Saturday, September 26, 2009

classias notes

There is an excellent software to train/predict a range of ML algorithms including logistic regression with L1 or L2 regularization, pagasos SVM L1/L2, average perceptron. It is called Classias and is by Naoaki Okazaki.
It is amazingly fast and can handle large datasets! It work directly on compressed formats such as bz2, tar.gz

Here are some quick how to note
  • running binary logistic regression 

    classias-train -tb -a lbfgs.logistic -m

    -tb says it is of -t type binary b.
    -a specifies the learning algorithm, which is lbfgs optimized logistic regression (note: logistic regression is NOT a regression model but a classification algorithm), current L1/L2 regularization is supported only with lbfgs.
    -m  specifies the model file.
    the final entry is the actual training file. The format being,
    label fid:fval ...

  • cross validation, regularization and help
    To perform cross validation use -g5 -x options (5 says 5-fold cross validation, can use any integer there.) If you have your held out data on a separate file then you can specify both training and heldout data files using another set of options [see the documentation of Classias]. If you want to enable L1 regularization use -pc1=1 This says set the parameter c1 to the value 1 (regularization coefficient) for the algorithm specified by -a. The default value is zero for L1 regularization and 1 for L2. If you are using L1 regularization only then you must set L2 to zero. i.e. -pc2=0. Otherwise you will end up using both L1 and L2 regularizations! For example, if you set both regularization coefficients to 1, then you end up having more features in the final trained model compared to what you get if trained only with L1 regularization. But still, it is far less features than what you would get if you used only L2 regularization. For the RCV1 dataset, I got 40628 features only usng L2 (accuracy being 0.95), where as those values were 491(@0.95) for L1 only and 1597(@0.94) using both.

    General help of classias-train can be seen by doing,
    classias-train --help
    and to see what parameters are available for a specific algorithm (e.g. lbfgs.logistic) do the following,
    classias-train -a lbfgs.logistic -H
    H indicates parameter specific help. -h is the normal help.
    Putting it all together the following command trains a binary logistic regression model with L1 regularization and also performs 5-fold cross validation.
     classias-train -tb -a lbfgs.logistic -pc1=1 -pc2=0  -m rcv1.binary.model -g5 -x rcv1_train.binary

  • Multi-class classification
  • Tagging  (prediction)
    Read test instances from stdin and output the class labels , weights (-w), and in the case of logistic regression models probabilities (-p). Specify the model file by -m. You can compute accuracies by using -t option. To suppress labels etc. when testing use quiet option (-q).

    cat rcv1_test_binary | classias-tag -m rcv1.binary.model -p cat rcv1_test_binary | classias-tag -m rcv1.binary.model -w cat rcv1_test_binary | classias-tag -m rcv1.binary.model -tq
    If your data is in bz2 the use bzcat instead of cat.  

No comments:

Post a Comment

Continuously monitor GPU usage

 For nvidia GPUs do the follwing: nvidia-smi -l 1