Stochastic Coordinate Descent (SCD)

Overview

SCD is a C++ implementation of (a variant of) the stochastic coordinate descent algorithm proposed in

Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for l₁ regularized loss minimization. Proceedings of the 26th International Conference on Machine Learning, pages 929-936, 2009.

It can be used for l₁-regularized loss minimization for both classification and regression problems. This means that it (approximately) solves the following minimization problem:

min_w (1/m) ∑_{i=1, … , m} L(w^T x_i, y_i) + λ|w|₁

Currently supported loss functions are the logistic loss [L(a,b) = log(1+exp(-ab))] and the squared loss [L(a,b) = (a-b)²]. The x_i's are example vectors with d coordinates x_i,1, … x_i,d. The y_i's are labels that are either ±1 (classification) or arbitrary floats (regression). The logistic loss only makes sense in a classification context but the squared loss can be used in either. λ is the regularization parameter. SCD is designed to run fast even for large values of m and d and can exploit the sparsity in the examples (i.e. if a lot of coordinates in the example vectors are zero).

Source Code

The source code is released under the GNU General Public License. The authors are not responsible for any consequences arising out of the use of the code, so please use it at your own risk. We do, however, appreciate receiving bug reports from you. Please direct them to Ambuj Tewari via email.

If you use SCD in your research, please cite it as

Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for l₁ regularized loss minimization. Proceedings of the 26th International Conference on Machine Learning, pages 929-936, 2009.

The code was developed under Linux using g++ (GCC) version 3.4.6. Current version of SCD is 2.1. The source code for the current version is available at the following location:

http://ttic.uchicago.edu/~tewari/code/scd/scd.tar.gz

Installation

After downloading the file scd.tar.gz into a suitable directory, unpack it by typing

 tar zxvf scd.tar.gz

This will create a directory called SCD with the following 9 files in it:

cmd_line.cc
cmd_line.h
cmd_option.h
Losses.h
gpl.txt
Makefile
scd.cc
test_examples
test_labels

Now, type

 make

This will create an executable named scd (plus a couple of object files).

Usage

After installing, simply type

 ./scd

and the following usage message is displayed:


USAGE: ./scd [options] <data-file> <label-file>

L1 regularized loss minimization using stochastic coordinate descent

OPTIONS:
 -lambda regularization parameter (default = 0.001)
 -loss 0 for logistic, 2 for squared (default = 0)
 -iters number of iterations (default = 1000)
 -printout after how many iterations to print w (default 1000)

All 4 command line options are optional and receive their default values indicated above if they are not supplied by the user. The options lambda, loss and iters set the regularization parameter, the loss function to use, and the number of iterations for which the stochastic coordinate descent algorithm will be run, respectively. The printout option selects the interval after which a summary of current state of the optimization will be printed on the standard output. The following information about the current weight vector w is included in this summary:

time elapsed (in milliseconds)
number of times the data matrix was accessed
|w|₁, the l₁-norm of w
|w|₀, the l₀-norm (no. of non-zero coordinates) of w
(1/m) ∑_{i=1, … , m} L(w^T x_i, y_i), the average loss of w
(1/m) ∑_{i=1, … , m} L(w^T x_i, y_i) + λ|w|₁, the objective function
∑_{i=1, … , m} 1[w^T x_i y_i ≤ 0], the number of mistakes on the examples (ignore this field if doing regression)

Data File Format

Conceptually, we often think of the data as a huge m x d matrix with the examples sitting as rows of this matrix. Viewed this way, scd expects its input data file to be in a sparse column major format as described below:

Line 1: m d
Line 2: i1 v1 i2 v2 ...
⋮
Line d+1: i1 v1 i2 v2 ...

The first line tells scd that there are m example each with d features or coordinates. The second line gives the values for feature 1 for all those examples where feature 1 is non-zero. The pair i1 v1 means that example (i1+1) has value v1. Note that, following C/C++ convention, the indices i1, i2, ... start from zero. For example, if feature 1 is non-zero only for examples 1, 3 and 10 and has values 0.2, 0.9 and -0.7 for these, then line 2 would read:

0 0.2 2 0.9 9 -0.7

The example indices need not appear in increasing order. It is assumed that the values v1, v2, ... lie in the interval [-1,1]. You might have to do some preprocessing of your data to ensure this.

Line 3 then gives values for feature 2, Line 4 for feature 3, and so on till Line d+1.

Label File Format

The label file format is pretty simple: Line j gives the label of example j. So, there should be a total of m lines in the label file and each line should have a single number (±1 in the classification case, any floating point number in the regression case).

Test Run

To make sure everything runs okay on your machine, use the files test_examples and test_labels to do a test run. The file test_examples encodes the following 3 x 5 data matrix (3 examples, 5 features):

0    0    0    0.2  0.2
0    0    0.3  0   -0.5
0.4  0.6  0    0    0

The labels for the 3 examples are 1,1 and -1 respectively. Now, run scd with logistic loss and the default value (0.001) of the regularization parameter by typing

 ./scd -iters 50000 -printout 5000 test_examples test_labels

This runs scd for 50,000 iterations printing a summary line every 5,000 iterations. You should see an output that looks something like

0 11972 41.2552 5 0.0164818 0.057737 0
10 24054 43.7355 5 0.012187 0.0559226 0
10 36020 44.4665 4 0.0109725 0.055439 0
20 47932 44.6602 3 0.0104728 0.055133 0
20 59888 44.9534 3 0.0101683 0.0551217 0
30 71858 45.0208 3 0.0101003 0.0551212 0
30 83900 45.0446 3 0.0100765 0.0551211 0
40 95908 45.0546 3 0.0100664 0.055121 0
50 107974 45.0588 3 0.0100623 0.055121 0
50 119818 45.0607 3 0.0100603 0.055121 0

The actual output might be slightly different due to the randomness in the algorithm but the objective function should converge to 0.05512 for this toy example.

Acknowledgments

Ambuj Tewari would like to thank his wife Shilpa Shukla for help in debugging the code.