Discriminative Segmental Cascades

Paper

Hao Tang, Weiran Wang, Kevin Gimpel, Karen Livescu, Discriminative Segmental Cascades for Feature-Rich Phone Recognition, ASRU, 2015

Download

scrf-fbd0c0a.tar.gz

Build

You will need a C++11 compatable compiler to build, e.g., GCC 4.9 or higher.

tar scrf-fbd0c0a.tar.gz
cd scrf-fbd0c0a
for d in ebt opt la autodiff speech scrf; do
    CXXFLAGS="-O2" make -C $d
done

File Format

A frame batch file contains all frames of all utterances. It can contain MFCCs or DNN phone posteriors, and DNN phone log posteriors are used in the paper. Here's an example of frame batch file.

01.mfc
0.1 0.2 0.3 0.4
0.2 0.4 0.6 0.8
0.3 0.6 0.9 1.2
.
02.mfc
0.4 0.3 0.2 0.1
0.8 0.6 0.4 0.2
1.2 0.9 0.6 0.3
.

Each utterance starts with a name and ends with a dot “.” with frames inbetween. Each line corresponds to one frame separated by a space.

A lattice batch file contains lattices. Here's an example of a lattice batch file.

01.lat
0 time=0
1 time=100
2 time=200
#
0 1 label=<s>,f1=0.1,f2=0.2
1 2 label=a,f1=0.3,f2=0.4
2 3 label=</s>,f1=0.5,f2=0.6
.
02.lat
0 time=0
1 time=50
2 time=80
3 time=120
#
0 1 label=<s>,f1=0.1,f2=0.2
0 2 label=b,f1=0.3,f2=0.4
1 2 label=a,f1=0.5,f2=0.6
1 3 label=b,f1=0.7,f2=0.8
2 3 label=</s>,f1=0.9,f2=1.0
.

Similar to the frame batch format, each lattice starts with a name and ends with a dot “.” with one lattice inbetween. Each lattice has a set of vertices and edges, and they are separated with the pound sign “#”. In the above example, each edge has a label and two fields f1 and f2. The additional key-value pairs can be used as external features.

Each vertex has an id and a sequence of key-value pairs. Key-value pairs are separated with commas “,”. In the above example, each vertex has a time. Each edge has a tail vertex, a head vertex, and key-value pairs.

Features

The following are features you can specify to the --features option.

You can append “@1” if you want to make first-order features, i.e., lexicalize features with a single label. Similarly, you can append “@0” and “@2” to make zeroth-order and second-order features respectively.

External features annotated on the lattices can be specified with ext:<name> where <name> is the feature name.

For example, if you want to use first-order frame averages and length indicators and zeroth-order external feature f1 and f2, you can say --features frame-avg@1,length-indicators@1,ext:f1@0,ext:f2@0.

Model Parameters

Model paramters are stored in json dictionary. The dictionary has strings as keys and vectors of doubles as values.

The simplest parameter file is just the empty dictionary as follows.

{}

This is useful as the initial model for training.

Example

To set up an experiment, we can have a zero model.

echo '{}' > param-0
echo '{}' > opt-data-0

The ground truths are stored in linear-chain graphs in train.slat. The training frames are stored in train.mfc.

To train a model with first-order frame averages and length indicators, we can run the following command.

scrf/learn \
    --frame-batch train.mfc \
    --ground-truth-batch train.slat \
    --param param-0 \
    --opt-data opt-data-0 \
    --loss hinge \
    --features frame-avg@1,length-indicators \
    --lm unigram \
    --step-size 0.1 \
    --max-seg 30 \
    --output-param param-1 \
    --output-opt-data opt-data-1

The --max-seg option limits maximum duration of a segment. A language model, in ARPA format, needs to be specified with the --lm option.

After getting the trained model, we can run the following command to get the prediction results.

scrf/predict \
    --frame-batch dev.mfc \
    --param param-1 \
    --lm unigram \
    --max-seg 30

To prune the first-pass hypothesis space, you can run the following command.

scrf/prune
    --frame-batch dev.mfc \
    --param param-1 \
    --lm unigram \
    --max-seg 30 \
    --alpha 0.8 \
    --output train-0.8.slat

The lattices are saved in train-0.8.slat. You can then annotate the lattices with more features and run scrf/learn to train a model for the second pass.