I've tried constructing some n-gram tagging models for the chord supertagging task. This page presents an overview of what I've tried and how well they worked.

I've also tried a baseline model that uses n-grams without the parser: see GrammarlessModels.

For a comparison of the models and the results of a combined model (backoff model), see NgramModelComparison.

More on supertagging

For an intro to my modelling work and supertagging models, see ParsingModels

Evaluation

I evaluate these models using an accuracy measure and the cross-entropy. In StupidBaselinesTalks I also use n-best accuracy, though the cross-entropy is a single measure that better represents this. These are described on SupertaggerEvaluation.

Smoothing

I use a variety of smoothing methods to assign probability to unseen ngrams. (In most cases this is combined with Katz backoff - see below.) I list these here. It should be noted, though, that the choice of smoothing method turns out to make very little difference to the results - presumably because there is so little data.

Smoothing

Description

Abbreviation

No smoothing

Pure maximum likelihood estimations

mle

Laplace

Add-one smoothing. Each count is increased by one, so that nothing has 0 counts

laplace

Witten-Bell discounting

Uses Witten-Bell discounting to reserve some probability mass for unseen events

witten-bell

Good-Turing discounting

Uses Good-Turing discounting in the same way Witten-Bell is used. This works appallingly on this kind of quantity of data, so I've not reported any results for it

good-turing

Backoff

I've experimented with using Katz backoff to back off to lower-order ngram models when an ngram is unseen. This requires the use of a smoothing model so that probability mass is reserved for this lower-order model.

Katz backoff only uses the lower-order model's probabilities when the ngram is unseen. Another approach, which I've not yet tried, is interpolation, where the probabilities from both models are summed, weighted in some way.

Cutoff

As is common with these sorts of models, I often treat low counts of an observation as if they were zero counts. The cutoff parameter, which I vary in the tests, is the number of observations required for the count to be treated as non-zero. A cutoff of 0 is therefore effectively no cutoff. I'm generally using a cutoff of 2. There are lots of things with counts of 1 and we don't want to trust these, but if we set the cutoff much higher, we're going to miss out on loads of training examples.

Results

Accuracy

I first evaluated top tag accuracy on all of these models. As you can see from the table, using ngram models (even bigrams) barely affects the results at all, over the basic unigram model. This was a worrying result. Further, an analysis of the actual tags getting returned showed that the models were getting the same things wrong in most cases (rather than that the higher-order models made mistakes that counter-balanced their improvements).

Entropy

However, I then took a look at the cross entropy of the tagger's distributions (see SupertaggerEvaluation for details and remember that lower is better). This is more informative and shows that the higher-order models give us an improvement. It is clear from this that they do not sufficiently boost the probabilities assigned to the correct tags that they overtake the tag that was assigned highest probability by a unigram model.

The entropy results show, though, that they do in general boost the probabilities assigned to the correct tags. This is promising, since the parser will generally use a whole bunch of tags that the tagger assigns sufficiently high probability, so this boost could make an important difference to whether the correct tag ends up in that bunch.

Figures

Update 6/12: I've recently rerun these experiments, the code and data having changed somewhat since the original experiments reported below. This first table of results reports the results I got, including using the C&C tagger.

Not sure why the unigram results were so much worse before.

All models use backoff, one order at a time, as far as unigram.

Order

Cutoff

Smoothing

Chord map

Entropy

Accuracy

Trigram

2

WB

small

1.35

78.31%

Bigram

0

WB

small

1.14

79.80%

Bigram

0

WB

none

1.23

Bigram

0

WB

big

1.21

Bigram

2

WB

small

1.29

Bigram

0

Laplace

small

1.28

Unigram

0

WB

small

1.25

Unigram

0

Laplace

small

1.25

C&C

none

1.39

C&C

small

1.52

My original results table.

Model name

Details

Accuracy

Entropy

unigram-simple

Unigram, no smoothing, no cutoff

78.63%

4.86

unigram-wb-0

Unigram, Witten-Bell smoothing, no cutoff

79.98%

4.88

unigram-wb-2

Unigram, Witten-Bell smoothing, cutoff 2

79.06%

4.87

bigram-nobackoff

Bigram, Witten-Bell smoothing, cutoff 2, no backoff

80.10%

1.86

bigram-c2-uni

Bigram, Witten-Bell smoothing, cutoff 2, backoff to unigram

80.81%

1.29

bigram-c5-uni

Bigram, Witten-Bell smoothing, cutoff 5, backoff to unigram

79.35%

1.53

bigram-nobackoff-lap

Bigram, Laplace smoothing, cutoff 2, no backoff

79.64%

1.46

bigram-c2-uni-lap

Bigram, Laplace smoothing, cutoff 2, backoff to unigram

78.69%

1.33

bigram-c5-uni-lap

Bigram, Laplace smoothing, cutoff 5, backoff to unigram

77.75%

1.47

trigram-c2-uni

Trigram, Witten-Bell smoothing, cutoff 2, backoff to bigram, then unigram

79.38%

1.36

Discussion

It's very difficult to draw any conclusions from the accuracy results. Some models do unexpectedly badly in comparison to others, but all the differences are tiny and only account for a handful of differently interpreted chords.

The entropy results are much more interesting. The unigram models can be seen to be very much worse than the bigrams. Backoff to unigrams gives the expected improvement. The smoothing results are still pretty inconclusive: Laplace actually seems to do better than Witten-Bell.

The trigram model doesn't do better than the best bigram model, though I've only tried one trigram model (because it takes so long to evaluate).