I've tried constructing some n-gram tagging models for the chord supertagging task. This page presents an overview of what I've tried and how well they worked.

I've also tried a baseline model that uses n-grams without the parser: see GrammarlessModels.

For a comparison of the models and the results of a combined model (backoff model), see NgramModelComparison.

More on supertagging

For an intro to my modelling work and supertagging models, see ParsingModels

Contents

Evaluation
Smoothing
Backoff
Cutoff
Results

Evaluation

I evaluate these models using an accuracy measure and the cross-entropy. In StupidBaselinesTalks I also use n-best accuracy, though the cross-entropy is a single measure that better represents this. These are described on SupertaggerEvaluation.

Smoothing

Summary: the smoothing method doesn't make much difference: using add-one on this stuff is fine.

I use a variety of smoothing methods to assign probability to unseen ngrams. (In most cases this is combined with Katz backoff - see below.) I list these here. It should be noted, though, that the choice of smoothing method turns out to make very little difference to the results - presumably because there is so little data.

Smoothing	Description	Abbreviation
No smoothing	Pure maximum likelihood estimations	mle
Laplace	Add-one smoothing. Each count is increased by one, so that nothing has 0 counts	laplace
Witten-Bell discounting	Uses Witten-Bell discounting to reserve some probability mass for unseen events	witten-bell
Good-Turing discounting	Uses Good-Turing discounting in the same way Witten-Bell is used. This works appallingly on this kind of quantity of data, so I've not reported any results for it	good-turing

To try: I've not tried Kneser-Ney smoothing, but have heard that it's good for this sort of thing.

Backoff

Summary: I'm using Katz backoff, but might want to try interpolation.

I've experimented with using Katz backoff to back off to lower-order ngram models when an ngram is unseen. This requires the use of a smoothing model so that probability mass is reserved for this lower-order model.

Katz backoff only uses the lower-order model's probabilities when the ngram is unseen. Another approach, which I've not yet tried, is interpolation, where the probabilities from both models are summed, weighted in some way.

Cutoff

Summary: I'm using a cutoff of 2 most of the time.

As is common with these sorts of models, I often treat low counts of an observation as if they were zero counts. The cutoff parameter, which I vary in the tests, is the number of observations required for the count to be treated as non-zero. A cutoff of 0 is therefore effectively no cutoff. I'm generally using a cutoff of 2. There are lots of things with counts of 1 and we don't want to trust these, but if we set the cutoff much higher, we're going to miss out on loads of training examples.

Results

Accuracy

I first evaluated top tag accuracy on all of these models. As you can see from the table, using ngram models (even bigrams) barely affects the results at all, over the basic unigram model. This was a worrying result. Further, an analysis of the actual tags getting returned showed that the models were getting the same things wrong in most cases (rather than that the higher-order models made mistakes that counter-balanced their improvements).

Entropy

However, I then took a look at the cross entropy of the tagger's distributions (see SupertaggerEvaluation for details and remember that lower is better). This is more informative and shows that the higher-order models give us an improvement. It is clear from this that they do not sufficiently boost the probabilities assigned to the correct tags that they overtake the tag that was assigned highest probability by a unigram model.

The entropy results show, though, that they do in general boost the probabilities assigned to the correct tags. This is promising, since the parser will generally use a whole bunch of tags that the tagger assigns sufficiently high probability, so this boost could make an important difference to whether the correct tag ends up in that bunch.

Figures

Entropy is the cross entropy (see SupertaggerEvaluation) measured in bits per chord.
Accuracy is the top tag accuracy.

Update 6/12: I've recently rerun these experiments, the code and data having changed somewhat since the original experiments reported below. This first table of results reports the results I got, including using the C&C tagger.

Not sure why the unigram results were so much worse before.

All models use backoff, one order at a time, as far as unigram.

Order	Cutoff	Smoothing	Chord map	Entropy	Accuracy
Trigram	2	WB	small	1.35	78.31%
Bigram	0	WB	small	1.14	79.80%
Bigram	0	WB	none	1.23
Bigram	0	WB	big	1.21
Bigram	2	WB	small	1.29
Bigram	0	Laplace	small	1.28
Unigram	0	WB	small	1.25
Unigram	0	Laplace	small	1.25
C&C			none	1.39
C&C			small	1.52

My original results table.

Model name	Details	Accuracy	Entropy
unigram-simple	Unigram, no smoothing, no cutoff	78.63%	4.86
unigram-wb-0	Unigram, Witten-Bell smoothing, no cutoff	79.98%	4.88
unigram-wb-2	Unigram, Witten-Bell smoothing, cutoff 2	79.06%	4.87
bigram-nobackoff	Bigram, Witten-Bell smoothing, cutoff 2, no backoff	80.10%	1.86
bigram-c2-uni	Bigram, Witten-Bell smoothing, cutoff 2, backoff to unigram	80.81%	1.29
bigram-c5-uni	Bigram, Witten-Bell smoothing, cutoff 5, backoff to unigram	79.35%	1.53
bigram-nobackoff-lap	Bigram, Laplace smoothing, cutoff 2, no backoff	79.64%	1.46
bigram-c2-uni-lap	Bigram, Laplace smoothing, cutoff 2, backoff to unigram	78.69%	1.33
bigram-c5-uni-lap	Bigram, Laplace smoothing, cutoff 5, backoff to unigram	77.75%	1.47
trigram-c2-uni	Trigram, Witten-Bell smoothing, cutoff 2, backoff to bigram, then unigram	79.38%	1.36

Discussion

It's very difficult to draw any conclusions from the accuracy results. Some models do unexpectedly badly in comparison to others, but all the differences are tiny and only account for a handful of differently interpreted chords.

The entropy results are much more interesting. The unigram models can be seen to be very much worse than the bigrams. Backoff to unigrams gives the expected improvement. The smoothing results are still pretty inconclusive: Laplace actually seems to do better than Witten-Bell.

The trigram model doesn't do better than the best bigram model, though I've only tried one trigram model (because it takes so long to evaluate).