I've tried constructing some n-gram tagging models for the chord supertagging task. This page presents an overview of what I've tried and how well they worked.
I've also tried a baseline model that uses n-grams without the parser: see GrammarlessModels.
For a comparison of the models and the results of a combined model (backoff model), see NgramModelComparison.
More on supertagging
For an intro to my modelling work and supertagging models, see ParsingModels
Evaluation
I evaluate these models using an accuracy measure and the cross-entropy. In StupidBaselinesTalks I also use n-best accuracy, though the cross-entropy is a single measure that better represents this. These are described on SupertaggerEvaluation.
Smoothing
Summary: the smoothing method doesn't make much difference: using add-one on this stuff is fine.
I use a variety of smoothing methods to assign probability to unseen ngrams. (In most cases this is combined with Katz backoff - see below.) I list these here. It should be noted, though, that the choice of smoothing method turns out to make very little difference to the results - presumably because there is so little data.
Smoothing |
Description |
Abbreviation |
No smoothing |
Pure maximum likelihood estimations |
mle |
Add-one smoothing. Each count is increased by one, so that nothing has 0 counts |
laplace |
|
Uses Witten-Bell discounting to reserve some probability mass for unseen events |
witten-bell |
|
Uses Good-Turing discounting in the same way Witten-Bell is used. This works appallingly on this kind of quantity of data, so I've not reported any results for it |
good-turing |
To try: I've not tried Kneser-Ney smoothing, but have heard that it's good for this sort of thing.
Backoff
Summary: I'm using Katz backoff, but might want to try interpolation.
I've experimented with using Katz backoff to back off to lower-order ngram models when an ngram is unseen. This requires the use of a smoothing model so that probability mass is reserved for this lower-order model.
Katz backoff only uses the lower-order model's probabilities when the ngram is unseen. Another approach, which I've not yet tried, is interpolation, where the probabilities from both models are summed, weighted in some way.
Cutoff
Summary: I'm using a cutoff of 2 most of the time.
As is common with these sorts of models, I often treat low counts of an observation as if they were zero counts. The cutoff parameter, which I vary in the tests, is the number of observations required for the count to be treated as non-zero. A cutoff of 0 is therefore effectively no cutoff. I'm generally using a cutoff of 2. There are lots of things with counts of 1 and we don't want to trust these, but if we set the cutoff much higher, we're going to miss out on loads of training examples.
Results
Accuracy
I first evaluated top tag accuracy on all of these models. As you can see from the table, using ngram models (even bigrams) barely affects the results at all, over the basic unigram model. This was a worrying result. Further, an analysis of the actual tags getting returned showed that the models were getting the same things wrong in most cases (rather than that the higher-order models made mistakes that counter-balanced their improvements).
Entropy
However, I then took a look at the cross entropy of the tagger's distributions (see SupertaggerEvaluation for details and remember that lower is better). This is more informative and shows that the higher-order models give us an improvement. It is clear from this that they do not sufficiently boost the probabilities assigned to the correct tags that they overtake the tag that was assigned highest probability by a unigram model.
The entropy results show, though, that they do in general boost the probabilities assigned to the correct tags. This is promising, since the parser will generally use a whole bunch of tags that the tagger assigns sufficiently high probability, so this boost could make an important difference to whether the correct tag ends up in that bunch.
Figures
Entropy is the cross entropy (see SupertaggerEvaluation) measured in bits per chord.
- Accuracy is the top tag accuracy.
Update 6/12: I've recently rerun these experiments, the code and data having changed somewhat since the original experiments reported below. This first table of results reports the results I got, including using the C&C tagger.
Not sure why the unigram results were so much worse before.
All models use backoff, one order at a time, as far as unigram.
Order |
Cutoff |
Smoothing |
Chord map |
Entropy |
Accuracy |
Trigram |
2 |
WB |
small |
1.35 |
78.31% |
Bigram |
0 |
WB |
small |
1.14 |
79.80% |
Bigram |
0 |
WB |
none |
1.23 |
|
Bigram |
0 |
WB |
big |
1.21 |
|
Bigram |
2 |
WB |
small |
1.29 |
|
Bigram |
0 |
Laplace |
small |
1.28 |
|
Unigram |
0 |
WB |
small |
1.25 |
|
Unigram |
0 |
Laplace |
small |
1.25 |
|
C&C |
|
|
none |
1.39 |
|
C&C |
|
|
small |
1.52 |
|
My original results table.
Model name |
Details |
Accuracy |
Entropy |
unigram-simple |
Unigram, no smoothing, no cutoff |
78.63% |
4.86 |
unigram-wb-0 |
Unigram, Witten-Bell smoothing, no cutoff |
79.98% |
4.88 |
unigram-wb-2 |
Unigram, Witten-Bell smoothing, cutoff 2 |
79.06% |
4.87 |
bigram-nobackoff |
Bigram, Witten-Bell smoothing, cutoff 2, no backoff |
80.10% |
1.86 |
bigram-c2-uni |
Bigram, Witten-Bell smoothing, cutoff 2, backoff to unigram |
80.81% |
1.29 |
bigram-c5-uni |
Bigram, Witten-Bell smoothing, cutoff 5, backoff to unigram |
79.35% |
1.53 |
bigram-nobackoff-lap |
Bigram, Laplace smoothing, cutoff 2, no backoff |
79.64% |
1.46 |
bigram-c2-uni-lap |
Bigram, Laplace smoothing, cutoff 2, backoff to unigram |
78.69% |
1.33 |
bigram-c5-uni-lap |
Bigram, Laplace smoothing, cutoff 5, backoff to unigram |
77.75% |
1.47 |
trigram-c2-uni |
Trigram, Witten-Bell smoothing, cutoff 2, backoff to bigram, then unigram |
79.38% |
1.36 |
Discussion
It's very difficult to draw any conclusions from the accuracy results. Some models do unexpectedly badly in comparison to others, but all the differences are tiny and only account for a handful of differently interpreted chords.
The entropy results are much more interesting. The unigram models can be seen to be very much worse than the bigrams. Backoff to unigrams gives the expected improvement. The smoothing results are still pretty inconclusive: Laplace actually seems to do better than Witten-Bell.
The trigram model doesn't do better than the best bigram model, though I've only tried one trigram model (because it takes so long to evaluate).