Introduction
I constructed a supertagging model that used n-grams to predict a tag sequence for a chord sequence. For a comparison of different parameters, see NgramSupertaggingModels. In short, most parameters didn't make much difference, so I settled on a standard set and stuck with them. Bigram models worked best.
I call this model St+Parse.
I also constructed a model of tonal space paths predicted directly from chord sequences (without using the grammar/parser at all). See GrammarlessModels for details. The parameters to the model were identical to those used for the supertagging model.
I call this model Ngram.
Finally, I constructed a backoff model that tries to use the supertagging model and falls back to the tonal space path returned by the grammarless model if no full parse can be found.
I call the combined model St+Parse+Ngram.
All models are evaluated on the precision, recall and f-score of the most probable tonal space path they return. These are all evaluated using 10-fold cross-validation on the JazzCorpus.
For a fuller description of the models and analysis of the results, see the document Quantifying the Contribution of the Jazz Grammar over a Pure Statistical Baseline.
St+Parse results
I evaluated a bigram model, with the parameters described on NgramSupertaggingModels.
Precision |
89.9% |
Recall |
61.9% |
F-score |
73.3% |
The trigram didn't do any better, presumably because it's overfitting the data (see below).
Ngram results
I evaluated a pure bigram model with the same parameters.
Precision |
74.6% |
Recall |
82.1% |
F-score |
78.2% |
St+Parse+Ngram results
Again, bigram models, the same as the above.
Precision |
81.7% |
Recall |
88.0% |
F-score |
84.7% |
Comparison
All of that in one table:
Model |
Precision |
Recall |
F-score |
Coverage |
St+Parse |
89.9% |
61.9% |
73.3% |
75.0% |
Ngram |
74.6% |
82.1% |
78.2% |
100% |
St+Parse+Ngram |
81.7% |
88.0% |
84.7% |
100% |
Trigrams
You'd imagine trigrams would be able to do a better job, but in fact the results come out similarly, in fact a little lower. I tried St+Parse and Ngram with trigram models instead (otherwise the same parameters).
Model |
Precision |
Recall |
F-score |
Coverage |
St+Parse |
89.5% |
56.9% |
69.6% |
71.1% |
Ngram |
74.2% |
82.2% |
78.0% |
100% |
My assumption is that the results are lower because the models are overfitting the data. I tried comparing the performance of the St+Parse bigram and trigram models when they're trained on the full dataset and tested on the same data.
Model |
Precision |
Recall |
F-score |
Coverage |
St+Parse bigram |
93.2% |
65.0% |
76.6% |
76.3% |
St+Parse trigram |
93.1% |
68.9% |
79.2% |
78.9% |
These results support my suggestion of overfitting. Although the trigram model performed worse in cross-validation than the bigram model, it does do slightly better than a bigram when tested on the data it's trained on.