Introduction

I constructed a supertagging model that used n-grams to predict a tag sequence for a chord sequence. For a comparison of different parameters, see NgramSupertaggingModels. In short, most parameters didn't make much difference, so I settled on a standard set and stuck with them. Bigram models worked best.

I also constructed a model of tonal space paths predicted directly from chord sequences (without using the grammar/parser at all). See GrammarlessModels for details. The parameters to the model were identical to those used for the supertagging model.

Finally, I constructed a backoff model that tries to use the supertagging model and falls back to the tonal space path returned by the grammarless model if no full parse can be found.

All models are evaluated on the precision, recall and f-score of the most probable tonal space path they return. These are all evaluated using 10-fold cross-validation on the JazzCorpus.

St+Parse results

I evaluated a bigram model, with the parameters described on NgramSupertaggingModels.

Precision

89.9%

Recall

61.9%

F-score

73.3%

The trigram didn't do any better, presumably because it's overfitting the data (see below).

Ngram results

I evaluated a pure bigram model with the same parameters.

Precision

74.6%

Recall

82.1%

F-score

78.2%

St+Parse+Ngram results

Again, bigram models, the same as the above.

Precision

81.7%

Recall

88.0%

F-score

84.7%

Comparison

All of that in one table:

Model

Precision

Recall

F-score

Coverage

St+Parse

89.9%

61.9%

73.3%

75.0%

Ngram

74.6%

82.1%

78.2%

100%

St+Parse+Ngram

81.7%

88.0%

84.7%

100%

Trigrams

You'd imagine trigrams would be able to do a better job, but in fact the results come out similarly, in fact a little lower. I tried St+Parse and Ngram with trigram models instead (otherwise the same parameters).

Model

Precision

Recall

F-score

Coverage

St+Parse

89.5%

56.9%

69.6%

71.1%

Ngram

74.2%

82.2%

78.0%

100%

My assumption is that the results are lower because the models are overfitting the data. I tried comparing the performance of the St+Parse bigram and trigram models when they're trained on the full dataset and tested on the same data.

Model

Precision

Recall

F-score

Coverage

St+Parse bigram

93.2%

65.0%

76.6%

76.3%

St+Parse trigram

93.1%

68.9%

79.2%

78.9%

These results support my suggestion of overfitting. Although the trigram model performed worse in cross-validation than the bigram model, it does do slightly better than a bigram when tested on the data it's trained on.