My CCG grammar of chord sequences has far too much lexical ambiguity for full parsing to be feasible. As with natural language grammars, the answer is to use statistical parsing to narrow down the interpretations the grammar assigns to a chord sequence.

I have tried two approaches to this so far: PCFGs, where generative probabilities are computed for each node in the derivation tree for possible derivations; and supertagging, where a computationally efficient model of short-distance dependencies is used to narrow down lexical interpretations, which are then fully parsed.

I have also implemented an ngram model that directly models the semantics given the input sequence.

For a comparison of all models, see my PhD thesis.


These models are not really probabilistic context-free grammars (PCFGs), since our grammars are not CFGs, but they apply the same principle to computing the probabilities of derivations. This is the method used by Julia Hockenmaier.

The PCFG parsing model on its own doesn't do very well. However, using one of the tagging models (below) to select lexical categories and then parsing using the PCFG model gets us our overall best result.


A supertagger uses a sequence model (often a log-linear model) to choose lexical interpretations (grammatical categories) for chords, which are then parsed in full. The model is very much like a part-of-speech tagging model.

So far, I've been trying generative HMM-type models for tagging chords with categories. I've tried various orders of N-gram models, with some simple smoothing and backoff techniques.

Unigram models

I started with some very simple unigram models. These establish a dumb, and rather high, baseline. I tried different kinds of observations based on the input chords. For details, take a look at my short presentation slides at StupidBaselinesTalks. This is a good intro to how I'm preprocessing the input and why.

N-gram models

I have experimented with different orders of n-gram models, using different types of smoothing and backoff. For details of the models and results, see NgramSupertaggingModels.

Here's an overview of what I found. We're working with a very small amount of training data. Smoothing is very important, but the choice of smoothing method does not make a big difference to the results. Laplace smoothing works pretty much as well as any other. Using models of higher order than the unigram models (even bigrams) barely affects the accuracy of the most probable tag assignment, but does work well in boosting the probability assigned to the correct tag.

N-gram Baseline

As a baseline to establish what work is being done by the grammar, I have implemented a direct model of the semantics of a sequence given the sequence, which uses no grammatical information at all.

Summary: the grammarless model does quite a lot worse than a similar model used as a supertagger for parsing with the grammar, demonstrating that the grammatical information is improving the resulting semantics.

For details and results, see GrammarlessModels.

Combined Model

I've also implemented a combined model that uses a bigram supertagging model and backs off to the n-gram (grammarless) baseline if no parse is found. This brings the coverage up to 100%, hurts the precision a bit, of course, but very much increases the f-score.

For the comparison of the models, see NgramModelComparison.