I've tried constructing some ngram tagging models for the chord supertagging task. This page presents an overview of what I've tried and how well they worked.
I've also tried a baseline model that uses ngrams without the parser: see GrammarlessModels.
For a comparison of the models and the results of a combined model (backoff model), see NgramModelComparison.
More on supertagging
For an intro to my modelling work and supertagging models, see ParsingModels
Evaluation
I evaluate these models using an accuracy measure and the crossentropy. In StupidBaselinesTalks I also use nbest accuracy, though the crossentropy is a single measure that better represents this. These are described on SupertaggerEvaluation.
Smoothing
Summary: the smoothing method doesn't make much difference: using addone on this stuff is fine.
I use a variety of smoothing methods to assign probability to unseen ngrams. (In most cases this is combined with Katz backoff  see below.) I list these here. It should be noted, though, that the choice of smoothing method turns out to make very little difference to the results  presumably because there is so little data.
Smoothing 
Description 
Abbreviation 
No smoothing 
Pure maximum likelihood estimations 
mle 
Addone smoothing. Each count is increased by one, so that nothing has 0 counts 
laplace 

Uses WittenBell discounting to reserve some probability mass for unseen events 
wittenbell 

Uses GoodTuring discounting in the same way WittenBell is used. This works appallingly on this kind of quantity of data, so I've not reported any results for it 
goodturing 
To try: I've not tried KneserNey smoothing, but have heard that it's good for this sort of thing.
Backoff
Summary: I'm using Katz backoff, but might want to try interpolation.
I've experimented with using Katz backoff to back off to lowerorder ngram models when an ngram is unseen. This requires the use of a smoothing model so that probability mass is reserved for this lowerorder model.
Katz backoff only uses the lowerorder model's probabilities when the ngram is unseen. Another approach, which I've not yet tried, is interpolation, where the probabilities from both models are summed, weighted in some way.
Cutoff
Summary: I'm using a cutoff of 2 most of the time.
As is common with these sorts of models, I often treat low counts of an observation as if they were zero counts. The cutoff parameter, which I vary in the tests, is the number of observations required for the count to be treated as nonzero. A cutoff of 0 is therefore effectively no cutoff. I'm generally using a cutoff of 2. There are lots of things with counts of 1 and we don't want to trust these, but if we set the cutoff much higher, we're going to miss out on loads of training examples.
Results
Accuracy
I first evaluated top tag accuracy on all of these models. As you can see from the table, using ngram models (even bigrams) barely affects the results at all, over the basic unigram model. This was a worrying result. Further, an analysis of the actual tags getting returned showed that the models were getting the same things wrong in most cases (rather than that the higherorder models made mistakes that counterbalanced their improvements).
Entropy
However, I then took a look at the cross entropy of the tagger's distributions (see SupertaggerEvaluation for details and remember that lower is better). This is more informative and shows that the higherorder models give us an improvement. It is clear from this that they do not sufficiently boost the probabilities assigned to the correct tags that they overtake the tag that was assigned highest probability by a unigram model.
The entropy results show, though, that they do in general boost the probabilities assigned to the correct tags. This is promising, since the parser will generally use a whole bunch of tags that the tagger assigns sufficiently high probability, so this boost could make an important difference to whether the correct tag ends up in that bunch.
Figures
Entropy is the cross entropy (see SupertaggerEvaluation) measured in bits per chord.
 Accuracy is the top tag accuracy.
Update 6/12: I've recently rerun these experiments, the code and data having changed somewhat since the original experiments reported below. This first table of results reports the results I got, including using the C&C tagger.
Not sure why the unigram results were so much worse before.
All models use backoff, one order at a time, as far as unigram.
Order 
Cutoff 
Smoothing 
Chord map 
Entropy 
Accuracy 
Trigram 
2 
WB 
small 
1.35 
78.31% 
Bigram 
0 
WB 
small 
1.14 
79.80% 
Bigram 
0 
WB 
none 
1.23 

Bigram 
0 
WB 
big 
1.21 

Bigram 
2 
WB 
small 
1.29 

Bigram 
0 
Laplace 
small 
1.28 

Unigram 
0 
WB 
small 
1.25 

Unigram 
0 
Laplace 
small 
1.25 

C&C 


none 
1.39 

C&C 


small 
1.52 

My original results table.
Model name 
Details 
Accuracy 
Entropy 
unigramsimple 
Unigram, no smoothing, no cutoff 
78.63% 
4.86 
unigramwb0 
Unigram, WittenBell smoothing, no cutoff 
79.98% 
4.88 
unigramwb2 
Unigram, WittenBell smoothing, cutoff 2 
79.06% 
4.87 
bigramnobackoff 
Bigram, WittenBell smoothing, cutoff 2, no backoff 
80.10% 
1.86 
bigramc2uni 
Bigram, WittenBell smoothing, cutoff 2, backoff to unigram 
80.81% 
1.29 
bigramc5uni 
Bigram, WittenBell smoothing, cutoff 5, backoff to unigram 
79.35% 
1.53 
bigramnobackofflap 
Bigram, Laplace smoothing, cutoff 2, no backoff 
79.64% 
1.46 
bigramc2unilap 
Bigram, Laplace smoothing, cutoff 2, backoff to unigram 
78.69% 
1.33 
bigramc5unilap 
Bigram, Laplace smoothing, cutoff 5, backoff to unigram 
77.75% 
1.47 
trigramc2uni 
Trigram, WittenBell smoothing, cutoff 2, backoff to bigram, then unigram 
79.38% 
1.36 
Discussion
It's very difficult to draw any conclusions from the accuracy results. Some models do unexpectedly badly in comparison to others, but all the differences are tiny and only account for a handful of differently interpreted chords.
The entropy results are much more interesting. The unigram models can be seen to be very much worse than the bigrams. Backoff to unigrams gives the expected improvement. The smoothing results are still pretty inconclusive: Laplace actually seems to do better than WittenBell.
The trigram model doesn't do better than the best bigram model, though I've only tried one trigram model (because it takes so long to evaluate).