During my PhD, we built an annotated corpus of jazz chord sequences. Each chord sequence is annotated with a full harmonic analysis in the form of lexical gramatical categories for each chord and some additional structural information required to determine a unique parse.

The purpose of the corpus is to train statistical models for use in parsing. Without any statistical component, the grammar is highly ambiguous and produces a huge number of parses very slowly for any chord sequence.

Data Formats

The Jazz Parser reads data from an internal format that is not easily readable (or editable) with using the parser's codebase. The dataset is also available in a simple CSV-based format more suitable for use in other systems. The parser's codebase includes scripts to read and write this format. Finally, the data is available in a (relatively) human readable text format.

The data in the internal format is downloaded with the parser's codebase. You can download the data separately in the CSV format from here (see above). The format is described in detail below.

CSV Format

A corpus consists of three files: one containing metadata, another data about songs and the third all the chords in the corpus. The files are called metadata, songs.csv and chords.csv respectively and are bundled in a tarball (gzipped tar archive).

If you're reading these into Python, you can use the utilities in jazzparser.data.db_mirrors.csv to read them into an internal data structure.


The metadata file contains several fields with information about the corpus. Each is on a line of its own, beginning with the field name, separated from the value by a :.

The following fields are included: Name, Version, Chords file, and Songs file.

Songs CSV

Each song occupies one line of the CSV. For each song, the following fields are represented by the columns: id (internal identifier of song), key (e.g. C major), bar length (integer number of beats), first chord (line number from chord.csv). The line numbers are given as 0-indexed lines after the first line, which is the header, so that the first chord in the file is 0. The songs are stored here in the order in which they appear in chords.csv, so that the chords of a song are found by taking lines from the first chord line up to the first chord line of the next song.

Chords CSV

The chords of the corpus are all stored in one big CSV, one chord per line. Each chord has the following fields as its columns: root (integer pitch class, relative to the piece's main key), type, additions, bass (integer pitch class, relative to main key), duration (integer number of beats), lexical schema (name of lexical schema for this chord's annotated category, see my thesis), coord middle (bool, whether this chord is left not immediately resolved due to coordination) and coord end (bool, whether this is the end of a non-initial coordination constituent).


Annotator I have written an annotation tool for inputting the chords with their grammatical categories. It is a database application with a web interface, written in Python using Django.

The annotator's code is available in the parser's source. It uses a SQLite database, which is contained within the codebase. The annotator is dependent on the parser's code, but not the other way round.

The annotator itself has certain dependencies, but these don't apply to the parser, since it's not dependent on the annotator. To get the annotator running, you'll need everything a basic Django local instance depends on and SQLite installed, with its Python bindings.