mir_eval.transcription
The aim of a transcription algorithm is to produce a symbolic representation of a recorded piece of music in the form of a set of discrete notes. There are different ways to represent notes symbolically. Here we use the piano-roll convention, meaning each note has a start time, a duration (or end time), and a single, constant, pitch value. Pitch values can be quantized (e.g. to a semitone grid tuned to 440 Hz), but do not have to be. Also, the transcription can contain the notes of a single instrument or voice (for example the melody), or the notes of all instruments/voices in the recording. This module is instrument agnostic: all notes in the estimate are compared against all notes in the reference.
There are many metrics for evaluating transcription algorithms. Here we limit ourselves to the most simple and commonly used: given two sets of notes, we count how many estimated notes match the reference, and how many do not. Based on these counts we compute the precision, recall, f-measure and overlap ratio of the estimate given the reference. The default criteria for considering two notes to be a match are adopted from the MIREX Multiple fundamental frequency estimation and tracking, Note Tracking subtask (task 2):
“This subtask is evaluated in two different ways. In the first setup , a returned note is assumed correct if its onset is within +-50ms of a reference note and its F0 is within +- quarter tone of the corresponding reference note, ignoring the returned offset values. In the second setup, on top of the above requirements, a correct returned note is required to have an offset value within 20% of the reference note’s duration around the reference note’s offset, or within 50ms whichever is larger.”
In short, we compute precision, recall, f-measure and overlap ratio, once without taking offsets into account, and the second time with.
For further details see Salamon, 2013 (page 186), and references therein:
Salamon, J. (2013). Melody Extraction from Polyphonic Music Signals. Ph.D. thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2013.
IMPORTANT NOTE: the evaluation code in mir_eval
contains several important
differences with respect to the code used in MIREX 2015 for the Note Tracking
subtask on the Su dataset (henceforth “MIREX”):
mir_eval
uses bipartite graph matching to find the optimal pairing of reference notes to estimated notes. MIREX uses a greedy matching algorithm, which can produce sub-optimal note matching. This will result inmir_eval
’s metrics being slightly higher compared to MIREX.MIREX rounds down the onset and offset times of each note to 2 decimal points using
new_time = 0.01 * floor(time*100)
.mir_eval
rounds down the note onset and offset times to 4 decinal points. This will bring our metrics down a notch compared to the MIREX results.In the MIREX wiki, the criterion for matching offsets is that they must be within
0.2 * ref_duration
or 0.05 seconds from each other, whichever is greater (i.e.offset_dif <= max(0.2 * ref_duration, 0.05)
. The MIREX code however only uses a threshold of0.2 * ref_duration
, without the 0.05 second minimum. Sincemir_eval
does include this minimum, it might produce slightly higher results compared to MIREX.
This means that differences 1 and 3 bring mir_eval
’s metrics up compared to
MIREX, whilst 2 brings them down. Based on internal testing, overall the effect
of these three differences is that the Precision, Recall and F-measure returned
by mir_eval
will be higher compared to MIREX by about 1%-2%.
Finally, note that different evaluation scripts have been used for the Multi-F0
Note Tracking task in MIREX over the years. In particular, some scripts used
<
for matching onsets, offsets, and pitch values, whilst the others used
<=
for these checks. mir_eval
provides both options: by default the
latter (<=
) is used, but you can set strict=True
when calling
mir_eval.transcription.precision_recall_f1_overlap()
in which case
<
will be used. The default value (strict=False
) is the same as that
used in MIREX 2015 for the Note Tracking subtask on the Su dataset.
Conventions
Notes should be provided in the form of an interval array and a pitch array. The interval array contains two columns, one for note onsets and the second for note offsets (each row represents a single note). The pitch array contains one column with the corresponding note pitch values (one value per note), represented by their fundamental frequency (f0) in Hertz.
Metrics
mir_eval.transcription.precision_recall_f1_overlap()
: The precision, recall, F-measure, and Average Overlap Ratio of the note transcription, where an estimated note is considered correct if its pitch, onset and (optionally) offset are sufficiently close to a reference note.mir_eval.transcription.onset_precision_recall_f1()
: The precision, recall and F-measure of the note transcription, where an estimated note is considered correct if its onset is sufficiently close to a reference note’s onset. That is, these metrics are computed taking only note onsets into account, meaning two notes could be matched even if they have very different pitch values.mir_eval.transcription.offset_precision_recall_f1()
: The precision, recall and F-measure of the note transcription, where an estimated note is considered correct if its offset is sufficiently close to a reference note’s offset. That is, these metrics are computed taking only note offsets into account, meaning two notes could be matched even if they have very different pitch values.
- mir_eval.transcription.validate(ref_intervals, ref_pitches, est_intervals, est_pitches)
Check that the input annotations to a metric look like time intervals and a pitch list, and throws helpful errors if not.
- Parameters:
- ref_intervalsnp.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
- ref_pitchesnp.ndarray, shape=(n,)
Array of reference pitch values in Hertz
- est_intervalsnp.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
- est_pitchesnp.ndarray, shape=(m,)
Array of estimated pitch values in Hertz
- mir_eval.transcription.validate_intervals(ref_intervals, est_intervals)
Check that the input annotations to a metric look like time intervals, and throws helpful errors if not.
- Parameters:
- ref_intervalsnp.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
- est_intervalsnp.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
- mir_eval.transcription.match_note_offsets(ref_intervals, est_intervals, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False)
Compute a maximum matching between reference and estimated notes, only taking note offsets into account.
Given two note sequences represented by
ref_intervals
andest_intervals
(seemir_eval.io.load_valued_intervals()
), we seek the largest set of correspondences(i, j)
such that the offset of reference notei
has to be withinoffset_tolerance
of the offset of estimated notej
, whereoffset_tolerance
is equal tooffset_ratio
times the reference note’s duration, i.e.offset_ratio * ref_duration[i]
whereref_duration[i] = ref_intervals[i, 1] - ref_intervals[i, 0]
. If the resultingoffset_tolerance
is less thanoffset_min_tolerance
(50 ms by default) thenoffset_min_tolerance
is used instead.Every reference note is matched against at most one estimated note.
Note there are separate functions
match_note_onsets()
andmatch_notes()
for matching notes based on onsets only or based on onset, offset, and pitch, respectively. This is because the rules for matching note onsets and matching note offsets are different.- Parameters:
- ref_intervalsnp.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
- est_intervalsnp.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
- offset_ratiofloat > 0
The ratio of the reference note’s duration used to define the
offset_tolerance
. Default is 0.2 (20%), meaning theoffset_tolerance
will equal theref_duration * 0.2
, or 0.05 (50 ms), whichever is greater.- offset_min_tolerancefloat > 0
The minimum tolerance for offset matching. See
offset_ratio
description for an explanation of how the offset tolerance is determined.- strictbool
If
strict=False
(the default), threshold checks for offset matching are performed using<=
(less than or equal). Ifstrict=True
, the threshold checks are performed using<
(less than).
- Returns:
- matchinglist of tuples
A list of matched reference and estimated notes.
matching[i] == (i, j)
where reference notei
matches estimated notej
.
- mir_eval.transcription.match_note_onsets(ref_intervals, est_intervals, onset_tolerance=0.05, strict=False)
Compute a maximum matching between reference and estimated notes, only taking note onsets into account.
Given two note sequences represented by
ref_intervals
andest_intervals
(seemir_eval.io.load_valued_intervals()
), we see the largest set of correspondences(i,j)
such that the onset of reference notei
is withinonset_tolerance
of the onset of estimated notej
.Every reference note is matched against at most one estimated note.
Note there are separate functions
match_note_offsets()
andmatch_notes()
for matching notes based on offsets only or based on onset, offset, and pitch, respectively. This is because the rules for matching note onsets and matching note offsets are different.- Parameters:
- ref_intervalsnp.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
- est_intervalsnp.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
- onset_tolerancefloat > 0
The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).
- strictbool
If
strict=False
(the default), threshold checks for onset matching are performed using<=
(less than or equal). Ifstrict=True
, the threshold checks are performed using<
(less than).
- Returns:
- matchinglist of tuples
A list of matched reference and estimated notes.
matching[i] == (i, j)
where reference notei
matches estimated notej
.
- mir_eval.transcription.match_notes(ref_intervals, ref_pitches, est_intervals, est_pitches, onset_tolerance=0.05, pitch_tolerance=50.0, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False)
Compute a maximum matching between reference and estimated notes, subject to onset, pitch and (optionally) offset constraints.
Given two note sequences represented by
ref_intervals
,ref_pitches
,est_intervals
andest_pitches
(seemir_eval.io.load_valued_intervals()
), we seek the largest set of correspondences(i, j)
such that:The onset of reference note
i
is withinonset_tolerance
of the onset of estimated notej
.The pitch of reference note
i
is withinpitch_tolerance
of the pitch of estimated notej
.If
offset_ratio
is notNone
, the offset of reference notei
has to be withinoffset_tolerance
of the offset of estimated notej
, whereoffset_tolerance
is equal tooffset_ratio
times the reference note’s duration, i.e.offset_ratio * ref_duration[i]
whereref_duration[i] = ref_intervals[i, 1] - ref_intervals[i, 0]
. If the resultingoffset_tolerance
is less than 0.05 (50 ms), 0.05 is used instead.If
offset_ratio
isNone
, note offsets are ignored, and only criteria 1 and 2 are taken into consideration.
Every reference note is matched against at most one estimated note.
This is useful for computing precision/recall metrics for note transcription.
Note there are separate functions
match_note_onsets()
andmatch_note_offsets()
for matching notes based on onsets only or based on offsets only, respectively.- Parameters:
- ref_intervalsnp.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
- ref_pitchesnp.ndarray, shape=(n,)
Array of reference pitch values in Hertz
- est_intervalsnp.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
- est_pitchesnp.ndarray, shape=(m,)
Array of estimated pitch values in Hertz
- onset_tolerancefloat > 0
The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).
- pitch_tolerancefloat > 0
The tolerance for an estimated note’s pitch deviating from the reference note’s pitch, in cents. Default is 50.0 (50 cents).
- offset_ratiofloat > 0 or None
The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the
offset_tolerance
will equal theref_duration * 0.2
, or 0.05 (50 ms), whichever is greater. Ifoffset_ratio
is set toNone
, offsets are ignored in the matching.- offset_min_tolerancefloat > 0
The minimum tolerance for offset matching. See offset_ratio description for an explanation of how the offset tolerance is determined. Note: this parameter only influences the results if
offset_ratio
is notNone
.- strictbool
If
strict=False
(the default), threshold checks for onset, offset, and pitch matching are performed using<=
(less than or equal). Ifstrict=True
, the threshold checks are performed using<
(less than).
- Returns:
- matchinglist of tuples
A list of matched reference and estimated notes.
matching[i] == (i, j)
where reference notei
matches estimated notej
.
- mir_eval.transcription.precision_recall_f1_overlap(ref_intervals, ref_pitches, est_intervals, est_pitches, onset_tolerance=0.05, pitch_tolerance=50.0, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False, beta=1.0)
Compute the Precision, Recall and F-measure of correct vs incorrectly transcribed notes, and the Average Overlap Ratio for correctly transcribed notes (see
average_overlap_ratio()
). “Correctness” is determined based on note onset, pitch and (optionally) offset: an estimated note is assumed correct if its onset is within +-50ms of a reference note and its pitch (F0) is within +- quarter tone (50 cents) of the corresponding reference note. Ifoffset_ratio
isNone
, note offsets are ignored in the comparison. Otherwise, on top of the above requirements, a correct returned note is required to have an offset value within 20% (by default, adjustable via theoffset_ratio
parameter) of the reference note’s duration around the reference note’s offset, or withinoffset_min_tolerance
(50 ms by default), whichever is larger.- Parameters:
- ref_intervalsnp.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
- ref_pitchesnp.ndarray, shape=(n,)
Array of reference pitch values in Hertz
- est_intervalsnp.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
- est_pitchesnp.ndarray, shape=(m,)
Array of estimated pitch values in Hertz
- onset_tolerancefloat > 0
The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).
- pitch_tolerancefloat > 0
The tolerance for an estimated note’s pitch deviating from the reference note’s pitch, in cents. Default is 50.0 (50 cents).
- offset_ratiofloat > 0 or None
The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the
offset_tolerance
will equal theref_duration * 0.2
, oroffset_min_tolerance
(0.05 by default, i.e. 50 ms), whichever is greater. Ifoffset_ratio
is set toNone
, offsets are ignored in the evaluation.- offset_min_tolerancefloat > 0
The minimum tolerance for offset matching. See
offset_ratio
description for an explanation of how the offset tolerance is determined. Note: this parameter only influences the results ifoffset_ratio
is notNone
.- strictbool
If
strict=False
(the default), threshold checks for onset, offset, and pitch matching are performed using<=
(less than or equal). Ifstrict=True
, the threshold checks are performed using<
(less than).- betafloat > 0
Weighting factor for f-measure (default value = 1.0).
- Returns:
- precisionfloat
The computed precision score
- recallfloat
The computed recall score
- f_measurefloat
The computed F-measure score
- avg_overlap_ratiofloat
The computed Average Overlap Ratio score
Examples
>>> ref_intervals, ref_pitches = mir_eval.io.load_valued_intervals( ... 'reference.txt') >>> est_intervals, est_pitches = mir_eval.io.load_valued_intervals( ... 'estimated.txt') >>> (precision, ... recall, ... f_measure) = mir_eval.transcription.precision_recall_f1_overlap( ... ref_intervals, ref_pitches, est_intervals, est_pitches) >>> (precision_no_offset, ... recall_no_offset, ... f_measure_no_offset) = ( ... mir_eval.transcription.precision_recall_f1_overlap( ... ref_intervals, ref_pitches, est_intervals, est_pitches, ... offset_ratio=None))
- mir_eval.transcription.average_overlap_ratio(ref_intervals, est_intervals, matching)
Compute the Average Overlap Ratio between a reference and estimated note transcription. Given a reference and corresponding estimated note, their overlap ratio (OR) is defined as the ratio between the duration of the time segment in which the two notes overlap and the time segment spanned by the two notes combined (earliest onset to latest offset):
>>> OR = ((min(ref_offset, est_offset) - max(ref_onset, est_onset)) / ... (max(ref_offset, est_offset) - min(ref_onset, est_onset)))
The Average Overlap Ratio (AOR) is given by the mean OR computed over all matching reference and estimated notes. The metric goes from 0 (worst) to 1 (best).
Note: this function assumes the matching of reference and estimated notes (see
match_notes()
) has already been performed and is provided by thematching
parameter. Furthermore, it is highly recommended to validate the intervals (seevalidate_intervals()
) before calling this function, otherwise it is possible (though unlikely) for this function to attempt a divide-by-zero operation.- Parameters:
- ref_intervalsnp.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
- est_intervalsnp.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
- matchinglist of tuples
A list of matched reference and estimated notes.
matching[i] == (i, j)
where reference notei
matches estimated notej
.
- Returns:
- avg_overlap_ratiofloat
The computed Average Overlap Ratio score
- mir_eval.transcription.onset_precision_recall_f1(ref_intervals, est_intervals, onset_tolerance=0.05, strict=False, beta=1.0)
Compute the Precision, Recall and F-measure of note onsets: an estimated onset is considered correct if it is within +-50ms of a reference onset. Note that this metric completely ignores note offset and note pitch. This means an estimated onset will be considered correct if it matches a reference onset, even if the onsets come from notes with completely different pitches (i.e. notes that would not match with
match_notes()
).- Parameters:
- ref_intervalsnp.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
- est_intervalsnp.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
- onset_tolerancefloat > 0
The tolerance for an estimated note’s onset deviating from the reference note’s onset, in seconds. Default is 0.05 (50 ms).
- strictbool
If
strict=False
(the default), threshold checks for onset matching are performed using<=
(less than or equal). Ifstrict=True
, the threshold checks are performed using<
(less than).- betafloat > 0
Weighting factor for f-measure (default value = 1.0).
- Returns:
- precisionfloat
The computed precision score
- recallfloat
The computed recall score
- f_measurefloat
The computed F-measure score
Examples
>>> ref_intervals, _ = mir_eval.io.load_valued_intervals( ... 'reference.txt') >>> est_intervals, _ = mir_eval.io.load_valued_intervals( ... 'estimated.txt') >>> (onset_precision, ... onset_recall, ... onset_f_measure) = mir_eval.transcription.onset_precision_recall_f1( ... ref_intervals, est_intervals)
- mir_eval.transcription.offset_precision_recall_f1(ref_intervals, est_intervals, offset_ratio=0.2, offset_min_tolerance=0.05, strict=False, beta=1.0)
Compute the Precision, Recall and F-measure of note offsets: an estimated offset is considered correct if it is within +-50ms (or 20% of the ref note duration, which ever is greater) of a reference offset. Note that this metric completely ignores note onsets and note pitch. This means an estimated offset will be considered correct if it matches a reference offset, even if the offsets come from notes with completely different pitches (i.e. notes that would not match with
match_notes()
).- Parameters:
- ref_intervalsnp.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
- est_intervalsnp.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
- offset_ratiofloat > 0 or None
The ratio of the reference note’s duration used to define the offset_tolerance. Default is 0.2 (20%), meaning the
offset_tolerance
will equal theref_duration * 0.2
, oroffset_min_tolerance
(0.05 by default, i.e. 50 ms), whichever is greater.- offset_min_tolerancefloat > 0
The minimum tolerance for offset matching. See
offset_ratio
description for an explanation of how the offset tolerance is determined.- strictbool
If
strict=False
(the default), threshold checks for onset matching are performed using<=
(less than or equal). Ifstrict=True
, the threshold checks are performed using<
(less than).- betafloat > 0
Weighting factor for f-measure (default value = 1.0).
- Returns:
- precisionfloat
The computed precision score
- recallfloat
The computed recall score
- f_measurefloat
The computed F-measure score
Examples
>>> ref_intervals, _ = mir_eval.io.load_valued_intervals( ... 'reference.txt') >>> est_intervals, _ = mir_eval.io.load_valued_intervals( ... 'estimated.txt') >>> (offset_precision, ... offset_recall, ... offset_f_measure) = mir_eval.transcription.offset_precision_recall_f1( ... ref_intervals, est_intervals)
- mir_eval.transcription.evaluate(ref_intervals, ref_pitches, est_intervals, est_pitches, **kwargs)
Compute all metrics for the given reference and estimated annotations.
- Parameters:
- ref_intervalsnp.ndarray, shape=(n,2)
Array of reference notes time intervals (onset and offset times)
- ref_pitchesnp.ndarray, shape=(n,)
Array of reference pitch values in Hertz
- est_intervalsnp.ndarray, shape=(m,2)
Array of estimated notes time intervals (onset and offset times)
- est_pitchesnp.ndarray, shape=(m,)
Array of estimated pitch values in Hertz
- **kwargs
Additional keyword arguments which will be passed to the appropriate metric or preprocessing functions.
- Returns:
- scoresdict
Dictionary of scores, where the key is the metric name (str) and the value is the (float) score achieved.
Examples
>>> ref_intervals, ref_pitches = mir_eval.io.load_valued_intervals( ... 'reference.txt') >>> est_intervals, est_pitches = mir_eval.io.load_valued_intervals( ... 'estimate.txt') >>> scores = mir_eval.transcription.evaluate(ref_intervals, ref_pitches, ... est_intervals, est_pitches)