Académique Documents
Professionnel Documents
Culture Documents
2. PROPOSED SYSTEM
practical to solve it directly. Fortunately, Donoho [17] Table 2. Parameter values used in the proposed system
demonstrated that if the solution of (2) is sparse enough, it is Parameter Value
equal to the solution of the following l1-minimization
The RMS amplitude for the normalization of the
problem: 0.25
input WAVE file
x* = arg min || x ||1 subject to y = Ax. (3) The length of each base element (|ai,k|) 1.0
x
The radius of the moving Gaussian filter 10
It is shown in [18] that the solution of (3) is close to the
following l1-regularized minimization problem: The global threshold for the selection of the
1.0
significant spectral peaks
x = arg min || y - Ax || + λ || x ||1 ,
* 2 (4) The minimal margin for the selection of the
0.75
x significant spectral peaks
where the regularization parameter λ is a real non-negative Regularization parameter (λ) 100.0
number. We use the method described in [19] to solve (4).
After obtaining the sparse coefficient vector x* of (4), we
consider that a note is present in the frame if the summation this problem formulation the learning process is about the
of the coefficients corresponding to that note, its activation music structure rather than the timbre characteristic of a
index, is larger than a threshold. particular piano. We apply the Viterbi algorithm to find the
solution of (7).
2.4 Temporal smoothing After the temporal smoothing by HMMs, the system
generates the transcription of the input WAVE file in the
The FFT described above treats the short-time frames MIDI format.
independently, leaving the temporal structure of music
unexploited. To address this issue, we use two-state (on and 3. EVALUATION
off) hidden Markov models (HMMs) to model the attack
and decay of each note [3]. For each note, we want to We evaluate the performance of the proposed system against
maximize three state-of-the-art systems [1], [7], [20] using ten one-
minute long classical music recordings (in the form of
∏ p( x | q ) p(q t t t | qt −1 ), (5) WAVE files) of a Yamaha Disklavier playback grand piano
t [3]. The executable files of [1], [7] are provided by the
where qt is the state at time t, xt is the frame beginning at authors, and the evaluation result of [20] is provided by its
author. Each music recording comes with a MIDI file to
time t, p(xt|qt) is the probability of xt given qt, and p(qt|qt-1) is
serve as the ground truth. There are 4952 notes in the
the transition probability between states. Although we do
recordings, and the average polyphony of them is 3.5.
not know p(xt|qt), from the conditional probability we have
Klapuri’s system [7] achieved the best performance for
p ( qt | xt ) ∝ p ( xt | qt ) p (qt ) . (6) Multiple Fundamental Frequency Tracking task of Music
Information Retrieval Evaluation eXchange (MIREX) in
Therefore, we can maximize 2008, whereas Yeh’s system [20] achieved the best
performance for the same contest in 20101.
p ( qt | xt )
∏ p ( qt )
p( qt | qt −1 ) (7)
t 3.1 Experiment set-up
instead of (5). p(qt|xt) is obtained by dividing the activation
index of the note by the maximum activation index of xt.
Both the prior p(qt) and the state transition probability
p(qt|qt-1) can be learnt from the training data. Note that in 1
http://www.music-ir.org/mirex/
Table 3. Result of the frame-based evaluation Table 4. Result of the note-based evaluation
F-measure Precision Recall F-measure Precision Recall
Proposed system 70.2% 74.4% 66.5% Proposed system 73.0% 74.6% 71.6%
Marolt’s system 66.1% 78.6% 57.1% Yeh’s system 67.1% 57.2% 81.1%
Klapuri’s system 62.2% 72.4% 54.6%
2 × Precision × Recall (10)
F-measure = ,
Precision + Recall
where Ntp is the number of correctly transcribed notes (true
positives), Nfp is the number of unvoiced notes transcribed
as voiced (false positives), and Nfn is the number of voiced
notes transcribed as unvoiced (false negative) of all frames.
Table 3 shows the average performance results for the
frame-based evaluation. We can see that the proposed
system achieves a better F-measure, which is a combined
(a)
metric for recall and precision. It is also observed that the
high recall of our system is obtained at the cost of a little bit
lower precision than Marolt’s system [1].
Fig. 6 shows the details of the transcription
performance with respect to polyphony. Both F-measure and
recall of the proposed system are consistently better than
that of Klapuri’s system [7] and Marolt’s system. Compared
with Klapuri’s system, the proposed system has better
precision for polyphony 2, 4, and 6 or more. Compared with
(b) (c) Marolt’s system, the proposed system has better precision
Fig. 6. Evaluation results in terms of the polyphony of the frames. for polyphony 2. As mentioned, there is a tradeoff between
The numbers in the parentheses are relative frequencies of the precision and recall, and the best tradeoff is achieved when
polyphony. (a) F-measure. (b) Precision. (c) Recall. the F-measure is maximized. To improve the F-measure
performance, a more sophisticated method that learns the
The waveforms in the database are synthesized by two
music structure, say, the relationship between concurrent
commercial software tools, and three different piano timbres
notes, should be investigated in future work.
are used2. There are 646 base elements in the database, and
each note has six to eight corresponding base elements. The
3.3 Note-based evaluation
testing WAVE files are monaural, and the sampling rate is 8
kHz. The frame size for FFT is 100 milliseconds, and the
In addition to the frame-based evaluation, we also evaluate
hop size between successive frames is 10 milliseconds. The
the performance of the proposed system in a note-based
training data used for learning the prior and the transition
manner and compare it with Yeh’s system [20]. A note
probability is also provided in [3]. Note that the training
output by the system is considered a correctly transcribed
data is disjoint with the testing data. Values of the
note if its MIDI number matches with the ground truth and
parameters of the proposed system are listed in Table 2.
if the system switches it on within 100 milliseconds from
the onset specified in the ground truth MIDI file. The
3.2 Frame-based evaluation
evaluation is performed for each of the ten music recordings
described earlier. At the end of the evaluation, the number
We use three typical metrics to evaluate the transcription
of false negatives is obtained by subtracting the number of
performance, namely precision, recall, and F-measure. The
correctly transcribed notes from the total number of ground
metrics are defined as follows:
truth notes, and the number of false positives is obtained by
N tp (8) subtracting the number of correctly transcribed notes from
precision = , the total number of notes output by the system. We do not
N tp +Nfp
take the offset of the notes into consideration because they
N tp (9)
recall = , have little perceptual importance [3]. Table 4 shows the
N tp +N fn result of the note-based evaluation. We can see that the F-
measure and precision of the proposed system are better
than Yeh’s system.
2 In summary, the performance of the proposed system is
Piano 1 and Piano 2 of Native Instrument’s Bandstand and
Acoustic Piano 1 of Arobas‘s Guitar Pro. consistent and unbiased, and the improvement over the three
state-of-the-art systems is significant under the one-tailed t- [7] A. Klapuri, “Multiple fundamental frequency estimation
test (p-value<0.05). by summing harmonic amplitudes,” in Proc. Int. Conf.
Music Inform. Retrieval, Victoria, Canada, pp. 216–221,
4. CONCLUSION Oct. 2006.
[8] Z. Duan, B. Pardo, and C. Zhang, “Multiple Fundamental
We have presented an automatic transcription system for Frequency Estimation by Modeling Spectral Peaks and
piano music. The system first performs a data pre- Non-Peak Regions,” IEEE Trans. Audio, Speech, Lang.
processing step for volume normalization, tuning factor Process., vol. 18, no. 8, pp. 2121–2133, Nov. 2010.
estimation, and database tuning. Then, the system [9] C. Yeh, N. Bogaards, and A. Roebel, “Synthesized
decomposes the audio content of the input music file into polyphonic music database with verifiable ground truth
for multiple-F0 estimation,” in Proc. Int. Conf. Music
short-time frames. After the selection of note candidates
Inform. Retrieval, Vienna, Austria, pp. 393–398, 2007.
based on two types of harmonic series, the sparse
[10] S. A. Abdallah and M. D. Plumbley, “Polyphonic
representation of each frame is computed by l1-regularized
transcription by non-negative sparse coding of power
minimization, followed by temporal smoothing the frame- spectra,” in Proc. Int. Conf. Music Inform. Retrieval,
level results based on hidden Markov models. Evaluation Barcelona, Spain, pp. 318–325, Oct. 2004.
results show that the approach is viable. [11] A. Cont, “Realtime Multiple Pitch Observation using
The system can be improved in three aspects. First, we Sparse Non-negative Constraints,” in Proc. Int. Conf.
estimate only one tuning factor for the entire piano keys in Music Inform. Retrieval, Victoria, Canada, pp. 206-212,
the current system, but the fact is that each key has its own Oct. 2006.
tuning. Estimating the tuning factors for the treble, middle, [12] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma,
and bass sections of the piano may help improve the “Robust Face Recognition via Sparse Representation,”
accuracy of the estimation. Second, in addition to learning IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp.
the intra-note relationship, we should learn the inter-note 210-227, Feb. 2009.
relationship to gain insight into the temporal structure of [13] IBM Corporation and Microsoft Corporation,
music. Last but not least, onset detection can be “Multimedia Programming Interface and Data
incorporated to the system to improve performance, for Specifications 1.0,” Aug. 1991.
which the sparse representation coefficients serve as a [14] MIDI Manufacturers Association, “Complete MIDI 1.0
valuable clue. Detailed Specification,” Nov. 2001.
[15] Z. Duan, Y. Zhang, C. Zhang, and Z. Shi, “Unsupervised
ACKNOWLEDGEMENT single-channel music source separation by average
harmonic structure modeling,” IEEE Trans. Audio,
The authors would like to thank A. Klapuri and C. Yeh Speech, Lang. Process., vol. 16, no. 4, pp. 766–778, May
for generously providing source code and evaluation result 2008.
for comparison. [16] C. Yeh, A. Roebel, and X. Rodet, “Multiple Fundamental
Frequency Estimation and Polyphony Inference of
REFERENCES Polyphonic Music Signals,” IEEE Trans. Audio, Speech,
Lang. Process., vol. 18, no. 6, pp. 1116–1126, Aug. 2010.
[17] D. Donoho, “For Most Large Underdetermined Systems
[1] M. Marolt, “A connectionist approach to automatic
of Linear Equations the Minimal l1-Norm Solution Is
transcription of polyphonic piano music,” IEEE Trans.
Also the Sparsest Solution,” Commun. Pure Appl. Math.,
Multimedia, vol. 6, no. 3, pp. 439–449, 2004.
vol. 59, no. 6, pp. 797-829, 2006.
[2] J. Bello, L. Daudet, and M. Sandler, “Automatic piano
transcription using frequency and time-domain
[18] E. Candès, J. Romberg, and T. Tao, “Stable signal
recovery from incomplete and inaccurate measurements,”
information,” IEEE Trans. Audio, Speech, Lang. Process.,
Commun. Pure Appl. Math., vol. 59, no. 8, pp. 1207-1223,
vol. 14, no. 6, pp. 2242–2251, Nov. 2006.
Aug. 2006.
[3] G. Poliner and D. Ellis, “A discriminative model for
polyphonic piano transcription,” EURASIP J. Advances
[19] S. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinvesky,
“An interiorpoint method for large-scale l1-regularized
Signal Process., vol. 8, pp. 1–9, 2007.
least squares,” IEEE J. Sel. Topics Signal Process., vol. 1,
[4] G. Peeters, “Music pitch representation by periodicity
pp. 606–617, 2007.
measures based on combined temporal and spectral
[20] C. Yeh and A. Roebel. (2010). Multiple-F0 estimation for
representations,” in Proc. Int. Conf. Acoust., Speech,
MIREX 2010. Music Information Retrieval Evaluation
Signal Process., Toulouse, France, pp. 53–56, May 2006.
eXchange. [Online]. Available:
[5] A. de Cheveigné and H. Kawahara, “Yin, a fundamental
http://www.music-ir.org/mirex/abstracts/2010/AR1.pdf
frequency estimator for speech and music,” J. Acoust. Soc.
Amer., vol. 111, pp. 1917–1930, 2002.
[6] J. A. Moorer, “On the transcription of musical sound by
computer,” Comput. Music J., vol. 1, no. 4, pp. 32–38,
1977.