Vous êtes sur la page 1sur 6

AUTOMATIC TRANSCRIPTION OF PIANO MUSIC

BY SPARSE REPRESENTATION OF MAGNITUDE SPECTRA

Cheng-Te Lee, Yi-Hsuan Yang, and Homer Chen

National Taiwan University


aderleee@gmail.com, affige@gmail.com, homer@cc.ee.ntu.edu.tw

ABSTRACT of piano music is more challenging than that of other


instruments because pianos have wide pitch range and high
Assuming that the waveforms of piano notes are pre-stored polyphony. We are motivated to tackle the transcription of
and that the magnitude spectrum of a piano signal segment piano music in this work also because piano is the most
can be represented as a linear combination of the magnitude popular musical instrument worldwide. The fact that there
spectra of the pre-stored piano waveforms, we formulate the are usually no more than six concurrent notes played from
automatic transcription of polyphonic piano music as a the 88 notes of a piano inspires us to use sparse
sparse representation problem. First, the note candidates of representation to solve the transcription problem.
the piano signal segment are found by using heuristic rules.
Then, the sparse representation problem is solved by l1- 1.1 Previous work
regularized minimization, followed by temporal smoothing
the frame-level results based on hidden Markov models. Music signals can be monophonic or polyphonic. For
Evaluation against three state-of-the-art systems using ten polyphonic music, one has to deal with issues such as the
classical music recordings of a real piano is performed to source number ambiguity and the octave ambiguity that do
show the performance improvement of the proposed system. not exist in monophonic music; therefore, polyphonic music
transcription is more complicated than monophonic music
Index Terms— F0 estimation, multiple pitch estimation, transcription. For this reason, monophonic music
sparse representation, l1-regularized minimization. transcription is relatively easy to deal with and mature [5],
but polyphonic music transcription still has much room for
1. INTRODUCTION improvement.
Moorer [6] made the first attempt to transcribe
Music transcription is the process of converting a musical polyphonic music. Although the system is limited to duets,
recording to a musical score (i.e., symbolic representation). it has inspired many interesting ideas. For example, Marolt
This process in the strict music sense entails a number of [1] applied neural networks and oscillator networks to track
complicated tasks to extract musical attributes, including partials, Klapuri [7] adopted an iterative estimation and
pitch, starting time, duration, instrument type, and dynamics cancellation scheme to estimate concurrent fundamental
of each note and tempo, meter signature, and tonality (key frequencies (F0s), Bello et al. [2] applied a rule-based
signature) of the music. In its simplest form, music framework to find meaningful groups of spectral peaks,
transcription can be reduced to the estimation of the pitch Poliner et al. [3] adopted a machine learning approach to
and timing of each individual note. Such information, determine whether a particular note is present, and Duan et
although only a small subset of the whole musical attributes, al. [8] presented a probabilistic approach to fundamental
is sufficiently useful for content-based music retrieval, such frequency estimation. These systems used the batch learning
as query by humming, query by example, and cover song technique to learn the relationships between the spectra of
identification. This paper mainly deals with this simplified the music signals and the underlying notes. Therefore, the
version of music transcription. models cannot adapt to different types of music without re-
Like the work described in [1]-[3], we limit the audio training. Moreover, generating the ground truth data for
source to piano (either real or synthesized). Although only learning is time-consuming [9]. Abdallah et al. [10] and
one single type of audio source is considered, the problem is Cont [11] introduced a sparsity constraint into their
by no means trivial. In fact, Peeters [4] showed that the transcription system, and they only used one base element
___________________________ for each note. As shown by Wright et al. [12], the
This work was supported by the National Science Council of performance of sparse representation highly depends on the
Taiwan under contract NSC 97-2221-E-002-111-MY3. overcompleteness of the dictionary of base elements. Here,

978-1-61284-350-6/11/$26.00 ©2011 IEEE


Table 1. Tuning algorithm
Input: magnitude spectrum M of a sample, tuning factor τ
Output: tuned magnitude spectrum Mt
1. For i = 1 to the dimension of M
2. Mt[i] = 0
3. For i = 1 to the dimension of M
4. n = round((i-1) × τ) + 1
5. /* round(x) rounds x to the nearest integer */
Fig. 1. The proposed automatic piano transcription system. 6. Mt[n] = Mt[n] + Mt[i]

1.2 Organization of the paper

The organization of this paper is as follows. In Section 2, we


propose a piano transcription system that is based on sparse
representation of magnitude spectra without the need for
model training. In Section 3, the proposed system is applied
to transcribe ten recordings of solo piano music [3] and
compared to previous approaches. Section 4 concludes this
paper.

2. PROPOSED SYSTEM

Fig. 1 shows the schematic diagram of our transcription


system. The system takes a piano WAVE file as input and
generates the transcription in MIDI file format [14]. The
Fig. 2. The original and the tuned spectra of a sample of note C4 system first normalizes the root-mean-squared (RMS)
(MIDI number 60). The estimated tuning factor is 0.982. amplitude of the input WAVE file and estimates the tuning
factor of the piano. Then the samples in the piano sound
database are tuned accordingly. During this preprocessing
stage, the system also decomposes the audio content of the
normalized WAVE file into short-time frames and applies
fast Fourier transform (FFT) to each frame. The note
candidates are selected according to the resulting spectrum
and the tuning factor. Then, the sparse representation of the
frame spectrum using the spectra of the note candidates is
created. Because we formulate the sparse representation as
an l1-regularized minimization problem, there are some
small non-zero coefficients equivalent to noise. The system
eliminates such noise by setting a threshold. Finally, the
system applies temporal smoothing to the resulting frame-
level sparse representation to generate the transcription. The
detail of each component is described below.

2.1 Data preprocessing


Fig. 3. Illustration of the peak detection algorithm. Before the transcription process starts, a data pre-processing
step is performed.
sparse representation refers to an expression of the input First, a database of piano note waveforms with standard
signal as a linear combination of base elements, and the tuning is built. There should be at least one waveform for
resulting representation only has few non-zero terms. each piano note in the database. Also, there can be several
To address these issues, we propose a system that waveforms generated by different pianos for the same note.
adapts to different pianos without retraining. It only requires Each waveform is then divided into short-time samples, and
the WAVE [13] files of a dozen individual notes of a new the energy of each sample is normalized.
piano to be added to the piano sound database. In addition, Then, the RMS amplitude of the input piano WAVE
the system uses more than one base element for a note to file is normalized, and the tuning of the piano is estimated
improve the transcription performance. using
where τ is the tuning factor.
The system at this point can select note candidates by
the following two heuristic rules:
• If the magnitude of a significant spectral peak at
frequency fp is higher than the magnitude of the spectrum
at integer multiples (harmonics) of fp, the note whose
frequency is nearest to the spectral peak is selected as a
candidate.
(a)
• Following the last rule, if the magnitude of the spectrum
at fp/2 is higher than a pre-defined threshold, the note
whose frequency is nearest to fp/2 is selected as a note
candidate. The same procedure is applied for fp/3.
The two rules are designed for two types, strong-
fundamental and weak-fundamental, of harmonic series of
piano notes. As shown in Fig. 4, the spectra of the same note
of different pianos may look different. Fig. 4(a) shows an
(b) example note of strong-fundamental type. We can see that
the spectral envelop has maximum magnitude at the
Fig. 4. Spectra of note D4 (MIDI number 62) of two different fundamental frequency marked by circle. The first rule is
pianos. The magnitude of the fundamental frequency is marked by used to detect notes of the strong-fundamental type, while
circle. (a) Strong-fundamental type. (b) Weak-fundamental type. the second rule is used to detect notes whose spectral
the method proposed in [1]. It uses adaptive oscillators to envelope has maximum magnitude at one of the harmonics,
find partials in the input WAVE file and calculates the see Fig. 4(b). Notes with such spectral envelope belong to
tuning factor by comparing the frequencies of the partials to the weak-fundamental type. Yeh et al. have a similar
the frequencies of the notes of a piano with standard tuning classification of the harmonic series [16].
(i.e., A4=440 Hz). After the tuning factor is obtained, the
samples in the database are tuned using the algorithm shown 2.3 Computation of sparse representation
in Table 1. Note that the tuning factor is a real number and
that its value is set to 1.0 for standard tuning. Fig. 2 shows We assume that the magnitude spectrum of a frame can be
the difference between the original and the tuned spectra of represented as a linear combination of the base elements,
a certain note sample. The spectra of the tuned samples which are the spectra of samples of the note candidates in
serve as the base elements of the sparse representation. the tuned database. However, the coefficients of such linear
combination may not be unique because of the mathematical
2.2 Note candidates selection dependencies between the base elements. To address this
issue, we impose a sparsity constraint to require that the
For each frame, the system selects a number of note number of non-zero coefficients should be minimal. The
candidates to reduce the search space of the sparse sparsity constraint is reasonable [12] because “the sparsest
representation coefficients. In this way, we can reduce the representation is naturally discriminative.” For the
time complexity of sparse representation. transcription problem dealt with in this paper, the constraint
First, the system detects significant spectral peaks in a implies that the spectrum of a frame is most compactly
manner similar to the one proposed in [15], which is expressed using the base elements of notes present in the
illustrated in Fig. 3. The smoothed spectrum is obtained by frame. Fig. 5 illustrates the idea of sparse representation.
convolving the original spectrum with a moving Gaussian Denote the dictionary of base elements for the ith key
filter. If the magnitude of a frequency satisfies the two of a piano by Ai = [ai,1|ai,2|…|ai,Ni], where ai,k is a column
requirements described below, it is considered a significant vector containing the magnitudes of the Fourier coefficients
spectral peak. The first requirement is that the original of the kth short-time sample of that key and Ni is the number
magnitude of a significant spectral peak should be higher of such samples. Let the column vector y be the magnitudes
than a predefined global threshold. The second requirement, of the Fourier coefficients of a frame, the problem now is to
which is a local requirement, is that the original magnitude find a sparse coefficient vector x* such that
of a significant spectral peak should be higher than the
magnitude of its smoothed version by a predefined margin. x* = arg min || x ||0 subject to y = Ax, (2)
x
In addition to peak detection, the system determines the
frequency of each note. The frequency f of the nth key of a where A contains the dictionaries of the note candidates.
piano is defined by Because finding the solution of (2) is NP-hard, it is not
n − 49
f = (440 × τ ) × 12 2 , (1)
(a) (b) (c) (d)
Fig. 5. Illustration of the sparse representation of frame spectrum for music transcription. (a) The magnitude spectrum of a frame. (b) The
coefficients of sparse representation. (c) The base elements. (d) The residues.

practical to solve it directly. Fortunately, Donoho [17] Table 2. Parameter values used in the proposed system
demonstrated that if the solution of (2) is sparse enough, it is Parameter Value
equal to the solution of the following l1-minimization
The RMS amplitude for the normalization of the
problem: 0.25
input WAVE file
x* = arg min || x ||1 subject to y = Ax. (3) The length of each base element (|ai,k|) 1.0
x
The radius of the moving Gaussian filter 10
It is shown in [18] that the solution of (3) is close to the
following l1-regularized minimization problem: The global threshold for the selection of the
1.0
significant spectral peaks
x = arg min || y - Ax || + λ || x ||1 ,
* 2 (4) The minimal margin for the selection of the
0.75
x significant spectral peaks
where the regularization parameter λ is a real non-negative Regularization parameter (λ) 100.0
number. We use the method described in [19] to solve (4).
After obtaining the sparse coefficient vector x* of (4), we
consider that a note is present in the frame if the summation this problem formulation the learning process is about the
of the coefficients corresponding to that note, its activation music structure rather than the timbre characteristic of a
index, is larger than a threshold. particular piano. We apply the Viterbi algorithm to find the
solution of (7).
2.4 Temporal smoothing After the temporal smoothing by HMMs, the system
generates the transcription of the input WAVE file in the
The FFT described above treats the short-time frames MIDI format.
independently, leaving the temporal structure of music
unexploited. To address this issue, we use two-state (on and 3. EVALUATION
off) hidden Markov models (HMMs) to model the attack
and decay of each note [3]. For each note, we want to We evaluate the performance of the proposed system against
maximize three state-of-the-art systems [1], [7], [20] using ten one-
minute long classical music recordings (in the form of
∏ p( x | q ) p(q t t t | qt −1 ), (5) WAVE files) of a Yamaha Disklavier playback grand piano
t [3]. The executable files of [1], [7] are provided by the
where qt is the state at time t, xt is the frame beginning at authors, and the evaluation result of [20] is provided by its
author. Each music recording comes with a MIDI file to
time t, p(xt|qt) is the probability of xt given qt, and p(qt|qt-1) is
serve as the ground truth. There are 4952 notes in the
the transition probability between states. Although we do
recordings, and the average polyphony of them is 3.5.
not know p(xt|qt), from the conditional probability we have
Klapuri’s system [7] achieved the best performance for
p ( qt | xt ) ∝ p ( xt | qt ) p (qt ) . (6) Multiple Fundamental Frequency Tracking task of Music
Information Retrieval Evaluation eXchange (MIREX) in
Therefore, we can maximize 2008, whereas Yeh’s system [20] achieved the best
performance for the same contest in 20101.
p ( qt | xt )
∏ p ( qt )
p( qt | qt −1 ) (7)
t 3.1 Experiment set-up
instead of (5). p(qt|xt) is obtained by dividing the activation
index of the note by the maximum activation index of xt.
Both the prior p(qt) and the state transition probability
p(qt|qt-1) can be learnt from the training data. Note that in 1
http://www.music-ir.org/mirex/
Table 3. Result of the frame-based evaluation Table 4. Result of the note-based evaluation
F-measure Precision Recall F-measure Precision Recall
Proposed system 70.2% 74.4% 66.5% Proposed system 73.0% 74.6% 71.6%
Marolt’s system 66.1% 78.6% 57.1% Yeh’s system 67.1% 57.2% 81.1%
Klapuri’s system 62.2% 72.4% 54.6%
2 × Precision × Recall (10)
F-measure = ,
Precision + Recall
where Ntp is the number of correctly transcribed notes (true
positives), Nfp is the number of unvoiced notes transcribed
as voiced (false positives), and Nfn is the number of voiced
notes transcribed as unvoiced (false negative) of all frames.
Table 3 shows the average performance results for the
frame-based evaluation. We can see that the proposed
system achieves a better F-measure, which is a combined
(a)
metric for recall and precision. It is also observed that the
high recall of our system is obtained at the cost of a little bit
lower precision than Marolt’s system [1].
Fig. 6 shows the details of the transcription
performance with respect to polyphony. Both F-measure and
recall of the proposed system are consistently better than
that of Klapuri’s system [7] and Marolt’s system. Compared
with Klapuri’s system, the proposed system has better
precision for polyphony 2, 4, and 6 or more. Compared with
(b) (c) Marolt’s system, the proposed system has better precision
Fig. 6. Evaluation results in terms of the polyphony of the frames. for polyphony 2. As mentioned, there is a tradeoff between
The numbers in the parentheses are relative frequencies of the precision and recall, and the best tradeoff is achieved when
polyphony. (a) F-measure. (b) Precision. (c) Recall. the F-measure is maximized. To improve the F-measure
performance, a more sophisticated method that learns the
The waveforms in the database are synthesized by two
music structure, say, the relationship between concurrent
commercial software tools, and three different piano timbres
notes, should be investigated in future work.
are used2. There are 646 base elements in the database, and
each note has six to eight corresponding base elements. The
3.3 Note-based evaluation
testing WAVE files are monaural, and the sampling rate is 8
kHz. The frame size for FFT is 100 milliseconds, and the
In addition to the frame-based evaluation, we also evaluate
hop size between successive frames is 10 milliseconds. The
the performance of the proposed system in a note-based
training data used for learning the prior and the transition
manner and compare it with Yeh’s system [20]. A note
probability is also provided in [3]. Note that the training
output by the system is considered a correctly transcribed
data is disjoint with the testing data. Values of the
note if its MIDI number matches with the ground truth and
parameters of the proposed system are listed in Table 2.
if the system switches it on within 100 milliseconds from
the onset specified in the ground truth MIDI file. The
3.2 Frame-based evaluation
evaluation is performed for each of the ten music recordings
described earlier. At the end of the evaluation, the number
We use three typical metrics to evaluate the transcription
of false negatives is obtained by subtracting the number of
performance, namely precision, recall, and F-measure. The
correctly transcribed notes from the total number of ground
metrics are defined as follows:
truth notes, and the number of false positives is obtained by
N tp (8) subtracting the number of correctly transcribed notes from
precision = , the total number of notes output by the system. We do not
N tp +Nfp
take the offset of the notes into consideration because they
N tp (9)
recall = , have little perceptual importance [3]. Table 4 shows the
N tp +N fn result of the note-based evaluation. We can see that the F-
measure and precision of the proposed system are better
than Yeh’s system.
2 In summary, the performance of the proposed system is
Piano 1 and Piano 2 of Native Instrument’s Bandstand and
Acoustic Piano 1 of Arobas‘s Guitar Pro. consistent and unbiased, and the improvement over the three
state-of-the-art systems is significant under the one-tailed t- [7] A. Klapuri, “Multiple fundamental frequency estimation
test (p-value<0.05). by summing harmonic amplitudes,” in Proc. Int. Conf.
Music Inform. Retrieval, Victoria, Canada, pp. 216–221,
4. CONCLUSION Oct. 2006.
[8] Z. Duan, B. Pardo, and C. Zhang, “Multiple Fundamental
We have presented an automatic transcription system for Frequency Estimation by Modeling Spectral Peaks and
piano music. The system first performs a data pre- Non-Peak Regions,” IEEE Trans. Audio, Speech, Lang.
processing step for volume normalization, tuning factor Process., vol. 18, no. 8, pp. 2121–2133, Nov. 2010.
estimation, and database tuning. Then, the system [9] C. Yeh, N. Bogaards, and A. Roebel, “Synthesized
decomposes the audio content of the input music file into polyphonic music database with verifiable ground truth
for multiple-F0 estimation,” in Proc. Int. Conf. Music
short-time frames. After the selection of note candidates
Inform. Retrieval, Vienna, Austria, pp. 393–398, 2007.
based on two types of harmonic series, the sparse
[10] S. A. Abdallah and M. D. Plumbley, “Polyphonic
representation of each frame is computed by l1-regularized
transcription by non-negative sparse coding of power
minimization, followed by temporal smoothing the frame- spectra,” in Proc. Int. Conf. Music Inform. Retrieval,
level results based on hidden Markov models. Evaluation Barcelona, Spain, pp. 318–325, Oct. 2004.
results show that the approach is viable. [11] A. Cont, “Realtime Multiple Pitch Observation using
The system can be improved in three aspects. First, we Sparse Non-negative Constraints,” in Proc. Int. Conf.
estimate only one tuning factor for the entire piano keys in Music Inform. Retrieval, Victoria, Canada, pp. 206-212,
the current system, but the fact is that each key has its own Oct. 2006.
tuning. Estimating the tuning factors for the treble, middle, [12] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma,
and bass sections of the piano may help improve the “Robust Face Recognition via Sparse Representation,”
accuracy of the estimation. Second, in addition to learning IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp.
the intra-note relationship, we should learn the inter-note 210-227, Feb. 2009.
relationship to gain insight into the temporal structure of [13] IBM Corporation and Microsoft Corporation,
music. Last but not least, onset detection can be “Multimedia Programming Interface and Data
incorporated to the system to improve performance, for Specifications 1.0,” Aug. 1991.
which the sparse representation coefficients serve as a [14] MIDI Manufacturers Association, “Complete MIDI 1.0
valuable clue. Detailed Specification,” Nov. 2001.
[15] Z. Duan, Y. Zhang, C. Zhang, and Z. Shi, “Unsupervised
ACKNOWLEDGEMENT single-channel music source separation by average
harmonic structure modeling,” IEEE Trans. Audio,
The authors would like to thank A. Klapuri and C. Yeh Speech, Lang. Process., vol. 16, no. 4, pp. 766–778, May
for generously providing source code and evaluation result 2008.
for comparison. [16] C. Yeh, A. Roebel, and X. Rodet, “Multiple Fundamental
Frequency Estimation and Polyphony Inference of
REFERENCES Polyphonic Music Signals,” IEEE Trans. Audio, Speech,
Lang. Process., vol. 18, no. 6, pp. 1116–1126, Aug. 2010.
[17] D. Donoho, “For Most Large Underdetermined Systems
[1] M. Marolt, “A connectionist approach to automatic
of Linear Equations the Minimal l1-Norm Solution Is
transcription of polyphonic piano music,” IEEE Trans.
Also the Sparsest Solution,” Commun. Pure Appl. Math.,
Multimedia, vol. 6, no. 3, pp. 439–449, 2004.
vol. 59, no. 6, pp. 797-829, 2006.
[2] J. Bello, L. Daudet, and M. Sandler, “Automatic piano
transcription using frequency and time-domain
[18] E. Candès, J. Romberg, and T. Tao, “Stable signal
recovery from incomplete and inaccurate measurements,”
information,” IEEE Trans. Audio, Speech, Lang. Process.,
Commun. Pure Appl. Math., vol. 59, no. 8, pp. 1207-1223,
vol. 14, no. 6, pp. 2242–2251, Nov. 2006.
Aug. 2006.
[3] G. Poliner and D. Ellis, “A discriminative model for
polyphonic piano transcription,” EURASIP J. Advances
[19] S. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinvesky,
“An interiorpoint method for large-scale l1-regularized
Signal Process., vol. 8, pp. 1–9, 2007.
least squares,” IEEE J. Sel. Topics Signal Process., vol. 1,
[4] G. Peeters, “Music pitch representation by periodicity
pp. 606–617, 2007.
measures based on combined temporal and spectral
[20] C. Yeh and A. Roebel. (2010). Multiple-F0 estimation for
representations,” in Proc. Int. Conf. Acoust., Speech,
MIREX 2010. Music Information Retrieval Evaluation
Signal Process., Toulouse, France, pp. 53–56, May 2006.
eXchange. [Online]. Available:
[5] A. de Cheveigné and H. Kawahara, “Yin, a fundamental
http://www.music-ir.org/mirex/abstracts/2010/AR1.pdf
frequency estimator for speech and music,” J. Acoust. Soc.
Amer., vol. 111, pp. 1917–1930, 2002.
[6] J. A. Moorer, “On the transcription of musical sound by
computer,” Comput. Music J., vol. 1, no. 4, pp. 32–38,
1977.

Vous aimerez peut-être aussi