Vous êtes sur la page 1sur 6

ISMIR 2008 – Session 1a – Harmony



Kouhei Sumi,† Katsutoshi Itoyama,† Kazuyoshi Yoshii,‡

Kazunori Komatani,† Tetsuya Ogata,† and Hiroshi G. Okuno†
† ‡
Dept. of Intelligence Science and Technology National Institute of Advanced Industrial
Graduate School of Informatics, Kyoto University Science and Technology (AIST)
Sakyo-ku, Kyoto 606-8501 Japan Tsukuba, Ibaraki 305-8568 Japan
{ksumi, itoyama, komatani, ogata, okuno}@kuis.kyoto-u.ac.jp k.yoshii@aist.go.jp


This paper presents a method that identifies musical chords Melody Harmony Rhythm
in polyphonic musical signals. As musical chords mainly
represent the harmony of music and are related to other mu- Bass lines Chord Beat
sical elements such as melody and rhythm, the performance
of chord recognition should improve if this interrelation-
ship is taken into consideration. Nevertheless, this inter- Figure 1. Musical elements
relationship has not been utilized in the literature as far as
the authors are aware. In this paper, bass lines are utilized for extracting musical elements are essential for obtaining
as clues for improving chord recognition because they can content-based information from musical signals.
be regarded as an element of the melody. A probabilis- A key principle in analyzing musical signals is that mu-
tic framework is devised to uniformly integrate bass lines sical elements are related to each other. Because composers
extracted by using bass pitch estimation into a hypothesis- exploit the interrelationship among musical elements, this
search-based chord recognition. To prune the hypothesis interrelationship should be considered when analyzing the
space of the search, the hypothesis reliability is defined as elements as well. Most studies in the literature have dealt
the weighted sum of three reliabilities: the likelihood of with these elements independently.
Gaussian Mixture Models for the observed features, the joint This paper exploits the interrelationship between chord
probability of chord and bass pitch, and the chord transi- sequences and bass lines to improve the performance of chord
tion N-gram probability. Experimental results show that recognition. The chord sequence is regarded as an element
our method recognized the chord sequences of 150 songs of the harmony, while bass lines are regarded as an element
in twelve Beatles albums; the average frame-rate accuracy of the melody. The chord sequence consists of a chord sym-
of the results was 73.4%. bol sequence and chord boundary sequence. As the chord
Keyword: chord recognition, bass line, hypothesis search, sequence may represent the mood of music, it can be used
probabilistic integration to calculate the similarity in mood between musical pieces.
This similarity is important in MIR and music recommen-
1 INTRODUCTION dation. On the other hand, the bass line represents a melody
in the bass register; thus, it leads the chord progression.
In recent years, automatic recognition of musical elements A recent approach adopted recently by many researchers
such as melody, harmony, and rhythm (Figure 1) from poly- for automated description of the chord sequence is the use
phonic musical signals has become a subject of great inter- of Hidden Markov Models (HMMs). Several methods have
est. The spread of high-capacity portable digital audio play- been suggested to explore the analogy between speech recog-
ers and online music distribution has allowed a diverse user nition and chord recognition and to consider the temporal
base to store a large number of musical pieces on these play- connection of chords [1, 2, 3]. Sheh et al. [1] proposed a
ers. Information on the content of musical pieces such as method that uses the extended Pitch Class Profile (PCP) [4]
their musical structure, mood and genre can be used together as a feature vector. They used an HMM that had one state
with text-based information to make music information re- per chord with a large set of classes (147 chord types). How-
trieval (MIR) more efficient and effective. Manual annota- ever, they were not able to obtain good enough results for
tion requires an immense amount of effort, and maintaining recognition. Bello et al. [2] used chroma features and an
a consistent level of quality is not easy. Thus, techniques HMM; they improved accuracy by incorporating musical

ISMIR 2008 – Session 1a – Harmony

knowledge into the model. Lee et al. [3] built key-specific 2.2 Issues
models for automatic chord transcription. They used a 6-
We use the hypothesis-search-based (HSB) method proposed
dimensional feature vector, called Tonal Centroid that is based
by Yoshioka et al. to recognize chord symbols and chord
on Tonnetz [5]. Higher accuracies were obtained by limiting
boundaries simultaneously. We chose this method over the
the number of chord types that could be recognized.
HMM-based one because it expressly solves the mutual de-
Yoshioka et al. pointed out that chord symbols affect chord pendency problem between chord symbols and chord bound-
boundary recognition and vice versa [6]. They developed aries. Furthermore, the HSB method makes it easier to prob-
a method that concurrently recognizes chord symbols and abilistically integrate various musical elements. That is, we
boundaries by using a hypothesis search that recognizes the are able to integrate bass pitch estimation into the chord
chord sequence and key. recognition. In this paper, Yoshioka’s HSB method is called
While previous studies have treated only the features of the baseline method. However, two issues remain in using it
chords, we focus on the interrelationship among musical el- to calculate the evaluation value of the hypothesis.
ements and integrate information about bass lines into chord
recognition in a probabilistic framework. The framework
2.2.1 Usage of bass pitch estimation
enables us to deal with multiple musical elements uniformly
and integrate information obtained from statistical analyses Although information about bass sounds is used in the base-
of real music. line method, it is not used in a probabilistic framework.
This paper is organized as follows: Section 2 describes When the predominant single tone estimated is different from
our motivation for developing an automatic chord recogni- the harmonic tones of the chord, penalties are imposed on
tion system, the issues in involved in doing so, and our so- the certainty based on bass sounds. Errors in estimating sin-
lution. Section 3 explains our method in concrete terms. gle tones also tend to produce errors in chord recognition.
Section 4 reports the experimental results and describes the
effectiveness of our method. Our conclusions are discussed 2.2.2 Non-probabilistic certainties
in Section 5.
The certainties based on musical elements for the evaluation
function are not probabilistic in the baseline method. When
the observation distribution of chroma vectors [7] is approx-
imated with a single Gaussian, the Mahalanobis generalized
distance between acoustic features is used as the certainty.
Another certainty based on chord progression patterns is de-
2.1 Motivation fined as a penalty. The criterion for applying this penalty
is related to progression patterns that appear several times.
A bass line is a series of tonal linear events in the bass regis-
However, as the scales of the values was inconsistent, it be-
ter. We focus on it specifically because it is strongly related
comes difficult to integrate multiple elements. Additionally,
to the chord sequence. Bass sounds are the most predomi-
optimizing the weighting factors of each value takes a lot of
nant tones in the low frequency region, and bass lines have
time and effort.
the following important properties:

• They are comprised of the bass register of the musical 2.3 Our Solution
chords. To resolve the above issues, we use bass pitch probability
• They lead the chord sequence. (BPP) and define hypothesis reliability by using a proba-
bilistic function.
• They can be played with a single tone and have a pitch
that is relatively easy to estimate.
2.3.1 Bass Pitch Probability
We improve chord recognition by exploiting the above prop- We utilize BPP as information about bass lines to reduce the
erties. Information about bass lines is obtained through bass effect of bass pitch estimation errors on chord recognition.
pitch estimation. BPP can be estimated using a method called PreFEst [8].
To process multiple musical elements simultaneously, we Because BPP is uniform in non-bass-sound frames, it does
use a probabilistic framework that enables us to deal with not significantly affect chord recognition in these frames.
them uniformly because each evaluation value is scaled from
0 to 1. In addition, with statistical training based on prob-
2.3.2 Probabilistic design of hypothesis reliability
ability theory, it is possible to apply information obtained
from the analysis of real music to the recognition. Thus, the It is necessary to reformulate hypothesis reliability in order
framework has both scalability and flexibility. to use BPP in the reliability calculation. Like the baseline

ISMIR 2008 – Session 1a – Harmony

• Chord symbol sequence
c = [c1 · · · cM ], ci ∈ C ≡ R × T
C major
Musical Signals The system uses 48 classes (major, minor, diminished,
Tracking C Am F G
major and sus4 for each of the 12 roots). Triads, sevenths,
Am C
CC Em Amajor
AmAm F
C Am F Em
and so on are included as subclasses of these larger
Hypothesis Reliability
classes. we focused on discriminating the larger classes.
Musical Signals 12-dimensional Likelihood Evaluation
For MIR applications we believe the larger classes
in Search Interval Chroma Vector using GMM would be sufficient to capture the characteristics or
Bass pitch Joint Probability moods of accompaniments of musical pieces.
Bass Pitch Probability
of Chord and Bass • Chord boundary sequence
Searched Chord Chord Transition t = [t0 · · · tM ], ti ∈ N
Musical Key
Symbol Sequence N-gram Probability where M is the number of chord symbols,
ti denotes the boundary time of ci and ci+1 ,
t0 denotes the beginning time of the input signal,
Figure 2. System Overview and tM denotes the end time of the input signal.
• Key
method, we use acoustic features and chord progression pat- k, k ∈ K ≡ R × M
terns. However, we define the reliabilities based on these
R, M, T are defined as follows:
features by using a probabilistic framework. The values of
the acoustic features are based on the likelihood of Gaussian R ≡ {C, C#, · · · , B}, M ≡ {Major, Minor},
Mixture Models (GMMs) for 12-dimensional chroma vec- T ≡ {Major, Minor, Diminished, Sus4}
tors. The values for the progression patterns are based on
We also assume the tempo stays constant, the beat is a
the transition probability obtained from statistical N-gram
common measure (four-four time), and the key does not
models. This reformulation enables three reliabilities to be
integrated: those based on acoustic features, on BPP, and on
transition probability.
3.2 Formulating of Hypothesis Reliability

3 SYSTEM Denoting the observed feature sequence over frames τ =

(τs , τe ) as Xτ , we can probabilistically define hypothesis
By applying information about BPP when calculating hy- reliability Relτ as follows:
pothesis reliability, we integrate the baseline method and Relτ = p(cτ |Xτ ) (Xτ = [xτs · · · xτe ]) (1)
bass pitch estimation in our chord recognition system. The The BPP, βfτ , during a duration τ is defined from the frame-
three reliabilities based on the three elements are formu- by-frame BPP, w(t) (f ), as follows:
lated probabilistically so that the system deals with them
uniformly. βfτ = w(i) (f )/α = {w(τs ) (f ) + · · · + w(τe ) (f )}/α (2)
Figure 2 shows an overview of our automatic chord recog- i=τs
nition system. First, the beat tracking method from [9] is
used to detect the eighth note level beat times of an input where f denotes the frequency of the bass pitch and α is
musical piece. Second, the hypotheses are expanded over a normalization constant. Hypothesis reliability integrating
this beat time, and hypothesis reliability is calculated based BPP on the duration τ is defined
as follows:
on these three reliabilities. Third, a beam-search method Relτ = p(cτ |Xτ ) = p(cτ , βfτ |Xτ ) (3)
using hypothesis reliability as the cost function is used to f

prune the expanded hypotheses. These operations are re- This hypothesis reliability is converted with Bayes’ theorem
peated until the end of the input signal. Finally, we obtain a into another form as follows:
tuple comprising the chord symbol sequence, chord bound-  
p(cτ , βfτ |Xτ ) ∝ p(Xτ |cτ , βfτ )p(cτ , βfτ ) (4)
ary sequence, and key from the hypothesis that has the high- f f
est reliability. 
= p(Xτ |cτ , βfτ )p(cτ |βfτ )p(βfτ ) (5)
3.1 Specification of Chord Recognition System
We use 12-dimensional chroma vectors as the observation
We define automatic chord recognition as a process of auto- features. Since the vectors only depend on the chord sym-
matically obtaining a chord symbol sequence, a chord bound- bol, we set the following expression.
ary sequence, and a key. These elements are defined as fol-
lows: p(Xτ |cτ , βfτ ) = p(Xτ |cτ ) (6)

ISMIR 2008 – Session 1a – Harmony

Musical signals
Thus, hypothesis reliability over τ becomes as follows:

Bass pitch
Relτ = p(Xτ |cτ ) p(cτ |βfτ )p(βfτ ) (7)
f frame-by-frame
Learning-based value
Conditional Probability of
where p(Xτ |cτ ) denotes the reliability based on acoustic
Summation chords given bass-pitch

features, and f p(cτ |βfτ )p(βfτ ) denotes the reliability based Bass Pitch Probability
in search interval
on BPP.
The key k is independent of chord boundaries. With the Joint Probability of
chord and bass-pitch
conditional probability of a chord symbol sequence given a
key, overall hypothesis reliability Relall is defined as fol-
Multiplication Multiplication

Relall = p(c|k) Relτ (8)
τ Figure 3. Bass-Pitch Processing
= p(c|k) p(Xτ |cτ ) p(cτ |βfτ )p(βfτ ) (9)
τ f
Figure 3 shows an overview of the process for calculating
the reliability based on BPP. The prior probability of bass
where p(c|k) denotes the reliability based on transition prob- sounds p(βfτ ) is defined by using BPP w(t) (f ) as follows:
ability of chords. 
p(βfτ ) = w(τ ) (f ) = w(j) (f ) (11)
3.3 Reliability based on Acoustic Features
On the other hand, the conditional probability of chord
We use 12-dimensional chroma vectors as acoustic features; ci given bass pitch βfτ is obtained from real music by using
these vectors approximately represent the intensities of the correct chord labels and the results of PreFEst for the par-
12-semitone pitch classes. As the chord symbols are identi- ticular duration. We statistically calculate the frequency of
fied by the variety of tones, it is essential for chord recogni- appearance of each bass pitch for each chord.
tion. As the reliability based on BPP is the log joint probability
Because we focus on four chord types, major, minor, di- of the chord and bass pitch, the reliability gb is defined in
minished and sus4, we use four M -mixture GMMs (Maj- terms of the BPP w(τ ) (f ) and the conditional probability
GMM, Min-GMM, Dim-GMM, and Sus4-GMM). The pa- p(ci |βfτ ).
rameters of each GMM, λt , are trained on chroma vectors 
calculated at the frame level. Note that chords of differ- gb = log( p(ci |βfτ )w(τ ) (f )) (12)
ent roots are normalized by rotating chroma vectors. This f
normalization reduces the number of GMMs to four and ef-
fectively increases the number of training samples. The EM 3.5 Reliability based on Transition Probability
algorithm is used to determine the mean, covariance matrix,
Music theory indicates that the genre and the artist usually
and weight for each Gaussian.
determine the chord progression patterns for a given musi-
After chroma vectors from the input signals are rotated
cal piece. The progression patterns are obtained from the
to adapt to the 12 chords having different root tones, we
key and scale degree. We probabilistically approximate the
calculate the log likelihood between them and the 4 GMMs.
frequency of a chord symbol appearing, thus reducing the
The likelihood is equal to p(Xτ |cτ ). Thus, the reliability
ambiguity of chord symbols.
divided by the number of frames in the hypothesis, gc is
We use two 2-gram models, one for major keys and one
defined as follows:
for minor keys. They are obtained from real music in ad-
gc = log p(Xτ |cτ ) = log p(Xr |λt ), (10) vance. In the learning stage, we obtain the 2-gram probabil-
where r denotes the number of rotating chroma vector in- ities from common chord symbol sequences consisting of
dexes and t denotes the number of GMM types. the key and correct chord labels. We calculate the 2-gram
probability p(ci |ci ) from the number of progression pat-
terns and use smoothing to handle progression patterns not
3.4 Reliability based on Bass Pitch Probability
appearing in the training samples.
BPP is obtained by bass pitch estimation, and it is used to We estimate the transition 2-gram probability on a log
represent the degree of each pitch of bass sounds. Since the scale by using the hypothesis’s key. The reliability gp is
bass lines determine the chord sequence, they should be si- defined as follows:
multaneously analyzed for recognizing the chord sequence. (−1)
gp = log p(ci |k) = log p(ci |ci ) (13)
ISMIR 2008 – Session 1a – Harmony

Eighth-note level
Table 1. Parameter values
Beat times
C major
New 48 hypotheses BS = 25 M = 16
C Am
wc = 1.0 wb = 2.0 wp = 0.3
C major
C Am
New 48 hypotheses Table 2. Results of 5-fold cross validation
[1]:Baseline Method, [2]:Ac, [3]:Ac+Ba, [4]:Ac+Pt, [5]:Our Method

Expanding hypothesis
Groups [1] [2] [3] [4] [5]
C major 1st 65.1 62.9 71.7 68.3 78.3
C Am
2nd 62.7 62.6 70.6 65.7 74.9
3rd 57.7 60.4 66.8 61.2 69.5
C major 4th 61.0 61.3 70.2 64.0 72.7
C Am 5th 61.6 60.9 69.2 64.8 71.4
Total 61.6 61.6 69.7 64.8 73.4
Figure 4. Hypothesis Expansion
where Relprev denotes the reliability of the previously ex-
3.6 Integrating three reliabilities panded hypothesis, Relnext denotes the hypothesis reliabil-
ity of the newly expanded interval, and Nh denotes the num-
Because hypothesis reliability on a log scale is the weighted ber of previously expanded intervals.
sum of the three log reliabilities described above, Rel is de-
fined as follows:
Rel = wc × gc + wb × gb + wp × gp (14)
where wc , wb , and wp are weight constants. We tested our system on one-minute excerpts from 150 songs
in 12 Beatles albums (a total of 180 songs), which have the
properties described in Section 3.1. These songs were sep-
3.7 Updating the Hypothesis
arated into 5 groups at random for 5-fold cross validation.
3.7.1 Hypothesis Expansion For training the parameters of the GMMs, we used chroma
vectors calculated from audio signals of 120 of these songs
Figure 4 shows the hypothesis expansion process. We con-
and audio signals of each chord played on a MIDI tone gen-
sider the minimum size of boundary intervals to be eighth-
erator. These songs were also used to train the conditional
note level beat times, which is the same as the baseline
probability of chords given the bass pitch (Section 3.4). We
method. For each unit between one and eight beats, 48 hy-
utilized 150 songs (137 in the major key, 13 in the minor
potheses are generated for 48 chord symbols.
key) as training data for the 2-gram models. As the correct
chord labels, we used ground-truth annotations of these Bea-
3.7.2 Hypothesis Pruning tles albums made by C. A. Harte [10]. The implementation
We use a beam search to prune the expanded hypotheses; for experiments used the parameters listed in Table 1.
this prevents the number of hypotheses from expanding ex- To evaluate the effectiveness of our method, we com-
ponentially. In the beam search, hypotheses are pruned us- pared the frame-rate accuracies of the following five meth-
ing a beam width BS. After the pruning, the BS hypothe- ods of computing the hypothesis reliability:
ses with higher reliability than the rest of the hypotheses are 1. Baseline method
expanded. Furthermore, by delaying the expansion of hy- 2. Using only acoustic features
potheses until it comes time to evaluate them, we can apply 3. Using acoustic features and BBP
pruning to newly generated hypotheses. As a result, by de- 4. Using acoustic features and transition probability
creasing wasteful expansions of hypotheses we can reduce 5. Our method (three elements)
both the amount of computation and memory required.
The results are listed in Table 2. With our system, the av-
erage accuracy for the 150 songs was 73.4%. Compared
3.7.3 Update of Hypothesis Reliability with using only acoustic features, the method using both
When a hypothesis is expanded, we need to update hypothe- acoustic features and bass pitch probability improved the
sis reliability. To treat hypotheses with different numbers of recognition rate by 8.1 points. Furthermore, the method
intervals fairly, the reliability of a new hypothesis Relnew is using acoustic features, BBP, and transition probability im-
defined as follows: proved the recognition rate by 11.8 points. In addition, our
Relprev system’s accuracy was higher than that of the baseline method.
Relnew = + Relnext , (15) This is because the probabilistic integration enabled us to
ISMIR 2008 – Session 1a – Harmony

100 100 racy when the three reliabilities were integrated compared
95 95
90 90
with the baseline method and a method using only acous-
85 85 tic features. This shows that to recognize musical elements
(not only musical chords but also other elements), it is im-
Accuracy (%)

80 80
75 75
portant to consider the interrelationship among musical el-
70 70
65 65
ements and to integrate them probabilistically. To obtain
60 60 more information about how to recognize chord sequences
55 55 more effectively, we will design a way to integrate other mu-
50 50
45 45
sical elements such as rhythm.
40 40
35 35
0 10 20 30 0 10 20 30 6 ACKNOWLEDGEMENTS
Number of songs This research was partially supported by the Ministry of Ed-
Figure 5. Accuracy Histograms.(Left) Histogram for the ucation, Culture, Sports, Science and Technology (MEXT),
songs with correct key. (Right) Histogram for the songs with the Global Century of Excellence (GCOE) Program, and
incorrect key. the CrestMuse Project of the Japan Science and Technol-
ogy Agency (JST). The authors thank Takuya Yoshioka for
his valuable discussion and comments, C. A. Harte for pro-
utilize information about bass lines as a clue in chord recog- viding us ground-truth annotations of the Beatles albums,
nition. Thus, the results prove both the importance of con- and Graham Neubig for his careful reading of our English
sidering the interrelationship between chord sequence and manuscript.
bass lines and the effectiveness of the probabilistic integra-
tion of these elements. 7 REFERENCES
We compared our results with those obtained from other
systems proposed by Bello [2] and Lee [3]. They used two [1] A. Sheh and D. P. W. Ellis, “Chord Segmentation and Recog-
Beatles albums (Please Please Me, Beatles for Sale) as the nition Using EM-Trained Hidden Markov Models,” Proc.
test data set. Our system had an average accuracy of 77.5%. ISMIR, pp. 183-189, 2003.
Although both Bello’s and Lee’s system used different train- [2] J. P. Bello and J. Pickens, “A Robust Mid-level Representation
ing data from those that we used, their systems had 75.0% for Harmonic Content in Music Signals,” Proc. ISMIR, pp.
304-311, 2005.
and 74.4% accuracy, respectively.
[3] K. Lee and M. Slaney, “A Unified System for Chord Tran-
Upon investigating the accuracy distribution for all songs,
scription and Key Extraction Using Hidden Markov Models,”
we fount that the accuracy histogram for all songs is polar- Proc. ISMIR, pp. 245-250, 2007.
ized with two peaks split at approximately 70%. Figure 5 [4] T. Fujishima, “Realtime Chord Recognition of Musical Sound:
shows two accuracy histograms, one for songs where the a System Using Common Lisp Music,” Proc. ICMC, pp.
key was estimated correctly and another where it was in- 464-467, 1999.
correct. As these histograms describe the polarization, it [5] C. A. Harte, M. B. Sandler, and M. Gasser, “Detecting
is clearly important to correctly estimate the key for chord Harmonic Change in Musical Audio,” Proc. Audio and Music
recognition. To improve key estimation, we plan to develop Computing for Multimedia Workshop, pp. 21-26, 2006.
a method that searches the hypotheses by using not only a [6] T. Yoshioka, T. Kitahara, K. Komatani, T. Ogata, and H. G.
forward search but also backtracking. Okuno, “Automatic Chord Transcription with Concurrent
Recognition of Chord Symbols and Boundaries,” Proc. ISMIR,
pp. 100-105, 2004.
5 CONCLUSION [7] M. Goto, “A Chorus-Section Detecting Method for Musical
Audio Signals,” Proc. ICASSP, V, pp. 437-440, 2003.
We presented a chord recognition system that takes into ac- [8] M. Goto, “A Real-time Music-scene-description System:
count the interrelationship among musical elements. Specif- Predominant-F0 Estimation for Detecting Melody and Bass
ically, we focus on bass lines and integrate hypothesis-search- Lines in Real-world Audio Signals,” Speech Comm., 43:4, pp.
based chord recognition and bass pitch estimation in a prob- 311-329, 2004.
abilistic framework. To evaluate hypotheses, our system [9] M. Goto, “An Audio-based Real-time Beat Tracking System
calculates the hypothesis reliability, which is designed by for Music With or Without Drum-sounds,” Journal. of New
the probabilistic integration of three reliabilities, based on Music Research, 30:2, pp. 159–171, 2001.
acoustic features, bass pitch probability, and chord transi- [10] C. A. Harte, et al., “Symbolic Representation of Musical
tion probability. The experimental results showed that our Chords: A Proposed Syntax for Text Annotations,” Proc.
system had a 73.4% frame-rate accuracy of chord recogni- ISMIR, pp. 66-71, 2005.
tion in 150 songs. They also showed an increase in accu-