Vous êtes sur la page 1sur 4

AUTOMATIC REAL-TIME ELECTRIC GUITAR AUDIO TRANSCRIPTION Xander Fiss, Andres Kwasinski Department of Computer Engineering, Rochester Institute

of Technology, Rochester, NY
ABSTRACT Guitar audio transcription is the process of generating a human-interpretable musical score from guitar audio. The musical score is presented as guitar tablature, which indicates what notes are played and where they are played on the guitar fretboard. Automatic transcription remains a challenge with polyphonic sounds as those generated by a guitar. The guitar adds more ambiguity to the transcription problem because the same note can be played in many ways. On the other hand, the guitar offers the potential to constrain the polyphonic pitch detection problem because it has few strings and the ngering is usually performed by a single hand. In this paper, a real-time guitar transcription scheme is presented. Accuracy is improved by considering physical limitations of the guitar and the human player. Index Terms Music Transcription, Guitar, Tablature, Real Time Systems 1. INTRODUCTION Learning to play an instrument can be a challenging endeavor. The development of a musicians intuition for the cause-effect relationship inherent in producing sound is essential to mastering the instrument. In addition, learners without prior musical education may nd the concepts and theory of musical key, scales, and notation difcult to grasp and apply towards their studies. Software that transcribes audio into a musical score in real-time is a potential solution to this problem. Nevertheless, the development of such application poses technical challenges. The guitar is among a class of instruments that can produce multiple notes at the same time (chords). The notes can also be generated in different ways by playing different strings with different positioning of the ngers. Identifying the notes being played, a task known as polyphonic pitch detection, is a challenging problem. Most research in the area of pitch detection has focused on single pitch tracking for speech recognition and compression. Polyphonic pitch detection, has received comparatively little attention [1, 2]. Detecting multiple fundamental frequencies is difcult because it may not be clear whether a peak in frequency is a fundamental or a harmonic, or both. This challenge is exacerbated in musical signals, where notes in a chord are chosen specically because of the harmony caused by harmonic overlap. Klapuri has done signicant work in the eld of polyphonic pitch detection [3] by presenting a variety of techniques and insights into the problem. Transcription for an instrument such as the piano ends with polyphonic pitch detection each note can only be

Fig. 1. Guitar Transcription System. played with a single key. The guitar, however, differs in that the same note can be played in up to ve different places on the fretboard. Guitar audio transcription into tablature not only identies notes but also how those notes have been played. A variety of methods have been used to attempt guitar audio transcription. Klapuri provides an excellent reference as to the state-of-the-art with regard to automatic transcription [1]. Computer vision can be used to assist the problem of transcribing guitar audio by understanding where the hand is on the fretboard [4], although in this work, performance could be further improved by adding polyphonic pitch detection. In this paper, we investigate an implementation of guitar audio transcription. With the goal that the implementation be used in interactive applications (e.g. to provide feedback to a student) we require for the implementation to report results in real time. Similarly to the approach in [5] the presented scheme iteratively processes frequency peaks, however instead of removing harmonics our method marks fundamental frequencies. We demonstrate that by analyzing the instrument and the way it is played, polyphonic pitch detection accuracy has the potential to improve. We take a polyphonic pitch detection algorithm that produces many results and score the results based on both the pitch detection algorithm alone, and then when considering the special properties of the guitar. The focus of the algorithm is on rich polyphonies of 4, 5 or 6 simultaneous notes. The correct result ranks top in 33% of cases. In the rest of the cases, when there are multiple solutions that practically coincide, the correct solution is ranked among the top. 2. SYSTEM FOR REAL-TIME GUITAR TRANSCRIPTION Figure 1 presents a block diagram of our proposed system for real-time guitar transcription. The whole system is implemented in software that performs guitar audio transcription in real-time. To maximize portability, extensibility, and performance, the software was written in Java. The software works on all major desktop platforms and the algorithms are not processor-bound, meaning that a responsive user experience is achievable on any number of congurations. The modern guitar usually has six strings, which are plucked or strummed using ngers or a pick. The standard tuning for a 6-string guitar, from low to high frequency (and

978-1-4577-0539-7/11/$26.00 2011 IEEE

373

ICASSP 2011

correspondingly, thick to thin strings) is E2, A2, D3, G3, B3, and E4. These strings run along a fretboard, enabling the guitarist to play a pitch accurately and consistently. The vibrations of the strings are captured by the electric guitars pickup electronics. Almost all electric guitars have pickups that superimpose the vibrations into one signal. Some guitars come with pickups that produce individual audio signals for each string, effectively sidestepping the challenge presented in this paper, however both the pickup and corresponding system for recording a hexaphonic signal tend to be expensive and difcult to obtain. While the signal that comes from an electric guitars pickup electronics is affected by the construction of the pickup and quality of the electric components in the guitar, it is sufciently clear for the purpose of this work. Also, this setup is more robust against outside noise when using a direct feed from the guitar as opposed to a microphone. For a guitar signal, a very high sampling frequency fs is not necessary. The highest fundamental frequency a 22 fret guitar can produce is a D6 note at 1174.66 Hz. We chose fs = 8000 Hz, and therefore a Nyquist frequency of 4000 Hz, as it is sufcient to sample up to the 3th harmonic of D6 and the 4th harmonic of 96 % of all possible notes on a 22 fret guitar without aliasing. We judged this choice for fs to be a good compromise, as the lowest possible fs reduces computation time and memory usage by reducing the number of samples in a time window, and hence the size of the signal processing algorithms. A key operation in guitar transcription is polyphonic pitch detection. Because even state of the art polyphonic pitch detection algorithms are considerably inaccurate, we chose an algorithm that produces many possible results. These results are given scores. The scoring can be performed by looking only at properties of the signal, but our approach improves upon generic polyphonic pitch detection results by applying knowledge of the guitar to the scoring of results. By considering limitations of the sound produced by the guitar, the scoring can be improved. Some of these limitations include that a guitar can only produce six unique notes at the same time because it only has six strings. Humans introduce other limitations. The strings are pressed into the fretboard by a single hand. There is a limit to the distance one can stretch ones hand, effectively limiting the spread between the played frets. Also, in most scenarios, the thumb is typically used to squeeze the neck of the guitar and provide the necessary support to press the strings into the fretboard - it is not used to play a note. Many talented guitarists use their thumbs to play the low E string, however for the purpose of this research we do not consider this special case. Thus, the maximum number of unique frets in a chord is four. Polyphonic pitch detection starts by identifying peaks in the signal spectrum, which identies fundamental frequencies and associated harmonics. Although guitar signals attenuate over time, for short time windows they can be approximated

as sinusoids. The Short-Time Fourier Transform (STFT) therefore provides a good match for extracting frequency information. Computation of the STFT is accomplished by taking the DFT of partially overlapping Hann windowed signal samples. The Hann window is used because it emphasizes peaks while suppressing spectral leakage; this property will be key in performing peak detection. The amount of overlap does not affect the performance of the algorithm when looking at a single window; it simply helps determine the spacing between results in time. An important consideration is the tradeoff between frequency resolution and latency (window size) when using the STFT. For a total number of samples (the size of the STFT) N and sampling frequency fs , this tradeoff is captured in the expression for frequency resolution f and duration of a windowed block of data twindow : f = fs , N twindow = N , fs

where by frequency resolution we mean the separation of the center of two adjacent Hann window main lobes. In this work, the STFT is size 2048, and the sampling frequency is 8000 Hz. Therefore, f = 3.91 Hz and twindow = 0.256 seconds. For real time applications, such as this work, both frequency resolution and window size are essential to performance. Frequency resolution is necessary to properly identify notes. On the other hand, too large of a sampling window means a signicant delay (latency) between when a user plays a note and when it is able to be processed and presented back to the user. With f = 3.91 Hz, one might wonder whether this provides sufcient frequency resolution to discriminate between very low frequency notes, which are more closely clustered together than high frequency notes. This issue is addressed at the peak detection and classication part of the algorithm. Peak detection in a noisy signal is a challenging problem. A human may be able to look at a noisy signal and point out the maxima and minima, but simple methods such as zeroderivate will not work due to the uctuations in the signal. However, by setting a minimum change in amplitude for maxima and minima detection the noise in the signal is ignored [6]. The value for is determined by making a preliminary sweep with = 0.01. For the true peak detection pass, is dened as the average of the squared magnitudes of the peaks from the rst pass divided by an empirically determined factor = 3.2. When is too low, too many peaks are allowed through, and when is too high, not enough are allowed through. This value was incremented until a balance was found between these extremes. Due to the sampled nature of the STFT, each peak found using the above method is accurate within half a frequency bin f . Quadratic curve tting is used to improve peak location estimation[7]. This is assisted by interpolation provided by zero-padding the FFT. The peak location correction algorithm is as follows:

374

1. The magnitudes of the three highest contiguous points surrounding (and including) the peak are taken as , , and . If the peak is at point x, then the three points can be {x 2, x 1, x}, {x 1, x, x + 1}, or {x, x + 1, x + 2}. 2. The parabola peak location p is found: 1 p= 2 2 + 3. The peak location p is relative to the original peak location x. The true peak location is estimated. x = x + p 4. The magnitude at the new peak is estimated. 1 y (p) = ( )p 4 A peak in frequency has the potential to represent a combination of several fundamentals and harmonics. In NoteOctave notation, harmonics can be indicated with the use of exponents, e.g. E22 =164.8 Hz. Using this notation, however, leads to multiple possible names for the same frequency peak; is the peak E22 or E31 ? In our algorithm, we are primarily concerned with determining the near-integer multiple of the lowest frequency occurence of the note on the guitar. To this end, we assume the octave to be the octave of the lowest frequency occurence of the note on the guitar and omit it, using only the exponent to describe the multiple. We call this notation Note-Multiple. While an 82.4 Hz E note would be E2 in Note-Octave notation, it is E1 in Note-Multiple because 82.4 Hz is the lowest frequency E on the guitar. E2 refers to 164.8 Hz, etc. The frequency of notes is on a logarithmic scale notes are closer together lower in frequency and more spread apart higher in frequency. The STFT, on the other hand, has uniform resolution. It is therefore possible to have one peak in frequency represent multiple notes and harmonics. The notes E1 and F1 are separated by only 4.9 Hz. Both of these notes are played on the same string, so it is not immediately clear which one is being played. It is also possible to have two notes play together but only appear as a single peak, such as A1 on the low E string and A 1 on the A string, which are separated by only 6.5 Hz. As it turns out, a lack of frequency resolution does not present a problem. A note played on the guitar will not appear as a single peak in frequency, but as a set of peaks the fundamental and a set of supporting harmonics. A peak is analyzed to see which note multiples it can represent. If the note-multiple is a power-of-two, then it is possibly a fundamental frequency. If it is a fundamental frequency, it will have supporting harmonics. Without supporting harmonics, a power-of-two note-multiple can only be a harmonic of a lower-frequency fundamental. Consider, for example, a peak which by itself could represent either E1 or F1 . Both of these notes can only be produced on the low E string of the guitar, so only one of these notes can be produced at a time. It follows that only one will have supporting

harmonics. This leads to a straightforward way to separate denite fundamentals from a list of potential fundamentals, as a denite fundamental has to match the following three conditions: Must be a power-of-two multiple. Must have supporting harmonics. Can not be explained as a harmonic of a previously established denite fundamental. Then, the algorithm to perform this separation is as follows: with the list of possible fundamentals sorted from low to high frequency, a fundamental that is not a supporting harmonic of any denite fundamentals is moved to the list of denite fundamentals. This process iterates until all potential fundamentals have been processed. The list of denite fundamentals denes a starting point for producing results from the polyphonic pitch detection algorithm. Each result must include all denite fundamentals and a combination of the potential fundamentals, with six or fewer total fundamentals (because a guitar can only produce six unique notes at the same time). There can be multiple different ways to play (ngerings) each set of fundamentals. Some results can be discarded immediately because no valid ngering exists (e.g. they may require impossibly large hands). The maximum number of fundamentals depends on the lowest frequency fundamental if this fundamental is played on the low E string, there are ve remaining strings for notes to be played on. If it is played on the A string, then there are only four remaining strings, etc. The end results are pairs of ngerings tied to sets of fundamentals. These are referred to as guitar states. Guitar states are scored using three metrics that are combined together using weighted average. The rst two metrics rely only on the signal properties and do not consider the ngering (how ngers may be positioned). These analyze each fundamental to determine the condence of it existence. The rst metric, called Support, checks the number of supporting harmonics for a fundamental, as it is reasonable to assume that proposed fundamentals with more supporting harmonics have a higher probability of being fundamentals than those with fewer or no supporting harmonics. The specic values were determined by analyzing the frequency of overlapping harmonics among Note-Multiples. A table was created with a list of Note-Mults produced by the guitar. Those Note-Mults that could potentially overlap were noted. Fundamentals with only one supporting harmonic are more difcult to distinguish because they can be caused completely by one note. These examples are assigned the relatively low condence value of 0.50. Condence is increased for 3 supporting harmonics to 0.90, and is set to 1.0 for 4 or greater supporting harmonics. Support 0 1 2 3 4+ Condence 0.00 0.50 0.75 0.90 1.00 The second metric is called Power. It is dened as the sum of the magnitudes of the frequency peaks corresponding

375

to the rst harmonic (fundamental) and the second harmonic. This introduces robustness against destructive interference of overlapping harmonics. power is dened as the mean of the powers of the denite fundamentals. It is reasonable to assume that proposed fundamentals with power higher than power are more likely to be fundamentals, whereas those with lower power are less likely. Arctangent is used as a straightforward, continuous way to determine condence values. It has the property of being nearly linear near 0, while becoming asymptotic farther away. Cpower = 2 tan1 (power power )/2 The nal metric, called Spread, is ngering condence. It was previously mentioned that only four ngers are available to press down on the fretboard; results that require more than four ngers are discarded. Condence values are assigned to ngerings based on the fret spread. This reects how much a person must stretch their hand to form the chord. Chords with spread 0, 1, and 2 describe almost all chords. Chords with spread 6 or higher are considered biomechanically unfeasible. Spread 0 1 2 3 4 5 6+ Condence 1.0 1.0 1.0 0.7 0.4 0.2 0.0 3. RESULTS As single notes can be accurately tracked using established pitch detection techniques, the system tuning and testing focus was on chords, or several notes being played at the same time. We choose 18 common chords with emphasis on covering different ngering shapes. The chords have 4, 5, or 6 simultaneous notes. To create an overall score, the two signal-only metrics (Support and Power) are rst combined to obtain a score for pitch detection only. Then the ngering condence is combined. To determine weighting values, weights were incremented from 0 to 1 in steps of 0.05. Then, scoring and relative rank were analyzed to determine a good weighting value. This was performed rst for the signalonly metrics. A support-to-power weighting of 0.35:0.65 was chosen. This was repeated for the weighting between signal and ngering metrics. A weighting of 0.3:0.7 (signalto-ngering) was selected. These weighting selections were made because in calibration tests they gave the correct solution the highest ranking among other results. Without any use of ngering in the scoring, the scores tend to be tightly grouped and gradually decrease from highest to lowest scores. Introducing scoring based on ngering lowers the score of incorrect answers, causing a distinct scoring drop after the strongest results are accounted for. Those scoring values are shown in Table 1. Note that the table does not represents the result that 33% of chords tested did appear as rank 1 results; as high frequency harmonics died off, the results no longer reected what was played by the user. Also, rank is a metric that needs to be carefully interpreted due to the possibly many different

Average Correct Result Score 0.873

Average Top Score

Average Rank 20 Score

0.902 0.525 Table 1. Average Scores

Average Rank for Correct Result 6.44

valid ngerings available for a single set of fundamentals in the context of polyphonic pitch detection, these all count as one result. From the perspective of an user of software utilizing our guitar transcription scheme, the system works very well for automatic identication of chords. This is because, in general, the top results all belong to the same chord structure but may be missing or have an extra fundamental. 4. CONCLUSION We have presented a scheme to perform automatic guitar transcription in real-time. The challenge of the problem resides in the fact that the guitar can generate up to six notes simultaneously. Furthermore, the guitar adds further ambiguity to the transcription problem because the same note can often be played in many ways. The presented scheme is based on combining polyphonic pitch detection, fundamental frequencies detection and classication and a scoring scheme that takes into consideration the physical limits of sound production in a guitar and human biomechanically constraints. Our scheme achieves perfect transcription in 33% of cases and ranks the correct solution among the top ones in the rest of cases where several possible solutions coincide. This opens the door for rapid transcription systems that allow the user to quickly correct transcription mistakes by providing several answers. 5. REFERENCES [1] A. Klapuri, Automatic music transcription as we know it today, Journal of New Music Research ,2004. [2] A. Pertusa and J. M. Inesta, Multiple fundamental frequency estimation using Gaussian smoothness, in 15th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 105108, 2008. [3] A. Klapuri, Signal Processing Methods for the Automatic Transcription of Music, Ph.D. thesis, Tampere, Finland, 2004. [4] M. Paleari, B. Huet, A. Schutz, and D. Slock, A multimodal approach to music transcription, in 15th IEEE International Conference on Image Processing (ICIP), pp. 9396, Oct. 2008. [5] A.Klapuri, Multipitch analysis of polyphonic music and speech signals using an auditory model, Audio, Speech, and Language Processing, IEEE Transactions on, vol.16, no. 2, pp. 255 266, Feb. 2008. [6] Eli Billauer, PeakDet Peak Detection Algorithm,, online [available] http://www.billauer.co.il/ peakdet.html. [7] J. Smith and X. Serra, Parshl an analysis/synthesis program for non-harmonic sounds based on a sinusoidal representation, in International Computer Music Conference,1987.

376

Vous aimerez peut-être aussi