Vous êtes sur la page 1sur 24

Chapter 5 Indexing and Retrieval of Audio 5.

1 INTRODUCTION represented as a sequence of samples (except for structured representations such as MIDI) normally stored in a compressed form Human beings have amazing ability to distinguish different types of audio, can instantly tell the o type of audio (e.g., human voice, music, or noise), o speed (fast or slow), o the mood (happy, sad, relaxing, etc.), o determine its similarity to another piece of audio. computer sees a piece of audio as a sequence of sample values o most common method of accessing audio pieces is based on their titles or file names o may be hard to find audio pieces satisfying the particular requirements of applications o cannot support query by example. content-based audio retrieval techniques o simplest: sample to sample comparison does not work well (sampling rates, number of bits for each sample) o based on a set of extracted audio features: average amplitude, frequency distribution, etc. general approach to content-based audio indexing and retrieval: o Audio is classified into common types of audio (speech, music, and noise) o Different audio types processed and indexed in different ways For example, if the audio type is speech, speech

recognition is applied and the speech is indexed based on recognized words o Query audio are classified, processed, and indexed o Audio pieces are retrieved based on similarity between the query index and the audio index in the database. Reason for classification o different audio types require different processing, indexing and retrieval techniques o different audio types have different significance to different applications o speech recognition techniques/systems available o the audio type or class information is itself very useful to some applications o search space after classification is reduced to a particular audio class during the retrieval process. Audio classification is based on some objective or subjective audio features general approach to speech indexing and retrieval o apply speech recognition to convert speech to spoken words o apply traditional IR on the recognized words two forms of musical representation: structured and sample-based temporal and content relationships between different media types may help indexing and retrieval of multimedia objects 5.2 MAIN AUDIO PROPERTIES AND FEATURES

represented in o time domain (time-amplitude representation) o frequency domain (frequency-magnitude representation) features are derived or extracted from these two representations there are other subjective features such as timbre

5.2.1 Features Derived in the Time Domain most basic signal representation technique a signal is represented as amplitude varying with time silence is represented as 0 signal value can be positive or negative depending on whether the sound pressure is above or below the equilibrium atmospheric pressure when there is silence

Figure 5.1 Amplitude-time representation of an audio signal. Three features o Average energy (E): indicates the loudness of the audio signal
N1 n= 0

E=

x ( n) 2
N

N : number of samples, x(n) : value of sample n.


o

Zero crossing rate: indicates the frequency of signal amplitude sign change. To some extent, it indicates the average signal frequency.

2N where sgnx(n) is the sign of x(n) (1 if x(n) is positive and l if x(n) is negative).
o

ZC =

n= 1

| sgn x(n) sgn x(n 1) |

Silence ratio: indicates the proportion of the sound piece that is silent. Silence is defined as a period within which the absolute amplitude values of a certain number of samples are below a certain threshold. calculated as the ratio between the sum of silent periods and the total length of the audio piece.

5.2.2 Features Derived From the Frequency Domain Sound Spectrum Spectrum frequency domain representation of a signal derived from the time domain representation according to the Fourier transform indicating the amount of energy at different frequencies . Figure 5.2 The spectrum of the sound signal in Figure 5.1. DFT is given by:
X (k ) =
N 1 n= 0

x(n)e jn k

where k=2 k/N, x(n) is a discrete signal with N samples, k is the DFT bin number. If the sampling rate of the signal is fs Hz, then the frequency fk of bin k in hertz is given by:

fk = fs

k k = fs 2 N

If x(n) is time-limited to length N, then it can be recovered completely by taking the inverse discrete Fourier transform (IDFT) of the N frequency samples as follows:
1 x ( n) = N
N1 k= 0 jn X ( k )e k

The DFT and IDFT are calculated efficiently using an algorithm called the Fast Fourier transforms (FFT). a long time period is broken into blocks called frames (typically 10 to 20 ms) and the DFT is applied to each of the frames. Features extracted from frequency definition o Bandwidth: indicates the frequency range of a sound. Music normally has a higher bandwidth than speech signals. bandwidth = highest frequency - lowest frequency of the nonzero spectrum components. In some cases nonzero is defined as at least 3 dB above the silence level.
o

Energy Distribution: the signal frequency distribution across the frequency components. useful for audio classification (music has more high frequency components than speech). Example: frequency below 7 kHz belong to the low band and others belong to high band. total energy for each band is calculated as the sum of power of each samples within the band. spectral centroid (brightness), is the midpoint of the

spectral energy distribution of a sound. Speech has low centroid compared to music. o Harmonicity: In harmonic sound the spectral components are mostly whole number multiples of the lowest (fundamental frequency), and most often loudest frequency. Music is normally more harmonic than other sounds. harmonicity is determined by checking if the frequencies of dominant components are of multiples of the fundamental frequency. Example, sound spectrum of the flute playing the note G4: 400 Hz, 800 Hz, 1200 Hz, 1600 Hz, and so on. (f= 400 Hz is fundamental frequency) o Pitch: Pitch is a subjective feature (related to but not equivalent to the fundamental frequency) in practice, we use the fundamental frequency as the approximation of the pitch. 5.2.3 Spectrogram shows the relation between three variables: frequency content, time, and intensity. frequency content is shown along the vertical axis, time along the horizontal one. intensity, or power, of the signal is indicated by a gray scale (darkest shows greatest amplitude/power) spectrogram clearly illustrates the relationships among time, frequency, and amplitude Figure 5.3 Spectrogram of the sound signal of Figure 5.1.

5.2.4 Subjective Features timbre, example of subjective feature timbre relates to the quality of a sound It encompasses all the distinctive qualities of a sound other than its pitch, loudness, and duration. Salient components of timbre include the amplitude envelope, harmonicity, and spectral envelope. 5.3 AUDIO CLASSIFICATION 5.3.1 Main Characteristics of Different Types of Sound

Speech low bandwidth compared to music (100 to 7,000 Hz). spectral centroids (also called brightness) of speech signals are usually lower than those of music. frequent pauses in a speech, speech signals normally have a higher silence ratio than music. succession of syllables composed of short periods of friction (caused by consonants) followed by longer periods for vowels speech has higher variability in ZCR (zero-crossing rate). Music has a high frequency range (bandwidth), from 16 to 20,000 Hz. spectral centroid is higher than that of speech. music has a lower silence ratio (except may be music produced by a solo instrument or singing without accompanying music). music has lower variability in ZCR has regular beats that can be extracted to differentiate it from speech Table 5.1 Main Characteristics of Speech and Music

Features Bandwidth Spectral Centroid Silence Ratio Zero-crossing rate Regular beat

Speech 0-7 kHz Low High More Variable None

Music 0-20 kHz High Low Less Variable Yes

5.3.2 Audio Classification Frameworks based on calculated feature values first group of methods: each feature is used individually in different classification steps second group: a set of features is used together as a vector to calculate the closeness of the input to the training sets Step-by-Step Classification See figure 5.4 Example: o discriminate broadcast speech and music and achieved classification rate of 90% o the silence ratio to classify audio into music and speech with an average success rate of 82%.

Audio Input High centroid? Yes

Music

No

Speech plus music High silence ratio? No

Music

Yes

Speech plus solo music No

High ZCR variability ? Yes Speech

Solo music

Figure 5.4 A possible audio classification process. Feature-Vector-Based Audio Classification a set of features are calculated and used as a feature vector Training stage: average feature vector (reference vector) is found for each class of audio. Classification: feature vector of an input is calculated and the vector distances between the input feature vector and each of the reference vectors are calculated input is classified into the class from which the input has least vector distance (eg. Euclidean distance) 5.4 SPEECH RECOGNITION AND RETRIEVAL

basic approach to speech indexing and retrieval: o convert speech signals into text and o apply IR techniques for indexing and retrieval Other information to enhance speech indexing and retrieval o speakers identity o mood of speaker 5.4.1 Speech Recognition

automatic speech recognition (ASR) problem is a pattern matching problem ASR system is trained to collect models or feature vectors for all possible speech units smallest unit is a phoneme, word or phrases recognition process: o extract feature vectors of input speech unit o compare feature vectors collected during the training process o speech unit whose feature vector is closest to that of the input speech unit is deemed to be the unit spoken. describe three classes of practical ASR techniques. o dynamic time warping, o hidden Markov models (HMMs), o artificial neural network (ANN) models. Techniques based on HMMs are most popular and produce the highest speech recognition performance. 5.4.1.1 Basic Concepts of ASR

ASR system operates in two stages: training and pattern matching Training stage: features of each speech unit is extracted and stored

in the system. Recognition process: o features of an input speech unit are extracted and compared with each of the stored features o speech unit with the best matching features is taken as the recognized unit (may use a phoneme as a speech unit). recognition is complicated by the following factors: o A phoneme spoken by different speakers or by the same speaker at different times produces different features in terms of duration, amplitude, and frequency components. o background or environmental noise. o difficult to separate into individual phonemes because different phonemes have different durations. o Phonemes vary with their location in a word. The frequency components of a vowels pronunciation are heavily influenced by the surrounding consonants.

See Figure 5.5 for a general model of ASR systems o typical frame size is 10 ms. o mel-frequency cepstral coefficients (MFCCs) mostly used feature vectors. MFCCs are obtained by the following process: The spectrum of the speech signal is warped to a scale, called the mel-scale, that represents how a human ear hears sound. The logarithm of the warped spectrum is taken. An inverse Fourier transform of the result of step 2 is taken to produce what is called the cepstrum.

Training speech

Preprcessing & feature extraction

Feature vectors Phonetic modeling

Corresponding word of training speech

Training process Retreiving process

Phoneme models

Dictionary and grammar

Speech input

Preprocessing & feature extraction

Feature vectors

Search and matching

Output word sequence

Figure 5.5 A general ASR system (after [13]). 5.4.1.2 Techniques Based on Dynamic Time Warping Each speech frame is represented by a feature vector. Recognition process: o distances between the input feature vector and those in the recognition database is computed as the sum of frame to frame differences between feature vectors. o The best match is the one with the smallest distance. o Weakness, may not work due to nonlinear variations in the timing of speeches: made by different speakers made at different times by the same speaker o For example, the same word spoken by different people will take a different amount of time. Therefore, we cannot directly calculate frame to frame differences. Dynamic time warping normalizes or scales speech duration to minimize the sum of distances between feature vectors that most likely match best, see Figure 5.6

Figure 5.6(a): before time warping o reference speech and the test speech are the same, but the two speeches have different time durations Figure 5.6(b): After time warping o reference speech and test speech have the same length and their distance can be calculated by summing the frame to frame or sample to sample differences

Feature Amplitude

Reference speech Test speech Time

Feature Amplitude

Time

Figure 5.6 A Dynamic time warping example: (a) before time warping (b) after time warping 5.4.1.3 Techniques Based on Hidden Markov Models o most widely used and produce the best recognition performance o Phonemes are fundamental units of meaningful sound in speech o Phonemes are different from all the rest, but they are not unchanging in themselves. o When one phoneme is voiced, it can be identified as similar to its previous occurrences, although not exactly the same.

o a phonemes sound is modified by its neighbors sounds. o The challenge of speech recognition is how to model these variations mathematically. o HMM consists of a number of states, linked by a number of possible transitions (Figure 5.7). o Associated with each state are a number of symbols, each with a certain occurrence probability associated with each transition. o When a state is entered, a symbol is generated. Which symbol to be generated at each state is determined by the occurrence probabilities. o Figure 5.7, o the HMM has three states o At each state, one of four possible symbols, x1, x2, x3 and x4, is generated with different probabilities, as shown by b1(x), h2(x), b3(x) and b4(x). o The transition probabilities are shown as a11, a12, and so forth.

Figure 5.7 An example of an HMM.

o not possible to identify a unique sequence of states given a sequence of output symbols. o Every sequence of states that has the same length as the output symbol sequence is possible, each with a different probability. o The sequence of states is hidden from the observer who sees only the output symbol sequence (Thus called the hidden Markov model) o Although it is not possible to identify the unique sequence of state for a given sequence of output symbols, it is possible to determine which sequence of state is most likely to generate the sequence of symbols, based on state transition and symbol generating probabilities. o HMMs in speech recognition o Each phoneme is divided into three audible states: an introductory state, a middle state, an exiting state. o Each state can last for more than one frame (normally each frame is 10 ms) o During the training stage, training speech data is used to construct HMMs for each of the possible phonemes. Each HMM has the above three states and is defined by state transition probabilities and symbol generating probabilities. Symbols are feature vectors calculated for each frame. Some transitions are not allowed as time flows forward only. For example, transitions from 2 to 1, 3 to 2 and 3 to 1 are not allowed if the HMM in Figure 5.7 is used as a phoneme model. Transitions from a state to itself are allowed and serve to model time variability of speech.

At the end of the training stage, each phoneme is represented by one HMM capturing the variations of feature vectors in different frames. These variations are caused by different speakers, time variations, and surrounding sounds. o During speech recognition feature vectors for each input phoneme are calculated frame by frame. Obtain which phoneme HMM is most likely to generate the sequence of feature vectors of the input phoneme. The corresponding phoneme of the HMM is deemed as the input phoneme. As a word has a number of phonemes, a sequence of phonemes are normally recognized together. There are a number of algorithms to compute the probability that an HMM generates a given sequence of feature vectors. forward algorithm: used for recognizing isolated words Viterbi algorithms: for recognizing continuous speech 5.4.1.4 Techniques Based on Artificial Neural Networks ANNs widely used for pattern recognition ANN consists of many neurons interconnected by links with weights Speech recognition with ANNs consists of two stages: o training stage: feature vectors of training speech data are used to train the ANN (adjust weights on different links). o recognition stage:

ANN identify the most likely phoneme based on the input feature vectors 5.4.1.5 Speech Recognition Performance measured by recognition error rate performance is affected by the following factors: 1. Subject matter: vary from a set of digits, a newspaper article, to general news. 2. Types of speech: read or spontaneous conversation. 3. Size of vocabulary: from dozens to a few thousand words. Table 5.2 Error Rates for High Performance Speech Recognition (Based on HMM) Subject Matter Type Vocabulary, Word Error No of Words Rate(%) Connected digits Read 10 <0,3 Airline travel system Spontaneous 2.500 2 Wall Street Journal Read 64.000 7 Broadcast news Read/spontaneous 64.000 30 (mixed) General phone call Conversation 10.000 50 5.4.2 Speaker Identification speaker identification or voice recognition attempts to find the identity of the speaker or to extract information about an individual from his/her speech very useful to multimedia information retrieval can determine speakers information o the number of speakers in a particular setting o gender of speaker,

o adult or child, o speakers mood, o emotional state and attitude, etc. speakers information, together with the speech content significantly improves information retrieval performance. Speech recognition, if it is to be speaker-independent, must purposefully ignore any idiosyncratic speech characteristics of the speaker and focus on those parts of the speech signal richest in linguistic information. In contrast, voice recognition must amplify those idiosyncratic speech characteristics that individualize a person and suppress linguistic characteristics that have no bearing on the recognition of the individual speaker.

5.5 MUSIC INDEXING AND RETRIEVAL research and development of effective techniques for music indexing and retrieval is still at an early stage. there are two types of music: o structured or synthetic o sample based music 5.5.1 Indexing and Retrieval of Structured Music and Sound Effects represented by a set of commands or algorithms MIDI: represent music as a number of notes and control commands MPEG-4 Structured Audio: represents sound in algorithms and control languages structured sound standards and formats are developed for sound transmission, synthesis, and production. structure and notes description in these formats make retrieval

process easy (no need to do feature extraction) exact match of the sequence of notes does not guarantee same sound due to different devices for rendering Finding similar music or sound effects based on similarity is complicated even with structured music and sound effects. main problem is hard to define similarity between two sequences of notes. One possibility is to retrieve music based on the pitch changes of a sequence of notes o each note (except for the first one) in the query and in the database sound files is converted into pitch change relative to its previous note. o The three possible values for the pitch change are U(up), D(down), and S(same or similar). o a sequence of notes is characterized as a sequence of symbols. o retrieval task becomes a string-matching process.

5.5.2 Indexing and Retrieval of Sample-Based Music two approaches to indexing and retrieval of sample-based music first approach is based on a set of extracted sound features second approach is based on pitches of music notes Music retrieval based on a set of features a set of acoustic features is extracted for each sound similarity between the query and each of the stored music pieces is calculated based on the corresponding feature vectors This approach can be applied to general sound including music, speech, and sound effects. five features are used o loudness

o pitch o brightness o bandwidth o harmonicity. Each feature is represented statistically by three parameters: o Mean o Variance o autocorrelation. The Euclidean distance or Manhattan distance may be used. Music retrieval based on pitch similar to pitch-based retrieval of structured music need to extract pitch for each note (called pitch tracking) Pitch tracking is a simple form of automatic music transcription that converts musical sound into a symbolic representation basic idea o Each note of music (including the query) is represented by its pitch. o a musical piece or segment is represented as a sequence or string of pitches. o retrieval is based on similarity between query and candidate strings. Pitch is normally defined as the fundamental frequency of a sound To find the pitch for each note, the input music must first be segmented into individual notes Segmentation of continuous music, especially humming and singing, is very difficult. Therefore, it is normally assumed that music is stored as scores in the database. The pitch of each note is known. The common query input form is humming There are two pitch representations

o first method each pitch except the first one is represented as pitch direction (or change) relative to the previous note pitch direction is either U(up), D(down), or S (similar) each musical piece is represented as a string of three symbols or characters. o second method represents each note as a value based on a chosen reference note value is assigned from a set of standard pitch values that is closest to the estimated pitch. If we represent each allowed value as a character, each musical piece or segment is represented as a string of characters. finding a match or similarity between the strings humming is not exact, thus need approximate matching string matching with k mismatches, k is determined by user The problem consists of finding all instances of a query string Q=q1q2q3..qm in a reference string R=r1,r2,r3,rm such that there are at most k mismatches (characters that are not the same) 5.6 MULTIMEDIA INFORMATION INDEXING AND RETRIEVAL USING RELATIONSHIPS BETWEEN AUDIO AND OTHER MEDIA So far, we have treated sound independently of other media. In some applications, sound appears as part of a multimedia document or object. For example, a movie consists of a sound track and a video track with fixed temporal relationships between them. Different media in a multimedia object are interrelated in their contents as well as by time. We use this interrelation to improve multimedia information indexing and retrieval in the following two ways. First, we can use knowledge or understanding about one medium to

understand the contents of other media. We have used text to index and retrieve speech through speech recognition. We can in turn use audio classification and speech understanding to help with the indexing and retrieval of video. Figure 5.8 shows an multimedia object consisting of a video track and a sound track. The video track has 26 frames. Now we assume the sound track has been segmented into different sound types. The first segment is speech and corresponds to video frames 1 through 7. The second segment is loud music and corresponds to video frames 7 through 18. The final segment is speech again and corresponds to video frames 19 through 26. We then use the knowledge of the sound track to do the following on the video track. First, we segment the video track according to the sound track segment boundaries. In this case, the video track is likely to have three segments with the boundaries aimed with the sound track segment boundaries. Second, we apply speech recognition to sound segments 1 and 3 to understand what was talked about. The corresponding video track may very likely have similar content. Video frames may be indexed and retrieved based on the speech content without any other processing. This is very important because in general it is difficult to extract video content even with complicated image processing techniques.
Frame 1 Video track Frame 7 Frame 19 Frame 26

Sound track Segment 1 Segment 2 Segment 3

Figure 5.8 An example multimedia object with a video track and a soundtrack. The second way to make use of relationships between media for

multimedia retrieval is during the retrieval process. The user can use the most expressive and simple media to formulate a query, and the system will retrieve and present relevant information to the user regardless of media types. For example, a user can issue a query using speech to describe what information is required and the system may retrieve and present relevant information in text, audio, video, or their combinations. Alternatively, the user can use an example image as query and retrieve information in images, text, audio, and their combinations. This is useful because there are different levels of difficulty in formulating queries in different media. We discuss the indexing and retrieval of composite multimedia objects further in Chapter 9. 5.7 SUMMARY This chapter described some common techniques and related issues for content-based audio indexing and retrieval. The general approach is to classify audio into some common types such as speech and music, and then use different techniques to process and retrieve the different types of audio. Speech indexing and retrieval is relatively easy, by applying JR techniques on words identified using speech recognition. But speech recognition performance on general topics without any vocabulary restriction is still to be improved. For music retrieval, some useful work has been done based on audio feature vector matching and approximate pitch matching. However, more work is needed on how music and audio in general is perceived and on similarity comparison between musical pieces. It will also be very useful if we can further automatically classify music into different types such as pop and classical. The classification and retrieval capability described in this chapter is potentially important and useful in many areas, such as the press and music industry, where audio information is used. For example, a user can hum or play a song and ask the system to find songs similar to what was hummed or played. A radio presenter can specify the requirements of a particular occasion and ask the system to provide a selection of

audio pieces meeting these requirements. When a reporter wants to find a recorded speech, he or she can type in part of the speech to locate the actual recorded speech. Audio and video are often used together in situations such as movie and television programs, so audio retrieval techniques may help locate some specific video clips, and video retrieval techniques may help locate some audio segments. These relationships should be exploited to develop integrated multimedia database management systems.

Vous aimerez peut-être aussi