Académique Documents
Professionnel Documents
Culture Documents
Humans classify audio signals all the time without conscious effort. Recognizing a
voice on the telephone, telling the difference between a telephone ring and a
doorbell ring, these are tasks that we don’t consider very difficult. Problems do
arise when the sound is weak or there is noise or it is similar to another sound. In
this project, we propose to design a deep learning based transfer learning system to
perform audio classification for similarity search. Audio signal classification
consists of extracting relevant features from a sound, and of using these features to
identify into which of a set of classes the sound is most likely to fit. The feature
extraction and grouping algorithms used can be quite diverse depending on the
classification domain of the application. In this project, we propose to use
convolutional neural networks to extract the image features and a novel technique
called transfer learning for classification. We propose to plot the audio signal as a
spectrum of intensity and feed the spectrum to CNN for feature extraction. We
classify the audio spectrum using transfer learning rather than the audio signal
itself. This system finds its applications in various domains such as medicine,
entertainment, security, audio fingerprinting etc. The proposed system is expected
to overpower the state-of the art techniques as it performs the task of audio feature
extraction automatically without human intervention.
Speech is the principal source of communication among humans to show their
ideas, feelings and thoughts to each other. In fact, using speech as a source for
controlling one’s surroundings is always an intriguing concept. Speech recognition
technology has made it possible for computer to listen human voice commands and
interpret human languages. Classification of speech is one of the most vital
problems in speech processing. Although there have been many studies on the
classification of speech, the results are still limited. Firstly, most of the speech
classification approaches requiring input data have the same dimension. Secondly,
all traditional methods must be trained before classifying speech signal and must
be retrained when having more training data or new class. Studies on speech
processing have been carried out for more than 50 years. Despite the fact that a
great deal about how the system works has been researched, there is still more to
be discovered. Previously, researches considered speech perception and speech
recognition as separate domains. Speech perception focuses on the process that
operates to decode speech sounds no matter what words those sounds might
comprise. However, there have been some differences between speech perception,
speech classification and speech recognition. The differences are that speech
recognition points out what the input signal is, while speech perception results in
an interaction of input signal and speech classification, organizing speech signal
into a category for its most effective and efficient use base on a set of training
speech signal. In this paper, we focus on the problem of speech classification or,
more particularly, on isolated words classification. Speech recognition is the
process of converting a given input signal into sequence of words, by means an
algorithm implemented as a computer program. In other words, Speech
Recognition system allows a computer to identify the words that a person speaks
into a microphone or telephone and convert it into readable text. The speech
recognition system would support many valuable applications that require human
interaction with machine. Speech recognition is the inter-disciplinary sub-field of
computational linguistics that develops methodologies and technologies that
enables the recognition and translation of spoken language into text by computers.
It is also known as automatic speech recognition (ASR), computer speech
recognition or speech to text (STT). It incorporates knowledge and research in the
linguistics, computer science, and electrical engineering fields. Some speech
recognition systems require "training" (also called "enrollment") where an
individual speaker reads text or isolated vocabulary into the system. The system
analyzes the person's specific voice and uses it to fine-tune the recognition of that
person's speech, resulting in increased accuracy. Systems that do not use training
are called "speaker independent"[1] systems. Systems that use training are called
"speaker dependent". Speech recognition applications include voice user interfaces
such as voice dialing (e.g. "Call home"), call routing (e.g. "I would like to make a
collect call"), domotic appliance control, search (e.g. find a podcast where
particular words were spoken), simple data entry (e.g., entering a credit card
number), preparation of structured documents (e.g. a radiology report), speech-to-
text processing (e.g., word processors or emails), and aircraft (usually termed
direct voice input). The term voice recognition[2][3][4] or speaker
identification[5][6] refers to identifying the speaker, rather than what they are
saying. Recognizing the speaker can simplify the task of translating speech in
systems that have been trained on a specific person's voice or it can be used to
authenticate or verify the identity of a speaker as part of a security process.[7]
From the technology perspective, speech recognition has a long history with
several waves of major innovations. Most recently, the field has benefited from
advances in deep learning and big data. The advances are evidenced not only by
the surge of academic papers published in the field, but more importantly by the
worldwide industry adoption of a variety of deep learning methods in designing
and deploying speech recognition systems. These speech industry players include
Google, Microsoft, IBM, Baidu, Apple, Amazon, Nuance, SoundHoun d,
iFLYTEK many of which have publicized the core technology in their speech
recognition systems as being based on deep learning.
Literature Survey
The Motor theory was proposed by Liberman and Cooper [2] in the 1950s. The
Motor theory was developed further by Liberman et al. [1,2]. In this theory,
listeners were said to interpret speech sounds in terms of the motoric gestures they
would use to make those same sounds. The TRACE model [5] is a connectionist
network with an input layer and three processing layers: pseudo-spectra, phoneme
and word. There are three types of connection in TRACE model. The first
connection type is feedforward excitatory connections from input to features,
features to phonemes and phonemes to words. The second connection type is
lateral inhibitory connections at the feature, phoneme and word layers.
The last connection type is top-down feedback excitatory connections from words
to phonemes. The original Cohort model was proposed in 1984 by Wilson et al.
[6]. The core idea at the heart of the Cohort model is that human speech
comprehension is achieved by processing incoming speech continuously as it is
heard. At all times, the system computes the best interpretation of currently
available input combining information in the speech signal with prior semantic and
syntactic context.
The fuzzy logical theory of speech perception was developed by Massaro [4]. He
proposes that people remember speech sounds in a probabilistic, or graded, way. It
suggests that people remember descriptions of the perceptual units of language,
called prototypes. Within each prototype, various features may combine.
However, features are not just binary, there is a fuzzy value corresponding to how
likely it is that a sound belongs to a particular speech category. Thus, when
perceiving a speech signal our decision about what we actually hear is based on the
relative goodness of the match between the stimulus information and values of
particular prototypes. The final decision is based on multiple features or sources of
information, even visual information. For the speech recognition problem, some
common methods are hidden Markov models (HMM) [7,8], neural network [9,10],
dynamic time wrapping [11], deep neural network (DNN) acoustic models [12,13].
These approaches usually use frequent features of speech signal such as MFCC [9],
LPC [14] or raw speech signal using a convolution neural network to learn features
[16–18] as input features. To be used with common machine learning techniques,
the size of these input features must be the same. Thus, the speech features must be
resampled or quantized to have the same size. In addition, the disadvantage of
these machine learning techniques is that they do not allow adding training
samples without retraining. This reduces the flexibility needed for large-scale
speech perception application. To retain all the discriminative features of the data,
Boiman proposed a classification approach called naïve Bayes Nearest neighbor
(NBNN) [19], then Sancho developed this method and proposed an approach
called local naïve Bayes nearest neighbor (LNBNN) [20].
Classification Techniques
Decision Trees:
Decision tree builds classification or regression models in the
form of a tree structure. It breaks down a data set into smaller
and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with
decision nodes and leaf nodes. A decision node has two or more
branches and a leaf node represents a classification or decision.
The topmost decision node in a tree which corresponds to the
best predictor called root node. Decision trees can handle both
categorical and numerical data.
Random Forest:
Random forests or random decision forests are an ensemble
learning method for classification, regression and other tasks,
that operate by constructing a multitude of decision trees at
training time and outputting the class that is the mode of the
classes (classification) or mean prediction (regression) of the
individual trees. Random decision forests correct for decision
trees’ habit of over fitting to their training set.
Neural Network:
A neural network consists of units (neurons), arranged in layers,
which convert an input vector into some output. Each unit takes
an input, applies a (often nonlinear) function to it and then passes
the output on to the next layer. Generally the networks are
defined to be feed-forward: a unit feeds its output to all the units
on the next layer, but there is no feedback to the previous layer.
Weightings are applied to the signals passing from one unit to
another, and it is these weightings which are tuned in the training
phase to adapt a neural network to the particular problem at
hand.
Nearest Neighbor:
The k-nearest-neighbors algorithm is a classification algorithm,
and it is supervised: it takes a bunch of labelled points and uses
them to learn how to label other points. To label a new point, it
looks at the labelled points closest to that new point (those are its
nearest neighbors), and has those neighbors vote, so whichever
label the most of the neighbors have is the label for the new
point (the “k” is the number of neighbors it checks).
Nowadays there are a variety of audio fingerprinting schemes available, but most of them share the
same general architecture [12]. As shown in Figure 2.7, there are two major parts: fingerprint extraction
and fingerprint matching. The fingerprint extraction part computes a set of characteristics features from
the input audio signal. These features are also called fingerprints. They might be extracted at uniform
rate [30] or only around special zone on the spectrogram [62]. After fingerprint extraction, these
fingerprints of the query sample are used by a matching algorithm to find the best match through
searching a large database of fingerprints. In the fingerprint matching part, we compute the distance
between the query fingerprint and other fingerprints in the database. The number of comparison is
usually very high and the computation of distances could be expensive, so a good matching algorithm is
critical. In the end, the hypothesis testing block computes a qualitative or quantitative measurement
about the reliability of the searching results.
Let’s look at this framework from another perspective. It has two working modes, training mode and
operating mode. During training mode, reference tracks are fed into the fingerprint extraction part and
fingerprints are extracted and stored in a database. When a query track is given, the system switches to
operating mode. Fingerprints are extracted by the same means as the training mode and sent to the
fingerprint matching part. In this step, fingerprints are compared to other fingerprints 14 in the
database to find the particular document that has most fingerprints in common with the query sample.
Fingerprint Models
We present the basic concepts and architecture of audio fingerprinting systems, and a summary of the
related works done in speech recognition and speech enhancement in noisy environments. We begin
with a brief introduction of the acoustic processing for audio signal. Then, a general audio fingerprinting
framework is introduced. Most audio fingerprinting algorithms follow a similar architecture. In the end,
we review previous work done in noise-robust speech recognition, mainly focusing on speech
enhancement techniques.
Acoustic Processing
Acoustic processing is the basis of audio fingerprinting and speech recognition. The main steps of
acoustic processing are: represent a sound wave to facilitate digital signal processing, get the
distribution of frequencies from waveforms, and visualize an audio file.
Sound Wave
When we listen to a piece of audio, what our ears get is actually a series of changes of air pressure. The
air pressure is generated by the speaker who makes air pass through the glottis and out the oral or nasal
cavities [36]. To represent sound waves, we need to plot the changes of air pressure over time. For
example, Figure 2.1 shows the waveform for the sentence “set white at B4 now” taken from the GRID1
audiovisual sentence corpus2. In this figure, we can easily distinguish waveforms for the vowels from
most consonants in this sentence. The reason is that vowels are voiced and loud, leading to high
amplitude in the waveform, while consonants are unvoiced and of low amplitude. Figure 2.2 shows the
waveform for the vowel [E] extracted from this sentence. Note that there are repeated pattens in the
wave, which are related to the underlying frequency.
Frequency and amplitude are two important characteristics of a sound wave. Frequency denotes how
many times in a second a wave repeats itself. In Figure 2.2, we can find a wave with a special patten that
repeats about 16 times in 0.11 seconds. So there is a frequency component of 16/0.11 (145) Hz in this
vowel. Here “Hz” is a frequency unit. Amplitude is the strength of air pressure. Zero means the air
pressure is normal, positive amplitude means the air pressure is stronger than normal one and negative
amplitude means weaker air pressure [36]. From a perceptual perspective, frequency and amplitude are
related to pitch and loudness respectively, although the relationship between them is not linear.
To process a sound wave, the first step is to digitize it using an analog-to-digital converter. Actually there
are two stages here, sampling and quantization. Sampling is to measure the amplitude of a sound wave
with a specified sampling rate, which is the number of samples taken in a second. According to Nyquist–
Shannon sampling
theorem [28], the sampling rate should be at least two times the maximum frequency we want to
capture. 8,000 Hz and 16,000 Hz are common sampling rate for speech signal, as the major energy of
human voice is distributed between 300 Hz and 3,400 Hz [49]. After sampling, a sequence of amplitude
measurements, which is real-valued numbers, is outputted. To save the sequence efficiently, we need
quantization. In this stage, the real-valued numbers are converted to integers of 8 bits or 16 bits.
Spectrum
Processing sound waves in time domain could be very complicated, however, it turns out to be much
simpler when the signal is converted to frequency domain. The mathematical operation that converts an
acoustic signal between the time and frequency domains is called a transform. One example is the
Fourier transform devised by the French mathematician Fourier in the 1820’s, that can transform a time
function into the sum of infinite sine waves, each of which represents a different frequency component.
In the context of acoustic signal processing, spectrum is a representation of all the frequency
components of a sound wave in frequency domain. Its resolution depends on what transform is used,
what the sampling rate is and how many samples we use to compute the spectrum.
The discrete Fourier Transform (DFT) is the most common way to perform Fourier transform in real
applications
Here x is the input sequence of sound wave and X is the frequency output. N is the number of samples
we use to calculate.
Figure 2.3 shows the spectrum of [E] in Figure 2.2 calculated with Fast Fourier Transform (FFT), a
method which can perform the DFT of a sequence rapidly and generate exactly the same result as
evaluating the DFT definition directly. Normally magnitude of each frequency component is measured in
decibels (dB). From this figure, we can find that there are two major frequency components at 500 Hz
and 1700 Hz in this vowel, and some other weaker frequency components besides them. We can also
Similarity Search Methods
After fingerprints are extracted from the query audio, we need to search for similar fingerprints in the
database. Here the similarity is the measure of how much alike two fingerprints are, and is described as
a distance. Small distance indicates high degree of similarity, and vice versa. Popular similarity distance
measures include the Euclidean distance [8], Manhattan distance [31], an error metric called
“Exponential Pseudo Norm” [51], accumulated approximation error [3], etc. How to compute the
distance largely depends on the design of the fingerprint.
Searching for the similar items in a large database is a non-trivial task, although it may be easy to find
the exact same item. There are millions of fingerprints in the database, so it is unlikely to be efficient to
compare them one by one. The general strategy is to design an index data structure to decrease the
number of distance calculations. To further accelerate the searching procedure, some searching
algorithms adopt multi-step searching strategy. In [31], Haitsma et al. design a two-phase search
algorithm. Full fingerprint comparisons are only performed when they have been selected by a sub-
fingerprint search. In [40], Lin et al. propose a matching system 16 consisting of three parts: “atomic”
subsequence matching, long subsequence matching and sequence matching
Hypothesis Testing
The final step is to decide whether there is a matching item in the database. If the similarity, which is
based on the above distance, between the query fingerprint and other reference fingerprints in the
database is above a threshold, the reference item will be returned as the matching result, otherwise the
system thinks there is no matching item in the database. Based on the matching results, the
performance of an audio fingerprinting system is measured as a fraction of the number of correct match
out of all the queries that are used to test. Most systems report this recognition rate as their evaluation
results
Identification of Sound
Speech recognition is the process to convert speech signal to the corresponding sequence of words [21].
It has been implemented on mobile devices, computers or cloud [34]. Sometimes, it is also known as
automatic speech recognition . A general speech recognition system is illustrated in Figure 1.1. The
acoustic model describes the probabilistic relationship between audio signal and phonemes which are
the basic units of speech. It is calculated from a training dataset consisting of speech files and their
corresponding transcripts. The lexicon describes how the phonemes make up individual words and the
language model defines the probability of different combinations of words. Given a speech waveform,
the recognition algorithm collects probability information from these three sources and outputs the
word string with the highest probability. Recently, with the development of smart phones, wearable
devices and virtual reality, the demand for robust speech recognition has increased greatly, requiring
speech recognition to work in much more challenging circumstances. For example, a user may want to
use Siri in his iPhone when he is driving a car or sitting in a restaurant, where interference sounds
around the phone may distort the original 3 speech. A traditional speech recognition system will have a
lot of problems in this scenario. As shown in Figure 1.2, the system is trained by clean speech, while later
is fed with corrupted speech. This mismatch between the training and operating conditions will result in
dramatic deterioration in the recognition rate of the speech recognition system.
Integrity Verification
Implementation
Training the system: In this module, we train the system by using the labelled
dataset. We restrict the speech signals to 5 different styles. We can assume each
style to be a class and thus we have 5 classes. The name of every class (the style
name) serves us as the label to all the images corresponding to that class, which
serve as data. Later the features are extracted from every speech signal in each of
the classes. We then train the dataset using a multi-class classifier such as SVM,
Naive Bays or a Convolutional Neural Network to obtain a classification model.
This classification model is then used for testing.
Testing the system: In this module, we supply the test dataset to the trained
model. The trained model returns the probability of the test image belonging to
each of the trained classes. We then design a hypothesis to decide the class of the
test image. This determines the gesture that is present in the image. This activity is
done on a real time system such as raspberry-pi to achieve the ease of execution
1) It recognises the speech signal that is not used in training. Thus it can be
imagined as a generalized system.
5) The results obtained are efficient. Software tools and technologies used:
This project encompasses the concept of Data Mining, machine learning and
Statistics. This project makes heavy use of NumPy, Pandas, and Data Visualization
Libraries. Few of the key concepts that are utilized in the project are discussed
below.
1. Anaconda Python: In this project, we prefer to code the system in python for its
versatility and its compatibility features. We propose to use the anaconda wrapper
to code the python programs. We propose to use this wrapper for the ease of
execution and inbuilt libraries that it provides. Anaconda comes up with majority
of inbuilt libraries that reduces the burden of installation and compilation before
usage. Also it comes up with various IDE’s like spyder, jupyter notebook etc… for
easy debugging of the program.
4. Cost function: When building a linear model it’s said that we are trying to
minimize the error an algorithm does making predictions, and we got that by
choosing a function to help us measure the error also called cost function.
Evaluation metrics for classification problems, such as accuracy, are not useful for
regression problems. Instead, we need evaluation metrics designed for comparing
continuous values, here we use Root Mean Squared Error, of course there are
others but this is one of the favourites choice and we are going to go along with it.
We write the cost function
It’s sometimes difficult to see how this mathematical explanation translates into a
practical setting, so it’s helpful to look at an example. The canonical example when
explaining gradient descent is linear regression.
Lines that fit our data better (where better is defined by our error function) will
result in lower error values. If we minimize this function, we will get the best line
for our data. Since our error function consists of two parameters (m and b) we can
visualize it as a two-dimensional surface.
Each point in this two-dimensional space represents a line. The height of the
function at each point is the error value for that line. You can see that some lines
yield smaller error values than others (i.e., fit our data better). When we run
gradient descent search, we will start from some location on this surface and move
downhill to find the line with the lowest error.
To run gradient descent on this error function, we first need to compute its
gradient. The gradient will act like a compass and always point us downhill. To
compute it, we will need to differentiate our error function. Since our function is
defined by two parameters (m and b), we will need to compute a partial derivative
for each. These derivatives work out to be:
We can then update the initial approximation to reach to the minima and thereby
obtain the values of the coefficients
We generally fit a sigmoid curve using the techniques of cost function optimization
and other techniques as discussed above.
There are several advantages of performing speech recognition. A few of them can
be listed below.
1. Automotive sector: Speech recognition can be used to design self driving cars,
gesture controlled advance driver assistive systems etc…
2. Consumer electronics sector: The process of speech recognition plays a vital role
in having ease of execution for the greater impact of the electronics items such as
fan, AC, TV to pierce into the human life.
7. We use this widely in home automation. 8. The usage can be seen in sign
language interpretation.
The project investigates the different techniques used for speech recognition and
also tells about the various advantages of it. It tries to deploy a well known
machine learning algorithm to solve the problem of speech recognition. The
project also proposed to pre-process the data using different computer vision
techniques to improve the accuracy. It is now evident that the project can find its
applications in many fields such as cheque read automation, Autonomous driving
vehicles etc.
The present approach for hand gesture classification opens door to multiple
pathways. These include, but not limited to:
References
3. Cole, R., Fanty, M.: ISOLET (Isolated Letter Speech Recognition), Department
of Computer Science and Engineering, September 12 (1994)
4. Massaro, D.W.: Testing between the TRACE Model and the Fuzzy Logical
Model of Speech perception. Cognitive Psychology, pp. 398–421 (1989)
5. McClelland, J.L., Elman, J.L.:The TRACE model of speech perception.
Cognitive Psychology (1986)
7. Patel, I.: Speech recognition using HMM with MFCC—an analysis using
frequency spectral decomposition technique. Signal & Image Proc Int J (SIPIJ).
1(2) (2010)
8. Paul, D.B.: Speech Recognition Using Hidden Markov Models. Lincoln Lab. J.
3(1) (1990)
9. Adam, T.B.: Spoken english alphabet recognition with mel frequency cepstral
coefficients and back propagation neural networks. Int J Comput Appl. 42(12),
0975–8887 (2012)
10. Salam, M.S.H., Mohamad, D., Salleh, S.: Malay isolated speech recognition
using neural network: a work in finding number of hidden nodes and learning
parameters. Int Arab J Info Technol 8, 364–371 (2011)
11. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for
spoken word recognition. In: IEEE Transactions on Acoustics, Speech and Signal
Processing, pp. 43–49 (1978)
12. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech
recognition: the shared views of four research groups. In: IEEE Signal Process, pp.
82–97 (2012)
13. Abdel-Hamid, O., et al.: Convolutional neural networks for speech recognition
in IEEE/ACM transactions on audio. Speech and language processing, October,
USA (2014)
15. Favero R.F.: Compound wavelets: wavelets for speech recognition. In:
International symposium on time-frequency and time-scale analysis, pp. 600– 603,
(1994)
16. Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves
using restricted boltzmann machines. In: Proc. of ICASSP, pp. 5884–5887 (2011)
17. Sainath T., Weiss, R., Senior, A., Wilson, W., Vinyals O.: Learning the Speech
Front-end with Raw Waveform CLDNNs. In: Interspeech (2015)
18. Dimitri, P., Mathew, M.D., Ronan, C.: Analysis of CNN-based speech
recognition system using raw speech as input. In: Interspeech (2015)
19. Boiman, O., Shechtman, E., Iran, M.: In defense of nearestneighbor based
image classification. In: CVPR (2008)
20. McCann, S., Lowe, D.G.: Local Naive Bayes nearest neighbor for image
classification. In: CVPR (2012)
21. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In:
IJCV (2004)
24. Sadaoki, F.: 50 years of Progress in speech and Speaker Recognition Research.
vol. 1, no. 2, November (2005)
25. Davis K.H., Biddulph R., Balashek, S.: Automatic recognition of spoken digits.
J. Acoust. Soc. Am, pp. 637–642 (1952)
26. Olson, H.F., Belar, H.: Phonetic typewriter. J. Acoust. Soc. Am. 28(6), 1072–
1081 (1996)
27. Fry D.B.: Theoretical aspects of mechanical speech recognition. J. Br. Inst.
Radio Eng., pp. 211–299 (1959)
28. Rabiner, L.R., Levinson, S.E., Rosenberg, A.E., Wilpon, J.G.: Speaker
independent recognition of isolated words using clustering techniques. IEEE Trans.
Acoustics, Speech, Signal Proc (1979)
29. Sakoe, H.,: Two level DP matching—a dynamic programming based pattern
matching algorithm for connected word recognition. IEEE Trans. Acoustics,
Speech, Signal Proc., pp. 588–595 (1979)
31. Cole, R., Fanty, M., Muthusamy, Y., Gopalakrishnan M.: Speakerindependent
recognition of spoken english letters. In: International Joint Conference on Neural
Networks (IJCNN), pp. 45–51 (1990)
32. Cole, R., Fanty, M.,: Spoken letter recognition. In: Presented at the
Proceedings of the conference on advances in neural information processing
systems Denver, Colorado, United States (1990)
33. Fanty, M., Cole, R.: Spoken Letter Recognition. In: Presented at the
Proceedings of the conference on advances in neural information processing
systems Denver, Colorado, United States (1990)
35. Ibrahim, M.D., Ahmad, A.M., Smaon, D.F., Salam M.S.H.: Improved E-set
recognition performance using time-expanded features. In: Presented at the second
national conference on computer graphics and multimedia (CoGRAMM),
Selangor, Malaysia (2004)
Bibliography
[1] MA Abd El-Fattah, Moawad Ibrahim Dessouky, Salah M Diab, and Fathi ElSayed Abd El-Samie. Speech
enhancement using an adaptive wiener filtering approach. Progress In Electromagnetics Research M,
4:167–184, 2008.
[2] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete cosine transform. IEEE transactions on
Computers, 100(1):90–93, 1974.
[3] Eric Allamanche, J¨urgen Herre, Oliver Hellmuth, Bernhard Fr¨oba, Throsten Kastner, and Markus
Cremer. Content-based identification of audio material using mpeg-7 low level description. In ISMIR,
2001.
[4] Andreas Antoniou. Digital signal processing. McGraw-Hill Toronto, Canada:, 2006.
[5] Shumeet Baluja and Michele Covell. Content fingerprinting using wavelets. Proc. of European
Conference on Visual Media Production (CVMP), 2006.
[6] Shumeet Baluja and Michele Covell. Audio fingerprinting: Combining computer vision & data stream
processing. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International
Conference on, volume 2, pages II– 213. IEEE, 2007.
[7] Michael Berouti, Richard Schwartz, and John Makhoul. Enhancement of speech corrupted by acoustic
noise. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’79., volume
4, pages 208–211. IEEE, 1979.
[8] Thomas L Blum, Douglas F Keislar, James A Wheaton, and Erling H Wold. Method and article of
manufacture for content-based analysis, storage, retrieval, and segmentation of audio information, June
29 1999. US Patent 5,918,223. 68
[9] Steven Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on
acoustics, speech, and signal processing, 27(2):113–120, 1979.
[10] Judith C Brown. Calculation of a constant q spectral transform. The Journal of the Acoustical Society
of America, 89(1):425–434, 1991.
[11] Judith C Brown and Miller S Puckette. An efficient algorithm for the calculation of a constant q
transform. The Journal of the Acoustical Society of America, 92(5):2698–2701, 1992.
[12] Pedro Cano, E Batle, Ton Kalker, and Jaap Haitsma. A review of algorithms for audio fingerprinting.
In Multimedia Signal Processing, 2002 IEEE Workshop on, pages 169–173. IEEE, 2002.
[13] Pedro Cano, Eloi Batlle, Ton Kalker, and Jaap Haitsma. A review of audio fingerprinting. Journal of
VLSI signal processing systems for signal, image and video technology, 41(3):271–284, 2005.
[14] Pedro Cano, Eloi Batlle, Harald Mayer, and Helmut Neuschmied. Robust sound modeling for song
detection in broadcast audio. Proc. AES 112th Int. Conv, pages 1–7, 2002.
[15] Vijay Chandrasekhar, Matt Sharifi, and David A Ross. Survey and evaluation of audio fingerprinting
schemes for mobile query-by-example applications. In ISMIR, volume 20, pages 801–806, 2011.
[16] Jianping Chen and Tiejun Huang. A robust feature extraction algorithm for audio fingerprinting. In
Pacific-Rim Conference on Multimedia, pages 887–890. Springer, 2008.
[19] Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. An audio-visual corpus for speech
perception and automatic speech recognition. The Journal of the Acoustical Society of America,
120(5):2421–2424, 2006.
21] Li Deng and Douglas O’Shaughnessy. Speech processing: a dynamic and optimization-oriented
approach. CRC Press, 2003.
[22] Dan Ellis. Robust landmark-based audio fingerprinting. web resource, available: http://labrosa. ee.
columbia. edu/matlab/fingerprint, 2009.
[23] Yariv Ephraim, Hanoch Lev-Ari, and William JJ Roberts. A brief survey of speech enhancement. The
Electronic Handbook, 2, 2005.
[24] S´ebastien Fenet, Ga¨el Richard, Yves Grenier, et al. A scalable audio fingerprint method with
robustness to pitch-shifting. In ISMIR, pages 121–126, 2011.
[25] Mark John Francis Gales. Model-based techniques for noise robust speech recognition. PhD thesis,
University of Cambridge Cambridge, 1995.