Sound Report

Introduction
Humans classify audio signals all the time without conscious effort. Recognizing a
voice on the telephone, telling the difference between a telephone ring and a
doorbell ring, these are tasks that we don’t consider very difficult. Problems do
arise when the sound is weak or there is noise or it is similar to another sound. In
this project, we propose to design a deep learning based transfer learning system to
perform audio classification for similarity search. Audio signal classification
consists of extracting relevant features from a sound, and of using these features to
identify into which of a set of classes the sound is most likely to fit. The feature
extraction and grouping algorithms used can be quite diverse depending on the
classification domain of the application. In this project, we propose to use
convolutional neural networks to extract the image features and a novel technique
called transfer learning for classification. We propose to plot the audio signal as a
spectrum of intensity and feed the spectrum to CNN for feature extraction. We
classify the audio spectrum using transfer learning rather than the audio signal
itself. This system finds its applications in various domains such as medicine,
entertainment, security, audio fingerprinting etc. The proposed system is expected
to overpower the state-of the art techniques as it performs the task of audio feature
extraction automatically without human intervention.
Speech is the principal source of communication among humans to show their
ideas, feelings and thoughts to each other. In fact, using speech as a source for
controlling one’s surroundings is always an intriguing concept. Speech recognition
technology has made it possible for computer to listen human voice commands and
interpret human languages. Classification of speech is one of the most vital
problems in speech processing. Although there have been many studies on the
classification of speech, the results are still limited. Firstly, most of the speech
classification approaches requiring input data have the same dimension. Secondly,
all traditional methods must be trained before classifying speech signal and must
be retrained when having more training data or new class. Studies on speech
processing have been carried out for more than 50 years. Despite the fact that a
great deal about how the system works has been researched, there is still more to
be discovered. Previously, researches considered speech perception and speech
recognition as separate domains. Speech perception focuses on the process that
operates to decode speech sounds no matter what words those sounds might
comprise. However, there have been some differences between speech perception,
speech classification and speech recognition. The differences are that speech
recognition points out what the input signal is, while speech perception results in
an interaction of input signal and speech classification, organizing speech signal
into a category for its most effective and efficient use base on a set of training
speech signal. In this paper, we focus on the problem of speech classification or,
more particularly, on isolated words classification. Speech recognition is the
process of converting a given input signal into sequence of words, by means an
algorithm implemented as a computer program. In other words, Speech
Recognition system allows a computer to identify the words that a person speaks
into a microphone or telephone and convert it into readable text. The speech
recognition system would support many valuable applications that require human
interaction with machine. Speech recognition is the inter-disciplinary sub-field of
computational linguistics that develops methodologies and technologies that
enables the recognition and translation of spoken language into text by computers.
It is also known as automatic speech recognition (ASR), computer speech
recognition or speech to text (STT). It incorporates knowledge and research in the
linguistics, computer science, and electrical engineering fields. Some speech
recognition systems require "training" (also called "enrollment") where an
individual speaker reads text or isolated vocabulary into the system. The system
analyzes the person's specific voice and uses it to fine-tune the recognition of that
person's speech, resulting in increased accuracy. Systems that do not use training
are called "speaker independent"[1] systems. Systems that use training are called
"speaker dependent". Speech recognition applications include voice user interfaces
such as voice dialing (e.g. "Call home"), call routing (e.g. "I would like to make a
collect call"), domotic appliance control, search (e.g. find a podcast where
particular words were spoken), simple data entry (e.g., entering a credit card
number), preparation of structured documents (e.g. a radiology report), speech-to-
text processing (e.g., word processors or emails), and aircraft (usually termed
direct voice input). The term voice recognition[2][3][4] or speaker
identification[5][6] refers to identifying the speaker, rather than what they are
saying. Recognizing the speaker can simplify the task of translating speech in
systems that have been trained on a specific person's voice or it can be used to
authenticate or verify the identity of a speaker as part of a security process.[7]
From the technology perspective, speech recognition has a long history with
several waves of major innovations. Most recently, the field has benefited from
advances in deep learning and big data. The advances are evidenced not only by
the surge of academic papers published in the field, but more importantly by the
worldwide industry adoption of a variety of deep learning methods in designing
and deploying speech recognition systems. These speech industry players include
Google, Microsoft, IBM, Baidu, Apple, Amazon, Nuance, SoundHoun d,
iFLYTEK many of which have publicized the core technology in their speech
recognition systems as being based on deep learning.
Literature Survey
Researches on the speech classification have started with perception of speech

code, motor theory and letter speech recognititon [1–3]. Some popular theory for
speech classification are Motor theory [2], TRACE model [4,5], Cohort model [6]
and Fuzzy-logical model [4].
The Motor theory was proposed by Liberman and Cooper [2] in the 1950s. The
Motor theory was developed further by Liberman et al. [1,2]. In this theory,
listeners were said to interpret speech sounds in terms of the motoric gestures they
would use to make those same sounds. The TRACE model [5] is a connectionist
network with an input layer and three processing layers: pseudo-spectra, phoneme
and word. There are three types of connection in TRACE model. The first
connection type is feedforward excitatory connections from input to features,
features to phonemes and phonemes to words. The second connection type is
lateral inhibitory connections at the feature, phoneme and word layers.
The last connection type is top-down feedback excitatory connections from words
to phonemes. The original Cohort model was proposed in 1984 by Wilson et al.
[6]. The core idea at the heart of the Cohort model is that human speech
comprehension is achieved by processing incoming speech continuously as it is
heard. At all times, the system computes the best interpretation of currently
available input combining information in the speech signal with prior semantic and
syntactic context.
The fuzzy logical theory of speech perception was developed by Massaro [4]. He
proposes that people remember speech sounds in a probabilistic, or graded, way. It
suggests that people remember descriptions of the perceptual units of language,
called prototypes. Within each prototype, various features may combine.
However, features are not just binary, there is a fuzzy value corresponding to how
likely it is that a sound belongs to a particular speech category. Thus, when
perceiving a speech signal our decision about what we actually hear is based on the
relative goodness of the match between the stimulus information and values of
particular prototypes. The final decision is based on multiple features or sources of
information, even visual information. For the speech recognition problem, some
common methods are hidden Markov models (HMM) [7,8], neural network [9,10],
dynamic time wrapping [11], deep neural network (DNN) acoustic models [12,13].
These approaches usually use frequent features of speech signal such as MFCC [9],
LPC [14] or raw speech signal using a convolution neural network to learn features
[16–18] as input features. To be used with common machine learning techniques,
the size of these input features must be the same. Thus, the speech features must be
resampled or quantized to have the same size. In addition, the disadvantage of
these machine learning techniques is that they do not allow adding training
samples without retraining. This reduces the flexibility needed for large-scale
speech perception application. To retain all the discriminative features of the data,
Boiman proposed a classification approach called naïve Bayes Nearest neighbor
(NBNN) [19], then Sancho developed this method and proposed an approach
called local naïve Bayes nearest neighbor (LNBNN) [20].
These approaches were successful in images classification problem. In this study,

we propose an approach for speech classification based on spectrogram images.
Researchers have proposed the use of scale-invariant feature transform (SIFT) of
speech signal spectrogram image. SIFT features are invariant to scale and have
been used well for image classification [21,22]. In particular, feature points with
SIFT description are extracted successfully from 2D image of frequency spectral of
speech signal. The quantity of feature points of each image is different. Each
feature point describes one local feature of image; therefore, quantization of these
features will result in the loss of the descriptive nature about the local feature of the
image. Therefore, they use an algorithm classification which allows using feature
vectors in different sizes, while LNBNN [20] accepts input features with different
sizes and has acceptable running time. They propose the use of LNBNN for
classifying speech signal represented by image based on SIFT features. This is
motivated from the work in [20], where LNBNN is used with SIFT features for
searching images in a database. One advantage of this approach is that new
training samples are also added without retraining. Moreover, this approach allows
using feature vectors in different sizes. In this project, we propose to use the theory
of transfer learning to achieve the speech classification. Our work is motivated
from that of [20] with one change being the usage of deep learning algorithms for
classification rather than the classical machine learning based feature extraction
methods.
Classification Techniques
In machine learning and statistics, classification is a

supervised learning approach in which the computer
program learns from the data input given to it and then uses
this learning to classify new observation. This data set may
simply be bi-class (like identifying whether the person is
male or female or that the mail is spam or non-spam) or it
may be multi-class too. Some examples of classification
problems are: speech recognition, handwriting recognition,
bio metric identification, document classification etc.
Here we have the types of classification algorithms in

Machine Learning:
1. Linear Classifiers: Logistic Regression, Naive Bayes
Classifier
2. Support Vector Machines
3. Decision Trees
4. Boosted Trees
5. Random Forest
6. Neural Networks
7. Nearest Neighbor
Naive Bayes Classifier (Generative Learning Model) :

It is a classification technique based on Bayes’ Theorem with an
assumption of independence among predictors. In simple terms,
a Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.
Even if these features depend on each other or upon the
existence of the other features, all of these properties
independently contribute to the probability. Naive Bayes model
is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even
highly sophisticated classification methods.
Logistic Regression (Predictive Learning Model) :

It is a statistical method for analysing a data set in which there
are one or more independent variables that determine an
outcome. The outcome is measured with a dichotomous variable
(in which there are only two possible outcomes). The goal of
logistic regression is to find the best fitting model to describe the
relationship between the dichotomous characteristic of interest
(dependent variable = response or outcome variable) and a set of
independent (predictor or explanatory) variables.
Decision Trees:
Decision tree builds classification or regression models in the
form of a tree structure. It breaks down a data set into smaller
and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with
decision nodes and leaf nodes. A decision node has two or more
branches and a leaf node represents a classification or decision.
The topmost decision node in a tree which corresponds to the
best predictor called root node. Decision trees can handle both
categorical and numerical data.
Random Forest:
Random forests or random decision forests are an ensemble
learning method for classification, regression and other tasks,
that operate by constructing a multitude of decision trees at
training time and outputting the class that is the mode of the
classes (classification) or mean prediction (regression) of the
individual trees. Random decision forests correct for decision
trees’ habit of over fitting to their training set.
Neural Network:
A neural network consists of units (neurons), arranged in layers,
which convert an input vector into some output. Each unit takes
an input, applies a (often nonlinear) function to it and then passes
the output on to the next layer. Generally the networks are
defined to be feed-forward: a unit feeds its output to all the units
on the next layer, but there is no feedback to the previous layer.
Weightings are applied to the signals passing from one unit to
another, and it is these weightings which are tuned in the training
phase to adapt a neural network to the particular problem at
hand.
Nearest Neighbor:
The k-nearest-neighbors algorithm is a classification algorithm,
and it is supervised: it takes a bunch of labelled points and uses
them to learn how to label other points. To label a new point, it
looks at the labelled points closest to that new point (those are its
nearest neighbors), and has those neighbors vote, so whichever
label the most of the neighbors have is the label for the new
point (the “k” is the number of neighbors it checks).
General Fingerprint Framework

Audio fingerprinting is a highly specific content-based audio retrieval technique. Given a short audio
fragment as query, an audio fingerprinting system can identify the particular file that contains the
fragment in a large library potentially consisting of millions of audio files. In this thesis, we investigate
the possibility and feasibility of applying audio fingerprinting to do speech recognition in noisy
environments based on speech reconstruction. To reconstruct noisy speech, the speech is divided into
small segments of equal length at first. Then, audio fingerprinting is used to find the most similar
segment in a large dataset consisting of clean speech files. If the similarity is above a threshold, the noisy
segment is replaced with the clean segment. At last, all the segments, after conditional replacement, are
concatenated to form the reconstructed speech, which is sent to a traditional speech recognition
system. In the above procedure, a critical step is using audio fingerprinting to find the clean speech
segment in a dataset. To test its performance, we build a landmark-based audio fingerprinting system.
Experimental results show that this baseline system performs well in traditional applications, but its
accuracy in this new application is not as good as we expected. Next, we propose three strategies to
improve the system, resulting in better accuracy than the baseline system. Finally, we integrate the
improved audio fingerprinting system into a traditional speech recognition system and evaluate the
performance of the whole system.
Nowadays there are a variety of audio fingerprinting schemes available, but most of them share the
same general architecture [12]. As shown in Figure 2.7, there are two major parts: fingerprint extraction
and fingerprint matching. The fingerprint extraction part computes a set of characteristics features from
the input audio signal. These features are also called fingerprints. They might be extracted at uniform
rate [30] or only around special zone on the spectrogram [62]. After fingerprint extraction, these
fingerprints of the query sample are used by a matching algorithm to find the best match through
searching a large database of fingerprints. In the fingerprint matching part, we compute the distance
between the query fingerprint and other fingerprints in the database. The number of comparison is
usually very high and the computation of distances could be expensive, so a good matching algorithm is
critical. In the end, the hypothesis testing block computes a qualitative or quantitative measurement
about the reliability of the searching results.
Let’s look at this framework from another perspective. It has two working modes, training mode and
operating mode. During training mode, reference tracks are fed into the fingerprint extraction part and
fingerprints are extracted and stored in a database. When a query track is given, the system switches to
operating mode. Fingerprints are extracted by the same means as the training mode and sent to the
fingerprint matching part. In this step, fingerprints are compared to other fingerprints 14 in the
database to find the particular document that has most fingerprints in common with the query sample.
Fingerprint Models
We present the basic concepts and architecture of audio fingerprinting systems, and a summary of the
related works done in speech recognition and speech enhancement in noisy environments. We begin
with a brief introduction of the acoustic processing for audio signal. Then, a general audio fingerprinting
framework is introduced. Most audio fingerprinting algorithms follow a similar architecture. In the end,
we review previous work done in noise-robust speech recognition, mainly focusing on speech
enhancement techniques.
Acoustic Processing
Acoustic processing is the basis of audio fingerprinting and speech recognition. The main steps of
acoustic processing are: represent a sound wave to facilitate digital signal processing, get the
distribution of frequencies from waveforms, and visualize an audio file.
Sound Wave
When we listen to a piece of audio, what our ears get is actually a series of changes of air pressure. The
air pressure is generated by the speaker who makes air pass through the glottis and out the oral or nasal
cavities [36]. To represent sound waves, we need to plot the changes of air pressure over time. For
example, Figure 2.1 shows the waveform for the sentence “set white at B4 now” taken from the GRID1
audiovisual sentence corpus2. In this figure, we can easily distinguish waveforms for the vowels from
most consonants in this sentence. The reason is that vowels are voiced and loud, leading to high
amplitude in the waveform, while consonants are unvoiced and of low amplitude. Figure 2.2 shows the
waveform for the vowel [E] extracted from this sentence. Note that there are repeated pattens in the
wave, which are related to the underlying frequency.
Figure 2.1: The waveform of the sentence “set white at B4 now”
Frequency and amplitude are two important characteristics of a sound wave. Frequency denotes how
many times in a second a wave repeats itself. In Figure 2.2, we can find a wave with a special patten that
repeats about 16 times in 0.11 seconds. So there is a frequency component of 16/0.11 (145) Hz in this
vowel. Here “Hz” is a frequency unit. Amplitude is the strength of air pressure. Zero means the air
pressure is normal, positive amplitude means the air pressure is stronger than normal one and negative
amplitude means weaker air pressure [36]. From a perceptual perspective, frequency and amplitude are
related to pitch and loudness respectively, although the relationship between them is not linear.
To process a sound wave, the first step is to digitize it using an analog-to-digital converter. Actually there
are two stages here, sampling and quantization. Sampling is to measure the amplitude of a sound wave
with a specified sampling rate, which is the number of samples taken in a second. According to Nyquist–
Shannon sampling
Figure 2.2: The waveform of [E] extracted from Figure 2.1
theorem [28], the sampling rate should be at least two times the maximum frequency we want to
capture. 8,000 Hz and 16,000 Hz are common sampling rate for speech signal, as the major energy of
human voice is distributed between 300 Hz and 3,400 Hz [49]. After sampling, a sequence of amplitude
measurements, which is real-valued numbers, is outputted. To save the sequence efficiently, we need
quantization. In this stage, the real-valued numbers are converted to integers of 8 bits or 16 bits.
Spectrum
Processing sound waves in time domain could be very complicated, however, it turns out to be much
simpler when the signal is converted to frequency domain. The mathematical operation that converts an
acoustic signal between the time and frequency domains is called a transform. One example is the
Fourier transform devised by the French mathematician Fourier in the 1820’s, that can transform a time
function into the sum of infinite sine waves, each of which represents a different frequency component.
In the context of acoustic signal processing, spectrum is a representation of all the frequency
components of a sound wave in frequency domain. Its resolution depends on what transform is used,
what the sampling rate is and how many samples we use to compute the spectrum.
The discrete Fourier Transform (DFT) is the most common way to perform Fourier transform in real
applications
9 Figure 2.3: The FFT spectrum of the vowel [E].
DFT is calculated as follows [4]:
Xk = N X−1 n=0 xn · e −2πikn/N , 0 ≤ n < N, 0 ≤ k < N.
Here x is the input sequence of sound wave and X is the frequency output. N is the number of samples
we use to calculate.
Figure 2.3 shows the spectrum of [E] in Figure 2.2 calculated with Fast Fourier Transform (FFT), a
method which can perform the DFT of a sequence rapidly and generate exactly the same result as
evaluating the DFT definition directly. Normally magnitude of each frequency component is measured in
decibels (dB). From this figure, we can find that there are two major frequency components at 500 Hz
and 1700 Hz in this vowel, and some other weaker frequency components besides them. We can also
Similarity Search Methods
After fingerprints are extracted from the query audio, we need to search for similar fingerprints in the
database. Here the similarity is the measure of how much alike two fingerprints are, and is described as
a distance. Small distance indicates high degree of similarity, and vice versa. Popular similarity distance
measures include the Euclidean distance [8], Manhattan distance [31], an error metric called
“Exponential Pseudo Norm” [51], accumulated approximation error [3], etc. How to compute the
distance largely depends on the design of the fingerprint.
Searching for the similar items in a large database is a non-trivial task, although it may be easy to find
the exact same item. There are millions of fingerprints in the database, so it is unlikely to be efficient to
compare them one by one. The general strategy is to design an index data structure to decrease the
number of distance calculations. To further accelerate the searching procedure, some searching
algorithms adopt multi-step searching strategy. In [31], Haitsma et al. design a two-phase search
algorithm. Full fingerprint comparisons are only performed when they have been selected by a sub-
fingerprint search. In [40], Lin et al. propose a matching system 16 consisting of three parts: “atomic”
subsequence matching, long subsequence matching and sequence matching
Hypothesis Testing
The final step is to decide whether there is a matching item in the database. If the similarity, which is
based on the above distance, between the query fingerprint and other reference fingerprints in the
database is above a threshold, the reference item will be returned as the matching result, otherwise the
system thinks there is no matching item in the database. Based on the matching results, the
performance of an audio fingerprinting system is measured as a fraction of the number of correct match
out of all the queries that are used to test. Most systems report this recognition rate as their evaluation
results
Identification of Sound
Speech recognition is the process to convert speech signal to the corresponding sequence of words [21].
It has been implemented on mobile devices, computers or cloud [34]. Sometimes, it is also known as
automatic speech recognition . A general speech recognition system is illustrated in Figure 1.1. The
acoustic model describes the probabilistic relationship between audio signal and phonemes which are
the basic units of speech. It is calculated from a training dataset consisting of speech files and their
corresponding transcripts. The lexicon describes how the phonemes make up individual words and the
language model defines the probability of different combinations of words. Given a speech waveform,
the recognition algorithm collects probability information from these three sources and outputs the
word string with the highest probability. Recently, with the development of smart phones, wearable
devices and virtual reality, the demand for robust speech recognition has increased greatly, requiring
speech recognition to work in much more challenging circumstances. For example, a user may want to
use Siri in his iPhone when he is driving a car or sitting in a restaurant, where interference sounds
around the phone may distort the original 3 speech. A traditional speech recognition system will have a
lot of problems in this scenario. As shown in Figure 1.2, the system is trained by clean speech, while later
is fed with corrupted speech. This mismatch between the training and operating conditions will result in
dramatic deterioration in the recognition rate of the speech recognition system.
Integrity Verification
Implementation
Modules: Typically, the problem of speech recognition system can be visualized

into two modules
1) Training the system
2) Testing the system
Training the system: In this module, we train the system by using the labelled
dataset. We restrict the speech signals to 5 different styles. We can assume each
style to be a class and thus we have 5 classes. The name of every class (the style
name) serves us as the label to all the images corresponding to that class, which
serve as data. Later the features are extracted from every speech signal in each of
the classes. We then train the dataset using a multi-class classifier such as SVM,
Naive Bays or a Convolutional Neural Network to obtain a classification model.
This classification model is then used for testing.
Testing the system: In this module, we supply the test dataset to the trained
model. The trained model returns the probability of the test image belonging to
each of the trained classes. We then design a hypothesis to decide the class of the
test image. This determines the gesture that is present in the image. This activity is
done on a real time system such as raspberry-pi to achieve the ease of execution
In this project, we propose to solve the problem of speech recognition by

performing the learning based image classification technique on the labelled data.
We use the dataset generated to perform training of the system. We restrict the
number of classes to be 5. We pre-process the dataset using different image
processing and computer vision techniques to have improved results. The detailed
visualization of the system is shown in figure 2. Figure 2: Proposed System We use
every single speech data and generate the spectrum images. We use all the images
in the dataset for training. We propose to improve the quality of the images in the
dataset using the image processing and computer vision techniques such as
morphological operations, metric correction etc. We later associate every image
with a class label. This ensures the dataset to contain the images along with the
labels. Now we send this to a learning algorithm and obtain a trained model. The
test data undergoes the same process of quality improvement and then it is sent to
the trained model to obtain the probability of the trained model belonging to a
particular digit. We then use it to decide the class of the gesture via a designed
hypothesis. This gives us the information about the speech signal present in the
spectrum image.
Features and objectives of the proposed algorithm
1) It recognises the speech signal that is not used in training. Thus it can be
imagined as a generalized system.
2) It completes the task at a much faster speed
3) The algorithmic complexity is just polynomial in nature, not exponential.
4) The memory usage is also of the polynomial order
5) The results obtained are efficient. Software tools and technologies used:
This project encompasses the concept of Data Mining, machine learning and
Statistics. This project makes heavy use of NumPy, Pandas, and Data Visualization
Libraries. Few of the key concepts that are utilized in the project are discussed
below.
1. Anaconda Python: In this project, we prefer to code the system in python for its
versatility and its compatibility features. We propose to use the anaconda wrapper
to code the python programs. We propose to use this wrapper for the ease of
execution and inbuilt libraries that it provides. Anaconda comes up with majority
of inbuilt libraries that reduces the burden of installation and compilation before
usage. Also it comes up with various IDE’s like spyder, jupyter notebook etc… for
easy debugging of the program.
2. Tensorflow: TensorFlow is an open source software library for high

performance numerical computation. Its flexible architecture allows easy
deployment of computation across a variety of platforms (CPUs, GPUs, TPUs),
and from desktops to clusters of servers to mobile and edge devices. Originally
developed by researchers and engineers from the Google Brain team within
Google’s AI organization, it comes with strong support for machine learning and
deep learning and the flexible numerical computation core is used across many
other scientific domains.
3. Machine Learning: Machine learning is an application of artificial intelligence

(AI) that provides systems the ability to automatically learn and improve from
experience without being explicitly programmed. Machine learning focuses on the
development of computer programs that can access data and use it learn for
themselves. The process of learning begins with observations or data, such as
examples, direct experience, or instruction, in order to look for patterns in data and
make better decisions in the future based on the examples that we provide. The
primary aim is to allow the computers learn automatically without human
intervention or assistance and adjust actions accordingly. Machine Learning can be
broadly classified into two categories. a. Supervised machine learning algorithms
can apply what has been learned in the past to new data using labelled examples to
predict future events. Starting from the analysis of a known training dataset, the
learning algorithm produces an inferred function to make predictions about the
output values. The system is able to provide targets for any new input after
sufficient training. The learning algorithm can also compare its output with the
correct, intended output and find errors in order to modify the model accordingly.
b. In contrast, unsupervised machine learning algorithms are used when the
information used to train is neither classified nor labelled. Unsupervised learning
studies how systems can infer a function to describe a hidden structure from
unlabeled data. The system doesn’t figure out the right output, but it explores the
data and can draw inferences from datasets to describe hidden structures from
unlabeled data. In this project, we make use of supervised machine learning
algorithm (In specific linear regression algorithm) to predict the share price as we
would have the data of share prices in the previous days and we can use it for
training.
4. Cost function: When building a linear model it’s said that we are trying to
minimize the error an algorithm does making predictions, and we got that by
choosing a function to help us measure the error also called cost function.
Evaluation metrics for classification problems, such as accuracy, are not useful for
regression problems. Instead, we need evaluation metrics designed for comparing
continuous values, here we use Root Mean Squared Error, of course there are
others but this is one of the favourites choice and we are going to go along with it.
We write the cost function
5. Gradient Descent: Gradient descent is one of those “greatest hits” algorithms

that can offer a new perspective for solving problems. Unfortunately, it’s rarely
taught in undergraduate computer science programs. In this post I’ll give an
introduction to the gradient descent algorithm, and walk through an example that
demonstrates how gradient descent can be used to solve machine learning
problems such as linear regression.
At a theoretical level, gradient descent is an algorithm that minimizes functions.

Given a function defined by a set of parameters, gradient descent starts with an
initial set of parameter values and iteratively moves toward a set of parameter
values that minimize the function. This iterative minimization is achieved using
calculus, taking steps in the negative direction of the function gradient.
It’s sometimes difficult to see how this mathematical explanation translates into a
practical setting, so it’s helpful to look at an example. The canonical example when
explaining gradient descent is linear regression.
Formally, this error function looks like:
Lines that fit our data better (where better is defined by our error function) will
result in lower error values. If we minimize this function, we will get the best line
for our data. Since our error function consists of two parameters (m and b) we can
visualize it as a two-dimensional surface.
This is what it looks like for our data set:
Each point in this two-dimensional space represents a line. The height of the
function at each point is the error value for that line. You can see that some lines
yield smaller error values than others (i.e., fit our data better). When we run
gradient descent search, we will start from some location on this surface and move
downhill to find the line with the lowest error.
To run gradient descent on this error function, we first need to compute its
gradient. The gradient will act like a compass and always point us downhill. To
compute it, we will need to differentiate our error function. Since our function is
defined by two parameters (m and b), we will need to compute a partial derivative
for each. These derivatives work out to be:
We can then update the initial approximation to reach to the minima and thereby
obtain the values of the coefficients
6. Logistic regression: In regression analysis, logistic regression is estimating the

parameters of a logistic model. More formally, a logistic model is one where the
logodds of the probability of an event are a linear combination of independent or
predictor variables. The two possible dependent variable values are often labelled
as "0" and "1", which represent outcomes such as pass/fail, win/lose, alive/dead or
healthy/sick. The binary logistic regression model can be generalized to more than
two levels of the dependent variable: categorical outputs with more than two
values are modelled by multinomial logistic regression, and if the multiple
categories are ordered, by ordinal logistic regression, for example the proportional
odds ordinal logistic model.
We generally fit a sigmoid curve using the techniques of cost function optimization
and other techniques as discussed above.
7. SVM: Support Vector Machines (SVMs) are formulated to construct binary

classifiers. From a set of labelled training patterns, defined by: (xn,yn)€ RM X {+-
1} (M: is data dimension) and {n = 1,2,….. NC } where Nc is the number of
samples per a class c. For a set of functions f : RM → {+-1} SVMs seek the
function f that allows the minimal generalization error [1] [10]. The selection of the
appropriate f is achieved by minimizing an upper bound on the generalization error
while maximizing the margin between the two classes. Therefore, data are
classified according to
Where b is a bias. The optimal hyperplane corresponds to f(x) = 0. SV is the
numbers of support vectors which are training data whose Lagrange multipliers αi
are different to zero. The kernel k(,.. ) is any mathematical function, which respects
Mercer’s conditions [2]. For pattern recognition, the Radial Basis Function (RBF)
kernel provides commonly the best performances.
Here σ is user-defined. Extension of SVMs for multiclass problems can be done

through various approaches [4]. For a C-class problem the OneAgainst-All (OAA),
which is the earliest multi-class implementation, performs C binary SVMs in order
to separate iteratively each class from all the others. Since OAA requires a large
training time, the One-Against-One approach with the DDAG decision function is
commonly used [9]. The DDAG performs C(C −1)/2 SVMs each of which
separates two classes. Presently, the DDAG is employed to perform multiclass
SVMs.
Application & Advantages
There are several advantages of performing speech recognition. A few of them can
be listed below.
1. Automotive sector: Speech recognition can be used to design self driving cars,
gesture controlled advance driver assistive systems etc…
2. Consumer electronics sector: The process of speech recognition plays a vital role
in having ease of execution for the greater impact of the electronics items such as
fan, AC, TV to pierce into the human life.
3. Transit sector: It can be used in vehicle transit and control.
4. Gaming sector: This provides a greater experience of gaming.
5. We can use hand gesture recognition to unlock smart-phones

6. Speech recognition finds its vital role in defence sector.
7. We use this widely in home automation. 8. The usage can be seen in sign
language interpretation.
Discussion & Future Work
The project investigates the different techniques used for speech recognition and
also tells about the various advantages of it. It tries to deploy a well known
machine learning algorithm to solve the problem of speech recognition. The
project also proposed to pre-process the data using different computer vision
techniques to improve the accuracy. It is now evident that the project can find its
applications in many fields such as cheque read automation, Autonomous driving
vehicles etc.
The present approach for hand gesture classification opens door to multiple
pathways. These include, but not limited to:
1) Improving the accuracy of classification.
2) Extending the algorithm to detect multiple words in a single spectrum image.
3) Enabling the algorithm to detect speech gesture in any orientation of spectrum.

4) Usage of techniques such as auto-encoders in classification.
5) Improving the speed of recognition.
References
1. Liberman, A.M., Cooper, F.S., Shankweiler, D.P., StuddertKennedy, M.:

Perception of speech code. Psychol. Rev. 74, 431–461 (1967)
2. Liberman, A.M., Mattingly, I.G.: The motor theory of speech perception

revised. Cognition 21, 1–36 (1985)
3. Cole, R., Fanty, M.: ISOLET (Isolated Letter Speech Recognition), Department
of Computer Science and Engineering, September 12 (1994)
4. Massaro, D.W.: Testing between the TRACE Model and the Fuzzy Logical
Model of Speech perception. Cognitive Psychology, pp. 398–421 (1989)
5. McClelland, J.L., Elman, J.L.:The TRACE model of speech perception.
Cognitive Psychology (1986)
6. Wilson, W., Marslen, M.: Functional parallelism in spoken wordrecognition.

Cognition 25, 71–102 (1984)
7. Patel, I.: Speech recognition using HMM with MFCC—an analysis using
frequency spectral decomposition technique. Signal & Image Proc Int J (SIPIJ).
1(2) (2010)
8. Paul, D.B.: Speech Recognition Using Hidden Markov Models. Lincoln Lab. J.
3(1) (1990)
9. Adam, T.B.: Spoken english alphabet recognition with mel frequency cepstral
coefficients and back propagation neural networks. Int J Comput Appl. 42(12),
0975–8887 (2012)
10. Salam, M.S.H., Mohamad, D., Salleh, S.: Malay isolated speech recognition
using neural network: a work in finding number of hidden nodes and learning
parameters. Int Arab J Info Technol 8, 364–371 (2011)
11. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for
spoken word recognition. In: IEEE Transactions on Acoustics, Speech and Signal
Processing, pp. 43–49 (1978)
12. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech
recognition: the shared views of four research groups. In: IEEE Signal Process, pp.
82–97 (2012)
13. Abdel-Hamid, O., et al.: Convolutional neural networks for speech recognition
in IEEE/ACM transactions on audio. Speech and language processing, October,
USA (2014)
14. Hermansky: Perceptual linear predictive (PLP) analysis of speech. J. Acoust.

Soc. Am. 87(4), 1738–52 (1990)
15. Favero R.F.: Compound wavelets: wavelets for speech recognition. In:
International symposium on time-frequency and time-scale analysis, pp. 600– 603,
(1994)
16. Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves
using restricted boltzmann machines. In: Proc. of ICASSP, pp. 5884–5887 (2011)
17. Sainath T., Weiss, R., Senior, A., Wilson, W., Vinyals O.: Learning the Speech
Front-end with Raw Waveform CLDNNs. In: Interspeech (2015)
18. Dimitri, P., Mathew, M.D., Ronan, C.: Analysis of CNN-based speech
recognition system using raw speech as input. In: Interspeech (2015)
19. Boiman, O., Shechtman, E., Iran, M.: In defense of nearestneighbor based
image classification. In: CVPR (2008)
20. McCann, S., Lowe, D.G.: Local Naive Bayes nearest neighbor for image
classification. In: CVPR (2012)
21. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. In:
IJCV (2004)
22. Lowe, D.G.: Object recognition from local scale-invariant features.

Proceedings of the international conference on computer vision 2, 1150–1157
(1999).
23. Sakriani, S., Konstantin,M., Satoshi, N.,Wolfgang,M.: Incorporating

knowledge sources into statistical speech recognition.: Springer Science &
Business Media (2009)
24. Sadaoki, F.: 50 years of Progress in speech and Speaker Recognition Research.
vol. 1, no. 2, November (2005)
25. Davis K.H., Biddulph R., Balashek, S.: Automatic recognition of spoken digits.
J. Acoust. Soc. Am, pp. 637–642 (1952)
26. Olson, H.F., Belar, H.: Phonetic typewriter. J. Acoust. Soc. Am. 28(6), 1072–
1081 (1996)
27. Fry D.B.: Theoretical aspects of mechanical speech recognition. J. Br. Inst.
Radio Eng., pp. 211–299 (1959)
28. Rabiner, L.R., Levinson, S.E., Rosenberg, A.E., Wilpon, J.G.: Speaker
independent recognition of isolated words using clustering techniques. IEEE Trans.
Acoustics, Speech, Signal Proc (1979)
29. Sakoe, H.,: Two level DP matching—a dynamic programming based pattern
matching algorithm for connected word recognition. IEEE Trans. Acoustics,
Speech, Signal Proc., pp. 588–595 (1979)
30. Loizou, P.C., Spanias, A.S.: High-performance alphabet recognition. IEEE

Trans. Speech Audio Proc. 4, 430–445 (1996)
31. Cole, R., Fanty, M., Muthusamy, Y., Gopalakrishnan M.: Speakerindependent
recognition of spoken english letters. In: International Joint Conference on Neural
Networks (IJCNN), pp. 45–51 (1990)
32. Cole, R., Fanty, M.,: Spoken letter recognition. In: Presented at the
Proceedings of the conference on advances in neural information processing
systems Denver, Colorado, United States (1990)
33. Fanty, M., Cole, R.: Spoken Letter Recognition. In: Presented at the
Proceedings of the conference on advances in neural information processing
systems Denver, Colorado, United States (1990)
34. Karnjanadecha, M., Zahorian, S.A.: Signal modeling for highperformance

robust isolated word recognition. IEEE Trans. Speech Audio Proc. 9, 647–654
(2001)
35. Ibrahim, M.D., Ahmad, A.M., Smaon, D.F., Salam M.S.H.: Improved E-set
recognition performance using time-expanded features. In: Presented at the second
national conference on computer graphics and multimedia (CoGRAMM),
Selangor, Malaysia (2004)
Bibliography
[1] MA Abd El-Fattah, Moawad Ibrahim Dessouky, Salah M Diab, and Fathi ElSayed Abd El-Samie. Speech
enhancement using an adaptive wiener filtering approach. Progress In Electromagnetics Research M,
4:167–184, 2008.
[2] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete cosine transform. IEEE transactions on
Computers, 100(1):90–93, 1974.
[3] Eric Allamanche, Jürgen Herre, Oliver Hellmuth, Bernhard Fröba, Throsten Kastner, and Markus
Cremer. Content-based identification of audio material using mpeg-7 low level description. In ISMIR,
2001.
[4] Andreas Antoniou. Digital signal processing. McGraw-Hill Toronto, Canada:, 2006.
[5] Shumeet Baluja and Michele Covell. Content fingerprinting using wavelets. Proc. of European
Conference on Visual Media Production (CVMP), 2006.
[6] Shumeet Baluja and Michele Covell. Audio fingerprinting: Combining computer vision & data stream
processing. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International
Conference on, volume 2, pages II– 213. IEEE, 2007.
[7] Michael Berouti, Richard Schwartz, and John Makhoul. Enhancement of speech corrupted by acoustic
noise. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’79., volume
4, pages 208–211. IEEE, 1979.
[8] Thomas L Blum, Douglas F Keislar, James A Wheaton, and Erling H Wold. Method and article of
manufacture for content-based analysis, storage, retrieval, and segmentation of audio information, June
29 1999. US Patent 5,918,223. 68
[9] Steven Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on
acoustics, speech, and signal processing, 27(2):113–120, 1979.
[10] Judith C Brown. Calculation of a constant q spectral transform. The Journal of the Acoustical Society
of America, 89(1):425–434, 1991.
[11] Judith C Brown and Miller S Puckette. An efficient algorithm for the calculation of a constant q
transform. The Journal of the Acoustical Society of America, 92(5):2698–2701, 1992.
[12] Pedro Cano, E Batle, Ton Kalker, and Jaap Haitsma. A review of algorithms for audio fingerprinting.
In Multimedia Signal Processing, 2002 IEEE Workshop on, pages 169–173. IEEE, 2002.
[13] Pedro Cano, Eloi Batlle, Ton Kalker, and Jaap Haitsma. A review of audio fingerprinting. Journal of
VLSI signal processing systems for signal, image and video technology, 41(3):271–284, 2005.
[14] Pedro Cano, Eloi Batlle, Harald Mayer, and Helmut Neuschmied. Robust sound modeling for song
detection in broadcast audio. Proc. AES 112th Int. Conv, pages 1–7, 2002.
[15] Vijay Chandrasekhar, Matt Sharifi, and David A Ross. Survey and evaluation of audio fingerprinting
schemes for mobile query-by-example applications. In ISMIR, volume 20, pages 801–806, 2011.
[16] Jianping Chen and Tiejun Huang. A robust feature extraction algorithm for audio fingerprinting. In
Pacific-Rim Conference on Multimedia, pages 887–890. Springer, 2008.
[17] CHiME. Small vocabulary track. http://spandh.dcs.shef.ac.uk/chime_

challenge/chime2013/chime2_task1.html, 2013. [Online; accessed 31- January-2017].
[18] Heidi Christensen, Jon Barker, Ning Ma, and Phil D Green. The chime corpus: A resource and a
challenge for computational hearing in multisource environments. In Interspeech, pages 1918–1921.
Citeseer, 2010. 69
[19] Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. An audio-visual corpus for speech
perception and automatic speech recognition. The Journal of the Acoustical Society of America,
120(5):2421–2424, 2006.
[20] D.Ellis. Robust landmark-based audio fingerprinting. http://www.ee.

columbia.edu/ln/rosa/matlab/fingerprint/, 2009.
21] Li Deng and Douglas O’Shaughnessy. Speech processing: a dynamic and optimization-oriented
approach. CRC Press, 2003.
[22] Dan Ellis. Robust landmark-based audio fingerprinting. web resource, available: http://labrosa. ee.
columbia. edu/matlab/fingerprint, 2009.
[23] Yariv Ephraim, Hanoch Lev-Ari, and William JJ Roberts. A brief survey of speech enhancement. The
Electronic Handbook, 2, 2005.
[24] Sébastien Fenet, Gaël Richard, Yves Grenier, et al. A scalable audio fingerprint method with
robustness to pitch-shifting. In ISMIR, pages 121–126, 2011.
[25] Mark John Francis Gales. Model-based techniques for noise robust speech recognition. PhD thesis,
University of Cambridge Cambridge, 1995.

Sound Report

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Sound Report

Transféré par

Droits d'auteur :

Formats disponibles

Introduction

Researches on the speech classification have started with perception of speech

These approaches were successful in images classification problem. In this study,

In machine learning and statistics, classification is a

Here we have the types of classification algorithms in

Naive Bayes Classifier (Generative Learning Model) :

Logistic Regression (Predictive Learning Model) :

General Fingerprint Framework

Figure 2.1: The waveform of the sentence “set white at B4 now”

Figure 2.2: The waveform of [E] extracted from Figure 2.1

9 Figure 2.3: The FFT spectrum of the vowel [E].

DFT is calculated as follows [4]:

Xk = N X−1 n=0 xn · e −2πikn/N , 0 ≤ n < N, 0 ≤ k < N.

Modules: Typically, the problem of speech recognition system can be visualized

1) Training the system

2) Testing the system

In this project, we propose to solve the problem of speech recognition by

Features and objectives of the proposed algorithm

2) It completes the task at a much faster speed

3) The algorithmic complexity is just polynomial in nature, not exponential.

4) The memory usage is also of the polynomial order

2. Tensorflow: TensorFlow is an open source software library for high

3. Machine Learning: Machine learning is an application of artificial intelligence

5. Gradient Descent: Gradient descent is one of those “greatest hits” algorithms

At a theoretical level, gradient descent is an algorithm that minimizes functions.

Formally, this error function looks like:

This is what it looks like for our data set:

6. Logistic regression: In regression analysis, logistic regression is estimating the

7. SVM: Support Vector Machines (SVMs) are formulated to construct binary

Here σ is user-defined. Extension of SVMs for multiclass problems can be done

Application & Advantages

3. Transit sector: It can be used in vehicle transit and control.

4. Gaming sector: This provides a greater experience of gaming.

5. We can use hand gesture recognition to unlock smart-phones

Discussion & Future Work

1) Improving the accuracy of classification.

2) Extending the algorithm to detect multiple words in a single spectrum image.

3) Enabling the algorithm to detect speech gesture in any orientation of spectrum.

5) Improving the speed of recognition.

1. Liberman, A.M., Cooper, F.S., Shankweiler, D.P., StuddertKennedy, M.:

2. Liberman, A.M., Mattingly, I.G.: The motor theory of speech perception

6. Wilson, W., Marslen, M.: Functional parallelism in spoken wordrecognition.

14. Hermansky: Perceptual linear predictive (PLP) analysis of speech. J. Acoust.

22. Lowe, D.G.: Object recognition from local scale-invariant features.

23. Sakriani, S., Konstantin,M., Satoshi, N.,Wolfgang,M.: Incorporating

30. Loizou, P.C., Spanias, A.S.: High-performance alphabet recognition. IEEE

34. Karnjanadecha, M., Zahorian, S.A.: Signal modeling for highperformance

[17] CHiME. Small vocabulary track. http://spandh.dcs.shef.ac.uk/chime_

[20] D.Ellis. Robust landmark-based audio fingerprinting. http://www.ee.

Vous aimerez peut-être aussi