Vous êtes sur la page 1sur 6

Urban Audio Classification Using Artificial

Neural Network
1st Amit Kishor Raturi 2nd Dhruv Verma
Dept. Of Computer Science And Engineering Dept. Of Computer Science And Engineering
NIT Uttarakhand NIT Uttarakhand
Srinagar Garhwal, India Srinagar Garhwal, India
Amitkishorraturi.cse16@nituk.ac.in Dhruv.cse16@nituk.ac.in

Abstract—This paper considers the use of Artificial time_freq_audio.png


Neural Networks in classifying short audio clips of
environmental and urban sounds. Here we have pro-
posed the use of Artificial Neural Networks overcoming Fig. 1. Visiualization of Audio Data
the problem o f accuracy due to data scarcity on the
performance of the proposed Convolutional Neural Net-
work architecture. Using optimal activation functions
a fully connected 3 hidden layer network is trained exterior locations [6], [7]), machines usefulness can
on 8,732 labelled sound clips (4 seconds each) from be threatened with the loss of their vision recogni-
ten classes: air conditioner, car horn, children playing,
dog bark, drilling, engine idling, gunshot, jackhammer, tion ability. So in such conditions audio recognition
siren, and street music. These type of sounds are seems very promising in increasing the optimality
very common in daily life so classification over this of a machine. There have been ongoing interests in
data set of environmental and urban sounds will open nding approaches to give hearing to portable robots
the future possibilities in city management and safety [8], [9] so to improve their setting mindfulness with
operations , especially in the smart city engineering.
Audio features which include noise, are extracted using sound data, Different applications incorporate those
librosa implementation. The accuracy of the network in the area of wearables and setting mindful appli-
is evaluated on Urban8K dataset of environmental and cations [10], [11], e.g., in the plan of a cell phone,
urban recordings. for example, a mobile device that can consequently
Index Terms—environmental sound, artificial neural change the notication mode dependent on the learning
network, mel-frequency cepstral coefficients
of client’s environment, such as changing to the
quiet mode in a theater or classroom [10] or even
I. I NTRODUCTION
give data altered to client’s area [12]. Research in
Sound recognition is a basic audio signal process- general audio environment recognition has attracted
ing problem. For instance, applications in context some interest in the last few years [13],[14] but,
aware computing [1] and surveillance [2] to noise the action is significantly less contrasted with that
mitigation enabled by smart acoustic sensor networks for speech or music. Programmed unstructured con-
[3], automated route, assistive robotics, and other cell dition characterization is still in its earliest stages.
phone based administrations, where setting mindful A few regions of non-speech sound recognition that
handling is frequently wanted or required. We hu- have been concentrated to different degrees are those
mans use both vision and audio recognition in our day relating to acknowledgment of specific occasions
to day activities while machines are still using vision utilizing sound from deliberately created movies or
recognition primarily as audio recognition still needs TV tracks[15],[16]. Others incorporate the separation
to be done in a vast field. As till researchers have between music instruments[17], [18], music classes
mainly focused on similar type of audio data set, here [19], and between varieties of speech, non-speech and
we have taken urban sound clips which consists of music [20] [22]. As till now very less systems have
frequent frequency changes and a lot of surrounding proposed to model raw environmental audio, so in
noise. this paper instead of classifying discrete voice events
At the point when utilized to comprehend unstruc- we have tried to classify environmental and urban
tured situations [4], [5] (e.g., determining interior or sounds.
II. L ITERATURE S URVEY of labeled data for environmental sound classication
made adverse impact on final accuracy. So Justin
When contrasted with different zones in sound, for Salamon and Juan Pablo Bellos work Deep Convolu-
example, discourse or music, investigate on general tional Neural Networks and Data Augmentation for
unstructured sound based scene acknowledgment has Environmental Sound Classication[29] shows approx-
gotten little consideration. As far we know only a imately the same work by removing the problem of
few frameworks(and systems) have been proposed scarcity of labeled data for environmental sound clas-
to examine ecological classication with crude sound. sication using Data Augmentation and performance
Sound-based circumstance examination has been con- of this proposed model increased signicantly, yielding
sidered in [23], [24] and in [11], [25], for wearables a mean accuracy of 0.79 So as earlier CNN algorithm
and setting mindful applications. In view of haphaz- is mainly used in classifying these type of sound data-
ardness, high fluctuation, and different difculties in set, the goal of this paper is to provide an answer
working with ecological sounds, the acknowledgment to the question that Can Artificial Neural Network
rates fall quickly with expanding number of classes; be effectively used in classifying environmental and
agent results show acknowledgment exactness re- urban sound sources?
stricted to around 92% for ve classes[8],77%for 11
III. M ETHODOLOGY
classes [26], and roughly 60% for at least 13 classes
[23], [24]. The examination of sound conditions in A. Data Preprocessing And Feature Extraction
Peltonen’s postulation [24], displayed two classica- The raw audio signal is not suitable as direct input
tion plans. The rst plot depended on averaging the to a classifier due to its extremely high dimensionality
band-vitality proportion as highlights and character- and the fact that it would be unlikely for perceptually
izing them utilizing a K-closest neighborhood (kNN) similar sounds to be neighbours in vector space.Thus,
classier. The second uses MFCCs as highlights and a a popular approach for feature learning from audio is
Gaussian blend demonstrate (GMM) classier. Pelto- to convert the signal into a time-frequency represen-
nen saw the weaknesses of MFCCs for natural sounds tation,a common choice being the mel-spectrogram.
and proposed utilizing the band-vitality proportion We extract log-scaled mel-spectrograms with 40 com-
as an approach to speak to sounds happening in ponents (bands) covering the audible frequency range
various recurrence ranges. Both of these investiga- (0-22050 Hz), using a window size of 23 ms (1024
tions included characterizing 13 unique settings or samples at 44.1 kHz) and a hop size of the same
classes. There has also been some earlier paintings duration. We also experimented with a larger numbers
on using matching pursuit for studying audio for of bands (128), but this did not improve performance
classication however quite confined. The proposed and hence we stuck to the lower (and faster to
method by using Ebenezer et al. [27] demonstrated process) resolution of 40 bands.
the usage of MP for signal classication. Their frame- To extract the useful features from sound data, we
work classied acoustic emissions using a modied have used Librosa library. Librosa provides several
MP algorithm in an actual acoustic tracking machine methods to extract different features from the sound
. The classier become based on a modied version clips. Following features have been extracted from
of the MP decomposition set of rules. For each the audio clips
class, suitable gaining knowledge of signals have 1) melspectrogram: Compute a Mel-scaled power
been selected, the time- and frequency-transferring spectrogram
of those alerts bureaucracy their dictionary. After 2) mfcc: Mel-frequency cepstral coefficients
the MP algorithm terminates, the net contribution chorma-stft: Compute a chromagram from a
of correlation coefcients from each elegance is used waveform or power spectrogram
because the decision statistic, wherein the only that 3) spectralcontrast Compute spectral contrast:
produces the largest cost is the chosen elegance. They 4) tonnetz: Computes the tonal centroid features
established an overall classication charge of 83.0% (tonnetz)
for the 12-elegance classication case. In this type Librosa provides useful method to convert raw
of sound classification over environmental and urban sound clips into informative features (along with a
sounds, till now mainly proposed method is using class label for each sound clip) that we can directly
Convolutional Neural Network architecture. Karol J. feed into our classifier. The class label of each sound
Piczaks work over environmental sound classification clip is in the file name. For example, if the file name
with CNN is one of the example[28]. But scarcity is 108041-9-0-4.wav then the class label will be 9.
Logistic sigmoid Input Hidden Hidden Hidden Output
1
layer layer 1 layer 2 layer 3 layer
σ(x)
(2)
h0
0.5
(1) (3)
h0 h0
x0 (2)
h1
0
−4 −2 0 2 4 (1) (3)
ŷ1
h1 h1
x
x1 (2)
h2
(a) sigmoid activation function. (1) (3)
h2 h2 ŷ2
1 Hyperbolic tangent x2 (2)
h3 ..
.
.. (1) (3)
ŷ10
tanh(x)

. h3 h3
0 x193 .. (2)
h4 ..
. .
(1)
h280 .. (3)
h250
−1
.
(2)
h300
−4 −2 0 2 4
x
Fig. 3. Proposed Architecture of classifier for urban sound
(b) Hyperbolic tangent activation function. classification

Fig. 2. Activation functions utilised in the architecture include the


sigmoid σ(x) and the hyperbolic tangent tanh(x). percent accuracy on Urban8Kdataset. There are in
total 5 layers including 3 hidden layers, input layer
B. Network Architecture and output layer in the proposed architecture. There
are 280,300,250 units in hidden layers repectively as
A typical neural network classifier consists of a shown below. The first and second layers are Tangent
number of different layers stacked together along with hyperbolic activated while third layer is sigmoid
different activation functions. activated, finally at ouput layer we have used softmax
A activation function (also known as transfer func- activation for 10 class of audios.
tion) determines the relation between input and out-
Different architechtures along with the combina-
put. Activation function normalizes the output value
tions of activation functions have been used, for
to certain range. There are number of activation
example different combinations of Rectified Linear
functions in use with neural networks. Activation
Unit(Relu), Hyperbolic Tangent function(Tanh) and
functions may be linear or non-linear. Some of the
Sigmoid activation functions have been tried, out
widely used activation functions are as follows.
a) Logistic Sigmoid Function: Sigmoid function of which the following architecture gave the best
has s-shaped graph and is most common form of ac- performance.
tivation function for defining the output of a neuron.
TABLE I
Sigmoid function is strictly increasing, continuous D ESCRIPTION O F N EURAL N ETWORK
and differential function. It exhibits a graceful bal-
ance between linear and nonlinear behavior. An ex- Layer Neurons Activation Function
Input 193 -
ample of the sigmoid function is the logistic function, Hidden 1 280 tanh
defined by, Hidden 2 300 sigmoid
1 Hidden 3 250 sigmoid
φ(x) = (1)
1 + e−x Output 10 Softmax

where x is the local output of the neuron.


b) Hyperbolic tangent sigmoid function: Hyper- 1) Cost Function: A cost function or loss function
bolic tangent sigmoid activation function is symmet- or cost function is a function that maps an event or
ric bipolar activation function, defined by, values of one or more variables onto a real number
intuitively representing some ”cost” associated with
2
φ(x) = −1 (2) the event. We have used Cross-entropy loss function
(1 + e−x ) described below.
We have tried the classication of urban sound with Cross-entropy loss: Cross-entropy loss, or log
different architectures. We have achieved around 78 loss, measures the performance of a classification
cross_entropy.png b) False Positive(FP): Number of samples that
belongs to negative class and are labeled as a member
of positive class by the classifer.
Fig. 4. Variation of Cross Entropy Loss
c) True Negative(TN): Number of samples that
belongs to negative class and are labeled as a member
model whose output is a probability value between 0 of negative class by the classifer.
and 1. Cross-entropy loss increases as the predicted d) False Negative(FN): Number of samples that
probability diverges from the actual label. So predict- belongs to negative class and are labeled as a member
ing a probability of .012 when the actual observation of positive class by the classifer.
label is 1 would be bad and result in a high loss value. Based on the above parameters follwing per-
A perfect model would have a log loss of 0. With formance measures are defined for a classification
y being the target label and p being the probability model. Accuracy defines the percentage of correct
to belong to class labeled y ,cross entropy loss is predictions on test data.
mathematicaly defined as
TP + TN
φ(y) = −(y log(p) + (1 − y) log(1 − p)) (3) Accuracy = (4)
TP + TN + FP + FN
Its plot is as shown in fig(3).
Precision is defined as the ration of total number
IV. R ESULTS A ND A NALYSIS of correctly classified positive examples with the
We have splited the Urban8KDataset into 0.8 ( total number of predicted positive examples. High
Training) and 0.2(Testing). Achieving an accuracy of Precision indicates an example labeled as positive is
78% which comparatively good considering the small indeed positive (small number of FP).
dataset and no use of convolution nerual network.
We describe the performance measures that have TP
P recision = (5)
been used, Experiments carried out by tunning the TP + FP
hyperparameters such as learning rate and batch size Recall can be defined as the ratio of the total number
and confusion matrix for proposed mdel and finally of correctly classified positive examples divide to
we compare the performance of our classifier with the total number of positive examples. High Recall
the convolutional nerual network implementations for indicates the class is correctly recognized (small
urban sound classifications in the following subsec- number of FN).
tions.
1) Performance Meansures: Performance mea- TP
Recall = (6)
sures are a crucial part of a classification task. We TP + FN
have utilised following perfromance measures for our
multi class urban audio classification. Since we have two measures (Precision and Recall)
Confusion Matrix: A confusion matrix is a table it helps to have a measurement that represents both
that is often used to describe the performance of a of them. We calculate an F-measure which uses
classification model on a set of test data for which Harmonic Mean in place of Arithmetic Mean as it
the true values are known. The general structure of punishes the extreme values more.
a confusion matrix is shown in the table(2). The
2 ∗ P recsion ∗ Recall
rows shows the actual class and columns show the F − M easure = (7)
P recision + Recall
predicted class. Entries of the table(2) are described
2) Experimental Results: Urban sound classifica-
TABLE II tion with fully connected neural network achieves
G ENERAL S TRUCTURE O F A C ONFUSION M ATRIX
competitve accuracy as compared to CNN based
TP FP urban sound classifiers. We are able to get 78%
FP FN of accuracy with the proposed architecture. Perfor-
mance measures for our classifier is shown in ta-
below ble(3). Confusion matrix for our experiments with
a) True Positive(TP): Number of samples that Urban8KDataset are shown in the fig(4). The cost
belongs to positive class and are labeled as a member as a function of number of epochs is shown in the
of positive class by the classifer. fig(5).
TABLE III
P ERFORMANCE M EASURES F OR PROPOSED CLASSIFIER

Accuracy = 0.783 Precision = 0.731


Recall = 0.723 F-measure = 0.78

R EFERENCES
[1] S. Chu, S. Narayanan, and C.-C. Kuo, Environmental sound
recognition with time-frequency audio features, IEEE Trans.
on Audio, Speech, and Language Processing, vol. 17, no. 6,
pp. 11421158, Aug. 2009.
[2] R. Radhakrishnan, A. Divakaran, and P. Smaragdis, Audio
analysis for surveillance applications, in IEEE Worksh. on
Apps. of Signal Processing to Audio and Acoustics (WAS-
Fig. 5. Confusion Matrix for urban sound classifier PAA05), New Paltz, NY, USA, Oct. 2005, pp. 158161.
[3] C. Mydlarz, J. Salamon, and J. P. Bello, The implementation
of low cost urban acoustic monitoring devices, Applied
Acoustics, vol.InPress, 2016.
iteration_vs_cost_function.png [4] J.Pineau,M.Montemerlo,M.Pollack,N.Roy,andS.Thrun,Towards
robotic assistants in nursing homes: Challenges and results,
Special Iss. Socially Interactive Robots, Robot., Autonomous
Syst., vol. 42, no. 34, pp. 271281, 2003.
Fig. 6. Cost Function vs Iterations
[5] S. Thrun, M. Bennewitz, W. Burgard, A. B. Cremers, F.
Dellaert, D. Fox, D. Haehnel, C. Rosenberg, N. Roy, J.
Schulte, and D. Schulz, Minerva: A second generation mobile
V. C ONCLUSION tour-guide robot, in Proc. ICRA, 1999.
[6] H.A.Yanco,A robotic wheel chair system: Indoor navigation
The goal of this paper was to evaluate whether and user interface, in Lecture Notes in Articial Intelligence:
Assistive Technology and Articial Intelligence. NewYork:
fully connected neural networks can be successfully Springer-Verlag,1998, pp. 256268.
applied to environmental sound classification tasks, [7] A. Fod, A. Howard, and M. J. Mataric, Laser-based people
especially considering the limited nature of datasets tracking, in Proc. ICRA, 2002.
[8] S. Chu, S. Narayanan, C.-C. J. Kuo, and M. J. Mataric ,
available in this field. It seems that they are in- Where am I? Scene recognition for mobile robots using audio
deed a viable solution to this problem. Conducted features, in Proc. ICME, 2006.
experiments show that a fully connected neural net- [9] J. Huang, Spatial auditory processing for a hearing robot, in
Proc. ICME, 2002.
work model compete with other comman approaches [10] A. Waibel,H. Steusloff,andR. Stiefelha-
like applying convolution nerual network where the gen,ChilComputersinthe humaninteraction-
scarcity of data can be a major issue. We are able loop,inProc.WIAMIS,2004,andtheCHILProject Consortium.
[11] D. P. W. Ellis and K. Lee, Minimal-impact audio-based
to produce an accuracy of 78%. Although, taking personal archives, in Proc. CARPE, 2004.
into consideration much longer training times, the [12] J. Mantyjarvi, P. Huuskonen, and J. Himberg, Collaborative
result is far from ground- breaking, it shows that fully context determination to support mobile terminal applica-
tions, IEEE Trans. Wireless Communications, vol. 9, no. 5,
connected neural networks can be effectively applied pp. 3945, Oct. 2002.
in environmental sound classification tasks even with [13] D. P. W. Ellis, Prediction-driven computational auditory
limited datasets. What is more, it is quite likely that scene analysis, Ph.D. dissertation, Dept. of Elect. Eng. and
Comput. Sci., Mass. Inst. of Technol., Cambridge, MA, Jun.
a considerable increase in the size of the available 1996.
dataset would vastly improve the performance of [14] J.-J. Aucouturier, B. Defreville, and F. Pachet, The bag-of-
trained models, as the gap to human accuracy is frames approach to audio pattern recognition: A sufcient
model for urban soundscapes but not for polyphonicmusic, J.
still profound. One of the possible questions open Acoust.Soc. Amer., vol. 122, no. 2, pp. 881891, Aug. 2007.
for future inquiry is whether fully connected neural [15] R. Cai, L. Lu, A. Hanjalic, H. Zhang, and L.-H. Cai, A exible
networks could outperform the convolution nerual framework for key audio effects detection and auditory con-
text inference, IEEE Trans. Audio, Speech, Lang. Process.,
network approaches. vol. 14, no. 3, pp. 10261039, May 2006.
[16] R. Radhakrishnan, A. Divakaran, and P. Smaragdis, Audio
ACKNOWLEDGMENT analysis for surveillance applications, in Proc. IEEE Work-
shop Applicat. Signal Process. Audio Acoust., 2005, pp.
We are highy greatful to Dr. Maroti Deshmukh , 158161.
Assistant Professor ( Dept. Of Computer Science & [17] P. Cano, M. Koppenberger, S. Le Groux, J. Ricard, N. Wack,
and P. Herrera, Nearest-neighbor generic sound classication
Engineering , NIT Uttarakhand) for their guidence with a wordnet-based taxonomy, in Proc. 116th AES Conv.,
during the course of this research project. 2004.
[18] P. Herrera, A. Yeterian, and F. Gouyon, Automatic classi-
cation of drum sounds: A comparison of feature selection
methods and classication techniques, in Proc. ICMAI, 2002.
[19] G. Tzanetakis and P. Cook, Musical genre classication of
audio signals, IEEE Trans. Speech Audio Process., vol. 10,
no. 5, pp. 293302, Jul. 2002.
[20] M. J. Carey, E. S. Parris, and H. Lloyd-Thomas, A compar-
ison of features for speech, music discrimination, in Proc.
ICASSP, 1999.
[21] K. El-Maleh, M. Klein, G. Petrucci, and P. Kabal,
Speech/music discrimination for multimedia applications, in
Proc. ICASSP, 2000, pp. 149152.
[22] T.ZhangandC.-C.J.Kuo, Audio content analysis for online
audio visual data segmentation and classication, IEEE Trans.
Speech Audio Process., vol. 9, no. 4, pp. 441457, May 2001.
[23] A. Eronen, V. Peltonen, J. Tuomi, A. Klapuri, S. Fagerlund,
T. Sorsa, G. Lorho, and J. Huopaniemi, Audio-based context
recognition, IEEE Trans. Audio, Speech, Lang. Process., vol.
14, no. 1, pp. 321329, Jan. 2006.
[24] V. Peltonen, Computational auditory scene recognition, M.S.
thesis, Tampere Univ. of Technol., Tampere, Finland, 2001.
[25] B. Clarkson, N. Sawhney, and A. Pentland, Auditory con-
text awareness via wearable computing, in Proc. Workshop
Perceptual User Interfaces, 1998.
[26] R. G. Malkin and A. Waibel, Classifying user environment
for mobile applications using linear auto encoding of ambient
audio,in Proc. ICASSP, 2005
[27] S. P. Ebenezer, A. Papandreou-Suppappola, and S. B. Sup-
pappola, Classication of acoustic emissions using modied
matching pursuit, EURASIP J. Appl. Signal Process, pp.
347357, 2004.
[28] Karol J. Piczak, Environmental sound classication with
Convolutional Neural Networks, IEEE INTERNATIONAL
WORKSHOP ON MACHINE LEARNING FOR SIGNAL
PROCESSING, SEPT. 1720, 2015.
[29] Justin Salamon and Juan Pablo Bello, Deep Convolutional
Neural Networks and Data Augmentation for Environmen-
tal Sound, IEEE SIGNAL PROCESSING LETTERS, AC-
CEPTED NOVEMBER 2016.

Vous aimerez peut-être aussi