Vous êtes sur la page 1sur 5

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org


Volume 3, Issue 6, November-December 2014

ISSN 2278-6856

Stress management with Music Therapy


Sougata Das 1 and Ayan Mukherjee 2
1

Senior Systems Engineer, IBM India Private Limited,


Kolkata, India

Assistant Professor, Dept. of MCA, Brainware Group of Institutions,


Barasat, Kolkata, India

Abstract
This paper is a culmination of two dynamic fields which is of
major importance in the practical world Voice Recognition
and Music Therapy. Voice Recognition, which is a very
important field and is under development by various
commercial and research organizations all over the world. It
is implemented in critical applications such as healthcare,
military, to regular applications such as smart phones,
microwaves, biometric security. However, a portion of our
speech which is less quantitative in nature is the emotion
involved in speech. It is that portion of our speech which adds
dynamism to our speech. This paper is an attempt to extract
the emotional part of a speech and detect the emotional
content of the speech. The detection of emotion is independent
of any adjective-oriented word spoken but rather the emphasis
on the pitch, tonality and stress on particular words.

Keywords:- DCT, DFT, FFT, MFCC, Cepstrum, Mel


Scale, Mel Spectrum, Hamming Window, Music Therapy,
Emotion Detection.

1. INTRODUCTION
1.1 History of Music Therapy
Music Therapy [6] is the systematic application of music in
the treatment of the physiological and psychosocial
aspects of an illness or disability. It focuses on the
acquisition of non-musical skills and behaviors, as
determined by a board certified music therapist through
systematic assessment and treatment planning. Therefore,
it is an allied health profession and one of the expressive
therapies, consisting of an interpersonal process in which
a certified music therapist uses music and all of its
facetsphysical, emotional, mental, social, aesthetic, and
spiritualto help clients to improve or maintain their
health. Music therapy in the United States of America
began in the late 18th century. However, using music as a
healing medium dates back to ancient times. This is
evident in biblical scriptures and historical writings of
ancient civilizations such as Egypt, China, India, Greece
and Rome. Today, the power of music remains the same
but music is used much differently than it was in ancient
times. The profession of music therapy in the United
States began to develop during W.W.I and W.W. II, when
music was used in Veterans Administration Hospitals as
an intervention to address traumatic war injuries.
Veterans actively and passively engaged in music
activities that focused on relieving pain perception.

Volume 3, Issue 6, November-December 2014

Numerous doctors and nurses witnessed the effect music


had on veterans' psychological, physiological, cognitive,
and emotional state. Since then, colleges and universities
developed programs to train musicians how to use music
for therapeutic purposes. In 1950, a professional
organization was formed by a collaboration of music
therapists that worked with veterans, mentally retarded,
hearing/visually impaired, and psychiatric populations.
This was the birth of the National Association for Music
Therapy (NAMT). In 1998, NAMT joined forces with
another music therapy organization to become what is
now known as the American Music Therapy Association
(AMTA).
1.2 History of Voice Recognition
Voice Recognition or a more accurate term which we can
say is speech recognition is a form of technology which
was developed solely to remove the concept of typing or
writing or rather introduces the human voice as an input.
Speech Recognition was first developed in the institute
which is the mother of all computer oriented inventions,
the Bell Labs. The first voice operated system developed
was AUDREY in the year 1952, which had the ability to
identify the spoken digits. Exactly 10 years from then, the
leading corporate giants and another powerhouse of
innovation, IBM, first demonstrated the ShoeBox, voice
recognition software which had the ability to recognized
16 spoken English words. Slowly the idea of voice or
rather speech recognition was spreading far and wide and
laboratories in United States, Japan, England, and Soviet
Union were developing voice recognition systems and also
developing dedicated hardware to support such systems.
Despite how little these efforts might sound but it was
impressive beginning given to the fact that computation
was quite primitive and not so developed. In 1970s, the
US Department of Defense started taking interest in voice
recognition and a Speech Understanding Research (SUR)
cell was formed. Research work was going on all over the
world when Carnegie-Mellon University first developed
HARPY speech understanding system which was
capable or recognizing 1011 words. 1011 words could be
supposed to be an average vocabulary of a 3 year old child.
The most interesting point in HARPY was the search
techniques involved a heuristic search based algorithm
called the BEAM SEARCH, which provided an optimal
Page 273

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org
Volume 3, Issue 6, November-December 2014
solution. At the end of the 70s, Voice Recognition went
around from a single voice to identifying and operating on
the basis of multi-people voices. Also due to development
on linguistics dedicated towards speech recognition,
speech recognition went from a free hundred words to a
few thousand words and potentially the ability to
recognize unlimited words.

12

ISSN 2278-6856

CONTROL FLOW DIAGRAM

12.1 Block Diagram

2. PROBLEM STATEMENT
This project aims to detect the negative emotion from a
given sound and determine that which music is to be
implemented as a therapy on the given subject. The
steps are as discussed below:
Voice Recording
At application level, a voice is to be recorded and fed as
input.
Detection of emotion
The program will provide us with the output of an emotion
detected.
Final Result
On the basis of the emotion detected, a mapping function
will generate which music will be used as a therapy on the
subject.

10 OBJECTIVES
The objective of this project is to do stress management
using music therapy but the innovation is to apply the
music therapy using an automated method of detecting the
stress in the voice sample. The primary objective and
motivation of this project is to reduce stress problems by
utilizing music therapy. The ragas involved in the music
therapy happen to produce a positive effect on the subject
and reduce the temporary negative emotions present.

Figure 1 Block Diagram


12.2 Flow Diagram

11 BENEFITS
It is an effective automation system which will determine
the negative emotion and rather help the subject from
going in negative psychological state. It will remove the
necessity of any human intervention in the following
process Reduce human mortality by reducing the chances
of deaths related to psychiatric issues. It can be
implemented on any device such as smart phones which
will be available to each and every people.
Figure 2 Flow Diagram

13 PROCESS DESCRIPTION
The Stress Management [1] using Music Therapy is the
complete culmination of two different well known fields.
It is a combination of Music Therapy and Voice
Recognition but involving the ability to detect emotional
content from a given voice/speech sample. The entire
system is divided in two sections. One section deals with
the detection of emotion from the given voice/speech
sample and the other section works with the mapping

Volume 3, Issue 6, November-December 2014

Page 274

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org
Volume 3, Issue 6, November-December 2014

ISSN 2278-6856

N 1

of the given emotion with the required raga which can be


used to calm that emotion.
13.1 Framing
It is generally known that a given speech sample will not
be stationary over time or we will not be able to find any
consistency in its waveform, But given in short term
analysis of signals, if we consider an interval which is
quite short in length, then we can consider the wave to
stationary. The non-uniformity of the speech or voice
waveform is given due to the fact of the rate of movement
of speech articulators i.e. the lips, jaw, tongue etc. As the
change of the voice spectrum is directly dependent to the
rate of change of the articulation of the speech. We take a
frame where the speech waveform is cropped and any
extra silence of acoustic interference is removed which
may be present in the staring or ending of the file.
13.2 Windowing
The frames are taken and processed to remove any sort
of signal discontinuities in the beginning or end of the
frame. The concept here is to minimize the spectral
distortion by using the window to taper the signal to zero
at the beginning and end of each frame. In other words,
when we perform Fourier Transform, it assumes that
the signal repeats, and the end of one frame does not
connect smoothly with the beginning of the next one.
This introduces some glitches at regular intervals. So we
have to make the ends of each frame smooth enough to
connect with each other The technique involved in
preparing the windows of the signal requires the
implementation of soft window. However, it is mostly
seen that Hamming Window is the mostly preferred
technique because it causes the window to smoothly taper
at both the ends.
The Hamming Window function involved is:
w(n) = 0.53856-0.46164cos(2n/N-1) (1)
where
w(n) = hamming window function.
n = any given sample from the total number of
frames.
N-1 = the total number of frames obtained
13.3 Fourier Transformation
The given frame is now to be processed by a Fourier
Transformation (Discrete Fourier Transformation) which
converts each frame from the given N frames from time
domain to the frequency domain. As the frames obtained
were on the basis of the time domain against the
amplitude but now after the implementation of the Fourier
transformation, it converts it from time domain to
frequency domain. The Fourier Transformation is
implemented using the Fast Fourier Transform (FFT)
algorithm which is a faster method of Discrete Fourier
Transformation. The FFT is a fast algorithm to
implement the Discrete Fourier Transform (DFT) which
is defined on the set of N samples {x} as follow:

Volume 3, Issue 6, November-December 2014

Xn =

e -2i k(n/N) n=0,1,2,..,N-1 (2)

k 0

13.4 Mel - Frequency Wrapping


The spectrum obtained in the frequency domain is now
taken as input for this stage. The signal is plotted against
the Mel-Scale [3] to mimic the human hearing. Human
perception of frequency contents of sounds for speech
signal does not follow a linear scale. Thus for each tone
with an actual frequency, f, measured in Hz, a subjective
pitch is measured on a scale called the Mel scale. The
Mel-frequency scale is a linear frequency spacing below
1000 Hz and a logarithmic spacing above 1000 Hz. As a
reference point, the pitch of a 1 KHz tone, 40dB above the
perceptual hearing threshold, is defined as 1000 Mels.
Therefore we can use the following approximate formula
to compute the Mel on the basis of a given frequency.
Mel(f) = 2595*log10(1 + f/700)
(3)
The approach is to simulate the subjective spectrum is to
use a filter bank, one filter for each desired Mel-frequency
component. That filter bank has a triangular band pass
frequency response and the spacing as well as the
bandwidth is determined by a constant Mel-frequency
interval. The Mel scale filter bank is a series of l
triangular band pass filters that have been designed to
simulate the band pass filtering believed to occur in the
auditory system. This corresponds to series of band pass
filters with constant bandwidth and spacing on a Mel
frequency scale.
13.5 Cepstrum
The Cepstrum[4] name is derived from the word
Spectrum by reversing the first four letters ''Spec
becomes Cepstrum. We can additionally say that
cepstrum is the Fourier Transform of the logarithm of the
Fourier Transform of the Window Signal.
Cepstrum = FT(log(FT(window signal))+j2m) (4)
The real values real cepstrum uses the logarithm function.
While for defining the complex values whereas the
complex cepstrum uses the complex logarithm function.
The real cepstrum uses the information of the magnitude
of the spectrum whereas complex cepstrum holds
information about both magnitude and phase of the initial
spectrum, which allows the reconstruction of the signal.
The cepstral representation of the speech spectrum
provides us with a very good representation of the local
spectral properties of the signal for a given frame analysis.

Page 275

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org
Volume 3, Issue 6, November-December 2014

ISSN 2278-6856

14 TEST RESULTS
On the basis of the value generated, we play the necessary
raga file. A mapping table is provided under the section of
MUSIC THERAPY [8] where we have provided a list of
diseases which can be used to cure using a particular
music file.
Table 1: Classification of various moods according to the
Ragas
Mood
Sad
Depression
Hypertension
Anger
Fear
Figure 3 Cepstral Representation
The values of the cepstrum are then converted from
frequency domain to time domain using Discrete Cosine
Transformation. Thus we can calculate the MFCC's as:

Ragas
Kafi
Kapi
Bageshri
Sahana
Mishra Mand

The following set of figures show the plot of the signal on


left side and the power spectrum on the right

Fear

(4)
13.6 Mel Frequency Cepstrum Coefficient (MFCC)
Mel Frequency Cepstrum Coefficient are coefficients
which represent audio on the basis of perception. It was
developed by Paul Mermelstein along with Bridle and
Brown who proposed the idea. It generates a 20
dimensional matrix from the signal and we utilize the
value and generate and algorithm to deduce what emotion
we have as the sample. Algorithmically, the concept of
cepstrum[2] is presented here in the form of a block
diagram. Figure below shows the flow chart that describes
as to how to obtain cepstrum from a signal.

Figure 5

Anger

Figure 3: Flow chart of Cepstrum


The MFCCs are the amplitudes of the resulting spectrum.
This procedure is represented step- wise in the figure
below

Figure 6

Depression

Figure 7
Figure 4: MFCC Flow Chart

Volume 3, Issue 6, November-December 2014

Page 276

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org
Volume 3, Issue 6, November-December 2014
HyperTension

ISSN 2278-6856

References

Figure 8

Figure 9: Data Store for sad mood

15 CONCLUSION
The project is an example of two distinct fields of
Computer Science and Para medicine merged into a
single field and though at a nascent stage with very
narrow production, it will lead to a very promising
field. The project provides a solution to Stress
Management which is very prevalent today in the
modern world and if the project is taken up at a
higher level it can be applied at a commercial level
also. Though the technique used to emulate human
hearing and perception, the Mel Frequency Cepstrum
Coefficient is very much prone to noise and even after
the noise is reduced, the results produced are
sometimes very limited as we are unable to properly
detect the emotion, however by utilising proper scales
and on further research, it will be very much possible
to detect the correct emotion with much accuracy.
Even though with all the limitations, we have tried our
level best to produce satisfactory result and generate a
solution by which we can map the negative emotions
with the ragas which can be used to cure them.

[1] Kumar, Ch.Srinivasan (2011), Design Of An


Automatic Speaker Recognition System
[2] Neiberg, Daniel (2006), Emotion Recognition in
Spontaneous Speech Using GMMs,
[3] Cornaz, Christian (2003), An Automatic Speaker
Recognition System, February 03
[4] Tiwari, Vibha (2010), MFCC And Its Applications In
Speaker Recognition, February 10
[5] Sairam,T.V: Music And Moods 2,[Online]
http://ayurvedaforyou.com/music/musicandmoods2.html (Sept 13,
2013)
[6] MusicTherapy,[Online]http://en.wikipedia.org/wiki/
Music_therapy (Aug 21, 2013)
[7] Mahesh, Anuradha: Music-Therapy For
Wellness,[Online]http://anuradhamahesh.wordpress.
com/music-therapy/ ( Sept 20, 2013)
[8] Raga Therapy For Healing Mind And
Body,[Online]http://www.medindia.net/patients/patie
ntinfo/raga-therapy-for-healing-mind-and-bodyhealing-ragas_print.htm (Aug 30, 2013)
[9] Music As Medicine, [Online]
http://www.musicasmedicine.com/about/history.cfm
(Aug 29, 2013)

16 LIMITATIONS AND SCOPE FOR FUTURE


Firstly, this application is still at a nascent stage and due
to some hardware irregularities we have to work on stored
audio file instead of real time recording. Secondly, highly
noisy audio input produces deviating result and does not
produce the correct result. Thirdly, not all emotions can be
detected at present due to unavailability of exact mood
emotion source audio files. This can be implemented as a
smart phone app or web application. The range of
emotions can be increased by professional actors. A
professional database application can be used for efficient
storage and retrieval of data.

Volume 3, Issue 6, November-December 2014

Page 277

Vous aimerez peut-être aussi