Vous êtes sur la page 1sur 35

SINUSOIDAL SYNTHESIS OF SPEECH USING MATLAB Thesis Submitted in partial fulfillment of the requirement of BITS C421T Thesis BY AKSHAY


Under the supervision of Dr. RAHUL SINGHAL Assistant Professor, EEE Dept. BITS-Pilani



I would like to thank the Almighty first of all for his blessings. I am obliged to Prof B.N. JAIN, Vice Chancellor, Birla Institute of Technology & Science, Pilani for providing us with a course pattern where a student gets exposure to projects. I wish to express deep sense of gratitude to Dr Rahul Singhal, my supervisor for Thesis named Sinusoidal Synthesis of speech using MATLAB for providing me this wonderful opportunity to learn about various parameters associated with speech and synthesis of speech from spectrogram. I would also like to thanks him for his constant advice, encouragement and support in the study. I wish to express gratitude to all other people as well as all the websites for the content they provided me for performance of research work. Last but not the least; I would like to thank our parents for their constant support and motivation.

CERTIFICATE This is to certify that Thesis entitled ____________________________ ________Sinusoidal Synthesis of Speech using Matlab ______________ is submitted by _Akshay Vijay Jain_ ID NO _ 2009B4A8568P in partial fulfillment of the requirement of the BITS C421T Thesis embodies the work done by him under my supervision

Signature of Supervisor Date: 25 November 2013 Dr Rahul Singhal Assistant Professor, EEE Department, BITS PILANI, PILANI CAMPUS

Thesis Abstract This thesis report discusses speech signal, how it is stored on computer, how it is analyzed and how it is synthesized. One of the way of analyzing speech signal is Short Time Fourier Transform, which is discussed in the Thesis report along with its parameter. Based on this analysis of speech signal, we are extracting the matrix containing frequency present in the signal as function of time. Then after having obtained the matrix from the spectrogram generated from the MATLAB, we try to resynthesize the speech signal back by sinusoidal addition using MATLAB code.

TABLE OF CONTENTS 1) Introduction 2) Recording of speech signal 3) Analysis of speech signal a) Long term frequency analysis b) Window sequence c) Effect of window d) Choice of window e) Parameters of Short Term Frequency Spectrum f) Time-Frequency domain: spectrogram g) Length of window and fundamental frequency 4) Why sinusoids? 5) Additive synthesis 6) Frequency Vs Time matrix from spectrogram in MATLAB 1. GenerateFreqVsTime Matlab Code 2. Croplimit MatlabCroplimit Code 3. Screenshots 7) Speech signal from Frequency Vs Time matrix in MATLAB 1. GenerateSoundData Matlab Code 2. TestAtLevel Matlab Code 8) Conclusion 9) Bibliography/Reference



We all know speech is an acoustic signal by that we mean that it is a mechanical wave that is an oscillation of pressure transmitted through solid liquid or gas and it is composed of frequencies within hearing range. Sound is a sequence of waves of pressure that propagates through compressible media such as air or water. (Sound can propagate through solids as well, but there are additional modes of propagation). Sound that is perceptible by humans has frequencies from about 20 Hz to 20,000 Hz. In air at standard temperature and pressure, the corresponding wavelengths of sound waves range from 17 m to 17 mm. During propagation, waves can be reflected, refracted, or attenuated by the medium.

Figure 1. Typical sound signal


Recording of Speech

Sound recording is an electrical or mechanical inscription of sound waves, such as spoken voice, singing, instrumental music, or sound effects. The two main classes of sound recording technology are analog recording and digital recording. Acoustic analog recording is achieved by a small microphone diaphragm that can detect changes in atmospheric pressure (acoustic sound waves) and record them as a graphic representation of the sound waves on a medium such as a phonograph (in which a stylus senses grooves on a record). In magnetic tape recording, the sound waves vibrate the microphone diaphragm and are converted into a varying electric current, which is then converted to a varying magnetic field by an electromagnet, which makes a representation of the sound as magnetized areas on a plastic tape with a magnetic coating on it. Digital recording converts the analog sound signal picked up by the microphone to a digital form by a process of digitization, allowing it to be stored and transmitted by a wider variety of media. Digital recording stores audio as a series of binary numbers representing samples of the amplitude of the audio signal at equal time intervals, at a sample rate high enough to convey all speechs capable of being heard. Digital recordings are considered higher quality than analog recordings not necessarily because they have higher fidelity (wider frequency response or dynamic range), but because the digital format can prevent much loss of quality found in analog recording due to noise and electromagnetic interference in playback, and mechanical deterioration or damage to the storage medium. A digital audio signal must be reconverted to analog form during playback before it is applied to a loudspeaker or earphones.


Analysis of Speech Signal

The long-term frequency analysis of speech signals yields good information about the overall frequency spectrum of the signal, but no information about the temporal location of those frequencies. Since speech is a very dynamic signal with a time-varying spectrum, it is often insightful to look at frequency spectra of short sections of the speech signal. a) Long-term frequency analysis The frequency response of a system is defined as the discrete-time Fourier transform (DTFT) of the system's impulse response h[n]:

Similarly, for a sequence x[n], its long-term frequency spectrum is defined as the DTFT of the Sequence

Theoretically, we must know the sequence x[n] for all values of n (from n=- until n=) in order to compute its frequency spectrum. Fortunately, all terms where x[n] = 0 do not matter in the sum, and therefore an equivalent expression for the sequence's spectrum is

Here we've assumed that the sequence starts at 0 and is N samples long. This tells us that we can apply the DTFT only to all of the non-zero samples of x[n], and still obtain the sequence's true spectrum X (). But what is the correct mathematical expression to compute the spectrum over a short section of the sequence, that is, over only part of the nonzero samples of the sequence?

b) Window sequence It turns out that the mathematically correct way to do that is to multiply the sequence x[n] by a window sequence w[n] that is non-zero only for n=0 L-1, where L, the length of the window, is smaller than the length N of the sequence x[n]: Now Then we compute the spectrum of the windowed sequence xw [n] as usual

The following figure illustrates how a window sequence w[n] is applied to the sequence x[n]:

Figure 2 Result of application of windowed sequence to data sequence

As the figure shows, the windowed sequence is shorter in length than the original sequence. So we can further truncate the DTFT of the windowed sequence:

Using this windowing technique, we can select any section of arbitrary length of the input sequence x[n] by choosing the length and location of the window accordingly. The only question that remains is: how does the window sequence w[n] affect the short-term frequency spectrum? c) Effect of the window To answer that question, we need to introduce an important property of the Fourier transform. The diagram below illustrates the property graphically: I. Implementation of an LTI system in the time domain.

II. Equivalent implementation of an LTI system in the frequency domain.


The two implementations of an LTI system are equivalent: they will give the same output for the same input. Hence, convolution in the time domain = multiplication in the frequency domain:

And since the time domain and the frequency domain are each others dual in the Fourier transform, it is also true that multiplication in the time domain = convolution in the frequency domain:

This shows that multiplying the sequence x[n] with the window sequence w[n] in the time domain is equivalent to convolving the spectrum of the sequence X (), with the spectrum of the window W(). The result of the convolution of the spectra in the frequency domain is that the spectrum of the sequence is smeared by the spectrum of the window. This is best illustrated by the example in the figure below:


Figure 3 Result of application of window sequence in time and frequency domain d) Choice of window Because the window determines the spectrum of the windowed sequence to a great extent, the choice of the window is important. Matlab supports a number of common windows, each with their own strengths and weaknesses. Some common choices of windows are shown below.

Figure 4 Rectangular window sequence


Figure 5 Triangular and Hamming window sequence All windows share the same characteristics. Their spectrum has a peak, called the main lobe, and ripples to the left and right of the main lobe called the side lobes. The width of the main lobe and the relative height of the side lobes are different for each window. The main lobe width determines how accurate a window is able to resolve different frequencies: wider is less accurate. The side lobe height determines how much spectral leakage the window has. An important thing to realize is that we can't have short-term frequency analysis without a window. Even if we don't explicitly use a window, we are implicitly using a rectangular window. e) Parameters of the short-term frequency spectrum Besides the type of window rectangular, hamming, etc. there are two other factors in Matlab that control the short-term frequency spectrum: window length and the number of frequency sample points. The window length controls the fundamental trade-off between time resolution and frequency resolution of the short-term spectrum,

irrespective of the window's shape. A long window gives poor time resolution, but good frequency resolution. Conversely, a short window gives good time resolution, but poor frequency resolution. For example, a 250 millisecond long window can, roughly speaking, resolve frequency components when they are 4 Hz or more apart (1/0.250 = 4), but it can't tell where in those 250 millisecond those frequency components occurred. On the other hand, a 10millisecond window can only resolve frequency components when they are 100 Hz or more apart (1/0.010= 100), but the uncertainty in time about the location of those frequencies is only 10 millisecond. The result of short-term spectral analysis using a long window is referred to as a narrowband spectrum (because a long window has a narrow main lobe), and the result of shortterm spectral analysis using a short window is called a wideband spectrum. In short-term spectral analysis of speech, the window length is often chosen with respect to the fundamental period of the speech signal, i.e., the duration of one period of the fundamental frequency. A common choice for the window length is either less than 1 times the fundamental period, or greater than 2-3 times the fundamental period. Examples of narrowband and wideband short-term spectral analysis of speech are given in the figures below:

Figure 6 Wideband and Narrowband analysis of speech The other factor controlling the short-term spectrum in Matlab is the number of points at which the frequency spectrum H () is evaluated. The number of points is usually equal to the length of the window. Sometimes a greater number of points is chosen to obtain a smoother looking spectrum. Evaluating H () at fewer points than the window length is possible, but very rare.

f) Time-frequency domain: Spectrogram An important use of short-term spectral analysis is the short-time Fourier transform or spectrogram of a signal. The spectrogram of a sequence is constructed by computing the short term spectrum of a windowed version of the sequence, then shifting the window over to a new location and repeating this process until the entire sequence has been analyzed. The whole process is illustrated in the figure below:

Figure 7 Demonstration of making of spectrogram Together, these short-term spectra (bottom row) make up the spectrogram, and are typically shown in a two-dimensional plot, where the horizontal axis is time, the vertical axis is frequency, and magnitude is the color or intensity of the plot. For example:


Figure 8 A typical spectrogram The appearance of the spectrogram is controlled by a third parameter: window overlap. Window overlap determines how much the window is shifted between repeated computations of the short term spectrum. Common choices for window overlap are 50% or 75% of the window length. For example, if the window length is 200 samples and window overlap is 50%, the window would be shifted over 100 samples between each short-term spectrum. In the case that the overlap was 75%, the window would be shifted over 50 samples. The choice of window overlap depends on the application. When a temporally smooth spectrogram is desirable, window overlap should be 75% or more. When computation should be at a minimum, no overlap or 50% overlap are good choices. If computation is not an issue, you could even compute a new short-term spectrum for every sample of the sequence. In that case, window overlap = window length 1, and the window would only shift 1 sample between the spectra. But doing so is wasteful when analyzing speech signals, because the spectrum of speech does not change at such a high rate. It is more practical to compute a new spectrum every 20-50 millisecond, since that is the rate at which the speech spectrum changes.

g) Length of the window and fundamental frequency


In a wideband spectrogram (i.e., using a window shorter than the fundamental period), the fundamental frequency of the speech signal resolves in time. That means that you can't really tell what the fundamental frequency is by looking at the frequency axis, but you can see energy fluctuations at the rate of the fundamental frequency along the time axis. In a narrowband Spectrogram (i.e., using a window 2-3 times the fundamental period), the fundamental frequency resolves in frequency, i.e., you can see it as an energy peak along the frequency axis. See for example the figures below:

Figure 9. Wideband Speech Spectrogram

Figure 10. Narrowband Speech Spectrogram



Why Sinusoids?

In general the goal of modelling a signal is to reduce redundancy and to get a more compact representation of the data. There are different techniques to model a time series and it depends on the signal which technique to apply. Sinusoids are especially suited for modelling speech with harmonic content. Most natural acoustical sounds exhibit this attribute and the reason for this sinusoidity can be found in the way of the speech production. Human voice production system consists of two fundamental parts working together, namely the voice chords (the excitation source) and the pharynx with mouth and nasal cavities acting as acoustical filter. During voiced parts of speech the vocal chords are opening and closing at a certain frequency (the fundamental frequency, f0) modulating the airstream coming from the lungs. The harmonic overtone structure results from the structure of the pharynx which can be seen as an open tube in a simplified way, letting develop all overtones. f1 fn being integer multiples of the fundamental f0.


Additive Synthesis

Sine waves can be considered the building blocks of speech. In fact, it was shown in the 19th Century by the mathematician Joseph Fourier that any periodic function can be expressed as a series of sinusoids of varying frequencies and amplitudes. This concept of constructing a complex speech out of sinusoidal terms is the basis for additive synthesis, sometimes called Fourier synthesis for the aforementioned reason. In addition to this, the concepts of additive synthesis have also existed since the introduction of the organ, where different pipes of varying pitch are combined to create a sound or timbre. A simple block diagram of the additive form may appear like

Figure 11. Block Diagram representation of Sinusoidal Synthesis Its mathematical form based on Fourier series will be

Where is an offset value for the whole function (typically 0), = the amplitude weightings for each sine term, = the frequency multiplier value. With hundreds of terms each with their own individual frequency and amplitude weightings, we can design and specify some incredibly complex sounds, especially if we can modulate the parameters over time.


6) Frequency Vs Time Matrix from Spectrogram in MATLAB

Determination of the frequency content present in speech at a particular instant of time is possible approximately by the Short Term Fourier Transform (STFT), for our thesis work we are using the Narrow Band Spectrogram produced from Matlab. We are choosing narrow band because it gives better frequency resolution and acceptable time resolution. We tried with Wideband Spectrogram, but the speech synthesized using information from Wideband Spectrogram was very noisy. First of all, we take the spectrogram of speech signal with the help of MATLAB command spectrogram . The spectrogram produced by the MATLAB command spectrogram is a RGB image in decibel scale , where in the intensities above 0 dB are expressed in varying shades of Red color, so we separate out the Red component from the RGB image, then in the separated component we can easily identify the frequencies which had higher intensities in the speech, since the pixels corresponding to high intensity frequencies will appear white while others will appear black and the intermediate will be in gray scale. Now the Red component is appropriately cropped and resized with number of rows equal to 400 implying every row for 10 Hz range and into number of columns hundred times the duration of the speech signal implying that each column in the speech signal corresponds to 10 milliseconds of speech. It has been found that when we convert the resized image in to black and white by converting gray pixel nearer to white into white and gray pixel nearer to black into black the quality of speech is very near to the original speech. So we produce the black and white image which corresponds to Frequency Vs Time Graph for the speech signal.

a) The MATLAB code for performing above task is as follows

1)% function GenerateFreqVsTime() 2)% Record your voice for 5 seconds. 3)f=input('Enter the time in seconds for which you want to record'); 4)recObj = audiorecorder(8000,8,1); 5)disp('Start speaking.'); 6)recordblocking(recObj,f); 7)disp('End of Recording.'); 8)% Play back the recording. 9) play(recObj); 10) % Store data in double-precision array. 11) myRecording = getaudiodata(recObj); 12) figure(1) 13) plot (myRecording);title('sound '); 14) % Plot the spectrogram 15) figure(2) 16) spectrogram(myRecording, 1000,923, 1024,8E3,'yaxis'); 17) h=gcf; 18) set(gcf, 'Position', get(0,'Screensize')); % Maximize figure. 19) level=input('Please enter level between 0 and 1'); 20) saveas(h,'spectrogram1.jpg'); 21) fig=imread('spectrogram1.jpg'); 22) figG1ray=rgb2gray(fig); 23) figure(9) 24) imshow(figGray); title('FigGray'); 25) figRed=fig(:,:,1); 26) figure(3) 27) imshow(figRed); 28) title('figRed'); 29) [xmin ymin width height]=croplimits(figRed); 30) figure(4) 31) figRedCropped=imcrop(figRed,[xmin ymin width height]);


32) imshow(figRedCropped);title('figRed Cropped'); 33) figure(5) 34) figRedCroppedResized=imresize(figRedCropped,[400 100*f]); 35) imshow(figRedCroppedResized);title('figRedCroppedResized'); 36) figRedCroppedResizedCorrected=flipud(figRedCroppedResized); 37) figure(6) 38) figRedCroppedResizedBW=im2bw(figRedCroppedResized,level); 39) imshow(figRedCroppedResizedBW);title('figRedCroppedResizedBW'); 40) figure(7) 41) figRedCroppedResizedBWCorrected=flipud(figRedCroppedResizedBW); 42) imshow(figRedCroppedResizedBWCorrected);

b) Matlab code for Croplimits function used in above code is as follows

1) function [xmin ymin width height]=croplimits(input) 2) xmin=0;r2=0;ymin=0;c2=0; 3) [row,column]=size(input); 4) for i=30:90 5) 6) 7) 8) if(input(i,column/2)~=255) ymin=i+5; break end

9) end 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) count=0; for ki=row:-1:row-120 if(input(ki,column/2)~=255) for kj=column/2:column/2+50 if(input(ki,kj)~=255) count=count+1; else count=count-1; end if(count>0) r2=ki;


20) 21) 22) 23) 24) 25) 26) 27) 28) 29) 30) 31) 32) 33) 34) 35) 36) 37) 38) 39) 40) 41) 42) 43) 44) 45) 46) 47) 48) 49) 50) 51) end count=0; end end end end count=0; end end

break end // end of if on line 18 //end of for loop from line 13 //end of if on line 12 //end of for loop on line 11

for j=80:180 if(input(row/3,j)~=255) for i=row/2:row/2+40 if(input(i,j)~=255) count=count+1; else count=count-1; end //end of for loop on line 28

if(count>24) xmin=j+8;break;

//end of if on line 27 //end of for loop on line 26

for j=column:-1:column-120 if(input(row/2,j)~=255) for i=row/2:row/2+100 if(input(i,j)~=255) count=count+1; else count=count-1; end end //end of if from line 44 //end of for from line 43

if(count>0) c2=j;break;


52) 53) 54) 55) 56) 57) end end


// end of if from line 51 // end of 42 // end of 41

height=r2-ymin+1; width=c2-xmin+1; end // end of function

c) Screenshots i. Speech Waveform

Figure 12 Speech Waveform



Spectrogram of above speech using Matlab

Figure 13 Spectrogram of above speech using Matlab

iii. Grayscale Spectrogram

Figure 14 Grayscale Spectrogram



Image of Red component of spectrogram since red component represents positive magnitude

Figure 15 Red component of spectrogram


Same figure after being cropped by the matlab function croplimit

Figure 16. Same figure after being cropped by the matlab function croplimit


Above figure resized by Matlab function to generate a column of pixel corresponding to 10 milliseconds

Figure 17 Resized using Matlab

vii. Above figure inverted so as to make first row correspond to 10Hz frequency and next row correspond to 20Hz while last 400th row correspond to 4KHz

Figure 18 Same figure as previous but inverted



Same figure as above with pixels having intensity less than .9 reduced to zero while others extended to 1

Figure 19 Same figure as above with pixels having intensity less than .9 reduced to zero while others extended to 1


7) Speech signal from Frequency Vs Time Matrix in MATLAB

Once we have Frequency Vs Time matrix, we can generate the all the frequencies using the sin function of MATLAB and add them all and do these for all the columns which correspond to 10 milliseconds. Now we can concatenate the data generated for each column and result is the speech signal. The MATLAB code for performing above series of task is as follows a) GenerateSoundData Matlab Code:
1) function sounddata=GenerateSoundData(image) 2) [row column]=size(image); 3) image=image/.255; 4) sounddata=zeros(1,80*column); 5) timeResolution=.01;% 10 milliseconds 6) samplingRate=8000;%8000Hz 7) time=1/samplingRate:1/samplingRate:timeResolution; 8) for i=1:column 9) 10) 12) 13) 14) 15)end 16) sounddata=sounddata'; y=sqrt(double(image(10,i)))*sin(2*pi*time*1*10); for j=11:row-100 y=y+sqrt(double(image(j,i)))*sin(2*pi*time*j*10); end sounddata(80*(i-1)+1:80*i)=y;

In this code we are only generating frequencies in the range 100Hz to 3000 Hz, because other frequencies do not affect the hearing ability so much. b) TestAtLevel Matlab Code:

1) 2) 3) 4) 5)

function sdata=TestAtLevel(spectrograph,level) bwspectrograph=im2bw(spectrograph,level); sdata=GenerateSoundData(bwspectrograph); soundsc(sdata,8000); end

In the above function namely TestAtLevel, we pass the matrix obtained from the GenerateFreqVsTime function of name figRedCroppedResizedCorrected, along with the level which specifies the threshold for converting lower values to zero while values greater than level to 1.



The speech waveform generated with different values of level for conversion of Red component of spectrogram into Black and White image are demonstrated below along with their spectrogram

a) Level = 0.8


b) Level = 0.9


c) Level = 0.95




From the above three speech waveforms, it seems that the level of around 0.9 is the best threshold for Red component of spectrogram generated from the Matlab, so that the speech generated using the above Matlab function namely GenerateSoundData is matching more with the original speech. The sinusoidal model, a framework for modelling speech and music signals, has been presented. Sinusoidal synthesis of speech by extracting frequency and time information form the spectrogram gives acceptable quality of speech. Another strategy would be decomposing the signal into deterministic and stochastic parts and using different models for the different portions of a speech as proposed by [5].

9) Bibliography/References
[1] R. McAulay, Th. Quatieri: Speech Analysis/Synthesis Based on a Sinusoidal Representation, in IEEE Transactions on Acoustics, Speech, and Signal Processing, August 1986 [2] J. Smith III, X. Serra: PARSHL: An Analysis/Synthesis Program for Non-Harmonic Sounds Based on a Sinusoidal Representation [3] K. Fitz, L. Haken: On the Use of Time-Frequency Reassignment in Additive Sound Modelling [4] M. Lagrange, S. Marchand, M. Raspaud, J.-B. Rault: Enhanced Partial Tracking using Linear Prediction, in Proc. of the 6th Int. Conference on Digital Audio Effects (DAFx-03), September 2003 [5] X. Serra: A System for Sound Analysis/Transformation/Synthesis based on a Deterministic plus Stochastic Decomposition, Thesis, Stanford University, 1989