Vous êtes sur la page 1sur 7

ARTIFICIAL BANDWIDTH EXTENSION OF SPEECH

COURSE SGN-1650 AND SGN-1656, 20102011

In this work, we implement a simple speech bandwidth extension system that converts a narrowband speech signal into a wideband signal. Passing the course SGN-4010 Speech processing methods before selecting this exercise is recommended. You should be familiar with the basic (speech) signal processing methods, such as windowing, Fourier transform, and linear prediction.

1
1.1

Introduction
Articial bandwidth extension

In digital signal processing, signals are bandlimited with respect to the used sampling frequency. For instance, if a sampling frequency of 8kHz is used, the highest possible frequency component in the signal is 4kHz (Nyquist frequency). In analog telephone speech, speech bandwidth has traditionally been limited to 3003400Hz. Most of the information in speech is below the upper boundary 3.4kHz and even though the very low frequencies are not transmitted, the human hearing system can detect the speech fundamental frequency based on the harmonic components present in the signal. For simplicity, the terms narrowband speech and wideband speech are used here to refer to speech signals with bandwidth of 4kHz (sampling frequency of 8kHz) and 8kHz (sampling frequency of 16kHz), respectively. The amount of information in narrowband speech is smaller compared to wideband speech and the perceived speech quality is thus lower. To achieve wideband speech quality without actually transmitting wideband signals, algorithms for articial bandwidth extension (ABE, BWE) have been developed. These algorithms convert original narrowband signals into articial wideband signals by estimating the missing high-frequency content based on the existing low-frequency content. In this work we implement a simple ABE system that utilizes a source-lter model of speech. Each narrowband signal frame is decomposed into a source part and a lter part and the parts are extended separately. The vocal tract is modeled as an all-pole lter and the lter coecients are estimated using linear prediction (LP). The model residual is used as a source signal. The vocal tract model is extended using the most suitable wideband model taken from a codebook and the residual signal by time domain zero-insertion. The created signal is added to a resampled and delayed version of the original narrowband signal to form an articial wideband signal.

1.2

Linear prediction of speech

Linear prediction (LP) is one of the most important tools used in digital speech processing. Speech production can be modeled as a source-lter system where a source signal produced by the vocal cords is ltered by a vocal tract lter with resonances at formant frequencies. For recap, check www.cs.tut.fi/ kurssit/SGN-4010/LP en.pdf. Vocal tract can be considered to be pth order all-pole lter 1/A(z): 1 1 = A(z) 1 + a1 z 1 ... + ap z p (1)

where the lter coecients a1 . . . ap are estimated using linear prediction. Figure 1 illustrates the spectrum of the all-pole lter 1/A(z) estimated from a short speech frame. The thin line represents the amplitude spectrum (absolute value of discrete Fourier transform, DFT) of the frame. Speech frame Y (z) (now in frequency domain) is formed by ltering the residual signal X(z) by the vocal tract all-pole lter 1/A(z) (remember that convolution/ltering in time domain corresponds to multiplication in frequency domain): Y (z) = X(z) A(z)

Amplitude and LP spectrum for a Finnish vowel y 40 30 20 10 Amplitude (dB) 0 10 20 30 40 50 60

1000

2000

3000 4000 5000 Frequency (Hz)

6000

7000

8000

Figure 1: Frame LP amplitude spectrum (thick line) and amplitude spectrum (thin line).
Residual signal X(z) is formed by ltering the frame Y (z) by the vocal tract inverse lter A(z): X(z) = Y (z)A(z) In Matlab, use function lpc to compute LP coecients of a given order: a = lpc(frame,order); % Estimate LP coefficients For speech coding purposes, LP polynomial A(z) can be decomposed into line spectral frequencies (LSF). LSFs have good quantization and interpolation properties and are thus widely used in speech coding. The LSF representation of the previous LP spectrum is given in Figure 2. The thick line represents the LP spectrum and the frequency values of the thin lines represent the LSF coecient values.
LP spectrum and LSF coefficients for a Finnish vowel y 40 30 20 10 Amplitude (dB) 0 10 20 30 40 50 60

1000

2000

3000 4000 5000 Frequency (Hz)

6000

7000

8000

Figure 2: Frame LP spectrum (thick line) and corresponding LSF values (thin lines).
A more detailed derivation of LSFs can be found in http://www.cs.tut.fi/sgn/arg/8003102/syn en.pdf. The idea is to decompose A(z) into two polynomials that have their roots on the unit circle. An LSF coecient vector corresponding to A(z) of order p consists of p root angle (frequency) values i : = (1 , 2 , . . . , p ). In Matlab, LP-to-LSF and LSF-to-LP conversions can be computed using functions poly2lsf and lsf2poly: w = poly2lsf(a); % Convert LP coefficients into LSF coefficients a = lsf2poly(w); % Convert LSF coefficents into LP coefficients Note that the values of vector w are between 0 and (whereas in Figure 2 the LSF values are scaled to be between 0 and 8kHz)

Work assignment

In this work, we build a simple speech bandwidth extension system. The system should read a narrowband speech signal and process it framewise so that each frame is rst decomposed into an all-pole lter and a source signal. The lter and the source signal parts are extended separately. The lter part is extended using a codebook built in Section 2.3. The extended parts are combined using ltering to form a frame that contains articial high-frequency components missing from the narrowband frame. Extended time domain frames are catenated using overlap-add (see http://www.cs.tut.fi/kurssit/SGN-4010/ikkunointi en.pdf). The extended signal is added to an interpolated and delayed version of the original narrowband signal.

2.1

What to include in the report?

The report of the work should contain the commented Matlab codes of your bandwidth extension system, answers to the questions including the plotted gures. Write the whole report in a single le and send it to hanna.silen@tut.fi. Include the names and student numbers of the group members!

2.2

Getting started

The speech les can be downloaded from: http://www.speech.cs.cmu.edu/cmu_arctic/packed/cmu_us_slt_arctic-0.95-release.tar.bz2 The package is extracted in Lintula by typing: tar -xjf cmu_us_slt_arctic-0.95-release.tar.bz2 The wideband speech les should now be in folder cmu us slt arctic/wav/. Waveles arctic a0001.wav . . . arctic a0100.wav are used as LSF codebook training data. As test data, any of the waveles excluded in the training data can be used, e.g. arctic a0501.wav. Before starting to build the ABE system, lets go through some basic speech processing Matlab functions. Read the selected test wideband speech signal into Matlab: [ywb,fs] = wavread(cmu_us_slt_arctic/wav/arctic_b0001.wav); and plot the time domain signal and spectrogram: figure; plot(ywb); figure; specgram(ywb,512,fs,kaiser(500,5),475); Question 1: What is the sampling frequency in the downloaded wideband signal? What is the highest frequency component of this band-limited signal? Create a narrowband signal by downsampling the wideband signal: ynb = decimate(ywb,2); % Downsample (wideband -> narrowband) By default, function decimate lters out the high-frequency content before downsampling thus preventing aliasing. Therefore, in this case you do not need to take care of the anti-aliasing ltering. Question 2: Plot the created time domain signal and its spectrogram (note the new sampling frequency). What can you say about the frequency content of the narrowband signal compared to the wideband signal? Listen to the signals (soundsc), are there any audible dierences between them? Increase the sampling frequency by upsampling the narrowband signal: yus = resample(ynb,2,1); % Upsample signal

Question 3: Plot the time domain signal and spectrogram. Compare the spectrogram to the narrowband spectrogram. How do they dier? Add a zero after each sample of the narrowband signal: yf = zeros(length(ynb)*2,1); yf(1:2:end) = ynb; Question 4: Plot the signal spectrogram and listen to the signal. How has the signal changed?

2.3

LSF codebook construction

The rst step in building our ABE system is the construction of the LSF codebook. The codebook stores narrowband-wideband representation pairs and it is used in extension phase of Section 2.4 for nding a suitable wideband representation for the spectral envelope based on the known narrowband representation.
NB LSF vector 1 NB LSF vector 2 WB LSF vector 1 WB LSF vector 2

...

...

NB LSF vector N

WB LSF vector N

Figure 3: The LSF codebook consisting of narrowband-wideband representation pairs, a suitable wideband
LSF vector is found based on the corresponding narrowband representation. To construct the LSF codebook, the following processing should be repeated for every training data wavele. The narrowband signals can be formed by decimating the existing wideband signals (decimate). Pre-emphasis Before LP analysis, lter the signals using a pre-emphasis lter. For wideband signals, use a FIR lter H(z) = 1 0.95z 1 : ywb = filter([1 -0.95],1,ywb); % Filter signal ywb The frequency response of the wideband pre-emphasis lter is illustrated in Figure 4.
10 Magnitude (dB) 0 10 20 30

1000

2000

3000 4000 5000 Frequency (Hz)

6000

7000

8000

80 Phase (degrees) 60 40 20 0

1000

2000

3000 4000 5000 Frequency (Hz)

6000

7000

8000

Figure 4: Frequency response of the wideband pre-emphasis lter.


In the bandwidth extension, we are going to need both narrowband and wideband LP spectra. Therefore, use for the narrowband signal a pre-emphasis lter whose DFT on the frequency band 04kHz (sampling frequency is 8kHz) is identical to the DFT of the wideband lter on the band 04kHz (sampling frequency is 16kHz). This can be done easily in frequency domain:

% FFT length for the wideband signal nfft = 2^nextpow2(length(ywb)); % Wideband filter DFT (length nfft) % Note: DFT includes frequencies 0-16kHz (sampling frequency 16kHz) Hwb = fft([1 -0.95],nfft); % Narrowband filter DFT (length nfft/2) % Note: DFT includes frequencies 0-8kHz (sampling frequency 8kHz) Hnb = Hwb([1:0.25*nfft 0.75*nfft+1:nfft]); % Narrowband signal DFT (length nfft/2) Ynb = fft(ynb,0.5*nfft); % Filtering in time domain corresponds % to multiplication in frequency domain Ynb = Ynb(:).*Hnb(:); % Inverse DFT: frequency domain -> time domain ynb = ifft(Ynb,symmetric); Windowing Window the signals using the following 25ms window and no overlapping between adjacent frames: awinlen = round(fs*0.025); tmp = hanning(fs*0.005); awinfun = [tmp(1:length(tmp)/2); ones(awinlen-length(tmp),1); tmp(length(tmp)/2+1:end)]; Note that the sampling frequency (fs) for the narrowband signal is 8kHz and for the wideband signal 16kHz. LSF computing Compute framewise LSF coecients for the narrowband and wideband signals. Start by estimating of the LP coecients (lpc) and convert them further into LSF coecients (poly2lsf). Use lter order 10 for narrowband and 18 for wideband signals. Repeat the framewise processing for each training wavele and store the results in two matrices of size Nx10 and Nx18, where the matrix rows correspond to LSF vectors of individual frames N being the total number of frames in the training data. LSF clustering The collected LSF vectors are not used as codebook entries as such. The wideband vectors are clustered using k-means clustering and the resulting cluster centroids are used as wideband codebook entries. A detailed description about k-means clustering can be found at http://www.cs.tut.fi/jupeto/jht lectnotes eng.pdf. Cluster the wideband LSF vectors using Matlab function kmeans: [clidx,clcentr] = kmeans(codevec_wb,200); The number of clusters was now set to 200 and may be varied. The resulting matrix centroids of matrix clcentr are used as wideband entries of the codebook. Vector clidx contains the cluster index for every original wideband LSF vector. For each cluster, form a mean vector (of size 1x10) of the corresponding narrowband LSF vectors. Use this mean vector as a key for the corresponding cluster. Check that the clustered LSF vectors result in stable LP lters (1 < 2 < . . . < p ).

2.4

Extension of a narrowband test signal

Write a Matlab code that extends a given narrowband test signal articially into a wideband signal. A block diagram of the system is given below.
Resampled and delayed narrowband signal SOURCE SIGNAL EXTENSION LP ANALYSIS SPECTRAL ENVELOPE EXTENSION WAVEFORM GENERATION

Narrowband signal

Artificial wideband signal

Figure 5: Bandwidth extension procedure.


The basic idea is to create a signal that contains the frequencies that are missing from the original narrowband signal. This signal having its energy mainly on the frequency band 48kHz is then added to an interpolated version of the original narrowband signal that has most of its energy on the band 04kHz. Note that the extension causes a delay in the signal and therefore also the resampled narrowband signal must be delayed to synchronize the signals. Start the processing by windowing the incoming signal. The frames are decomposed into source signal and lter parts using LP analysis and the parts are extended separately. The frame waveform is reconstructed by ltering the extended source signal using the extended lter. After scaling the frames are joined together using overlap-add. Pre-emphasis As in the codebook construction, the input signal is ltered using a pre-emphasis lter. Use the same narrowband pre-emphasis lter as before. Overlap-add The ltered signal is extended framewise and the extended frames are catenated using overlap-add technique. For overlap-add, a 25ms analysis window and a 10ms synthesis window are used. The time dierence between adjacent frames is 5ms in both analysis and synthesis. In analysis, use the same window function as earlier (awinfun). In synthesis, reconstruct a 25ms time domain speech frame and use a 10ms Hanning window to extract a segment around the center of the frame. Join the windowed segments using an overlap of 5ms. An example code for overlap-add is available at http://www.cs.tut.fi/kurssit/SGN-4010/ikkunointi en.pdf. You can modify this code or write your own implementation. Note that the sampling frequency of our ABE system is 8kHz in analysis and 16kHz in synthesis! Extension of the source and lter parts Decompose each narrowband frame using LP analysis into source and lter parts. First, compute the all-pole lter coecients 1, a1 , . . . , ap for the frame and then form the model residual signal as was explained in Section 1.2. You can operate either in time domain (use ltering) or frequency domain (use multiplication/division). Use the same model order as in the codebook training. Create a wideband source signal that has its energy mainly on the frequency band 48kHz. Use the narrowband source signal as a basis for the wideband source signal: 1. Increase the sampling rate of the narrowband signal using time domain zero-insertion (or spectral mirroring in frequency domain). This will create a signal whose spectrum on the band 48kHz is a mirrored copy of the spectrum on the band 04kHz (check the spectrogram).

2. Using the signal in (1), create a signal whose energy is mainly on the band 48kHz (check the spectrogram). Use this signal as a wideband source signal. Use the codebook to extend the spectral envelope: 1. Convert LP coecients of the current narrowband frame into LSFs. 2. Find the narrowband codebook entry with the minimum Euclidean distance to the current LSF vector. Select the corresponding wideband entry to be used as a wideband spectral envelope representation. 3. Convert the selected wideband LSF coecients into LP coecients for waveform synthesis. Waveform synthesis Reconstruct the wideband frame by ltering the created wideband source signal with the selected wideband all-pole lter. Scale the frame energy according to the energy of the original narrowband frame. First, compute energy Ecb of the frequency band 04kHz for the frame that results from ltering an interpolated version of the narrowband source signal (sampling frequency of the interpolated signal is 16kHz) by the selected wideband all-pole lter. Then compute energy Enb of the original narrowband frame (sampling frequency is 8kHz). p Multiply each sample of the frame by the scaling factor Enb /Ecb . Use overlap-add to join the scaled time domain frames. Remove the eect of pre-emphasis ltering by ltering the signal with the inverse lter of the wideband pre-emphasis lter H(z): 1/H(z) = 1/(1 0.95z 1 ). In Matlab: sig = filter(1,[1 -0.95],sig); Increase the sampling frequency of the original narrowband signal from 8kHz to 16kHz using command resample. Delay the signal with with the total delay of your ABE system. Add the resampled and delayed signal to the articially extended signal. Listen to the resulting signal. Question 5: Plot the spectrogram of the resulting signal. How does it dier from the spectrogram of the original wideband and narrowband signals? Listen to the signal. Are there audible dierences compared to the original narrowband and wideband signals? Question 6: What are the most dominant artifacts caused by the extension? Could they be avoided somehow?

Vous aimerez peut-être aussi