Vous êtes sur la page 1sur 162

Introduction to Signal

Processing and Some


applications in audio analysis

Md. Khademul Islam


Molla
JSPS Research Fellow
Hirose-Minematsu Laboratory
Email: molla@gavo.t.u-
tokyo.ac.jp
Outlines of the presentation
Basics of discrete time signals
Frequency domain signal analysis
Basic Transformations
Fourier Transform (FT), short-time FT (STFT)
Wavelet Transform (WT)
Empirical mode decomposition (EMD), and
Hilbert spectrum (HS)
Remarkable comparisons among FT, WT, HS
Some applications in audio processing
Some open problems to work with
Discrete time signal
It is not possible to process
continuous signals
We need to make it discrete time
signal with suitable sampling
frequency and quantization
The sampling theory Fs≥ 2fc
where, fc expected signal frequency,
Fs required sampling frequency
Quantization is required in sampling
Discrete time signal

Signal sampling Signal quantization


Discrete time signal

Effects of under sampling


Discrete time signal

Effects of required sampling frequency


Discrete time signal
Telephone speech is usually sampled
at 8 kHz to capture up to 4 kHz data
16 kHz is generally regarded as
sufficient for speech recognition and
synthesis
The audio standard is a sample rate
of 44.1 kHz (CD) or 48 kHz (Digital
Audio Tape) to represent frequencies
up to 20 kHz
Discrete time signal
5
4

Amplitud
3

f(x) = 5 cos (x)


2
1
0
-10 -5 -1 0 5 10

e -2
-3
-4
-5

Phase f(x) = 5 cos (x + 3.14) 3

-10 -5 -1 0 5 10
-3

-5

Frequenc f(x) = 5 cos (3 x +


5

y
1

3.14) -10 -5 -1 0

-3
5 10

-5
Time-domain signals
The Independent Variable is Time
The Dependent Variable is the Amplitude
Most of the Information is Hidden in the
Frequency Content
1 1

0.5 0.5

Magnitude
2 Hz
Magnitude

0 0
10 Hz
-0.5 -0.5

-1 -1
0 0.5 1 0 0.5 1
Time Time
1 4

0.5 2 2 Hz +
20 Hz 10 Hz +
Magnitude

Magnitude

0 0
20Hz
-0.5 -2

-1 -4
0 0.5 1 0 0.5 1
Time Time
Signal Transformation
Why
 To obtain a further information from the signal that
is not readily available in the raw signal.
Raw Signal
 Normally the time-domain signal
Processed Signal
 A signal that has been "transformed" by any of the
available mathematical transformations
Fourier Transformation
 The most popular transformation
between time and frequency domains
Frequency domain analysis

Why Frequency Information is


Needed
Be able to see any information that is not
obvious in time-domain
Types of Frequency Transformation
Fourier Transform, Hilbert Transform,
Short-time Fourier Transform,the Radon
Transform, the Wavelet Transform …
Frequency domain analysis
•Powerful & complementary to time domain analysis
methods
•Frequency domain representation shows the signal
energy and phase with respect to frequency
•Fast and efficient way to view signal’s information
General Transform as
problem-solving tool
analysis
time, t frequency, f
F
s(t), S(f) :
s(t) S(f) = F[s(t)] Transform Pair
synthesis

Basic block diagram of signal transformation


Frequency domain analysis
Complex numbers Amplitude
4.2 + 3.7i A = | Z | = √(a2 + b2)
9.4447 – 6.7i
-5.2 (-5.2 + 0i)
Phase
ϕ = ∠ Z = tan-1(b/a)
General Form
Z = a + bi
Re(Z) = a
Im(Z) = b
Frequency domain analysis
Polar Coordinate ℑ
Z = a + bi
A b
ϕ
Amplitude
A = √(a2 + b2)
a ℜ

Phase
ϕ = tan-1 (b/a)
Frequency domain analysis
Frequency Spectrum
 Be basically the frequency components (spectral
components) of that signal
 Show what frequencies exists in the signal
Fourier Transform (FT)
 One way to find the frequency content
 Tells how much of each frequency exists in a signal

Spectrum of
speech signal
Fourier Transform
•Fourier transform decomposes a function into a spectrum
of its frequency components,
•The inverse transform synthesizes a function from its
spectrum of frequency components
•Discrete Fourier transform pair is defined as:

Where Xk represents the frequency component

Where xn represents nth sample in time domain


Fourier Transform
5
4
3
Amplitude Only
2
1
0
-1 0 200 400 600 800 1000 1200 1400
-2
-3
-4 5 10 15
-5

(Hz)

5
4
3
2
1
0
-1 0 200 400 600 800 1000 1200 1400
-2
-3
-4
-5
5 10 15
(Hz)
Fourier Trans. of 1D signal

5
4
3
2
1
0
-1 0 200 400 600 800 1000 1200 1400
-2
-3
5 10 15
-4
-5
(Hz)
Fourier Spectrum of 1D
Fourier Transform

Fourier analysis uses Sinusoids as the


basis function in decomposition
Fourier transforms give the frequency
information, smearing time
Samples of a function give the
temporal information, smearing
frequency
FS synthesis
Square wave reconstruction
from spectral terms 1.5
1.5

11

square signal, sw(t)


0.5
0.5
137
5
11
9
sw11
13
7
5 (t)===∑
9(t)
(t) ∑ ∑[[--[b-bbkkk⋅⋅⋅sin(kt)
⋅sin(kt)]] ]
sin(kt)
sin(kt) 00

kkk==1
=11 -0.5
-0.5

-1
-1

-1.5
-1.5
00 22 44 66 88 10
10
tt

Convergence may be slow (~1/k) - ideally need infinite terms.


Practically, series truncated when remainder below computer tolerance
(⇒ error). BUT … Gibbs’ Phenomenon.
Stationarity of the signal

Stationary Signal
Signals with frequency content
unchanged over the entire time
All frequency components exist at all
times

Non-stationary Signal
Frequency changes in time
One example: the “Chirp Signal”
Stationarity of the signal
2 Hz + 10 Hz + 20Hz 3 6
0
0 Occur at all times
2 5
0
0

Magnitude

Magnitude
1 4
0
0

Stationary 0 3
0
0

-
1 2
0
0

-
2 1
0
0

-
3 0
0 0
.
2 0
.
4 0
.
6 0
.
8 1 0 5 1
0 1
5 2
0 2
5
Time Frequency (Hz)

Do not appear at all times


0.0-0.4: 2 Hz + 1 2
5
0

0.4-0.7: 10 Hz +
0
.
8

0
.
6 2
0
0
Magnitude

0.7-1.0: 20Hz

Magnitude
0
.
4

0
.
2 1
5
0

Non- 0

Stationary -
0
.
2

-
0
.
4
1
0
0

-
0
.
6 5
0

-
0
.
8

-
1 0
0 0
.
5 1 0 5 1
0 1
5 2
0 2
5
Time Frequency (Hz)
Chirp signal
Frequency: 2 Hz to 20 Hz
Frequency: 20 Hz to 2 Hz
Different in Time Domain
1 150 1 150

0.8 0.8

0.6 0.6

0.4 0.4
100
Magnitude

100

Magnitude
Magnitude

Magnitude
0.2 0.2

0 0

-0.2 -0.2
50 50
-0.4 -0.4

-0.6 -0.6

-0.8 -0.8

-1 0 -1 0
0 0.5 1 0 5 10 15 20 25 0 0.5 1 0 5 10 15 20 25
Time Frequency (Hz) Time Frequency (Hz)

Same in Frequency Domain

At what time the frequency components occur? FT can not tell!


Limitations of Fourier Transform

FT Only Gives what Frequency


Components Exist in the Signal
The Time and Frequency Information
can not be seen at the same time
Time-frequency representation of the
signal is needed

Most of Signals are Non-stationary

ONE SOLUTION: SHORT-TIME FOURIER TRANSFORM (STFT)


Short Time Fourier Transform

Dennis Gabor (1946) used STFT


 To analyze only a small section of the signal at a
time -- a technique called Windowing the Signal.
The segment of signal is assumed stationary
A 3D transform

STFTX( ω ) ( t ′, f ) = ∫ [ x( t ) • ω* ( t − t ′) ] • e − j 2 πft dt
t
ω( t ) : the window function

A function of time and


frequency
Short time Fourier Transform

FT
FT

Speech signal
and its STFT
Drawbacks of STFT
Unchanged Window
Dilemma of Resolution
 Narrow window -> poor frequency resolution
 Wide window -> poor time resolution
Heisenberg Uncertainty Principle
 Cannot know what frequency exists at what time
intervals
Via Narrow Via Wide Window
Window
Wavelet Transform
To overcome some limitations of
Fourier transform

S
S

A1 D
1

A2 D2

A3 D3

Discrete Wavelet
decomposition
Wavelet Overview

Wavelet
 A small wave
Wavelet Transforms
 Provide a way for analyzing waveforms, bounded in both
frequency and duration
 Allow signals to be stored more efficiently than by Fourier
transform
 Be able to better approximate real-world signals
 Well-suited for approximating data with sharp discontinuities
“The Forest & the Trees”
 Notice gross features with a large "window“
 Notice small features with a small "window”
Multi-resolution analysis
Wavelet Transform
 An alternative approach to the short time Fourier
transform to overcome the resolution problem
 Similar to STFT: signal is multiplied with a function

Multi-resolution Analysis
 Analyze the signal at different frequencies with
different resolutions
 Good time resolution and poor frequency resolution at
high frequencies
 Good frequency resolution and poor time resolution at
low frequencies
 More suitable for short duration of higher frequency;
and longer duration of lower frequency components
Advantages of WT over STFT

Width of the Window is Changed


as the Transform is Computed
for Every Spectral Components
Altered Resolutions are Placed
Principles of WT

Split Up the Signal into a Bunch of Signals


Representing the Same Signal, but all Corresponding to
Different Frequency Bands
Only Providing What Frequency Bands Exists at What
Time Intervals
Principles of WT

1 * t − τ 
CWT ( τ, s ) = Ψ ( τ, s ) =
ψ
x
ψ
x ∫ x( t ) • ψ  dt
s  s 
Translation
(The location of Scale
the window)

Mother Wavelet
Wavelet
 Small wave
 Means the window function is of finite length

Mother Wavelet
 A prototype for generating the other window functions
 All the used windows are its dilated or compressed and
shifted versions
Principles of WT
Wavelet bases

Time
domain Frequency
domain

-1 2
j ω η −η
Wavelet Basis Functions: Morlet (ω0 = frequency ) : π 4 e 0 e
2

2 m i m m!
Paul ( m = order ) : DOG (1 − iη) −( m+1)
π( 2m )!
( - 1) m+1d m −η2
( )
DOG ( m = devivative ) : e 2

1 dη m
Derivative Of a Gaussian Γ m + 
M=2 is the Marr or Mexican hat wavelet  2
Scale of wavelet

Scale
 S>1: dilate the signal
 S<1: compress the signal
Low Frequency -> High Scale -> Non-
detailed Global View of Signal -> Span
Entire Signal
High Frequency -> Low Scale -> Detailed
View Last in Short Time
Only Limited Interval of Scales is Necessary
Computation of WT
1 * t − τ 
CWT xψ ( τ, s ) = Ψxψ ( τ, s ) = ∫ x ( t ) • ψ  dt
s  s 
Step 1: The wavelet is placed at the beginning of the
signal, and set s=1 (the most compressed wavelet);
Step 2: The wavelet function at scale “1” is multiplied
by the signal, and integrated over all times; then
multiplied by ;
1 s
Step 3: Shift the wavelet to t= , and get the
transform value at t= and s=1;τ
τ
Step 4: Repeat the procedure until the wavelet
reaches the end of the signal;
Step 5: Scale s is increased by a sufficiently small
value, the above procedure is repeated for all s;
Step 6: Each computation for a given s fills the single
row of the time-scale plane;
Step 7: CWT is obtained if all s are calculated.
Time & Frequency Resolution
Better time
resolution;
Poor
frequency
resolution
Frequency

Better
frequency
resolution;
Poor time
resolution Time
• Each box represents a equal portion
• Resolution in STFT is selected once for entire
Comparison of transformations
Discretization of WT

It is Necessary to Sample the Time-


Frequency (scale) Plane.
At High Scale s (Lower Frequency f ),
the Sampling Rate N can be Decreased.
The Scale Parameter s is Normally
Discretized on a Logarithmic Grid.
The most Common Value is 2.

S 2 4 8 …
N 2 = s1 s2 ⋅ N1 = f1 f 2 ⋅ N1
N 32 16 8 …
Effective and Fast DWT

The Discretized WT is not a True


Discrete Transform
Discrete Wavelet Transform (DWT)
Provides sufficient information both for
analysis and synthesis
Reduce the computation time sufficiently
Easier to implement
Analyze the signal at different frequency
S
S

bands with different resolutions


A D1
Decompose the signal into a coarse 1

approximation and detail information


A
2
D 2

A3 D3
Decomposition with DWT
Halves the Time Resolution
 Only half number of samples resulted
Doubles the Frequency Resolution
 The spanned frequency band halved

0-1000 Hz
256
X[n]5 Filter 1 D1: 500-1000 Hz
12
256
S
S Filter 2 D2: 250-500 Hz
A1 128

A1 D1 128
Filter 3 D3: 125-250 Hz
A2 64
A2 D2

A3 D3 A3: 0-125 Hz
64
Decomposition of non-
stationary signal

fL

Signal:
0.0-0.4: 20 Hz
0.4-0.7: 10 Hz
0.7-1.0: 2 Hz
Wavelet: db4
fH
Level: 6
Decomposition of non-
stationary signal

fL

Signal:
0.0-0.4: 2 Hz
0.4-0.7: 10 Hz
0.7-1.0: 20Hz
Wavelet: db4
fH
Level: 6
Reconstruction from WT

What
How those components can be assembled
back into the original signal without loss of
information?
A Process After decomposition or analysis.
Also called synthesis
How
Reconstruct the signal from the wavelet
coefficients
Where wavelet analysis involves filtering and
downsampling, the wavelet reconstruction
process consists of upsampling and filtering
Reconstruction from WT

Lengthening a signal component by


inserting zeros between samples
(upsampling)
MATLAB Commands: idwt and waverec.
Wavelet Applications

Typical Application Fields


 Astronomy, acoustics, nuclear engineering, sub-
band coding, signal and image processing,
neurophysiology, music, magnetic resonance
imaging, speech discrimination, optics, fractals,
turbulence, earthquake-prediction, radar, human
vision, and pure mathematics applications
Sample Applications
 De-noising signals
 Breakdown detecting
 Detecting self-similarity
 Compressing images
 Identifying pure tone
Signal De-noising

Highest Frequencies
Appear at the Start of
The Original Signal
Approximations Appear
Less and Less Noisy
Also Lose Progressively
More High-frequency
Information.
In A5, About the First
20% of the Signal is
Truncated
Breakdown Detection

The Discontinuous Signal


Consists of a Slow Sine
Wave Abruptly Followed
by a Medium Sine Wave.
The 1st and 2nd Level
Details (D1 and D2) Show
the Discontinuity Most
Clearly
Things to be Detected
 The site of the
change
 The type of
change (a rupture
of the signal, or
an abrupt change
in its first or
second derivative) Discontinuit
 The amplitude of y Points
the change
Detecting Self-similarity
Purpose
 How analysis by wavelets
can detect a self-similar,
or fractal, signal.
 The signal here is the Koch
curve -- a synthetic signal
that is built recursively
Analysis
 If a signal is similar to
itself at different scales,
then the "resemblance
index" or wavelet
coefficients also will be
similar at different scales.
 In the coefficients plot,
which shows scale on the
vertical axis, this self-
similarity generates a
characteristic pattern.
Image Compression
Fingerprints
 FBI maintains a large database of
fingerprints — about 30 million
sets of them.
 The cost of storing all this data
runs to hundreds of millions of
dollars.
Results
 Values under the threshold are
forced to zero, achieving about
42% zeros while retaining almost
all (99.96%) the energy of the
original image.
 By turning to wavelets, the FBI
has achieved a 15:1 compression
ratio
 better than the more traditional
JPEG compression
Identifying Pure Tone

Purpose
 Resolving a signal into constituent
sinusoids of different frequencies
 The signal is a sum of three pure
sine waves
Analysis
 D1 contains signal components whose
period is between 1 and 2.
 Zooming in on detail D1 reveals that
each "belly" is composed of 10
oscillations.
 D3 and D4 contain the medium sine
frequencies.
 There is a breakdown between
approximations A3 and A4 -> The
medium frequency been subtracted.
 Approximations A1 to A3 be used to
estimate the medium sine.
 Zooming in on A1 reveals a period
of around 20.
Empirical Mode Decomposition
Principle
Objective — From one observation of x(t), get a AM-FM type
representation :
K

x(t) = Σ ak(t) Ψk(t)


k=1

with ak(.) amplitude modulating functions and Ψk(.) oscillating


functions.
Idea — “signal = fast oscillations superimposed to slow
oscillations”.

Operating mode — (“EMD”, Huang et al., ’98)


(1) identify locally in time, the fastest oscillation ;
(2) subtract it from the original signal ;
(3) iterate upon the residual.
Empirical Mode Decomposition
Principle
1

A LF sawtooth 0

-1
0 1

+
1

A linear FM 0

-1
0 1

= 0

0 1
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
FirstIntrinsicM
odeFunction

R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
S
econdIntrinsicM
odeFunction

R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
ThirdIntrinsicM
odeFunction

R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition
R
esidu

S
I
F
T
I
N
G

P
R
O
C
E
S
S
Empirical Mode Decomposition
Algorithmic definition
S
ignal

1stIntrinsicM
odeFunction

2ndIntrinsicM
odeFunction

3rdIntrinsicM
odeFunction

R
esidu
Empirical Mode Decomposition
Intrinsic Mode Functions

— Quasi monochromatic harmonic oscillations

 #{zero crossing} = #{extrema} ± 1


 symmetric envelopes around the y=0 axis

— IMF ≠ Fourier mode and, in nonlinear situations,

IMF = several Fourier modes

— Output of a self-adaptive time-varying filter


(≠ standard linear filter)

ex: 2 sinus FM + gaussian wave packet


Empirical Mode Decomposition
Instantaneous frequency (IF)

— Analytic version of each IMF Ci(t) is computed


using Hilbert transform as:
z i (t ) = C i (t ) + jH [C i (t )] = a i (t )e jθi (t )

— hence zi(t) becomes complex with phase and


amplitude. Then IF can be computed as:
dθi (t )
ωi (t ) =
dt

-Hilbert spectrum (HS) is a triplet as


H(ω ,t)={t, ω i(t), ai(t)}
Empirical Mode Decomposition
Intrinsic Mode Functions

Signal

time
Spectrum
frequency

Time-Frequency representation
Empirical Mode Decomposition
Intrinsic Mode Functions
S
ignal
Empirical Mode Decomposition
Intrinsic Mode Functions

Signal 1st IMF

frequency
time

2nd IMF 3rd IMF


Empirical Mode Decomposition
Intrinsic Mode Functions

•EMD is fully data adaptive decomposition for


spectral and time-frequency representation of
non-linear and non-stationary time series
•It does not employ any basis function for
decomposition
•It produces perfect localization of the signal
components in high resolution time-frequency
space of the time series
Empirical Mode Decomposition
Comparison between HS and STFT

Time-frequency representation of two pure


tones (100Hz and 250Hz) using HS and STFT

Hilbert spectrum (HS) STFT


Empirical Mode Decomposition
Comparison between Wavelet and HS

Wavelet Hilbert spectrum (HS)


Remarks on FT

•Fourier Transform has a mathematical


foundation
•Can be used in robust analysis having
phase information
•The detail signal information is limited
with the basis (sinusoid) function
•STFT analysis includes some addition
cross-spectral energy that degrades the
performance in some applications
Remarks on WT

•WT employs data adaptive basis function base


on its time and frequency scales
•It can produce more detail signal information
in T-F representation
•WT also perform well in multi-band
decomposition
•The reconstruction error of multi-band
representation is much less than the FT
•It can not preserve the phase information for
perfect reconstruction from T-F space
Remarks on EMD and HS

•EMD is fully adaptive multi-band


decomposition method
•It produces the perfect localization of signal
components in T-F space
•HS can represent the instantaneous spectra of
the signal
•The signal can be reconstructed with negligible
error terms
•It does not have mathematical foundation yet
•It is difficult to use EMD based decomposition
in robust analysis
Application of DSP in audio analysis

•Audio source separation from mixture


using independent subspace
analysis (ISA)
•Audio source separation by spatial
localization in underdetermined case
•Robust pitch estimation using EMD
Application of DSP in audio analysis

Audio source separation from


mixture using independent subspace
analysis (ISA)
Source separation by Independent
subspace analysis (ISA)
Audio mixture

s(t)
φ
STFT Source ISTFT
spectrograms
STFTM

PCA
Basis Individual
Basis vector selection sources
vector
clustering
ICA
T-F representation of mixture
Mixture
Audio

Short Time Fourier Transform (STFT)


Magnitude Spectrogram X
Phase
Information

window 30ms
Overlap20ms
Proposed separation model
Mixture spectrogram X=∑ xi Source
Spectrograms

Bi = [b1(i ) , b2(i ) ......bn(i ) ]


xi=BiAi
(i ) (i ) (i ) T
Ai = [a1 , a2 ........an ]
Bi  Invariant frequency n-component basis
Ai  Corresponding amplitude envelope
Ai=BiTX, Bi=XAiT
** To find independent Bi or Ai
Dimension Reduction
Rows or columns of X > > number of sources
Subject to reduce the dimension of X
Singular value decomposition (SVD) is used
Xn× k=Un× nSn× kVk× kT
-U and V  orthogonal matrices (column-wise)
-S  diagonal matrix of elements (singular values)
σ 1≥ σ 2≥ σ 3 ≥ .… σ n≥ 0
p basis vectors (from U or V) 1 p

are selected by setting n ∑σ i ≥ϕ


ϕ =0.5 to 0.6 in inequality  ∑σ i
i =1
i =1
Proposed separation model

To derive the basis vectors


Singular value decomposition
(SVD) is applied as PCA
Some principal components are
selected as basis vectors
Independent component analysis
(ICA) is applied to make the bases
independent
Independent basis vectors

before ICA after ICA

**The bases Independent along time


frames
Producing source subspaces
The bases of the speech signal

Time bases Frequency bases


Source Subspaces (cont.)
Mixture Spectrogram
PCA+Basis Selection+ICA

B Basis vectors A

KLd based clustering

B1A1 Source Subspaces B2A2

Source
Spectrograms
Source re-synthesis

Separated subspaces
(spectrograms)
Mixture of speech & bip-bip sound

Inverse STFT
Separated speech
Append phase
information
j [φ ( n ,k )]
Si (n, k ) = xi .e Separated bip-bip sound
Experimental results
Separated signals with proposed algorithm

mixtures separated

Speech+bip-bip

Male+female speech
Application of DSP in audio analysis

Audio source separation by spatial


localization in underdetermined case
Localization based separation

To avoid the spectral dependency and


signal content in separation
To increase the number of sources
The spatial location is considered

The use of Binaural mixtures instead of


single mixture
Localization based
separation (cont.)
Consider a multi-source audio situation
Human can easily localize and separate
the sources by HAS (human auditory
system)
The binaural cues ITD and ILD are
mainly used in source localization
Separation is performed by applying
Beamforming and Binary mask
Source localization Cues
• Interaural time difference (ITD) between two
microphones’ signals (like two ears of human)
• Interaural level difference (ILD)

ITD ILD
Source localization

Xr(ω ) and Xl(ω ) are STFT of xr(t) and xl(t)


ITD and ILD are calculated as

1  | X l (ω ) | 
τ I T D(ω ) = {φ l (ω ) − φ r (ω ) } κ ILD(ω ) = 20log 

ω |
 r X (ω ) | 

where φ r(ω ) and φ l(ω ) are unwrap phase of


Xr(ω ) and Xl(ω ) respectively at frequency ω
Source localization (cont.)
ITD becomes ambiguous at higher
frequency (factor of mics` spacing)

ITD ITD

AtILD
low dominates
frequency At high frequency
to resolve the problem
Source localization (cont.)
ITD and ILD are quantized into 50 levels
Collection of T-F points corresponding to each
ITD/ILD quantized pair produces peaks
Separation by beamforming

ITD is derived for each of the


localized sources
Spatial Beamforming is applied
Linearly constrained minimum
variance Beamforming (LCMVB) is
used
The gain is selected based on the
spatial locations
Separation by binary mask
with HS
It is required to avoid the
limitations of spatial beamforming
Separation is performed by binary
mask estimation based on ITD/ILD
The sources are considered as
disjoint orthogonal in T-F space
 not more than one source is
active at any T-F point
Computing ITD and ILD
Each mixture is transformed to T-F domain
using Hilbert spectrums (HL and HR)
ITD and ILD are measured as:

 1 H L (ω , t f ) H R (ω , t f ) 
{ ITD(ω , t f ), ILD(ω , t f )} =  ∠ , 
 ω H R (ω , t f ) H L (ω , t f ) 
 H (ω , t ) 2 
 R f 
ILDdB (ω , t f ) = 20 log 10 2 
 H L (ω , t f ) 
where tf is the time frame  
ITD-ILD Space Localization
ITD and ILD are quantized into 50 levels
Collection of T-F points from HS
corresponding to each ITD/ILD quantized
pair produces peaks
Source Separation
Each peak region in the histogram
refers to a source of the binaural
mixtures
Construct a binary mask (nullifying T-F
points of interfering sources) Mi(ω ,t)
The HS of ith source is separated as
H i (ω , t ) = M i (ω , t ) H L (ω , t )

Time domain ith source is given as


si (t ) = ∑ H i (ω , t ) ⋅ cos[φ (ω , t )]
ω
Source disjoint
orthogonality
Disjoint orthogonality (DO) of audio
sources assumes that not more than
one source is active at any T-F point

F1 (ω , t ) F2 (ω , t ) = 0; ∀ ω , t
where F1 and F2 are TFR of two signals
SIR (signal to interference ratio) is used as
the basis to measure DO
Source disjoint orthogonality
(cont.)
s1 s2 s3 Three audio sources

Microphone s1
s s2
Frequency TFR s3

Time
Source disjoint orthogonality
(cont.)
The SIR of the jth source is defined as:
X j (ω , t )
SIR j = ∑ ∑ ; Y j (ω , t ) ≠ 0
ω t Y j (ω , t )
N
Y j (ω , t ) = ∑ X i (ω , t )
i =1
i≠ j

Yj  sum of interfering sources


Source disjoint orthogonality
(cont.)
Dimensions of HS and STFT of same
signal may be different
DO is defined as the percentage over
the entire TFR region
Average DO (ADO) of all sources is
N


1
ADO = SIR j
N j =1
N number of sources
Experimental results
The three mixtures are defined as
m1{sp1(-40°, 0°), sp2(30°, 0°), ft(0°, 0°)},
m2{sp1(20°, 10°), sp2(0°, 10°), ft(-10°,10°)},
m3{sp1(40°, 20°), sp2(30°, 20°), ft(-20°, 20°)}
The separation efficiency is measured
as OSSR (original to separated signal
ratio) defined as: w  
T

 ∑ 2
s original (t + i ) 


1
OSSR = log 10  i =1

T  w

t =1




i =1
2
s separated (t + i ) 


Experimental results
(cont.)
The comparative separation efficiency
(OSSR) using HS and STFT :
Mixtures TFR OSSR of sp1 OSSR of sp2 OSSR of ft
m1 HS -0.0271 0.0213 0.0264
STFT 0.0621 -0.0721 -0.0531
m2 HS 0.0211 -0.0851 -0.0872
STFT 0.0824 0.1202 0.1182
m3 HS 0.0941 -0.0832 0.0225
STFT -0.1261 0.1092 -0.0821
Experimental results
This experiment also compares the
DO using HS and STFT as TFR
STFT is affected by many factors 
window function and its length,
overlapping, FFT points
HS is independent of such factors
It is slightly affected by the number of
frequency bins used in TFR
Experimental results (cont.)
The ADO of HS and STFT as a function of
number of frequency bins (N=3):
Experimental results (cont.)
The ADO of only STFT is affected by the
factor of window overlapping (%)
Experimental results (cont.)
STFT includes more cross-spectral energy
terms
The TFR of two pure tones using HS and STFT
Experimental results (cont.)
Always HS has better DO for audio signals
DO depends on the resolution of TFR
STFT has to satisfy the inequality 1
∆ t∆ω ≥
2
The frequency resolution of HS is up to
Nyquist frequency
Its time resolution is up to sampling rate and
hence offers better resolution
Remarks

The separation efficiency is independent of


the signal’s spectral characteristics
The performance is affected by the apart
angles and disjointness of the sources
HS produces better disjointness in T-F
domain and hence better separation
The Binaural mixtures are recorder in
anechoic room of NTT
Application of DSP in audio analysis

Robust pitch estimation


using EMD
Why EMD in pitch estimation?

Pitch facilitates speech coding,


enhancement, recognition etc.
Autocorrelation function is
mostly used in pitch estimation
algorithm
Autocorrelation (AC) function-
recalls the periodic property of
the speech
EMD in pitch estimation (cont.)
EMD in pitch estimation (cont.)
Pitch is the sample difference
between two consecutive peaks in AC
function
Sometimes the pitch peak may be
less prominent specially due to noise
EMD in pitch estimation (cont.)
EMD decomposes any signal into higher to
lower frequency component
It produces the local and global oscillations
of the signal
The global oscillation almost represents the
envelop of the signal
The IMF of global oscillation is used to
estimate the pitch
Pitch estimation with EMD
Pitch estimation with EMD
(cont.)
There exists an IMF in EMD domain
representing the global oscillation
of the AC function
That IMF represents the sinusoid
of the pitch period
Pitch is the frequency of that IMF
rather than finding the pitch peak
Pitch estimation with EMD (cont.)

In EMD, IMF-5 is the oscillation of


pitch period
It is a crucial step to determine the
target IMF representing the sinusoid
with pitch period
The IMF of low frequency oscillation
(than pitch period) can be discarded
by energy thresholding
Pitch estimation with EMD
(cont.)
A reference pitch is computed by
weighted AC (WAC) method
Such pitch information is used to
select the IMF with pitch period
The periodicity of the selected
each IMF is computed as pitch
period
Pitch estimation with EMD (cont.)

The peak at zero-lag is selected


Two cycles are selected from both
sides
Average samples are the periodicity
Proposed Pitch estimation
Algorithm
Normalized autocorrelation (AC) of
the speech frame is computed
Determine rough pitch period using
WAC method
Apply EMD on AC function
Select the IMF of pitch period on
the basis of WAC based method
The period of the selected IMF is
the estimated pitch
Experimental results

Keele pitch database is used here


20kHz sampling rate
Frame length is 25.6ms with 10ms
shifting
Each frame is filtered by band-pass
filter of pitch range (50-500Hz)
Gross pitch error (GPE) is used to
measure the performance
Experimental results (cont.)

The %GPE of male and female


speech with different SNR are
presented here
SNR
Total number of frames is 1823
30dB 20dB 10dB 0dB -5dB -15dB

Female 1.90 2.83 3.93 10.12 21.83 64.24

Male 2.15 3.78 5.22 11.89 23.56 66.76


Remarks

The use of EMD makes the pitch estimation


method more robust
EMD of AC function can extract the
fundamental oscillation of the signal
The pitch can be easily estimated from the
single sinusoid of fundamental oscillation
It is not affected by the prominent non-
pitch peak
Future works

The open problem is to identify the


IMF with pitch period
In present algorithm the error to
estimate pitch roughly in ACF can
propagate to the performance of final
estimation
The performance is not yet tested with
other existing algorithm
Open Problem-1
Instantaneous Pitch (IP) estimation using EMD
•Frame based pitch estimation is already done
•Paper is accepted by EUROSPEECH 2007

Three methods of
pitch estimation

•We have used the pitch information based WAC to


compute the exact pitch (IMF) from EMD space
•Problem to compute IP only from EMD space
Open Problems-2
Voiced/Unvoiced Detection with EMD
•Useful in speech enhancement and speech/speaker recog.
•Paper with preliminary results is published in ICICT2007

V/UV
differentiation

•Problem to derive better separation region for V/UV and


to conduct experiment with large speech data
Open Problems-3
Robust Audio Source Localization
•Localization is done by delay-attenuation computed in
T-F space of binaural mixtures- NOT noise robust

Localization of three
sources using TD-LD
computed in T-F space

•The problem is to derive mathematical mode for


robust localization in underdetermined situation
Open Problems-4
Speech denoising using image processing
•Noisy speech can be represented as an image with
time-frequency (T-F) representation e.g. Spectrogram

Speech with
white noise

•Image processing algorithm can be used for denoising


•It seems easy for musical/white noises
•Problem is to deal with other noise even by using
binaural mixtures
Open Problems-5
Auditory segmentation with binaural mixtures
•Auditory segmentation is the first stage of source
separation using auditory scene analysis (ASA)

Source
separation
by ASA

•Problem is to use of binaural mixtures for improved


auditory segmentation as source separation
•T-F representation other than FT can be employed
Open Problems-6
Two stage speech enhancement
•Single stage speech enhancement is not efficient in
all noisy situations
•For example, musical noise is introduced with binary
masking and some thresholding methods
•Noise may not be separated perfectly by using ICA,
ISA (independent subspace analysis) based techniques
•Multi-stage enhancement with suitable order can
improve the performance
Noisy First stage Second stage Clean
speech enhancement enhancement speech
Open Problems-7
Informative features extraction
•To use spectral dynamics in speech/speaker recog.
•Special type of speech features are required

Mixed signal with


its spectrogram

•How to parameterized speech signal to represent


speech dynamics
•WT, HS based spectral analysis can be studied>>>
Open Problems-8
Source based audio indexing
•Useful in multimedia applications and moving
audio source separation
s2
Audio sources at
Separation of different azimuth angles
moving sources s3
s1 (0 to 180 degree)

1.5m

•Several new method could be used for indexing 


Ada-boost, Tree-ICA, condition random field
Open Problems-9
Time-series prediction with EMD
•Subject to financial and environment time series
•Conventional methods use Kalman filter (for
smoothing) and AR model for prediction

Non-stationary
time-series

•EMD can be used as smoothing filter to enhance the


prediction accuracy
Open Problems-10
Heart-rate analysis with ECG data using EMD

Different parts
of ECG signal

•ECG Variability analysis at different frequency region


•Analysis of instantaneous ECG condition
•Abnormality analysis of heart-rate using EMD based
spectral modeling
The End

Questions/Suggestion
Please
Source Separation
Each peak region in the histogram refers to
a source of the stereo mixtures
Construct a binary mask (nullifying TF
points of interfering sources) Mi(n,t)
The HS of ith source is separated as

H i (n, t ) = M i (n, t ) H L (n, t )


Time domain ith source is given as
si (t ) = ∑ H i (n, t ) ⋅ cos[φ (n, t )]
n
The End

Questions/Suggestion
Please

Vous aimerez peut-être aussi