Introduction To Signal Processing

Introduction to Signal
Processing and Some

applications in audio analysis
Md. Khademul Islam

Molla
JSPS Research Fellow
Hirose-Minematsu Laboratory
Email: molla@gavo.t.u-
tokyo.ac.jp
Outlines of the presentation
Basics of discrete time signals
Frequency domain signal analysis
Basic Transformations
Fourier Transform (FT), short-time FT (STFT)
Wavelet Transform (WT)
Empirical mode decomposition (EMD), and
Hilbert spectrum (HS)
Remarkable comparisons among FT, WT, HS
Some applications in audio processing
Some open problems to work with
Discrete time signal
It is not possible to process
continuous signals
We need to make it discrete time
signal with suitable sampling
frequency and quantization
The sampling theory Fs≥ 2fc
where, fc expected signal frequency,
Fs required sampling frequency
Quantization is required in sampling
Signal sampling Signal quantization

Effects of under sampling

Effects of required sampling frequency

Telephone speech is usually sampled
at 8 kHz to capture up to 4 kHz data
16 kHz is generally regarded as
sufficient for speech recognition and
synthesis
The audio standard is a sample rate
of 44.1 kHz (CD) or 48 kHz (Digital
Audio Tape) to represent frequencies
up to 20 kHz
5
4
Amplitud
3
f(x) = 5 cos (x)

2
1
0
-10 -5 -1 0 5 10
e -2
-3
-4
-5
Phase f(x) = 5 cos (x + 3.14) 3
-10 -5 -1 0 5 10
-3
-5
Frequenc f(x) = 5 cos (3 x +

5
y
1
3.14) -10 -5 -1 0
-3
5 10
-5
Time-domain signals
The Independent Variable is Time
The Dependent Variable is the Amplitude
Most of the Information is Hidden in the
Frequency Content
1 1
0.5 0.5
Magnitude
2 Hz
Magnitude
0 0
10 Hz
-0.5 -0.5
-1 -1
0 0.5 1 0 0.5 1
Time Time
1 4
0.5 2 2 Hz +
20 Hz 10 Hz +
Magnitude
Magnitude
0 0
20Hz
-0.5 -2
-1 -4
0 0.5 1 0 0.5 1
Time Time
Signal Transformation
Why
 To obtain a further information from the signal that
is not readily available in the raw signal.
Raw Signal
 Normally the time-domain signal
Processed Signal
 A signal that has been "transformed" by any of the
available mathematical transformations
Fourier Transformation
 The most popular transformation
between time and frequency domains
Frequency domain analysis
Why Frequency Information is

Needed
Be able to see any information that is not
obvious in time-domain
Types of Frequency Transformation
Fourier Transform, Hilbert Transform,
Short-time Fourier Transform,the Radon
Transform, the Wavelet Transform …
•Powerful & complementary to time domain analysis
methods
•Frequency domain representation shows the signal
energy and phase with respect to frequency
•Fast and efficient way to view signal’s information
General Transform as
problem-solving tool
analysis
time, t frequency, f
F
s(t), S(f) :
s(t) S(f) = F[s(t)] Transform Pair
synthesis
Basic block diagram of signal transformation

Complex numbers Amplitude
4.2 + 3.7i A = | Z | = √(a2 + b2)
9.4447 – 6.7i
-5.2 (-5.2 + 0i)
Phase
ϕ = ∠ Z = tan-1(b/a)
General Form
Z = a + bi
Re(Z) = a
Im(Z) = b
Polar Coordinate ℑ
Z = a + bi
A b
ϕ
Amplitude
A = √(a2 + b2)
a ℜ
Phase
ϕ = tan-1 (b/a)
Frequency Spectrum
 Be basically the frequency components (spectral
components) of that signal
 Show what frequencies exists in the signal
Fourier Transform (FT)
 One way to find the frequency content
 Tells how much of each frequency exists in a signal
Spectrum of
speech signal
Fourier Transform
•Fourier transform decomposes a function into a spectrum
of its frequency components,
•The inverse transform synthesizes a function from its
spectrum of frequency components
•Discrete Fourier transform pair is defined as:
Where Xk represents the frequency component
Where xn represents nth sample in time domain

Fourier Transform
5
4
3
Amplitude Only
2
1
0
-1 0 200 400 600 800 1000 1200 1400
-2
-3
-4 5 10 15
-5
(Hz)
5
4
3
2
1
0
-1 0 200 400 600 800 1000 1200 1400
-2
-3
-4
-5
5 10 15
(Hz)
Fourier Trans. of 1D signal
5
4
3
2
1
0
-1 0 200 400 600 800 1000 1200 1400
-2
-3
5 10 15
-4
-5
(Hz)
Fourier Spectrum of 1D
Fourier Transform
Fourier analysis uses Sinusoids as the

basis function in decomposition
Fourier transforms give the frequency
information, smearing time
Samples of a function give the
temporal information, smearing
frequency
FS synthesis
Square wave reconstruction
from spectral terms 1.5
1.5
11
square signal, sw(t)

0.5
0.5
137
5
11
9
sw11
13
7
5 (t)===∑
9(t)
(t) ∑ ∑[[--[b-bbkkk⋅⋅⋅sin(kt)
⋅sin(kt)]] ]
sin(kt)
sin(kt) 00
kkk==1
=11 -0.5
-0.5
-1
-1
-1.5
-1.5
00 22 44 66 88 10
10
tt
Convergence may be slow (~1/k) - ideally need infinite terms.

Practically, series truncated when remainder below computer tolerance
(⇒ error). BUT … Gibbs’ Phenomenon.
Stationarity of the signal
Stationary Signal
Signals with frequency content
unchanged over the entire time
All frequency components exist at all
times
Non-stationary Signal
Frequency changes in time
One example: the “Chirp Signal”
Stationarity of the signal
2 Hz + 10 Hz + 20Hz 3 6
0
0 Occur at all times
2 5
0
0
Magnitude
Magnitude
1 4
0
0
Stationary 0 3
0
0
-
1 2
0
0
-
2 1
0
0
-
3 0
0 0
.
2 0
.
4 0
.
6 0
.
8 1 0 5 1
0 1
5 2
0 2
5
Time Frequency (Hz)
Do not appear at all times

0.0-0.4: 2 Hz + 1 2
5
0
0.4-0.7: 10 Hz +
0
.
8
0
.
6 2
0
0
Magnitude
0.7-1.0: 20Hz
Magnitude
0
.
4
0
.
2 1
5
0
Non- 0
Stationary -
0
.
2
-
0
.
4
1
0
0
-
0
.
6 5
0
-
0
.
8
-
1 0
0 0
.
5 1 0 5 1
0 1
5 2
0 2
5
Time Frequency (Hz)
Chirp signal
Frequency: 2 Hz to 20 Hz
Frequency: 20 Hz to 2 Hz
Different in Time Domain
1 150 1 150
0.8 0.8
0.6 0.6
0.4 0.4
100
Magnitude
100
Magnitude
Magnitude
Magnitude
0.2 0.2
0 0
-0.2 -0.2
50 50
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 0 -1 0
0 0.5 1 0 5 10 15 20 25 0 0.5 1 0 5 10 15 20 25
Time Frequency (Hz) Time Frequency (Hz)
Same in Frequency Domain
At what time the frequency components occur? FT can not tell!

Limitations of Fourier Transform
FT Only Gives what Frequency

Components Exist in the Signal
The Time and Frequency Information
can not be seen at the same time
Time-frequency representation of the
signal is needed
Most of Signals are Non-stationary
ONE SOLUTION: SHORT-TIME FOURIER TRANSFORM (STFT)

Short Time Fourier Transform
Dennis Gabor (1946) used STFT

 To analyze only a small section of the signal at a
time -- a technique called Windowing the Signal.
The segment of signal is assumed stationary
A 3D transform
STFTX( ω ) ( t ′, f ) = ∫ [ x( t ) • ω* ( t − t ′) ] • e − j 2 πft dt
t
ω( t ) : the window function
A function of time and

frequency
Short time Fourier Transform
FT
FT
Speech signal
and its STFT
Drawbacks of STFT
Unchanged Window
Dilemma of Resolution
 Narrow window -> poor frequency resolution
 Wide window -> poor time resolution
Heisenberg Uncertainty Principle
 Cannot know what frequency exists at what time
intervals
Via Narrow Via Wide Window
Window
Wavelet Transform
To overcome some limitations of
Fourier transform
S
S
A1 D
1
A2 D2
A3 D3
Discrete Wavelet
decomposition
Wavelet Overview
Wavelet
 A small wave
Wavelet Transforms
 Provide a way for analyzing waveforms, bounded in both
frequency and duration
 Allow signals to be stored more efficiently than by Fourier
transform
 Be able to better approximate real-world signals
 Well-suited for approximating data with sharp discontinuities
“The Forest & the Trees”
 Notice gross features with a large "window“
 Notice small features with a small "window”
Multi-resolution analysis
Wavelet Transform
 An alternative approach to the short time Fourier
transform to overcome the resolution problem
 Similar to STFT: signal is multiplied with a function
Multi-resolution Analysis
 Analyze the signal at different frequencies with
different resolutions
 Good time resolution and poor frequency resolution at
high frequencies
 Good frequency resolution and poor time resolution at
low frequencies
 More suitable for short duration of higher frequency;
and longer duration of lower frequency components
Advantages of WT over STFT
Width of the Window is Changed

as the Transform is Computed
for Every Spectral Components
Altered Resolutions are Placed
Principles of WT
Split Up the Signal into a Bunch of Signals

Representing the Same Signal, but all Corresponding to
Different Frequency Bands
Only Providing What Frequency Bands Exists at What
Time Intervals
Principles of WT
1 * t − τ 
CWT ( τ, s ) = Ψ ( τ, s ) =
ψ
x
ψ
x ∫ x( t ) • ψ  dt
s  s 
Translation
(The location of Scale
the window)
Mother Wavelet
Wavelet
 Small wave
 Means the window function is of finite length
Mother Wavelet
 A prototype for generating the other window functions
 All the used windows are its dilated or compressed and
shifted versions
Principles of WT
Wavelet bases
Time
domain Frequency
domain
-1 2
j ω η −η
Wavelet Basis Functions: Morlet (ω0 = frequency ) : π 4 e 0 e
2
2 m i m m!
Paul ( m = order ) : DOG (1 − iη) −( m+1)
π( 2m )!
( - 1) m+1d m −η2
( )
DOG ( m = devivative ) : e 2
1 dη m
Derivative Of a Gaussian Γ m + 
M=2 is the Marr or Mexican hat wavelet  2
Scale of wavelet
Scale
 S>1: dilate the signal
 S<1: compress the signal
Low Frequency -> High Scale -> Non-
detailed Global View of Signal -> Span
Entire Signal
High Frequency -> Low Scale -> Detailed
View Last in Short Time
Only Limited Interval of Scales is Necessary
Computation of WT
1 * t − τ 
CWT xψ ( τ, s ) = Ψxψ ( τ, s ) = ∫ x ( t ) • ψ  dt
s  s 
Step 1: The wavelet is placed at the beginning of the
signal, and set s=1 (the most compressed wavelet);
Step 2: The wavelet function at scale “1” is multiplied
by the signal, and integrated over all times; then
multiplied by ;
1 s
Step 3: Shift the wavelet to t= , and get the
transform value at t= and s=1;τ
τ
Step 4: Repeat the procedure until the wavelet
reaches the end of the signal;
Step 5: Scale s is increased by a sufficiently small
value, the above procedure is repeated for all s;
Step 6: Each computation for a given s fills the single
row of the time-scale plane;
Step 7: CWT is obtained if all s are calculated.
Time & Frequency Resolution
Better time
resolution;
Poor
frequency
resolution
Frequency
Better
frequency
resolution;
Poor time
resolution Time
• Each box represents a equal portion
• Resolution in STFT is selected once for entire
Comparison of transformations
Discretization of WT
It is Necessary to Sample the Time-

Frequency (scale) Plane.
At High Scale s (Lower Frequency f ),
the Sampling Rate N can be Decreased.
The Scale Parameter s is Normally
Discretized on a Logarithmic Grid.
The most Common Value is 2.
S 2 4 8 …
N 2 = s1 s2 ⋅ N1 = f1 f 2 ⋅ N1
N 32 16 8 …
Effective and Fast DWT
The Discretized WT is not a True

Discrete Transform
Discrete Wavelet Transform (DWT)
Provides sufficient information both for
analysis and synthesis
Reduce the computation time sufficiently
Easier to implement
Analyze the signal at different frequency
S
S
bands with different resolutions

A D1
Decompose the signal into a coarse 1
approximation and detail information

A
2
D 2
A3 D3
Decomposition with DWT
Halves the Time Resolution
 Only half number of samples resulted
Doubles the Frequency Resolution
 The spanned frequency band halved
0-1000 Hz
256
X[n]5 Filter 1 D1: 500-1000 Hz
12
256
S
S Filter 2 D2: 250-500 Hz
A1 128
A1 D1 128
Filter 3 D3: 125-250 Hz
A2 64
A2 D2
A3 D3 A3: 0-125 Hz
64
Decomposition of non-
stationary signal
fL
Signal:
0.0-0.4: 20 Hz
0.4-0.7: 10 Hz
0.7-1.0: 2 Hz
Wavelet: db4
fH
Level: 6
Decomposition of non-
stationary signal
fL
Signal:
0.0-0.4: 2 Hz
0.4-0.7: 10 Hz
0.7-1.0: 20Hz
Wavelet: db4
fH
Level: 6
Reconstruction from WT
What
How those components can be assembled
back into the original signal without loss of
information?
A Process After decomposition or analysis.
Also called synthesis
How
Reconstruct the signal from the wavelet
coefficients
Where wavelet analysis involves filtering and
downsampling, the wavelet reconstruction
process consists of upsampling and filtering
Reconstruction from WT
Lengthening a signal component by

inserting zeros between samples
(upsampling)
MATLAB Commands: idwt and waverec.
Wavelet Applications
Typical Application Fields

 Astronomy, acoustics, nuclear engineering, sub-
band coding, signal and image processing,
neurophysiology, music, magnetic resonance
imaging, speech discrimination, optics, fractals,
turbulence, earthquake-prediction, radar, human
vision, and pure mathematics applications
Sample Applications
 De-noising signals
 Breakdown detecting
 Detecting self-similarity
 Compressing images
 Identifying pure tone
Signal De-noising
Highest Frequencies
Appear at the Start of
The Original Signal
Approximations Appear
Less and Less Noisy
Also Lose Progressively
More High-frequency
Information.
In A5, About the First
20% of the Signal is
Truncated
Breakdown Detection
The Discontinuous Signal

Consists of a Slow Sine
Wave Abruptly Followed
by a Medium Sine Wave.
The 1st and 2nd Level
Details (D1 and D2) Show
the Discontinuity Most
Clearly
Things to be Detected
 The site of the
change
 The type of
change (a rupture
of the signal, or
an abrupt change
in its first or
second derivative) Discontinuit
 The amplitude of y Points
the change
Detecting Self-similarity
Purpose
 How analysis by wavelets
can detect a self-similar,
or fractal, signal.
 The signal here is the Koch
curve -- a synthetic signal
that is built recursively
Analysis
 If a signal is similar to
itself at different scales,
then the "resemblance
index" or wavelet
coefficients also will be
similar at different scales.
 In the coefficients plot,
which shows scale on the
vertical axis, this self-
similarity generates a
characteristic pattern.
Image Compression
Fingerprints
 FBI maintains a large database of
fingerprints — about 30 million
sets of them.
 The cost of storing all this data
runs to hundreds of millions of
dollars.
Results
 Values under the threshold are
forced to zero, achieving about
42% zeros while retaining almost
all (99.96%) the energy of the
original image.
 By turning to wavelets, the FBI
has achieved a 15:1 compression
ratio
 better than the more traditional
JPEG compression
Identifying Pure Tone
Purpose
 Resolving a signal into constituent
sinusoids of different frequencies
 The signal is a sum of three pure
sine waves
Analysis
 D1 contains signal components whose
period is between 1 and 2.
 Zooming in on detail D1 reveals that
each "belly" is composed of 10
oscillations.
 D3 and D4 contain the medium sine
frequencies.
 There is a breakdown between
approximations A3 and A4 -> The
medium frequency been subtracted.
 Approximations A1 to A3 be used to
estimate the medium sine.
 Zooming in on A1 reveals a period
of around 20.
Empirical Mode Decomposition
Principle
Objective — From one observation of x(t), get a AM-FM type
representation :
K
x(t) = Σ ak(t) Ψk(t)

k=1
with ak(.) amplitude modulating functions and Ψk(.) oscillating

functions.
Idea — “signal = fast oscillations superimposed to slow
oscillations”.
Operating mode — (“EMD”, Huang et al., ’98)

(1) identify locally in time, the fastest oscillation ;
(2) subtract it from the original signal ;
(3) iterate upon the residual.
Principle
1
A LF sawtooth 0
-1
0 1
+
1
A linear FM 0
-1
0 1
= 0
0 1
Algorithmic definition
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
FirstIntrinsicM
odeFunction
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
S
econdIntrinsicM
odeFunction
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
I
F
T
I
N
G
P
ThirdIntrinsicM
odeFunction
R
O
C
E
S
S
S
I
F
T
I
N
G
P
R
O
C
E
S
S
R
esidu
S
I
F
T
I
N
G
P
R
O
C
E
S
S
S
ignal
1stIntrinsicM
odeFunction
2ndIntrinsicM
odeFunction
3rdIntrinsicM
odeFunction
R
esidu
Intrinsic Mode Functions
— Quasi monochromatic harmonic oscillations
 #{zero crossing} = #{extrema} ± 1

 symmetric envelopes around the y=0 axis
— IMF ≠ Fourier mode and, in nonlinear situations,
IMF = several Fourier modes
— Output of a self-adaptive time-varying filter

(≠ standard linear filter)
ex: 2 sinus FM + gaussian wave packet

Instantaneous frequency (IF)
— Analytic version of each IMF Ci(t) is computed

using Hilbert transform as:
z i (t ) = C i (t ) + jH [C i (t )] = a i (t )e jθi (t )
— hence zi(t) becomes complex with phase and

amplitude. Then IF can be computed as:
dθi (t )
ωi (t ) =
dt
-Hilbert spectrum (HS) is a triplet as

H(ω ,t)={t, ω i(t), ai(t)}
Signal
time
Spectrum
frequency
Time-Frequency representation
S
ignal
Signal 1st IMF
frequency
time
2nd IMF 3rd IMF

•EMD is fully data adaptive decomposition for

spectral and time-frequency representation of
non-linear and non-stationary time series
•It does not employ any basis function for
decomposition
•It produces perfect localization of the signal
components in high resolution time-frequency
space of the time series
Comparison between HS and STFT
Time-frequency representation of two pure

tones (100Hz and 250Hz) using HS and STFT
Hilbert spectrum (HS) STFT

Comparison between Wavelet and HS
Wavelet Hilbert spectrum (HS)

Remarks on FT
•Fourier Transform has a mathematical

foundation
•Can be used in robust analysis having
phase information
•The detail signal information is limited
with the basis (sinusoid) function
•STFT analysis includes some addition
cross-spectral energy that degrades the
performance in some applications
Remarks on WT
•WT employs data adaptive basis function base

on its time and frequency scales
•It can produce more detail signal information
in T-F representation
•WT also perform well in multi-band
decomposition
•The reconstruction error of multi-band
representation is much less than the FT
•It can not preserve the phase information for
perfect reconstruction from T-F space
Remarks on EMD and HS
•EMD is fully adaptive multi-band

decomposition method
•It produces the perfect localization of signal
components in T-F space
•HS can represent the instantaneous spectra of
the signal
•The signal can be reconstructed with negligible
error terms
•It does not have mathematical foundation yet
•It is difficult to use EMD based decomposition
in robust analysis
Application of DSP in audio analysis
•Audio source separation from mixture

using independent subspace
analysis (ISA)
•Audio source separation by spatial
localization in underdetermined case
•Robust pitch estimation using EMD
Audio source separation from

mixture using independent subspace
analysis (ISA)
Source separation by Independent
subspace analysis (ISA)
Audio mixture
s(t)
φ
STFT Source ISTFT
spectrograms
STFTM
PCA
Basis Individual
Basis vector selection sources
vector
clustering
ICA
T-F representation of mixture
Mixture
Audio
Short Time Fourier Transform (STFT)

Magnitude Spectrogram X
Phase
Information
window 30ms
Overlap20ms
Proposed separation model
Mixture spectrogram X=∑ xi Source
Spectrograms
Bi = [b1(i ) , b2(i ) ......bn(i ) ]

xi=BiAi
(i ) (i ) (i ) T
Ai = [a1 , a2 ........an ]
Bi  Invariant frequency n-component basis
Ai  Corresponding amplitude envelope
Ai=BiTX, Bi=XAiT
** To find independent Bi or Ai
Dimension Reduction
Rows or columns of X > > number of sources
Subject to reduce the dimension of X
Singular value decomposition (SVD) is used
Xn× k=Un× nSn× kVk× kT
-U and V  orthogonal matrices (column-wise)
-S  diagonal matrix of elements (singular values)
σ 1≥ σ 2≥ σ 3 ≥ .… σ n≥ 0
p basis vectors (from U or V) 1 p
are selected by setting n ∑σ i ≥ϕ

ϕ =0.5 to 0.6 in inequality  ∑σ i
i =1
i =1
Proposed separation model
To derive the basis vectors

Singular value decomposition
(SVD) is applied as PCA
Some principal components are
selected as basis vectors
Independent component analysis
(ICA) is applied to make the bases
independent
Independent basis vectors
before ICA after ICA
**The bases Independent along time

frames
Producing source subspaces
The bases of the speech signal
Time bases Frequency bases

Source Subspaces (cont.)
Mixture Spectrogram
PCA+Basis Selection+ICA
B Basis vectors A
KLd based clustering
B1A1 Source Subspaces B2A2
Source
Spectrograms
Source re-synthesis
Separated subspaces
(spectrograms)
Mixture of speech & bip-bip sound
Inverse STFT
Separated speech
Append phase
information
j [φ ( n ,k )]
Si (n, k ) = xi .e Separated bip-bip sound
Experimental results
Separated signals with proposed algorithm
mixtures separated
Speech+bip-bip
Male+female speech
Audio source separation by spatial

localization in underdetermined case
Localization based separation
To avoid the spectral dependency and

signal content in separation
To increase the number of sources
The spatial location is considered
The use of Binaural mixtures instead of

single mixture
Localization based
separation (cont.)
Consider a multi-source audio situation
Human can easily localize and separate
the sources by HAS (human auditory
system)
The binaural cues ITD and ILD are
mainly used in source localization
Separation is performed by applying
Beamforming and Binary mask
Source localization Cues
• Interaural time difference (ITD) between two
microphones’ signals (like two ears of human)
• Interaural level difference (ILD)
ITD ILD
Source localization
Xr(ω ) and Xl(ω ) are STFT of xr(t) and xl(t)

ITD and ILD are calculated as
1  | X l (ω ) | 
τ I T D(ω ) = {φ l (ω ) − φ r (ω ) } κ ILD(ω ) = 20log 

ω |
 r X (ω ) | 
where φ r(ω ) and φ l(ω ) are unwrap phase of

Xr(ω ) and Xl(ω ) respectively at frequency ω
Source localization (cont.)
ITD becomes ambiguous at higher
frequency (factor of mics` spacing)
ITD ITD
AtILD
low dominates
frequency At high frequency
to resolve the problem
Source localization (cont.)
ITD and ILD are quantized into 50 levels
Collection of T-F points corresponding to each
ITD/ILD quantized pair produces peaks
Separation by beamforming
ITD is derived for each of the

localized sources
Spatial Beamforming is applied
Linearly constrained minimum
variance Beamforming (LCMVB) is
used
The gain is selected based on the
spatial locations
Separation by binary mask
with HS
It is required to avoid the
limitations of spatial beamforming
Separation is performed by binary
mask estimation based on ITD/ILD
The sources are considered as
disjoint orthogonal in T-F space
 not more than one source is
active at any T-F point
Computing ITD and ILD
Each mixture is transformed to T-F domain
using Hilbert spectrums (HL and HR)
ITD and ILD are measured as:
 1 H L (ω , t f ) H R (ω , t f ) 
{ ITD(ω , t f ), ILD(ω , t f )} =  ∠ , 
 ω H R (ω , t f ) H L (ω , t f ) 
 H (ω , t ) 2 
 R f 
ILDdB (ω , t f ) = 20 log 10 2 
 H L (ω , t f ) 
where tf is the time frame  
ITD-ILD Space Localization
ITD and ILD are quantized into 50 levels
Collection of T-F points from HS
corresponding to each ITD/ILD quantized
pair produces peaks
Source Separation
Each peak region in the histogram
refers to a source of the binaural
mixtures
Construct a binary mask (nullifying T-F
points of interfering sources) Mi(ω ,t)
The HS of ith source is separated as
H i (ω , t ) = M i (ω , t ) H L (ω , t )
Time domain ith source is given as

si (t ) = ∑ H i (ω , t ) ⋅ cos[φ (ω , t )]
ω
Source disjoint
orthogonality
Disjoint orthogonality (DO) of audio
sources assumes that not more than
one source is active at any T-F point
F1 (ω , t ) F2 (ω , t ) = 0; ∀ ω , t
where F1 and F2 are TFR of two signals
SIR (signal to interference ratio) is used as
the basis to measure DO
Source disjoint orthogonality
(cont.)
s1 s2 s3 Three audio sources
Microphone s1
s s2
Frequency TFR s3
Time
(cont.)
The SIR of the jth source is defined as:
X j (ω , t )
SIR j = ∑ ∑ ; Y j (ω , t ) ≠ 0
ω t Y j (ω , t )
N
Y j (ω , t ) = ∑ X i (ω , t )
i =1
i≠ j
Yj  sum of interfering sources

(cont.)
Dimensions of HS and STFT of same
signal may be different
DO is defined as the percentage over
the entire TFR region
Average DO (ADO) of all sources is
N
∑
1
ADO = SIR j
N j =1
N number of sources
The three mixtures are defined as
m1{sp1(-40°, 0°), sp2(30°, 0°), ft(0°, 0°)},
m2{sp1(20°, 10°), sp2(0°, 10°), ft(-10°,10°)},
m3{sp1(40°, 20°), sp2(30°, 20°), ft(-20°, 20°)}
The separation efficiency is measured
as OSSR (original to separated signal
ratio) defined as: w  
T

 ∑ 2
s original (t + i ) 

∑
1
OSSR = log 10  i =1

T  w

t =1



∑
i =1
2
s separated (t + i ) 


(cont.)
The comparative separation efficiency
(OSSR) using HS and STFT :
Mixtures TFR OSSR of sp1 OSSR of sp2 OSSR of ft
m1 HS -0.0271 0.0213 0.0264
STFT 0.0621 -0.0721 -0.0531
m2 HS 0.0211 -0.0851 -0.0872
STFT 0.0824 0.1202 0.1182
m3 HS 0.0941 -0.0832 0.0225
STFT -0.1261 0.1092 -0.0821
This experiment also compares the
DO using HS and STFT as TFR
STFT is affected by many factors 
window function and its length,
overlapping, FFT points
HS is independent of such factors
It is slightly affected by the number of
frequency bins used in TFR
Experimental results (cont.)
The ADO of HS and STFT as a function of
number of frequency bins (N=3):
The ADO of only STFT is affected by the
factor of window overlapping (%)
STFT includes more cross-spectral energy
terms
The TFR of two pure tones using HS and STFT
Always HS has better DO for audio signals
DO depends on the resolution of TFR
STFT has to satisfy the inequality 1
∆ t∆ω ≥
2
The frequency resolution of HS is up to
Nyquist frequency
Its time resolution is up to sampling rate and
hence offers better resolution
Remarks
The separation efficiency is independent of

the signal’s spectral characteristics
The performance is affected by the apart
angles and disjointness of the sources
HS produces better disjointness in T-F
domain and hence better separation
The Binaural mixtures are recorder in
anechoic room of NTT
Robust pitch estimation

using EMD
Why EMD in pitch estimation?
Pitch facilitates speech coding,

enhancement, recognition etc.
Autocorrelation function is
mostly used in pitch estimation
algorithm
Autocorrelation (AC) function-
recalls the periodic property of
the speech
EMD in pitch estimation (cont.)
Pitch is the sample difference
between two consecutive peaks in AC
function
Sometimes the pitch peak may be
less prominent specially due to noise
EMD decomposes any signal into higher to
lower frequency component
It produces the local and global oscillations
of the signal
The global oscillation almost represents the
envelop of the signal
The IMF of global oscillation is used to
estimate the pitch
Pitch estimation with EMD
(cont.)
There exists an IMF in EMD domain
representing the global oscillation
of the AC function
That IMF represents the sinusoid
of the pitch period
Pitch is the frequency of that IMF
rather than finding the pitch peak
Pitch estimation with EMD (cont.)
In EMD, IMF-5 is the oscillation of

pitch period
It is a crucial step to determine the
target IMF representing the sinusoid
with pitch period
The IMF of low frequency oscillation
(than pitch period) can be discarded
by energy thresholding
(cont.)
A reference pitch is computed by
weighted AC (WAC) method
Such pitch information is used to
select the IMF with pitch period
The periodicity of the selected
each IMF is computed as pitch
period
Pitch estimation with EMD (cont.)
The peak at zero-lag is selected

Two cycles are selected from both
sides
Average samples are the periodicity
Proposed Pitch estimation
Algorithm
Normalized autocorrelation (AC) of
the speech frame is computed
Determine rough pitch period using
WAC method
Apply EMD on AC function
Select the IMF of pitch period on
the basis of WAC based method
The period of the selected IMF is
the estimated pitch
Keele pitch database is used here

20kHz sampling rate
Frame length is 25.6ms with 10ms
shifting
Each frame is filtered by band-pass
filter of pitch range (50-500Hz)
Gross pitch error (GPE) is used to
measure the performance
The %GPE of male and female

speech with different SNR are
presented here
SNR
Total number of frames is 1823
30dB 20dB 10dB 0dB -5dB -15dB
Female 1.90 2.83 3.93 10.12 21.83 64.24
Male 2.15 3.78 5.22 11.89 23.56 66.76

Remarks
The use of EMD makes the pitch estimation

method more robust
EMD of AC function can extract the
fundamental oscillation of the signal
The pitch can be easily estimated from the
single sinusoid of fundamental oscillation
It is not affected by the prominent non-
pitch peak
Future works
The open problem is to identify the

IMF with pitch period
In present algorithm the error to
estimate pitch roughly in ACF can
propagate to the performance of final
estimation
The performance is not yet tested with
other existing algorithm
Open Problem-1
Instantaneous Pitch (IP) estimation using EMD
•Frame based pitch estimation is already done
•Paper is accepted by EUROSPEECH 2007
Three methods of
pitch estimation
•We have used the pitch information based WAC to

compute the exact pitch (IMF) from EMD space
•Problem to compute IP only from EMD space
Open Problems-2
Voiced/Unvoiced Detection with EMD
•Useful in speech enhancement and speech/speaker recog.
•Paper with preliminary results is published in ICICT2007
V/UV
differentiation
•Problem to derive better separation region for V/UV and

to conduct experiment with large speech data
Open Problems-3
Robust Audio Source Localization
•Localization is done by delay-attenuation computed in
T-F space of binaural mixtures- NOT noise robust
Localization of three
sources using TD-LD
computed in T-F space
•The problem is to derive mathematical mode for

robust localization in underdetermined situation
Open Problems-4
Speech denoising using image processing
•Noisy speech can be represented as an image with
time-frequency (T-F) representation e.g. Spectrogram
Speech with
white noise
•Image processing algorithm can be used for denoising

•It seems easy for musical/white noises
•Problem is to deal with other noise even by using
binaural mixtures
Open Problems-5
Auditory segmentation with binaural mixtures
•Auditory segmentation is the first stage of source
separation using auditory scene analysis (ASA)
Source
separation
by ASA
•Problem is to use of binaural mixtures for improved

auditory segmentation as source separation
•T-F representation other than FT can be employed
Open Problems-6
Two stage speech enhancement
•Single stage speech enhancement is not efficient in
all noisy situations
•For example, musical noise is introduced with binary
masking and some thresholding methods
•Noise may not be separated perfectly by using ICA,
ISA (independent subspace analysis) based techniques
•Multi-stage enhancement with suitable order can
improve the performance
Noisy First stage Second stage Clean
speech enhancement enhancement speech
Open Problems-7
Informative features extraction
•To use spectral dynamics in speech/speaker recog.
•Special type of speech features are required
Mixed signal with

its spectrogram
•How to parameterized speech signal to represent

speech dynamics
•WT, HS based spectral analysis can be studied>>>
Open Problems-8
Source based audio indexing
•Useful in multimedia applications and moving
audio source separation
s2
Audio sources at
Separation of different azimuth angles
moving sources s3
s1 (0 to 180 degree)
1.5m
•Several new method could be used for indexing 

Ada-boost, Tree-ICA, condition random field
Open Problems-9
Time-series prediction with EMD
•Subject to financial and environment time series
•Conventional methods use Kalman filter (for
smoothing) and AR model for prediction
Non-stationary
time-series
•EMD can be used as smoothing filter to enhance the

prediction accuracy
Open Problems-10
Heart-rate analysis with ECG data using EMD
Different parts
of ECG signal
•ECG Variability analysis at different frequency region

•Analysis of instantaneous ECG condition
•Abnormality analysis of heart-rate using EMD based
spectral modeling
The End
Questions/Suggestion
Please
Source Separation
Each peak region in the histogram refers to
a source of the stereo mixtures
Construct a binary mask (nullifying TF
points of interfering sources) Mi(n,t)
The HS of ith source is separated as
H i (n, t ) = M i (n, t ) H L (n, t )

Time domain ith source is given as
si (t ) = ∑ H i (n, t ) ⋅ cos[φ (n, t )]
n
The End
Questions/Suggestion
Please

Introduction To Signal Processing

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Introduction To Signal Processing

Transféré par

Droits d'auteur :

Formats disponibles

Introduction to Signal

Processing and Some

Md. Khademul Islam

Signal sampling Signal quantization

Effects of under sampling

Effects of required sampling frequency

f(x) = 5 cos (x)

Phase f(x) = 5 cos (x + 3.14) 3

Frequenc f(x) = 5 cos (3 x +

Why Frequency Information is

Basic block diagram of signal transformation

Where Xk represents the frequency component

Where xn represents nth sample in time domain

Fourier analysis uses Sinusoids as the

square signal, sw(t)

Convergence may be slow (~1/k) - ideally need infinite terms.

Do not appear at all times

Same in Frequency Domain

At what time the frequency components occur? FT can not tell!

FT Only Gives what Frequency

Most of Signals are Non-stationary

ONE SOLUTION: SHORT-TIME FOURIER TRANSFORM (STFT)

Dennis Gabor (1946) used STFT

A function of time and

Width of the Window is Changed

Split Up the Signal into a Bunch of Signals

It is Necessary to Sample the Time-

The Discretized WT is not a True

bands with different resolutions

approximation and detail information

Lengthening a signal component by

Typical Application Fields

The Discontinuous Signal

x(t) = Σ ak(t) Ψk(t)

with ak(.) amplitude modulating functions and Ψk(.) oscillating

Operating mode — (“EMD”, Huang et al., ’98)

— Quasi monochromatic harmonic oscillations

 #{zero crossing} = #{extrema} ± 1

— IMF ≠ Fourier mode and, in nonlinear situations,

IMF = several Fourier modes

— Output of a self-adaptive time-varying filter

ex: 2 sinus FM + gaussian wave packet

— Analytic version of each IMF Ci(t) is computed

— hence zi(t) becomes complex with phase and

-Hilbert spectrum (HS) is a triplet as

Signal 1st IMF

2nd IMF 3rd IMF

•EMD is fully data adaptive decomposition for

Time-frequency representation of two pure

Hilbert spectrum (HS) STFT

Wavelet Hilbert spectrum (HS)

•Fourier Transform has a mathematical

•WT employs data adaptive basis function base

•EMD is fully adaptive multi-band

•Audio source separation from mixture

Audio source separation from

Short Time Fourier Transform (STFT)

Bi = [b1(i ) , b2(i ) ......bn(i ) ]

are selected by setting n ∑σ i ≥ϕ

To derive the basis vectors

before ICA after ICA

**The bases Independent along time

Time bases Frequency bases