Vous êtes sur la page 1sur 5

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015

Statistical Parametric Speech Synthesis: A Review


Athira Aroon

S.B Dhonde

Department of Electronics Engineering

Department of Electronics Engineering

A.LS.S.M.S Institute of Information Technology

Abstract-

A.LS.S.M.S Institute of InformationTechnology

Pune , India

Pune,India

athiraaroon3@gmail.com

dhondesomnath@gmail.com

the

closely the output resembles human speech. It is tried to

Statistical Parametric Speech Synthesis (SPSS ) , based on hidden

maximize the characteristics of above stated qualities using

Markov model. The non-mathematical introduction of SPSS have

speech synthesis. Descriptions of speech synthesisers often

been

In

this

introduced

.Have

techniques used in
Gaussian

Process

Distribution

paper

we

have

emphasized

briefly

the

reviewed

recent

emerging

SPSS like Autoregressive HMM model,


Regression(GPR),

Estimators

(NADE)

Neural

Autoregressive

overcoming

Restricted

Boltzmann Machines (RBM), Deep Neural Networks (DNNs).

take

procedural

view:

they

describe

the

sequence

of

processes required to convert text into speech, often arranged


in a simple 'pipeline' architecture[[][2] . We have undertaken
this review inorder to study the progresses towards the speech

One of the major drawback of SPSS is vocoder quality in

synthesizer ,one of its techniques that is statistical speech

accordance to this problem we have analyzed spectral envelope

synthesis .

estimation

algorithms

proposed

for

speech

synthesis

like

STRAIGHT, TANDEM-STRAIGHT ,CHEAPTRICK providing


high quality.)

Index

Statistical

Parametric

Speech Synthesis(SPSS), Vocoder Quality .

vocabularies

.Spoken

names that are drawn from large


words

combination
speech

and

termed

is
a

as

originated

limited
sound

explains about the methods in order to improve the vocoder

ILSTATISTICAL PARAMETRIC SPEECH SYNTHESIS

Speech is the vocalized form for communication with other


people. [t comprises of

statistical parametric speech synthesis techniques. Section 3


quality.

L INTRODUCTION

consonant

description of

three sections. Section 2 describes the emerging trends of

Terms--. Text-to-Speech,

phonetic

This paper clearly gives a brief

Statistical Parametric Synthesis. .The paper comprises of

set

out

of

vowels

unites

in

the
and

This system is based on hidden Markov models .

The

model is parametric because the speech used are parameters,


rather than stored exemplars. It is

statistical because

it

describes those parameters using statistics (e.g., means and

speech

variances of probability density functions) which capture the

synthesis. Speech synthesis is artificially producing speech

distribution of parameter values found in the training data.

and system performing this function is called as speech

HMM based speech synthesis system comprises of training

synthesizer[ 1].

stage and synthesis stage . [n the training stage mel cepstral


coefficients are obtained from speech database by mel cepstral
analysis .Mel cepstral coefficients are used to train HMM
phenome [3].

Abstract
Input

Analysis

Text

routines

Underlying
Linguistic

r-------1

Synthes is

Output

routines

Speech

Description

The parameters are then extracted for

phenomes

followed by the generation of HMM for each phenome.In


later stage text to be synthesized is transformed into
phenome sequence, representing the whole text to be
synthesized

Fig 1. Text converted to abstract linguistic representation [ I]

constructed

by

concatenating

phenome

HMMs. From the sentence HMM , a speech parameter


sequence is generated using the algorithm for speech

A Text- To Speech (ITS) systems converts normal text of

parameter generation from HMM. By using a suitable

any language into speech . High intelligible synthesizer are

algorithm for spectral synthesis, speech is synthesized from

being developed producing sound as natural as expected.

the generated mel-cepstral coefficients[3][4].

Naturalness and intelligibility are the important qualities


required for speech synthesizer. [ntelligibility is the ease with

A. Autoregressive HMM Model

which output is understood and naturalness describes how

978-1-4799-6480-2/15/$31.00 2015 IEEE

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
The autoregressive HMM as a probabilistic model. We then

speech synthesis is promising, there exist a number of issues

discuss how to use a specifc form of autoregressive HMM, the

for the realization of practical systems. One of them is

linear Gaussian linear autoregressive HMM (LGLAR HMM),

generation of acoustic feature trajectories from predictive

to model speech parameters .It uses the linear Gaussian linear

distribution[8].

regression model described in a larger sequential model.


It

supports

existing

high

quality

speech

Generally

parameter

causing

over-smoothing

problem.

Global

variance (GY) was considered as an alternate way widely used

generation methods such as parameter generation considering

to alleviate over -smoothing problem in the HMM based

global variance;

speech synthesis. Another issue is selection of hyperparameter

and supports a simple and exact time

recursive form of speech parameter generation that is not

of

available for the standard HMM synthesis framework or the

optimization for PlC approximation using EM algorithm are

kernel

function

used

trajectory HMM and which may be used for low latency

introduced.

GY

parameter generation. We. The LGLAR HMM, like the

outperformed

the

and

standard HMM synthesis framework, has certain parameters

subjective evaluation[9].

in

GP.

So

hyperparameter

hyperparameter

conventional

optimization

HMM-based

approach

by

such as MDL tuning factor LGLAR HMM is capable of


producing speech that is as natural as that of the standard
HMM synthesis framework with its conventional settings, but

C.

Neural Autoregressive Distribution Estimators (NADE)


The NADE proposed is inspired by Restricted Boltzmann

Machines(RBM)

not as natural as the trajectory HMM[6].

which is a kind of bipartite undirected

graphical model which has been applied to speech synthesis .


and voice conversion . However, RBM does not

provide a

tractable partition function for computing the probability of an


observation. Not knowing the exact value of partition function
makes it hard to evaluate how well the distribution estimated
by the RBM fits the observations. So NADE evolved solving
the difficulty of partition function calculation by decomposing
the joint distribution of observations into tractable conditional
distributions. Therefore, NADE was adopted as the form of
the state PDFs instead of RBM[lO].
NADE has been proved to be an efficient multivariate
binary distribution estimator and performs similarly to a large
(but intractable) RBMs on several datasets.

comparing the

ability of model generalization between RBMs and NADEs,


the experimental results show that NADEs demonstrates better
performance than RBMs due to the accurate calculation of
gradients at training time . It can also be understood as a
special kind of autoencoder whose

output assigns

valid

probabilities to observations and hence is a proper generative


model. Results have also shown the superiority of NADEs
over Gaussian mixture models in describing the distribution of
spectral envelopes as a density model and in alleviating the
over-smoothing effect at the synthesis time.. Incorporating the
dynamic features of mel-cepstra and spectral envelopes into
NADE modeling and extending the spectral features from the
spectral envelopes to the FFT spectrum[lO][ 1 1] .
Fig 2. HMM based TTS system[4]

D . Deep Neural Networks (DNNs)


B. Gaussian Process Regression

They

Gaussian process regression (GPR) is a statistical technique

are feedforward artificial neural networks (ANNs)

with many hidden layers, and have achieved significant

with a long history in spatial statistics, and more recently in

improvement in many machine learning areas. They were also

function estimation and prediction. To make computational

introduced as acoustic models for Statistical parameter speech

cost

synthesis (SPSS). In SPSS, a number of linguistic features that

feasible

partially

independent

condtional

(PlC)
based

affect speech, including phonetic, syllabic, and grammatical

approach achieved comparatively better performance than

ones, have to be taken into account in acoustic modeling to

HMM-based system.

achieve naturally sounding synthesized speech.

approximation

Contributing

was

adopted

and

showed

that

GP

advantages of GPs, such as the flexibility to

Effective modeling of these complex context dependencies

model complexity and the robustness against over-fitting.

is one of the most critical problems for SPSS. In DNN-based

Hierarchical GPR may be used to distinguish individual,

SPSS, a DNN is trained to represent the mapping function

group, and condition differences. Although the GP-based

from linguistic features (inputs) to acoustic features (outputs).

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
DNN-based acoustic models offer an efficient and distributed

distribution

Gaussian

representation of complex dependencies between linguistic

estimators (NADE)

models

and acoustic features and have shown the potential to produce

for

the

spectral

describing

naturally-sounding

in

Density

modeling

networks (MDNs) have been introduced for overcoming the

synthesized

speech.

Mixture

statistical

limitations

parametric

DNN-based acoustic modelling for speech

mixture
in
the
of

distribution

spectral envelopes
speech

as a density model

synthesis like the lack of variances and the unimodal nature of

synthesis. In order

and

the objective function[7].

to

the

over

smoothing

effect

Objective and subjective evaluations have shown that the

alleviate

the

over-smoothing

components having variances and multiple mixtures by using

effect

a mixture density output layer was helpful in predicting

generated

acoustic features DNN-based SPSS by introducing mixture

structures.

on

the

spectral

at

alleviating

the

synthesis

time.

density networks (MDNs) speech synthesis: more accurately


and improved the naturalness of the synthesized speech
significantly[7][ 10].

4
TABLE I.

Adopted
graphical models

supenonty

(20 13)[10]

with

RBM

Authors

Proposed Work

No

Contribution

Boltzmann

describing

estimation

evaluations

(2014)

algorithm

to
high-

quality

speech

synthesis.

The

algorithm

obtains

an

spectral
using

demonstrated

that

Cheap-Trick

was

superior

to

CheapTrick

fundamental

frequency (FO)

of

low-level
envelopes

RBM

HMM

autoregressive

standard

et.al

hidden Markov

synthesis

(20 13)[6]

model (HMM) for

framework, the

speech

trajectory

The

synthesis.

autoregressive

same

mean

trajectories,

for

much

better

trajectory
and

covariances, and a

synthesis

in

higher naturalness

consistent

way,

in

contrast

to

the

score.

Compared

to

the

standard

approach

autoregressive

to

statistical

HMM,

parametric

speech

has

synthesis.

the

trajectory

HMM

better

mean

trajectory
Xiang Yin

new

approach

AI

which
neural

superiority

autoregressive

NADEs

utilized

by

results

show

than

features
sampling,

which may help

the
of
over

speech

less

monotonic

and

boring.
5

Tomoki

Paper examines two

Proposed

Koriyama

issues of a

which

et.al(20 14)

statistical speech

and

[9]

synthesis approach

hyperparameter

based Gaussian

optimization

process (GP)

outperformed

regression.

conventional

Although GP-based

HMM-based

speech synthesis

approach

can give higher

subjective

performance in

evaluation.

parameters than the

Experimental

(2014)[ 1 1]

et.

more

acoustic

generating spectral

modeling
3

DBN

generating

has slightly better

the

estimation

and

the GMM for

HMM

uses

parameter

the

speech.

make the synthetic

model

HMM

of

appropriate

the

Shannon

over-smoothing

are

more

to

the

synthetic

FO-manipulated

Compared

and in mitigating

state.

robustly than the

Proposed using the

Matt

as
models

effect

other algorithms.
2

spectral

at each HMM

the

speech

of
density

distribution

the

distribution

represent

spectral

In

belief

envelopes

the

particular,
synthesized

(RBM)

deep

networks (DBN), to
the

the

algorithms.

and

envelope,

and

conventional

stable

accurate

temporally

machines

subjective

Morise,

achieve

Gaussian

Authors
The

presented

over

DBN

mixture model

A spectral envelope

[ 14]

variables,

of

and

including restricted

Masanori

IS

the
multiple

hidden

Sf.

Results show the

Zhen-Hua
Ling,

HMM-based one.

method

uses

GV

the

by

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
III.VOCODER

speech quality. In the traditional speech models, such as

Vocoder-a term derived from the words Voice and

Multi-Band Excitation (MBE) model , the voiced/unvoiced

CODER. One of the major challenges of statistical parametric

labels are simply added to the speech in the time-frequency

speech synthesis are the vocoder quality, which is not on par

domain with a hard clustering, which can only identify

with the pure waveforms of unit selection synthesis, the

whether the frame or frequency band of the speech is voiced

accuracy of the HMM-based acoustic modeling, which does

or not. [n contrast to such clustering, the a period component

not exactly model the real speech waveform, and the problem

is more flexible to present the ratio between the period and

of

noise energies, which are respectively defined by the higher

over-smoothing

of

the

HMM-generated

parameter

trajectories. So we may consider the vocoder methods for


spectral estimations. For HMM-

and lower smoothed spectrum envelopes of the speech.

based speech synthesis we

may consider many different vocoders.

B. TANDEM-STRAIGHT

A. STRAIGHT

bringing about complete reformulation and reengineering

TANDEM-STRAIGHT superseded original STRAIGHT ,


STRAIGHT (Speech Transformation and Representation

based on the same underlying concept. In a time invariant

using Adaptive Interpolation of weiGHT spectrum) is the most

linear system the output is excited by a periodic pulse train

established of the more sophisticated vocoding method.[t is a

that yields a spectrogram that has periodic interference both in

tool for manipulating voice quality, timbre, pitch, speed and

the time and frequency domains, even if the system and the

other attributes flexibly. [t is an always evolving system for

input are temporally stable and spectrally smooth This is the

attaining better sound quality, that is close to the original

major problem STRA[GHT and TANDEM-STRA[GHT were

natural speech, by introducing advanced signal processing

designed to address.

algorithms and findings in computational aspects of auditory


processing. The main feature of the STRAIGHT

refilled

TANDEM ,a short-term power spectral representation of


periodic signals that does not have a temporally varying

is with speech spectrum, a series of advanced

component. TANDEM is a procedure , shortening the window

methods are adopted to modify and complement the original

length and keeping the power spectra temporally constant and

analysis

spectrum directly extracted by use of short time Fourier

the logarithmic power spectra tolerant to background noise.

transformation (STFT)[ 12] .

Therefore introduced measures for the window length,

The input speech is decomposed by STRA[GHT into


three types of positive-valued parameters: an interference-free
spectrogram,

an

aperiodicity

map,

and

the

frequency variations, the temporal of the power spectra, and


the temporal variation of the logarithmic power spectra.
STRA[GHT uses FO adaptive triangular smoothing

fundamental

frequency (FO) trajectory. .The periodic interference in time

function

domain is eliminated via the effective solution of the size

response to eliminate this leakage. The base length is set to

mismatch problem between the fixed time window and the

2wO

variable pitch by a pitch-adaptive smoothing filter. the phase

rectangular function

interference in frequency domain is successively taken into

Smoothing function

account.

as an additional anti-aliasing filter impulse

in this case. TANDEM-STRA[GHT uses FO adaptive

h2(w)

A compensatory time window is designed to remove the

h [(w)

h2(w) instead. Its base length is set to wo.


hl(w) is obtained by the convolution of

with itself.. Smoothing TANDEM spectra using this

anti-aliasing smoother selectively removes spectral variations

holes of the spectrogram caused by out of phase. Moreover, a

due to periodicity.

compensation procedure of over-smoothing in the frequency

separates the periodicity and response information almost

It provides simple decomposition, which

domain is used to recover some underlying spectrum structure

perfectly[ 13].

to further improve the speech analysis-synthesis performance


of the STRAIGHT model[12].

C. CHEAPTRICK
For high-quality speech synthesis a simple algorithm for
high-quality speech synthesis is introduced that is superior to
conventional ones both objectively and subjectively.
CheapTrick consists of power spectrum estimation with

Spectrum
extraction 'With
ti me freq uency

the FO-adaptive Hanning window, the smoothing of the power


spectrum, and spectral recovery in the quefrency domain. The
algorithm
spectral

can

obtain

envelope

algorithms

by

other

an

accurate

objective

than

and

temporally

evaluations.

STRAIGHT

and

stable

Conventional
TANDEM

STRAIGHT cannot fulfill two requirements in the estimation


performance and remove the time-varying component. an
Fig 3

The procedure of smoothed spectrum[12]

algorithm named CheapTrick that fulfills these requirements


. The name CheapTrick comes from its cheap and tricky

One of the advantage of The STRAIGHT model is the a

design based on the

conventional algorithms such as FO

period component parameter which can effectively describe

adaptive windowing and the cepstrum method.. CheapTrick

the voiced attribute

was superior to the other algorithms in terms of sound quality

inorder to enhance the synthesized

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
regardless of gender the results include the sound quality of

[6] Matt Shannon, Student Member, Heiga Zen, Member , and

not

William

only

the

re-synthesized

speech

but

also

the

FO

manipulated speech, they suggested that CheapTrick was


robust against FO manipulation. The difference in sound
quality in female speech was smaller than that in male speech,
and this difference is associated with the objective evaluation
results in which the error in higher FO was smaller than that in
lower FO[14].

The paper reviews


Parametric

Sythesis

viz.

quality speech parameter generation methods considering


global variance; and supports a simple and exact time
form,

Gaussian

Process

Regression(GPR)

includes hyperparameter optimization outperforming the


conventional HMM-based approach, Neural Autoregressive
Distribution Estimators

Autoregressive

Synthesis", IEEE

Models

for

Transactions on

20 13.

[7] Heiga Zen, Andrew Senior, "Deep Mixture Density Network for
acoustic modelling in statistical parametric speech synthesis ",

Speech

Synthesis

based

on

Gaussian

Regression" ,IEEE journal of selected topics in Signal

Process

Processing

,VoI8. No. 2 ,pp 173-183. 20 14.

like

Autoregressive HMM model supporting the existing high

recursive

Member,"

Audio ,Speech and Language Processing VOL. 2 1, NO. 3, MARCH

Parametric

majorly researched methods of


Speech

Senior

[8] Tomoki Koriyama, Takashi Nose, Takao Kobyashi ,"Statistical

IV.CONCLUSION

Statistical

Byrne,

Statistical Parametric Speech

(NADE) overcoming Restricted

[9] Tomoki Koriyama ,Takashi Nose ,Takao Kobyashi ,"Parametric


Speech Synthesis using local and global variance" , 24th IEEE
International Workshop on Machine learning and Signal processing,
20 14 .
[ 10] Zhen-Hua Ling" LiDeng, , and Dong Yu,"

Modeling Spectral

Envelopes Using Restricted Boltzmann Machines and Deep Belief


Networks

for Statistical

Parametric

Speech

IEEE

Synthesis ",

Boltzmann Machines (RBM) NADE is a very easy to

Transactions on Audio ,Speech and Language Processing VOL. 2 1,

implement and train model for joint distributions, yielding a

NO. 10. October 20 13.

tractable distribution function .In future work, we can use


NADE on problems other than distribution estimation, in
particular on problems for which RBMs and autoencoders
are often considered., Deep Neural Networks (DNNs),
describing the distribution of spectral envelopes, making
the synthetic speech less monotonic and improved the
naturalness of the synthesized speech.
Vocoder quality is the major drawback of SPPS ,so
the recent evolving vocoder algorithms like STRAIGHT
,TANDEM-STRAIGHT,

Cheaptrick

was

comparatively

[ 1 1]

parametric

[ 12]

Ning

XU 1,

Yuan

GAOl,

Changping

A Simplifed STRAIGHT Model with


Computational

Hideki

Kawaharai
of

and

Masanori

TANDEM-STRAIGHT,

Morise
a

modification and synthesis framework", Sadhana

,"Technical

speech

analysis,

Vo!. 36,

Part 5,

October 20 1 1, pp. 7 13-727.


Masanori Morise ," CheapTrick, a spectral envelope estimator

for high-quality speech synthesis", ScienceDirect , Available online

V.REFERENCES

September 20 14.

[ 1] Dr. Shaila Apte, "Speech Synthesis," chapter in the book Speech


and Audio Processing,20 13.

David Suendermann, Harald Hoge, and Alan Black,"Challenges


Jokinen ,Speech

Technology, Springer Science+Business Media, LLC 20 1 O.


Simon King ," An introduction to statistical parametric speech

synthesis" , Sadhana Vo!. 36, Part 5, October 20 1 1, pp. 837-85.


Heiga Zen" Keiichi Tokuda, Alan W. BlackcK. , "Statistical

Parametric Speech Synthesis", submitted to Speech Communication,


April 6 2009.

[5]

[ 14]

[4]

TANG,

Information Systems 9: 5 (20 13 .


[ 13]

in Speech Synthesis",chapter 2 ofF. Chen, K.

statistical

International conference

Aperiod Component Reconstruction", Journal of

foundations

speech synthesized with other algorithms.

[3]

Yibin

ZHU2,Qingbang HAN2,"

methods and

obtain a temporally stable spectral envelope

speech synthesis ",2014 IEEE

on Acoustic and speech Processing.

reviwed .Among these models Cheptrick outperforms other


and synthesize speech with higher sound quality than

[2]

Xiang Yin, Zhen-Hua Ling, Li-Rong Dai," Spectral modelling

using Neural Autoregressive distribution estimators for

Gregory E. Cox ,George Kachergis, Richard M. Shiffrin

Gaussian Process Regression for Trajectory Analysis",

"