Vous êtes sur la page 1sur 57

Speech synthesis

Text-to-speech synthesis

The automatic transformation from (electronic) text to speech

The speaker is defined by the system design

Single speaker

In contrast to speech recognition, the aim is not to handle all speakers


and all normal pronunciation variants, but to render one spoken
realization of the text that is perceived as natural and intelligible

A text contains orthographic words, numbers, abbreviations,
mnemonics and punctuation

Linguistic analysis of the text is necessary to



Interpret symbols

Analyze the grammatical structure

Infer the semantic interpretation of the text

Torbjrn Svendsen

Speech Synthesis

The processing steps of TTS



Text

analysis

Text

Phonemic

analysis

Prosodic

analysis

Speech

synthesis

Speech

Text analysis: text normalization; analysis of document structure, linguistic


analysis

Output: tagged text


Phonemic analysis : homograph disambiguation, morphological analysis,


letter-to-sound mapping

Output: tagged phone sequence

Prosodic analysis: intonation; duration; volume



Output: control sequence, tagged phones

Speech synthesis: voice rendering



Output: synthetic speech

Torbjrn Svendsen

Speech Synthesis

Text-to-speech synthesis

Text

analysis

Phonemic

analysis

Prosodic

analysis

Speech

synthesis

Speech synthesis concerns the waveform generation from the


annotated symbol sequence (typically phone sequence)

Philosophy: Rule-based vs. data driven synthesis

Method: Articulatory synthesis; formant synthesis; concatenative
(waveform) synthesis

4

Torbjrn Svendsen

Speech Synthesis

Quality

Festival

NextGen

Infovox

Klattalk

Voder

Different strategies give different quality but also different consistency


of quality

No strategy can currently provide consistent high quality (but it is
getting closer)

Limited domain gives high quality within the application domain

5

Torbjrn Svendsen

Speech Synthesis

The synthesis space



Speech
knowledge

Intelligibility

Flexibility

Naturalness

Bit rate

Units

Processing
needs

Cost

Complexity

Vocabulary

(Figure adapted from
Granstrm)

Torbjrn Svendsen

Speech Synthesis

Main methods

Formant synthesis
Concatenative, or Waveform synthesis
Articulatory synthesis

Torbjrn Svendsen

Speech Synthesis

Formant synthesis

Annotated

phones

Formant tracks

Formant

synthesis

Rule system

Pitch contour

Synthetic speech

Normally rule based (knowledge driven) system, but can also be data
driven

Each formant can be specified with center frequency, bandwidth and
optionally, amplitude

E.g.

H i ( z) =

1 ! 2e !"bi

1
cos(2"f i ) z !1 + e ! 2"bi z ! 2

2nd order filter with resonance in fi and a bandwidth bi

Torbjrn Svendsen



Speech Synthesis

Implementation

Cascade or parallel
implementation

Voiced sounds typically
use cascade, unvoiced
sounds use parallel
implementation

LPC-filters can also be
used

Normally poorer
quality

Torbjrn Svendsen

Speech Synthesis

Klatt formant synthesizer


10

Torbjrn Svendsen

Speech Synthesis

LPC synthesis

Pulse

generator

LPC

Filter

Noise

generator

Example:

A(z)

Original

LPC-coded, all unvoiced/voiced

Speed: Halved/doubled

Pitch: Halved/doubled

Melody

11

Torbjrn Svendsen

Speech Synthesis

Rule-based formant generation



Formants are slowly varying

Update rates of 5-10ms sufficient

Target values describe stationary conditions, {Fi, Bi}



Rules describe transition between phones

Parameters describe
transition shape

Specific rules for all
transition types

12

Torbjrn Svendsen

Speech Synthesis

Rule based formant generation


13

Torbjrn Svendsen

Speech Synthesis

Formant synthesis

Flexible

Produces intelligible speech with few parameters

Simple implementation

Rule derivation is complex, development can be costly

Limited naturalness

NOTE: Given sufficient training data, formant generation can be data
driven, e.g. using an HMM in production mode for generating the
formant tracks (Acero et. al)

Similar approach for LPC-based synthesis more recently by Tohkura


14

Torbjrn Svendsen

Speech Synthesis

Articulatory synthesis

The waveform production is performed by describing the movement of
the articulators

Jaw opening, lip rounding, tongue placement and height, .

Acoustics, fluid mechanics form basis

Limited success

Complex theory

Computational difficulties and complexity

15

Torbjrn Svendsen

Speech Synthesis

Synthesis by concatenation

Concatenation of stored waveform fragments

Optional modification of the fragments (duration, pitch, formants)

Dilemma:Use of unmodified fragments will either



Produce audible distortion at concatenation points (phase mismatch,
formant and pitch mismatch)

Lead to an enormous database to cover all phonetic and prosodic events

How much modification is possible before degradation is audible?


16

Torbjrn Svendsen

Speech Synthesis

Some central issues



1. Which unit?

2. How to design the acoustic library (inventory)?

Content, recording conditions, reading style

Annotation type, level

Segmentation and labeling consistency, effort, automation?

3. How to select the best sequence of units from the acoustic inventory

4. How to perform prosodic modification of the selected sequence?

17

Torbjrn Svendsen

Speech Synthesis

Some central issues



1. Which unit?

2. How to design the acoustic library (inventory)?

Content, recording conditions, reading style

Annotation type, level

Segmentation and labeling consistency, effort, automation?

3. How to select the best sequence of units from the acoustic inventory

4. How to perform prosodic modification of the selected sequence?

18

Torbjrn Svendsen

Speech Synthesis

1. Which unit?

Longer unit leads to better quality, but



requires more data to be stored

Is more context dependent

19

Torbjrn Svendsen

Speech Synthesis

Unit requirements

Low concatenation distortion

Longer units -> less concatenation points

Units containing attractive concatenation points

Low prosodic distortion


Small inventory means prosodic modification necessary



Modification introduces distortion

Unit should be generalizable



Need to be able to synthesize sequences that were not in the original
inventory (except for limited domain synthesis)

Unit should be trainable



Finite training data sufficient to estimate or predict all units

20

Torbjrn Svendsen

Speech Synthesis

Coverage

Complete coverage of all phonetic and prosodic events is impossible



Large Number of Rare Events

21

Torbjrn Svendsen

Speech Synthesis

Some possible unit choices



Context independent phonemes

Bad concatenation properties

Context dependent phonemes



Reduces discontinuity problems

Large number (~125k), needs to be reduced







E.g. generalized triphones or phonetic decision trees

Diphones (dyads)

~2500 possible units

Reduces discontinuity problems

Widespread use

Sub-phonemic units

Increased use (e.g. IBM, AT&T)

Half-phones (AT&T), phone HMM-state (IBM)

22

Torbjrn Svendsen

Speech Synthesis

Some posible unit choices (cont.)



Syllables, words and phrases

Mainly used for limited domain applications

Fixed message repertoire

Potentially good quality



Demands large storage

And much data collection

Computationally demanding

Complex search in large database

Syllables or demi-syllables most interesting


23

Torbjrn Svendsen

Speech Synthesis

Some central issues



1. Which unit?

2. How to design the acoustic library (inventory)?

Content, recording conditions, reading style

Annotation type, level

Segmentation and labeling consistency, effort, automation?

3. How to select the best sequence of units from the acoustic inventory

4. How to perform prosodic modification of the selected sequence?

24

Torbjrn Svendsen

Speech Synthesis

Designing the acoustic inventory



Recordings from one speaker, appropriately annotated

Voice talent very important for resulting quality

Design choice: Rely on prosodic modification by signal processing or
aim for good coverage of natural prosodic variation in database

Prosodic modification at synthesis PSOLA type synthesis

25

Typically diphone units



Normally desirable to have nearly constant (neutral) F0

Nonsense words/sentences with (near) full diphone coverage

Small database (~5 minutes of speech contains the essential units)

Torbjrn Svendsen

Speech Synthesis

Designing the acoustic inventory (2)



Unit selection synthesis rely on natural prosodic variation

Representative speech speaking style defined by database

Many representations of each phonetic unit

Gives prosodic variation

Large database

Facitilitates longer units, variable units

Requires search for the best unit sequence

Rich phonetic and prosodic context

Typically real texts


Text selection:

Start with large number of natural sentences

Analyze sentences, predict phonetic and prosodic realization

Use some greedy algorithm to obtain the best coverage possible with a small
number of sentences (2000-4000, typically)

Design supplementary sentences to improve coverage

26

Torbjrn Svendsen

Speech Synthesis

Coverage - LNRE

P(unit)

Large number of units with small probability of occurrence



If database units are selected randomly, the probability of
encountering a unit not in the database approaches certainty for a
small sequence of randomly selected sentences.

Unit inventory must be chosen with care

Fall-back solutions must exist for non-covered units

27

Torbjrn Svendsen

Speech Synthesis

Annotation

For small databases, speech can be segmented and annotated manually

Phonemic and prosodic annotation can be detailed

For unit selection databases automation is necessary



Automatic or semi-automatic methods for segmentation in phonemic and
prosodic units

Annotation can be fairly high-level without loss of quality

Annotation level and cost function for unit selection are closely linked

28

Torbjrn Svendsen

Speech Synthesis

Some central issues



1. Which unit?

2. How to design the acoustic library (inventory)?

Content, recording conditions, reading style

Annotation type, level

Segmentation and labeling consistency, effort, automation?

3. How to select the best sequence of units from the acoustic inventory

4. How to perform prosodic modification of the selected sequence?

29

Torbjrn Svendsen

Speech Synthesis

3. Optimal unit string



Selection problem arises when there are several possible choices for
the unit sequence

Traditional diphone synthesis has only one exemplar of each unit

Trivial solution

Selection is made based on desire for naturalness and minimum


discontinuity due to

Different phonetic contexts



Segmentation errors in the database

Acoustic variability

Prosodic differences (pitch discontinuity, formant tracks)

Search problem

Must define an object function to be minimized

30

Torbjrn Svendsen

Speech Synthesis

Object function for search



Lattice of candidate units

Sequence of target units



N

N #1

j =1

j =1

d(!,T) = " du (! j ,t j ) + " dt (! j ,! j +1 )

Unit cost, du(*)



Transition cost, dt(*)

! = {!1,!2 ,...,!N } # Candidate segment sequence


T = {t1,t 2 ,...,t N }
- Target units
= argmin d(!,T)
!
!

31

Torbjrn Svendsen

Speech Synthesis

Object function for search (2)



How to choose unit and transition cost functions?

Empirical or data driven

Empirical strategy:

Transition cost:

If two segments originally spoken in succession, dt(*)=0



Otherwise, cost as sum of prosodic and coarticulary cost

Prosodic cost proportional to difference in F0 (or logF0) at boundary

Coarticulary cost based on empirical knowledge of perceived distance

Unit cost

Contribution from prosody and context

Prosody cost proportional to difference in F0

Contextual cost by using a unit from a different phonetic context. Based on
empirical data.

32

Torbjrn Svendsen

Speech Synthesis

Object function for search (3)



Data driven cost function

Transition cost

Measure of spectral discontinuity, e.g. spectral distance in the transition area
(distance between the end frame of preceeding unit and first frame of
succeding unit)

(Optional) prosodic cost, e.g. magnitude of log(F0) difference

Unit cost

Based on context

Examples:

Same context means no cost, different context gives infinite cost

Generalized triphones(GT): Unit belongs to same GT means no cost,
otherwise cost is infinity

Phonetic decision trees, e.g. no cost for units at same leaf node

33

Torbjrn Svendsen

Speech Synthesis

Optimal unit string selection



Given

The object function to be minimized

A target sequence from the TTS front end

A unit inventory

The minimization can be performed using standard dynamic


programming techniques (Viterbi-style)

Similar to HMM decoding, but no probabilities, only cost values

Search can be further simplified by e.g. clustering of units

Initial search using units representing each cluster

Search refinement by selecting best cluster member as selected unit

34

Torbjrn Svendsen

Speech Synthesis

Some central issues



1. Which unit?

2. How to design the acoustic library (inventory)?

Content, recording conditions, reading style

Annotation type, level

Segmentation and labeling consistency, effort, automation?

3. How to select the best sequence of units from the acoustic inventory

4. How to perform prosodic modification of the selected sequence?

35

Torbjrn Svendsen

Speech Synthesis

4. Prosodic modification

Techniques for prosodic modification (pitch, duration) mandatory
when unit inventory is small

Also desirable for unit selection synthesis due to LNRE

Main issue: How to be able to achieve (at least moderate) prosodic
modification of a unit (sequence) without introducing annoying
distortion

Example 1: Original - duration - pitch - duration and pitch

Example:

36

Torbjrn Svendsen

Original duration pitch duration and pitch


Speech Synthesis

(Synchronous) Overlap and Add



2N

37

OLA: Time-scale modification, fixed distance between analysis windows.


Produces irregular pitch periods

SOLA: Analysis window placed at position which gives max correlation btw.
windows

Torbjrn Svendsen

Speech Synthesis

Pitch Synchronous OLA (PSOLA)



Window is pitch synchronous

centered around an excitation pulse

Duration equal to two pitch periods, 2*T0

Allows for simple modification of pitch frequency



Can also modify duration

Unvoiced sounds:

Fixed window length (< 10 ms)



Can invert every other repeated segment in order to avoid periodicities
when expanding duration

Can provide high quality as long as the degree of modification is


relatively low (<2)

38

Torbjrn Svendsen

Speech Synthesis

PSOLA principle, F0 change



T=1.25*T0

Original





Epochs

Shift



Re-harmonized
signal

39

Torbjrn Svendsen

Speech Synthesis

PSOLA duration and F0 modification


40

Torbjrn Svendsen

Speech Synthesis

PSOLA principle

e( n ) =

"

! $ (n # kT )

k = #"

s(n)


s(n) determines spectral envelope


x ( n ) = e( n ) * s ( n ) =

"

! s(n # kT )

k = #"

Using an appropriate, pitch synchronous window, T0 can be changed


without changing spectral envelope (exact match at f=k*F0)

Window type and degree of change will determine distortion outside
pitch harmonics (interpolated values, correctness determined by
window sidelobes)

41

Torbjrn Svendsen

Speech Synthesis

How to determine the synthesis epochs


ts(j) time instance for pitch pulse (epoch) i in the synthesis



Ps(t) desired pitch period at time t

If Ps(t) is slowly varying, ts(j+1)- ts(j)=Ps (ts(j))

Exact:

t s ( j +1)

" P (t )dt

t s ( j + 1) ! t s ( j ) =

ts ( j )

t s ( j + 1) ! t s ( j )

Next pulse offset by mean pitch within the synthesis interval



Iterative calculation

42

Torbjrn Svendsen

Speech Synthesis

Synthesis epoch calculation


43

Torbjrn Svendsen

Speech Synthesis

Pitch modification, no time scaling



Original pitch : Pa (t ) = t a (i + 1) " t a (i )
Desired pitch : Ps (t ) = ! (t ) Pa (t )
t s ( j +1)

# ! (t ) P (t
)dt
a

t s ( j + 1) " t s ( j ) =

ts ( j )

t s ( j + 1) " t s ( j )

Pa (t ) is piecewise constant

! (t ) is normally constant or linear

44

Torbjrn Svendsen

Speech Synthesis

Changing the time scale



ta

ts = D(ta ) = ! ! (" )d"


0

Presume ! (" ) = ! (reduce speed when ! > 1)


Similar derivation to pitch epoch determination

!
ts ( j +1) " ts ( j) =

ts ( j+1)/!

Pa (t)dt

ts ( j )/!

ts ( j +1) " ts ( j)

If Pa (t) # Pa (constant in interval):


ts ( j +1) " ts ( j) = ! Pa

45

Torbjrn Svendsen

Speech Synthesis

Changing the time scale


46

Torbjrn Svendsen

Speech Synthesis

All the modifications



Changing both time and pitch :

#
t s ( j + 1) ! t s ( j ) =

47

Torbjrn Svendsen

t s ( j +1) / #

" $ (t ) P (t )dt

ts ( j ) /#

t s ( j + 1) ! t s ( j )

Speech Synthesis

Epoch positioning in database



Database must be annotated with pitch pulse location

Accurate positioning necessary for good performance

Automatic methods using pitch estimation techniques give reasonably
good results

Use of laryngograph (electroglottograph EGG) during recording is
recommended

Measures resistance over vocal cords dependent on glottal opening



Peak picking of derivative of EGG signal

48

Torbjrn Svendsen

Speech Synthesis

Epochs from EGG signal



Speech signal

EGG signal




Detected pulse locations

Peak picking on EGG or its time derivative



Accurate epoch and F0 estimation

Voiced/unvoiced determination

49

Torbjrn Svendsen

Speech Synthesis

PSOLA limitations

Amplitude mismatch

Voiced fricatives

Increased buzzyness

All modification will introduce distortion and unnaturalness


Degree dependent on amount of modification


Limits on maximum modification


50

Torbjrn Svendsen

Speech Synthesis

Phase mismatches

Wrong
positioning of
pitch pulses in
database

51

Torbjrn Svendsen

Causes glitches
in output

Speech Synthesis

Pitch mismatches

Correct F0 and
pulse position

52

Torbjrn Svendsen

Different F0 in
segments cause
spectral and
waveform
discontinuities

Speech Synthesis

HMMs for synthesis


In speech recognition, Hidden Markov models are used to model speech


production

Task of recognizer is to find the model that best explains the observed utterance

If the HMM is used for generating observations, the produced feature vector
sequence can be used to produce speech from a given unit sequence (phone
sequence)

53

The feature vectors must be suitable for speech production



Combination of continuous and discrete elements

Modifications to HMM theory are necessary to facilitate the generative mode

Potential for efficient and flexible synthesis

Basis for HMM-based synthesis


HMM-based speech synthesis


54

Training from database



Produce excitation and filter parameters for e.g. LPC-type speech generation

Text-to-speech synthesis, a hybrid


solution

Speech training
database

HMM based system for
prediction and unit
selection

Experimental system

Very good evaluation in
international
competition (Blizzard
Challenge 2010)

TRAINING DATA

INPUT TEXT
TTS frontend

Analysis

Target model
construction

HTS
training

State
alignment

Voice
database

Candidate list
construction

Selection &
boundary decision

Waveform
concatenation

55

A few examples

Diphone synthesis: Festival, Arne, Infovox

Unit selection synthesis: Festival, AT&T NextGen


HMM synthesis

Hybrid HMM/Unit selection

Limited domain unit selection synthesis

56

Torbjrn Svendsen

Speech Synthesis

Summary

Data-driven vs synthesis by rule.

Current synthesis generation is concatenative waveform synthesis.

Single unit synthesis, diphone synthesis, requires units to be
prosodically modified.

Unit selection synthesis aims to use natural prosody and minimal
prosodic modification.

Issues in waveform synthesis:

57

Unit definition.

Definition, realization and annotation of waveform library.

Unit selection - search.

Prosodic modification.

Torbjrn Svendsen

Speech Synthesis

Vous aimerez peut-être aussi