Speech Synthesis

Speech synthesis
Text-to-speech synthesis

The automatic transformation from (electronic) text to speech

The speaker is defined by the system design

Single speaker

In contrast to speech recognition, the aim is not to handle all speakers

and all normal pronunciation variants, but to render one spoken
realization of the text that is perceived as natural and intelligible

A text contains orthographic words, numbers, abbreviations,
mnemonics and punctuation

Linguistic analysis of the text is necessary to

Interpret symbols

Analyze the grammatical structure

Infer the semantic interpretation of the text

Torbjrn Svendsen

Speech Synthesis

The processing steps of TTS

Text

analysis

Text

Phonemic

analysis

Prosodic

analysis

Speech

synthesis

Speech

Text analysis: text normalization; analysis of document structure, linguistic

analysis

Output: tagged text

Phonemic analysis : homograph disambiguation, morphological analysis,

letter-to-sound mapping

Output: tagged phone sequence

Prosodic analysis: intonation; duration; volume

Output: control sequence, tagged phones

Speech synthesis: voice rendering

Output: synthetic speech

Torbjrn Svendsen

Speech Synthesis

Text-to-speech synthesis

Text

analysis

Phonemic

analysis

Prosodic

analysis

Speech

synthesis

Speech synthesis concerns the waveform generation from the

annotated symbol sequence (typically phone sequence)

Philosophy: Rule-based vs. data driven synthesis

Method: Articulatory synthesis; formant synthesis; concatenative
(waveform) synthesis

4

Torbjrn Svendsen

Speech Synthesis

Quality

Festival

NextGen

Infovox

Klattalk

Voder

Different strategies give different quality but also different consistency

of quality

No strategy can currently provide consistent high quality (but it is
getting closer)

Limited domain gives high quality within the application domain

5

Torbjrn Svendsen

Speech Synthesis

The synthesis space

Speech
knowledge

Intelligibility

Flexibility

Naturalness

Bit rate

Units

Processing
needs

Cost

Complexity

Vocabulary

(Figure adapted from
Granstrm)

Torbjrn Svendsen

Speech Synthesis

Main methods

Formant synthesis
Concatenative, or Waveform synthesis
Articulatory synthesis
Torbjrn Svendsen

Speech Synthesis

Formant synthesis

Annotated

phones

Formant tracks

Formant

synthesis

Rule system

Pitch contour

Synthetic speech

Normally rule based (knowledge driven) system, but can also be data
driven

Each formant can be specified with center frequency, bandwidth and
optionally, amplitude

E.g.

H i ( z) =
1 ! 2e !"bi
1
cos(2"f i ) z !1 + e ! 2"bi z ! 2
2nd order filter with resonance in fi and a bandwidth bi
Torbjrn Svendsen

Speech Synthesis

Implementation

Cascade or parallel
implementation

Voiced sounds typically
use cascade, unvoiced
sounds use parallel
implementation

LPC-filters can also be
used

Normally poorer
quality

Torbjrn Svendsen

Speech Synthesis

Klatt formant synthesizer

10

Torbjrn Svendsen

Speech Synthesis

LPC synthesis

Pulse

generator

LPC

Filter

Noise

generator

Example:

A(z)

Original

LPC-coded, all unvoiced/voiced

Speed: Halved/doubled

Pitch: Halved/doubled

Melody

11

Torbjrn Svendsen

Speech Synthesis

Rule-based formant generation

Formants are slowly varying

Update rates of 5-10ms sufficient

Target values describe stationary conditions, {Fi, Bi}

Rules describe transition between phones

Parameters describe
transition shape

Specific rules for all
transition types

12

Torbjrn Svendsen

Speech Synthesis

Rule based formant generation

13

Torbjrn Svendsen

Speech Synthesis

Formant synthesis

Flexible

Produces intelligible speech with few parameters

Simple implementation

Rule derivation is complex, development can be costly

Limited naturalness

NOTE: Given sufficient training data, formant generation can be data
driven, e.g. using an HMM in production mode for generating the
formant tracks (Acero et. al)

Similar approach for LPC-based synthesis more recently by Tohkura

14

Torbjrn Svendsen

Speech Synthesis

Articulatory synthesis

The waveform production is performed by describing the movement of
the articulators

Jaw opening, lip rounding, tongue placement and height, .

Acoustics, fluid mechanics form basis

Limited success

Complex theory

Computational difficulties and complexity

15

Torbjrn Svendsen

Speech Synthesis

Synthesis by concatenation

Concatenation of stored waveform fragments

Optional modification of the fragments (duration, pitch, formants)

Dilemma:Use of unmodified fragments will either

Produce audible distortion at concatenation points (phase mismatch,
formant and pitch mismatch)

Lead to an enormous database to cover all phonetic and prosodic events

How much modification is possible before degradation is audible?

16

Torbjrn Svendsen

Speech Synthesis

Some central issues

1. Which unit?

2. How to design the acoustic library (inventory)?

Content, recording conditions, reading style

Annotation type, level

Segmentation and labeling consistency, effort, automation?

3. How to select the best sequence of units from the acoustic inventory

4. How to perform prosodic modification of the selected sequence?

17

Torbjrn Svendsen

Speech Synthesis

Some central issues

1. Which unit?







18

Torbjrn Svendsen

Speech Synthesis

1. Which unit?

Longer unit leads to better quality, but

requires more data to be stored

Is more context dependent

19

Torbjrn Svendsen

Speech Synthesis

Unit requirements

Low concatenation distortion

Longer units -> less concatenation points

Units containing attractive concatenation points

Low prosodic distortion

Small inventory means prosodic modification necessary

Modification introduces distortion

Unit should be generalizable

Need to be able to synthesize sequences that were not in the original
inventory (except for limited domain synthesis)

Unit should be trainable

Finite training data sufficient to estimate or predict all units

20

Torbjrn Svendsen

Speech Synthesis

Coverage

Complete coverage of all phonetic and prosodic events is impossible

Large Number of Rare Events

21

Torbjrn Svendsen

Speech Synthesis

Some possible unit choices

Context independent phonemes

Bad concatenation properties

Context dependent phonemes

Reduces discontinuity problems

Large number (~125k), needs to be reduced

E.g. generalized triphones or phonetic decision trees

Diphones (dyads)

~2500 possible units

Reduces discontinuity problems

Widespread use

Sub-phonemic units

Increased use (e.g. IBM, AT&T)

Half-phones (AT&T), phone HMM-state (IBM)

22

Torbjrn Svendsen

Speech Synthesis

Some posible unit choices (cont.)

Syllables, words and phrases

Mainly used for limited domain applications

Fixed message repertoire

Potentially good quality

Demands large storage

And much data collection

Computationally demanding

Complex search in large database

Syllables or demi-syllables most interesting

23

Torbjrn Svendsen

Speech Synthesis

Some central issues

1. Which unit?







24

Torbjrn Svendsen

Speech Synthesis

Designing the acoustic inventory

Recordings from one speaker, appropriately annotated

Voice talent very important for resulting quality

Design choice: Rely on prosodic modification by signal processing or
aim for good coverage of natural prosodic variation in database

Prosodic modification at synthesis PSOLA type synthesis

25

Typically diphone units

Normally desirable to have nearly constant (neutral) F0

Nonsense words/sentences with (near) full diphone coverage

Small database (~5 minutes of speech contains the essential units)

Torbjrn Svendsen

Speech Synthesis

Designing the acoustic inventory (2)

Unit selection synthesis rely on natural prosodic variation

Representative speech speaking style defined by database

Many representations of each phonetic unit

Gives prosodic variation

Large database

Facitilitates longer units, variable units

Requires search for the best unit sequence

Rich phonetic and prosodic context

Typically real texts

Text selection:

Start with large number of natural sentences

Analyze sentences, predict phonetic and prosodic realization

Use some greedy algorithm to obtain the best coverage possible with a small
number of sentences (2000-4000, typically)

Design supplementary sentences to improve coverage

26

Torbjrn Svendsen

Speech Synthesis

Coverage - LNRE

P(unit)

Large number of units with small probability of occurrence

If database units are selected randomly, the probability of
encountering a unit not in the database approaches certainty for a
small sequence of randomly selected sentences.

Unit inventory must be chosen with care

Fall-back solutions must exist for non-covered units

27

Torbjrn Svendsen

Speech Synthesis

Annotation

For small databases, speech can be segmented and annotated manually

Phonemic and prosodic annotation can be detailed

For unit selection databases automation is necessary

Automatic or semi-automatic methods for segmentation in phonemic and
prosodic units

Annotation can be fairly high-level without loss of quality

Annotation level and cost function for unit selection are closely linked

28

Torbjrn Svendsen

Speech Synthesis

Some central issues

1. Which unit?







29

Torbjrn Svendsen

Speech Synthesis

3. Optimal unit string

Selection problem arises when there are several possible choices for
the unit sequence

Traditional diphone synthesis has only one exemplar of each unit

Trivial solution

Selection is made based on desire for naturalness and minimum

discontinuity due to

Different phonetic contexts

Segmentation errors in the database

Acoustic variability

Prosodic differences (pitch discontinuity, formant tracks)

Search problem

Must define an object function to be minimized

30

Torbjrn Svendsen

Speech Synthesis

Object function for search

Lattice of candidate units

Sequence of target units

N
N #1
j =1
j =1
d(!,T) = " du (! j ,t j ) + " dt (! j ,! j +1 )
Unit cost, du(*)

Transition cost, dt(*)

! = {!1,!2 ,...,!N } # Candidate segment sequence

T = {t1,t 2 ,...,t N }
- Target units
= argmin d(!,T)
!
!
31

Torbjrn Svendsen

Speech Synthesis

Object function for search (2)

How to choose unit and transition cost functions?

Empirical or data driven

Empirical strategy:

Transition cost:

If two segments originally spoken in succession, dt(*)=0

Otherwise, cost as sum of prosodic and coarticulary cost

Prosodic cost proportional to difference in F0 (or logF0) at boundary

Coarticulary cost based on empirical knowledge of perceived distance

Unit cost

Contribution from prosody and context

Prosody cost proportional to difference in F0

Contextual cost by using a unit from a different phonetic context. Based on
empirical data.

32

Torbjrn Svendsen

Speech Synthesis

Object function for search (3)

Data driven cost function

Transition cost

Measure of spectral discontinuity, e.g. spectral distance in the transition area
(distance between the end frame of preceeding unit and first frame of
succeding unit)

(Optional) prosodic cost, e.g. magnitude of log(F0) difference

Unit cost

Based on context

Examples:

Same context means no cost, different context gives infinite cost

Generalized triphones(GT): Unit belongs to same GT means no cost,
otherwise cost is infinity

Phonetic decision trees, e.g. no cost for units at same leaf node

33

Torbjrn Svendsen

Speech Synthesis

Optimal unit string selection

Given

The object function to be minimized

A target sequence from the TTS front end

A unit inventory

The minimization can be performed using standard dynamic

programming techniques (Viterbi-style)

Similar to HMM decoding, but no probabilities, only cost values

Search can be further simplified by e.g. clustering of units

Initial search using units representing each cluster

Search refinement by selecting best cluster member as selected unit

34

Torbjrn Svendsen

Speech Synthesis

Some central issues

1. Which unit?







35

Torbjrn Svendsen

Speech Synthesis

4. Prosodic modification

Techniques for prosodic modification (pitch, duration) mandatory
when unit inventory is small

Also desirable for unit selection synthesis due to LNRE

Main issue: How to be able to achieve (at least moderate) prosodic
modification of a unit (sequence) without introducing annoying
distortion

Example 1: Original - duration - pitch - duration and pitch

Example:
36

Torbjrn Svendsen

Original duration pitch duration and pitch

Speech Synthesis

(Synchronous) Overlap and Add

2N

37

OLA: Time-scale modification, fixed distance between analysis windows.

Produces irregular pitch periods

SOLA: Analysis window placed at position which gives max correlation btw.
windows

Torbjrn Svendsen

Speech Synthesis

Pitch Synchronous OLA (PSOLA)

Window is pitch synchronous

centered around an excitation pulse

Duration equal to two pitch periods, 2*T0

Allows for simple modification of pitch frequency

Can also modify duration

Unvoiced sounds:

Fixed window length (< 10 ms)

Can invert every other repeated segment in order to avoid periodicities
when expanding duration

Can provide high quality as long as the degree of modification is

relatively low (<2)

38

Torbjrn Svendsen

Speech Synthesis

PSOLA principle, F0 change

T=1.25*T0

Original

Epochs

Shift

Re-harmonized
signal

39

Torbjrn Svendsen

Speech Synthesis

PSOLA duration and F0 modification

40

Torbjrn Svendsen

Speech Synthesis

PSOLA principle

e( n ) =
"
! $ (n # kT )
k = #"
s(n)

s(n) determines spectral envelope

x ( n ) = e( n ) * s ( n ) =
"
! s(n # kT )
k = #"
Using an appropriate, pitch synchronous window, T0 can be changed

without changing spectral envelope (exact match at f=k*F0)

Window type and degree of change will determine distortion outside
pitch harmonics (interpolated values, correctness determined by
window sidelobes)

41

Torbjrn Svendsen

Speech Synthesis

How to determine the synthesis epochs

ts(j) time instance for pitch pulse (epoch) i in the synthesis

Ps(t) desired pitch period at time t

If Ps(t) is slowly varying, ts(j+1)- ts(j)=Ps (ts(j))

Exact:

t s ( j +1)
" P (t )dt
t s ( j + 1) ! t s ( j ) =
ts ( j )
t s ( j + 1) ! t s ( j )
Next pulse offset by mean pitch within the synthesis interval

Iterative calculation

42

Torbjrn Svendsen

Speech Synthesis

Synthesis epoch calculation

43

Torbjrn Svendsen

Speech Synthesis

Pitch modification, no time scaling

Original pitch : Pa (t ) = t a (i + 1) " t a (i )
Desired pitch : Ps (t ) = ! (t ) Pa (t )
t s ( j +1)
# ! (t ) P (t
)dt
a
t s ( j + 1) " t s ( j ) =
ts ( j )
t s ( j + 1) " t s ( j )
Pa (t ) is piecewise constant
! (t ) is normally constant or linear
44

Torbjrn Svendsen

Speech Synthesis

Changing the time scale

ta
ts = D(ta ) = ! ! (" )d"

0
Presume ! (" ) = ! (reduce speed when ! > 1)

Similar derivation to pitch epoch determination
!
ts ( j +1) " ts ( j) =
ts ( j+1)/!
Pa (t)dt
ts ( j )/!
ts ( j +1) " ts ( j)
If Pa (t) # Pa (constant in interval):

ts ( j +1) " ts ( j) = ! Pa
45

Torbjrn Svendsen

Speech Synthesis

Changing the time scale

46

Torbjrn Svendsen

Speech Synthesis

All the modifications

Changing both time and pitch :
#
t s ( j + 1) ! t s ( j ) =
47

Torbjrn Svendsen

t s ( j +1) / #
" $ (t ) P (t )dt

ts ( j ) /#
t s ( j + 1) ! t s ( j )
Speech Synthesis

Epoch positioning in database

Database must be annotated with pitch pulse location

Accurate positioning necessary for good performance

Automatic methods using pitch estimation techniques give reasonably
good results

Use of laryngograph (electroglottograph EGG) during recording is
recommended

Measures resistance over vocal cords dependent on glottal opening

Peak picking of derivative of EGG signal

48

Torbjrn Svendsen

Speech Synthesis

Epochs from EGG signal

Speech signal

EGG signal

Detected pulse locations

Peak picking on EGG or its time derivative

Accurate epoch and F0 estimation

Voiced/unvoiced determination

49

Torbjrn Svendsen

Speech Synthesis

PSOLA limitations

Amplitude mismatch

Voiced fricatives

Increased buzzyness

All modification will introduce distortion and unnaturalness

Degree dependent on amount of modification

Limits on maximum modification

50

Torbjrn Svendsen

Speech Synthesis

Phase mismatches

Wrong
positioning of
pitch pulses in
database

51

Torbjrn Svendsen

Causes glitches
in output

Speech Synthesis

Pitch mismatches

Correct F0 and
pulse position

52

Torbjrn Svendsen

Different F0 in
segments cause
spectral and
waveform
discontinuities

Speech Synthesis

HMMs for synthesis

In speech recognition, Hidden Markov models are used to model speech

production

Task of recognizer is to find the model that best explains the observed utterance

If the HMM is used for generating observations, the produced feature vector
sequence can be used to produce speech from a given unit sequence (phone
sequence)

53

The feature vectors must be suitable for speech production

Combination of continuous and discrete elements

Modifications to HMM theory are necessary to facilitate the generative mode

Potential for efficient and flexible synthesis

Basis for HMM-based synthesis

HMM-based speech synthesis

54

Training from database

Produce excitation and filter parameters for e.g. LPC-type speech generation

Text-to-speech synthesis, a hybrid

solution

Speech training
database

HMM based system for
prediction and unit
selection

Experimental system

Very good evaluation in
international
competition (Blizzard
Challenge 2010)

TRAINING DATA
INPUT TEXT
TTS frontend
Analysis
Target model
construction
HTS
training
State
alignment
Voice
database
Candidate list
construction
Selection &
boundary decision
Waveform
concatenation
55

A few examples

Diphone synthesis: Festival, Arne, Infovox

Unit selection synthesis: Festival, AT&T NextGen

HMM synthesis

Hybrid HMM/Unit selection

Limited domain unit selection synthesis

56

Torbjrn Svendsen

Speech Synthesis

Summary

Data-driven vs synthesis by rule.

Current synthesis generation is concatenative waveform synthesis.

Single unit synthesis, diphone synthesis, requires units to be
prosodically modified.

Unit selection synthesis aims to use natural prosody and minimal
prosodic modification.

Issues in waveform synthesis:

57

Unit definition.

Definition, realization and annotation of waveform library.

Unit selection - search.

Prosodic modification.

Torbjrn Svendsen

Speech Synthesis

Speech Synthesis

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Speech Synthesis

Transféré par

Droits d'auteur :

Formats disponibles

Speech synthesis

In contrast to speech recognition, the aim is not to handle all speakers

Linguistic analysis of the text is necessary to

The processing steps of TTS

Text analysis: text normalization; analysis of document structure, linguistic

Output: tagged text

Phonemic analysis : homograph disambiguation, morphological analysis,

Prosodic analysis: intonation; duration; volume

Speech synthesis: voice rendering

Speech synthesis concerns the waveform generation from the

Different strategies give different quality but also different consistency

The synthesis space

2nd order filter with resonance in fi and a bandwidth bi

Klatt formant synthesizer

Rule-based formant generation

Target values describe stationary conditions, {Fi, Bi}

Rule based formant generation

Similar approach for LPC-based synthesis more recently by Tohkura

Dilemma:Use of unmodified fragments will either

How much modification is possible before degradation is audible?

Some central issues

Some central issues

Longer unit leads to better quality, but

Low prosodic distortion

Small inventory means prosodic modification necessary

Unit should be generalizable

Unit should be trainable

Complete coverage of all phonetic and prosodic events is impossible

Some possible unit choices

Context dependent phonemes

Some posible unit choices (cont.)

Potentially good quality

Syllables or demi-syllables most interesting

Some central issues

Designing the acoustic inventory

Typically diphone units

Designing the acoustic inventory (2)

Typically real texts

Large number of units with small probability of occurrence

For unit selection databases automation is necessary

Some central issues

3. Optimal unit string

Selection is made based on desire for naturalness and minimum

Different phonetic contexts

Object function for search

Sequence of target units

d(!,T) = " du (! j ,t j ) + " dt (! j ,! j +1 )

Unit cost, du(*)

! = {!1,!2 ,...,!N } # Candidate segment sequence

Object function for search (2)

If two segments originally spoken in succession, dt(*)=0

Object function for search (3)

Optimal unit string selection

The minimization can be performed using standard dynamic

Some central issues

Original duration pitch duration and pitch

(Synchronous) Overlap and Add

OLA: Time-scale modification, fixed distance between analysis windows.

Pitch Synchronous OLA (PSOLA)

Allows for simple modification of pitch frequency

Fixed window length (< 10 ms)

Can provide high quality as long as the degree of modification is

PSOLA principle, F0 change

PSOLA duration and F0 modification