Académique Documents
Professionnel Documents
Culture Documents
Text-to-speech synthesis
The automatic transformation from (electronic) text to speech
The speaker is defined by the system design
Single speaker
Torbjrn Svendsen
Speech Synthesis
Text
Phonemic
analysis
Prosodic
analysis
Speech
synthesis
Speech
Torbjrn Svendsen
Speech Synthesis
Text-to-speech synthesis
Text
analysis
Phonemic
analysis
Prosodic
analysis
Speech
synthesis
Torbjrn Svendsen
Speech Synthesis
Quality
Festival
NextGen
Infovox
Klattalk
Voder
Torbjrn Svendsen
Speech Synthesis
Intelligibility
Flexibility
Naturalness
Bit rate
Units
Processing
needs
Cost
Complexity
Vocabulary
(Figure adapted from
Granstrm)
Torbjrn Svendsen
Speech Synthesis
Main methods
Formant synthesis
Concatenative, or Waveform synthesis
Articulatory synthesis
Torbjrn Svendsen
Speech Synthesis
Formant synthesis
Annotated
phones
Formant tracks
Formant
synthesis
Rule system
Pitch contour
Synthetic speech
Normally rule based (knowledge driven) system, but can also be data
driven
Each formant can be specified with center frequency, bandwidth and
optionally, amplitude
E.g.
H i ( z) =
1 ! 2e !"bi
1
cos(2"f i ) z !1 + e ! 2"bi z ! 2
Torbjrn Svendsen
Speech Synthesis
Implementation
Cascade or parallel
implementation
Voiced sounds typically
use cascade, unvoiced
sounds use parallel
implementation
LPC-filters can also be
used
Normally poorer
quality
Torbjrn Svendsen
Speech Synthesis
10
Torbjrn Svendsen
Speech Synthesis
LPC synthesis
Pulse
generator
LPC
Filter
Noise
generator
Example:
A(z)
Original
LPC-coded, all unvoiced/voiced
Speed: Halved/doubled
Pitch: Halved/doubled
Melody
11
Torbjrn Svendsen
Speech Synthesis
Parameters describe
transition shape
Specific rules for all
transition types
12
Torbjrn Svendsen
Speech Synthesis
13
Torbjrn Svendsen
Speech Synthesis
Formant synthesis
Flexible
Produces intelligible speech with few parameters
Simple implementation
Rule derivation is complex, development can be costly
Limited naturalness
NOTE: Given sufficient training data, formant generation can be data
driven, e.g. using an HMM in production mode for generating the
formant tracks (Acero et. al)
14
Torbjrn Svendsen
Speech Synthesis
Articulatory synthesis
The waveform production is performed by describing the movement of
the articulators
Jaw opening, lip rounding, tongue placement and height, .
Acoustics, fluid mechanics form basis
Limited success
Complex theory
Computational difficulties and complexity
15
Torbjrn Svendsen
Speech Synthesis
Synthesis by concatenation
Concatenation of stored waveform fragments
Optional modification of the fragments (duration, pitch, formants)
16
Torbjrn Svendsen
Speech Synthesis
3. How to select the best sequence of units from the acoustic inventory
4. How to perform prosodic modification of the selected sequence?
17
Torbjrn Svendsen
Speech Synthesis
3. How to select the best sequence of units from the acoustic inventory
4. How to perform prosodic modification of the selected sequence?
18
Torbjrn Svendsen
Speech Synthesis
1. Which unit?
Torbjrn Svendsen
Speech Synthesis
Unit requirements
Low concatenation distortion
Longer units -> less concatenation points
Units containing attractive concatenation points
20
Torbjrn Svendsen
Speech Synthesis
Coverage
Torbjrn Svendsen
Speech Synthesis
E.g. generalized triphones or phonetic decision trees
Diphones (dyads)
~2500 possible units
Reduces discontinuity problems
Widespread use
Sub-phonemic units
Increased use (e.g. IBM, AT&T)
Half-phones (AT&T), phone HMM-state (IBM)
22
Torbjrn Svendsen
Speech Synthesis
Computationally demanding
Complex search in large database
23
Torbjrn Svendsen
Speech Synthesis
3. How to select the best sequence of units from the acoustic inventory
4. How to perform prosodic modification of the selected sequence?
24
Torbjrn Svendsen
Speech Synthesis
25
Torbjrn Svendsen
Speech Synthesis
Large database
Facitilitates longer units, variable units
Requires search for the best unit sequence
Rich phonetic and prosodic context
Text selection:
Start with large number of natural sentences
Analyze sentences, predict phonetic and prosodic realization
Use some greedy algorithm to obtain the best coverage possible with a small
number of sentences (2000-4000, typically)
Design supplementary sentences to improve coverage
26
Torbjrn Svendsen
Speech Synthesis
Coverage - LNRE
P(unit)
27
Torbjrn Svendsen
Speech Synthesis
Annotation
For small databases, speech can be segmented and annotated manually
Phonemic and prosodic annotation can be detailed
28
Torbjrn Svendsen
Speech Synthesis
3. How to select the best sequence of units from the acoustic inventory
4. How to perform prosodic modification of the selected sequence?
29
Torbjrn Svendsen
Speech Synthesis
Search problem
Must define an object function to be minimized
30
Torbjrn Svendsen
Speech Synthesis
N #1
j =1
j =1
31
Torbjrn Svendsen
Speech Synthesis
Empirical strategy:
Transition cost:
Unit cost
Contribution from prosody and context
Prosody cost proportional to difference in F0
Contextual cost by using a unit from a different phonetic context. Based on
empirical data.
32
Torbjrn Svendsen
Speech Synthesis
Unit cost
Based on context
Examples:
Same context means no cost, different context gives infinite cost
Generalized triphones(GT): Unit belongs to same GT means no cost,
otherwise cost is infinity
Phonetic decision trees, e.g. no cost for units at same leaf node
33
Torbjrn Svendsen
Speech Synthesis
34
Torbjrn Svendsen
Speech Synthesis
3. How to select the best sequence of units from the acoustic inventory
4. How to perform prosodic modification of the selected sequence?
35
Torbjrn Svendsen
Speech Synthesis
4. Prosodic modification
Techniques for prosodic modification (pitch, duration) mandatory
when unit inventory is small
Also desirable for unit selection synthesis due to LNRE
Main issue: How to be able to achieve (at least moderate) prosodic
modification of a unit (sequence) without introducing annoying
distortion
Example 1: Original - duration - pitch - duration and pitch
Example:
36
Torbjrn Svendsen
Speech Synthesis
37
Torbjrn Svendsen
Speech Synthesis
38
Torbjrn Svendsen
Speech Synthesis
39
Torbjrn Svendsen
Speech Synthesis
40
Torbjrn Svendsen
Speech Synthesis
PSOLA principle
e( n ) =
"
! $ (n # kT )
k = #"
s(n)
x ( n ) = e( n ) * s ( n ) =
"
! s(n # kT )
k = #"
41
Torbjrn Svendsen
Speech Synthesis
t s ( j +1)
" P (t )dt
t s ( j + 1) ! t s ( j ) =
ts ( j )
t s ( j + 1) ! t s ( j )
42
Torbjrn Svendsen
Speech Synthesis
43
Torbjrn Svendsen
Speech Synthesis
# ! (t ) P (t
)dt
a
t s ( j + 1) " t s ( j ) =
ts ( j )
t s ( j + 1) " t s ( j )
Pa (t ) is piecewise constant
44
Torbjrn Svendsen
Speech Synthesis
!
ts ( j +1) " ts ( j) =
ts ( j+1)/!
Pa (t)dt
ts ( j )/!
ts ( j +1) " ts ( j)
45
Torbjrn Svendsen
Speech Synthesis
46
Torbjrn Svendsen
Speech Synthesis
#
t s ( j + 1) ! t s ( j ) =
47
Torbjrn Svendsen
t s ( j +1) / #
" $ (t ) P (t )dt
ts ( j ) /#
t s ( j + 1) ! t s ( j )
Speech Synthesis
48
Torbjrn Svendsen
Speech Synthesis
Detected pulse locations
Torbjrn Svendsen
Speech Synthesis
PSOLA limitations
Amplitude mismatch
Voiced fricatives
Increased buzzyness
50
Torbjrn Svendsen
Speech Synthesis
Phase mismatches
Wrong
positioning of
pitch pulses in
database
51
Torbjrn Svendsen
Causes glitches
in output
Speech Synthesis
Pitch mismatches
Correct F0 and
pulse position
52
Torbjrn Svendsen
Different F0 in
segments cause
spectral and
waveform
discontinuities
Speech Synthesis
If the HMM is used for generating observations, the produced feature vector
sequence can be used to produce speech from a given unit sequence (phone
sequence)
53
54
TRAINING DATA
INPUT TEXT
TTS frontend
Analysis
Target model
construction
HTS
training
State
alignment
Voice
database
Candidate list
construction
Selection &
boundary decision
Waveform
concatenation
55
A few examples
Diphone synthesis: Festival, Arne, Infovox
HMM synthesis
Hybrid HMM/Unit selection
Limited domain unit selection synthesis
56
Torbjrn Svendsen
Speech Synthesis
Summary
Data-driven vs synthesis by rule.
Current synthesis generation is concatenative waveform synthesis.
Single unit synthesis, diphone synthesis, requires units to be
prosodically modified.
Unit selection synthesis aims to use natural prosody and minimal
prosodic modification.
Issues in waveform synthesis:
57
Unit definition.
Definition, realization and annotation of waveform library.
Unit selection - search.
Prosodic modification.
Torbjrn Svendsen
Speech Synthesis