Vous êtes sur la page 1sur 4

Vietnamese Large Vocabulary Continuous Speech Recognition

Thang Tat Vu*, Dung Tien Nguyen**, Mai Chi Luong**, John-Paul Hosom***.

* Graduate School of Information Science


Japan Advanced Institute of Science and Technology
vu-thang@jaist.ac.jp
**Institute of Information Technology
Vietnamese Academy of Science and Technology.
dungtn@gmail.com, lcmai@ioit.ncst.ac.vn
***Center for Spoken Language Understanding (CSLU-OGI)
Oregon Health & Science University (OHSU)
hosom@cslu.ogi.edu

The MFCC and PLP features do not captured the F0 contour –


Abstract which is the most important characteristic to distinguish
This paper presents an early study on building Vietnamese different tones [11, 13]. The use of a bigram language model
large vocabulary continuous speech recognition with improves the accuracy but is not enough to resolve the tonal
concentration on choosing type of units and feature set. Our problem. Therefore, the combination of MFCC and F0
experiments were done using the HTK Toolkit and VOV features was used to improve the accuracy of a Vietnamese
broadcast corpus. The results show that the recognizer with large vocabulary continuous speech recognition system.
mixture units achieved better performance than recognizers
with initial-final units or phoneme units. Among feature sets The rest of this paper is organized as follows:
applied to the mixture unit recognizer, MFCC has - In the next section, a brief description Vietnamese
performance somewhat better than PLP, and the combination phonetics is given.
of MFCC and F0 features increases the accuracy of the
Vietnamese recognition system. - In Section 3, we described the VOV corpus – the speech
corpus was used in our experiments and evaluations.
1. Introduction - In Sections 4, 5 and 6, the Vietnamese speech recognition
Automatic speech recognition (ASR) is one branch of the field system and a number of experiments are described along
of speech processing, and related with a number of different with their results that show the interaction between unit
fields of knowledge such as acoustics, linguistics, pattern types, feature sets and recognition performance.
recognition, and artificial intelligence [6]. The complexity of
- Finally, conclusions and future research are given in the
an ASR system depends on its limitations, such as (a) speaker
last section.
independence or dependence (b) large, medium or small
vocabulary, (c) complexity of grammar, or (d) continuous,
connected or isolated speech. 2. Basic Phonetic Structure of Vietnamese
Large vocabulary, continuous speech recognition is only at the Vietnamese is a monosyllable tonal language. Each
beginning of development for Vietnamese. The equivalent Vietnamese syllable may be considered as a combination of
task is considered challenging even in English. Many existing Initial, Final and Tone components.
methods for speech recognition have been developed for
spoken English and other European languages. The purpose of Table 1. Structure of Vietnamese syllable
our work is to apply language independent ASR techniques
using Hidden Markov Models [6, 7, 8, and 9] to the Tone
Vietnamese language, to investigate the effect of different
Final
types of phonetic units on recognition performance, and Initial
evaluate feature sets used in classification. Onset Nucleus Coda

Another difficulty is that Vietnamese is a syllabic tonal


The Initial component is always a consonant, or it may be
language, which has six lexical tones for most syllables. In
omitted in some syllables (or seen as zero initial). There are 21
such a tonal language, the meaning of a syllable is dependent
Initials in Vietnamese and 155 Final components in
on the tones, and tone classification is an essential issue for a
Vietnamese. The total of pronounceable distinct syllables in
Vietnamese ASR system. Although there are some good
Vietnamese is 18958 but the used syllables in practice are
results for other tonal languages [11, 13], Vietnamese tone
only around 7000 different syllables [1].
classification has nearly no results [10], especially for
continuous speech.
The Final can be decomposed into Onset, Nucleus and Coda.
The Onset and Coda are optional and may not exist in a
syllable. The Nucleus consists of a vowel or a diphthong, and
the Coda is a consonant or a semi-vowel. There are 1 Onset, converted them into Wave format. The data in Wave format
16 Nuclei and 8 Codas in Vietnamese. have 16000 Hz sampling rate with A/D conversion precision
of 16 bits. A silence detector was used to cut each long sound
Figure 1. Vietnamese Tone Patterns file into many small associated utterances. Each utterance
contains approximately 10 syllables. Six people heard more
Pitch than 50000 utterances and selected 23424 good utterances
(7)
from this set before typing the corresponding transcriptions.
(1)
The utterances and associated transcriptions are needed inputs
(5) for training and evaluating HMM units with HTK tools. Nine-
(3)
tenths of the data was randomly chosen for the training set, the
(6) other one-tenth of the data was used for the test set. The same
data for training and testing was used for all experiments.
(4) (2)
(8) 4. Recognition System
Time The recognizers in this work were trained and tested by the
use of the HTK – Hidden Markov Toolkit [7], which can be
There are six lexical tones in Vietnamese, and they can affect freely downloaded for research purposes from the CMU Web
word meaning; six different tones applied to a syllable can site.
result in six distinct words. Syllables with a closure coda can
only go with rising tones and drop tones [3, 4]. As in Figure 1, Figure 2. Three state HMM
rising and drop tones of syllables ending with stop consonants
have F0 contours similar to rising and falling tones of other
syllables, but they rise or drop more sharply [2, 5].

Therefore, most linguists who study Vietnamese acoustics 1 2 3


claim that the Vietnamese language contains 8 different tones
base on F0 contours. The properties of F0 contours associated
with different Vietnamese tones are summarized in Table 2.
Acoustic Vector Sequence
Table 2. Classification of Vietnamese tones

Contour Unflat
The 3-state HMM architecture with embedded training process
Pitch Flat was chosen for all of our experiments. For building a
Unsteady Steady Stop continuous speech recognition system, the null states with
(7) Rising non-emitting entry and exit states provide the glue needed to
High (1) Level (3) Broken (5) Rising
(closure coda) join models of HMM units together [7].
(8) Drop
Low (2) Falling (4) Curve (6) Drop F0 was extracted with the Praat tool, which can also be freely
(closure coda)
downloaded from the Institute of Phonetic Sciences (IFA) at
the University of Amsterdam. F0 values were used in the
3. Speech Corpus third experiment. The documentation for the HTK Toolkit [7]
The corpus used in this work is the VOV (the Voice of and for Praat tools [12] was used as a guide for carrying all
Vietnam) corpus: a collection of story reading, VOV mailbag, our experiments.
news reports, and colloquy from the radio program “the Voice
of Vietnam”. There are 23424 utterances in the corpus of The system was trained using the embedded training
about 30 male and female broadcasters and visitors. The capability of HTK and training was performed until
number of distinct syllables with tone is 4923 and the number convergence. A 3-state HMM was used for each unit. After
of distinct syllables without tone is 2101. Therefore, the training monophone HMMs, we used these models to create
corpus covers all Vietnamese phonemes and most Vietnamese triphone HMMs. The tree based clustering was applied to
syllables. The total capacity of the corpus in WAV format is share the parameters of similar triphone HMMs in all our
about 2.5GB. experiments.

One deficiency of the corpus is the number of unique The same grammar, namely a bigram model at the syllable
speakers. In one radio station, there are only a limited number level, was used to improve the accuracy of each system in the
of broadcasters. Their voices do not cover most variations of experiments. The Gaussian mixture models were also used to
Vietnamese speech. The corpus is also not phonetically improve accuracy; the number of mixture components was
balanced. The data gathered from the section of story reading around a half of the number of mono-phone units.
is the largest part of the corpus with about 1 GB.

To build this corpus, we downloaded the sound files in


RealAudio format from the website of VOV, and then
5. Experiences combination of two neighbors, in particular short phonetic
units. Naturally, this is a mixture of units from the previous
unit sets. Some examples of mixture units in this recognizer
5.1. Unit type experiment are described in Table 3.
In this experiment, we compared the recognition
performance of three systems, based on different basic The bigram language model was used for all three recognizers.
This language model contains information about the
speech units: initial – final units, phonetic units, and
probability of the sequence of two syllables, and each syllable
mixture phonemes. may be separated by a sp (short pause) unit [7]. All three
The first recognizer was an initial-final unit recognizer. Some recognizers for this experiment were trained and tested using
examples of Initial units and Final units in this recognizer are the same feature set: PLP_E_D_A with 12 PLP coefficients.
described in Table 3. In HTK, it means that the feature vectors have 39 dimensions:
12 PLP coefficients plus energy (E), and their delta (D) and
The second recognizer in this experiment was a phonetic unit acceleration (A) values.
recognizer. Some examples of phonetic units in this recognizer
are described in Table 3. 5.2. Feature set experiment

The last recognizer used in this experiment was based on In this second experiment, we applied different feature sets to
mixture units. These units were selected by the knowledge of our mixture unit recognizer. The motivation of this experiment
Vietnamese acoustic-phonetics. Some units are the was to study the influence of feature extraction on recognition
performance.

Table 3. Examples of different units for recognition systems.


English Vietnam Telex Tone Initial-Final Phoneme Mixture
zero Không khoong 1 /kh/ /oong1/ /kh/ /oo1/ /ngz1/ /kh/ /oo1/ /ngz1/
boat thuyền thuyeenf 2 /th/ /uyeen2/ /th/ /u/ /iee2/ /nz2/ /th/ /u/ /iee2/ /nz2/
act diễn dieenx 3 /d/ /ieen3/ /d/ /iee3/ /nz3/ /d/ /iee3/ /nz3/
seven bẩy baayr 4 /b/ /aai4/ /b/ /aa4/ /iz4/ /b/ /aa4/ /iz4/
four bốn boons 5 /b/ /oon5/ /b/ /oo5/ /nz5/ /b/ /oo5/ /nz5/
spot mụn munj 6 /m/ /un6/ /m/ /u6/ /nz6/ /m/ /u6/ /nz6/
style mốt moots 5 /m/ /oot7/ /m/ /oo7/ /tc7/ /m/ /oot7/
one một mootj 6 /m/ /oot8/ /m/ /oo8/ /tc8/ /m/ /oot8/
unit chiếc chieecs 5 /ch/ /ieec7/ /ch/ /iee7/ /c7/ /ch/ /ieec7/
cheat bịp bipj 6 /b/ /ip8/ /b/ /i8/ /pc8/ /b/ /ip8/

The first recognizer in this experiment used PLP_E_D with 12


PLP coefficients. The feature vector has 26 dimensions: 12 6. Results
PLP coefficients plus energy (E), and their deltas (D).
Table 4 shows the results of unit type experiment with word
The second recognizer in this experiment used MFCC_E_D accuracy (WA) for the test set. The phonetic unit recognizer
with 12 MFCC coefficient. The feature vector has 26 has better performance in comparison with the initial-final unit
dimensions with 12 MFCC coefficients, energy (E) and their recognizer, demonstrating the effectiveness of phonetic
deltas (D). modeling. The mixture unit recognizer has better recognition
accuracy than the other recognizers, showing that the basic
The third recognizer in this experiment used PLP_E_D_A unit suitable for Vietnamese large vocabulary continuous
with 12 PLP coefficients. The feature vector has 39 speech recognition is the mixture phoneme.
dimensions with 12 PLP coefficients, energy (E), their deltas
and acceleration coefficients (D_A). Table 4. Recognition performances of three recognizers:
initial-final unit, phonetic unit and mixture unit.
The second recognizer in this experiment used MFCC_E_D_A
basic speech unit WA
with 12 MFCC coefficients, the feature vector have 39
Initial Final Unit 65.79
dimensions with 12 MFCC coefficients, energy, and their
Phoneme Unit 71.90
delta and acceleration.
Mixture Unit 72.38
5.3. F0 and MFCC combination experiment « WA » indicates word-level accuracy (in percent)

In this experiment, we apply features from the F0 contour to Table 5 shows the results of feature set experiment. The
improve the best system from the two first experiments, which mixture unit recognizer with MFCC_E_D_A achieves the best
uses MFCC_E_D_A. In addition to MFCC features extracted result with 73.15% word accuracy. This result shows that
by HTK, we also used F0 features extracted by the Praat tool. MFCC features have somewhat better performance than PLP
All the features are written out in HTK format. So, the feature in our experiment. It should be noted that Table 3 also
vectors used here have 42 dimensions: 12MFCC, F0, energy, demonstrates that the addition of A (acceleration coefficients)
their delta, and their acceleration.
in the feature set does notably improve the performance of the For future research, building a tone classifier is an essential
recognizers. issue for Vietnamese continuous speech recognition. We will
do more experiments focusing on tone recognition. These
Table 5. Recognition performance of the mixture unit experiments are planned to include information on pitch
recognizer with four different feature sets. contours to further improve accuracy. In another direction, a
tone recognition system can be built separately and operated
feature set WA in parallel with a no-tone ASR system. In addition,
MFCC13+D 68.03 determining the optimal set of mixture units is an area of
MFCC13+D+A 73.15 future work.
(PLP12+E)+D 67.98
(PLP12+E)+D+A 72.38 8. References
Table 6 shows the results of F0 and MFCC combination [1] Đoàn Thiện Thuật. Ngữ âm tiếng Việt (Vietnamese
experiment. The accuracy of the mixture unit recognizer with Acoustic). Nhà xuất bản đại học quốc gia (Vietnamese
MFCC_E_D_A was improved by adding the F0 feature and National Editions), In lần thứ 2, 2003.
its delta and acceleration values. The word accuracy was [2] M.S. Han, K.O Kim , "Phonetic variation of Vietnamese
improved by approximately 10% with a relative 36.5% tones in disyllabic utterances tones", Journal of
reduction in error. Phonetics,vol. 2, 1974, pp 223-232
[3] Vũ Thanh Phương, “The acoustic and perceptual nature
Table 6. Recognition performance before and after appling F0 of tone in Vietnamese”, PhD thesis, Australia National
features into the ASR system. university, Canberra, 1981
[4] Hansjörg Mixdorff, Nguyen Hung Bach, Hiroya Fujisaki
feature set WA and Mai Chi Luong, “Quantitative Analysis and
MFCC13+D+A 73.15 Synthesis of Syllabic Tones in Vietnamese”, In
(MFCC13+F0)+D+A 82.97 Proceedings of Eurospeech2003, Geneva, 2003
[5] Dung Tien Nguyen, Mai Chi Luong, Bang Kim Vu,
7. Conclusions Hansjoerg Mixdorff , Huy Hoang Ngo, "Fujisaki Model
based F0 contours in Vietnamese TTS”, ICSLP2004,
In this paper, we have presented our study on large vocabulary Korea, 2004.
continuous speech recognition for Vietnamese with a radio- [6] Soren Kamaric Riis, “Hidden Markov Model and Neural
broadcast database. The results show that mixture units Network for Speech Recognition” PhD thesis, Technical
selected based on our knowledge of Vietnamese acoustic- University of Denmark, 1998.
phonetics is the better choice in comparison with the initial- [7] Steven Young et all, The HTK Book, Cambridge
final unit and phonetic units. Furthermore, the experiment University Engineering Department, December 2003
also showed the need to carefully choose basic units of [8] S. Young. Large vocabulary continuous speech
classification. It may be possible to continue to improve the recognition. IEEE Signal Processing Magazine, Vol. 13,
accuracy by trying other mixture unit sets. No. 5, p. 45–57, 1996.
[9] Lawrence R. Rabiner, "A Tutorial on Hidden Markov
We also found in our experiments that among the feature sets Models and Selected Applications in Speech
used in mixture unit recognizers, the feature set Recognition", Proceedings of the IEEE, Vol. 77, No. 2,
MFCC_E_D_A including 12 MFCC coefficients with energy 1989.
plus their delta and acceleration coefficients achieves the best [10] Q.C.Nguyen, Eric Castelli, Ngoc-Yen Pham. “Tone
result. Features extracted from the F0 contour also improve Recognition for Vietnamese”. Euro-Speech2003, Geneva,
the accuracy by 10% (absolute) at word level, corresponding 2003.
to a 36.5% relative reduction in error. [11] A.Tungthangthum, "Tone Recognition for Thai", Circuits
and Systems, IEEE APCCAS 1998, Asia-Pacific
In comparison with large vocabulary continuous speech Conference, p. 157-160.
recognition for English (using a more controlled speech [12] Paul Boersma and David Weenink , www.praat.org,
database), our system has poorer performance with only Institute of Phonetic Sciences (IFA) in the University of
82.97% word accuracy. However, these are the first known Amsterdam
results for Vietnamese large vocabulary continuous ASR. [13] Jim J.W, Li D., Jacky C.: "Modeling context-dependent
phonetic units in a continuous speech recognition system
for Mandarin Chinese". Proceeding of ICSLP '96.

Vous aimerez peut-être aussi