Académique Documents
Professionnel Documents
Culture Documents
Thang Tat Vu*, Dung Tien Nguyen**, Mai Chi Luong**, John-Paul Hosom***.
Contour Unflat
The 3-state HMM architecture with embedded training process
Pitch Flat was chosen for all of our experiments. For building a
Unsteady Steady Stop continuous speech recognition system, the null states with
(7) Rising non-emitting entry and exit states provide the glue needed to
High (1) Level (3) Broken (5) Rising
(closure coda) join models of HMM units together [7].
(8) Drop
Low (2) Falling (4) Curve (6) Drop F0 was extracted with the Praat tool, which can also be freely
(closure coda)
downloaded from the Institute of Phonetic Sciences (IFA) at
the University of Amsterdam. F0 values were used in the
3. Speech Corpus third experiment. The documentation for the HTK Toolkit [7]
The corpus used in this work is the VOV (the Voice of and for Praat tools [12] was used as a guide for carrying all
Vietnam) corpus: a collection of story reading, VOV mailbag, our experiments.
news reports, and colloquy from the radio program “the Voice
of Vietnam”. There are 23424 utterances in the corpus of The system was trained using the embedded training
about 30 male and female broadcasters and visitors. The capability of HTK and training was performed until
number of distinct syllables with tone is 4923 and the number convergence. A 3-state HMM was used for each unit. After
of distinct syllables without tone is 2101. Therefore, the training monophone HMMs, we used these models to create
corpus covers all Vietnamese phonemes and most Vietnamese triphone HMMs. The tree based clustering was applied to
syllables. The total capacity of the corpus in WAV format is share the parameters of similar triphone HMMs in all our
about 2.5GB. experiments.
One deficiency of the corpus is the number of unique The same grammar, namely a bigram model at the syllable
speakers. In one radio station, there are only a limited number level, was used to improve the accuracy of each system in the
of broadcasters. Their voices do not cover most variations of experiments. The Gaussian mixture models were also used to
Vietnamese speech. The corpus is also not phonetically improve accuracy; the number of mixture components was
balanced. The data gathered from the section of story reading around a half of the number of mono-phone units.
is the largest part of the corpus with about 1 GB.
The last recognizer used in this experiment was based on In this second experiment, we applied different feature sets to
mixture units. These units were selected by the knowledge of our mixture unit recognizer. The motivation of this experiment
Vietnamese acoustic-phonetics. Some units are the was to study the influence of feature extraction on recognition
performance.
In this experiment, we apply features from the F0 contour to Table 5 shows the results of feature set experiment. The
improve the best system from the two first experiments, which mixture unit recognizer with MFCC_E_D_A achieves the best
uses MFCC_E_D_A. In addition to MFCC features extracted result with 73.15% word accuracy. This result shows that
by HTK, we also used F0 features extracted by the Praat tool. MFCC features have somewhat better performance than PLP
All the features are written out in HTK format. So, the feature in our experiment. It should be noted that Table 3 also
vectors used here have 42 dimensions: 12MFCC, F0, energy, demonstrates that the addition of A (acceleration coefficients)
their delta, and their acceleration.
in the feature set does notably improve the performance of the For future research, building a tone classifier is an essential
recognizers. issue for Vietnamese continuous speech recognition. We will
do more experiments focusing on tone recognition. These
Table 5. Recognition performance of the mixture unit experiments are planned to include information on pitch
recognizer with four different feature sets. contours to further improve accuracy. In another direction, a
tone recognition system can be built separately and operated
feature set WA in parallel with a no-tone ASR system. In addition,
MFCC13+D 68.03 determining the optimal set of mixture units is an area of
MFCC13+D+A 73.15 future work.
(PLP12+E)+D 67.98
(PLP12+E)+D+A 72.38 8. References
Table 6 shows the results of F0 and MFCC combination [1] Đoàn Thiện Thuật. Ngữ âm tiếng Việt (Vietnamese
experiment. The accuracy of the mixture unit recognizer with Acoustic). Nhà xuất bản đại học quốc gia (Vietnamese
MFCC_E_D_A was improved by adding the F0 feature and National Editions), In lần thứ 2, 2003.
its delta and acceleration values. The word accuracy was [2] M.S. Han, K.O Kim , "Phonetic variation of Vietnamese
improved by approximately 10% with a relative 36.5% tones in disyllabic utterances tones", Journal of
reduction in error. Phonetics,vol. 2, 1974, pp 223-232
[3] Vũ Thanh Phương, “The acoustic and perceptual nature
Table 6. Recognition performance before and after appling F0 of tone in Vietnamese”, PhD thesis, Australia National
features into the ASR system. university, Canberra, 1981
[4] Hansjörg Mixdorff, Nguyen Hung Bach, Hiroya Fujisaki
feature set WA and Mai Chi Luong, “Quantitative Analysis and
MFCC13+D+A 73.15 Synthesis of Syllabic Tones in Vietnamese”, In
(MFCC13+F0)+D+A 82.97 Proceedings of Eurospeech2003, Geneva, 2003
[5] Dung Tien Nguyen, Mai Chi Luong, Bang Kim Vu,
7. Conclusions Hansjoerg Mixdorff , Huy Hoang Ngo, "Fujisaki Model
based F0 contours in Vietnamese TTS”, ICSLP2004,
In this paper, we have presented our study on large vocabulary Korea, 2004.
continuous speech recognition for Vietnamese with a radio- [6] Soren Kamaric Riis, “Hidden Markov Model and Neural
broadcast database. The results show that mixture units Network for Speech Recognition” PhD thesis, Technical
selected based on our knowledge of Vietnamese acoustic- University of Denmark, 1998.
phonetics is the better choice in comparison with the initial- [7] Steven Young et all, The HTK Book, Cambridge
final unit and phonetic units. Furthermore, the experiment University Engineering Department, December 2003
also showed the need to carefully choose basic units of [8] S. Young. Large vocabulary continuous speech
classification. It may be possible to continue to improve the recognition. IEEE Signal Processing Magazine, Vol. 13,
accuracy by trying other mixture unit sets. No. 5, p. 45–57, 1996.
[9] Lawrence R. Rabiner, "A Tutorial on Hidden Markov
We also found in our experiments that among the feature sets Models and Selected Applications in Speech
used in mixture unit recognizers, the feature set Recognition", Proceedings of the IEEE, Vol. 77, No. 2,
MFCC_E_D_A including 12 MFCC coefficients with energy 1989.
plus their delta and acceleration coefficients achieves the best [10] Q.C.Nguyen, Eric Castelli, Ngoc-Yen Pham. “Tone
result. Features extracted from the F0 contour also improve Recognition for Vietnamese”. Euro-Speech2003, Geneva,
the accuracy by 10% (absolute) at word level, corresponding 2003.
to a 36.5% relative reduction in error. [11] A.Tungthangthum, "Tone Recognition for Thai", Circuits
and Systems, IEEE APCCAS 1998, Asia-Pacific
In comparison with large vocabulary continuous speech Conference, p. 157-160.
recognition for English (using a more controlled speech [12] Paul Boersma and David Weenink , www.praat.org,
database), our system has poorer performance with only Institute of Phonetic Sciences (IFA) in the University of
82.97% word accuracy. However, these are the first known Amsterdam
results for Vietnamese large vocabulary continuous ASR. [13] Jim J.W, Li D., Jacky C.: "Modeling context-dependent
phonetic units in a continuous speech recognition system
for Mandarin Chinese". Proceeding of ICSLP '96.