Académique Documents
Professionnel Documents
Culture Documents
React appropriately Convert the input speech into another medium, e.g. text
(Dis)continuous speech
Discontinuous speech much easier to
recognize
Single words tend to be pronounced more clearly
coarticulation effects
Weak forms Assimilation Contractions
languages
Approaches to ASR
Template matching Knowledge-based (or rule-based) approach Statistical approach:
Noisy channel model + machine learning
Template-based approach
Store examples of units (words,
phonemes), then find the example that most closely fits the input Extract features from speech signal, then its just a complex similarity matching problem, using solutions developed for all sorts of applications OK for discrete utterances, and a single user
Rule-based approach
Use knowledge of phonetics and linguistics
to guide search process Templates are replaced by rules expressing everything (anything) that might help to decode:
Phonetics, phonology, phonotactics Syntax Pragmatics
Statistics-based approach
Collect a large corpus of transcribed
speech recordings Train the computer to learn the correspondences (machine learning) At run time, apply statistical processes to search through the space of all possible solutions, and pick the statistically most likely one
voice input(In addition to DTMF input) from the user and play back relevant information accordingly.
System Architecture
Overview
The speech detector takes audio as input and
returns status events to indicate when speech was found or if no speech occurred within a specified timeout The recognizer takes as input audio and grammars, returns status events to indicate when the caller finished speaking and when recognition finished, and also returns recognition results. The speech detector and recognizer do not communicate to each other directly. Instead, the platform directly controls the speech detector and recognizer as separate resources.
IV A R pplication
N uance S W
In case success is received fromN uance SW the song codes corresponding to the Song/M ovie N e are fetched fromthe am database and song is played to the user. In case of failure user is asked for speech input again. U ser listens to the played song/ searches for next/ dow nload/ dedicate/ search for another m ovie or song/drops the call R esponse as per the user input
Example
A grammar is a text file that defines the
sentences that can be spoken by callers It specifies what the recognition engine can understand If the word spoken by the user is detected by the Nuance software, the word is search in the grammar file If the word is found, the corresponding keyword is return to the IVR application
Remaining problems
Robustness graceful degradation, not catastrophic failure Portability independence of computing platform Adaptability to changing conditions (different mic,
background noise, new speaker, new task domain, new language even) Language Modelling is there a role for linguistics in improving the language models? Confidence Measures better methods to evaluate the absolute correctness of hypotheses. Out-of-Vocabulary (OOV) Words Systems must have some method of detecting OOV words, and dealing with them in a sensible way. Spontaneous Speech disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem. Prosody Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger)
Conclusion
ASR Combines many probabilistic phenomena:
varying acoustic features of phones, likely pronunciations of words, likely sequences of words Relies upon many approximate techniques to translate a signal ASR has moved beyond discrete speech and into continuous speech ASR is promising technology, it may be many years before it will be useful in natural settings