Automatic Speech Recognition

AUTOMATIC SPEECH RECOGNITION
What is the task?

Getting a computer to understand spoken language By understand we might mean

React appropriately Convert the input speech into another medium, e.g. text
Whats hard about that?

Digitization Converting analogue signal into digital representation Signal processing Separating speech from background noise Phonetics Variability in human speech Phonology Recognizing individual sound distinctions (similar phonemes) Lexicology and syntax Disambiguating homophones Features of continuous speech Syntax and pragmatics Interpreting prosodic features Pragmatics Filtering of performance errors (disfluencies)
Variability in individuals speech

Variation among speakers due to Vocal range (f0, and pitch range see later) Voice quality (growl, whisper, physiological elements such as nasality, adenoidality, etc) ACCENT !!! (especially vowel systems, but also consonants, allophones, etc.) Variation within speakers due to
Health, emotional state Ambient conditions
Speech style: formal read vs spontaneous
(Dis)continuous speech
Discontinuous speech much easier to
recognize
Single words tend to be pronounced more clearly
Continuous speech involves contextual
coarticulation effects
Weak forms Assimilation Contractions
Interpreting prosodic features

Pitch, length and loudness are used to
indicate stress All of these are relative

On a speaker-by-speaker basis And in relation to context
Pitch and length are phonemic in some
languages
Approaches to ASR
Template matching Knowledge-based (or rule-based) approach Statistical approach:
Noisy channel model + machine learning
Template-based approach
Store examples of units (words,
phonemes), then find the example that most closely fits the input Extract features from speech signal, then its just a complex similarity matching problem, using solutions developed for all sorts of applications OK for discrete utterances, and a single user
Rule-based approach
Use knowledge of phonetics and linguistics
to guide search process Templates are replaced by rules expressing everything (anything) that might help to decode:
Phonetics, phonology, phonotactics Syntax Pragmatics
Statistics-based approach
Collect a large corpus of transcribed
speech recordings Train the computer to learn the correspondences (machine learning) At run time, apply statistical processes to search through the space of all possible solutions, and pick the statistically most likely one
ASR based voice Application

An ASR based Voice Application shall take
voice input(In addition to DTMF input) from the user and play back relevant information accordingly.
System Architecture
Overview
The speech detector takes audio as input and
returns status events to indicate when speech was found or if no speech occurred within a specified timeout The recognizer takes as input audio and grammars, returns status events to indicate when the caller finished speaking and when recognition finished, and also returns recognition results. The speech detector and recognizer do not communicate to each other directly. Instead, the platform directly controls the speech detector and recognizer as separate resources.
Call flow of application

U E S R
U ser C alls U ser is asked w hether he w ants to listen any specific song or all the songs froma particular m ovie. U ser gives D F input TM On the basis of D F input user is TM asked to speech song nam or e m ovie nam e U ser gives speech input Voice D input send ata Success/Failure R esponse
IV A R pplication
N uance S W
In case success is received fromN uance SW the song codes corresponding to the Song/M ovie N e are fetched fromthe am database and song is played to the user. In case of failure user is asked for speech input again. U ser listens to the played song/ searches for next/ dow nload/ dedicate/ search for another m ovie or song/drops the call R esponse as per the user input
Example
A grammar is a text file that defines the
sentences that can be spoken by callers It specifies what the recognition engine can understand If the word spoken by the user is detected by the Nuance software, the word is search in the grammar file If the word is found, the corresponding keyword is return to the IVR application
Remaining problems
Robustness graceful degradation, not catastrophic failure Portability independence of computing platform Adaptability to changing conditions (different mic,
background noise, new speaker, new task domain, new language even) Language Modelling is there a role for linguistics in improving the language models? Confidence Measures better methods to evaluate the absolute correctness of hypotheses. Out-of-Vocabulary (OOV) Words Systems must have some method of detecting OOV words, and dealing with them in a sensible way. Spontaneous Speech disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem. Prosody Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger)
Conclusion
ASR Combines many probabilistic phenomena:
varying acoustic features of phones, likely pronunciations of words, likely sequences of words Relies upon many approximate techniques to translate a signal ASR has moved beyond discrete speech and into continuous speech ASR is promising technology, it may be many years before it will be useful in natural settings

Automatic Speech Recognition

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Automatic Speech Recognition

Transféré par

Droits d'auteur :

Formats disponibles

AUTOMATIC SPEECH RECOGNITION

What is the task?

Getting a computer to understand spoken language By understand we might mean

Whats hard about that?

Variability in individuals speech

Speech style: formal read vs spontaneous

Continuous speech involves contextual

Interpreting prosodic features

indicate stress All of these are relative

Pitch and length are phonemic in some

ASR based voice Application

Call flow of application

Vous aimerez peut-être aussi