Vous êtes sur la page 1sur 36

SPEECH RECOGNITION

FOR MOBILE SYSTEMS


BY:
PRATIBHA CHANNAMSETTY
SHRUTHI SAMBASIVAN

Introduction
What is speech recognition?
Automatic speech recognition(ASR) is the process by
which a computer maps an acoustic speech signal to text.

CLASSIFICATION OF SPEECH RECOGNITION


SYSTEM
Users
- Speaker dependent system
- Speaker independent system
-Speaker adaptive system
Vocabulary
-small vocabulary : tens of word
-medium vocabulary : hundreds of words
-large vocabulary : thousands of words
-very-large vocabulary : tens of thousands of
words.

CLASSIFICATION OF SPEECH RECOGNITION


SYSTEM
Word Pattern
- isolated-word system : single words at a time
- continuous speech system : words are connected
together

HOW SPEECH RECOGNITION WORKS

APPLICATIONS
Healthcare
Military
Helicopters
Training air traffic controllers
Telephony and other domains

WHY SPEECH RECOGNITION?


Speech is the easiest and most common way for people
to communicate.
Speech is also faster than typing on a keypad and more
expressive than clicking on a menu item.
Users with low literacy.
Cellphones have widely proliferated the market.

CHALLENGES ON MOBILE DEVICES


Limited available storage space
Cheap and variable microphones
No hardware support for floating point arithmetic
Low processor clock-frequency
Small cache of 8-32 KB
Highly variable and challenging acoustic environments ranging
from heavy background traffic noises to a small room with
reverberation of multiple speakers speaking simultaneously
Consume a lot of energy during algorithm execution

ASR MODELS
Embedded speech recognition
Speech recognition in the cloud
Distributed speech recognition
Shared speech recognition with user based
adaptation(proposed model of use)

EMBEDDED MOBILE SPEECH RECOGNITION

EMBEDDED MOBILE SPEECH


RECOGNITION
Advantages
Not rely on any communication with a central server
Cost effective
Not affected by the latency

EMBEDDED MOBILE SPEECH RECOGNITION


Disadvantages
Cannot perform complex computations
Lack in terms of speed and memory
To achieve reliable performance, modifications
need to be made to every sub-system of the ASR to take
both factors into account.

SPEECH RECOGNITION IN THE CLOUD

SPEECH RECOGNITION IN THE CLOUD


Advantages
Improves speed and accuracy
It provides an easy way to upgrade or modify

the central speech recognition system.


It can be used for speech recognition with

low-end mobile devices such as cheap cellphones.

SPEECH RECOGNITION IN THE CLOUD


Disadvantages
Performance degradation
Acoustic models on the central server need to account for
large variations in the different channels.
Each data transfer over the telephone network can cost
money for the end user.

DISTRIBUTED SPEECH RECOGNITION

DISTRIBUTED SPEECH RECOGNITION


Advantages
Does not really need high quality speech
Improve word error rates

DISTRIBUTED SPEECH RECOGNITION


Disadvantages
The major disadvantage of this mode still remains cost and
the need of continuous and reliable cellular connection,.
Theres a need for standardized feature extraction
processes that account for variability's arising due to
differences in
channel , multi-linguality, variable accents,
and gender differences, etc.

SHARED SPEECH RECOGNITION


WITH USER BASED ADAPTATION

SHARED SPEECH RECOGNITION WITH USER


BASED ADAPTATION
Advantages
The ability to function even without network connectivity.
Works well for the limited set of conditions it encounters.
It can be covered successfully by existing mobile devices, if
trained or adapted accordingly.
Server capacity has to be provided only for average, not
peak use.

Speech recognition Process in detail

Front-end Process
Involves spectral analysis that derives feature vectors to
capture salient spectral characteristics of speech input.

Backend Process
Combines word-level matching and sentence-level search
to perform an inverse operation to decode the message
from the speech waveform.

Acoustic model
Provides a method of calculating the likelihood of any
feature vector sequence Y given a word W.
Each phone is represented by a HMM.

Language Model
The purpose of the language model is to take advantage of
linguistic constraints to compute the probability of different
word sequences
Assuming a sequence of words, ={1,2,,k}, the
probability () can be expanded as
()=(1,2,,k)
We generally make the simplifying assumption that any
word depends only on the previous 1 words in the
sequence
This is known as an N-gram model
Grammars Use context free grammars represented by
Finite State Automata (FSA)

Statistical Speech recognition model

Overview of Statistical Speech recognition

Statistical Speech recognition model


Word sequence is postulated and the language model
computes its probability.
Each word is converted into sounds or phones using
pronunciation dictionary.
Each phoneme has a corresponding statistical Hidden
Markov Model (HMM).
HMM of each phoneme is concatenated to form word model
and the likelihood of the data given the word sequence is
computed.
This process is repeated for many word sequences and the
best is chosen as the output.

Speech recognition on embedded


platforms
Embedded ASR can be deployed either locally or in a
distributed environment with both advantages and
disadvantages.
For LVCSR, embedded devices are limited in terms of CPU
power and amount of memory.
Most importantly, speed is a limiting factor.

Decoding algorithm
Asynchronous stack based decoder memory efficient but
complex.
Viterbi based decoder most efficient.
3 types of search implementation
Combination of static graph and static search space
Static graph space with dynamic search space
Dynamic graph

Mobile speech frameworks

Nuance - Dragon mobile SDK


Openears
Sphinx
CeedVocal SDK
Vlingo

Dragon Mobile SDK


The Dragon Mobile SDK provides speech recognition and textto-speech functionality.
The Speech Kit framework provides the classes necessary to
perform network-based speech recognition and text-tospeech synthesis.
It uses SystemConfiguration and AudioToolbox frameworks.

Speech kit architecture

OpenEars
OpenEars is an iOS framework for iPhone voice recognition
and speech synthesis (TTS).
It uses the open source CMU Pocketsphinx, CMU Flite, and
CMUCLMTK libraries.
OpenEars works by doing the recognition inside the iPhone
without using the network.

Sphinx
CMU Sphinx is a open source toolkit for speech recognition
developed by Carnegie Melon University.
CMU Sphinx is a speaker-independent large vocabulary
continuous speech recognizer.
Pocketsphinx lightweight recognizer library written in C.
Sphinx4 adjustable, modifiable recognizer written in
Java.

CeedVocal SDK
CeedVocal SDK is a isolated word speech recognition SDK for
iOS.
It operates locally on the device and supports 6 languages :
English, French, German, Dutch, Spanish and Italian.

Mobile applications using speech


recognition

Google now
Siri
S-Voice
Dragon Search
Dragon Dictation
Trippo-Mondo
Verbally

References
1. Rethinking Speech Recognition on Mobile Devices, Anuj Kumar, Anuj Tewari,
Seth Horrigan, Matthew Kam, Florian Metze and John Canny.
2. Towards large vocabulary ASR on embedded platforms, Miroslav Novak.
3. Speech Recognition: Statistical Methods, L R Rabiner, B-H Juang.
4. http://www.nuancemobiledeveloper.com, 9th April 2013.
5. http://cmusphinx.sourceforge.net , 9th April 2013.
6. http://www.politepix.com/openears.
7. http://www.creaceed.com/ceedvocal

Vous aimerez peut-être aussi