Speech Recognition On Mobile Devices

Rethinking Speech Recognition
on
Mobile Devices
Anuj Kumar1, Anuj Tewari2,
Seth Horrigan 2, Matthew Kam1, Florian Metze1 ,
John Canny2
Canegie Mellon University, USA
2 University of California, Berkeley, USA
1
Mobile devices have widely

penetrated the market
Mobiles have widely proliferated both in

developing nations and in the low-socio economic
communities of the developed world, even greater
than a personal computer.
Mobile Devices vs. PCs
Text Input vs. Speech

Besides being more intuitive offers several advantages
Typing Speeds
Speech
Handwriting
QWERTY
Predictive
Text
Multi-tap
With speech, interaction

becomes independent
of device size
If accurately recognized,
speech is three
times
faster than
QWERTY
(Basapur et al. 07)
Only plausible
input
modality for 800
million nonliterate users
Given the greater availability of mobile phones and

effectiveness of speech as a communication medium we
propose various Automatic Speech Recognition Models for
mobiles in this project.
Automatic Speech Recognition (ASR)

Training Data
(speech
utterances)
Machin
Speech
Signal
e
AcousticLearnin
Model
g
Feature
Extraction
Language
Model
Phonetic
Dictionary
Decoder
Recognition
Result
Speech
Feature
s
Word
sequen
ce
ASR : Framework
ASR converts spoken words into text.
ASR has following components:
Feature Extractor
Extracts the feature vector from the speech
signal
Acoustic Model(AM)
Models acoustic properties of the test
speech signal
Phonetic Dictionary
Contains a mapping from words to phones
Language Model(LM)
Defines which word could follow previously
recognized words thus reducing word search
Decoder
Integrates AM and LM with phonetic dictionary
ASR on Mobile: Challenges

Limited available storage and small cache(8-32 KB)
Cheap and variable microphones
No hardware support for floating point calculations
Low processor clock frequency
Energy constraints
Challenging acoustic environment like heavy traffic noise in
background and reverberation of multiple speakers speaking
simultaneously
Embedded Mobile Speech Recognition

ASR Search
User
Speec
h
Feature
extractio
n
Mobile Device
Acoustic
Models
Languag
e
Model
ASR
Output
Pros and Cons of Embedded Model

Pros
Cons
No networks required
Mobile hardware not as

good as a central server(in
terms of speed and
memory)
No performance drop due

to data loss involved in
transmission
No cost involved in data
transmission
No latency
Speech Recognition in Cloud

Speec
h
Coder
Data
Transmissi
on
Feature Extraction
from Codec
Parameters
Mobile Device
User
Speec
h
ASR
Output
Acousti
Langua
c
ge
Models
Model
ASR Search
Server
Pros and Cons of Cloud Model

Pros
Cons
Better speed and accuracy

in ASR owing to servers
superior configuration
Performance degradation
due to loss during data
transmission
Central systems update is

enough to update all
systems of network
Acoustic models on the

central server need to
account for large variations
in the different channels.
Even cheap low-end

phones works fine
Each data transfer over the

telephone network can cost
money for the end user
Distributed Speech Recognition

Feature
extraction
and
compressi
on
Data
Transmissi
on
Mobile Device
User
Speec
h
Feature
Reconstruction
Acousti
c
Models
Langua
ge
Model
ASR Search
ASR
Output
Server
Pros and Cons of Distributed Model

Pros
Cons
It has all the advantages of

cloud model with added
less amount of data loss
owing to speech coder,
transmission and decoding
at low bit rates
Cost
Requires continuous and
reliable cellular connection
Requirement of standard
feature extraction process
on account of variations
due to differences in
channel(mic, audio data
card),variable accents etc.
Shared Speech Recognition with User based Adaptation(Proposed)

User Speech
Feature
extraction
and
compressi
on
Data Transmission
(Only when
network is
detected)
Feature Reconstruction +
Metadata Extraction
Acoustic
Models
User
Language
based
Model
Acoustic
Models
ASR Search
Updated
Data
Mobile Device
ASR Output
Language
Model
ASR Search
User Context
Dependent
Analysis
Server
Pros and Cons of Adaptation Model

Pros
It takes the best of both the worlds i.e. of server based
systems(high accuracy, updates and maintainability,
moderate requirements on mobile hardware) and local
ASR(functioning without network).
One mobile is used by lesser number of users hence mobile
itself device can cater to that.
Centralized server only performs adaptation it may be
designed for average traffic and not peak traffic as
adaptation need not be real
What to adapt for?

Device Specific pre-processing
To achieve robustness or signal normalization
User Specific vocabularies and Language Model
To model dialectal and idolectal variations as well as
accented speech
Speaker Specific Acoustic Models or feature and/or model
based adaptation technique
Tuning of parameters
Results in Laboratory: Operating System

First, in order to determine the performance of the systems
in a resource-rich Linux environment, we used a PC running
Ubuntu Linux 8.04 in VMWare Server 1.0.6 on top of
Microsoft Windows Server 2003 with an Intel Pentium D
clocked at 2.8 GHz with 4 GB of RAM.
Second, in order to determine the performance on a resource
constrained mobile device, we used a Nokia N800 Internet
Tablet running Maemo Linux OS2008 with a TI OMAP 2420
ARM processor clocked at 330MHz with 128 MB of RAM
Results in Laboratory: Dataset

DARPA Resource Management (RM-1) corpus
1600 training utterances,
1000 tied Gaussian Mixture Models (senones)
30080 context-dependent triphones
Bigram statistical language model with a vocabulary of 993
words and a language weight of 9.5
ICSI Meeting Recorder (ICSI) corpus
90314 training utterances
1000 senones
104082 context-dependent triphones
Trigram statistical language model with a vocabulary of
11908 words and a language weight of 9.5
Test data contains 365 random utterances from RM-1 and
400 from ICSI
Results in Laboratory: Toolkits

Pocketsphinx
Sphinxtiny
Sphinx-3.7
Results in Laboratory: RM Dataset

Word
Deskt
op
Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)
Pocke
t
Sphin
x
0.05
6.0%
Sphin
x-3.7
0.19
Sphin
x Tiny
0.37
Word
Mobil
e
Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)
28.2%
Pocke
t
Sphin
x
0.53
6.0%
28.2%
7.3%
34.2%
Sphin
x-3.7
24.34
7.3%
34.2%
7.3%
35.1%
Sphin
x Tiny
2.58
7.3%
35.1%
Results in Laboratory: ICSI Dataset

Word
Deskt
op
Word
Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)
Mobil
e
Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)
10.65
55.1
%
72.8%
Pocke
t
Sphin
x
1.20
55.1
%
72.8%
Pocke
t
Sphin
x
Sphin
x-3.7
0.75
39.6
%
69.5%
Sphin
x-3.7
68.49
39.6
%
69.5%
Sphin
x Tiny
1.23
39.4
%
70.3%
Sphin
x Tiny
8.91
39.4
%
70.3%
Results in Laboratory
On a small vocabulary task PocketSphinx outperforms
SphinxTiny on both accuracy and speed;
As the complexity of the acoustic and language models
increases, SphinxTiny's accuracy is better than
PocketSphinx.
PocketSphinx is superior when using small acoustic and
language models for real-time recognition, but for tasks that
allow larger delays in exchange for better accuracy,
SphinxTiny is a better choice.
Results in Laboratory
On a small vocabulary task PocketSphinx outperforms
SphinxTiny on both accuracy and speed;
As the complexity of the acoustic and language models
increases, SphinxTiny's accuracy is better than
PocketSphinx.
PocketSphinx is superior when using small acoustic and
language models for real-time recognition, but for tasks that
allow larger delays in exchange for better accuracy,
SphinxTiny is a better choice.
Conclusion
Results certainly highlight the feasibility and applicability of speech

recognition on mobile device and its immense capacity in
educational and other fields in developing and low socio-economic
communities of developed countries.
Thank You!
Questions?
Presented by :
ABHISHEK GARG(208/CO/11)
NETAJI SUBHAS INSTITUTE OF TECHNOLOGY

Speech Recognition On Mobile Devices

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Speech Recognition On Mobile Devices

Transféré par

Droits d'auteur :

Formats disponibles

Rethinking Speech Recognition

Mobile devices have widely

Mobiles have widely proliferated both in

Mobile Devices vs. PCs

Text Input vs. Speech

With speech, interaction

Given the greater availability of mobile phones and

Automatic Speech Recognition (ASR)

ASR on Mobile: Challenges

Embedded Mobile Speech Recognition

Pros and Cons of Embedded Model

Mobile hardware not as

No performance drop due

Speech Recognition in Cloud

Pros and Cons of Cloud Model

Better speed and accuracy

Central systems update is

Acoustic models on the

Even cheap low-end

Each data transfer over the

Distributed Speech Recognition

Pros and Cons of Distributed Model

It has all the advantages of

Shared Speech Recognition with User based Adaptation(Proposed)

Pros and Cons of Adaptation Model

What to adapt for?

Results in Laboratory: Operating System

Results in Laboratory: Dataset

Results in Laboratory: Toolkits

Results in Laboratory: RM Dataset

Results in Laboratory: ICSI Dataset

Results certainly highlight the feasibility and applicability of speech

Vous aimerez peut-être aussi