Vous êtes sur la page 1sur 27

Rethinking Speech Recognition

on
Mobile Devices
Anuj Kumar1, Anuj Tewari2,
Seth Horrigan 2, Matthew Kam1, Florian Metze1 ,
John Canny2
Canegie Mellon University, USA
2 University of California, Berkeley, USA
1

Mobile devices have widely


penetrated the market

Mobiles have widely proliferated both in


developing nations and in the low-socio economic
communities of the developed world, even greater
than a personal computer.

Mobile Devices vs. PCs

Text Input vs. Speech


Besides being more intuitive offers several advantages
Typing Speeds
Speech

Handwriting

QWERTY

Predictive
Text
Multi-tap

With speech, interaction


becomes independent
of device size

If accurately recognized,
speech is three
times
faster than
QWERTY
(Basapur et al. 07)

Only plausible
input
modality for 800
million nonliterate users

Given the greater availability of mobile phones and


effectiveness of speech as a communication medium we
propose various Automatic Speech Recognition Models for
mobiles in this project.

Automatic Speech Recognition (ASR)


Training Data
(speech
utterances)
Machin

Speech
Signal

e
AcousticLearnin
Model
g

Feature
Extraction

Language
Model
Phonetic
Dictionary

Decoder
Recognition
Result

Speech
Feature
s
Word
sequen
ce

ASR : Framework
ASR converts spoken words into text.
ASR has following components:
Feature Extractor
Extracts the feature vector from the speech
signal
Acoustic Model(AM)
Models acoustic properties of the test
speech signal
Phonetic Dictionary
Contains a mapping from words to phones
Language Model(LM)
Defines which word could follow previously
recognized words thus reducing word search
Decoder
Integrates AM and LM with phonetic dictionary

ASR on Mobile: Challenges


Limited available storage and small cache(8-32 KB)
Cheap and variable microphones
No hardware support for floating point calculations
Low processor clock frequency
Energy constraints
Challenging acoustic environment like heavy traffic noise in
background and reverberation of multiple speakers speaking
simultaneously

Embedded Mobile Speech Recognition


ASR Search

User
Speec
h

Feature
extractio
n

Mobile Device

Acoustic
Models
Languag
e
Model

ASR
Output

Pros and Cons of Embedded Model


Pros

Cons

No networks required

Mobile hardware not as


good as a central server(in
terms of speed and
memory)

No performance drop due


to data loss involved in
transmission
No cost involved in data
transmission
No latency

Speech Recognition in Cloud


Speec
h
Coder

Data
Transmissi
on

Feature Extraction
from Codec
Parameters

Mobile Device

User
Speec
h

ASR
Output

Acousti
Langua
c
ge
Models
Model
ASR Search
Server

Pros and Cons of Cloud Model


Pros

Cons

Better speed and accuracy


in ASR owing to servers
superior configuration

Performance degradation
due to loss during data
transmission

Central systems update is


enough to update all
systems of network

Acoustic models on the


central server need to
account for large variations
in the different channels.

Even cheap low-end


phones works fine

Each data transfer over the


telephone network can cost
money for the end user

Distributed Speech Recognition


Feature
extraction
and
compressi
on

Data
Transmissi
on

Mobile Device

User
Speec
h

Feature
Reconstruction
Acousti
c
Models

Langua
ge
Model

ASR Search

ASR
Output

Server

Pros and Cons of Distributed Model


Pros

Cons

It has all the advantages of


cloud model with added
less amount of data loss
owing to speech coder,
transmission and decoding
at low bit rates

Cost
Requires continuous and
reliable cellular connection
Requirement of standard
feature extraction process
on account of variations
due to differences in
channel(mic, audio data
card),variable accents etc.

Shared Speech Recognition with User based Adaptation(Proposed)


User Speech
Feature
extraction
and
compressi
on

Data Transmission
(Only when
network is
detected)

Feature Reconstruction +
Metadata Extraction

Acoustic
Models
User
Language
based
Model
Acoustic
Models
ASR Search

Updated
Data

Mobile Device
ASR Output

Language
Model
ASR Search

User Context
Dependent
Analysis
Server

Pros and Cons of Adaptation Model


Pros
It takes the best of both the worlds i.e. of server based
systems(high accuracy, updates and maintainability,
moderate requirements on mobile hardware) and local
ASR(functioning without network).
One mobile is used by lesser number of users hence mobile
itself device can cater to that.
Centralized server only performs adaptation it may be
designed for average traffic and not peak traffic as
adaptation need not be real

What to adapt for?


Device Specific pre-processing
To achieve robustness or signal normalization
User Specific vocabularies and Language Model
To model dialectal and idolectal variations as well as
accented speech
Speaker Specific Acoustic Models or feature and/or model
based adaptation technique
Tuning of parameters

Results in Laboratory: Operating System


First, in order to determine the performance of the systems
in a resource-rich Linux environment, we used a PC running
Ubuntu Linux 8.04 in VMWare Server 1.0.6 on top of
Microsoft Windows Server 2003 with an Intel Pentium D
clocked at 2.8 GHz with 4 GB of RAM.
Second, in order to determine the performance on a resource
constrained mobile device, we used a Nokia N800 Internet
Tablet running Maemo Linux OS2008 with a TI OMAP 2420
ARM processor clocked at 330MHz with 128 MB of RAM

Results in Laboratory: Dataset


DARPA Resource Management (RM-1) corpus
1600 training utterances,
1000 tied Gaussian Mixture Models (senones)
30080 context-dependent triphones
Bigram statistical language model with a vocabulary of 993
words and a language weight of 9.5
ICSI Meeting Recorder (ICSI) corpus
90314 training utterances
1000 senones
104082 context-dependent triphones
Trigram statistical language model with a vocabulary of
11908 words and a language weight of 9.5
Test data contains 365 random utterances from RM-1 and
400 from ICSI

Results in Laboratory: Toolkits


Pocketsphinx
Sphinxtiny
Sphinx-3.7

Results in Laboratory: RM Dataset


Word

Deskt
op

Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)

Pocke
t
Sphin
x

0.05

6.0%

Sphin
x-3.7

0.19

Sphin
x Tiny

0.37

Word

Mobil
e

Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)

28.2%

Pocke
t
Sphin
x

0.53

6.0%

28.2%

7.3%

34.2%

Sphin
x-3.7

24.34

7.3%

34.2%

7.3%

35.1%

Sphin
x Tiny

2.58

7.3%

35.1%

Results in Laboratory: ICSI Dataset


Word

Deskt
op

Word

Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)

Mobil
e

Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)
10.65

55.1
%

72.8%

Pocke
t
Sphin
x

1.20

55.1
%

72.8%

Pocke
t
Sphin
x

Sphin
x-3.7

0.75

39.6
%

69.5%

Sphin
x-3.7

68.49

39.6
%

69.5%

Sphin
x Tiny

1.23

39.4
%

70.3%

Sphin
x Tiny

8.91

39.4
%

70.3%

Results in Laboratory
On a small vocabulary task PocketSphinx outperforms
SphinxTiny on both accuracy and speed;
As the complexity of the acoustic and language models
increases, SphinxTiny's accuracy is better than
PocketSphinx.
PocketSphinx is superior when using small acoustic and
language models for real-time recognition, but for tasks that
allow larger delays in exchange for better accuracy,
SphinxTiny is a better choice.

Results in Laboratory
On a small vocabulary task PocketSphinx outperforms
SphinxTiny on both accuracy and speed;
As the complexity of the acoustic and language models
increases, SphinxTiny's accuracy is better than
PocketSphinx.
PocketSphinx is superior when using small acoustic and
language models for real-time recognition, but for tasks that
allow larger delays in exchange for better accuracy,
SphinxTiny is a better choice.

Conclusion

Results certainly highlight the feasibility and applicability of speech


recognition on mobile device and its immense capacity in
educational and other fields in developing and low socio-economic
communities of developed countries.

Thank You!
Questions?

Presented by :
ABHISHEK GARG(208/CO/11)
NETAJI SUBHAS INSTITUTE OF TECHNOLOGY

Vous aimerez peut-être aussi