Académique Documents
Professionnel Documents
Culture Documents
on
Mobile Devices
Anuj Kumar1, Anuj Tewari2,
Seth Horrigan 2, Matthew Kam1, Florian Metze1 ,
John Canny2
Canegie Mellon University, USA
2 University of California, Berkeley, USA
1
Handwriting
QWERTY
Predictive
Text
Multi-tap
If accurately recognized,
speech is three
times
faster than
QWERTY
(Basapur et al. 07)
Only plausible
input
modality for 800
million nonliterate users
Speech
Signal
e
AcousticLearnin
Model
g
Feature
Extraction
Language
Model
Phonetic
Dictionary
Decoder
Recognition
Result
Speech
Feature
s
Word
sequen
ce
ASR : Framework
ASR converts spoken words into text.
ASR has following components:
Feature Extractor
Extracts the feature vector from the speech
signal
Acoustic Model(AM)
Models acoustic properties of the test
speech signal
Phonetic Dictionary
Contains a mapping from words to phones
Language Model(LM)
Defines which word could follow previously
recognized words thus reducing word search
Decoder
Integrates AM and LM with phonetic dictionary
User
Speec
h
Feature
extractio
n
Mobile Device
Acoustic
Models
Languag
e
Model
ASR
Output
Cons
No networks required
Data
Transmissi
on
Feature Extraction
from Codec
Parameters
Mobile Device
User
Speec
h
ASR
Output
Acousti
Langua
c
ge
Models
Model
ASR Search
Server
Cons
Performance degradation
due to loss during data
transmission
Data
Transmissi
on
Mobile Device
User
Speec
h
Feature
Reconstruction
Acousti
c
Models
Langua
ge
Model
ASR Search
ASR
Output
Server
Cons
Cost
Requires continuous and
reliable cellular connection
Requirement of standard
feature extraction process
on account of variations
due to differences in
channel(mic, audio data
card),variable accents etc.
Data Transmission
(Only when
network is
detected)
Feature Reconstruction +
Metadata Extraction
Acoustic
Models
User
Language
based
Model
Acoustic
Models
ASR Search
Updated
Data
Mobile Device
ASR Output
Language
Model
ASR Search
User Context
Dependent
Analysis
Server
Deskt
op
Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)
Pocke
t
Sphin
x
0.05
6.0%
Sphin
x-3.7
0.19
Sphin
x Tiny
0.37
Word
Mobil
e
Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)
28.2%
Pocke
t
Sphin
x
0.53
6.0%
28.2%
7.3%
34.2%
Sphin
x-3.7
24.34
7.3%
34.2%
7.3%
35.1%
Sphin
x Tiny
2.58
7.3%
35.1%
Deskt
op
Word
Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)
Mobil
e
Senten
ce
Speed Error
Error
(XRT) Rate
Rate
(WER
(SER)
)
10.65
55.1
%
72.8%
Pocke
t
Sphin
x
1.20
55.1
%
72.8%
Pocke
t
Sphin
x
Sphin
x-3.7
0.75
39.6
%
69.5%
Sphin
x-3.7
68.49
39.6
%
69.5%
Sphin
x Tiny
1.23
39.4
%
70.3%
Sphin
x Tiny
8.91
39.4
%
70.3%
Results in Laboratory
On a small vocabulary task PocketSphinx outperforms
SphinxTiny on both accuracy and speed;
As the complexity of the acoustic and language models
increases, SphinxTiny's accuracy is better than
PocketSphinx.
PocketSphinx is superior when using small acoustic and
language models for real-time recognition, but for tasks that
allow larger delays in exchange for better accuracy,
SphinxTiny is a better choice.
Results in Laboratory
On a small vocabulary task PocketSphinx outperforms
SphinxTiny on both accuracy and speed;
As the complexity of the acoustic and language models
increases, SphinxTiny's accuracy is better than
PocketSphinx.
PocketSphinx is superior when using small acoustic and
language models for real-time recognition, but for tasks that
allow larger delays in exchange for better accuracy,
SphinxTiny is a better choice.
Conclusion
Thank You!
Questions?
Presented by :
ABHISHEK GARG(208/CO/11)
NETAJI SUBHAS INSTITUTE OF TECHNOLOGY