Vous êtes sur la page 1sur 24


A SeminarReport On

Automatic Speech Recognition

Submitted By KESHAVKUMAR.K 10942M1621 M.TechIIYear



Department of Computer Science

Page 1



Automatic speech recognition (ASR) is a computerized speechtotext process, in which speech is usually recorded with acoustical microphones by capturing air pressure changes. This kind of airtransmitted speech signal is prone to two kinds of problems related to noise robustness and applicability. The former means the mixing of speech signal and ambient noise usually deteriorates ASR performance. The latter means speech could be overheard easily on the airtransmission channel and this often results in privacy loss or annoyance to other people. Automatic speech recognition systems are trained using human supervision to provide transcriptions of speech utterances. The goal is to minimize the human supervision for training acoustic and language models and to maximize the performance given the transcribed and untranscribed data. It aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function for a human to label.

Department of Computer Science

Page 2


Table Of Contents
S.No Topic
Page No

1. Introduction to Speech Technology 2. Basics of Speech Recognition 3. Performance of Speech Recognition 4. Architecture of Speech Recognition 5. Algorithms Used 5.1 Hidden Markov models(HMM) 5.2 Dynamic Time Warping(DTW) 5.3 Viterbi 6 .Challenges for Speech Recognition 7. Approaches for Speech Recognition 7.1 Template based approach 7.2 Knowledge or rule based approach 7.3 Statistical based approach 8. Machine Learning 9. Language Model 10. Applications 11. Advantages 12. Disadvantages 13. Conclusion 14. References 17 18 20

4 6 8 9 11

14 15

22 22 23 23

Department of Computer Science

Page 3


Automatic Speech Recognition

1. Introduction to Speech Technology
Speechbasedinteractionsallowuserstocommunicatewithcomputersorcomputerrelated deviceswithouttheuseofakeyboard,mouse,buttons,oranyotherphysicalinteractiondevice. Speech technologies areofparticular importance toindividuals with physicalimpairments that hindertheiruseoftraditionalinputdevicessuchasthekeyboardandmouse.Speechtechnology relatestothetechnologiesdesignedtoduplicateandrespondtothehumanvoice.Theyhavemany uses,whichincludeaidingthevoicedisabled,thehearingdisabled,theblind,andtocommunicate withcomputerswithoutakeyboard,tomarketgoodsorservicesbytelephoneandtoenhancegame software. The subject includes several subfields that are speech synthesis, speaker recognition, speaker verification, speech compression etc. Even after years of extensive research and development,accuracyinASRremainsachallengetoresearchers.Thereareanumberofwell known factors which determine accuracy. The prominent factors include variations in context, speakers and noiseintheenvironment.Thereforeresearchinautomaticspeechrecognition has manyopenissueswithrespecttosmallorlargevocabulary,isolatedorcontinuousspeech,speaker dependentorindependentandenvironmentalrobustness.Theaccuracyandacceptanceofspeech recognitionhascomealongwayinthelastfewyears.Mostpopularusesofspeechrecognition technologyarethefollowing:

Playing back simple information: Customers need fast access to information in many circumstancestheydonotactuallyneedorwanttospeaktoaliveoperator.Forexample,ifthey havelittletimeortheyonlyrequirebasicinformationthenspeechrecognitioncanbeusedtocut waitingtimesandprovidecustomerswiththeinformationtheywant.

Call steering: Puttingcallersthroughtotherightdepartment.Waitinginaqueuetogetthroughto Department of Computer Science Page 4

AutomaticSpeechRecognition anoperatoror,worsestill,finallybeingputthroughtothewrongoperatorcanbeveryfrustratingto acustomer,resultingindissatisfaction.Byintroducingspeechrecognition,onecanallowcallersto chooseaselfservicerouteoralternativelysaywhattheywantandbedirectedtothecorrect departmentorindividual.

Automated identification: Whereoneneedstoauthenticateonesidentityonthephonewithout usingriskypersonaldata.Someadvancedspeechrecognitionsystemsprovidean answertothisproblemusingvoicebiometrics.Thistechnologyisnowacceptedasamajortoolin combatingtelephonebasedcrime.Onaverageittakeslessthantwominutestocreateavoiceprint based on specific text such as Name and Account Number. This is then stored against the individualsrecord,sowhentheynextcall,theycansimplysaytheirnameandifthevoiceprint matches what they have stored, then the person is put straight through to a customer service representative. Speechrecognition(alsoknownasautomaticspeechrecognitionorcomputerspeech recognition)convertsspokenwordstotext.Theterm"voicerecognition"issometimesusedtorefer tospeechrecognitionwheretherecognitionsystemistrainedtoaparticularspeakerasisthecase formostdesktoprecognitionsoftware,hencethereisanelementofspeakerrecognition,which attemptstoidentifythepersonspeaking,tobetterrecognizewhatisbeingsaid.Speechrecognition isabroad termwhichmeansitcanrecognizealmostanybody'sspeech.Speechrecognitionisasystemtrained toaparticularuser,whereitrecognizestheirspeechbasedontheiruniquevocalsound.

Department of Computer Science

Page 5


2. Basics of Speech Recognition Thebasicsofspeechrecognitionsystemcanbedefinedasdescribedbelow:

Utterance:Itisthevocalization(speaking)ofawordorwordsthatrepresentasinglemeaningto thecomputer.Utterancescanbeasingleword,afewwords,asentence,orevenmultiplesentences.

SpeakerDependence systems: Theyaredesignedaroundaspecificspeaker.Theygenerallyare moreaccurateforthecorrectspeaker,butmuchlessaccurateforotherspeakers.Theyassumethe speakerwillspeakinaconsistentvoiceandtempo.Speakerindependentsystemsaredesignedfora variety of speakers. Adaptive systems usually start as speaker independent systems and utilize trainingtechniquestoadapttothespeakertoincreasetheirrecognitionaccuracy

SpeakingMode: Therecognitionsystemscanbeeitherisolatedwordprocessorsorcontinuous speechprocessors.Somesystemsprocessisolatedutteranceswhichmayincludesinglewordoreven sentencesandothersprocesscontinuousspeechinwhichcontinuouslyutteredspeechisrecognized thatisimplementedinmostrealtimeapplications. SpeakingStyle:Itcaneitherbespeakerdependentorspeakerindependent.Forspeakerdependent Department of Computer Science Page 6

AutomaticSpeechRecognition recognition,thespeakertrainsthesystemtorecognizehis/hervoicebyspeakingeachofthewords intheinventoryseveraltimes.Inspeakerindependentrecognitionthespeechutteredbyanyuser canberecognized,whichismorecomplicatedprocesscomparedtotheformersystems.

Vocabularies: TheyarelistsofwordsorutterancesthatcanberecognizedbytheSRsystem. Generally,smallervocabulariesareeasierforacomputertorecognize,whilelargervocabulariesare moredifficult.Unlikenormaldictionaries,eachentrydoesn'thavetobeasingleword.Theycanbe aslongasasentenceortwo.Smallervocabulariescanhaveasfewas1or2recognizedutterances (e.g.sunnatho"),whileverylargevocabulariescanhaveahundredthousandormore. Accuracy: The ability of a recognizer can be examined by measuring its accuracy or how well it recognizes utterances. This includes not only correctly identifying an utterance but also identifying if the spoken utterance is not in its vocabulary. Good ASR systems have an accuracy of 98% or more. The acceptable accuracy of a system really depends on the application.

Training the acoustic models: Some speech recognizers have the ability to adapt to a speaker. When the system has this ability, it may allow training to take place. An ASR system is trained by having the speaker repeat standard or common phrases and adjusting its comparison algorithms to match that particular speaker. Training a recognizer usually improves its accuracy. As long as the speaker can consistently repeat an utterance, ASR systems with training should be able to adapt.

Department of Computer Science

Page 7


3. Performance of Speech Recognition System

The performance of speech recognition systems is usually specified in terms of accuracy and speed. Accuracy may be measured in terms of performance accuracy which is usually rated with Word Error Rate (WER), whereas speed is measured with the real time factor. Other measures of accuracy include Single Word Error Rate (SWER) and Command Success Rate (CSR). Most speech recognition users would tend to agree that dictation machines can achieve very high performance in controlled conditions. For simple applications training of the acoustic models usually require only a short period of training and may successfully capture continuous speech with a large vocabulary at normal pace with a very high accuracy. An accuracy of 98% to 99% can be achieved if operated under optimal conditions. Optimal conditions usually assume that users: have speech characteristics which match the training data, can achieve proper speaker adaption and work in a clean noise environment (e.g. quiet room).

Department of Computer Science

Page 8

AutomaticSpeechRecognition This explains why some users, especially those whose speech is heavily accented, might achieverecognition rates much lower than expected. Limited vocabulary systems, require no training, can recognize a small number of words (for instance, the ten digits) as spoken by most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organizations.

4. Architecture
Speech recognition is gettingacomputertounderstandspokenlanguage Byunderstandwemightmean Reactappropriately Converttheinputspeechintoanothermedium,e.g.text It is done by Digitization Acoustic analysis of the speech signal Linguistic interpretation The speech signal which is given as input is converted from analog signal into digital representation where the speech signal is segmented depending on the language. The architecture of speech recognition is shown below.

Department of Computer Science

Page 9


The first step in almost all speech recognition systems is the extraction of features from the acoustic data. Most systems make use of the Mel Frequency Cepstral Coefficients (MFCC) to describe the speech signal. First, the input signal is windowed. Typically, this is done with Hamming windows of 30 ms long and with a 20 ms overlap. Next, the spectrum is computed by taking the Fourier transform.

These coefficients then are mapped onto the Mel scale by using a set of triangular-shaped filters. After taking the log of the powers(phase information is omitted because it contains no useful information and moreover the human ear is also phase-deaf) the resulting coefficients are treated as a signal and the inverse discrete cosine transform is taken.The resulting spectrum is called the Mel Frequency Spectrum and the resulting coefficients are called Mel Frequency Cepstral Coefficients. Usually the first 12 coefficients are used to describe the part of speech signal under the Hamming window, forming a feature vector. Next the energy of the signal, which also contains useful information is added to the feature vector. The window then shifts (10 ms) and a new feature vector is calculated. This procedure creates a time series of feature vectors from the continuous speech signal. Department of Computer Science Page 10

AutomaticSpeechRecognition Because speech is transient in nature, also first and second order time derivatives of the MFCC features are added to every feature vector. By the MFCC features we get the phonemes and it is compared with the database .Another important part of typical speech recognition systems is the lexicon (also called dictionary). The lexicon describes how to combine acoustic models (phonemes) to form words. It contains all words known to the ASR system and the series of phonemes that must be encountered to form that word. The language model combines words to form sentences.Finally we get the recognized words.

5. Algorithms Used
5.1 Hidden Markov model (HMM)
Modern general-purpose speech recognition systems are generally based on HMMs. These are statistical models which output a sequence of symbols or quantities. One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piecewise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a stationary process. Speech could thus be thought of as a Markov model for many stochastic processes. Department of Computer Science Page 11

AutomaticSpeechRecognition Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, the HMM model would output a sequence of n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of spectral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and correlating the spectrum using a cosine transform, then taking the first (most significant) coefficients.ThehiddenMarkovmodelwilltendtohaveineachstateastatisticaldistributionthat isamixtureofdiagonalcovarianceGaussianswhichwillgivelikelihoodforeachobservedvector. Eachword,or(formoregeneralspeechrecognitionsystems),eachphoneme,willhaveadifferent outputdistribution. A HMM model for a sequence of words or phonemes is made by concatenating the individual trained HMM models for the separate words and phonemes. Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical largevocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states), it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use Vocal Tract Length Normalization (VTLN) for male-female normalization and Maximum Likelihood Linear Regression (MLLR) for more general speaker adaptation.

The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use Heteroscedastic Linear Discriminant Analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as Maximum Likelihood Linear Transform, or MLLT). Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are Maximum Mutual Information (MMI), Minimum Department of Computer Science Page 12

AutomaticSpeechRecognition Classification Error (MCE) and Minimum Phone Error (MPE). Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information.

5.2 Dynamic Time Warping (DTW)

Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic Time Warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics indeed, any data which can be turned into a linear representation can be analyzed with DTW. A well known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models

5.3 Viterbi
TheViterbialgorithmisadynamicprogrammingalgorithmforfindingthemostlikelysequenceof hiddenstatescalledtheViterbipaththatresultsinasequenceofobservedevents,especiallyin thecontextofMarkovinformationsources,andmoregenerally,HMMs.Theforwardalgorithmisa closelyrelatedalgorithmforcomputingtheprobabilityofasequenceofobservedevents.These algorithmsbelongtotherealmofinformationtheory. Thealgorithmmakesanumberofassumptions.First,boththeobservedeventsand Department of Computer Science Page 13

AutomaticSpeechRecognition hiddeneventsmustbeinasequence.Thissequenceoftencorrespondstotime.Second,thesetwo sequencesneedtobealigned,andaninstanceofanobservedeventneedstocorrespondtoexactly oneinstanceofahiddenevent.Third,computingthemostlikelyhiddensequenceuptoacertain pointtmustdependonlyontheobservedeventatpointt,andthemostlikelysequenceatpoint t1.TheseassumptionsareallsatisfiedinafirstorderhiddenMarkovmodel. Theterms"Viterbipath"and"Viterbialgorithm"areappliedtorelateddynamic programming algorithmsthatdiscoverthesinglemostlikelyexplanationforanobservation. In statistical parsingadynamicprogrammingalgorithmcanbeusedtodiscoverthesinglemost likelycontextfreederivation(parse)ofastring,whichissometimescalledthe"Viterbiparse". Dynamicprogrammingusuallytakesoneoftwoapproaches: Topdownapproach:Theproblemisbrokenintosubproblems,andthesesubproblemsaresolved and the solutions remembered, in case they need to be solved again. This is recursion and memorizationcombinedtogether. Bottomupapproach:Allsubproblemsthatmightbeneededaresolvedinadvanceandthenused tobuildupsolutionstolargerproblems.Thisapproachisslightlybetterinstackspaceandnumber offunctioncalls,butitissometimesnotintuitivetofigureoutallthesubproblemsneededfor solvingthegivenproblem.

6. Challenges for Speech Recognition

Department of Computer Science Page 14


Inter-Speaker Variability -Vocal Tract,Gender,Dialects Variability Language -From isolated words to continuous speech
-Out of vocabulay words

Vocabulary Size and domain -From just a few words to large vocabulary speech recognition
-Domain that is being recognized

Noise -Convolutive:Recording/Transmission conditions

-Additive:Recording environment/Transmission SNR -Intra-Speaker Variability:stress,age,humor,changes of articulation due to environment influence

7. Approaches for ASR

Department of Computer Science

Page 15


7.1 Template based approach

Template-based speech recognition systems have a database of prototype speech patterns (templates) that define the vocabulary. The generation of this database is performed during the training mode. During recognition, the incoming speech is compared to the templates in the database, and the template that represents the best match is selected. Since the rate of human speech production varies considerably, it is necessary to stress or compress the time axes between the incoming speech and the reference template. This can be done efficiently using Dynamic Time Warping (DTW). In a few algorithms, like Vector Quantization (VQ), it is not necessary to vary the time axis for each word, even if any two words have different utterance length. This is performed by splitting the utterance into several different sections and coding each of the sections separately to generate a template for the word. Each word has its own template, and therefore this method becomes impractical as the vocabulary size is increased ( > 500 words). I developed an SR system with the concept of TW for non-linear alignment of speech. The system performed a speaker-dependent isolated-word recognition task with a 97% accuracy on a 200word vocabulary. Performance of this system degraded to 83%), when the vocabulary size was increased to 1000 words. It was a speaker-dependent connected-digit large vocabulary speech recognition system that had an accuracy of 99.6% This system performed well for speaker-dependent speech recognition, but one of the shortcomings of the system was that it took more than 150 minutes to recognize a word with a vocabulary size of 600.

7.2 Knowledge or rule based approach

Knowledge-based speech recognition systems incorporate expert knowledge that is, for example, derived from spectrograms, linguistics, or phonetics. The goal of a knowledge-based system is to include the knowledge using rules or procedures. The drawback of these systems is the difficulty of quantifying expert knowledge and integrating the multitude of knowledge sources [24]. This becomes increasingly difficult if the speech is continuous, and the vocabulary size increases.

This approach is based on blackboard architecture: -At each decision point, lay out the possibilities Department of Computer Science Page 16

AutomaticSpeechRecognition -Apply rules to determine which sequences are permitted Poor performance due to -Difficulty to express rules -Difficulty to make rules interact -Difficulty to know how to improve the system

7.3 Statistical based approach

Can be seen asextensionoftemplatebasedapproach,usingmorepowerfulmathematical and statisticaltools.Sometimesseenasantilinguisticapproach. Collect a large corpus of transcribed speech recordings and train the computer to learn the correspondences (machine learning) At run time, apply statistical processes to search through the space of all possible solutions, and pick the statistically most likely one.

Department of Computer Science

Page 17


8. Machine Learning 8.1 Acoustic and Lexical models

Analysetrainingdataintermsofrelevantfeatures Learnfromlargeamountofdatadifferentpossibilities differentphonesequencesforagivenword differentcombinationsofelementsofthespeechsignalforagivenphone/phoneme CombinetheseintoaHiddenMarkovModelexpressingtheprobabilities

Department of Computer Science

Page 18


9. Language Model
Whilegrammarbasedlanguagemodelsareeasytounderstand,theyarenotgenerallyusefulfor largevocabularyapplicationssimplybecauseitissodifficulttowriteagrammarwithsufficient coverageofthelanguage.Themostcommonkindoflanguagemodelinusetodayisbasedon estimatesofwordstringprobabilitiesfromlargecollectionsoftextortranscribedspeech.Inorder to make these estimates tractable, the probability of a word given the preceeding sequence is approximated to the probability given the preceeding one (bigram) or two (trigram) words (in general,thesearecalledngrammodels).

Forabigramlanguagemodel: P(wn|w1,w2,w3...wn1)=P(wn|wn1)

Toestimatethebigramprobabilitiesfromatextwemustcountthenumberofoccurencesofthe wordpair(wn1,wn)anddividethatbythenumberofoccurencesofthepreceedingwordw n1.This is a relatively easy computation and accurate estimates can be obtained from transcriptions of languagesimilartothatexpectedasinputtothesystem. Forexample,ifwearerecognisingnewstories,textsuchastheWallStreetJournalcorpuscanbe usedtoestimatebigramprobabilitiesforthelanguagemodel.Thismodelisunlikelytotransfervery welltoanotherdomainsuchastraintimetableenquiries;ingeneraleachapplicationrequiresthe languagemodeltobefinetunedtothelanguageinputexpected. Thebigramlanguagemodelgivesthesimplestmeasureofwordtransition probabilitybutignoresmostofthepreceedingcontext.Itiseasytocomeupwithexamplesofword sequenceswhichwillbeinproperlyestimatedfromabigrammodel(forexample,in"Thedogon thehillbarked",theprobabilityofbarkedfollowinghillislikelytobeunderestimated). The morecontextalanguagemodelcanusethemorelikelyitistobeabletocapturelonger Department of Computer Science Page 19

AutomaticSpeechRecognition distancedependancies betweenwordsintheinput.Unfortunatelywehitasevereproblemwith makingestimatesofprobabilitiesforanythingbeyondtrigramlanguagemodels,andeventhere specialcaremustbetakentodealwithwordsequenceswheredataisscarce.

Inatrigramlanguagemodeltheprobabilityofawordgivenit'spredecessorsisestimatedbythe probabilitygiventheprevioustwowords:


Toestimatethisquantitywemustcountthenumberoftimesthetripleof(wn2,wn1,wn)isobserved anddividethisbythenumberoftimesthepair(wn2,wn1)occurs.Theproblemhereisclearlythat formanytriplesthenumberofoccurencesislikelytobeverylowandsoreliableestimatesofthe trigramprobabilityareunlikely.

Toovercomethispaucityofdatathetechniqueoflanguagemodel smoothing isused.Herethe overalltrigramprobabilityisderivedbyinterpolatingtrigram,bigramandunigramprobabilities: P(wn|wn2,wn1)=k1*f(wn|wn2,wn1)+k2*f(wn|wn1)+k3*f(wn) Wherethefunctionsf()aretheunsmoothedestimatesoftrigram,bigramandunigramprobabilities. Thismeansthatforatriplewhichdoesnotoccurinthetrainingtext,theestimatedprobabilitywill bederivedfromthebigrammodelandtheunigrammodel;theestimatewillbenonzeroforevery wordincludedinthelexicon(ie.everywordforwhichthereisanestimateofP(w)).Thechoiceof theparametersk1..k3isanotheroptimisationproblem.

Department of Computer Science

Page 20



Health care
In the health care domain, even in the wake of improving speech recognition technologies, medical transcriptionists (MTs) have not yet become obsolete. The services provided may be redistributed rather than replaced.Speech recognition can be implemented in front-end or back-end of the medical documentation process.

The problems of achieving high recognition accuracy under stress and noise pertain strongly to the helicopter environment as well as to the fighter environment. The acoustic noise problem is actually more severe in the helicopter environment, not only because of the high noise levels but also because the helicopter pilot generally does not wear a facemask, which would reduce acoustic noise in the microphone.

Battle Management command centres generally require rapid access to and control of large, rapidly changing information databases. Commanders and system operators need to query these databases as Department of Computer Science Page 21

AutomaticSpeechRecognition conveniently as possible, in an eyes-busy environment where much of the information is presented in a display format. Human-machine interaction by voice has the potential to be very useful in these environments. A number of efforts have been undertaken to interface commercially available isolatedword recognizers into battle management environments. In one feasibility study speech recognition equipment was tested in conjunction with an integrated information display for naval battle management applications. Users were very optimistic about the potential of the system, although capabilities were limited.

Air traffic controllers

Training for air traffic controllers (ATC) represents an excellent application for speech recognition systems. Many ATC training systems currently require a person to act as a "pseudo-pilot", engaging in a voice dialog with the trainee controller, which simulates the dialog which the controller would have to conduct with pilots in a real ATC situation. Speech recognition and synthesis techniques offer the potential to eliminate the need for a person to act as pseudo-pilot, thus reducing training and support personnel.

Telephony and other domains ASR in the field of telephony is now commonplace and in the field of computer gaming and simulation is becoming more widespread. Despite the high level of integration with word processing in general personal computing. However, ASR in the field of document production has not seen the expected increases in use.

Department of Computer Science

Page 22


11. Advantages
It enables increased efficiency in the workplace when hands are busy Quicker input of data for processing Data entry with no need to type-just speak what you want typed Easy for people who are physically challenged.

12. Disadvantages
Robustness graceful degradation, not catastrophic failure Portability independence of computing platform Adaptability to changing conditions (different mic, background noise, new speaker, new task domain, new language even) Language Modelling is there a role for linguistics in improving the language models? Confidence Measures better methods to evaluate the absolute correctness of hypotheses. Spontaneous Speech disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem. Prosody Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger) Accent, dialect and mixed language non-native speech is a huge problem, especially where code-switching is commonplace

Department of Computer Science

Page 23


13. Conclusion
ASR is becoming a sophisticated technology of today and will grow in popularity and its success will bring revolutionary changes in the computer industry. This will occur in business world as well as in our personal life.

14. References
1. L. Rabiner, B.H. Juang, Fundementals of Speech Recognition Pearson Education. First edition.2003 2. L. R. Rabiner, R.W Schafer Digital Processing of speech Signals Pearson Education 3. http://en.wikipedia.org/wiki/Speech_Recognition 4. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1318444 5. http://sound2sense.eu/images/uploads/DengStrik2007.pdf 6. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4156191

Department of Computer Science

Page 24

Vous aimerez peut-être aussi