Speech Recognition

1
What is Speech Recognition?

Automatic speech recognition is a process by which a computer takes a speech signal (recorded using a microphone) and converts it into words. Speech recognition is a hard problem for a number of reasons:

Many different words can be spoken. The average person uses thousands or tens of thousands of words. In speech the boundaries between words are not obvious - one word runs on into another. So the problem is one of finding the words as well as identifying them. This usual case is called continuous speech recognition. Sometimes, to make the problem easier, systems demand that people leave pauses between words. This is an unnatural way of speaking. Recognizing such speech is referred to as isolated word recognition. When people speak casually or conversationally, it is a lot more difficult for the computer to recognize what they are saying compared to when they are dictating or reading from a script. We say that the acoustic variability is much greater. It is more straightforward for a system to be speaker-dependent - able to recognize the speech of a particular speaker. But it is much more useful if a system is speaker-independent - able to recognize anyone. Speech recognition works much better in a quiet room with a nearby microphone. If this is not the case, then other sounds may also be recorded and it is much harder to recognize such noisy speech.
Speech recognition systems work best in particular applications where a person is expected to be speaking about a particular subject (e.g. booking a doctor's appointment). In such cases, the speech recognition system can take advantage of the fact that there a person will most likely talk about only a very limited number of things.
A Brief History
Alexander Graham Bell was born March 3, 1847, in Edinburgh, Scotland. Alexander Melville Bell, Bell's father, was an expert in vocal physiology. His grandfather, Alexander Bell, was an elocation professor. Speech recognition technology really began with Alexander Graham Bell's inventions in the 1870s. By discovering how to convert air pressure waves (sound) into electrical impulses, he began the process of uncovering the scientific and mathematical basis of understanding speech. Bell had discovered that a wire vibrated by the voice while partially immersed in a conducting liquid, like mercury, could be made to vary its resistance and produce an undulating current. In other words, human speech could be transmitted over a wire. In the 1950s, Bell Laboratories developed the first effective speech recognizer for numbers. In the 1970s, the ARPA Speech Understanding Research project developed the technology further - in particular by recognizing that the objective of automatic speech recognition is the understanding of speech not merely the recognition of words. By the 1980s, two distinct types of commercial products were available. The first offered speaker-independent recognition of small vocabularies. The second, offered by Kurzweil Applied Intelligence, Dragon Systems, and IBM, focused on the development of large-vocabulary voice recognition systems so that text documents could be created by voice dictation. Over the past two decades, voice recognition technology has developed to the point of real-time, continuous speech systems that augment command, security, and content creation tasks with exceptionally high accuracy.
Basic Principles of Speech Recognition

To understand Speech Recognition software, we must understand the basic units of language and the methods to interpret it. Speech The smallest unit of spoken language is known as a Phoneme. The English language contains approximately 44 phonemes representing all the vowels and consonants that we use for speech. We can take the example of a typical word such as moon which can be broken down into three phonemes: m, ue, n. Interpreting Speech To interpret speech we must have a way of identifying the components of spoken words. Phonemes act as identifying markers that within speech since they remain at a constant value and can therefore be broken down further. An algorithm has to be used to interpret the speech further. We can use mathematical models like Markov Models to do this. These models work on the basis of probability. To create a speech recognition engine, a large database of models is created to match each phoneme. When a comparison is performed, the most likely match is determined between the spoken phoneme and the stored one, and further computations are performed. This allows the system to break down the exact word that was uttered, to understand in what context it was used, and to understand the grammar if the word is part of a sentence.
How Does Speech Recognition Work? (Basic terms)

A speech recognizer consists of a number of components. These are learned from data, using a Speech Corpus consisting of recordings of speech and their textual transcriptions. The Speech Recognizer learns to make correspondences between sounds and words. Signal Processing This processes the signals recorded by the microphone into Feature Vectors that provide a snapshot of what is going on in the speech signal, emphasizing those features that are relevant to speech recognition. Typically, 100 feature vectors per second are produced. Acoustic Model This takes the stream of Feature Vectors and turns it into a stream of Phonemes (or Phoneme Hypotheses). A Phoneme is the unit that is used to construct words, and corresponds to a particular speech sound. An important aspect of the Acoustic Model is that it does not make definite decisions about what the stream of Phonemes is, but tells us how likely any particular Phoneme is at a point in the speech signal. Lexicon This tells us how words are constructed as a string of Phonemes. Alternative pronunciations are also possible. Language Model It states what sequences of words are likely and what are not. Just using a grammar is not possible, since people may say anything!
How Does Speech Recognition Work? Interactive Demonstration

This interactive animation shows how a basic Speech Recognition system might work.
Step 1: The Corpus Collection The first element that is required for speech recognition is what is known as Corpus Collection. This is a database of speech data that has been built from multiple speech samples. To create corpus collection a number of user voices are sampled. The user is generally asked to read many pages of text such as stories that have been designed so that samples of all the necessary speech data (e.g. phonemes ) can be recorded and stored.
Speaker independent: The speech collection uses samples from multiple sources, e.g. hundreds of voices, to allow it to work with many users. This is extremely complicated as voice patterns differ between gender, age and dialects or accents. To create a collection of this kind takes a lot of work and disc storage. Such collections are used for large scale commercial systems where any user can interact with the system. Speaker dependent: This speech collection has been compiled from one source e.g. a single user. The system using this will recognize only their speech patterns. Such collections are used for small scale applications such as use in the home.
Step 2: Signal Analysis A speech recognition system requires a user input to function, in this case a voice command. The microphone will pick up the analogous sound wave and convert this to an electrical signal. The signal is then passed to a processing unit for signal analysis. The processing unit which we call a signal analyzer will separate the signal, since much of the sound from the sample will be background noise and speaker distractions which are not useful. What we want is the main element of speech. The signal analyzer uses feature extraction process which analyses the waveform of sound, concentrating on the frequency and the amplitude of the signal.
Once it has been analyzed by the system, the extraction process will copy the phonemes from the speech sample and pass these onto the acoustic models to be identified.
Step 3: Acoustic Models The signal analyzer passes what it believes to be phonemes from the speech sample to the acoustic model for identification. The mathematical model such as markov model is used to perform the identification task. Samples of possible phonemes are passed into the models. These possibilities are compared with previous results from the trained models. The phoneme with the highest probability of a match is selected as being the correct phoneme.
What has now been determined to be correct phoneme is passed to the language models to determine what the word and its meaning are.
Step 4: Language Models-The Dictionary File Once the phonemes in the speech sample have been identified, the recognition system must then decide what word has been spoken and in what context it is being used. This is where we use Language Model. The Dictionary File is the first step in matching the phoneme patterns from the acoustic models to particular words that decide which grammar pattern is correct for the user input.
The phoneme patterns are identified and search is performed on the dictionary file. This large file has the phonetic spelling of thousands of words relating to the application to the system.
9
If a match is found, it is saved until the rest of the searches have been completed. In this case /p/--/ae/--/n/ identifies pain so the word is output to the grammar file.
Step 5: Language Models-The Grammar File The second step is to compare the words found by the dictionary file with pregenerated models which maps the paths that a conversation or command may take. The grammar file is made up of many templates that map how the conversation can evolve. These templates are created because the computer expects a certain type of response from the user, so a conversation can follow a limited number of paths. This allows the language model to match a sentence to the users response.
10
Words found in the dictionary file are passed to the grammar file.
This is a typical example of a template. The grammar file has many of these. The word pain that was passed in, matches this template (with other words of course).
Step 6: Text Output The language models complete their analysis of the phonemes. Then they generate the best match based on computation of what words have been spoken. Speech recognition system produces results that can be directly passed to the dialogue system for further evaluation.
11
Possible Uses for Speech Recognition

Speech Recognition technology has huge potential for human computer interaction. Over the last ten years, recognition technology has begun to appear in a multitude of devices from home computers to mobile phones and home security systems. It is the simplest way for a user to interact since speech is the most common and natural form of communication. It does not require complicated user interfaces. Home Care Speech Recognition has many possible uses for home care. It can be used to control things such as heating, lighting or appliances. It also forms a key part of spoken dialogue systems. People can communicate with a speech recognizer using a phone, or by wearing a pin-on microphone (that may be wireless). Microphones may also be attached to the walls or placed on a table, although this makes the task of recognition harder. Court reporter A court reporter or court stenographer, also called "stenotype operator", "shorthand reporter" or "law reporter", is a person whose occupation is to transcribe spoken or recorded speech into written form, using machine shorthand or voice writing equipment to produce official transcripts of court hearings, depositions and other official proceedings. Court reporting companies primarily serve private law firms, local, state and federal government agencies and courts, trade associations, meeting planners and nonprofits. Interactive voice response (IVR) Interactive voice response (IVR) is a technology that allows a computer to interact with humans through the use of voice and DTMF tones input via keypad. IVR technology is also being introduced into automobile systems for handsfree operation. Current deployment in automobiles revolves around satellite navigation, audio and mobile phone systems.
12
Hands-free computing Hands-free computing is any Computer configuration where a user can interface without the use of their hands, an otherwise common requirement of human interface devices such as the mouse and keyboard. Hands-free computing is important because it is useful to both able and disabled users. This can range from using the tongue, lips, mouth, movement of the head to voice activated interfaces utilizing speech recognition software and a microphone or blue tooth technology. Machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation (MAHT) or interactive translation) is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another. On a basic level, MT performs simple substitution of words in one natural language for words in another, but that alone usually cannot produce a good translation of a text, because recognition of whole phrases and their closest counterparts in the target language is needed. Automated identification It comes into picture where you need to authenticate someones identity on the phone without using risky personal data.
13
Advantages
1. 2. 3. 4. 5. 6. Increases productivity Can help with menial computer tasks, such as browsing and scrolling Can help people who have trouble using their hands Can help people who have cognitive disabilities Has long term benefits for students No training required for users
Disadvantages
1. 2. 3. 4. 5. Can be hacked with prerecorded verbal messages Has an initial period of adjusting to each user's voice Less accurate when there is background noise Can be distracting in a cubicle environment The company Nuance is one of the last companies producing the software and is essentially a monopoly
14
An Introduction to Markov Models
Markov models are excellent ways of abstracting simple concepts into a relatively easy computable form. Markov models are used in everything from data compression to sound recognition. In artificial intelligence, they are used in natural language processing and sound recognition (which becomes an NLP problem once the sound has been converted into textbased data). This essay will introduce you to simple Markov Models and their uses.
Basic Theory
In natural language, most sentences have some form of pattern to them. They are simply a sequence of patterns, albeit it a relatively complex sequence at times. Look at the directed graph below, with 3 nodes and a start and an end.
From this graph we can create sequences such as: N1 N2 N3 N1 N2 N2 N2 N3 N3 N3 N3 N3 N1 N1 N2 N2 N3 The thing is, in most sequences (especially complex ones like natural language), different paths have different probabilities of occurring. To accommodate for this, we can assign probabilities to the different paths (or arcs):
15
Now we can analyze the probability of certain sequence happening, using our previous examples: N1 N2 N3 = 0.4 X 0.8 X 0.5 = 0.16 N1 N2 N2 N2 N3 N3 N3 N3 N3 = 0.4 x 0.2 x 0.2 x 0.8 x 0.5 x 0.5 x 0.5 x 0.5 = 0.0008 N1 N1 N2 N2 N3 = 0.6 x 0.4 x 0.2 x 0.8 x 0.5 = 0.192 The directed graph augmented with probability scores is called a Markov Model. Surprisingly simple as this may seem, we can see how powerful this can be when applied to something like sound recognition. Applications to Sound Recognition In sound recognition, a computer has to be able to deal with tens of thousands of words all of which can be pronounced differently. A brute force search is simply too inefficient and time-consuming to be of any use. In comes the Markov Model! Imagine that those three nodes shown above (N1, N2, and N3) were actually small chunks of data that which put together in various ways produced phonemes (small parts of words, such as iy (heat)) and that Markov Model represented different ways that those three data chunks could produce valid phonemes (along with their likelihood of occurring). Could we take that paradigm up a level and see how phonemes could be put together to form words? Of course, look at this example of the word tomato:
This accommodates for pronunciations such as:

t ow m aa t ow t ah m ey t ow t ah mey t a quickly - British English - American English - Possibly pronunciation when speaking
16
What about increasing the paradigm even further? How could you link words together to produce a valid sentence? This is also just as plausible:
With sentences such as:

I I I I like like hate hate apple juice tomato juice apple juice tomato juice Very probable Very improbable! Relatively improbable Relatively probable
You can see how using Markov Models, speech recognition systems can ascertain "intelligently" what was said. For example, look at these two phrases:
James's school... James is cool
When these are broken down into phonemes they are very similar, especially when spoken quickly. When the computer checks it's Markov Model is would find that the first phrases is (unfortunately) much more likely to appear than the second, so it uses as a base for the rest of the sentence. Yet, if the phrases where these sentences:
James's school drives monkey feet James is cool because he plays the guitar
It would switch over to the second sentence, because the first sentence quickly turned meaningless, while the second was both semantically and syntactically correct. This sort of "understanding" is imperative for a speech recognition engine, and especially one that has to deal with continuous speech. Does the usefulness of Markov Models stop there? No! Markov Models can be arranged into a hierarchy that enables that computer to relatively quickly turn the speech data into meaningful sentences (probabilities omitted):
17
The computer would use this tiered Markov model firstly to extract phonemes from speech data, words from phonemes and sentences from words! Conclusion The remaining question that stands is where do all the probabilities come from? Normally, this is a brute force exercise of being fed a bunch of data (encyclopedia text, Shakespeare etc) and generating a list of probabilities. In fact, some Markov Models are not directed like shown, but simply connect all words to each other and allow the probabilities to govern the connections (normally called a bigram model). You can see how useful Markov Models are, both in speech recognition and natural language processing. While relatively simple in nature, they can prove incredibly productive when applied properly.
18
Conclusion
Being able to determine what is spoken or who a speaker is with near perfect accuracy is an extremely formidable task. Preventing another individual from breaking into the system can be just as difficult, as it requires a system dependent on text and a system that will not accept anything other than what it specifies. Our initial idea of being able to determine what word was being spoken is, at best, nave, and at worst not at all feasible. With that said, however, the end results were very acceptable. Speech in itself does not contain sucient information for speech recognition... But by combining visual and audio speech features we are able to achieve better performance than what is possible with audio-only ASR.
19

Speech Recognition

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Speech Recognition

Transféré par

Droits d'auteur :

Formats disponibles

1

What is Speech Recognition?

Basic Principles of Speech Recognition

How Does Speech Recognition Work? (Basic terms)

How Does Speech Recognition Work? Interactive Demonstration

Possible Uses for Speech Recognition

An Introduction to Markov Models

This accommodates for pronunciations such as:

With sentences such as:

Vous aimerez peut-être aussi