Vous êtes sur la page 1sur 6

Speech Abiding Systems

R. Manoharan , Professor & an MCP


Professor & Head – Department of MCA
SNS College of Technology,
Vazhiyampalayam, Sathy Main Road
Coimbatore – 641 035
Tamil Nadu – India
Objective
Multimedia enables a wide range of new applications, many of which are still in the experimental
phase. When analyzing such a broad field as multimedia from a scientific angle, it is difficult to
avoid reflections on the effects of these new technologies on society as a whole.

The primary objective of this research is to synchronize human speech with the computer
whereupon a Finite State Automation (FSA) is constructed taking into account the semantic,
syntactic and pragmatic content of the human speech including the Sphota of every word
comprising the speech.

The following are the sub-objectives besides the main objective of the research. They are

• Speech Synthesis

• Filtering the particularities which distinguish one word from another

• The problem in speech recognition affecting the recognition quality.

• Sound concatenation in the time/frequency range.

• Value Engineering to strike the optimality in cost by suitable O.R. techniques.

• Enhancing the standard of living by adding a new dimension to the potential market.

• Shifting the overall functionality more and more from hardware to software.

• Internal representation of information to the computer for this research scheme by


inventing a novel representation media like transducers.

• To show a novelty in the transmission media.


Introduction
All speeches are characterized by the words occurring in them. A word as nothing but a sound and
it has a form. Every sound has a form and every form has a sound. The sound and the form are
inseparable. Despite the significant progress that has been made in the areas of speech recognition
and spoken-language processing, building a successful dialog system still require large amounts of
development time and human expertise. In addition, spoken dialogue systems algorithms often
have little generalization power and are not portable across application domain.

In state-of-the-art spoken dialogue systems, the system must be able to cope with semantic
ambiguity and dynamically changing task definitions. Ambiguity might arise from system
misrecognitions or inherent ambiguity in the user utterances. The user can also change his/her
mind and an attempt to modify the semantic state of the system by implicit or explicit requests. To
cope with such user and systems behavior, the concept of persistence in the semantic
representation is introduced in the system. The system collects and argues with all available
information that the user supplies in the course of the dialogue.

A hierarchical semantic representation that encodes all information supplied by the user over
multiple dialogue turns and can efficiently represent and be used to argue with ambiguous or
conflicting information is proposed. Implicit in this semantic representation is a pragmatic module
consisting of context tracking, pragmatic analysis and pragmatic scoring sub-modules, which
computes pragmatic confidence scores for all system beliefs. These pragmatic scores are obtained
by combining semantic and pragmatic evidence from the various sub-modules, taking into account
the modality of input as well as identifying and resolving ambiguities.

A context tracking algorithm that is application-independent, uses a semantic taxonomy and can
handle/produce confidence scores. The proposed algorithms are simple, easy to implement, yet
general and powerful enough to be applicable to numerous applications of spoken dialogue
systems design.

Audio feature extraction servers as the basic for a wide range of applications in the area of speech
processing. Different approaches and various kinds of audio features were proposed with varying
success rates. The features can be extracted either directly from the time domain signal or from a
transformation domain depending upon the choice of the signal analysis approach. Audio signals
are highly non-stationary in nature, and the best way to analyze them is to use a joint time-
frequency (TF) approach. In order to perform efficient TF analysis on the signal for feature
extraction and classification purposes, it is essential to locate the subspaces on the TF plane that
demonstrate high discrimination between different classes of the signals. Once the target subspaces
are identified, it is easier to extract relevant features for classification.
Human Speech

Speech is based on spoken languages, which means that it has the semantic content. Human
beings use their speech organs without the need to knowingly control the generation of sounds.
Speech understanding means the efficient adaptation to speakers and their speaking habits.
Despite the large number of different dialects and emotional pronunciation, we can understand
each other’s language. The brain is capable of achieving a very good separation between the
speech and interference using the signals received by both ears. It is much more difficult for
humans to filter signals received in one ear only. The brain corrects speech recognition errors
because it understands content, the grammar rules and the phonetic and lexical word forms.
Speech signals have two important characteristics that can be used by speech processing
applications:

(i) Voiced speech signals (in contrast to unvoiced sounds) have an almost periodic
structure over a certain time interval, so that these signals remain quasi-stationary
for above 30 minutes

(ii). The spectrum of some sounds have characteristics maxima that normally involved
up to five frequencies. These frequencies maxima, generated while speaking, are
called formants, which is characteristics component of the quality of an utterance.

Speech Synthesis

Computers can translate an encoded description of a message into speech. This scheme is called
speech synthesis. A particular type of synthesis is text-to-speech conversion. Though software
have been commercially available for various computers, there is lacuna of naturalness. The
problem in speech recognition affecting the recognition quality includes dialects, emotional
pronunciation, and environmental noise. With the current technology, speaker dependant
recognition of approximately 25,000 words is possible. Speech recognition is a very interesting
field for multimedia systems. In combination with speech synthesis, it enables us to implement
media transformation. The primary quality characteristics of each speech recognition session are
determined by a probability of ≤ 1 to recognize a word correctly. A word is always recognized
only with a certain probability. Factors like environmental noise, room acoustics, and the physical
and psychical state of the speaker play an important role.

A specific problem in speech input is the room acoustics, where environmental noise may prevail,
so that the frequency-dependant reflections overlay a sound wave along walls and objects with the
primary sound waves. Also, word boundaries have to be defined, which is not easy, because most
speakers or most human languages do not emphasize the end of one and beginning of that next
word. A kind of time standardization is required to be able to compare a speech unit with existing
samples. The same word can be spoken fast or slow. However, we cannot simply clench or stretch
the time axis, because elongation factors are not proportional to the total duration.
Methodology Proposed
Using compression techniques to reduce the number of windows through the use of the powerful
mathematical tools like Discrete Fourier Transform (DFT), and the Discrete Cosine Transform
(DCT), suitably adapted. In other words, capturing audio content for the signal-based audio data
through discrete transformations.

The speech recognition principle is employed by comparing special characteristics of individual


utterances with a sentence of previously extracted speech elements. This means that these
characteristics are normally quantized for the speech sequence to be studied. The results are
compared with existing references to allocate it to one of the existing speech elements. Identified
utterances are stored, transmitted, or processed as a parameterized sequence of speech elements.
Practical implementation normally used dedicated components or a signal processor to extract
characteristic properties. The comparison and decision are generally handled by the system’s main
processor, while the lexicon with reference characteristics normally resides in the computer’s
secondary storage unit.

Most practical methods differ in how they defined characteristic properties. The application of the
principle is shown in the Fig.1, can be divided into the steps shown in Fig.2.

Reference
(Storage) Storage:
Properties
learned materials

Speech analysis: Pattern


Speech
Parameters, response, recognition:
Property extraction Comparison with
reference,
(Speech Chip) (Main Processor)

Recognized
Speech

Fig :1
Sound
Pattern, Syntax Semantics
Word Model

Speech Acoustics and


Syntactical Semantic
Phonetic
Understood
Analysis Analysis
Analysis

Speech

Fig:2

(1) Acoustic and phonetic analysis to sound pattern and word models.

(2) Syntactic analysis to detect errors in the first run and it serves as an additional decision
tool.

(3) Semantic analysis analyses the semantics of the speech sequence recognized to this point
and can detect errors from the previous decisions process and remove them by using
another interplay with other analytical methods. Even with current artificial intelligence
and neural network technologies, the implementation of this step is extremely difficult.

These methods often work with characteristics in the time and / or frequency range. The
suppression of silence in speech sequences is a typical example of a transformation that depends
entirely on the signal’s semantics.

The mechanism that converts an audio signal in to a sequence of digital samples is called an
Analog-to-Digital Converter (ADC) and a Digital-to-Analog Converter (DAC) is used to achieve
the opposite conversion.

The digitization process requires two steps. First the analog signal must be sampled. This means
that only a discrete set of values is retained at regular time or space intervals. The second step
involves quantization. The quantization process consists of converting a sampled signal into a
signal that can take only a limited number of values.

References
1. Principles of Multimedia Database Systems.
V.S. Subramanian
Morgan Kaufmann Publishers. Inc
San Francisco, CA94194 – 3205 USA.
2. Multimedia Fundamentals – Volume 1
(Media Coding and Content Processing)
Ralf Steinmetz
Klara Nahrstedt
Pearson Education, www.pearsoned.co.in
3. Information Seeking Spoken Dialogue Systems
Egbert Ammicht, Senior Member IEEE.
4. Audio Signal Feature Extraction and Classification Using Local Discriminating Bases.
Kartikeyan Umapathy, Student Member IEEE
Sridhar Krishnan, Senior Member IEEE
Raveendra K. Rao, Senior Member IEEE
5. “Insight into Wavelets” from Theory to Practice
.P.Soman
K.I Ramachandran
Printice – Hall of India, New Delhi
www.phindia.com

Vous aimerez peut-être aussi