Vous êtes sur la page 1sur 35

ABSTRACT

Speech reputation (sr) is the inter-disciplinary sub-subject of computational


linguistics which incorporates expertise and studies inside the linguistics, laptop
technology, and electric engineering fields to increase methodologies and
technology that allows the popuarity and translation of spoken language into
textual content via computer systems and computerized devices consisting of the
ones classified as smart technologies and robotics. Its also called "automated
speech popularity" (asr), "pc speech popularity", or just "speech to text". This
paper gives strategies for speech-to-text and speech-to-speech automated
summarization primarily based on speech unit extraction and concatenation.
Sentence and word devices which maximize the weighted sum of linguistic chance,
amount of statistics, confidence measure, and grammatical likelihood of
concatenated devices are extracted from the speech popularity effects and
concatenated for producing summaries. For the latter case, sentences, phrases, and
among-filler devices are investigated as gadgets to be extracted from unique
speech. Those methods are applied to the summarization of unrestricted-area
spontaneous shows and evaluated via goal and subjective measures.

Introduction
Speech recognition applications include voice user interfaces such as voice
dialling (e.g. "Call home"), call routing (e.g. "I would like to make a collect
call"), domestic appliance control, search (e.g. find a podcast where particular
words were spoken), simple data entry (e.g., entering a credit card number),
preparation of structured documents (e.g. a radiology report), speech-to-text
processing (e.g., word processors or emails), and aircraft (usually termed Direct
Voice Input).
The term voice recognition or speaker identification refers to identifying the
speaker, rather than what they are saying. Recognizing the speaker can simplify the
task of translating speech in systems that have been trained on a specific person's
voice or it can be used to authenticate or verify the identity of a speaker as part of a
security process. One of the key applications of automatic speech recognition is to
transcribe speech documents such as talks, presentations, lectures, and broadcast
news. Although speech is the most natural and effective method of communication
between human beings, it is not easy to quickly review, retrieve, and reuse speech
documents if they are simply recorded as audio signal. Therefore, transcribing
speech is expected to become a crucial capability for the coming IT era. Although
high recognition accuracy can be easily obtained for speech read from a text, such
as anchor speakers broadcast news utterances, technological ability for
recognizing spontaneous speech is still limited. Spontaneous speech is ill-formed
and very different from written text. Spontaneous speech usually includes
redundant information such as disfluencies, fillers, repetitions, repairs, and word
fragments. In addition, irrelevant information included in a transcription caused by
recognition errors is usually inevitable. Therefore, an approach in which all words

are simply transcribed is not an effective one for spontaneous speech. Instead,
speech summarization which extracts important information and removes
redundant and incorrect information is ideal for recognizing spontaneous speech.
Speech summarization is expected to save time for reviewing speech documents
and improve the efficiency of document retrieval.
Summarization results can be presented by either text or speech. The former
method has advantages in that: 1) the documents can be easily looked through; 2)
the part of the documents that are interesting for users can be easily extracted; and
3) information extraction and retrieval techniques can be easily applied to the
documents. However, it has disadvantages in that wrong information due to speech
recognition errors cannot be avoided and prosodic information such as the emotion
of speakers conveyed only in speech cannot be presented. On the other hand, the
latter method does not have such disadvantages and it can preserve all the acoustic
information included in the original speech.

Fig: Speech to text and text to speech flow

Motivation
We are inspired from the project name of JARVIS from Iron-Man I, II, III, and
especially from the AVENGER THE AGE OF ALTRON.
What is JARVIS.?
It is the program used by Mr. Tony Stark as his personal smart software assistant.
The JARVIS stand for Just A Rather Very Intelligent System. It is Artificial
Intelligence program speak in human male voice in British ascent.

BASIC VISUALIZATION AND GUI OF JARVIS

AIM AND OBJECTIVES

Aim: The aim of our project is to provide smart virtual assistant in order to speed
up the work, And new way for the blind people to handle their computer.

Objectives:

To perform certain task on commands.


To improve the input method (fast input method).
To do fast browsing on internet.
To improve the method of computer operating (one can gives commands by
wireless mic).

Project Background and Literature Survey

Background:
Previously, two-stage summarization method consisting of important sentence
extraction and word-based sentence compaction is investigated. In this approach,
the summarization results are presented by either text or speech. Early attempts to
design systems for automatic speech recognition were mostly guided by the theory
of acoustic-phonetics, which describes the phonetic elements of speech (the basic
sounds of the language) and tries to explain how they are acoustically realized in a
spoken utterance. These elements include the phonemes and the corresponding
place and manner of articulation used to produce the sound in various phonetic
contexts. For example, in order to produce a steady vowel sound, the vocal cords
need to vibrate (to excite the vocal tract), and the air that propagates through the
vocal tract results in sound with natural modes of resonance similar to what occurs
in an acoustic tube. These natural modes of resonance, called the formants or
formant frequencies, are manifested as major regions of energy concentration in
the speech power spectrum. In 1952, Davis, Biddulph, and Balashek of Bell
Laboratories built a system for isolated digit recognition for a single speaker [9],
using the formant frequencies measured (or estimated) during vowel regions of
each digit. Figure 5 shows a block diagram of the digit recognizer developed by
Davis et al., and Figure 6 shows plots of the formant trajectories along the
dimensions of the first and the second formant frequencies for each of the ten
digits, one-nine and oh, respectively. These trajectories served as the reference

pattern for determining the identity of an unknown digit utterance as the best
matching digit.

Literature Survey:
Sadaoki Furui (F93) : is a Professor at the Department of Computer Science,
Tokyo Institute of Technology, Tokyo, Japan. He is engaged in a wide range of
research on speech analysis, speech recognition, speaker recognition, speech
synthesis, and multimodal human-computer interaction and has authored or
coauthored over 350 published articles. From 1978 to 1979, he served on staff at
the Acoustics Research Department of Bell Laboratories, Murray Hill, NJ, as a
Visiting Researcher,working on speaker verification. He is the author of Digital
Speech Processing, Synthesis, and Recognition (New York: Marcel Dekker, 1989;
revised, 2000), Digital Speech Processing (Tokai, Japan:Tokai Univ. Press, 1985),
Acoustics and Speech Processing (Tokyo, Japan:Kindai-Kagaku-Sha, 1992) in
Japanese, and Speech Information Processing(Tokyo, Japan: Morikita, 1998). He
edited (with M. M. Sondhi) Advances in Speech Signal Processing (New York:
Marcel Dekker, 1992). He has translated into Japanese Fundamentals of Speech
Recognition (Tokyo, Japan: NTT Advanced Technology, 1995), authored by L. R.
Rabiner and B.-H. Juang, and Vector Quantization and Signal Compression
(Tokyo, Japan: Corona-sha, 1998), authored by A. Gersho and R. M. Gray.Dr.
Furui is a Fellow of the Acoustical Society of America and the Institute of
Electronics, Information and Communication Engineers of Japan (IEICE). He is
President of the Acoustical Society of Japan (ASJ), the International Speech
Communication Association (ISCA), and the Permanent Council for International
Conferences on Spoken Language Processing (PC-ICSLP). He is a Board of

Governor of the IEEE Signal Processing Society (SPS) and in 1993, he served as
an IEEE SPS Distinguished Lecturer. He has served on the IEEE Technical
Committee on Speech and MMSP and on numerous IEEE Conference organizing
committees. He is Editor-in-Chief of the Transactions of the IEICE. He is also an
Editorial Board member of Speech Communication, the Journal of Computer
Speech and Language, and the Journal of Digital Signal Processing. He has
received numerous awards, including: the Yonezawa Prize and the Paper Awards
from the IEICE (1975, 1988, 1993, and 2003); the Sato Paper Award from the ASJ
(1985 and 1987); the Senior Award from the IEEE Acoustics,Speech, and Signal
Processing Society (1989); the Achievement Award from the Minister of Science
and Technology of Japan (1989); the Technical AchievementAward and the
BookAward from the IEICE (1990 and 2003); the Mira Paul Memorial Award from
the AFECT of India (2001).
Tomonori Kikuchi: received the B. E. and M. E.degrees in computer science from
Tokyo Institute of Technology, Tokyo, Japan, in 2001 and 2003, respectively. He
has been with Japan Patent Office, Tokyo, Japan, since 2003.
Yousuke Shinnaka: received the B. E. degree in electrical and electronic
engineering from Tokyo Institute of Technology, Tokyo, Japan, in 2003. He is
currently pursuing the M.S. degree at Tokyo Institute of Technology.
Chiori Hori (M02) received the B.E. and M.E.degrees in electrical and
information engineering from Yamagata University, Yonezawa, Japan, in 1994 and
1997, respectively, and the Ph.D. degree from the Graduate School of Information
Science and Engineering, Tokyo Institute of Technology (TITECH), Tokyo, Japan,
in 2002.From April 1997 to March 1999, she was a Research Associate with the
Faculty of Literature and Social Sciences, Yamagata University. She is currently a
Researcher with NTT Communication Science Laboratories (CS Labs), Nippon

Telegraph and Telephone Corporation (NTT), Kyoto, Japan, which she joined in
2002. Dr. Hori is a member of the Acoustical Society of Japan (ASJ), the Institute
of Electronics, Information and Communication Engineers of Japan (IEICE), and
the Information Processing Society of Japan (IPSJ). She received the Paper Award
from the IEICE in 2002 for her work on speech summarization.

Problem Statement
Automatic speech summarization techniques is used with the two
presentation methods. In both cases, the most appropriate sentences, phrases or
word units/segments are automatically extracted from original speech and
concatenated to produce a summary.

Project Requirements and Analysis

1. Requirements for an Ideal voice assistant System

The ideal VOICE ASSISTENT system would, without error:

be completely transparent to the speaker (i.e. it would not require

training and no microphone would need to be worn by the speaker);

be completely transparent to the listener (i.e. it would not require the

user to carry any special equipment around);

recognise the speech of any speaker (even if they had a cold or an

unusual accent);

recognise any word in any context (including whether it is a command

);

recognise and convey attitudes, interest, emotion and tone;

recognise the speaker and be able to indicate who and where they are;

cope with any type or level of background noise and any speech

quality or level;

Synchronise the text and speech to enable searching and manipulation

of the speech using the text.

2. Current VOICE ASSISTENT Systems

Since 1999 Dr Wald has been working with IBM and Liberated learning
(coordinated by Saint Marys University in Nova Scotia, Canada) to demonstrate
that VOICE ASSISTENT can make speech accessible to all. (Wald 1999, Wald
2002a, Wald 2002b, Bain, Basson & Wald 2002) Lecturers wear wireless
microphones providing the freedom to move around as they are talking and the
AST is edited for errors and available for students on the Internet. Since standard
automatic speech recognition software lacks certain features that are required to
make the Liberated Learning vision a reality, a prototype application, Lecturer, was
developed in 2000 in collaboration with IBM for the creation of AST and was
superseded the following year by IBM ViaScribe (IBM 2004). Both applications
used the Via Voice engine and its corresponding training of voice and language
models and automatically provided AST displayed in a window and synchronised
AST stored for later reference. ViaScribe used a standard file format (SMIL)
enabling synchronised audio and the corresponding text transcript and slides to be
viewed on an Internet browser or through media players that support the SMIL 2.0
standard for accessible multimedia.

3. Visual Indication of Pauses


Without the dictation of punctuation VOICE ASSISTENT produces a continuous
stream of text that is very difficult to understand and so Lecturer and ViaScribe
automatically formatted the transcription based on pauses/silence in the normal
speech stream to provide a readable display. Formatting can be adjustably triggered

by pause/silence length with short and long pause timing and markers
corresponding, for example, to the written phrase and sentence markers comma
and period or the sentence and paragraph markers period and newline.
However as people dont speak in complete sentences, spontaneous speech does
not have the same structure as carefully constructed written text and so does not
lend itself easily to punctuating with periods or commas. A more readable approach
is to provide a visual indication of pauses which show how the speaker grouped
words together (e.g. one new line for a short pause and two for a long pause).

Accuracy
LL demonstrated accuracy of 85% or above for 40% of lecturers who used VOICE
ASSISTENT in classes (Leitch et al 2003) while some lecturers could use VOICE
ASSISTENT with over 90% accuracy. Although the same training approach was
followed by all LL Universities in the US, Canada and Australia, and the same
independent measure of accuracy and similar hardware and software were used,
lecturers varied in their lecturing experience, abilities, familiarity with the lecture
material and the amount of time they could spend on improving the voice and
language models. However, in spite of any problems, students and teachers
generally liked the LL concept and felt it improved teaching.

4.

Current LL Research and Development

Further planned ViaScribe developments include: a new speech recognition engine


integrated with ViaScribe; removing the requirement for speakers to train the

system by reading predefined scripts (that were designed for generating written
text rather than AST); optimising for recognition of a specific speakers
spontaneous speech by creating specific language models from their spontaneous
speech rather than generic written documents; speaker independent mode; real time
editing of AST; personalised individual displays; design and implementation of a
web infrastructure including semantic web for information retrieval and machine
readable notes. LL Research and development has also included improving the
recognition, training users, simplifying the interface, and improving the display
readability. Some of these developments have been trialled in the laboratory and
some in the classroom. Although IBM has identified the aim of developing better
than human speech recognition in the not too distant future, the use of a human
intermediary re-voicing and/or real time editing can help compensate for some of
VOICE ASSISTENTs current limitations.

5.

Editing and Re-voicing

Recognition errors will sometimes occur as VOICE ASSISTENT is not perfect and
so using one or more editors correcting errors in real time is one way of improving
the accuracy of AST. Not all errors are equally important, and so the editor can use
their initiative to prioritise those that most affect readability and understanding. An
experienced trained re-voicer repeating what has been said can improve accuracy
over Direct AST where the original speech is not of sufficient volume/quality (e.g.
telephone, internet, television, indistinct speaker) or when the system is not trained
(e.g. multiple speakers, meetings, panels, audience questions). Re-voiced VOICE
ASSISTENT is sometimes used for live television subtitling in the UK
(Lambourne, Hewitt, Lyon & Warren 2004) and in classrooms and courtrooms in

the US using a mask to reduce background noise and disturbance to others. While
one person acting as both the re-voicer and editor could attempt to create Real
Time Edited Revoiced AST, this would be more problematic for creating Real
Time Edited Direct AST (e.g. if a lecturer attempted to edit VOICE ASSISTENT
errors while they were giving their lecture). However, using Real Time Edited
Direct AST to increase accuracy might be more acceptable when using VOICE
ASSISTENT to communicate one-to-one with a deaf person.

Improving Readability through Confidence Levels and Phonetic Clues

VOICE ASSISTENT systems will attempt to display the most probable words in
its dictionary based on the speakers voice and language models even if the actual
words spoken are not in the dictionary (e.g. unusual or foreign names of people
and places). Although the system has information about the level of confidence it
has about these words, this is not usually communicated to the reader of the AST
whose only clue that an error has occurred will be the context. If the reader knew
that the transcribed word was unlikely to be correct, they would be better placed to
make an educated guess at what the word should have been from the sound of the
word (if they can hear this) and the other words in the sentence (current speech
recognition systems only use statistical probabilities of word sequences and not
semantics). Providing the reader with an indication of the confidence the system
has in recognition accuracy, can be done in different ways (e.g. colour change
and/or displaying the phonetic sounds) and the user could select the confidence
threshold. Since a lower confidence word will not always be wrong and a higher
confidence word right, further research is required on this feature. For a reader

unable to hear the word, the phonetic display would also give additional clues as to
how the word was pronounced and therefore what it might have been.

Improving Usability and Performance

Current unrestricted vocabulary VOICE ASSISTENT Systems normally are


speaker dependent and so require the speaker to train the system to the way they
speak, any special vocabulary they use and the words they most commonly employ
when writing. This normally involves initially reading aloud from a provided
training script, providing written documents to analyse, and then continuing to
improve accuracy by improving the voice and language models by correcting
existing words that are not recognised and adding any new vocabulary not in the
dictionary. Current research includes providing pre-trained voice models (the
most probable speech sounds corresponding to the acoustic waveform) and
language models (the most probable words spoken corresponding to the phonetic
speech sounds) from samples of speech, so the user does not need to spend the
time reading training scripts or improving the voice or language models. Speaker
independent systems currently usually generate lower accuracy than trained models
but research includes systems that improve accuracy as they learn more about the
speakers voice.

Network System

The speakers voice and language models need to be installed on the classroom
machines and this can occur either by the speaker bringing their computer with
them into the classroom or by uploading the files on to a fixed classroom machine
on a network. A network approach can also help make the system easier to use (e.g.
automatically loading personal voice and language models and saving the speech
and text files) and ensure that teachers dont need to be technical experts and that
technical experts are not required in the classroom to sort out problems.

Coping with Multiple Speakers

Various approaches could be adopted in meetings or interactive group sessions in


order that contributions, questions and comments from all speakers could be
transcribed directly into text. The simplest approach would be for each speaker to
have their own separate personal VOICE ASSISTENT system trained to their
voice. Since speech-recognition systems work by calculating how confident they
are that a word that has been spoken has been recognised in their dictionary, it is
possible with more than one speech recognition engine running on a client-server
system to compare the scores to find the best recognition. This would not involve
the time delay that would occur with the alternative approach of speaker
identification, where the system identifies a change in speaker (through software or
by microphone or locating the position speech originates from) before loading their
voice model. The system could also indicate for deaf people who and where the
speaker was (e.g. using colour, name, photo, position on diagram/image/map or an
individual wireless display screen or badge). Automatic speaker identification
could also be useful for everyone in virtual conferences.

Personalised Displays

Liberated Learnings research has shown that while projecting the text onto a large
screen in the classroom has been used successfully it is clear that in many
situations an individual personalised and customisable display would be preferable
or essential. A JAVA web based application will therefore be developed to provide
users with their own personal display on their own web enabled wireless systems
(e.g. computers, PDAs, mobile phones etc.) customised to their preferences (e.g.,
font, size, colour, text formatting and scrolling).

Speech Command and Control

Speech commands can be used to control events (e.g. NEXT-SLIDE or a more


flexible approach calling up a slide by name) but to avoid confusion it is important
to be able to distinguish when the speaker wishes a slide to be presented, from
when they are describing that slide (e.g. the next slide shows the results of
experiment B). While it is possible to learn to use command words, some people
find mixing speech control with speech dictation difficult and unnatural. It would
be possible to detect when a person is looking at a camera to interpret the speech as
commands, much as a person assumes they are being addressed when the speaker
looks at them.

Emotions, Attitudes and Tone

Current VOICE ASSISTENT doesnt convey emotions, attitude or tone, but it


might be possible to recognize some of these features and give some indication in
the transcription.

Implementation
Methodology
Speech-to-text summarization methods
Sentence extraction-based methods
LSA-based methods
Generic text summarization is a field that has seen increasing attention from the
NLP community. The actual huge amount of electronic information has to be
reduced to enable the users to handle this information more effectively. Latent
Semantic Analysis (LSA) is an algebraic-statistical method which extracts hidden
semantic structures of words and sentences. It is an unsupervised approach which
does not need any training or external knowledge. LSA uses context of the input
document and extracts information such as which words are used together and
which common words are seen in different sentences. High number of common
words among sentences indicates that the sentences are semantically related.
Meaning of a sentence is decided using the word it contains, and meaning of words
are decided using the sentences that contains the word. Singular Value
Decomposition (SVD), an algebraic method, is used to find out the interrelations
between sentences and words.
Example 1: three sentences are given as an input to LSA.

d0: The man walked the dog.


d1: The man took the dog to the park.
d2: The dog went to the park.
Step 1.
Input matrix creation: an input document needs to be represented in a way that
enables a computer to understand and perform calculations on it. This
representation is usually a matrix representation where columns are sentences and
rows are words/phrases. The cells are used to represent the importance of words in
sentences. Different approaches can be used for filling out the cell values. Since all
words are not seen in all sentences, most of the time the created matrix is sparse.
Step 2.
Singular Value Decomposition (SVD): SVD is an algebraic method that can model
relationships among words/phrases and sentences. In this method, the given input
matrix A is decomposed into three new matrices as follows:
A=U P VT A: Input matrix (m n) U: Words Extracted Concepts (m n)
P: Scaling values, diagonal descending matrix (n n)
V: Sentences Extracted Concepts (n n)
Step 3. Sentence selection
LSA has several limitations. The first one is that it does not use the information
about word order, syntactic relations, and morphologies. This kind of information
can be necessary for finding out the meaning of words and texts. The second
limitation is that it uses no world knowledge, but just the information that exists in
input document. The third limitation is related to the performance of the algorithm.

With larger and more inhomogeneous data the performance decreases sharply. The
decrease in performance is caused by SVD which is a very complex algorithm.
LSA-based research in instructional settings is a decade old. This paper aims to
provide an overview of this research and to fuel new research directions. We argue
that LSA is a good candidate to analyze instructional interactions associated with
learning environments and to deliver feedback accordingly.
MMR-based methods
A MMR-based feature selection which selects each feature according to a
combined criterion of information gain and novelty of information. a new feature
selection method which selects each feature according to a combined criterion of
information gain and novelty of information. The latter measures the degree of
dissimilarity between the feature being considered and previously selected
features. Maximal Marginal Relevance (MMR) provides precisely such
functionality . e a MMR-based feature selection which selects each feature
according to a combined criterion of information gain and novelty of information.
MMR selects candidate sentences for a final summary based on the highest
remaining relevance in the corpus (Provast, 2008). In other words, one chooses the
most relevant remaining sentence and ensures that there is a minimal duplication
with any sentences already chosen for the proposed summary. In extractive
summarization, the score of a sentence Si in the kth iteration in MMR is calculated
as follows:
MMR=arg max( Sim1 (Si , T- (1 ) max Sim2 (Si, Sj))
Where T is a tag set, R is the set of all sentences in the document set, S is the
current set of already selected sentences for the summary, and R S is the set of

unselected sentences in R. Sim1 is the similarity metric for computing the


similarity between a sentence si and the tag set T, and Sim2 is the similarity metric
for computing the similarity between two sentences si and sj . is a parameter used
to adjust the combined score in order to emphasize relevance or to avoid
redundancy .
Feature-based methods
Sentence compaction-based methods
Task-based evaluations measure human performance using the summaries for a
certain task (after the summaries are created). We can for example measure a
suitability of using summaries instead of full texts for text categorization [3]. This
evaluation requires a classified corpus of texts.
Feature extraction
Speech waveform is digitized by 16 kHz sampling and 16bit quantization, and a
25-dimensional feature vector consisting of normalized logarithmic energy, 12dimensional Mel-cestrum and their derivatives, is extracted using a 24ms frame
applied at every 10ms. The cepstral mean subtraction (CMS) is applied for each
utterance.
Acoustic and linguistic models Speaker-independent context-dependent phone
HMMs with 3000 states and 16 Gaussian mixtures for each state are made using a
part of the CSJ consisting of 338 presentations with the length of 59 hours spoken
by male speakers different from the speaker of the presentation for testing. The
transcribed presentations in the CSJ with 1.5M words are automatically split into
words (morphemes) by the JTAG morphological analysis program, and the most
frequent 20k words are selected to calculate word bigrams and trigrams.

Decoder A word-graph-based 2-pass decoder is used for recognition. In the first


pass, frame-synchronous beam search is performed using the above-mentioned
HMM and the bigram language model. A word graph generated as a result of the
first pass is rescored in the second pass using the trigram language model.
Combination of sentence extraction and sentence compaction
Automatic speech summarization is one approach toward accomplishing this goal.
Speech summarization would also be expected to reduce the time needed for
reviewing speech documents and to improve the efficiency of document retrieval.
Thus, speech summarization technology is expected to play an important role in
building various speech archives, including broadcast news, lectures, presentations,
and interviews. Output of this techniques is efficient
1) presenting simply concatenated speech segments that are extracted from
original speech
Vocal Signal Analysis.
Sound travels through the environment as a longitudinal wave with a speed that
depends on the environment density. The easiest way to represent sounds is a
sinusoidal graphic. The graphic presents variation of air pressure depending on
time. The shape of the sound wave depends on three factors: amplitude, frequency
and phase. The amplitude is the displacement of the sinusoidal graph above and
below temporal axis (y = 0) and it corresponds to the energy the sound wave is
loaded with. Amplitude measurement can be performed using pressure units
(decibels DB), which measure the amplitude following a logarithmic function as
regards a standard sound. Measuring amplitude using decibels is important in
practice because it is a direct representation of how the sound volume is perceived
by people. The frequency is the number of cycles the sinusoid makes every second.

A cycle consists of an oscillation starting with the medium line, then it reaches the
maximum, then it reaches the minimum and then back to medium line. The
frequency is measured in cycles per second or Hertz (Hz). The reverse of
frequency is called the period. It is the time needed for the sound wave to complete
a cycle.
Word Detection.
Todays detection techniques can accurately identify the starting and ending point
of a spoken word within an audio stream, based on processing signals varying with
time. They evaluate the energy and average magnitude in a short time unit, and
also calculate the average zero-crossing rate. Establishing the starting and ending
point is a simple problem if the audio recording is performed in ideal conditions. In
this case the ratio signal-noise is large because its easy to determine the location
within the stream that contains a valid signal by analyzing the samples.
In real conditions things are not so simple, the background noise has a significant
intensity and can disturb the isolation process of the word within the stream.
2) Synthesizing summarization text by using a speech synthesizer.
Some systems have a somewhat hybrid nature in that they perform unit
selection as above, but in addition perform prosodic modification by signal
processing. In cases where for instance the specification requires a unit with a
particular F0 value and the database has no unit of this value, it is possible to
use signal processing to adjust the unit to the required value. If the degree of
modification is too high, poor quality will result, and in particular if the
modified speech is present in the same utterance as unmodified speech then a
listener may be able to sense the shift in quality. If the degree of modification is

slight however, it is possible to use signal processing with no adverse effect. A


further possibility is to anticipate signal processing in the search. We do this by
giving lower priority (ie. weights) to the features we know we can compensate
for with signal processing.
1] Unit Selection and Concatenation:
Units for Extraction: The following issues need to be addressed in extracting and
concatenating speech segments for making summaries.
Units for extraction: sentences, phrases, or words.
Criteria for measuring the importance of units for extraction.
Concatenation methods for making summary speech.
The units are investigated in this paper: sentences, words, and between-filler units.
All the fillers automatically detected as the result of recognition are removed
before extracting important segments.
2] Unit Concatenation:
Units for Extraction: The following issues need to be addressed in extracting and
concatenating speech segments for making summaries.
1) Units for extraction: sentences, phrases, or words.
2) Criteria for measuring the importance of units for extraction.
3) Concatenation methods for making summary speech.
The following three units are investigated in this paper: sentences, words, and
between-filler units. All the fillers automatically detected as the result of
recognition are removed before
Extracting important segments.
We have design our first module. This module contain user login credentials .This
module can accept input through speech. We first tried word like Hello, Hi, open

notepad, open word. System can accept input through speech and we get output
according our requirements.

Result and Evaluation


We have design our first module. This module can accept input through speech.
We first tried word like Hello, Hi, open notepad, open word. System can accept
input through speech and we get output according our requirements.
We have performed testing for first module. We perform following types of testing.
Software testing is the process of evaluation a software item to detect differences
between given input and expected output. Also to assess the feature of A software
item. Testing assesses the quality of the product. Software testing is a process that
should be done during the development process. In other words software testing is
a verification and validation process.
4.1 Unit Testing
Unit testing is the testing of an individual unit or group of related units. It falls
under the class of white box testing. It is often done by the programmer to test that
the unit he/she has implemented is producing expected output against given input.
4.2 Integration Testing
Integration testing is defined as the testing of combined parts of an application to
determine if they function correctly.

Bottom-up integration
This testing begins with unit testing, followed by tests of progressively higher-level
combinations of units called modules or builds.
Top-down integration
In this testing, the highest-level modules are tested first and progressively, lowerlevel modules are tested thereafter.
4.3 Acceptance Testing
Acceptance testing is often done by the customer to ensure that the delivered
product meets the requirements and works as the customer expected. It falls under
the class of black box testing.
There are two types of testing alpha testing and beta testing
Alpha testing is done on developer`s site. And beta testing is done on user`s side.
Test
ID
VA_1

case Unit to test


Login
user

VA _2

Login
user

VA _3

Login

Test data

as User_Id,
password

as User_Id,
password

as User_Id,

Test case

Steps

Expected

User_id

Click login

result
Error

and

And

message

password is password

should

empty
User_id is Enter data

displayed
Message

empty and Click

on should

password is enter

displayed

correct

Enter

button

User_id is Enter data

user_id
Message

be

be

user

VA _4

Login
user

password

as User_Id,
password

correct and Click

on should

password is enter

displayed

empty

Enter

button

User

_id Enter data

and

Click

be

password
Message

on should

be

password is enter

displayed

incorrect

Enter

button

correct
VA _5

Login
user

VA _11

Search

as User_Id,
password

User

_id Enter data

and

Click

data
Message

on should

be

password is enter

displayed

Search

empty
Search

Fill data
Successfull

field

field data

button
Press enter

y
Directed

Proposed outcome
When user will give input to system, which is in the speech format, system
will convert speech format into text. Keywords will be matched with database, if
that keyword is present in database then system will response in speech format. If
required query is not present then system will notify user by giving speech output.

System Analysis Proposed Architecture

Project Plan

Conclusion
Finding out the information related to the needs of a user among large
number of documents is a problem that has become obvious with the growth of
text-based resources. In order to solve this problem, text summarization methods
are proposed and evaluated. The research on summarization started with the
extraction of simple features and went on to use different methods, such as lexical
chains, statistical approaches, graph-based approaches, and algebraic solutions.
One of the algebraic-statistical approaches is the Latent Semantic Analysis method.
Development of a sophisticated summarization system for spoken language
requires further research in both text and speech analysis and provides a fertile test
bed for integration of the two approaches. Some of the areas in which more joint
research with the speech community would prove valuable include: segmentation
of spoken documents at many levels, depending upon genre: utterance, turn, topic
or story; extraction of acoustic and prosodic information (pitch, intensity, timing),
which may be useful in segmentation but also in identifying important passages
to include in a summary; identification of speakers; more accurate named entity
extraction from speech; dissiliency detection and techniques for correcting
diffluent passages; speech act labeling; and access to phoneme lattices and to word
level confidence scores from ASR output, to identify out-of vocabulary proper
names and to identify words recognized with higher confidence

References
[1] S. Furui, K. Iwano, C. Hori, T. Shinozaki, Y. Saito, and S. Tamura, Ubiquitous
speech processing, in Proc. ICASSP2001, vol. 1, Salt Lake City, UT, 2001, pp.
1316.
[2] S. Furui, Recent advances in spontaneous speech recognition and
understanding,
in Proc. ISCA-IEEE Workshop on Spontaneous Speech Processing and
Recognition, Tokyo, Japan, 2003.
[3] I. Mani and M. T. Maybury, Eds., Advances in Automatic Text Summarization.
Cambridge, MA: MIT Press, 1999.
[4] J. Alexandersson and P. Poller, Toward multilingual protocol generation for
spontaneous dialogues, in Proc. INLG-98, Niagara-on-the-lake, Canada, 1998.
[5] K. Zechner and A. Waibel, Minimizing word error rate in textual summaries of
spoken language, in Proc. NAACL, Seattle, WA, 2000.
[6] J. S. Garofolo, E. M. Voorhees, C. G. P. Auzanne, and V. M. Stanford, Spoken
document retrieval: 1998 evaluation and investigation of new metrics, in Proc.
ESCA Workshop: Accessing Information in Spoken Audio, Cambridge, MA, 1999,
pp. 17.
[7] R. Valenza, T. Robinson, M. Hickey, and R. Tucker, Summarization of spoken
audio through information extraction, in Proc. ISCA Workshop on Accessing
Information in Spoken Audio, Cambridge, MA, 1999, pp. 111116.
[8] K. Koumpis and S. Renals, Transcription and summarization of voicemail
speech, in Proc. ICSLP 2000, 2000, pp. 688691.

[9] K. Maekawa, H. Koiso, S. Furui, and H. Isahara, Spontaneous speech corpus


of Japanese, in Proc. LREC2000, Athens, Greece, 2000, pp. 947952.
[10] T. Kikuchi, S. Furui, and C. Hori, Two-stage automatic speech
summarization by sentence extraction and compaction, in Proc. ISCA-IEEE
Workshop on Spontaneous Speech Processing and Recognition, Tokyo, Japan,
2003.

Vous aimerez peut-être aussi