Académique Documents
Professionnel Documents
Culture Documents
by
Gwalior-474005
January 2002
1
CERTIFICATE
This is to certify that the thesis entitled “Interactive voice response system” is being
submitted to Indian Institute of information technology and management, Gwalior for
the award of master of technology in information technology by K.Pratap Kumar Raju
is a record of bonafide work carried out by him under my supervision and guidance. It
is further certified that the work presented has reached a standard of the PG thesis and
it has not been to any other university or Institute for the award of any degree or
diploma.
Date:
Place: Dr. Rajendra Sahu
2
ACKNOWLEDGEMENT
I like to thank our Director Prof. D.P. Agarwal for providing all the
facilities and working environment in the institute. I also like to thank the
entire institute faculty who helped me directly or indirectly to complete
my thesis work.
3
Abstract
Interactive voice response (IVR) systems have been around for some time to help
guide customers to appropriate business units or information. However, with the use of
Internet technologies and wireless phones on the rise, coupled with the rapid
development in the speech recognition and speech synthesis technologies, new doors
for voice technology are opening to test demand in the marketplace. What’s more
convenient than picking up a phone? One can have instant access to the information
needed to make business operate more efficiently. Many businesses are betting that
consumers will embrace any technology that provides real-time access to information
piped through their regular telephone, wireless phone or voice-connected handheld
device.
A system in which the input and/or output are through a spoken, rather than a
graphical, user interface is what we call as Interactive voice response system or simply
IVR system. The web has made it possible to access information at the click of the
mouse. In recent years the meaning of what a client has grown from the desktop
computers to other clients like phones and mobile pieces. This is where voice control
came in.
Analyzing the requirements of the need for developing the voice systems, my
dissertation work concentrate on how to develop an interactive voice response
website. Voice Web technology makes use of open Internet standards (Web
infrastructure) like Hypertext Transfer Protocol (HTTP), Secure Sockets Layer (SSL),
Cookies and Extensible Markup Language (XML) based VoiceXML for implementing
voice services over the telephone. System proposes a three-tier architecture. At the
client side it consists of a telephone or cell phone connected to a Public Switching
Telephone Network. In the middle tier it consists of voice server equipped with VoIP
gateway, which facilitate the users of PSTN to connect to the voice application that
works in the IP network. This voice server identifies the call made by users of
4
telephone network, initiates the voice application, presents the user with the required
information and terminates the call when the user wants to exit from the application.
Development of an application makes use of VXML to provide an efficient speech
interface, Java Speech Markup Language (JSML) to develop grammar files in Java
Speech Grammar Format (JSGF), Servlets to supply the requested information by the
Voice browser. Front end make use of VXML language, which consists of tags to
recognize the human input and record them for future use. VXML tags takes input
from the user in small phrases and send these parameters to back end Servlets. Servlets
are basically written in java to accept the parameters from the front end and use them
to get necessary information database server. Database server stores the information of
an enterprise or institute in terms of tables where one can store the necessary
information to present it to users. It can be used with any phone at anywhere. One can
don’t have to put up with entering data using tiny keypad, but rather one can interact
with the service in a very natural manner.
The dissertation work aims at developing an IVR system for IIITM. It promises a good
speech interface to make the user feel comfortable to interact with the system and
email reader, that will read the emails so that one can listen to his emails rather than
browsing through them.
5
Contents
Chapter 1 Introduction
1.1 Introduction to IVR system. - 1
1.2 Typical voice applications. - 1
1.3 How to create and deploy IVR applications.- 2
1.4 How do users can access IVR application? - 3
6
Chapter 4 Methodology
4.1 Implementation details - 24
4.2 Application design and development - 25
4.3 Development tools - 25
4.3.1 Voice XML
4.3.2 Servlets
4.3.3 JSGF
4.3.4 Oracle database
4.4Speech interface design - 30
4.4.1 Methodology
4.5 IVR development aspects - 32
4.6 Deployment Procedure - 33
4.6.1 Working of the system
4.6.2 Practical issues in deploying IVR
4.6.3 Security issues
References - 40
7
Chapter 1
1.1 Introduction
IVR systems, also called Voice Response Units (VRU), automate the handling of calls
by interacting with user .It takes the input from the user in voice and provides the
enterprise information by connecting one or more online databases. Popular IVR
applications include bank-by-phone, flight schedule retrieval, and automated order
entry and tracking. The common feature of these examples is that a caller's touch-tone
or spoken requests are answered with verbal information derived from a "live"
database. A significant percentage of installed IVR systems are used in front-end call
centers to reroute calls away from costly live agents. Over time, the IVR systems have
evolved, from being simple systems accepting touch-tone input to advanced voice
systems accepting near natural language-like voice inputs. Mostly IVR systems are
used in applications that require less information from the user side and more
information from the system side so that the system should not feel the trouble of
understanding large inputs, which are difficult to understand, by the speech
recognition engines of the system.
Voice applications will typically fall into one of the following categories. Queries and
Transactions.
Queries: In this scenario, a customer calls into a system to retrieve information from a
Web-based infrastructure. The system guides the customer through a series of menus
and forms by playing instructions, prompts, and menu choices using prerecorded audio
files or synthesized speech. The customer uses spoken commands or DTMF input to
make menu selections and fill in form fields. Based on the customer’s input, the
system locates the appropriate records in a back-end enterprise database. The system
presents the desired information to the customer; either by playing back prerecorded
audio files or by synthesizing speech based on the data retrieved from the database.
8
Examples of this type of self-service interaction include applications or voice portals
providing weather reports, movie listings, stock quotes, health-care-provider listings,
and customer service information (Web call centers).
(ii). Writing VoiceXML applications using any text editor, one may find it more
convenient to use a graphical development environment that helps to create and
manage VoiceXML files. WebSphere Studio and WebSphere Voice Toolkit support the
development of VoiceXML-based applications. (Optional) A system administrator uses
a Web application server program to configure and manage a Web server.
(iii). The developer publishes the VoiceXML application (including VoiceXML pages,
grammar files, any prerecorded audio files, and any server-side logic) to the Web
server.
9
(iv). The developer uses a desktop workstation and the Voice Server SDK to test the
VoiceXML application running on the Web server or local disk, pointing the
VoiceXML browser to the appropriate starting VoiceXML page.
(vi). The system administrator uses one of the deployment platforms to configure,
deploy, monitor, and manage a dedicated Voice Server.
(vii). The developer uses a real telephone to test the VoiceXML application running on
the Voice Server.
(i). A user dials the telephone number provided to the application. The Voice Server
answers the call and executes the application referenced by the dialed phone number.
(ii). The Voice Server plays a greeting to the caller and prompts the caller to indicate
what information he or she wants. The application can use prerecorded greetings and
Prompts or synthesize them from text using the text-to-speech engine. If the
application supports barge-in, the caller can interrupt the prompt if he or she already
knows what to do.
(iii). The application waits for the caller’s response for a set period of time. The caller
can respond either by speaking or by pressing one or more keys on a DTMF telephone
10
Keypad, depending on the types of responses expected by the application. If the
response does not match the criteria defined by the application (such as the specific
word, phrase, or digits), the voice application can prompt the caller to enter the
response again, using the same or different wording. If the waiting period has elapsed
and the caller has not responded, the application can prompt the caller again, using the
same or different wording.
(iv). The application takes whatever action is appropriate to the caller’s response. For
example, the application might update information in a database, retrieve information
from a database and speak it to the caller. It also involves store or retrieve a voice
message, launch another application, Play a help message after taking action, the
application prompts the caller with what to do next.
(v). The caller or the application can terminate the call. For example, the caller can
terminate the interaction at any time, simply by hanging up; the Voice Server can
detect if the caller hangs up and can disconnect itself. If the application permits, the
caller can use a command to explicitly indicate that the interaction is over (for
example, by saying “Exit”). If the application has finished running, it can play a
closing message and then disconnect.
11
Chapter 2
2.1 Problem Definition
Until recently, the World Wide Web has relied exclusively on visual interfaces to
deliver information and services to users via computers equipped with a monitor,
keyboard, and pointing device. In doing so, a huge potential customer base has been
ignored: people who (due to time, location, and/or cost constraints) do not have access
to a computer.
Even though you made the website perfectly dynamic using many technologies, users
cannot feel it more comfortable as it requires them to sit in a static place before a
terminal and access the required information. But it’s not possible for mobile users, to
perform a transaction or get the desired information through desktops PC. What they
want is that they can be able to do it from anywhere through any network like PSTN,
Internet, mobile network.
The existing IP architecture does provide poor quality of service in transfer of voice
due to following reasons.
12
(i). The existing network makes use of connection less unreliable Internet protocol and
hence you are not sure whether the packet will arrive at the destination or not.
Retransmission is not allowed in transferring the voice signals through the network
when some packets were collapsed in the travel due to congestion in the network
(ii). Long propagation delays due to unreliable congested network make listening to
voice ineffective.
(iii). Packets may arrive out of order as they take different routes in traveling through
the network, which leads to a problem of sequence. Out of sequence packets are not
acceptable in transfer of voice signals.
(iv). The devices in the network cause unpredictable amount of delay between the
packets, which is called as jitter. Large jitter causes unpredictable amount delay of
packets in reaching the destination, which will leads to a poor quality of voice.
(v). In dealing with the voice, there should be some mechanism, which cancels the
echo created during voice travel through the network.
13
to implement the above-mentioned functions. Develop an interface between the voice
server and the web server.
Interactive voice response websites makes the information available in the World Wide
Web (WWW) to your public telephone or cell phones.
• Interactive Voice Response (IVR) system enabled World Wide Websites (internet)
make the information reachable even to telephones and cell phones. This facilitates
the user to get the information easily by just dialing the particular server using
their handsets at any time round the clock.
• IVR systems enable the users to do different types of transactions easily. eg:
checking the bank balances ,doing money transactions
• IVR systems facilitate you to check your emails with just using telephones. The
system takes the necessary information form you and read the messages intended
for you in voice.
• IVR systems are especially useful in case of call centers to respond to the
customers in voice and transfer the calls to other information systems.
Flaws in existing web sites are they are not voice interactive. By making a website
voice interactive you would be able to provide the information in voice presentation of
information in voice has many advantages which some of them are mentioned above.
Especially in case of any information queries where a client send little information as
request and more information to get from the server side voice interactive system will
be very much helpful to obtain the information with much. What you have to do is just
speak out small phrases of queries and listen to the required information.
Taking the flaws that are prevailing in the existing system in to consideration, one
would develop a system, which can interact effectively with user in voice and provide
the information in a form, which the user feels more comfortable. one can make the
information available to telephone and cell phones then it will be more advantageous
and help in substantial growth of organization.
14
Interactive Voice Response (IVR) applications enable callers to query and modify
database information over their telephone using their own human speech or by dialing
digits on their telephone. Callers can use their touch-tone pad to input requests or just
say what they want to do, such as ordering a product, obtaining a work schedule, or
requesting account balance information, and the database speaks information back to
the caller-using Text-to-Speech. IVR offers customers and businesses a new level of
freedom by enabling them to conduct transactions 24 hours a day, seven days a week.
Businesses of all sizes are realizing the tremendous benefits of IVR applications for
their call processing and information delivery needs. IVR functionality links a phone
system to a database to provide customers with 24-hour immediate access to account
information, via telephone. For example, a bank could make up to 10 data fields
available for a caller’s checking account, 10 data fields for his or her savings account,
and so on. To ensure security IVR can be set up to allow the caller access to account
information only if the caller enters a valid account number and corresponding
personal identification number.
IVR allows full connectivity to the most popular databases including Microsoft
Access, Microsoft Excel, Microsoft Fox Pro, DBase. One can read information from,
and write information to, databases, as well as make a query databases and can return
information. The application files can reside on the local system, an intranet, or the
Internet. Users can access the deployed applications anytime, anywhere, from any
telephony-capable device, and you can design the applications to restrict access only
to those who are authorized to receive it.
“Voice-enabling the World Wide Web” does not simply mean using spoken commands
to tell a visual browser to look up a specific Web address or go to a particular
bookmark Having a visual browser throw away the graphics on a traditional visual
Web page and read the rest of the information aloud converting the bold or italics on a
visual Web page to some kind of emphasized speech. Voice applications provide an
easy and novel way for users to surf or shop on the Internet—“browsing by voice.”
Users can interact with Web-based data (that is, data available via Web-style
architecture such as servlets, ASPs, JSPs, Java® Beans, CGI scripts, etc.) using speech
15
rather than a keyboard and mouse. The form that this spoken data takes is often not
identical to the form it takes in a visual interface, due to the inherent differences
between the interfaces. For this reason, transcoding—that is, using a tool to
automatically convert HTML files to VoiceXML—may not be the most effective way
to create voice applications. It will execute any created application when a caller dials
in and allows callers to interact with the system using both human speech and DTMF.
Advanced database technology permits reading, writing, appending, searching and
seeking database information.
It must be noted that voice based navigation can get complex. When implementing
information services on a web server, one can include a glut of information on the
page and over load paths to resources to make sure users reach their required
destination whatever their approach to searching for it. In voice applications, it
becomes more important to clearly define the information. Voice data is transient; it
depends on the users memory and ties in much more closely with preconceptions and
experience. Finally, our ability to focus on any one-voice source among many is
limited. The need to avoid ambiguity in the question/ Answer pattern of voice
interaction can be cause of very complex systems, and its very difficult to maintain
location information; keeping the user aware of where there are in application and
where they are in relating to other parts of the application, such as home page, end so
on. It is the characteristic of unpopular applications that user feels lost and out of
control.
The growing awareness of catering for a variety of needs and devices has highlighted
the important of voice control services and also the importance of making them
usable. Voice entry of textual data is very much clearer than using a phone keypad.
Current developments in wireless technology and increase in processors speed have
made speech applications a reality. With powerful servers for both speech processing
and wireless-based thin clients. Like mobile phones and PDAs, it is now possible to
interact with the user using audio input and output.
16
In addition to all these things VoiceXMl made the dream of developing voice-web
applications come to true. Voice XML is relatively a new specification of XML
designed to develop voice applications over the web. It has its root in a language
designed by Motorola by the name VXML, another specification for presenting
services and data in voice medium.
In US most of the weather information applications are automated using IVR systems.
There user queries regarding to weather information is automatically answered by the
simulated voice generated by the TTS engine.
In online purchasing any queries regarding to the items are answered by the automated
voice.
Information of the arrival, departure of the trains and reservation availability all these
information you can obtain from the automated response system.
Please speak the information about train number, source and destination stations .IVR
system will automatically generate a query on the database, get the information from
the database and will speak it out loud for you, the availability of the reservation.
17
(iv) Telemedicine
Now IVR systems even entered into medicine field. Electrocardiogram monitor gets
ECG data of the patient and transmit it over a regular telephone line.
(v) Tele-education
Education from distance places (Tele-education) is now a days possible after IVR
systems came into picture.
This service is available to callers using touch-tone telephones. If your phone makes a
different tone each time you press a number, then your phone is a touch-tone phone. If
you hear no tone or a number of clicks with each press of the numbers, then your
phone is operating in pulse mode and cannot access the system. You can purchase a
special adaptor from Telstra, which will enable you to use the system. Some
telephones have a switch or button, which allows you to change the mode to touch-
tone mode.
Calls are charged at the minimum rate of 35 cents per minute regardless of where you
call from, plus an initial charge of 15 cents. A higher charge will be incurred from
mobile telephones and public telephones. Students living overseas can access the
system by dialing 0055 31706 (preceded by the Australian International Code). The
rate is 75 cents per minute, plus the International access rate, which varies from
country to country.
18
(vi) Automatic Call answering in call centers
Before Voice web came into picture, Call centers use to spend a lot of money on the
call operators to answer the queries from the users. But now call centers fully operate
using voice web has the advantage of operating at low cost as TTS and voice
recognition engines came into image.
Emerging Digital Concepts (EDC) is developing solutions for clients using a number
of leading speech recognition technologies, including Speech Works and Nuance.
These technologies are applied to some of the state of the art hardware available today
including Natural Micro Systems and Dialogic.
TigerJet Network provides Integrated software and silicon solutions for network
communication applications.
19
TigerJet's Gateway Manager application
Implement your own private VoIP gateway for Internet to regular phone calls.You can
Place a call using you own regular phone line from anywhere in the world.
IP Phone integrates all popular choices for making Internet phone calls in a single easy
to use application with a central "one stop" interface.
20
networks and the Internet available from any telephone. Nuance is the leader in Voice
Web software — speech recognition, voice authentication, text-to-speech and voice-
browsing products that make the information and services of enterprises,
telecommunications networks and the Internet accessible from any telephone. SRI
International, one of the leading voice technology research entities throughout the
1980s and 1990s, established nuance as an independent company. Nuance offers its
products through industry partners, platform providers, and value-added resellers
around the world.
SRC TELECOM On 4 June 2001 – The SRC Telecom, the telephony based speech
recognition arm of SRC (The Speech Recognition Company), today announced it is
offering a VXML (voice XML) applications hosting service. SRC has installed a
VXML platform that will provide third parties with the first Europe based applications
hosting environment.
21
third parties signifies SRC Telecom’s leadership in delivering the latest telephony
speech solutions.”
By developing applications in VXML, organizations one can benefit from the many
advantages associated with open standards based development environment. Most
notably, VXML provides significant efficiencies during the application design process,
ensures ease of software maintenance and allows greater portability of applications.
However, the development of speech applications that facilitates the high end-user
acceptance still requires substantial expertise in human factors engineering, dialogue
design and speech systems integration.
JSGF provides a way to define the grammar files that help the system to check
whether the user input is valid or not. You can declare the small phrases or words as
options in the grammar and the user is required to speak these options in order to
select a particular option. Voice servers are being developed by many companies to
make the dream of deploying the application, which are reachable to both PSTN and
IP. One of the popular among them is the Voice server developed by IBM. It has many
22
versions, which can be run on windows 2000, Windows NT 4.0 and even on Linux.
Taking in to consideration the efficiency of Windows NT 4.0 and simplicity of the user
interface voice server for windows NT4.0 will be a best option. Latest versions
support new versions of tags that help in generating the simulated voice comparable
with the human voice. Java is one of the tools to develop server applications at the
back end. Servlets, which make use of java, and special API designed for various tasks
work satisfactorily to process the request from the voice browser.
Using the above mentioned tools the voice server which includes voice browser, voice
recognition engine and TTS engine and java speech grammar format the dissertation
work creates voice application that promise efficient speech user interface and user
friendly environment.
2.6 Conclusion
This chapter chronologically analyzes the tools and technologies for developing the
voice applications. It emphasizes the various tools and finds VXMl is the only
potential for the front-end future IVR. Using a Windows supported Voice server
development kit of IBM, java speech markup language and java Servlets at the back-
end one can develop a full pledged voice application that can be deployed on web
architecture.
2.7 Chapterization
The subsequent chapters deal with the system architecture, problem formulation, and
designing and development aspects of the system.
23
Chapter 3
VoiceGateway
VOIP Server
Web Server
24
(ii) VXML gateway.
(iii) Voice Server.
(iv) Web application Server.
(v) Database Server.
3.1.2 VoiceXML Gateway The purpose a gateway is used to transfer the data
between two networks, which adopt different protocols, and different data formats.
VoIP gateway is used to connect PSTN and IP network. IP network make use of
TCP/IP protocol suit, which transfer data in the packet format. PSTN transmits the raw
data bits through the network which got completely different data format when
compared TCP/IP network .It uses signaling and switching process (control plane and
data plane) two layers switches. VOIP gateway emulates the telephone network in the
IP network
A VOIP gateway consists of a series of Digital signal processors which performs the
following functions
25
Whenever user lifts the phone it is the function of the gateway to generate and detect
the tone of different DTMF input, that is the destination number. A routing server
maps this number to an Internet address to identify the destination node.
DSP
DSP
DSP
3.1.3 Voice Server Voice Server mainly consists of Speech recognition engine, Text to
speech engine DTMF simulator as shown in the figure 4.1.
Speech recognition is the ability of a computer to decode human speech and convert it
to text. To convert spoken input to text, the computer first parse the input audio stream
26
and then convert that information to text output. The process of recognition takes
place like this. One can create a series of speech recognition grammars defining the
words and phrases that can be spoken by the user, and specifies where each grammar
should be active within the application. When the application runs, the speech
recognition engine processes the incoming audio signal and compares the sound
patterns to the patterns of basic spoken sounds, trying to determine the most probable
combination that represents the audio input. Finally, the speech recognition engine
compares the sounds to the list of words and phrases in the active grammar(s). Only
words and phrases in the active grammars are considered as possible speech
recognition candidates. Any word for which the speech recognizer does not have a
pronunciation is given one and is flagged as an unknown word.
The key determinants of speech recognition accuracy are audio Input quality, interface
design, grammar design. The quality of audio input is a key determinant of speech
recognition accuracy. Audio quality is influenced by such factors as the choice of input
device i.e. microphone connected to a desktop workstation. For applications deployed
using the Web Sphere Voice Server, the input device could be a regular telephone,
cordless telephone, speakerphone, or cellular telephone. Speaking environment,
which could be in a car, outdoors, in a crowded room, or in a quiet office. Certain user
characteristics such as accent, fluency in the input language, and any atypical
pronunciations. While many of these factors may be beyond your control. One should
nevertheless consider their implications when design the applications.
Users will achieve the best possible speech recognition with a high-quality input
device that gives good signal-to-noise ratio. For desktop testing, use one of the
microphones listed at http://www.ibm.com/viavoice. Speech clarity is a significant
contributor to audio quality. Adult native speakers who speak clearly (without over-
enunciating or hesitating) and position the microphone or telephone properly achieve
the best recognition; other demographic groups may see somewhat variable
performance.
27
The design of the application interface has a major influence on speech recognition
accuracy. Words, phrases, and DTMF key sequences from active grammars are
considered as possible speech recognition candidates, what one chooses to put in a
grammar and when choosing to make that grammar active have a major impact on
speech recognition accuracy.
Text-to-speech conversion is the ability of a computer to “read out loud” (that is, to
generate spoken output from text input). Text-to-speech is often referred to as TTS or
speech synthesis. To generate synthesized speech, the computer must first parse the
input text to determine its structure and then convert that text to spoken output. One
can improve the quality of TTS output by using the speech markup elements provided
by the VoiceXML language, which is described later in the subsequent chapters. TTS
prompts are easier to maintain and modify than recorded audio prompts. For this
reason, TTS is typically used during application development.
28
specifies the -Dvxml.gui=false Java system property when starting the VoiceXML
browser. If the DTMF Simulator GUI window is closed, the only way to restart it is to
stop and restart the VoiceXML browser. The DTMF Simulator plus desktop
microphone and speakers take the place of a telephone during desktop testing,
allowing to debug VoiceXML applications without having to connect to telephony
hardware and the PSTN (Public Switched Telephone Network) or cellular GSM
(Global System for Mobile Communication).
Using the DTMF Simulator, one can simulate a telephone keypress event by pressing
the corresponding key on the computer keyboard or clicking on the corresponding
button on the DTMF Simulator GUI, shown in Figure 4.1.For example, if the
application prompt is “Press 5 on your telephone keypad,” one can simulate a user
response during desktop testing by clicking the 5 button on the DTMF Simulator GUI
or pressing the 5 key Figure 2. DTMF Simulator GUI on the computer keyboard while
the cursor focus is in the DTMF Simulator GUI window. The VoiceXML browser will
interpret your input as a 5 pressed on a DTMF telephone keypad. If the length of valid
DTMF input strings is variable, use the # key to terminate DTMF input.
Interactions with Text-to-Speech and Speech Recognition Engines
During initialization, the VoiceXML browser starts the TTS and speech recognition
engines. The VoiceXML browser uses telephony acoustic models in order to simulate
the behavior of the final deployed telephony application as closely as possible in a
desktop environment. As the VoiceXML browser processes a VoiceXML document, it
plays audio prompts using text-to-speech or recorded audio; for text-to-speech output,
it interacts with the TTS engine to convert the text into audio. Based on the current
dialog state, the VoiceXML browser enables and disables speech recognition
grammars. When the VoiceXML browser receives user audio input, the speech
recognition engine decodes the input stream, checks for valid user utterances as
defined by the currently active speech recognition grammar(s), and returns the results
to the VoiceXML browser. The VoiceXML browser uses the recognition results to fill
in form items or select menu options in the VoiceXML application. If the input is
associated with a <record> element in the VoiceXML document, the VoiceXML
browser stores the recorded audio. As the VoiceXML browser makes transitions to
29
new dialogs or new documents, it enables and disables different speech recognition
grammars, as specified by the VoiceXML application. As a result, the list of valid user
utterances changes. If the VoiceXML browser encounters an <ibmlexicon> element in
a VoiceXML document, it interacts with the speech recognition and TTS engines to
add or change the pronunciation of a word for the duration of the current VoiceXML
browser session.
3.1.4 Interactions with the Web Server and Enterprise Data Server
VoiceXML applications can be stored in any Web server running on any platform.
However, One make use of java web server, to reply the request generated by the
VXML documents. When starting the VoiceXML browser, the VoiceXML browser
sends an HTTP request over the LAN or Internet to request an initial VoiceXML
document from the Web server. The requested VoiceXML document can contain static
information, or it can be generated dynamically from data stored in an enterprise
database using the same type of server-side logic (CGI scripts, Java Beans, ASP, JSP,
Java Servlets, etc.) that is used to generate dynamic HTML documents.
The VoiceXML browser interprets and renders the document. Based on the user’s
input, the VoiceXML browser may request a new VoiceXML document from the Web
server, or may send data back to the Web server to update information in the back-end
database. The important thing is that the mechanism for accessing your back-end
enterprise data does not need to change.
3.1.5 Web server that runs the application logic, and may contain a database or
interface to an external database or transaction server.
30
Chapter 4 Methodology
4.1 Implementation Details
After identifying the system components and their details, the present chapter
discusses the implementation of IVR application. Application makes use of VoiceXML
at the front end. The voicexml documents run on a speech or voice browser. This voice
browser executes the tags one by one in the order specified by the form interpretation
algorithm. Form interpretation algorithm identifies the form elements and calls the
speech recognition engine function calls or TTS engine function calls to execute the
tags. If the application requires dynamic data to be extracted from the database it sends
a request to the Servlet program. Servlets make use of database connectivity to supply
the necessary data to the voice browser. Voice browser makes use of TTS engine to
convert this data in to voice form and is spoken out loud. Servlets run on a web
application server. Application makes use of java web server as web application
server. Database information is stored in the form of tables in the database server.
Oracle 8i database server is found to be efficient and easy to store the database.
VXML ENTERPRISE
SPEECH TEXT TO APPLICATION
RECOGNITION SPEECH
DATABASE
ENGINE ENGINE
VXML
APPLICATION
2 4
DTMF VXML Tables
SIMULATOR APPLICATION
31
Fig 4 .1 Application components and data flow
1.Voice in, 2.Audio or synthesized speech output, 3.Voicexml via http over LAN or
Internet, 4.DTMF in, 5.Database connectivity
(i). Information regarding the institute establishment and the institute profile.
(v). Eligibility and selection criteria for various courses of IIITM for the students
and faculty.
(vii). Exit from the site, if the user wants to come out of the system at any stage.
For development purpose use VXML, JSGF, at the front end and Servlets and java at
the back end in order to form a rigid and flexible system. For development of grammar
files, use java speech grammar format file (JSGF), threads and JDBC concepts.
32
4.3.1 VoiceXML
VXML is XML-based markup language for creating distributed voice applications,
much as HTML is a markup language for creating distributed visual applications.
VoiceXML supports dialogues that feature, spoken input, DTMF (telephone key)
input, recording of spoken input, synthesized speech output ("text-to-speech"), pre-
recorded audio output. VoiceXML makes building speech applications easier, in the
same way that HTML simplifies building visual applications.
These files define the voice user interaction and dialog flow control.
Grammar Files define the valid commands that are allowed during the voice
interaction. Grammar can be defined at the development stage or generated
dynamically at the run time. Audio Files are prerecorded audio files that are played
back, or the recordings of the user’s input. VoiceXML language provides features for
four major components of Voice Web: voice dialogs, platform control, telephony,
performance. Each VoiceXML document consists of one or more dialogs. The dialog
features cover the collection of input, generation of audio output, handling of
asynchronous events, performance of client-side scripting and dialog continuation.
Telephony features include simple connection control (call transfer, add 3rd party, call
disconnect) and telephony information like Automatic Number Identification (ANI)
and Dialed Number Information
VoiceXML Concepts
The user is always in one conversational state, or dialog, at a time. Each dialog
determines which dialog will be transitioned to next. Transitions are specified using
33
URI (Uniform Resource Identifier), which define the next document and dialog to use.
If a URI does not refer to a document, the current document is assumed. If it does not
refer to a specific dialog, the first dialog in the document is assumed. The dialog
execution is terminated when a dialog does not specify a successor, or if it has an
element that explicitly exits the conversation.
Dialogs are of two kinds forms and menus. Forms define an interaction that collects
values for a set of field-item variables. Each field may specify a grammar that defines
the allowable inputs for that field. If a form-level grammar is present, it can be used to
fill several fields from one utterance. A menu presents the user with a choice of
options and then transitions to another dialog based on that choice.
Subdialogs are function-like reusable components that can be used for standard
reusable dialog interfaces, like collecting credit card numbers. At the end of execution
of a subdialog, the control returns to the dialog from where it was invoked and returns
the fields that were collected.
Grammar, Each dialog has one or more speech and/or DTMF grammars (valid
commands) associated with it. Each dialog’s grammars are active only when the user
is in that dialog. Some of the dialogs can be flagged to make their grammars active
(i.e., listened for) even when the user is in another dialog in the same document, or on
another loaded document in the same application. In this situation, if the user says
something matching another dialog’s active grammars, the application transitions to a
new dialog, and treats the user’s utterance as if it were said in the new dialog.
34
enclosing elements "as if by copy." In this way, common event handling behavior can
be specified at any level, and applied to all lower levels.
4.3.2 Servlets
Java Servlets are the key component of server side programming. A servlet is a small
puggle extension to the server that enhances the servers functionality. Servlets are
server side programmes, which run on the web servers to provide the requested
information by the users. Servlets make use of JDBC concepts to connect to the
database where the actual information of the enterprise is stored
(i) Efficient: With traditional CGI, a new process is started for each HTTP request. If
the CGI program does a relatively fast operation, the overhead of starting the process
can dominate the execution time. With servlets, the Java Virtual Machine stays up, and
each request is handled by a lightweight Java thread, not a heavyweight operating
system process.
35
(ii) Powerful: Java servlets let to easily do several things that are difficult or
impossible with regular CGI. For one thing, servlets can talk directly to the Web server
(regular CGI programs can't). Servlets can also share data among each other, making
useful thing like database connection pools easy to implement. They also maintain
information from request to request, simplifying things like session tracking and
caching of previous computations. Servlets are written in Java and follow a well-
standardized API. Servlets are supported directly or via a plugin on almost every
major Web server.
36
• | to separate alternatives
• [] to enclose optional words, phrases, or rules
• () to group words, phrases, or rules
• to indicate that the previous item may occur zero or more times
• + to indicate that the previous item may occur one or more times
For example:
#JSGF V1.0;
grammar employees;
public <name>= Jonathan | Larry | Susan | Melissa;
Inline grammar, which is specified directly in the VXML document.
<grammar>
request | path | query | server | remote user | backup | exit
</grammar>
VoiceXML browser also uses JSGF as the DTMF grammar format. For example, the
following code snippet defines an inline DTMF grammar that allows the user to make
a selection by pressing the numbers 1 through 4, the asterisk, or the pound sign on a
telephone:
<dtmf type=”text/x-jsgf”>
1 | 2 | 3 | 4 | “*” | “#”
</dtmf>
37
4.4.1 Design Methodology
Developing speech user interfaces, like most development activities, involves an
iterative 4-phase process: “Design Phase”, “Prototype Phase”, “Test Phase”,
“Refinement Phase”.
Design Phase: In this phase, the goal is to define proposed functionality and create an
initial design. This involves the following tasks: “Analyzing Your Users”,
“Analyzing User Tasks” , “Making High-Level Decisions” , “Making Low-Level
Decisions” , “Defining Information Flow” , “Identifying Application Interactions” ,
“Planning for Expert Users” .
Prototype Phase: The goal of this phase is to create a prototype of the application,
leaving the design flexible enough to accommodate changes in prompts and dialog
flow in subsequent phases of the design.
For the first iteration, use the technique known as “Wizard of Oz” testing. This
technique can be used before beginning the coding, as it requires only a prototype
paper script and two people: one to play the role of the user, and a human “wizard” to
play the role of the computer system.
Test Phase: After incorporating the results of the “Wizard of Oz” testing, code and test
a working prototype of the application. During this phase, be sure to analyze the
behavior of both new and expert users.
38
Refinement Phase: During this phase, update the user interface based on the results of
testing the prototype. For example, revise prototype scripts, add tapered prompts and
customizable expertise levels, create dialogs for inter- and intra-application
interactions, and prune out dialogs that were identified as potential sources of user
interface breakdowns. Finally, iterate the Design—Prototype—Test—Refine process,
including in the Test phase.
(i) Create the necessary Vxml files to understand user input. Using JSGF create a
series of speech recognition grammars defining the words and phrases that can be
spoken by the user, and specifies where each grammar should be active within the
application.
(ii) Pass these parameters collected from the user to servlets by specifying the URI.
Uniform Resource Indicators are used to specify the path of the Servlet where it is
located. These URL’s are specified in the tag <submit> which submit the parameters to
the Servlets.
(iii). Create servlets using Servlet API to receive the parameters from the <submit>
tags of VXML files.
(v). Collect the information from the database and pass it through VXML tags like
<prompt> or <block>, which can read the text out loud.
Same procedure is applied to develop the code to process different options chosen by
the user. Special call recognizing tags are used in order to deploy the application in the
real time environment. In ordinary PSTN network central office is responsible for
generating the dial tone, establishing a connection between the source and destination
devices.
39
A gateway emulates a central office providing: Signaling - dial tone, call set-up etc.
(H.323, MGCP, SS7), Conversion to IP, (often Ethernet), Compression (G.711,
G.723.1 etc.), Echo Cancellation and Quality of Service (QOS).
When a user place a call using a telephone or cell phone to voice server. Voice server
automatically recognizes the call with the help of VOIP gateway and starts executing
the application-root document. User opts for a choice by hearing options provided by
the application. The speech recognition engine processes the incoming audio signal of
the user and compares the sound patterns to the patterns of basic spoken sounds, trying
to determine the most probable combination that represents the audio input. Finally,
the speech recognition engine compares the sounds to the list of words and phrases in
the active grammar(s). Only words and phrases in the active grammars are considered
as possible. With present technologies understanding long sentences is quiet difficult
when compared to small phrases.
40
After the pre-requirements were met, copy all the class files in to the servlets directory
of the java web server. Copy all VXML files in to a folder named “Thesiscode” in “c”
logical partition of the hard disk.
Now in order to run the application root document in the desktop simulated
environment, open the command prompt. Go to the directory where the vxml
documents are stored. From there run the voice browser and execute the file by typing.
“path for voice browser i.e vsaudio” root_iiitm.vxml
Eg: If voice server is installed in the “C” partion then the path for voice browser will
be “c:\voices~1\bin\vsaudio” root_iiitm.vxml. To run the applications in the textmode
please replace the “vsaudio” by “vstext”. This is mostly is in debugging the
application. After executing this statement application starts executing in a user-
friendly manner so that the user can easily identify at which location he was in the
application.
41
Eg: Suppose if one wants to know about information about institute establishment. He
can get it by selecting the establishment option after selecting the institute information
option first. After getting the institute establishment information. He will be again
given a set of choices to opt for. For further want information regarding to MTech
students selection, select the option students selection criteria and opt for the MTech
students choice in order to get that information.
Selecting first option provides information regarding to institute. In this the user is
given choices like what information the user like to have regarding to the institute
establishment, facilities, profile of the institute, Students database, Faculty database.
Provide information like student name, group and hear the complete details of that
particular student. Selecting the second option provides the information about the
recruitment process of the students and faculty in IIITM. Just select the group like
MTech, MBA, IPG in which the user is interested. Hear the recruitment procedure for
that particular branch. Selecting the third option announces the achievements of the
institute. It includes summer placement information, final placement information and
cultural events occurred in the institute every year. Selecting the fourth option gives
information of which student selected in which company for summers and finals. For
this the user has to supply the information, student name and group. To select the fifth
option first of all the should get registered to our system as a member. For this type
http://127.0.0.1:8080/registration.html in the Internet explorer or Netscape browser.
Fill up the information required and submit to the server the user recieves congrats
information along with a pin number, which is supplied to the system. This pin
number is of use in future to check your mails through our email reader. All the
members should have a pop mail account in any of the pop mail servers of yahoo,
hotmail etc. The pop mail account information like userid and password should be
given for further use by the system to connect to your pop mail account, get new mail
information and read the mails of intended for the user. Please supply your pin
number. User can hear from the system how many new mails he got and read the mail
in which he is interested.
42
4.6.2 Practical issues faced for deployment of the IVR system
(i) VOIP gateway: Developing VoIP gateway requires a lot of infrastructure like DSP
modules and developing the protocols like SIP and MGCP, which is practically
impossible to complete with in this short period of dissertation work.
(ii) Voice server: Voice server is equipped with a voice browser, TTS engine, and
speech recognition. As the time is short I was supposed to use voice server model
developed by IBM. But it was not flexible in its functioning as it was still under
developing stage. I make use of some of the functions incorporated in the voice server
to develop a voice application As VOIP is not available at present it becomes difficult
to adapt this application to PSTN. Hence I simulated it in the desktop environment.
(iii). Dealing with the Speech Recognition Errors: There are three basic types of
recognition errors. The speech recognition engine returns a result that does not match
what the user actually said. This can have many causes, including:
• The audio quality is poor.
• Multiple choices in the active grammars sound similar, such as “Newark” and
“New York” in a grammar of United States airports.
• The user utterance was not in any of the active grammars, but something from
an active grammar sounded similar.
• The user has a strong or unusual accent.
• The user paused before finishing the intended utterance.
• The speech recognition engine did not understand what the user said well
enough to return anything at all. This type of error can occur in situations
similar to those described above.
All the practical issues were taken in to consideration in developing the application.
Security in voice applications can be implemented at two different levels. One is at the
infrastructure level, involving the telephony network and Internet infrastructure. Most
VoiceXML browsers support the existing Web security infrastructure. They support
SSL and cookies to help manage security between the voice server and the Web server.
43
Communications may be secured with authentication, encryption, and data integrity
measures using existing telephony security technologies. Second is at the application
level, which can be implemented in any of the following three ways:
(i). The user id/password approach in which the application prompts for a user id and
pin code. In most cases, the user is asked to key in the entries instead of speaking (to
avoid overhearing).
(ii). The telephone number identifies the user id. In this approach, the user simply
enters his pin code, reducing the complexity. It is implemented in this application.
Most of the VoiceXML interpreters can identify the incoming phone number.
(iii). Speech verification (Voice Biometrics) authenticates the user, excluding the need
of PIN based verification. Here, the voiceprint samples are stored in the database at
the time the account is set up, to be compared against at the time of authentication.
44
Chapter 5
5.1 Conclusion
Interactive voice response system developed makes use of latest Speech recognition
engines to have the speech user interface efficient in recognizing the human voice. It
promises a friendly user interface as every stage of interaction, was designed carefully
and efficiently using a powerful voice language VXML.
IVR empowered users with more options regarding when, where, and how they use
Internet services. Using speech as the most natural form of communication, the
existing familiar global telephone network as the most pervasive communications
network, and enabling eyes and hands-free operation. This new mode of access
promises to further accelerate the growth and maturity of Internet services.
Coding is made in the way that there involves minimum amount of 'dead air' the caller
hears while the system fetches resources. VoiceXML provides several facilities to
either eliminate or hide the delays associated with retrieving Web resources.To
minimize delays, the system maintains a cache for VoiceXML documents, audio files,
and other files used by applications. Normally, once the system has fetched a file over
the Internet, it keeps a copy in the cache. If the application requests the file again, the
system uses the cached copy. This is known as fast caching. Sometimes, even when a
file is in the cache, the user should always check for a newer version of the file on the
server from which it was originally fetched. This is known as safe caching.
45
5.1.2 New way of Grammar development
Collecting the data from the database develops grammar servlets. This is the most
efficient way of developing the grammar rules when compared to ordinary way of
implementing grammar making use of reusable components. Reusable components are
files, which specify the entire probable input from the user by specifying all the
combination of alphanumeric characters. Servlets are written to collect the required
data from the database and form a grammar file using the database information.A
thread is created which checks for the new data entered in to the data table. This thread
executes in every 10 seconds and forms the grammar files every time the table is
updated.
Email reader makes the user simply dial a number from the telephone or cell phone
and listen to his emails. This is a cost effective and efficient way of checking the
emails especially to users who are always mobile.
Interactive voice response websites as every one knows requires a lot of infrastructure
like developing speech recognition engines and voice servers. Mostly IVR
applications are to serve the untapped market of mobile and telephone users, which are
the cost effective ways of doing the transactions .To make it possible a VoIP gateway
is required. As it requires lot of time to develop, I simulated the application in the
desktop environment. Improvements can be made at various stages of the application
as mentioned below.
(i). One can develop a much more efficient user-friendly interface than the existing
one.
(ii). One can develop a VoIP gateway and make my dream true of deploying it in a real
time environment.
(iii). One can introduce much more sophisticated technologies in speech recognition
and make the process of speech recognition perfect than which exist now.
46
Abbreviations
1. VXML - Voice Extensible Markup Language.
2. VSDK - Voice Server Development Kit.
3. PSTN - Public Switching Telephone Network
4. IP - Internet Protocol.
5. DTMF - Dual Tone Multiple Frequency.
6. JSGF - Java Speech Grammar Format.
7. JSML - Java Speech Markup Language.
8.URI - Uniform Resource Indicator.
9.’Wizard of Oz’- A prototype model of IVR development.
List of Figures
1. Fig 4.1 - System components and data flow. - 24
2. Fig 2.1 - IVR network architecture of TJNET.- 13
3. Fig 3.1 - Voice Web architecture.- 17
4. Fig3.1.2 – VoIP gateway- 19
47
References
2. www.Tellme.com.
3. www.heyanitafreespeech.com.
4. www.java.sun.com.
5. www.nuance.com.
6. www.voicexml.org.
9. Joseph O’Neil Teach your self Java Tata McGraw Hill Publications.
48