IT06 IVR Thesis Report (Pratap Raju)

Interactive Voice Response System
Thesis submitted for the partial fulfillment of
Master of Technology in Information Technology
by
K.Pratap Kumar Raju

2000IT06
Under the guidance of
Dr. Rajendra Sahu
Indian Institute of Information Technology and Management
Gwalior-474005
January 2002
1
CERTIFICATE
This is to certify that the thesis entitled “Interactive voice response system” is being
submitted to Indian Institute of information technology and management, Gwalior for
the award of master of technology in information technology by K.Pratap Kumar Raju
is a record of bonafide work carried out by him under my supervision and guidance. It
is further certified that the work presented has reached a standard of the PG thesis and
it has not been to any other university or Institute for the award of any degree or
diploma.
Date:
Place: Dr. Rajendra Sahu
2
ACKNOWLEDGEMENT
This work is a result of inspiration, support, guidance, co-operation and

facilities that were extended to me at their best by persons at all levels. I
feel really proud to say that I have worked under the guidance of a cool
and helping personality Dr. Rajendra Sahu, Assistant Professor of the
institute.
I would like to express my gratitude to our M.Tech (I.T) Head Prof. G

K Sharma for his encouragement and providing special dedicated lab for
the Thesis work. I would also like to thank Mr Dosapati Suresh, project
leader in Krisn Information Technologies Limited, Hyderabad for his kind
patience to clear my technical queries.
I like to thank our Director Prof. D.P. Agarwal for providing all the
facilities and working environment in the institute. I also like to thank the
entire institute faculty who helped me directly or indirectly to complete
my thesis work.
I also thank my colleague Mr Sanjeev Manglani for helping me

regarding to queries in coding aspects of java language.
K. Pratap Kumar Raju
3
Abstract
Interactive voice response (IVR) systems have been around for some time to help
guide customers to appropriate business units or information. However, with the use of
Internet technologies and wireless phones on the rise, coupled with the rapid
development in the speech recognition and speech synthesis technologies, new doors
for voice technology are opening to test demand in the marketplace. What’s more
convenient than picking up a phone? One can have instant access to the information
needed to make business operate more efficiently. Many businesses are betting that
consumers will embrace any technology that provides real-time access to information
piped through their regular telephone, wireless phone or voice-connected handheld
device.
A system in which the input and/or output are through a spoken, rather than a
graphical, user interface is what we call as Interactive voice response system or simply
IVR system. The web has made it possible to access information at the click of the
mouse. In recent years the meaning of what a client has grown from the desktop
computers to other clients like phones and mobile pieces. This is where voice control
came in.
Analyzing the requirements of the need for developing the voice systems, my
dissertation work concentrate on how to develop an interactive voice response
website. Voice Web technology makes use of open Internet standards (Web
infrastructure) like Hypertext Transfer Protocol (HTTP), Secure Sockets Layer (SSL),
Cookies and Extensible Markup Language (XML) based VoiceXML for implementing
voice services over the telephone. System proposes a three-tier architecture. At the
client side it consists of a telephone or cell phone connected to a Public Switching
Telephone Network. In the middle tier it consists of voice server equipped with VoIP
gateway, which facilitate the users of PSTN to connect to the voice application that
works in the IP network. This voice server identifies the call made by users of
4
telephone network, initiates the voice application, presents the user with the required
information and terminates the call when the user wants to exit from the application.
Development of an application makes use of VXML to provide an efficient speech
interface, Java Speech Markup Language (JSML) to develop grammar files in Java
Speech Grammar Format (JSGF), Servlets to supply the requested information by the
Voice browser. Front end make use of VXML language, which consists of tags to
recognize the human input and record them for future use. VXML tags takes input
from the user in small phrases and send these parameters to back end Servlets. Servlets
are basically written in java to accept the parameters from the front end and use them
to get necessary information database server. Database server stores the information of
an enterprise or institute in terms of tables where one can store the necessary
information to present it to users. It can be used with any phone at anywhere. One can
don’t have to put up with entering data using tiny keypad, but rather one can interact
with the service in a very natural manner.
The dissertation work aims at developing an IVR system for IIITM. It promises a good
speech interface to make the user feel comfortable to interact with the system and
email reader, that will read the emails so that one can listen to his emails rather than
browsing through them.
5
Contents
Chapter 1 Introduction
1.1 Introduction to IVR system. - 1
1.2 Typical voice applications. - 1
1.3 How to create and deploy IVR applications.- 2
1.4 How do users can access IVR application? - 3
Chapter 2 Problem Formulation

2.1 Problem Definition. - 5
2.2 Existing Flaws in the web architecture. - 5
2.2.1 Network deficiencies.
2.2.2 Web architecture deficiencies.
2.3 Examples of IVR applications. - 10
2.4 Literature Survey. - 12
2.5 Objectives of Study. - 15
2.6 Conclusion - 16
2.7 chapterization - 16
Chapter 3 System Components

3.1 IVR system components - 17
3.1.1 Telephony Network
3.1.2 Voice XML gateway
3.1.3 Voice Server
3.1.4 Interaction between Web Server and Voice Server
3.1.5 Web Server
3.1.6 TCP/IP Network
6
Chapter 4 Methodology
4.1 Implementation details - 24
4.2 Application design and development - 25
4.3 Development tools - 25
4.3.1 Voice XML
4.3.2 Servlets
4.3.3 JSGF
4.3.4 Oracle database
4.4Speech interface design - 30
4.4.1 Methodology
4.5 IVR development aspects - 32
4.6 Deployment Procedure - 33
4.6.1 Working of the system
4.6.2 Practical issues in deploying IVR
4.6.3 Security issues
Chapter 5 Conclusion & Future Scope

5.1 Conclusion - 38
5.1.1 Minimized fetch delays
5.1.2 New way of grammar development
5.1.3 Email reader
5.2 Future scope - 39
References - 40
7
Chapter 1
1.1 Introduction
IVR systems, also called Voice Response Units (VRU), automate the handling of calls
by interacting with user .It takes the input from the user in voice and provides the
enterprise information by connecting one or more online databases. Popular IVR
applications include bank-by-phone, flight schedule retrieval, and automated order
entry and tracking. The common feature of these examples is that a caller's touch-tone
or spoken requests are answered with verbal information derived from a "live"
database. A significant percentage of installed IVR systems are used in front-end call
centers to reroute calls away from costly live agents. Over time, the IVR systems have
evolved, from being simple systems accepting touch-tone input to advanced voice
systems accepting near natural language-like voice inputs. Mostly IVR systems are
used in applications that require less information from the user side and more
information from the system side so that the system should not feel the trouble of
understanding large inputs, which are difficult to understand, by the speech
recognition engines of the system.
1.2 Typical Types of Voice Applications.
Voice applications will typically fall into one of the following categories. Queries and
Transactions.
Queries: In this scenario, a customer calls into a system to retrieve information from a
Web-based infrastructure. The system guides the customer through a series of menus
and forms by playing instructions, prompts, and menu choices using prerecorded audio
files or synthesized speech. The customer uses spoken commands or DTMF input to
make menu selections and fill in form fields. Based on the customer’s input, the
system locates the appropriate records in a back-end enterprise database. The system
presents the desired information to the customer; either by playing back prerecorded
audio files or by synthesizing speech based on the data retrieved from the database.
8
Examples of this type of self-service interaction include applications or voice portals
providing weather reports, movie listings, stock quotes, health-care-provider listings,
and customer service information (Web call centers).
Transactions: In this scenario, a customer calls into a system to execute specific

transactions with a Web-based back-end database. The system guides the customer to
provide the data required for the transaction by playing instructions, prompts, and
menu choices using prerecorded audio files or synthesized speech. The customer
responds using spoken commands or DTMF input. Based on the customer’s input, the
system conducts the transaction and updates the appropriate records in a back-end
enterprise database. Typically the system also reports back to the customer, either by
playing prerecorded audio files or by synthesizing speech based on the information in
the database records. Examples of this type of self-service interaction include
applications or voice portals for employee benefits, employee timecard submission,
financial transactions, travel reservations, calendar appointments, electronic
relationship management (ERM), sales automation, and order management.
1.3 Create and deploy voice applications

(i). An application developer can use a Voice Server to create a voice application
written in VoiceXML. The VoiceXML pages can be static, or they can be dynamically
generated using server-side logic such as CGI scripts, Java Beans, ASPs, JSPs, Java
servlets, etc.
(ii). Writing VoiceXML applications using any text editor, one may find it more
convenient to use a graphical development environment that helps to create and
manage VoiceXML files. WebSphere Studio and WebSphere Voice Toolkit support the
development of VoiceXML-based applications. (Optional) A system administrator uses
a Web application server program to configure and manage a Web server.
(iii). The developer publishes the VoiceXML application (including VoiceXML pages,
grammar files, any prerecorded audio files, and any server-side logic) to the Web
server.
9
(iv). The developer uses a desktop workstation and the Voice Server SDK to test the
VoiceXML application running on the Web server or local disk, pointing the
VoiceXML browser to the appropriate starting VoiceXML page.
(v). A telephony expert configures the telephony infrastructure, as described in the

product documentation for the applicable deployment platform.
(vi). The system administrator uses one of the deployment platforms to configure,
deploy, monitor, and manage a dedicated Voice Server.
(vii). The developer uses a real telephone to test the VoiceXML application running on
the Voice Server.
1.4 Information to access the deployed voice application

Once the voice applications are deployed, users simply dial the telephone number that
the user provide and are connected to the corresponding voice application. Answer the
telephone call Play a prompt Wait for the caller’s response Take action as directed by
the caller Complete the interaction
(i). A user dials the telephone number provided to the application. The Voice Server
answers the call and executes the application referenced by the dialed phone number.
(ii). The Voice Server plays a greeting to the caller and prompts the caller to indicate
what information he or she wants. The application can use prerecorded greetings and
Prompts or synthesize them from text using the text-to-speech engine. If the
application supports barge-in, the caller can interrupt the prompt if he or she already
knows what to do.
(iii). The application waits for the caller’s response for a set period of time. The caller
can respond either by speaking or by pressing one or more keys on a DTMF telephone
10
Keypad, depending on the types of responses expected by the application. If the
response does not match the criteria defined by the application (such as the specific
word, phrase, or digits), the voice application can prompt the caller to enter the
response again, using the same or different wording. If the waiting period has elapsed
and the caller has not responded, the application can prompt the caller again, using the
same or different wording.
(iv). The application takes whatever action is appropriate to the caller’s response. For
example, the application might update information in a database, retrieve information
from a database and speak it to the caller. It also involves store or retrieve a voice
message, launch another application, Play a help message after taking action, the
application prompts the caller with what to do next.
(v). The caller or the application can terminate the call. For example, the caller can
terminate the interaction at any time, simply by hanging up; the Voice Server can
detect if the caller hangs up and can disconnect itself. If the application permits, the
caller can use a command to explicitly indicate that the interaction is over (for
example, by saying “Exit”). If the application has finished running, it can play a
closing message and then disconnect.
11
Chapter 2
2.1 Problem Definition
Until recently, the World Wide Web has relied exclusively on visual interfaces to
deliver information and services to users via computers equipped with a monitor,
keyboard, and pointing device. In doing so, a huge potential customer base has been
ignored: people who (due to time, location, and/or cost constraints) do not have access
to a computer.
Many of these people do, however, have access to a telephone. Providing

“conversational access” (that is, spoken input and audio output over a telephone) to
Web-based data will permit companies to reach this untapped market. Users benefit
from the convenience of using the mobile Internet for self-service transactions, while
companies enjoy the Web’s relatively low transaction costs. And, unlike that rely on
dual tone multi-frequency (DTMF) (telephone key press) input, voice applications can
be used in a hands-free or eyes-free environment, as well as by customers with rotary
pulse telephone service or telephones in which the keypad is on the handset
Even though you made the website perfectly dynamic using many technologies, users
cannot feel it more comfortable as it requires them to sit in a static place before a
terminal and access the required information. But it’s not possible for mobile users, to
perform a transaction or get the desired information through desktops PC. What they
want is that they can be able to do it from anywhere through any network like PSTN,
Internet, mobile network.
2.2 Flaws in the existing infrastructure to implement the new

technology
2.2.1 Network Deficiencies:
The existing IP architecture does provide poor quality of service in transfer of voice
due to following reasons.
12
(i). The existing network makes use of connection less unreliable Internet protocol and
hence you are not sure whether the packet will arrive at the destination or not.
Retransmission is not allowed in transferring the voice signals through the network
when some packets were collapsed in the travel due to congestion in the network
(ii). Long propagation delays due to unreliable congested network make listening to
voice ineffective.
(iii). Packets may arrive out of order as they take different routes in traveling through
the network, which leads to a problem of sequence. Out of sequence packets are not
acceptable in transfer of voice signals.
(iv). The devices in the network cause unpredictable amount of delay between the
packets, which is called as jitter. Large jitter causes unpredictable amount delay of
packets in reaching the destination, which will leads to a poor quality of voice.
(v). In dealing with the voice, there should be some mechanism, which cancels the
echo created during voice travel through the network.
So taking these deficiencies in to consideration one should develop a VoIP gateway

enabled network. The implementation of VOIP gateway is a must in order to adapt the
PSTN to IP.
2.2.2 Deficiencies in the existing Client-Server model

Classical client server model making use of three-tier architecture doesn’t support the
recognition of voice data. Hence changes are should be made to existing model so that
it can understand speech input. At present no browser support voice recognition and
no server can understand the voice requests made by the client. Hence one should have
a voice browser to run the voice applications. A speech recognition engine is used to
understand the user input in voice or DTMF and also a text to speech recognition
engine to speak the simulated voice out loud. A voice server can be developed in order
13
to implement the above-mentioned functions. Develop an interface between the voice
server and the web server.
Interactive voice response websites makes the information available in the World Wide
Web (WWW) to your public telephone or cell phones.
• Interactive Voice Response (IVR) system enabled World Wide Websites (internet)
make the information reachable even to telephones and cell phones. This facilitates
the user to get the information easily by just dialing the particular server using
their handsets at any time round the clock.
• IVR systems enable the users to do different types of transactions easily. eg:
checking the bank balances ,doing money transactions
• IVR systems facilitate you to check your emails with just using telephones. The
system takes the necessary information form you and read the messages intended
for you in voice.
• IVR systems are especially useful in case of call centers to respond to the
customers in voice and transfer the calls to other information systems.
Flaws in existing web sites are they are not voice interactive. By making a website
voice interactive you would be able to provide the information in voice presentation of
information in voice has many advantages which some of them are mentioned above.
Especially in case of any information queries where a client send little information as
request and more information to get from the server side voice interactive system will
be very much helpful to obtain the information with much. What you have to do is just
speak out small phrases of queries and listen to the required information.
Taking the flaws that are prevailing in the existing system in to consideration, one
would develop a system, which can interact effectively with user in voice and provide
the information in a form, which the user feels more comfortable. one can make the
information available to telephone and cell phones then it will be more advantageous
and help in substantial growth of organization.
14
Interactive Voice Response (IVR) applications enable callers to query and modify
database information over their telephone using their own human speech or by dialing
digits on their telephone. Callers can use their touch-tone pad to input requests or just
say what they want to do, such as ordering a product, obtaining a work schedule, or
requesting account balance information, and the database speaks information back to
the caller-using Text-to-Speech. IVR offers customers and businesses a new level of
freedom by enabling them to conduct transactions 24 hours a day, seven days a week.
Businesses of all sizes are realizing the tremendous benefits of IVR applications for
their call processing and information delivery needs. IVR functionality links a phone
system to a database to provide customers with 24-hour immediate access to account
information, via telephone. For example, a bank could make up to 10 data fields
available for a caller’s checking account, 10 data fields for his or her savings account,
and so on. To ensure security IVR can be set up to allow the caller access to account
information only if the caller enters a valid account number and corresponding
personal identification number.
IVR allows full connectivity to the most popular databases including Microsoft
Access, Microsoft Excel, Microsoft Fox Pro, DBase. One can read information from,
and write information to, databases, as well as make a query databases and can return
information. The application files can reside on the local system, an intranet, or the
Internet. Users can access the deployed applications anytime, anywhere, from any
telephony-capable device, and you can design the applications to restrict access only
to those who are authorized to receive it.
“Voice-enabling the World Wide Web” does not simply mean using spoken commands
to tell a visual browser to look up a specific Web address or go to a particular
bookmark Having a visual browser throw away the graphics on a traditional visual
Web page and read the rest of the information aloud converting the bold or italics on a
visual Web page to some kind of emphasized speech. Voice applications provide an
easy and novel way for users to surf or shop on the Internet—“browsing by voice.”
Users can interact with Web-based data (that is, data available via Web-style
architecture such as servlets, ASPs, JSPs, Java® Beans, CGI scripts, etc.) using speech
15
rather than a keyboard and mouse. The form that this spoken data takes is often not
identical to the form it takes in a visual interface, due to the inherent differences
between the interfaces. For this reason, transcoding—that is, using a tool to
automatically convert HTML files to VoiceXML—may not be the most effective way
to create voice applications. It will execute any created application when a caller dials
in and allows callers to interact with the system using both human speech and DTMF.
Advanced database technology permits reading, writing, appending, searching and
seeking database information.
It must be noted that voice based navigation can get complex. When implementing
information services on a web server, one can include a glut of information on the
page and over load paths to resources to make sure users reach their required
destination whatever their approach to searching for it. In voice applications, it
becomes more important to clearly define the information. Voice data is transient; it
depends on the users memory and ties in much more closely with preconceptions and
experience. Finally, our ability to focus on any one-voice source among many is
limited. The need to avoid ambiguity in the question/ Answer pattern of voice
interaction can be cause of very complex systems, and its very difficult to maintain
location information; keeping the user aware of where there are in application and
where they are in relating to other parts of the application, such as home page, end so
on. It is the characteristic of unpopular applications that user feels lost and out of
control.
The growing awareness of catering for a variety of needs and devices has highlighted
the important of voice control services and also the importance of making them
usable. Voice entry of textual data is very much clearer than using a phone keypad.
Current developments in wireless technology and increase in processors speed have
made speech applications a reality. With powerful servers for both speech processing
and wireless-based thin clients. Like mobile phones and PDAs, it is now possible to
interact with the user using audio input and output.
16
In addition to all these things VoiceXMl made the dream of developing voice-web
applications come to true. Voice XML is relatively a new specification of XML
designed to develop voice applications over the web. It has its root in a language
designed by Motorola by the name VXML, another specification for presenting
services and data in voice medium.
VoiceXML is a member of XML family, W3c specification for organizing data in a

document using a set of elements. Rules can be specified as either a document type
definition (DTD) or a schema. VoiceXML is one such type of schema. It consists of set
of rules that detail how to describe a voice transition using a markup language. Learn
more about VXML in the later chapters.
2.3 Some of the Interactive Voice Response applications
(i) Weather applications
In US most of the weather information applications are automated using IVR systems.
There user queries regarding to weather information is automatically answered by the
simulated voice generated by the TTS engine.
(ii) Online Shopping applications
In online purchasing any queries regarding to the items are answered by the automated
voice.
(iii) Online enquiry in railways and Airways
Information of the arrival, departure of the trains and reservation availability all these
information you can obtain from the automated response system.
Please speak the information about train number, source and destination stations .IVR
system will automatically generate a query on the database, get the information from
the database and will speak it out loud for you, the availability of the reservation.
17
(iv) Telemedicine
Now IVR systems even entered into medicine field. Electrocardiogram monitor gets
ECG data of the patient and transmit it over a regular telephone line.
(v) Tele-education
Education from distance places (Tele-education) is now a days possible after IVR
systems came into picture.
The Distance Education Centre in MONASH UNIVERSITY has introduced an

Interactive Voice Response System (IVR) to provide distance education to students
with information about dispatch of study materials and for providing enrolment
information. By using your telephone you can dial the 1901 number and obtain up to
date information about the materials dispatch for the subjects you are undertaking. If
you have not received your materials, or for missing materials, you may leave a
message on the system.
This service is available to callers using touch-tone telephones. If your phone makes a
different tone each time you press a number, then your phone is a touch-tone phone. If
you hear no tone or a number of clicks with each press of the numbers, then your
phone is operating in pulse mode and cannot access the system. You can purchase a
special adaptor from Telstra, which will enable you to use the system. Some
telephones have a switch or button, which allows you to change the mode to touch-
tone mode.
Calls are charged at the minimum rate of 35 cents per minute regardless of where you
call from, plus an initial charge of 15 cents. A higher charge will be incurred from
mobile telephones and public telephones. Students living overseas can access the
system by dialing 0055 31706 (preceded by the Australian International Code). The
rate is 75 cents per minute, plus the International access rate, which varies from
country to country.
18
(vi) Automatic Call answering in call centers
Before Voice web came into picture, Call centers use to spend a lot of money on the
call operators to answer the queries from the users. But now call centers fully operate
using voice web has the advantage of operating at low cost as TTS and voice
recognition engines came into image.
2.4 Literature Survey

Though voice has been used since beginning of the human race for communication
only recent developments in the technology have proved the research in the IVR
systems. In the IVR systems currently research is originally in areas related to VXMl,
Speech technologies and VoIP gateways.
VoiceGenie Technologies empowers every PC as a Voice access point, by making

them personal voice enabled gateways. Now using voice genie voice enabled gateway
made the PCs more than personal computing devices; they will be powerful
telecommunications servers that allow you to control your office, home, and more.
MyVoiceGenie will revolutionize how one can communicate.
Emerging Digital Concepts (EDC) is developing solutions for clients using a number
of leading speech recognition technologies, including Speech Works and Nuance.
These technologies are applied to some of the state of the art hardware available today
including Natural Micro Systems and Dialogic.
Computer Telephony Integration is a service provided to various clients for over 3

years. By now assisting clients in maximizing the capabilities of their existing CTI
platforms. This technology can increase the lifecycle and revenue generation life of
legacy CTI.
TigerJet Network provides Integrated software and silicon solutions for network
communication applications.
19
TigerJet's Gateway Manager application
Implement your own private VoIP gateway for Internet to regular phone calls.You can
Place a call using you own regular phone line from anywhere in the world.
Fig 2.3 Tjnet voice network
IP Phone integrates all popular choices for making Internet phone calls in a single easy
to use application with a central "one stop" interface.
The key features of IP Phone Center are:
• PC to phone calls using Dynamic VoIP gateways
• PC to phone calls using Static VoIP gateways (fixed IP)
• PC to phone calls using Web call and your choice of provider
• PC to PC calls over the Internet
• Buddy List to make placing a call "one-click close"
• One easy to use interface for all types of call
• Support for handsets and regular phones
Nuance- delivers speech recognition, voice authentication, and text-to-speech

software that make the information and services of enterprises, telecommunications
20
networks and the Internet available from any telephone. Nuance is the leader in Voice
Web software — speech recognition, voice authentication, text-to-speech and voice-
browsing products that make the information and services of enterprises,
telecommunications networks and the Internet accessible from any telephone. SRI
International, one of the leading voice technology research entities throughout the
1980s and 1990s, established nuance as an independent company. Nuance offers its
products through industry partners, platform providers, and value-added resellers
around the world.
Cisco IP-powered Interactive Voice Response Solution- Cisco IP IVR is an IP-

powered interactive voice response (IVR) solution that provides an open, extensible,
and feature-rich foundation for the creation and delivery of IVR solutions via Internet
Technology. Cisco IP IVR automates the handling of calls by autonomously
interacting with users. The IP IVR processes user commands to facilitate command
response features such as access to checking account information or user-directed call
routing. The IP IVR also performs “prompt and collect” functions to obtain user data
such as passwords or account identification.
Cisco IP IVR is the first application product in a suite of application products
completely written in Java and completely designed and constructed by Cisco to
facilitate concurrent multimedia communication processing.
SRC TELECOM On 4 June 2001 – The SRC Telecom, the telephony based speech
recognition arm of SRC (The Speech Recognition Company), today announced it is
offering a VXML (voice XML) applications hosting service. SRC has installed a
VXML platform that will provide third parties with the first Europe based applications
hosting environment.
“VXML is developed as a leading standard for the implementation of telephony based

speech applications” said Chris Hart, Managing Director SRC Telecom. “Our
decision to embrace this technology and offer a secure, high quality hosting service to
21
third parties signifies SRC Telecom’s leadership in delivering the latest telephony
speech solutions.”
By developing applications in VXML, organizations one can benefit from the many
advantages associated with open standards based development environment. Most
notably, VXML provides significant efficiencies during the application design process,
ensures ease of software maintenance and allows greater portability of applications.
However, the development of speech applications that facilitates the high end-user
acceptance still requires substantial expertise in human factors engineering, dialogue
design and speech systems integration.
Speaking of Nortel and SpeechWorks- Nortel Networks and SpeechWorks will

combine Nortel's speech processing platform OSCAR (Open Signal Computing and
Analysis Resource) and IVR (interactive voice response) with Open Speech
Recognizer, the speech recognition engine that provides interactive capability with the
Web via phone or voice-capable browser when combined with voice recognition
technology. One can grab their phone and access the Web with his voice.
2.5 Objective of Study

From the survey of the work done by many organizations. IVR systems can be
developed in the web architecture using tools as follows.
Voice XML is found to be a powerful language to develop voice applications. It
consists of tags that recognize the user voice and also to record them. It is found to be
reasonable to work with it as it provides a strong platform to run the voice
applications.
JSGF provides a way to define the grammar files that help the system to check
whether the user input is valid or not. You can declare the small phrases or words as
options in the grammar and the user is required to speak these options in order to
select a particular option. Voice servers are being developed by many companies to
make the dream of deploying the application, which are reachable to both PSTN and
IP. One of the popular among them is the Voice server developed by IBM. It has many
22
versions, which can be run on windows 2000, Windows NT 4.0 and even on Linux.
Taking in to consideration the efficiency of Windows NT 4.0 and simplicity of the user
interface voice server for windows NT4.0 will be a best option. Latest versions
support new versions of tags that help in generating the simulated voice comparable
with the human voice. Java is one of the tools to develop server applications at the
back end. Servlets, which make use of java, and special API designed for various tasks
work satisfactorily to process the request from the voice browser.
Using the above mentioned tools the voice server which includes voice browser, voice
recognition engine and TTS engine and java speech grammar format the dissertation
work creates voice application that promise efficient speech user interface and user
friendly environment.
2.6 Conclusion
This chapter chronologically analyzes the tools and technologies for developing the
voice applications. It emphasizes the various tools and finds VXMl is the only
potential for the front-end future IVR. Using a Windows supported Voice server
development kit of IBM, java speech markup language and java Servlets at the back-
end one can develop a full pledged voice application that can be deployed on web
architecture.
2.7 Chapterization
The subsequent chapters deal with the system architecture, problem formulation, and
designing and development aspects of the system.
23
Chapter 3
3.1 IVR System components

IVR make use of three-tier architecture, which is shown in the figure 3.1.
In the 3-tier architecture the request from the user is dealt by the voice server which
appear at the front end of the application separately .The requests from the voice
server are send to web server located in middle tier. Web server with the support of
database server process the request and re-send the requested by the client.
Distributing the load in to different stages is an added advantage in the three-tier
architecture of IVR systems.
Phone connected to a PSTN
VoiceGateway
VOIP Server
Web Server
Oracle Database Server
Fig 3.1 Voice enabled web architecture
The system consists of the following components at different levels.
(i) Telephone Network.
24
(ii) VXML gateway.
(iii) Voice Server.
(iv) Web application Server.
(v) Database Server.
3.1.1 Telephony Network is a PSTN (Public Switched Telephony Network), a

regular analog line or lines coming through a PBX (Private Board Exchange) system,
ISDN (Integrated Services Digital Network) lines or VoIP (Voice over IP) network.
The telephony network is connected to the VoiceXML gateway. The telephones can be
regular phones or IP (Internet Protocol) phones if connected to the VoIP network.
3.1.2 VoiceXML Gateway The purpose a gateway is used to transfer the data
between two networks, which adopt different protocols, and different data formats.
VoIP gateway is used to connect PSTN and IP network. IP network make use of
TCP/IP protocol suit, which transfer data in the packet format. PSTN transmits the raw
data bits through the network which got completely different data format when
compared TCP/IP network .It uses signaling and switching process (control plane and
data plane) two layers switches. VOIP gateway emulates the telephone network in the
IP network
A VOIP gateway consists of a series of Digital signal processors which performs the
following functions
(i) Voice Compression

As voice requires large bandwidth the voice signals need to be compressed to the
desired level with out any loss of information carried by the signals. The function of
compression and again converting in to original signal is done by the codec (which is a
combination of coder and a decoder). Voice compression is performed using the digital
modulation techniques like pulse code modulation.
(ii) Tone Detection/Generation
25
Whenever user lifts the phone it is the function of the gateway to generate and detect
the tone of different DTMF input, that is the destination number. A routing server
maps this number to an Internet address to identify the destination node.
(iii) Echo Cancellation

Echo generated when voice travels through the medium, is removed by the voice
activity detector.
(iv) Silence Suppression

Silence is usually observed between sentences when a person speaks. Transmission
silence leads to wastage of bandwidth. Silence detector is employed to detect silence
and remove it to enhance the bandwidth utilization.
DSP
PSTN DSP IP network

Micro
Processor
DSP
DSP
Fig 3.1.2 VOIP Gateway
3.1.3 Voice Server Voice Server mainly consists of Speech recognition engine, Text to
speech engine DTMF simulator as shown in the figure 4.1.
Speech recognition is the ability of a computer to decode human speech and convert it
to text. To convert spoken input to text, the computer first parse the input audio stream
26
and then convert that information to text output. The process of recognition takes
place like this. One can create a series of speech recognition grammars defining the
words and phrases that can be spoken by the user, and specifies where each grammar
should be active within the application. When the application runs, the speech
recognition engine processes the incoming audio signal and compares the sound
patterns to the patterns of basic spoken sounds, trying to determine the most probable
combination that represents the audio input. Finally, the speech recognition engine
compares the sounds to the list of words and phrases in the active grammar(s). Only
words and phrases in the active grammars are considered as possible speech
recognition candidates. Any word for which the speech recognizer does not have a
pronunciation is given one and is flagged as an unknown word.
The key determinants of speech recognition accuracy are audio Input quality, interface
design, grammar design. The quality of audio input is a key determinant of speech
recognition accuracy. Audio quality is influenced by such factors as the choice of input
device i.e. microphone connected to a desktop workstation. For applications deployed
using the Web Sphere Voice Server, the input device could be a regular telephone,
cordless telephone, speakerphone, or cellular telephone. Speaking environment,
which could be in a car, outdoors, in a crowded room, or in a quiet office. Certain user
characteristics such as accent, fluency in the input language, and any atypical
pronunciations. While many of these factors may be beyond your control. One should
nevertheless consider their implications when design the applications.
Users will achieve the best possible speech recognition with a high-quality input
device that gives good signal-to-noise ratio. For desktop testing, use one of the
microphones listed at http://www.ibm.com/viavoice. Speech clarity is a significant
contributor to audio quality. Adult native speakers who speak clearly (without over-
enunciating or hesitating) and position the microphone or telephone properly achieve
the best recognition; other demographic groups may see somewhat variable
performance.
27
The design of the application interface has a major influence on speech recognition
accuracy. Words, phrases, and DTMF key sequences from active grammars are
considered as possible speech recognition candidates, what one chooses to put in a
grammar and when choosing to make that grammar active have a major impact on
speech recognition accuracy.
Text-to-speech conversion is the ability of a computer to “read out loud” (that is, to
generate spoken output from text input). Text-to-speech is often referred to as TTS or
speech synthesis. To generate synthesized speech, the computer must first parse the
input text to determine its structure and then convert that text to spoken output. One
can improve the quality of TTS output by using the speech markup elements provided
by the VoiceXML language, which is described later in the subsequent chapters. TTS
prompts are easier to maintain and modify than recorded audio prompts. For this
reason, TTS is typically used during application development.
VoiceXML browser is the implementation of the interpreter context as defined in the

VoiceXML 1.0 specification. One of the primary functions of the VoiceXML browser is
to fetch documents to process. The request to fetch a document can be generated either
by the interpretation of a VoiceXML document, or in response to an external event.
The VoiceXML browser manages the dialog between the application and the user by
playing audio prompts, accepting user inputs, and acting on those inputs. The action
might involve jumping to a new dialog, fetching a new document, or submitting user
input to the Web server for processing. The VoiceXML browser is a Java application.
The Java console provides trace information on the prompts played, resource files
fetched, and user input recognized; other than this and the DTMF Simulator GUI,
there is no visual interface. For more information, see “Using the Trace Mechanism”
and “Interactions with the DTMF Simulator”, respectively.
When Voice application is deployed in a telephony environment, users are allowed to
provide DTMF (telephone keypress) input in addition to spoken input. The DTMF
Simulator is a GUI tool enables to simulate DTMF tones on your desktop workstation.
The VoiceXML browser starts the DTMF Simulator automatically, unless one
28
specifies the -Dvxml.gui=false Java system property when starting the VoiceXML
browser. If the DTMF Simulator GUI window is closed, the only way to restart it is to
stop and restart the VoiceXML browser. The DTMF Simulator plus desktop
microphone and speakers take the place of a telephone during desktop testing,
allowing to debug VoiceXML applications without having to connect to telephony
hardware and the PSTN (Public Switched Telephone Network) or cellular GSM
(Global System for Mobile Communication).
Using the DTMF Simulator, one can simulate a telephone keypress event by pressing
the corresponding key on the computer keyboard or clicking on the corresponding
button on the DTMF Simulator GUI, shown in Figure 4.1.For example, if the
application prompt is “Press 5 on your telephone keypad,” one can simulate a user
response during desktop testing by clicking the 5 button on the DTMF Simulator GUI
or pressing the 5 key Figure 2. DTMF Simulator GUI on the computer keyboard while
the cursor focus is in the DTMF Simulator GUI window. The VoiceXML browser will
interpret your input as a 5 pressed on a DTMF telephone keypad. If the length of valid
DTMF input strings is variable, use the # key to terminate DTMF input.
Interactions with Text-to-Speech and Speech Recognition Engines
During initialization, the VoiceXML browser starts the TTS and speech recognition
engines. The VoiceXML browser uses telephony acoustic models in order to simulate
the behavior of the final deployed telephony application as closely as possible in a
desktop environment. As the VoiceXML browser processes a VoiceXML document, it
plays audio prompts using text-to-speech or recorded audio; for text-to-speech output,
it interacts with the TTS engine to convert the text into audio. Based on the current
dialog state, the VoiceXML browser enables and disables speech recognition
grammars. When the VoiceXML browser receives user audio input, the speech
recognition engine decodes the input stream, checks for valid user utterances as
defined by the currently active speech recognition grammar(s), and returns the results
to the VoiceXML browser. The VoiceXML browser uses the recognition results to fill
in form items or select menu options in the VoiceXML application. If the input is
associated with a <record> element in the VoiceXML document, the VoiceXML
browser stores the recorded audio. As the VoiceXML browser makes transitions to
29
new dialogs or new documents, it enables and disables different speech recognition
grammars, as specified by the VoiceXML application. As a result, the list of valid user
utterances changes. If the VoiceXML browser encounters an <ibmlexicon> element in
a VoiceXML document, it interacts with the speech recognition and TTS engines to
add or change the pronunciation of a word for the duration of the current VoiceXML
browser session.
3.1.4 Interactions with the Web Server and Enterprise Data Server
VoiceXML applications can be stored in any Web server running on any platform.
However, One make use of java web server, to reply the request generated by the
VXML documents. When starting the VoiceXML browser, the VoiceXML browser
sends an HTTP request over the LAN or Internet to request an initial VoiceXML
document from the Web server. The requested VoiceXML document can contain static
information, or it can be generated dynamically from data stored in an enterprise
database using the same type of server-side logic (CGI scripts, Java Beans, ASP, JSP,
Java Servlets, etc.) that is used to generate dynamic HTML documents.
The VoiceXML browser interprets and renders the document. Based on the user’s
input, the VoiceXML browser may request a new VoiceXML document from the Web
server, or may send data back to the Web server to update information in the back-end
database. The important thing is that the mechanism for accessing your back-end
enterprise data does not need to change.
3.1.5 Web server that runs the application logic, and may contain a database or
interface to an external database or transaction server.
3.1.6 TCP/IP (Transport Control protocol/Internet protocol) is packet-based

network that connects the application server and voice server via HTTP.
30
Chapter 4 Methodology
4.1 Implementation Details
After identifying the system components and their details, the present chapter
discusses the implementation of IVR application. Application makes use of VoiceXML
at the front end. The voicexml documents run on a speech or voice browser. This voice
browser executes the tags one by one in the order specified by the form interpretation
algorithm. Form interpretation algorithm identifies the form elements and calls the
speech recognition engine function calls or TTS engine function calls to execute the
tags. If the application requires dynamic data to be extracted from the database it sends
a request to the Servlet program. Servlets make use of database connectivity to supply
the necessary data to the voice browser. Voice browser makes use of TTS engine to
convert this data in to voice form and is spoken out loud. Servlets run on a web
application server. Application makes use of java web server as web application
server. Database information is stored in the form of tables in the database server.
Oracle 8i database server is found to be efficient and easy to store the database.
VXML ENTERPRISE
SPEECH TEXT TO APPLICATION
RECOGNITION SPEECH
DATABASE
ENGINE ENGINE
VXML
APPLICATION
1 VOICE XML SPEECH

BROWSER
3 VXML
5
APPLICATION
Tables
2 4
DTMF VXML Tables
SIMULATOR APPLICATION
Voice Server Web Server Database Server
31
Fig 4 .1 Application components and data flow
1.Voice in, 2.Audio or synthesized speech output, 3.Voicexml via http over LAN or
Internet, 4.DTMF in, 5.Database connectivity
4.2 Application Design and Development
Interactive voice response system is designed to contain the following information.
(i). Information regarding the institute establishment and the institute profile.
(ii). Information regarding the students of IIITM.
(iii). Information regarding the IIITM faculty.
(iv). Email reader facility.
(v). Eligibility and selection criteria for various courses of IIITM for the students
and faculty.
(vi). Special announcements regarding to the result declaration of students selected,

achievements of the institute regarding to the summer, final placements of the
students.
(vii). Exit from the site, if the user wants to come out of the system at any stage.
For development purpose use VXML, JSGF, at the front end and Servlets and java at
the back end in order to form a rigid and flexible system. For development of grammar
files, use java speech grammar format file (JSGF), threads and JDBC concepts.
4.3 Development tools include

VXML (Tag language used to interact with the user.), Java (For server side scripting),
Servlets (Server side programming), JSGF (Grammar to recognize the user input in
voice), Java speech API (To validate the user input and supports Vxml tags), Oracle
database server (To store the database of in the form of tables), Java web server (To
run the Servlets), voice server (Provides tools to recognize the voice and generate
simulated voice output).
32
4.3.1 VoiceXML
VXML is XML-based markup language for creating distributed voice applications,
much as HTML is a markup language for creating distributed visual applications.
VoiceXML supports dialogues that feature, spoken input, DTMF (telephone key)
input, recording of spoken input, synthesized speech output ("text-to-speech"), pre-
recorded audio output. VoiceXML makes building speech applications easier, in the
same way that HTML simplifies building visual applications.
These files define the voice user interaction and dialog flow control.
Grammar Files define the valid commands that are allowed during the voice
interaction. Grammar can be defined at the development stage or generated
dynamically at the run time. Audio Files are prerecorded audio files that are played
back, or the recordings of the user’s input. VoiceXML language provides features for
four major components of Voice Web: voice dialogs, platform control, telephony,
performance. Each VoiceXML document consists of one or more dialogs. The dialog
features cover the collection of input, generation of audio output, handling of
asynchronous events, performance of client-side scripting and dialog continuation.
Telephony features include simple connection control (call transfer, add 3rd party, call
disconnect) and telephony information like Automatic Number Identification (ANI)
and Dialed Number Information
VoiceXML Concepts
An application is a set of VoiceXML documents sharing the same application root

document. The application root document remains loaded while the user is
transitioning between other documents in the same application, and it is unloaded
when the user transitions to a document that is not in the application. While it is
loaded, the application root document’s variables are available to the other documents
as application variables, and its grammars can also be set to remain active for the
duration of the application.
The user is always in one conversational state, or dialog, at a time. Each dialog
determines which dialog will be transitioned to next. Transitions are specified using
33
URI (Uniform Resource Identifier), which define the next document and dialog to use.
If a URI does not refer to a document, the current document is assumed. If it does not
refer to a specific dialog, the first dialog in the document is assumed. The dialog
execution is terminated when a dialog does not specify a successor, or if it has an
element that explicitly exits the conversation.
Dialogs are of two kinds forms and menus. Forms define an interaction that collects
values for a set of field-item variables. Each field may specify a grammar that defines
the allowable inputs for that field. If a form-level grammar is present, it can be used to
fill several fields from one utterance. A menu presents the user with a choice of
options and then transitions to another dialog based on that choice.
Subdialogs are function-like reusable components that can be used for standard
reusable dialog interfaces, like collecting credit card numbers. At the end of execution
of a subdialog, the control returns to the dialog from where it was invoked and returns
the fields that were collected.
Grammar, Each dialog has one or more speech and/or DTMF grammars (valid
commands) associated with it. Each dialog’s grammars are active only when the user
is in that dialog. Some of the dialogs can be flagged to make their grammars active
(i.e., listened for) even when the user is in another dialog in the same document, or on
another loaded document in the same application. In this situation, if the user says
something matching another dialog’s active grammars, the application transitions to a
new dialog, and treats the user’s utterance as if it were said in the new dialog.
Events, VoiceXML provides a form-filling mechanism for handling "normal" user

input. In addition, VoiceXML defines a mechanism for handling events not covered by
the form mechanism. Events are thrown by the platform under a variety of
circumstances, such as when the user does not respond, doesn't respond intelligibly,
requests help, etc. The interpreter also throws events if it finds a semantic error in a
VoiceXML document. Catch elements or their syntactic shorthand catches events.
Each element may specify catch elements. Catch elements are also inherited from
34
enclosing elements "as if by copy." In this way, common event handling behavior can
be specified at any level, and applied to all lower levels.
VoiceXML implements a client-server paradigm, where a web server provides

VoiceXML documents that contain dialogs to be interpreted and presented to the user;
the user’s responses are submitted to the web server, which responds by providing
additional VoiceXML documents, as appropriate. VoiceXML allows to request
documents and submit data to server scripts using Universal Resource Indicators
(URIs). It provides an open application development environment that generates
portable applications. This makes VoiceXML a cost effective alternative for providing
voice access services. It directly supports networked and web-based applications,
meaning that a user at one location can access information or an application provided
by a server at another geographically or organizationally distant location. This
capitalizes on the connectivity and commerce potential of the World Wide Web.
4.3.2 Servlets
Java Servlets are the key component of server side programming. A servlet is a small
puggle extension to the server that enhances the servers functionality. Servlets are
server side programmes, which run on the web servers to provide the requested
information by the users. Servlets make use of JDBC concepts to connect to the
database where the actual information of the enterprise is stored
Advantage of Servlets Over CGI

Java servlets are more efficient, easier to use, more powerful, more portable, and
cheaper than traditional CGI and than many alternative CGI-like technologies. (More
importantly, servlet developers get paid more than Perl programmers :-).
(i) Efficient: With traditional CGI, a new process is started for each HTTP request. If
the CGI program does a relatively fast operation, the overhead of starting the process
can dominate the execution time. With servlets, the Java Virtual Machine stays up, and
each request is handled by a lightweight Java thread, not a heavyweight operating
system process.
35
(ii) Powerful: Java servlets let to easily do several things that are difficult or
impossible with regular CGI. For one thing, servlets can talk directly to the Web server
(regular CGI programs can't). Servlets can also share data among each other, making
useful thing like database connection pools easy to implement. They also maintain
information from request to request, simplifying things like session tracking and
caching of previous computations. Servlets are written in Java and follow a well-
standardized API. Servlets are supported directly or via a plugin on almost every
major Web server.
4.3.3 JSGF (Java speech grammar format)

JSGF provide speech recognition systems with the ability to listen to user speech and
determine what is said .The VoiceXML browser requires all grammars to be specified
using the Java Speech Grammar format.The Java™ Speech Grammar Format (JSGF)
defines a platform-independent, vendor-independent way of describing one type of
grammar, a rule. It uses a textual representation that is readable and editable by both
developers and computers, and can be included in Java source code.
Components of the grammar, the grammar header and the grammar body. The
grammar header declares the grammar name and lists the imported rules and
grammars. The grammar body defines the rules of this grammar as combinations of
speakable text and references to other rules.
A simple grammar header might be:
#JSGF V1.0;
grammar citystate;
Here citystate is the “grammar name”
The grammar body consists of one or more rules that define the valid set of utterances.
The syntax for grammar rules is: public < rulename> = options;
where: public is an optional declaration indicating that the rule can be used as an
active rule by the speech recognition engine.
rulename is a unique name identifying the grammar rule.
options can be any combination of text that the user can speak, another rule, and
delimiters such as:
36
• | to separate alternatives
• [] to enclose optional words, phrases, or rules
• () to group words, phrases, or rules
• to indicate that the previous item may occur zero or more times
• + to indicate that the previous item may occur one or more times
For example:
#JSGF V1.0;
grammar employees;
public <name>= Jonathan | Larry | Susan | Melissa;
Inline grammar, which is specified directly in the VXML document.
<grammar>
request | path | query | server | remote user | backup | exit
</grammar>
VoiceXML browser also uses JSGF as the DTMF grammar format. For example, the
following code snippet defines an inline DTMF grammar that allows the user to make
a selection by pressing the numbers 1 through 4, the asterisk, or the pound sign on a
telephone:
<dtmf type=”text/x-jsgf”>
1 | 2 | 3 | 4 | “*” | “#”
</dtmf>
4.3.4 Oracle database

The overall information about the institute, student and the faculty database is stored
in the oracle database. Separate tables are created for students, faculty and regarding to
the institute information. Reason for selecting Oracle database as a source of data is its
simplicity. It is easy to create, update and delete the data tables using and SQL.
4.4 Speech interface design makes use of Prototype software model

The speech user interface should be presented such that it will easy for the user to hear
clearly the required information. A lot of effort is put to deal with the different aspects
of design.
37
4.4.1 Design Methodology
Developing speech user interfaces, like most development activities, involves an
iterative 4-phase process: “Design Phase”, “Prototype Phase”, “Test Phase”,
“Refinement Phase”.
Design Phase: In this phase, the goal is to define proposed functionality and create an
initial design. This involves the following tasks: “Analyzing Your Users”,
“Analyzing User Tasks” , “Making High-Level Decisions” , “Making Low-Level
Decisions” , “Defining Information Flow” , “Identifying Application Interactions” ,
“Planning for Expert Users” .
Prototype Phase: The goal of this phase is to create a prototype of the application,
leaving the design flexible enough to accommodate changes in prompts and dialog
flow in subsequent phases of the design.
For the first iteration, use the technique known as “Wizard of Oz” testing. This
technique can be used before beginning the coding, as it requires only a prototype
paper script and two people: one to play the role of the user, and a human “wizard” to
play the role of the computer system.
Test Phase: After incorporating the results of the “Wizard of Oz” testing, code and test
a working prototype of the application. During this phase, be sure to analyze the
behavior of both new and expert users.
Identifying Recognition Problems: After Test phase, note consistent recognition

problems. The most common cause of recognition problems is acoustic confusability
among the currently active phrases. Sometimes there is nothing one can do when this
happens. Other times one can try to correct the problem by: Using a synonym for one
of the terms. For example, if the system is confusing no and new. One should be able
to replace ‘new’ with ‘recent’ depending on the application’s context.
38
Refinement Phase: During this phase, update the user interface based on the results of
testing the prototype. For example, revise prototype scripts, add tapered prompts and
customizable expertise levels, create dialogs for inter- and intra-application
interactions, and prune out dialogs that were identified as potential sources of user
interface breakdowns. Finally, iterate the Design—Prototype—Test—Refine process,
including in the Test phase.
4.5 IVR Development Aspects

These are the files to be developed in order to build any IVR application.
(i) Create the necessary Vxml files to understand user input. Using JSGF create a
series of speech recognition grammars defining the words and phrases that can be
spoken by the user, and specifies where each grammar should be active within the
application.
(ii) Pass these parameters collected from the user to servlets by specifying the URI.
Uniform Resource Indicators are used to specify the path of the Servlet where it is
located. These URL’s are specified in the tag <submit> which submit the parameters to
the Servlets.
(iii). Create servlets using Servlet API to receive the parameters from the <submit>
tags of VXML files.
(iv). Use JDBC in Servlets to connect to database tables.
(v). Collect the information from the database and pass it through VXML tags like
<prompt> or <block>, which can read the text out loud.
Same procedure is applied to develop the code to process different options chosen by
the user. Special call recognizing tags are used in order to deploy the application in the
real time environment. In ordinary PSTN network central office is responsible for
generating the dial tone, establishing a connection between the source and destination
devices.
39
A gateway emulates a central office providing: Signaling - dial tone, call set-up etc.
(H.323, MGCP, SS7), Conversion to IP, (often Ethernet), Compression (G.711,
G.723.1 etc.), Echo Cancellation and Quality of Service (QOS).
When a user place a call using a telephone or cell phone to voice server. Voice server
automatically recognizes the call with the help of VOIP gateway and starts executing
the application-root document. User opts for a choice by hearing options provided by
the application. The speech recognition engine processes the incoming audio signal of
the user and compares the sound patterns to the patterns of basic spoken sounds, trying
to determine the most probable combination that represents the audio input. Finally,
the speech recognition engine compares the sounds to the list of words and phrases in
the active grammar(s). Only words and phrases in the active grammars are considered
as possible. With present technologies understanding long sentences is quiet difficult
when compared to small phrases.
4.6 Deployment Procedure

The following procedure is to be followed to run the application. Some of the
following pre-requirements should be satisfied to deploy the application.
• Operating System: Windows NT 4.0.
• Sound card should be properly configured.
• A headphone with MIC is to be used in order to simulate it in the desktop
environment.
• JRE1.2.2 provided by the IBM to configure the JVM.
• The Voice server with configured audio setup as per the directions specified by
the system.
• Java web server or any web server of your choice, which can run the servlets.
• JDK1.3.1 .Set the path for the package in the system variable “path” and also
set the classpath for the package in the environment variable “classpath”. Al so
copy the files jar files like mail.jar, pop.jar, activation.jar, jsdk.jar in to the lib
directory of the JDK1.3.1 directory.
40
After the pre-requirements were met, copy all the class files in to the servlets directory
of the java web server. Copy all VXML files in to a folder named “Thesiscode” in “c”
logical partition of the hard disk.
Now in order to run the application root document in the desktop simulated
environment, open the command prompt. Go to the directory where the vxml
documents are stored. From there run the voice browser and execute the file by typing.
“path for voice browser i.e vsaudio” root_iiitm.vxml
Eg: If voice server is installed in the “C” partion then the path for voice browser will
be “c:\voices~1\bin\vsaudio” root_iiitm.vxml. To run the applications in the textmode
please replace the “vsaudio” by “vstext”. This is mostly is in debugging the
application. After executing this statement application starts executing in a user-
friendly manner so that the user can easily identify at which location he was in the
application.
4.6.1 How the system works to provide information

User who enters the system will first hear the warm welcome message. User will be
prompted the options form a menu .The menu will be having 7 different options
mentioned above. User has to choose one out of them. System will understand the
option, which the user opted for, whether it is one among the options that the system
has provided, or not. Defining the grammar for the options does this. One will know
how to develop the grammar in the coming chapters. After checking the grammar, if it
is correct option the system move to the new document or dialog specified If not
application generates different types of events which specify no match event or silent
event or help event and re prompt to select the proper option again. If the user is
unable to provide the options, try to get help regarding to selection of menu in much
more understandable manner. The system disconnects the call if the user fails to
provide input to the system. After the control is transferred to the new document or
dialogue, the user is provided with the necessary information. Again the user is
provided the same set of options provided earlier. If User wants to traverse through
other document, he can select a option again and can obtain the information regarding
to other aspects.
41
Eg: Suppose if one wants to know about information about institute establishment. He
can get it by selecting the establishment option after selecting the institute information
option first. After getting the institute establishment information. He will be again
given a set of choices to opt for. For further want information regarding to MTech
students selection, select the option students selection criteria and opt for the MTech
students choice in order to get that information.
Selecting first option provides information regarding to institute. In this the user is
given choices like what information the user like to have regarding to the institute
establishment, facilities, profile of the institute, Students database, Faculty database.
Provide information like student name, group and hear the complete details of that
particular student. Selecting the second option provides the information about the
recruitment process of the students and faculty in IIITM. Just select the group like
MTech, MBA, IPG in which the user is interested. Hear the recruitment procedure for
that particular branch. Selecting the third option announces the achievements of the
institute. It includes summer placement information, final placement information and
cultural events occurred in the institute every year. Selecting the fourth option gives
information of which student selected in which company for summers and finals. For
this the user has to supply the information, student name and group. To select the fifth
option first of all the should get registered to our system as a member. For this type
http://127.0.0.1:8080/registration.html in the Internet explorer or Netscape browser.
Fill up the information required and submit to the server the user recieves congrats
information along with a pin number, which is supplied to the system. This pin
number is of use in future to check your mails through our email reader. All the
members should have a pop mail account in any of the pop mail servers of yahoo,
hotmail etc. The pop mail account information like userid and password should be
given for further use by the system to connect to your pop mail account, get new mail
information and read the mails of intended for the user. Please supply your pin
number. User can hear from the system how many new mails he got and read the mail
in which he is interested.
42
4.6.2 Practical issues faced for deployment of the IVR system
(i) VOIP gateway: Developing VoIP gateway requires a lot of infrastructure like DSP
modules and developing the protocols like SIP and MGCP, which is practically
impossible to complete with in this short period of dissertation work.
(ii) Voice server: Voice server is equipped with a voice browser, TTS engine, and
speech recognition. As the time is short I was supposed to use voice server model
developed by IBM. But it was not flexible in its functioning as it was still under
developing stage. I make use of some of the functions incorporated in the voice server
to develop a voice application As VOIP is not available at present it becomes difficult
to adapt this application to PSTN. Hence I simulated it in the desktop environment.
(iii). Dealing with the Speech Recognition Errors: There are three basic types of
recognition errors. The speech recognition engine returns a result that does not match
what the user actually said. This can have many causes, including:
• The audio quality is poor.
• Multiple choices in the active grammars sound similar, such as “Newark” and
“New York” in a grammar of United States airports.
• The user utterance was not in any of the active grammars, but something from
an active grammar sounded similar.
• The user has a strong or unusual accent.
• The user paused before finishing the intended utterance.
• The speech recognition engine did not understand what the user said well
enough to return anything at all. This type of error can occur in situations
similar to those described above.
All the practical issues were taken in to consideration in developing the application.
4.6.3 Security issues taken into consideration in deploying the IVR
Security in voice applications can be implemented at two different levels. One is at the
infrastructure level, involving the telephony network and Internet infrastructure. Most
VoiceXML browsers support the existing Web security infrastructure. They support
SSL and cookies to help manage security between the voice server and the Web server.
43
Communications may be secured with authentication, encryption, and data integrity
measures using existing telephony security technologies. Second is at the application
level, which can be implemented in any of the following three ways:
(i). The user id/password approach in which the application prompts for a user id and
pin code. In most cases, the user is asked to key in the entries instead of speaking (to
avoid overhearing).
(ii). The telephone number identifies the user id. In this approach, the user simply
enters his pin code, reducing the complexity. It is implemented in this application.
Most of the VoiceXML interpreters can identify the incoming phone number.
(iii). Speech verification (Voice Biometrics) authenticates the user, excluding the need
of PIN based verification. Here, the voiceprint samples are stored in the database at
the time the account is set up, to be compared against at the time of authentication.
44
Chapter 5
5. Conclusion and Future scope of the work
5.1 Conclusion
Interactive voice response system developed makes use of latest Speech recognition
engines to have the speech user interface efficient in recognizing the human voice. It
promises a friendly user interface as every stage of interaction, was designed carefully
and efficiently using a powerful voice language VXML.
IVR empowered users with more options regarding when, where, and how they use
Internet services. Using speech as the most natural form of communication, the
existing familiar global telephone network as the most pervasive communications
network, and enabling eyes and hands-free operation. This new mode of access
promises to further accelerate the growth and maturity of Internet services.
Improvement were made in the following aspects
5.1.1 Minimised fetching delays
Coding is made in the way that there involves minimum amount of 'dead air' the caller
hears while the system fetches resources. VoiceXML provides several facilities to
either eliminate or hide the delays associated with retrieving Web resources.To
minimize delays, the system maintains a cache for VoiceXML documents, audio files,
and other files used by applications. Normally, once the system has fetched a file over
the Internet, it keeps a copy in the cache. If the application requests the file again, the
system uses the cached copy. This is known as fast caching. Sometimes, even when a
file is in the cache, the user should always check for a newer version of the file on the
server from which it was originally fetched. This is known as safe caching.
45
5.1.2 New way of Grammar development
Collecting the data from the database develops grammar servlets. This is the most
efficient way of developing the grammar rules when compared to ordinary way of
implementing grammar making use of reusable components. Reusable components are
files, which specify the entire probable input from the user by specifying all the
combination of alphanumeric characters. Servlets are written to collect the required
data from the database and form a grammar file using the database information.A
thread is created which checks for the new data entered in to the data table. This thread
executes in every 10 seconds and forms the grammar files every time the table is
updated.
5.1.3 Email reader
Email reader makes the user simply dial a number from the telephone or cell phone
and listen to his emails. This is a cost effective and efficient way of checking the
emails especially to users who are always mobile.
5.2 Future Scope of the work
Interactive voice response websites as every one knows requires a lot of infrastructure
like developing speech recognition engines and voice servers. Mostly IVR
applications are to serve the untapped market of mobile and telephone users, which are
the cost effective ways of doing the transactions .To make it possible a VoIP gateway
is required. As it requires lot of time to develop, I simulated the application in the
desktop environment. Improvements can be made at various stages of the application
as mentioned below.
(i). One can develop a much more efficient user-friendly interface than the existing
one.
(ii). One can develop a VoIP gateway and make my dream true of deploying it in a real
time environment.
(iii). One can introduce much more sophisticated technologies in speech recognition
and make the process of speech recognition perfect than which exist now.
46
Abbreviations
1. VXML - Voice Extensible Markup Language.
2. VSDK - Voice Server Development Kit.
3. PSTN - Public Switching Telephone Network
4. IP - Internet Protocol.
5. DTMF - Dual Tone Multiple Frequency.
6. JSGF - Java Speech Grammar Format.
7. JSML - Java Speech Markup Language.
8.URI - Uniform Resource Indicator.
9.’Wizard of Oz’- A prototype model of IVR development.
List of Figures
1. Fig 4.1 - System components and data flow. - 24
2. Fig 2.1 - IVR network architecture of TJNET.- 13
3. Fig 3.1 - Voice Web architecture.- 17
4. Fig3.1.2 – VoIP gateway- 19
47
References
1. www. IBM /alphaworks.com.
2. www.Tellme.com.
3. www.heyanitafreespeech.com.
4. www.java.sun.com.
5. www.nuance.com.
6. www.voicexml.org.
7. Jason Hunter Java Servlet programming O’Reilly Publications.
8. Sameer Tyagi Professional WAP Wrox publications
9. Joseph O’Neil Teach your self Java Tata McGraw Hill Publications.
48

IT06 IVR Thesis Report (Pratap Raju)

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

IT06 IVR Thesis Report (Pratap Raju)

Transféré par

Droits d'auteur :

Formats disponibles

Interactive Voice Response System

Thesis submitted for the partial fulfillment of

Master of Technology in Information Technology

K.Pratap Kumar Raju

Under the guidance of

Dr. Rajendra Sahu

Indian Institute of Information Technology and Management

This work is a result of inspiration, support, guidance, co-operation and

I would like to express my gratitude to our M.Tech (I.T) Head Prof. G

I also thank my colleague Mr Sanjeev Manglani for helping me

K. Pratap Kumar Raju

Chapter 2 Problem Formulation

Chapter 3 System Components

Chapter 5 Conclusion & Future Scope

1.2 Typical Types of Voice Applications.

Transactions: In this scenario, a customer calls into a system to execute specific

1.3 Create and deploy voice applications

(v). A telephony expert configures the telephony infrastructure, as described in the

1.4 Information to access the deployed voice application

Many of these people do, however, have access to a telephone. Providing

2.2 Flaws in the existing infrastructure to implement the new

2.2.1 Network Deficiencies:

So taking these deficiencies in to consideration one should develop a VoIP gateway

2.2.2 Deficiencies in the existing Client-Server model

VoiceXML is a member of XML family, W3c specification for organizing data in a

2.3 Some of the Interactive Voice Response applications

(i) Weather applications

(ii) Online Shopping applications

(iii) Online enquiry in railways and Airways

The Distance Education Centre in MONASH UNIVERSITY has introduced an

2.4 Literature Survey

VoiceGenie Technologies empowers every PC as a Voice access point, by making

Computer Telephony Integration is a service provided to various clients for over 3

Fig 2.3 Tjnet voice network

The key features of IP Phone Center are:

• PC to phone calls using Dynamic VoIP gateways

• PC to phone calls using Static VoIP gateways (fixed IP)

• PC to phone calls using Web call and your choice of provider

• PC to PC calls over the Internet

• Buddy List to make placing a call "one-click close"

• One easy to use interface for all types of call

• Support for handsets and regular phones

Nuance- delivers speech recognition, voice authentication, and text-to-speech

Cisco IP-powered Interactive Voice Response Solution- Cisco IP IVR is an IP-

“VXML is developed as a leading standard for the implementation of telephony based

Speaking of Nortel and SpeechWorks- Nortel Networks and SpeechWorks will

2.5 Objective of Study

3.1 IVR System components

Phone connected to a PSTN

Oracle Database Server

Fig 3.1 Voice enabled web architecture

The system consists of the following components at different levels.

(i) Telephone Network.

3.1.1 Telephony Network is a PSTN (Public Switched Telephony Network), a

(i) Voice Compression

(ii) Tone Detection/Generation

(iii) Echo Cancellation

(iv) Silence Suppression

PSTN DSP IP network

Fig 3.1.2 VOIP Gateway

VoiceXML browser is the implementation of the interpreter context as defined in the

3.1.6 TCP/IP (Transport Control protocol/Internet protocol) is packet-based