Vous êtes sur la page 1sur 24

1.

INTRODUCTION

A voice browser is a device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input,and possibly other input/output modalities." The definition of a voice browser, above, is a broad one.The fact that the system deals with speech is obvious given the first word of the name,but what makes a software system that interacts with the user via speech a "browser"?The information that the system uses (for either domain data or dialog flow) is dynamic and comes somewhere from the Internet. From an end-user's perspective, the impetus is to provide a service similar to what graphical browsers of HTML and related technologies do today, but on devices that are not equipped with full-browsers or even the screens to support them. This situation is only exacerbated by the fact that much of today's content depends on the ability to run scripting languages and 3rd-party plug-ins to work correctly. Much of the efforts concentrate on using the telephone as the first voice browsing device. This is not to say that it is the preferred embodiment for a voice browser, only that the number of access devices is huge, and because it is at the opposite end of the graphicalbrowser continuum, which high lights the requirements that make a speech interface viable. By the first meeting it was clear that this scope-limiting was also needed in order to make progress, given that there are significant challenges in designing a system that uses or integrates with existing content, or that automatically scales to the features of various access devices. Voice Browsing refers to using speech to navigate an application. These applications are written using parts of the Speech Interface Framework. In much the same way that Web applications are written in HTML and are rendered in a Web browser, speech applications are written in VoiceXML and are rendered via a Voice Browser.

2.VOICE BROWSER DOCUMENTS


2.1 Dialog Requirements:
"A prioritized list of requirements for spoken dialog interaction which any proposed markup language (or extension thereof) should address." The Dialog Requirements document describes properties of a voice browser dialog, including a discussion of modalities (input and output mechanisms combined with various dialog interaction capabilities), functionality (system behavior) and the format of a dialog language. A definition of the latter is not specified, but a list of criteria is given that any proposed language should adhere to.An important requirement of any proposed dialog language is ease-of-creation. Dialogs can be created with a tool as simple as a text-editor, with more specific tools, such as an (XML) structure editor, to tools that are special-purposed to deal with the semantics of the language at hand.

2.2 Grammar Representation Requirements:


It defines a speech recognition grammar specification language that will be generally useful across a variety of speech platforms used in the context of a dialog and synthesis markup environment.When the system or application needs to describe to the speech-recognizer what to listen for, one way it can do so is via a format that is both human and machine-readable.

2.3 Model Architecture for Voice Browser Systems Representations:


"To assist in clarifying the scope of charters of each of the several subgroups of the W3C Voice Browser Working Group, a representative or model architecture for a typical voice browser application has been developed. This architecture illustrates one possible arrangement of the main components of a typical system, and should not be construed as a recommendation."

2.4 Natural Language Processing Requirements:


It establishes a prioritized list of requirements for natural language processing in a voice browser environment. The data that a voice browser uses to create a dialog can vary from a rigid set of instructions and state transitions, whether declaratively and/or procedurally stated,

to a dialog that is created dynamically from information and constraints about the dialog itself. The NLP requirements document describes the requirements of a system that takes the latter approach, using an example paradigm of a set of tasks operating on a frame-based model. Slots in the frame that are optionally filled guide the dialog and provide contextual information used for task-selection.

2.5 Speech Synthesis Markup Requirements:


It establishes a prioritized list of requirements for speech synthesis markup which any proposed markup language should address. A text-to-speech system, which is usually a stand-alone module that does not actually "understand the meaning" of what is spoken, must rely on hints to produce an utterance that is natural and easy to understand, and moreover, evokes the desired meaning in the listener. In addition to these prosodic elements, the document also describes issues such as multi-lingual capability, pronunciation issues for words not in the lexicon, time-synchronization, and textual items that require special preprocessing before they can be spoken properly.

3. STANDARDIZATION
Standardization to voice browsing technique were given by: The World Wide Web Consortium (W3C) develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential as a forum for information, commerce, communication, and collectiveunder standing. W3C which includes: 1 .Voice Browser Working Group 2. Speech Interface Framework

3.1 Voice Browser Working Group:


It was established on 26 March 1999 and re-chartered through 31 January 2009. W3C voice browser working group made the speech interface framework possible . This framework allows developers to create speech enabled applications that are based on Web technologies. The framework also provides developer with an environment that will be familiar to those who are familiar with Web development techniques. So, applications are written using parts of speech interface framework. Thus speech applications are written in VoiceXML and are rendered through a Voice Browser. In much the same way as Web applications are written in html and run on a Web browser. As per estimation, over 85% of Interactive Voice Response (IVR) applications for telephones (including mobile) use W3C's VoiceXML standard. Voice Browser Working Group are coordinating their efforts to make the Web available on more devices and in more situations.

3.1.1.Aim:
The Aim of the W3C Working Group is to enable users to speak and listen to Web applications by making standard languages for developing Web-based speech applications. This Working Group concentrates on languages for capturing and producing speech and managing the conversation between user and computer system, while a related Group, the Multimodal Interaction Working Group, works on additional input modes including keyboard and mouse, ink and pen, etc.

3.1.2. W3C Recommendations:


Its recommendations have been reviewed by w3c group Members, by software developers, and other interested parties, and are also endorsed by the Director as Web Standards.

3.2 Speech Interface Framework:


These framework includes:

3.2.1.VoiceXML:
A language for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Some of its versions are: VoiceXML 1.0: designed for creating audio dialogs. VoiceXML 2.0: uses form interpretation algorithm(FIA). VoiceXML 2.1: 8 additional elements. Voice XML 3.0: relationship between semantics and syntax.

3.2.2.Speech Recognition Grammar Specification (SRGS) 1.0:


A document language that can be used by developers to specify the words and patterns of words to be listened for by a speech recognizer or other grammar processor.

3.2.3.Speech Synthesis Markup Language (SSML):


A markup language for rendering a combination of prerecorded speech, synthetic speech, and music. Some of its versions are: Speech Synthesis Markup Language (SSML) 1.0 Speech Synthesis Markup Language (SSML) 1.1

3.2.4.Semantic Interpretation (SISR):


Document format that represents annotations to grammar rules for extracting the semantic results from recognition. Eg: Semantic Interpretation(SISR)1.0 version

3.2.5.Pronunciation Lexicon Specification (PLS):


A representation of phonetic information for use in speech recognition and synthesis . Eg: Pronunciation Lexicon Specification (PLS) 1.0 version

4.DESIGN OF THE SYSTEM


This section aims to give an overview of the designof a phone browser, TeleBrowse. The prototype is assumed to have a phone-based physical medium.The architecture is first introduced, which is followed by the functions provided by the prototype.

4.1 System Architecture:


The system consists of three modules: voice driveninterface, HTML translator and a commandprocessor.The voice driven interface of the prototype includes a voice recogniser, a DTMF recogniser, aspeech synthesiser and an audio engine. The physical medium for communication is intended to be a telephone, where the user can be remote to the systems location, but it can also be simulated with a microphone/speaker combination connected directly to a PC running the system. A voice recogniser and a DTMF (Dual Tone Multi-Frequency) recognizer are available for input to the system, while the speech/audio synthesis is suitable for output to the user. Hence, a command issued to the system can be in one of two forms: either spoken commands by the user, or DTMF tones punched in on a phones keypad. DTMF assigns a specific frequency, or tone, to each key on a touch-tone telephone so that it can easily be identified by a microprocessor. The purpose of DTMF in the architecture is justified by the cost effectiveness to supplement the voice functions. Although voice recognition technology is capable to handle most of the translation task, it is however an expensive one. This will be used for issuing commands that are not complex enough to warrant the translation to a voice driven format. The Voice Driven Interface essentially accepts spoken words as input. The input signals are thencompared against with a set of pre-defined commands. If there is an appropriate match, the corresponding command is output to the Command Processor. The engine also handles auxiliary functions (in conjunction with the speech synthesis engine) such as the confirmation of commands where appropriate, speed control and the URL dictation system. The Text-to-Speech/audio engine is responsible for producing the only output from the system back to the user. The output is in the form of spoken (synthesised) text, or sounds for the auditory icons.The input for these two engines comes from the Command Processor. The input can either be a text stream consisting of the actual information to be read out as the content of the page
7

(processed by the TTS engine), or an instruction to play a sound as an auditory icon to mark a document structure (processed by the audio engine). The other major component is an HTML Translator. When a user requests an HTML document, the contents of the document must first be parsed and translated to a form which is suitable for use in the audio realm. This includes the removal of unwanted tags and information, and the retagging of structures for compliance and subsequent use with the audio output engine of the interface.The translator also summarises information about the document such as the title and positions of various structures for use with the document summary feature. A Command Processor sits between the HTML translator and the interface. The command processor is responsible for acting on the voice / DTMF commands issued by a user. The Processor retrieves HTML documents from the WWW and feeds them to the HTML translation algorithm. It also controls the navigation between web pages, and the functionality associated with this navigation (bookmarks, history list, etc.). This component also processes all the other system and housekeeping commands associated with the program. A stream of marked text to the speech synthesis / audio engine is output. The stream consists of a combination of actual textual information, and tags to mark where audio cues should be played. "The architecture diagram was created as an aid to how we structure our work into subgroups. The diagram will help us to pinpoint areas currently outside the scope of existing groups." Although individual instances of voice browser systems are apt to vary considerably, it is reasonable to try and point out architectural commonalties as an aid to discussion, design and implementation. Not all segments of this diagram need be present in any one system, and systems which implement various subsets of this functionality may be organized differently. Systems built entirely third-party components, with architecture imposed, may result in unused or redundant functional blocks. Two types of clients are illustrated: telephony and data networking. The fundamental telephony client is, of course, the telephone, either wirelined or wireless. The handset telephone requires PSTN (Public Switched Telephone Network) interface, which can be either tip/ring, T1, or higher level, and may include hybrid echo cancellation to remove line

echoes for ASR barge-in over audio output. A speakerphone will also require an acoustic echo canceller to remove room echoes. The data network interface will require only acoustic echo cancellation if used with an open microphone since there is no line echo on data networks. The IP interface is shown for illustration only. Other data transport mechanisms can be used as well.The model architecture is shown below. Solid (green) boxes indicate system components, peripheral solid (yellow) boxes indicate points of usage for markup language, and dotted peripheral boxes indicate information flows.

Fig: 4.1 System architecture

5.LEVELS OF VOICE BROWSING


In order to understand it better in its current form, voice browsing can be examined under three levels.

5.1.Voice Browsing:
This is the first level in examining the technique of voice browsing. The simplest way of understanding the voice browser and the voice web would be to take the web itself into consideration. We know that people all over the world visit websites and get visual feedback. Now, the voice web contains voice sites where the feedback is delivered through dialogues. The most basic example would be calling up our cellphone operator portal, where speech recognition software provides the caller with a series of options like recharging your account, talking to the customer care executive or listen to the new offers. The technology of voice browsing has brought down the cost considerably, since earlier it would cost a rupee per minute for human operators to talk to customers while 10paise per minute for an automated call. Voice browsing also has its use in the corporate sector specially in banks and airlines.

5.2.Voice Browsing The World Wide Web:


This is the second level through which we can examine the voice browsing technique. We can browse the websites through voice which offer voice portals. In order to maintain customer loyalty these portals offer voice browsing of personal content. To have a claim in this sector or we can say market, AOL bought quack.com. Almost everywhere the voice portal market is being expanded, be it Europe, the US or Asia.

5.3.The Voice Web:


The level three is the voice web. Many companies have introduced forums for programmers for setting up voice sites so as to enhance the interest in voice browsing and speech recognition. The forums further become voice webs which sprout voice auction sites and voice based chat rooms.

1010110100100100110010110010101100000111010011011011110110000001010010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100

10

6.WEB PAGE DESIGN STRATEGIES FOR VOICE BROWSERS


The major obstacle to wide-scale commercial deployment of voice browsers for the web is not the technology,but the ease (or difficulty!) with which web page designers can add speech support to their site.Authoring a web page for any specific type of user agent or system configuration, should never be a completely separate subject with arcane new techniques developed for each special need, but rather an application of the common set of Universally Accessible Design principles that should be part of every web author's repertoire. With few exceptions, pages should never be designed for certain types (or brands) of browsers, but should instead be designed for all uses (and potential uses) of the information. All web documents should be equally accessible to voice browsers as to visual user agents. The HTML Writer's Guild studies, and discussions with web authors, have shown that the primary obstacle to universal accessibility is ignorance. There are few cases where a conscious decision has been made to produce a generally inaccessible web page; rather, the author is simply unaware of the need to create accessible pages and the techniques by which that is done. Once enlightened, most web authors eagerly embrace the concept of universal accessibility, since the benefits are many and obvious. In this section, some of the primary techniques of Universally Accessible Design will be briefly listed as they relate to voice browsers, and offer ideas as to how authors can implement these considerations when designing web pages.

6.1.Aural Style Sheets:


Aural Style Sheets are part of the Cascading Style Sheets, Level 2 specification, and provide for a level of control in spoken text roughly analogous to that for displayed/printed text. Aural rendering of a document is already commonly used by the blind and print-impaired communities. It combines speech synthesis and "auditory icons." Often such aural presentation occurs by converting the document to plain text and feeding this to a screen reader, software or hardware that simply reads all the characters on the screen. This result in less effective presentation than would be the case if the document structure were retained. Style sheet properties for aural presentation may be used together with visual properties (mixed media) or as
11

an aural alternative to visual presentation.The use of an aural style sheet (or aural style sheet properties included in a general style sheet document) allows the author to specify characteristics of the spoken text such as volume, pitch, speed, and stress; indicate pauses and insert audio "icons" (sound files); and show how certain phrases, acronyms, punctuation, and numbers should be voiced. Combined with the @media selector for media types, a well-crafted aural style sheet can greatly increase the accessibility of a web document in a voice browser. Further investigation in this area is encouraged, especially in the area of example aural style sheets and suggestions for authoring techniques.

6.2.Rich Meta-Content:
HTML 4.0 gives the author the ability to embed a great deal of meta-content into a document, specifying information which expands on the semantic meaning of the content and allows for specialized rendering by the user agent. In other words, by using features found in HTML 4.0 (and to a limited extent, in other versions of HTML), an author can give better information to the browser, which can then make the document easier to use. Judicious and ample use of meta-content within a document allows the author to not simply specify the content, but also suggest the meaning and relationship of that content in the context of the document. Voice browsers can then use that meta-information as appropriate for their presentation and structural needs.

6.3.Planned Abstraction:
One use for meta-content information is the development of pages, which are designed to be abstracted. The typical web document found on the web can often be quite lengthy; finding information by listening to web page read out loud takes longer than visually scanning a page, especially when most web pages are designed for visual use.Thus, most voice browsers will provide a method for abstracting a page; presenting one or more outlines of the page's content based on a semantic interpretation of the document.

12

Examples of potential or valid abstraction techniques include:


Listing all the links and link text on a page. Forming a structure based on the H1, H2, ... H6 headers. Summarizing table data. Scanning for TITLE attributes in elements and presenting a list of options for expansion. Vocalizing any "bold" or emphasized text. Digesting the entire document into a summary based on keywords as some search engines provide. There is any number of other options available for voice browser programmers

to use to provide short, easily-digestible versions of web contents to the browser user. This suggests that the web author should provide as much meta-content as possible as well as careful use of HTML elements in their proper manner. Specific techniques include:

Useful choices for link text (e.g., "the report is available" instead of "click here"). Appropriate use of heading tags to define document structure, not simply for size/formatting.

Use of the SUMMARY attribute for tables. Use of STRONG and EM where appropriate, providing benefits for both vocal and visual "scanability".

Use of META elements with KEYWORDS and SUMMARY content.

6.4.Alternative Content for Unsupported Media Types:


The "poster child" for web accessibility is the ALT attribute, which allows alternative text to be specified for images; if a user agent cannot display the visual image, the ALT text can be used instead. Widespread use of the ALT attribute by all sites on the Internet would likely double the accessibility of World Wide Web with such a simple change. Web authors who do not correctly use ALT text are seriously damaging the usability of the entire medium!

13

For voice browsers, ALT text is vitally important since images cannot be represented at all, aurally. Especially when used as part of a link, alternative content must be provided so that the voice browser can accurately render the page in a manner useful to the user. In addition to ALT for IMG attributes, HTML 4.0 provides a number of other ways for specifying alternative content that can be used by a browser if an unsupported media type is provided. Some of those include:

ALT attributes for image map AREAs, APPLETs, and image INPUT buttons. Text captions and transcripts for multimedia (video and audio). NOSCRIPT elements when including scripting languages, as voice browsers may be unable to process Java script instructions.

NOFRAMES elements when using frame sets, as frames are a very visually oriented method of document display.

Use of nested OBJECT elements to include a wide variety of alternative contents for many media types.

14

7.FUNCTIONALITY
All communication from the user to the system is made by issuing voice commands or using DTMF tones. Such commands are arranged into objects known as menus. Depending upon the functionality requirement/availability, different menus are available at different points in the programs execution. A grammar set is defined to recognise the speech commands. Some of these rules are for administrative control. To name a few administrative controls: <exit|quit[program|application| TeleBrowse]> speak <faster|slower> Where am I? What is my homepage?

The other rules are used to control navigation. It is anticipated that they are the most frequently used commands. The navigation is supported in various ways: within the same page (intra-page navigation), browsing a new web page (inter-page navigation), bookmarks, history list, document structure or to follow a hyperlink in the web page. The following grammars display the nature of these rules: Start browsing by <location| bookmark|homepage> Maintain bookmarks Start reading [all|again] Load <location|bookmark|homepage> Go to the history list Jump <forwards|backwards x <structure>> <Next|Previous <structure> >

A <structure> is one of paragraph, link, anchor, level 1/2/3 heading, list, page or table, and x represents a positive integer value. All three versions of this command represent one action moving between structures within a document. Another way to navigate to a specific target page is via dictation. Dictation is invoked whenever a browse by location type command
15

is requested, and it is responsible for fetching a URL address from the user. Users dictate to the system by saying words representing a single letter to improve recognition accuracy. A good example is the military code - alpha for a, bravo for b, charlie for c, etc. The grammar recognises this military code, and also common animals, such as frog for f. Macros and shortcuts are also used to simplify the dictation process. The http:// at the beginning of every URL is automatically added, and the system recognises phrases like World Wide Web, company, aussie among many more torepresent www., .com, and .au respectively. The dictation menu also allows for corrections to be made, a review of what has been dictated so far and an ability to restart the dictation session. Output from the system is either synthesised text or sounds (as auditory icons). The synthesised text can represent either actual information being read from a web page, or feedback about the systems operation to the user. When a page is to be read out (post translation), the page is broken up one piece at a time and analysed. Two situations can occur: if the piece is a tag with an associated auditory icon, this icon is played out, or, if the piece is simply text, it is synthesised into voice. Typical application of auditory icons include the creaking opening door (creaking) to represent internal link (link to an anchor within the same web page), or a doorbell to represent an email address, or the clicking sound of a camera shutter to relate to an image.When certain tags are encountered (end of paragraph, end of list, end of table row, etc.), speaking ceases and the user is returned to the menu they were last at. Alternatively, if a user wishes to interrupt the speaking prior to the next break point, the interrupt key * can be used.

16

8.EXPERIMENTS AND RESULTS


The TeleBrowse prototype was developed under the Visual Basic 6 environment.Three ActiveX controls were used in conjunction with the VB project.The Microsoft Telephony Control (see http://www.microsoft.com/speech) has the voice recognition engine, speech synthesis engine, and audio output engine. The Microsoft Internet Transfer Control is able to retrieve HTML documents from the Web using the HTTP protocol, while the HTML Zap Control (Newcomb, 1997) provides a simple interface for analysing HTML documents. The primary goal of the evaluation of the TeleBrowse prototype is to determine the usability and acceptance of the application as a voice-driven web-browsing tool.Measurement is done via the transformed data. The operational efficiency is represented by the physical interface, speech recognition & synthesis capabilities, document and page navigation, prompts and feedback from the prototype. The quality aspect is evaluated through the availability, integrity and usefulness of the data after the translation, and also the support of various structures in the original HTML page, e.g. headings, links, tables. The experiment was carried out on two different groups of users who had similar characteristics, i.e. they were all above 18 years old, they had prior experience on using current web browsing products and had limited knowledge of the WWW and the related technology. The major difference between these two groups was subjects in the first group (G1) were normally sighted, while the second group (G2) were visually impaired, or to be precise, they suffered from complete blindness. In other words, subjects in G1 were familiar with typical visually oriented web browsers such as Microsoft Internet Explorer. In this experiment, there were five people in G1 and four subjects in G2. Evaluation sessions were run on a one-to-one basis. There was also a general discussion involving the group at the end of the individual experiments. No interaction occurred between subjects from differing groups. Each subject was briefed about the operation of the prototype, and the methods for interacting with it. Along with the briefing, each user was given a sheet on which was printed a vocabulary listing of what phrases the program would respond to, and at what points in the programs execution such phrases could be used. In the case of the subjects in G2, the sheet was printed using a Braille printing device. The sheet also contained a

17

set of tasks the subjects had to complete using the program. Some of the tasks that subjects were required to complete included: starting the application, checking to see what the current homepage was set to, commencing reading of the web document once loaded, jumping to another location by dictating a URL address directly into the system etc. Accompanying each task was a description of the behaviour the system would demonstrate during the tasks execution, and what phrases to use in interaction with the system to complete those tasks. All tasks were first completed using the software running in emulation mode on the laptop computer. The speech recognition engine was not trained to adapt to any specific person. After completing this and gaining a degree of experience in using the system, subjects were then given an opportunity to use the system as they chose over the phone, completing the effect of a phonebased webbrowsing tool. Subjects were also asked to view the same web pages that had just been viewed using the prototype with a browser they would typically use. In the case of G1, this was either Microsoft Internet Explorer or Netscape Navigator. In the case of G2, this was again Microsoft Internet Explorer, but thistime with the edition of the JAWS screen reading program. After using the prototype for a sufficient amount of time (in most cases this was a period of about twenty minutes to half an hour), each subject was asked to complete a questionnaire to record their experiences with the prototype. The questionnaire was arranged into two parts: the measurement of efficiency and integrity. Some of these questions asked the participants to give numerical scores in a scale of 1 (very poor) to 7 (very good). Some of the other questions were free format where the subjects could provide their own comments. This gave us an initial and very broad insight into how subjects from each group responded to using the prototype in the experiments. By graphing both groups results on the same axes, we could also see a comparison of acceptance between the two groups.In addition to the rated response questions, The evaluation questionnaire contained a further thirteen questions of free-form response in nature. These free-form questions were designed to draw out any comments, problems, criticism, or general feedback from the test subjects. There were, on average, two such questions per section of the evaluation criteria.

8.1.Voice Recognition:
It was one of the more poorly rated criteria. While both groups considered it a necessary technology for the idea of a phone browser, subjects suffered from its shortcomings,
18

and it did result in a loss of efficiency for most users. Subjects from G2 responded more favourably than those from G1. Of particular concern was the dictation of URL addresses. This was noted as a shortcoming in the interface by every user from both groups. The idea of having to spell out URL addresses one letter at a time (and wait for confirmation of each letter) was not well received. The idea of using shortcuts like company to spellout .com was considered a strong improvement, thus this technique must be further explored.

8.2.Speech Synthesis:
There was little or no problem with this sub-system. Subjects found the voice easy to understand and of suitable volume and pitch. The major contrast between the two groups was the usage of the speed control feature. Subjects from G1 saw no reason to adjust the speed of the synthesised voice. They were content with the default normally paced speaking voice. However, subjects from G2 tended to change the speed to a much higher rate before doing anything else.

8.3.Navigation:
The overall ability of intra-page and inter-page navigation using the system was rated favourably by both groups. The use of auditory icons to mark HTML structures was viewed by the G2 subjects as being superior to any similar screen reader marking scheme. Subjects from G1 also appreciated the ability of the voice icon to quickly and to simply mark structures from a document, in a way that was natural and easy to remember. The idea of metaphorically matching the meaning of sounds with the structure they were representing was well liked and accepted. There was little problem with remembering the mapping of sound to structure, especially after using the system for an extended amount of time. A criticism with the auditory icons was that they appeared too frequently, and could be seen as breaking up the flow of text unnecessarily. A comment made by many subjects was that the prototype offered similar and familiar functionality to that of browsers they have previously used. Thus, features such as bookmarks, the history or go list and the ability to save a home page were all well received. The ability to follow links contained in documents was well liked. Using different auditory icons for the different types of links allowed subjects to know in advance whether the link would be to a target within the same document, or an external link to another document. This too was well liked. Again, the problem with dictating URL addresses was brought up.

19

8.4.Online Help:
This section of the criteria did not rate well, due to the lack of help associated with commands and prompts used in the system. It was thought by all users that more detailed help (context based) explaining the meaning and usage of commands, should be available at any point duringthe systems operation, as opposed to the simple listing of commands currently available. Certainly the need to refer to other supporting documentation for more detailed information shouldbe avoided, as access to this information would not be available inenvironments where a phone browser might be used. The tutorial available from the systems main menu was well accepted in terms of its content, but perhaps a similarly detailed tutorial should be available at every menu in the system, customized for the relevant set of commands.

8.5.Information & Structure:


There was a marked difference between the two evaluation groups in these two aspects. G2 was more willing to accept the level of integrity of information presented to them by the prototype than G1. Comments were made from users in G1 concerning the lack of ability to quickly visualise an entire page at a glance. They reported frustration when they were forced to listen to the content in sequential fashion.

8.6.Overall Impression:
The subjects from both evaluation groups accepted the prototype as a viable method of browsing the web in the audio realm by phone. The efficiency of the product was quite highly regarded by most subjects. The system interface faired very well.The only major problem was the dictation of URL addresses to the system.

20

9.FUTURE OF VOICE TECHNOLOGY


The speech technology is supposed to grow rapidly. The voice portal market is going to reach billions in just a few years. It is estimated by the kelsey group that voice browsing market will reach 6.5 billion dollars, while OVUM estimates a world market of 26 billion dollars. Anyone may guess the actual growth of the industry of voice technology due to variations in these figures. It is very difficult to navigate on a WAP to scroll through many lists. Hands-free interaction enables us to develop an easy communication between the user and the system.

Voice browsing can be used to access three kinds of information: 9.1.Business:


Information like automated telephone ordering services, support desks, order tracking, airline arrival and departure information services, cinema and theatre booking services, home banking services,etc can be retrieved using voice browsing very easily.

9.2.Public:
Voice browser can be used to access services like local , national and international news alongwith community information such as weather forecasting, traffic conditions, school closure and events. it can also be used to gather information on national and international stock market information and also business and e-commerce transactions.

9.3.Personal use:
It is used in accessing personal information like voice mails, personal horoscope ,personal newsletter, calendars, address and telephone lists etc. In future it is expected that voice browsing will become visual i.e MULTI MODAL. But greatest achievement would be when voice browsing is integrated with all types of operating system .This success would surly make voice browsing available to each and every application.

21

10.BENEFITS

Voice is a very natural user interface because it enables the user to speak and listen using skills learned during childhood. Currently users speak and listen to telephones and cell phones with no display to interact with voice browsers. Some voice browsers may have small screens, such as those found on cell phones and palm computers. In the future, voice browsers may also support other modes and media such as pen, video, and sensor input and graphics animation and actuator controls as output. For example, voice and pen input would be appropriate for Asian users whose spoken language does not lend itself to entry with traditional QWERTY keyboards. Some voice browsers are portable. They can be used anywhereat home, at work, and on the road. Information will be available to a greater audience, especially to people who have access to handsets, either telephones or cell phones, but not to networked computers. Voice browsers present a pragmatic interface for functionally blind users or users needing Web access while keeping their hands and eyes free for other things. Voice browsers present an invisible user interface to the user, while freeing workspace previously occupied by keyboards and mice.

22

11.CONCLUSION

The VoiceBrowse prototype developed for this project proved its success in the evaluation as a telephone based web-browsing tool. Problems were highlighted in the user study. Part of these problems are related to the features currently available in the prototype, for example, the navigation of a table, the interruption of an output using voice rather than * key on the phone pad, the missing function of filling in forms on a web page etc. At the same time, there are some deeper issues which require more research include the dictation function (how and where contextual information can be acquired to improve the accuracy of a dictation), visualisation process (what kind of model is made available which provides a feeling of information coming in parallel). Testing of the prototype in real life environments rather than in a standard laboratory condition is also an important direction of future work, e.g. walking in the street, driving a car. This may uncover further issues that relate to situated interaction. Research into domain-specific speech interaction model may also improve the accuracy and effectiveness of the system. It is the intention of this project to create an environment where web surfing with voice is possible and surfing experience is a pleasant one. The knowledge accumulated in the production of the first generation prototype, VoiceBrowse, has given us a lot of insights and lay the groundwork that helps us move on to develop the second generation prototype.

23

12.BIBLIOGRAPHY

http://www.w3.org/TR/REC-CSS2/aural.html http://www.hwg.org/opcenter/w3c/voicebrowsers.html http://www.w3.org/Voice/1998/Workshop/Michael-Brown.html http://www.oasis-open.org/cover/wap-wml.html Newcomb, M. (1997). HtmlZap ATL ActiveX Control. http://www.miken.com/htmlzap/ http://www.brookes.ac.uk/schools/cms/research/speech/publications/43hft97.htm,

24

Vous aimerez peut-être aussi