Académique Documents
Professionnel Documents
Culture Documents
INTRODUCTION
A voice browser is a device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input,and possibly other input/output modalities." The definition of a voice browser, above, is a broad one.The fact that the system deals with speech is obvious given the first word of the name,but what makes a software system that interacts with the user via speech a "browser"?The information that the system uses (for either domain data or dialog flow) is dynamic and comes somewhere from the Internet. From an end-user's perspective, the impetus is to provide a service similar to what graphical browsers of HTML and related technologies do today, but on devices that are not equipped with full-browsers or even the screens to support them. This situation is only exacerbated by the fact that much of today's content depends on the ability to run scripting languages and 3rd-party plug-ins to work correctly. Much of the efforts concentrate on using the telephone as the first voice browsing device. This is not to say that it is the preferred embodiment for a voice browser, only that the number of access devices is huge, and because it is at the opposite end of the graphicalbrowser continuum, which high lights the requirements that make a speech interface viable. By the first meeting it was clear that this scope-limiting was also needed in order to make progress, given that there are significant challenges in designing a system that uses or integrates with existing content, or that automatically scales to the features of various access devices. Voice Browsing refers to using speech to navigate an application. These applications are written using parts of the Speech Interface Framework. In much the same way that Web applications are written in HTML and are rendered in a Web browser, speech applications are written in VoiceXML and are rendered via a Voice Browser.
to a dialog that is created dynamically from information and constraints about the dialog itself. The NLP requirements document describes the requirements of a system that takes the latter approach, using an example paradigm of a set of tasks operating on a frame-based model. Slots in the frame that are optionally filled guide the dialog and provide contextual information used for task-selection.
3. STANDARDIZATION
Standardization to voice browsing technique were given by: The World Wide Web Consortium (W3C) develops interoperable technologies (specifications, guidelines, software, and tools) to lead the Web to its full potential as a forum for information, commerce, communication, and collectiveunder standing. W3C which includes: 1 .Voice Browser Working Group 2. Speech Interface Framework
3.1.1.Aim:
The Aim of the W3C Working Group is to enable users to speak and listen to Web applications by making standard languages for developing Web-based speech applications. This Working Group concentrates on languages for capturing and producing speech and managing the conversation between user and computer system, while a related Group, the Multimodal Interaction Working Group, works on additional input modes including keyboard and mouse, ink and pen, etc.
3.2.1.VoiceXML:
A language for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. Some of its versions are: VoiceXML 1.0: designed for creating audio dialogs. VoiceXML 2.0: uses form interpretation algorithm(FIA). VoiceXML 2.1: 8 additional elements. Voice XML 3.0: relationship between semantics and syntax.
(processed by the TTS engine), or an instruction to play a sound as an auditory icon to mark a document structure (processed by the audio engine). The other major component is an HTML Translator. When a user requests an HTML document, the contents of the document must first be parsed and translated to a form which is suitable for use in the audio realm. This includes the removal of unwanted tags and information, and the retagging of structures for compliance and subsequent use with the audio output engine of the interface.The translator also summarises information about the document such as the title and positions of various structures for use with the document summary feature. A Command Processor sits between the HTML translator and the interface. The command processor is responsible for acting on the voice / DTMF commands issued by a user. The Processor retrieves HTML documents from the WWW and feeds them to the HTML translation algorithm. It also controls the navigation between web pages, and the functionality associated with this navigation (bookmarks, history list, etc.). This component also processes all the other system and housekeeping commands associated with the program. A stream of marked text to the speech synthesis / audio engine is output. The stream consists of a combination of actual textual information, and tags to mark where audio cues should be played. "The architecture diagram was created as an aid to how we structure our work into subgroups. The diagram will help us to pinpoint areas currently outside the scope of existing groups." Although individual instances of voice browser systems are apt to vary considerably, it is reasonable to try and point out architectural commonalties as an aid to discussion, design and implementation. Not all segments of this diagram need be present in any one system, and systems which implement various subsets of this functionality may be organized differently. Systems built entirely third-party components, with architecture imposed, may result in unused or redundant functional blocks. Two types of clients are illustrated: telephony and data networking. The fundamental telephony client is, of course, the telephone, either wirelined or wireless. The handset telephone requires PSTN (Public Switched Telephone Network) interface, which can be either tip/ring, T1, or higher level, and may include hybrid echo cancellation to remove line
echoes for ASR barge-in over audio output. A speakerphone will also require an acoustic echo canceller to remove room echoes. The data network interface will require only acoustic echo cancellation if used with an open microphone since there is no line echo on data networks. The IP interface is shown for illustration only. Other data transport mechanisms can be used as well.The model architecture is shown below. Solid (green) boxes indicate system components, peripheral solid (yellow) boxes indicate points of usage for markup language, and dotted peripheral boxes indicate information flows.
5.1.Voice Browsing:
This is the first level in examining the technique of voice browsing. The simplest way of understanding the voice browser and the voice web would be to take the web itself into consideration. We know that people all over the world visit websites and get visual feedback. Now, the voice web contains voice sites where the feedback is delivered through dialogues. The most basic example would be calling up our cellphone operator portal, where speech recognition software provides the caller with a series of options like recharging your account, talking to the customer care executive or listen to the new offers. The technology of voice browsing has brought down the cost considerably, since earlier it would cost a rupee per minute for human operators to talk to customers while 10paise per minute for an automated call. Voice browsing also has its use in the corporate sector specially in banks and airlines.
1010110100100100110010110010101100000111010011011011110110000001010010100110100000111101010110100001110100111100011110010111101100001000010011010110111000101111011010001000100000111011011110000011110000011110010111100011100100101000011010110000110100101101011010010010011001011001010110000011101001101101111011000000100100101010011010000011110101011010000111010011110001111001011110110000100
10
an aural alternative to visual presentation.The use of an aural style sheet (or aural style sheet properties included in a general style sheet document) allows the author to specify characteristics of the spoken text such as volume, pitch, speed, and stress; indicate pauses and insert audio "icons" (sound files); and show how certain phrases, acronyms, punctuation, and numbers should be voiced. Combined with the @media selector for media types, a well-crafted aural style sheet can greatly increase the accessibility of a web document in a voice browser. Further investigation in this area is encouraged, especially in the area of example aural style sheets and suggestions for authoring techniques.
6.2.Rich Meta-Content:
HTML 4.0 gives the author the ability to embed a great deal of meta-content into a document, specifying information which expands on the semantic meaning of the content and allows for specialized rendering by the user agent. In other words, by using features found in HTML 4.0 (and to a limited extent, in other versions of HTML), an author can give better information to the browser, which can then make the document easier to use. Judicious and ample use of meta-content within a document allows the author to not simply specify the content, but also suggest the meaning and relationship of that content in the context of the document. Voice browsers can then use that meta-information as appropriate for their presentation and structural needs.
6.3.Planned Abstraction:
One use for meta-content information is the development of pages, which are designed to be abstracted. The typical web document found on the web can often be quite lengthy; finding information by listening to web page read out loud takes longer than visually scanning a page, especially when most web pages are designed for visual use.Thus, most voice browsers will provide a method for abstracting a page; presenting one or more outlines of the page's content based on a semantic interpretation of the document.
12
Listing all the links and link text on a page. Forming a structure based on the H1, H2, ... H6 headers. Summarizing table data. Scanning for TITLE attributes in elements and presenting a list of options for expansion. Vocalizing any "bold" or emphasized text. Digesting the entire document into a summary based on keywords as some search engines provide. There is any number of other options available for voice browser programmers
to use to provide short, easily-digestible versions of web contents to the browser user. This suggests that the web author should provide as much meta-content as possible as well as careful use of HTML elements in their proper manner. Specific techniques include:
Useful choices for link text (e.g., "the report is available" instead of "click here"). Appropriate use of heading tags to define document structure, not simply for size/formatting.
Use of the SUMMARY attribute for tables. Use of STRONG and EM where appropriate, providing benefits for both vocal and visual "scanability".
13
For voice browsers, ALT text is vitally important since images cannot be represented at all, aurally. Especially when used as part of a link, alternative content must be provided so that the voice browser can accurately render the page in a manner useful to the user. In addition to ALT for IMG attributes, HTML 4.0 provides a number of other ways for specifying alternative content that can be used by a browser if an unsupported media type is provided. Some of those include:
ALT attributes for image map AREAs, APPLETs, and image INPUT buttons. Text captions and transcripts for multimedia (video and audio). NOSCRIPT elements when including scripting languages, as voice browsers may be unable to process Java script instructions.
NOFRAMES elements when using frame sets, as frames are a very visually oriented method of document display.
Use of nested OBJECT elements to include a wide variety of alternative contents for many media types.
14
7.FUNCTIONALITY
All communication from the user to the system is made by issuing voice commands or using DTMF tones. Such commands are arranged into objects known as menus. Depending upon the functionality requirement/availability, different menus are available at different points in the programs execution. A grammar set is defined to recognise the speech commands. Some of these rules are for administrative control. To name a few administrative controls: <exit|quit[program|application| TeleBrowse]> speak <faster|slower> Where am I? What is my homepage?
The other rules are used to control navigation. It is anticipated that they are the most frequently used commands. The navigation is supported in various ways: within the same page (intra-page navigation), browsing a new web page (inter-page navigation), bookmarks, history list, document structure or to follow a hyperlink in the web page. The following grammars display the nature of these rules: Start browsing by <location| bookmark|homepage> Maintain bookmarks Start reading [all|again] Load <location|bookmark|homepage> Go to the history list Jump <forwards|backwards x <structure>> <Next|Previous <structure> >
A <structure> is one of paragraph, link, anchor, level 1/2/3 heading, list, page or table, and x represents a positive integer value. All three versions of this command represent one action moving between structures within a document. Another way to navigate to a specific target page is via dictation. Dictation is invoked whenever a browse by location type command
15
is requested, and it is responsible for fetching a URL address from the user. Users dictate to the system by saying words representing a single letter to improve recognition accuracy. A good example is the military code - alpha for a, bravo for b, charlie for c, etc. The grammar recognises this military code, and also common animals, such as frog for f. Macros and shortcuts are also used to simplify the dictation process. The http:// at the beginning of every URL is automatically added, and the system recognises phrases like World Wide Web, company, aussie among many more torepresent www., .com, and .au respectively. The dictation menu also allows for corrections to be made, a review of what has been dictated so far and an ability to restart the dictation session. Output from the system is either synthesised text or sounds (as auditory icons). The synthesised text can represent either actual information being read from a web page, or feedback about the systems operation to the user. When a page is to be read out (post translation), the page is broken up one piece at a time and analysed. Two situations can occur: if the piece is a tag with an associated auditory icon, this icon is played out, or, if the piece is simply text, it is synthesised into voice. Typical application of auditory icons include the creaking opening door (creaking) to represent internal link (link to an anchor within the same web page), or a doorbell to represent an email address, or the clicking sound of a camera shutter to relate to an image.When certain tags are encountered (end of paragraph, end of list, end of table row, etc.), speaking ceases and the user is returned to the menu they were last at. Alternatively, if a user wishes to interrupt the speaking prior to the next break point, the interrupt key * can be used.
16
17
set of tasks the subjects had to complete using the program. Some of the tasks that subjects were required to complete included: starting the application, checking to see what the current homepage was set to, commencing reading of the web document once loaded, jumping to another location by dictating a URL address directly into the system etc. Accompanying each task was a description of the behaviour the system would demonstrate during the tasks execution, and what phrases to use in interaction with the system to complete those tasks. All tasks were first completed using the software running in emulation mode on the laptop computer. The speech recognition engine was not trained to adapt to any specific person. After completing this and gaining a degree of experience in using the system, subjects were then given an opportunity to use the system as they chose over the phone, completing the effect of a phonebased webbrowsing tool. Subjects were also asked to view the same web pages that had just been viewed using the prototype with a browser they would typically use. In the case of G1, this was either Microsoft Internet Explorer or Netscape Navigator. In the case of G2, this was again Microsoft Internet Explorer, but thistime with the edition of the JAWS screen reading program. After using the prototype for a sufficient amount of time (in most cases this was a period of about twenty minutes to half an hour), each subject was asked to complete a questionnaire to record their experiences with the prototype. The questionnaire was arranged into two parts: the measurement of efficiency and integrity. Some of these questions asked the participants to give numerical scores in a scale of 1 (very poor) to 7 (very good). Some of the other questions were free format where the subjects could provide their own comments. This gave us an initial and very broad insight into how subjects from each group responded to using the prototype in the experiments. By graphing both groups results on the same axes, we could also see a comparison of acceptance between the two groups.In addition to the rated response questions, The evaluation questionnaire contained a further thirteen questions of free-form response in nature. These free-form questions were designed to draw out any comments, problems, criticism, or general feedback from the test subjects. There were, on average, two such questions per section of the evaluation criteria.
8.1.Voice Recognition:
It was one of the more poorly rated criteria. While both groups considered it a necessary technology for the idea of a phone browser, subjects suffered from its shortcomings,
18
and it did result in a loss of efficiency for most users. Subjects from G2 responded more favourably than those from G1. Of particular concern was the dictation of URL addresses. This was noted as a shortcoming in the interface by every user from both groups. The idea of having to spell out URL addresses one letter at a time (and wait for confirmation of each letter) was not well received. The idea of using shortcuts like company to spellout .com was considered a strong improvement, thus this technique must be further explored.
8.2.Speech Synthesis:
There was little or no problem with this sub-system. Subjects found the voice easy to understand and of suitable volume and pitch. The major contrast between the two groups was the usage of the speed control feature. Subjects from G1 saw no reason to adjust the speed of the synthesised voice. They were content with the default normally paced speaking voice. However, subjects from G2 tended to change the speed to a much higher rate before doing anything else.
8.3.Navigation:
The overall ability of intra-page and inter-page navigation using the system was rated favourably by both groups. The use of auditory icons to mark HTML structures was viewed by the G2 subjects as being superior to any similar screen reader marking scheme. Subjects from G1 also appreciated the ability of the voice icon to quickly and to simply mark structures from a document, in a way that was natural and easy to remember. The idea of metaphorically matching the meaning of sounds with the structure they were representing was well liked and accepted. There was little problem with remembering the mapping of sound to structure, especially after using the system for an extended amount of time. A criticism with the auditory icons was that they appeared too frequently, and could be seen as breaking up the flow of text unnecessarily. A comment made by many subjects was that the prototype offered similar and familiar functionality to that of browsers they have previously used. Thus, features such as bookmarks, the history or go list and the ability to save a home page were all well received. The ability to follow links contained in documents was well liked. Using different auditory icons for the different types of links allowed subjects to know in advance whether the link would be to a target within the same document, or an external link to another document. This too was well liked. Again, the problem with dictating URL addresses was brought up.
19
8.4.Online Help:
This section of the criteria did not rate well, due to the lack of help associated with commands and prompts used in the system. It was thought by all users that more detailed help (context based) explaining the meaning and usage of commands, should be available at any point duringthe systems operation, as opposed to the simple listing of commands currently available. Certainly the need to refer to other supporting documentation for more detailed information shouldbe avoided, as access to this information would not be available inenvironments where a phone browser might be used. The tutorial available from the systems main menu was well accepted in terms of its content, but perhaps a similarly detailed tutorial should be available at every menu in the system, customized for the relevant set of commands.
8.6.Overall Impression:
The subjects from both evaluation groups accepted the prototype as a viable method of browsing the web in the audio realm by phone. The efficiency of the product was quite highly regarded by most subjects. The system interface faired very well.The only major problem was the dictation of URL addresses to the system.
20
9.2.Public:
Voice browser can be used to access services like local , national and international news alongwith community information such as weather forecasting, traffic conditions, school closure and events. it can also be used to gather information on national and international stock market information and also business and e-commerce transactions.
9.3.Personal use:
It is used in accessing personal information like voice mails, personal horoscope ,personal newsletter, calendars, address and telephone lists etc. In future it is expected that voice browsing will become visual i.e MULTI MODAL. But greatest achievement would be when voice browsing is integrated with all types of operating system .This success would surly make voice browsing available to each and every application.
21
10.BENEFITS
Voice is a very natural user interface because it enables the user to speak and listen using skills learned during childhood. Currently users speak and listen to telephones and cell phones with no display to interact with voice browsers. Some voice browsers may have small screens, such as those found on cell phones and palm computers. In the future, voice browsers may also support other modes and media such as pen, video, and sensor input and graphics animation and actuator controls as output. For example, voice and pen input would be appropriate for Asian users whose spoken language does not lend itself to entry with traditional QWERTY keyboards. Some voice browsers are portable. They can be used anywhereat home, at work, and on the road. Information will be available to a greater audience, especially to people who have access to handsets, either telephones or cell phones, but not to networked computers. Voice browsers present a pragmatic interface for functionally blind users or users needing Web access while keeping their hands and eyes free for other things. Voice browsers present an invisible user interface to the user, while freeing workspace previously occupied by keyboards and mice.
22
11.CONCLUSION
The VoiceBrowse prototype developed for this project proved its success in the evaluation as a telephone based web-browsing tool. Problems were highlighted in the user study. Part of these problems are related to the features currently available in the prototype, for example, the navigation of a table, the interruption of an output using voice rather than * key on the phone pad, the missing function of filling in forms on a web page etc. At the same time, there are some deeper issues which require more research include the dictation function (how and where contextual information can be acquired to improve the accuracy of a dictation), visualisation process (what kind of model is made available which provides a feeling of information coming in parallel). Testing of the prototype in real life environments rather than in a standard laboratory condition is also an important direction of future work, e.g. walking in the street, driving a car. This may uncover further issues that relate to situated interaction. Research into domain-specific speech interaction model may also improve the accuracy and effectiveness of the system. It is the intention of this project to create an environment where web surfing with voice is possible and surfing experience is a pleasant one. The knowledge accumulated in the production of the first generation prototype, VoiceBrowse, has given us a lot of insights and lay the groundwork that helps us move on to develop the second generation prototype.
23
12.BIBLIOGRAPHY
http://www.w3.org/TR/REC-CSS2/aural.html http://www.hwg.org/opcenter/w3c/voicebrowsers.html http://www.w3.org/Voice/1998/Workshop/Michael-Brown.html http://www.oasis-open.org/cover/wap-wml.html Newcomb, M. (1997). HtmlZap ATL ActiveX Control. http://www.miken.com/htmlzap/ http://www.brookes.ac.uk/schools/cms/research/speech/publications/43hft97.htm,
24