Vous êtes sur la page 1sur 28

Todays lecture Natural Language Processing Applications Lecture 1

Administrative issues Claire Gardent


CNRS/LORIA Campus Scientique, BP 239, F-54 506 Vanduvre-l`s-Nancy, France e

Course Overview What is NLP? How is it done? a brief historics of NLP Linguistics in NLP Why is it hard? What are the applications?

2007/2008

1 / 111

2 / 111

Documentation
Webpage of the course www.loria.fr/gardent/applicationsTAL Slides of the lecture will be handed out at the beginning of each lecture. Thursday Nancy NLP Seminar www.loria.fr/gardent/Seminar/content/seminar07-2.php

Course Overview
Theory

What is NLP? : Why is it hard? How is it done? What are the applications? Symbolic approaches. Exemplied by natural language generation Meaning Text Statistical approaches. Exemplied by Information retrieval and information extraction Text Meaning

Practice First seminar: this week ; Guillaume Pitel (in English); on using Latent Semantic Analysis to bootstrap a Framenet for French

3 / 111

Python and NLTK (Natural language toolkit) Software project Presentations


4 / 111

Computers, accounts and Lab sessions

Assessment

you should all have a login account on the UHP machines. Nancy 2 students need rst to register at Nancy 1 (UHP) Registration is free. from this account you should all be able to access, python, NLTK and whatever is needed for the exercices and the projet If not, tell us! Room I100 is reserved for you every wednesday morning until christmas. Optional lab sessions with Bertrand Gaie, wednesday morning from 10 to 12 in Room I100. Starting next week.

Grades will be calculated as follows


Final exam : 60% Presentations : 10% Project : 30%

5 / 111

6 / 111

Presentations
Each student must present a paper on either Question Answering or Natural Language Generation. A list of papers suggested for presentation will be given shortly on the course web site. If you prefer, you can choose some other paper on either QA or NLG but you must then rst run it by me for approval I will collect your choices at the end of the second week. Presentations will be held on 4th (QA) and 16th October (NLG). More about presentations and about their grading at http://www.loria.fr/gardent/applicationsTAL

Software Project
A list of software projects will be presented at the end of the second week. You should gather into groups of 2 or 3 and choose a topic. If desired, there can also be individual projects. I will collect your choices at the end of the third week (4 October). Each group will give a short oral presentation (intermediate report) of their project at the end of the 5th week (18 october). The results (program and output) of each group on the project will be returned at the end of the semester (roughly end of january). More about projects at http://www.loria.fr/gardent/applicationsTAL
7 / 111 8 / 111

Course schedule

Course schedule

Mo 17 september, 2pm. What is NLP? Why is it hard? How is it done? An overview of NLP applications Tue 18 september, 10am. Python fundamentals Th 20 september, 10am. Regular expressions Mo 24 september, 2pm. Corpus processing and tokenization with NLTK. Tue 25 september, 10am. Tagging and Chunking with NLTK. Th 27 september, 10am. Information Retrieval. Presentation of Topics for Software projects Collecting Individual Choices for Presentations Mo 1 october, 10am. Information extraction. Tu 2 october, 10am. Information extraction. Th 4 october, 10am. Presentations on Question Answering
9 / 111 10 / 111

Mo 8 october, 2pm. NLG: Introduction Tue 9 october, 10am. NLG: Macro and microplanning dd Th 11 october, 10am. NLG: Referential expressions Mo 15 october, 2pm. NLG: Surface realisation Mo 16 october, 10am. Presentations on NLG. Th 18 october, 10am. Software Projects Presentations.

Contact information

What is NLP?

Need to nd me?

LORIA, Room B234 Phone: 03 83 59 20 39 email: Claire.Gardent AT loria.fr Web: www.loria.fr/gardent

Natural Language Processing aims at making computers talk and more precisely, at endowing them with the linguistic ability of humans.

11 / 111

12 / 111

Two illustrating examples of NLP applications

1966 ELIZA

1966: Eliza, the rst chatterbot i.e., a computer program that mimics human conversation. Developed by Joseph Weizenbaum from the Massachusetts Institute of Technology. 1973: SHRDLU, a computer program that allows the user to interact with a robot in a block world. Developed by Terry Winograd (MIT)

Eliza, the rst chatterbot a computer program that mimics human conversation. Joseph Weizenbaum Massachusetts Institute of Technology User types in some statement or set of statements in natural language ELIZA then analyzes the users statement and generates some response which it types out. http://www-ai.ijs.si/eliza/eliza.html

13 / 111

14 / 111

Sample conversation

ELIZA

Men are all alike. IN WHAT WAY? Theyre always bugging us about something or other. CAN YOU THINK OF A SPECIFIC EXAMPLE? Well, my boyfriend made me come here. YOUR BOYFRIEND MADE YOU COME HERE He says Im depressed much of the time. I AM SORRY TO HEAR YOU ARE DEPRESSED

Its true. I am unhappy DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY I need some help, that much seems certain. WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP Perhaps I could learn to get along with my mother. TELL ME MORE ABOUT YOUR FAMILY

15 / 111

16 / 111

ELIZA

ELIZA

My mother takes care of me. WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU My father. YOUR FATHER You are like my father in some ways. WHAT RESEMBLANCE DO YOU SEE You are not very aggressive but I think you dont want me to notice that. WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE You dont argue with me.

WHY DO YOU THINK I DONT ARGUE WITH YOU You are afraid of me. DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU My father is afraid of everybody. WHAT ELSE COMES TO YOUR MIND WHEN YOU THINK OF YOUR FATHER Bullies. DOES THAT HAVE ANYTHING TO DO WITH THE FACT THAT YOUR BOYFRIEND MADE YOU COME HERE

17 / 111

18 / 111

Elizas strategy KEYWORDS

Example transformations

Input read and inspected for the presence of a keyword

When keyword is found, the sentence is transformed according to a rule associated with the keyword (rcriture dexpressions rgulires)

I am X How long have you been X? I am very unhappy. HOW LONG HAVE YOU BEEN VERY UNHAPPY? X YOU Y ME What makes you think I Y you? It seems that you hate me. WHAT MAKES YOU THINK THAT I HATE YOU?

Else a content-free remark or, under certain conditions, an earlier transformation is retrieved. The text so computed or retrieved is then printed out.

19 / 111

20 / 111

Why does it work?

Does it really work?

Limited to psychiatrist talk

Can pretend to know almost nothing of the real world I went for a long boat ride Tell me about boats Not a real dialog (no real communication)

ELiza generates a reaction (often a rather non commital one) to the users statement The user cannot ask for information. The user cannot issue an order

Eliza: Hello. I am ELIZA. How can I help you? You: I went to Paris yesterday Eliza: Can you elaborate on that? You: I took the train Eliza: Can you elaborate on that? You: The train was late Eliza: Please go on. You: All the seats were full Eliza: Please go on. You: A little boy was having a tantrum Eliza: Please go on.

21 / 111

22 / 111

Does it really work?

When Eliza fails


Yes/No Questions User: Are you a good therapist? Eliza: Would you prefer it if I were a good therapist? User: Im wondering about your skills as a therapist; Eliza: Does it please you to believe I am wondering about my skills of therapist?

Indexicals (I, you, here, now) User: I am having fun right now. Eliza: Does it please you to believe that I am having fun right now.

World Knowledge User: I am a human. Eliza: How long have you been a human?

Wh-Questions User: Who are you? Eliza: Would you prefer it if I were not? User: What do you want me to tell? Eliza: Does that question interest you?

23 / 111

24 / 111

When Eliza fails

When Eliza fails

Complex Assertions User: I dont have a problem. Im quite happy, I have a girlfriend, I study. Everything is OK! Eliza: Why do you not have any problem, Im quite happy, you have a girlfriend, you study. Everything is OK!

Ill-formed Input User: azefzaer is qgqdg in the agag. Eliza: Please go on.

Ill-formed Output User: Doesnt my story interest you? Eliza: Why do you stay your story interest I? User: I think that you can help me but Im not sure that you can understand me. Eliza: Do you wish to but im not sure that me

25 / 111

26 / 111

Is it NLP?

1973 SHRDLU

Ad hoc transformations No linguistic (e.g., syntactic, semantic, morphological) knowledge/processing involved Not portable: would not work for another domain e.g., Train scheduling information system I want to go to Paris ??? Tell me about Paris

Terry Winograd, MIT Interaction with a robot in a block world. The user can:

instruct the robot to manipulate the blocks ask it about the blocks congurations ask it about its reasoning update it with facts

http://hci.stanford.edu/winograd/shrdlu/

27 / 111

28 / 111

SHRDLU A sample dialogue


1s. Pick up a red block. 1u. OK. (does it) Understands the instructions; identify the referent of a description 2s. Grasp the pyramid. 2u. I DONT UNDERSTAND WHICH PYRAMID YOU MEAN. Knows that denite the must refer to a uniquely identiable object 3s. Find a block which is taller than the one you are holding and put it into the box. 3u. BY IT, I ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE I AM HOLDING. Identies two possible antecedents for the one ...
29 / 111

SHRDLU A sample dialogue

4s. What does the box contain? 4u. THE BLUE PYRAMID AND THE BLUE BOX. Keep tracks of how context is aected by dialog 5s. What is the pyramid supported by? 5u. THE BOX. Handles passive/active variation; Correctly identies the pyramid as the pyramid just mentioned

30 / 111

SHRDLU A sample dialogue

Is it NLP?

How many blocks are not in the box? FOUR OF THEM. Is at least one of them narrower than the one which I told you to pick up? YES, THE RED CUBE. Is it supported? YES, BY THE TABLE. Can the table pick up blocks? NO. Can reason about the situation

Real communicative abilities: the user can request information, issue instructions and update the system with new information The system Understands language in a limited domain by using syntactic parsing and semantic reasoning

Large scale grammar of English + parser Procedural semantics for words and phrases

31 / 111

32 / 111

Early NLP 1950s

1964: The ALPAC report

Machine Translation (MT) one of the earliest applications of computers Major attempts in US and USSR - Russian to English and reverse George Town University, Washington system: - Translated sample texts in 1954 - Euphoria - Lot of funding, many groups in US, USSR * But: the system could not be scaled up.

Assessed research results of groups working on MTs Concluded: MT not possible in near future. Funding should cease for MT ! Basic research should be supported.

Word to word translation does not work Linguistic Knowledge is needed

33 / 111

34 / 111

60-80: Linguistics and CL

Some successful early CL systems

1957 Noam Chomskys Syntactic Structures A formal denition of grammars and languages Provides the basis for a automatic syntactic processing of NL expressions Montagues PTQ Formal semantics for NL. Basis for logical treatment of NL meaning 1967 Woods procedural semantics A procedural approach to the meaning of a sentence Provides the basis for a automatic semantic processing of NL expressions

1970 TAUM Meteo Machine translation of weather reports (Canada) 1970s SYSTRAN: MT system; still used by Google 1973 Lunar To question expert system on rock analyses from Moon samples 1973 SHRDLU (T. Winograd) Instructing a robot to move toy blocks

35 / 111

36 / 111

1980s: Symbolic NLP

1980s: Corpora and Resources

Formally grounded and reasonably computationally tractable linguistic formalisms (Lexical Functional Grammar, Head-Driven Phrase Structure Grammar, Tree Adjoining Grammar etc.) Linguistic/Logical paradigm extensively pursued Not robust enough Few applications

Disk space becomes cheap Machine readable text becomes uniquitous US funding emphasises large scale evaluation on real data 1994 The British National Corpus is made available A balanced corpus of British English Mid 1990s WordNet (Fellbaum & Miller) A computational thesaurus developed by psycholinguists Early 2000s The World Wide Web used as a corpus

37 / 111

38 / 111

1990s Statistical NLP

CL History Summary

50s Machine translation; ended by ALPAC report 60s Applications use linguistic techniques (Eliza, shrdlu) from Chomsky (formal grammars, parsers); Procedural semantics (Woods) also important. Approaches only work on restricted Domains. Not portable. 70s/80s Symbolic NLP. Applications based on extensive linguistic and real world knowledge. Not robust enough. Lexical acquisition bottleneck. 90s now. Statistical NLP. Applications based on statistical methods and large (annotated) corpora

The following factors promote the emergence of statistical NLP:

Speech recognition shows that given enough data, simple statistical techniques work US funding emphasises speech-based interfaces and information extraction Large size digitised corpora are available

39 / 111

40 / 111

Symbolic vs. statistical approaches


Symbolic

Linguistics in NLP

Based on hand written rules Requires linguistic expertise No frequencey information More brittle and slower than statistical approaches Often more precise than statistical approaches Error analysis is usually easier than for statistical approaches Supervised or non-supervised Rules acquired from large size corpus Not much linguistic expertise required Robust and quick Requires large size (annotated) corpora Error analysis is often dicult
41 / 111

NLP applications use knowledge about language to process language All levels of linguistic knowledge are relevant:

Statistical

Phonetics, Phonology The study of linguistic sounds and of their relation to words Morphology The study of words components Syntactic The study of the structural relationship between words Semantics The study of meaning Pragmatics The study of how language is used to accomplish goals and of the inuence of context on meaning Discourse The study of linguistic units larger than a single utterance

42 / 111

Phonetics/phonology
Phonetics : study of the speech sounds used in the languages of the world How to transcribe those sounds (IPA, International Phonetic Alphabet) How sounds are produced (Articulatory Phonetics) Phonology : study of the way a sound is realised in dierent environments A sound (phone) can usually be realised in dierent ways (allophones) depending on its context E.g., the hand transcribed Switchboard corpus of English telephone speech list 16 ways of pronuncing because and about
43 / 111

Phonetics/phonology

An example illustrating the Sound-to-Text mapping issue. (1) a. Recognise speech. b. Wreck a nice peach. Phonetics and phonology can be used either to map words into sound (Speech synthesis) or to map sounds onto words (Speech recognition).

44 / 111

Morphology

Morphology: main issues


Exceptions and irregularities:

Study of the structure of words Two types of morphology : Flectional: decomposes a word into a lemma and one or more grammatical axes giving information about tense, gender, number, etc. E.g., Cats lemma = cat + axe = s Derivational: decomposes a word into a lemma and one or more axes giving information about meaning or/and category. E.g., Unfair prex = un + lemma = fair

Women Woman, Plural Arent are not

Ambiguity:

saw saw, noun, sg, neuter saw saw, verb, 1st person, sg, past saw saw, verb, 2nd person, sg, past saw saw, verb, 3rd person, sg, past saw saw, verb, 1st person, pl, past saw saw, verb, 2nd person, sg, past saw saw, verb, 3rd person, sg, past
46 / 111

45 / 111

Morphology: Methods and Tools

Morphology: Applications

Methods

IN CL applications, morphological information is useful e.g.,

Lemmatisation (Morphological analysis) Stemming: approximation Finite state transducers

Tools

to resolve anaphora: (2) Sarah met the women in the street. She did not like them. [Shesg = Sarahsg ; thempl = the womenpl ] for spell checking and for generation * The womenpl issg

47 / 111

48 / 111

Syntax

Syntactic tree example

Captures structural relationships between words and phrases Describes the constituent structure of NL expressions Grammars are used to describe the syntax of a language Syntactic analysers and surface realisers assign a syntactic structure to a string/semantic representation on the basis of a grammar NP John Adv often V V gives

S VP NP Det a n book Prep to PP NP Mary

49 / 111

50 / 111

Methods in Syntax

Syntax

Words Syntactic tree


Algorithm : parser Resource used : Lexicon + Grammar Symbolic : hand-written grammar and lexicon Statistical: grammar acquired from tree bank diculty : coverage and ambiguity

In CL applications, syntactic information is useful e.g., . . .


for spell checking (e.g., subject-verb agreement) to construct the meaning of a sentence to generate a grammatical sentence

51 / 111

52 / 111

Spell checking

Syntax and Meaning

(3) Its a fair exchange. No syntactic tree Its a fair exchange. Ok syntactic tree (4) My friends is unhappy. The number of my friends who were unhappy was amazing. The man who greets my friends is amazing. Subject+Verb agreement

John loves Mary Agent = Subject = Mary loves John Agent = Subject Mary is loved by John Agent = By-Object

love(j,m)

love(m,l)

love(j,m)

53 / 111

54 / 111

Lexical semantics

Lexical semantics

The study of word meanings and of their interaction with context Words have several possible meanings Early methods use selectional restrictions to identify meaning intended in given context (5) a. The astronomer saw the star. b. The astronomer married the star. Statistical methods use cooccurrence information derived from corpora annotated with word senses (6) e. John sat on the bank. f. John went to the bank. g. ?? King Kong sat on the bank. Lesk algorithm: word overlap between words appearing the denitions of the ambiguous word and the words surrounding this word in text
55 / 111 56 / 111

Lexical relations i.e., relations between word meanings are also very important for CL based applications The most used lexical relations are:

Hyponymy (ISA) e.g., a dog is a hyponym of animal Meronymy (part of) e.g., arm is a meronym of body Synonymy e.g., eggplant and aubergine Antonymy e.g., big and little

Lexical semantics

Compositional Semantics

In NLP applications, the most commonly used lexical relation is hyponymy which is used:

for semantic classication (e.g., selectional restrictions, named entity recognition) for shallow inference (e.g., X murdered Y implies X killed Y) for word sense disambiguation for machine translation (if a term cannot be translated, substitute a hypernym)

Semantics of phrases Useful to reason about the meaning of an expression (e.g., to improve the accuracy of a question answering system)

(7) a. John saw Mary. b. Mary saw John. Same words, dierent meanings.

57 / 111

58 / 111

Pragmatics

Discourse

Compositional semantics delivers the literal meaning of an utterance NL phrases are often used non literally Examples. (8) a. Can you pass the salt? b. You are standing on my foot. Speech act analysis, plan recognition are needed to determine the full meaning of an utterance

Much of language interpretation is dependent on the preceding discourse/dialogue Example. Anaphora resolution. (9) a. The councillors refused the women a permit because they feared revolution. b. The councillors refused the women a permit because they advocated revolution.

59 / 111

60 / 111

Linguistics in deep symbolic NLP systems

Two main problems

The various types of linguistic knowledge are put to work in Deep NLP systems Deep Natural Language Processing Systems build a meaning representation (needed e.g., for NL interface to databases, question answering and good MT) from user input and produces some feedback to the user In a deep NLP system, each type of linguistic knowledge is encoded in a knowledge base which can be used by one or several modules of the system Ambiguity: the same linguistic unit (word, constituent, sentence, etc.) can be interpreted/categorised in several competing ways Paraphrases: the same content can be expressed in dierent ways.

61 / 111

62 / 111

Problem 1: Ambiguity

Ambiguity pervades all levels of linguistic analysis


Phonological: The same sounds can mean dierent things. Recognise speech or Wreck a nice peach ??

The same sentence can mean dierent things.

Lexical semantics: The same word can mean dierent things. toile : sky star or celebrity? e Part of speech: The same word can belong to dierent parts of speech. la : pronoun, noun or determiner? Syntax: The same sentence can have dierent syntactic structures. Jean regarde (la lle avec des lunettes) Jean ((regarde la lle) avec des lunettes) Semantics: The same sentence can have dierent meanings. La belle ferme la porte

La belle ferme la porte.


( La belle femme )Subj (ferme la porte)VP . (La belle femme ferme)Subj (la porte)VP .

63 / 111

64 / 111

A combinatorial problem

Problem 2: Paraphrase

Ambiguities multiply out thereby inducing a combinatorial issue. Example: La porte que la belle ferme prsente ferme mal. e

There are many ways of saying the same thing. Example:


la porter que ferme prsente mal e Nb of POS 3 3 3 5 2 2 Nb de combinaisons possibles: 3 x 3 x 3 x 3 x 3 x 5 x 2 x 5 x 2 = 24 300

Quand mon laptop arrivera-til? Pourriez vous me dire quand je peux esprer recevoir mon e laptop?

The combinatorics is high

In generation (Meaning Text), this implies making choices. Against the combinatorics is high.

65 / 111

66 / 111

Some NLP applications

NLP applications

Useful systems have been built for e.g.,:


Spelling and grammar checking Speech recognition Spoken Language Dialog Systems Machine Translation Text summarisation Information retrieval and extraction Question answering

Three main types of applications: 1. Language input technologies 2. Language processing technologies 3. Language output technologies

67 / 111

68 / 111

Language input technologies

Speech recognition (1)

Speech recognition Optical character recognition Handwriting recognition Retroconversion

Key focus

Spoken utterance Text Desktop control: dictation, voice control, navigation Telephony-based transaction: travel reservation, remote banking, pizza ordering , voice control

Two main types of Applications


69 / 111

70 / 111

Speech recognition (2)

Speech recognition

Cheap PC desktop software available 60-90% accuracy. Good enough for dictation and simple transactions but depends on Speaker and circumstances Speech recognition is not understanding!

based on statistical techniques and very large corpora works for many languages accurracy depends on audio-conditions (robustness problem)

cf. the PAROLE team (Yves Laprie)

71 / 111

72 / 111

Speech recognition (3)

Dictation

Dictation systems can do more than just transcribe what was said:

Desktop control

Philips FreeSpeech (www.speech.philips.com) IBM ViaVoice (www.software.ibm.com/speech) Scansofts DragonNaturallySpeaking (www.lhsl.com/naturallyspeaking)


See also google category : http://directory.google.com/Top/Computers/Speech Technology/

leave out the ums and eh implement corrections that are dictated ll the information into forms rephrase sentences (add missing articles, verbs and punctuation; remove redundant or repeated words and self corrections)

Communicate what is meant, not what is said Speech can be used both to dictate content or to issue commands to the word processing applications (speech macros eg to insert frequently used blocks of text or to navigate through form)

73 / 111

74 / 111

Speech recognition (4)

Optical character recognition (1)

Telephony-based elded products


Key focus

Nuance (www.nuance.com) ScanSoft (www.scansoft.com) Philips (www.speech.philips.com) Telstra directory enquiry (tel. 12455)

Printed material computer readable representation Scanning (text digitized format) Business card readers (to scan the printed information from business cards into the correct elds of an electronic address book.) www.cardscan.com Website construction from printed documents

Applications

See also google category : http://directory.google.com/Top/Computers/Speech Technology/Telephony/

75 / 111

76 / 111

Optical character recognition (2)

Optical character recognition (3)

Current state of the art


90% accuracy on clean text 100-200 characters per second (as opposed to 3-4 for typing) character segmentation and character recognition Problems: unclean data and ambiguity Many OCR systems use linguistic knowledge to correct recognition errors:

Fielded products

Fundamental issues

Caeres OmniPage (www.scansoft.com) Xerox TextBridge (www.scansoft.com) ExperVisions TypeReader (www.expervision.com)

N-grams for word choice during processing Spelling correction in post-processing

77 / 111

78 / 111

Handwriting recognition (1)

Handwriting recognition: fundamental issues

Key focus

Everyone write dierently! Isolated letter vs. cursive script Train user or system? Most people type faster than they write: choose applications where keyboards are not appropriate Need elaborate language model and writing style models

Human handwriting computer readable representation Forms processing Mail routing Personal digital agenda (PDA)

Applications

79 / 111

80 / 111

Handwriting recognition (2)

Handwriting recognition (3)

Isolated letters

5-6% error rate (on isolated letters) Good typist tolerate up to 1% error rate Human subjects make 4-8% errrors

Palms Grati (www.palm.com) Computer Intelligence Corporations Jot (www.cic.com) Motorolas Lexicaus ParaGraphs Calligraphper (www.paragraph.com)

Cursive scripts

cf. the READ team (Abdel Belaid)

81 / 111

82 / 111

Retroconversion

Language processing technologies

Key focus: identify the logical and physical structure of the input text Applications

Spelling and grammar checking Spoken Language Dialog System Machine Translation Text Summarisation Search and Information Retrieval Question answering systems

Recognising tables of contents Recognising bibliographical references Locating and recognising mathematical formulae Document classication

83 / 111

84 / 111

Spelling and grammar checking

Spelling and grammar checking

Various levels of sophistication:

Flag words which are not in the dictionnary Dictionnary lookup * neccessary In case of language with a rich morphology, ag words which are morphosyntactically incorrect e.g., He *gived a book to Mary Morphological processing

Syntax might be needed: *Its a fair exchange My friend *were unhappy

Possessive pronoun distribution Subject/Verb agreement

Word sense disambiguation : *The trees bows were heavy with snow v. The trees boughs were heavy with snow

Existing spell checkers only handle a limited number of these problems.

85 / 111

86 / 111

Spoken Language Dialog Systems

SLDSs Some problems

Goal

a system that you can talk to in order to carry out some task. Speech recognition Speech synthesis Dialogue Management Information provision systems: provides information in response to query (request for timetable information, weather information) Transaction-based systems: to undertake transaction such as buying/selling stocs or reserving a seat on a plane.

Key focus

No training period possible in Phone-based systems Error handling remains dicult User initiative remains limited (or likely to result in errors)

Applications

87 / 111

88 / 111

SLDS State of the art

SLDS commercial systems

Commercial systems operational for limited transaction and information services


Nuance (www.nuance.com) SpeechWorks (www.scansoft.com) Philips (www.speech.philips.com) See also google category : http://directory.google.com/Top/Computers/Speech Technology/

Stock broking system Betting service American Airlines ight information system

Limited (nite-state) dialogue management NL Understanding is poor

89 / 111

90 / 111

Machine Translation

Five main types of approaches to MT

Key focus

Direct transfer: map between words Syntactic transfer: map between syntactic structures Semantic structures: map between semantic structures Interlingua: parse to derive language neutral semantic representation and generate from there Example based: use a database of translation pairs and nd closest matching phrase

translating a text written/spoken in one language into another language Web based translation services Spoken language translation services

Applications

91 / 111

92 / 111

Existing MT Systems

The limitations of Taum-Meteo

Bownes iTranslator (www.itranslator.com) Taum-Meteo (1979): (English/French)


Exceptional domain: Limited language, large translation need. A limited domain with enough material never again found. The same group tried to build Taum-Aveo for aircraft maintenance manuals. Only limited success.

Domain of weather reports Highly successful Human assisted translation Rough translation Used over the internet through AltaVista http://babelsh.altavista.com

Systran: (among several European languages)


93 / 111

94 / 111

The limitations of Systran


Two British undercover soldiers are arrested by Iraqi police in Basra following a car chase. They are reported to have red on the police.

MT and lexical meaning

Deux soldats britanniques de capot interne sont arrts par la ee police dIraq ` Bassora suivant une chasse de voiture. On rapporte a quils mettent le feu sur la police.

Larbre est une structure tr`s utilise en linguistique. On lutilise e e par exemple, pour reprsenter la structure syntaxique dune e expression ou, par le biais des formules logiques, pour reprsenter e le sens des expressions de la langue naturelle.

undercover/de capot interne: incorrect word translation following/suivant (instead of suite a): gerund/preposition ambiguity wrongly resolved car chase/chasse a voiture (instead of course en voiture): wrong recognition of N-N compound re on/mettre le feu sur: non recognition of verbal locution
95 / 111

The tree is a structure very much used in linguistics. It is used for example, to represent the syntactic structure of an expression or, by the means of the logical formulas, to represent the direction of the expressions of the natural language.

96 / 111

Word salad
Cette approche est particuli`rement intressante parce que, un peu e e comme les grammaires dunication introduites il y a quelques dcennies par Martin Kay, [ ... ]. Cette vision qui est sans doute, e celle de la plupart des linguistes, na malheureusement toujours pas trouv de cadre informatique adquat pour sexprimer et e e sinstancier.

MT State of the Art

Broad coverage systems already available on the web (Systran) Reasonable accuracy for specic domains (TAUM Meteo) or controlled languages Machine aided translation is mostly used

This approach is particularly interesting because, a little like introduced grammars of unication a few decades ago by Martin Kay, [ ... ] . This vision which is undoubtedly, that of the majority of the linguists, unfortunately still did not nd of data-processing framework adequate to be expressed and instancier.

97 / 111

98 / 111

Text summarisation

Text summarisation

Three main steps

Key issue

Text Shorter version of text to decide whether its worth reading the original text To read summary instead of full text to automatically produce abstract

1. Extract important sentences (compute document keywords and score document sentences wrt these keywords) 2. Cohesion check: Spot anaphoric references and modify text accordingly (eg add sentence containing pronoun antecedent; remove dicult sentences; remove pronoun) 3. Balance and coverage: modify summary to have an appropriate text structure (delete redundant sentences; harmonize tense of verbs; ensure balance and proper coverage)

Applications

99 / 111

100 / 111

Text summarisation

Information Extraction/Retrieval and QA

State of the Art

Given a NL query and a document (e.g., web pages),


Sentences extracted on the basis of: location, linguistic cues, statistical information Low discourse coherence

retrieve document containing answer (retrieval) ll in template with relevant information (extraction) produce answer to query (Q/A)

Commercial systems

British Telecoms ProSum (transend.labs.bt.com) Copernic (www.copernic.com) MS Words Summarisation tool See also http://www.ics.mq.edu.au/swan/summarization/projects.htm

Limited to factoid questions e.g., Who invented the electric guitar? How many hexagons are on a soccer ball? Where did Bill Gates go to college? Excludes: how-to questions, yes-no questions, questions that reauire complex reasoning Highest possible accuracy estimated at around 70%

101 / 111

102 / 111

Information Extraction/Retrieval and QA

Language output technologies

IR systems : google, yahoo, etc. QA systems


Text-to-Speech Tailored document generation

AskJeeves (www.askjeeves.com) Articial lifes Alife Sales Rep (www.articial-life.com) Native MindsvReps (www.nativeminds.com) Soliloquy (www.soliloquy.com)

103 / 111

104 / 111

Text-to-Speech (1)

Text-to-Speech (2)

Key focus

Text Natural sounding speech Spoken rendering of email via desktop and telephone Document proofreading Voice portals Computer assisted language learning

Applications

Requires appropriate use of intonation and phrasing Existing systems


Scansofts RealSpeak (www.lhsl.com/realspeak) British Telecoms Laureate AT&T Natural Voices (http://www.naturalvoices.att.com)

105 / 111

106 / 111

Tailored document generation

Tailored document generation

Key focus

Document structure + parameters Individually tailored documents Personalised advice giving Customised policy manuals Web delivered dynamic documents

KnowledgePoint (www.knowledgepoint.com)

Tailored job descriptions Project status reports Weather reports

Applications

CoGenTex (www.cogentex.com)

107 / 111

108 / 111

CL Applications Summary

NLP application process language using knowledge about language all levels of linguistic knowledge are relevant Two main problems: ambiguity and paraphrase NLP applications use a mix of symbolic and statistical methods Current applications are not perfect as

Symbolic processing is not robust/portable enough Statistical processing is not accurate enough

Applications should be classied into two main types: aids to human users (e.g., spell checkers, machine aided translations) and agents in their own right (e.g., NL interfaces to DB, dialogue systems) Useful applications have been built since the late 70s Commercial success is harder to achieve
109 / 111

Vous aimerez peut-être aussi