Vous êtes sur la page 1sur 6

Week 1, November 2005

Introduction to natural language processing (NLP)


1. Objectives
In this part of the course we will investigate two computing paradigms: the rationalist, rule based,
symbolic approach, contrasted with empirical, data driven methods. The investigation will be carried out
by exploring some natural language processing tasks, so you should understand the current scope and
limitations of NLP.
Natural language is language used for any human communication, not a programming language. It can be
in a limited or specific domain, e.g. legal documents or technical manuals, or informal conversation.
Reference: Beardon, p 1-5
2. NLP applications
Automated Speech Recognition - ASR (or How to wreck a nice beach)
An acoustic processor takes in a speech signal and produces a list of candidate words. Then the
words are analysed by a language model, and the most likely word in the context is selected. The
quality of ASR has improved markedly in recent years. See subtitles on BBC2 live sports
programmes: these are produced through a speech recognition system, developed here at UH,
jointly with industrial partners.
Speech Synthesis
This is the conversion of text to speech, used, for example, in automated dialogues. Prosodic
information has to be captured, such as stress, rhythm, intonation (variations in pitch), amplitude
(loudness). Context has to be captured to disambiguate pronunciation, such as Wind as a breeze
and wind as in turn a handle.
Information Retrieval
This includes automated extraction of information. The TREC 2002 Question Answering paper
shows how a combination of different language processors can be used to find answers to
questions in large data collections.
Automated Dialog systems
This type of system can, in a limited domain, conduct a dialog with a user making enquiries about
travel information, or sports results (such as the SYRINX system). These systems are not yet fully
automated: the user may have to default to a human operator.
Machine Assisted Translation
A great challenge to NLP, so far limited successes, such as the Canadian weather forecasts
translated automatically between French and English. See the Systran web page,
http://www.systranbox.com/systran/box where you can type in your text for translation. Then cut
and paste the result and translate back to English.
The Verbmobil project is one of the best known attempts in this field: it aims to translate from
German to Japanese via English. The limited task is just to fix a time for a meeting.
Word correction
The invaluable spell checker, grammar checkers not so effective.

Copy detection the Ferret


Finding similar passages of text in large document collections. Used to detect plagiarism in
students work
3. Biological background to language
Only humans use language, and speech is the primary language form. We investigate the production of
speech and also its perception, that is hearing and understanding.
The human ability to communicate by speech has evolved despite physiological disadvantages, which
other primates, and infant humans, do not have. The need to communicate effectively must have
outweighed the disadvantages.
In order to produce a wider range of speech sounds more effectively, the human vocal tract has become
longer. This has resulted in the need for a mechanism that enables the same tract to be used for eating
and breathing at different times, and a consequent possibility of choking.
The advantages include the fact that sounds are produced which can be transmitted faster than those
produced by animals without the longer vocal tract; there is a wider range of sounds that can be produced,
and they can be more easily combined. Also, human speech sounds are less susceptible to perceptual
confusion. See reference P Lieberman, 1992.
In looking at language acquisition we need to distinguish between acquisition
a) in the life time of the species
b) in historical times
c) in the lifetime of the individual
For many decades there has been controversy over how language is acquired by each individual, and this
has been reflected in approaches to natural language processing. The well known protagonist of one
school of thought is Noam Chomsky, who proposes that humans have an innate language capability, and
in particular a grammatical module, waiting to be activated by learning the actual ambient language. One
of his principle arguments is the poverty of the stimulus. This is the observed fact that children can
produce utterances that they have never heard spoken before. They have learnt the words, and can then
put them together in novel ways according to certain rules. This view of language acquisition leads to a
rationalist, rule based approach to NLP, working on the hypothesis that a set of rules can be found that
encompasses any language. After half a century of trying this has not been achieved.
The contrasting theory is that we only learn language from ambient speech. This is associated with the
development of empirical, data driven language processors, where rules are not laid down in advance but
may emerge from processing large quantities of data. Though this does not meet some of the arguments
of opposing theories, the approach has been very successful in practice.
Many applications are hybrid systems that combine rule-based and empirical elements, as we see next
term. For instance, the Question Answering Competition at TREC2002, and the Alpine neural parser.

4. Decomposing the NLP processing task


In order to make language processing tractable, we have to decompose the processing task into
independent stages. However, these stages influence each other, and feed back is often necessary.
Prosodic
This means analysing information from the acoustic signal indicating pauses, intonation, stress
etc.
Morphological
Analysis of the shape of words, meaning is changed by inflection, e.g. love can be a present
tense verb, loved is past tense. Prefixes can change word sense, e.g. kind and unkind.
English is not heavily inflected: meaning depends on word order too. Other languages make much
more use of inflection, e.g.Latin, Russian. In English the boy loves the girl is not the same as
the girl loves the boy. In Latin this can be translated as puer (boy) puellam (girl) amat (loves)
or as puellam puer amat. The role of lover and object loved is shown by the word ending.
Syntactic
Analysis of the structure of language: how can words can be strung together, what is the grammar
of the language. We need syntactic understanding get the meaning of a spoken utterance or
passage of text (slide from cognitive test illustrates this). Many utterances are not fully
grammatical, but completely ungrammatical speech is rare and usually incomprehensible.
Syntactic analysis is the focus of our work in this part of the course, so there is much more
to come on this.

Semantic
Determining the meaning of words and groups of words. We have to address different types of
ambiguity
(i) same words with completely different meanings e.g might, will, bank
(ii) same words with some common ancestry e.g. stamp, surf
(iii) same words with varying meanings e.g. game (reference Wittgenstein in
Philosophical Investigations)
(iv) different interpretations e.g. replace the bolt can either mean get another bolt or
put the bolt back where it was
Pragmatic
Non-explicit inference, e.g. Can you tell me the time?
In a Verbmobil dialog to fix a meeting one speaker suggests a day to which the reply
is Isnt that All Saints Day? implying it is a holiday.

Dialog management
Handling the process of a dialog in interactive sytems.

World knowledge
Very hard to capture outside the most limited domains, but essential to understanding.
e.g. a BBC subtitle on a sporting event should have read there is a bit of acrimony between these
two players but actually output there is a bit of macaroni between these two players

5. Review of basic grammar


In order to study syntax we need to know basic grammar. To develop an automated parser we need to use
a set of part-of-speech tags. We start with a very simple, minimal set:
Tagset1

noun
verb
adjective
adverb
determiner
pronoun
preposition
conjunction
exceptions

content words

function words

anything else

Content words have semantic content: they are also known as open class words since their number can
grow as new names and activities arise. Function words can be thought of as the glue which sticks the
content words together. They are also known as closed class words since new ones are very unlikely to
occur.
Many words have more than one part-of-speech tag: e.g. water, shop, fish can be nouns or verbs.
In working systems tagsets are usually larger: some of the categories can be subdivided, and there are
others we have temporarily omitted.
Phrase definitions
A noun phrase is a noun or group of words that acts like a noun.
A verb phrase is a verb or a verb with its following predicate
A prepositional phrase is a preposition followed by a noun phrase
Consider the sentence
the man crossed the river with the crocodiles
This contains the noun phrases: the man, the river, the river with the crocodiles
The verb phrase: crossed the river with the crocodiles
The prepositional phrase: with the crocodiles
See Covington, page 83 - 85 or Allen chapter 2,
Phrase structure grammars capture some of the structure of language, and define how a parser moves
from state to state in a sentence. They model some significant characteristics, such as recursion. PSG
Rules include direct recursion, such as
NP --> Det NP
or indirect, such as
NP --> Noun PP ( as in he crossed rivers with crocodiles)
PP --> Prep NP

6. Examples of rule based grammars


A. Link Grammar See separate notes
B. Phrase Structure Grammar (PSGs)
Also known as Definite Clause Grammar in Prolog, belongs to the class of Context Free Grammars
Phrase structure grammars have three steps:
1. Words in texts being parsed are allocated parts-of-speech
2. Phrase structure grammar is produced
3. An input sentence is displayed
Here is a very simple example. Take the sentence
the man crossed the river
1. Give the words part-of-speech tags
the
determiner (det)
man
noun
crossed
verb
river
noun
2. Group words into phrases
NP (noun phrase)
-> det noun
VP (verb phrase)
-> verb NP
Group phrases into a sentence
S
-> NP VP
3. Display
(S (NP the man) (VP crossed (NP the river) ) )
or as a tree:
sentence

NP

det

VP

noun

verb

NP
det

the

man

crossed

the

noun
river

References
You do not have to read all of these, but should read some of them
Allen, James, 1995 Natural Language Understanding, Benjamin/Cummings
(especially chap 2 )
Charniak, Eugene, 1993, Statistical Language Learning, MIT press
(especially chapter 1 which compares symbolic and rule based approaches.)
Covington, Michael, 1994 Natural Language Processing for Prolog Programmers, Prentice Hall (see
chap 4 for a clear, simple introduction). Good explanation even if you do not know Prolog
Russell, Stuart, and Norvig, Peter, 2003, Artificial Intelligence, a Modern Approach, Prentice Hall, chaps
22 and 23
Background reading
Pinker, Steven, 1994, The Language Instinct, - Interesting and entertaining, neo-Chomskyist
Sampson, Geoffrey, 2005, The Language Instinct Debate, pub. Continuum
Crystal, David, 1995 The Cambridge Encyclopaedia of the English Language, CUP - dip into this.
Lieberman, Philip, 2000, The Evolution of Language and our Reptilian Brain Interesting, new approach.
Pick out his main themes and skip the rest.
Chomsky, Noam, 2000, New horizons in the study of language and mind, CUP Hard to understand.
Gives you an idea of his approach.