Vous êtes sur la page 1sur 7

Nov 2005, DAI week 2

Rationalist and Empiricist paradigms in Natural Language Processing


For the rationalist, rule based, symbolic AI approach the primary task is finding a
grammar which fully represents the structure of natural language. This is done by using
intuition and examination of a limited number of examples. Then the aim is to encode the
grammar and use in automated processes.
For the empiricist, data driven approach, also called corpus based linguistics, the primary
task is finding regularities in language by examining large corpora. Then the aim is to
construct usable models from this training data.
Rationalist position
Knowledge of language involves in the first place knowledge of grammar
language is a derivative and perhaps not very interesting concept
Chomsky, 1980
Empiricist position
Dont think but look! the more we examine actual language, the sharper
becomes the conflict between it and our requirement, the crystalline purity of logic
Wittgenstein, 1945
Chomsky has modified his position over the years, and now says that empirical evidence
is relevant. Though Wittgenstein wrote long ago, this quotation nicely sums up the
empiricist approach
Empirical, corpus based, methods
Basic principle: Set the parameters of the system by exposure to training data, store
them and then apply to new, unseen data.
The training data is a corpus of texts. It may be a collection of articles of the same sort
(e.g.the Reuters corpus of 1 million news reports). The British National Corpus (BNC)
is 100 million words of texts from many different domains. The corpus may be
transcribed natural conversation or other forms of speech.
A language model is constructed by analysing the training data. A language model is a set
of characteristics of language, drawn from a large corpus or corpora. This is explained
further below.
Empirical methods include probabilistic systems and neural processors. The concepts
underlying these methods are related.

Types of grammar: the Chomsky hierarchy


Type 0:

Turing equivalent (recursively enumerable)


(not needed for natural language processing)

Type 1

Context sensitive

Type 2

Context free

Type 3

Regular

Empirical methods and grammar type


Empirical methods are based on regular, type 3, grammars. Dependencies between
adjacent symbols, such as words, are captured and analysed. If there is a sequence of
words such as
word1 word2 word3
we might ask How likely is word3 to occur, given that word1 and word2 have gone
before?
Little semantic information can be captured if system is based on a regular grammar.
For example, the agent and victim cannot be extracted from sentences like
John kicked Bill.
John was kicked by Bill.
For a given application this may not matter. E.g the Ferret plagiarism detector is based on
detecting similar word patterns in pairs of texts. Semantics are not always needed
Regular grammars can be modelled by finite state automata. But they cannot model
phrase structure, or higher level grammar. Even if this does matter, a regular grammar
base may be chosen to make computation tractable.
Rule based methods and grammar type
Rule based methods are typically founded on context free grammars (CFGs) type 2
grammars, occasionally higher. Phrase structure grammar (PSG) is context free. As well
as capturing dependencies between adjacent symbols, relationships between groups of
symbols are modelled, for example (noun phrase - verb phrase). It may be possible to
determine subject / predicate roles.
However, in practice the rule base cannot usually cover unrestricted natural language.
CFGs typically generate a large number of alternative parses unless sentences are short
and simple. Semantic knowledge may be needed to disambiguate sentence structure.

Context sensitive information is not captured. Consider translating the sentence:


The pipe connections of the cooler must be checked.
The subject is The pipe connections of the cooler. The head of the subject
connections is plural, so verb must should be plural. In English modal verbs have
the same singular / plural form, so we cannot tell from looking at the word which it is. A
CFG could not represent the necessary context sensitive number agreement.
Context free grammars can be modelled by push down automata.They cannot model
context sensitive grammar.
Tag disambiguation with CLAWS
Tag disambiguation is a preliminary task in many natural language processors, and the
Claws system is a well known method. Now, many words in English and other languages
have more than one part-of-speech tag, and the right ones need to be found in a given
context. Examples:
light
adjective, noun, verb
race
noun, verb
cut
verb, noun, past participle
her
pronoun, possessive pronoun
flies
noun, verb
Tagsets
Tagsets of different sizes are used for different purposes, depending on the degree of
resolution needed. For instance, nouns can be subdivided into singular and plural, verbs
into different tenses and number. Tagsets in use vary in size from 2 (just content and
function words) to over 100. In CLAWS version 4, about 65 tags are used. A very simple
tagset with 10 classes might be:
Content words
noun
verb
adjective
adverb

Function words
determiner
pronoun
preposition
conjunction
auxiliary verb

Other
The underlying concepts
The underlying principle of CLAWS is based on probabilities. How likely is it that a
word has a certain tag, given its context. This method uses information from local
dependencies. For example noun verb is a likely pair, whereas determiner - verb is an
unlikely pair. What about noun noun ? People often think this is unlikely at first, but
taking an empirical approach we find in fact that in English it is very common. For
example: window pane, table cloth, paper handkerchief, Computer Science Department.

The CLAWS process is based on a Markov model. A Markovian sequence is defined as


one in which the probability of a certain output element occurring depends on a finite
number of preceding terms. In this simple case the number of preceding terms is one. We
assert that the probability of a word having a certain tag depends on its predecessor.
The training data is tagged by hand. Then the frequencies of the tag transitions are
counted. These frequencies are taken as approximations to probabilities, and constitute
the language model.
For this version of CLAWS, 2 million words of the BNC were tagged by hand. Transition
frequencies, how often one tag follows another, were found. This is the training data; the
transition frequencies constitute the language model. See http://info.ox.ac.uk/bnc
Tag disambiguation - example 1
Suppose we want to disambiguate the following sentence using the ClAWS method .
Henry likes stews .
First, the processor looks up each word in the lexicon, to get one or more possible tags.
In this case it gives:
Henry
likes
stews
.

NP (proper noun)
VBZ (singular verb) or NNS (plural noun)
VBZ or NNS
. (full stop)

From the training data we have the following part of a transition matrix
Preceding tag

NP

NNS

VBZ

NNS

28

VBZ

17

not needed

135

37

Following tag

We calculate the different probability scores for each choice of tag:


NP_NNS_NNS_ . = 7 * 5 * 135 = 4725
NP_NNS_VBZ_ . = 7 * 1 * 37 = 259
NP_VBZ_NNS_ . = 17 * 28 * 135 = 64260
NP_VBZ_VBZ_ . = 17 * 0 * 37 = 0
These numbers would be normalised with reference to the total number of times that the
tag occurred. They show that the sequence NP_VBZ_NNS_. is the most likely.

The probabilities are calculated in the following way


Let t and t be part-of-speech tags
(t,t) means t follows t
C (t) is a count of the number of times that tag t occurs in the training set
C (t,t) is a count of the number of times that tag t follows t
P ( t,t) = C (t,t)
C (t)
The lexicon holds a list of words and their possible tags. When processing new, unseen,
text the processor will take each word in turn and first look it up in the lexicon. If it is not
there, tags will be assigned using rules for suffix or prefix.
Examples:
suffix
probable tag
-able
adjective
e.g. suitable
-cle
noun
e.g. article
-dle
noun or verb e.g. handle
If word does not have suffix or prefix in list, then a default value of noun or verb is given.
Summary

When words in text have been allocated one or more tags, then the
disambiguation process is begun.
The transition probabilities are looked up in a table, which has been created from
the training data.

Then the most likely tags are calculated

Example 2
Which of the words in the following sentence have more than one part-of-speech tag ?
Norman forced her to cut down on smoking.
References for tag disambiguation

The Computational Analysis of English, Garside et al., chapter 4


Statistical Language Learning, Charniak, chapter 3

2nd Corpus based application


The Alpine neural parser
The Alpine parser takes natural language text and divides sentences up into constituent
parts. For example:
Yesterday
that dog with a long tail bit the boy
{pre-subject}
{subject}
{predicate}
The pre-subject may be empty, but in a declarative sentence there will be a subject and a
predicate.
Steps in the process:
1. Using CLAWS get possible tags for each word. The text is mapped onto a sequence
of parts-of-speech tags
2. Use the trained neural net to find the boundary markers of the subject.
A single layer neural net is used, as the data turns out to be linearly separable. The
net is trained in supervised mode.

The neural net has 2 outputs: grammatical and ungrammatical, yes and no.

Each sentence generates a set of alternative tag sequences, and alternative


locations for the boundary markers. These are converted into sets of features that can be
entered into an input vector.

The net determines the sequence that is grammatical.

The correct location of the boundary markers together with the correct tags are
displayed.

Alpine as a hybrid system


Alpine is a corpus based application, using empirical methods
It also has a rule based component. We assert that a sentence can be decomposed into
{pre-subject} {subject} {predicate}
Development of the two paradigms in historical context
Empirical methods were dominant in the 1950s, exemplified by behaviourism in
psychology, and the rise of information theory. The linguist Firth summed up the
approach for language analysis in the well known aphorism: You shall know a word by
the company it keeps. In tune with this philosophy, Rosenblatt introduced perceptrons,
the first neural processors.
However, Chomsky demolished the empirical position in linguistics in his seminal book
Syntactic Structures (1956). He established the Chomsky hierarchy, and showed the
limitations of regular grammars. As well as his work on natural language his work on
computer languages and the construction of compilers contributed to his reputation.

Taking a similar stand, Minsky and Papert later on demonstrated the limitations of the
perceptron, Perceptrons (1969) Empirical, data driven approaches went out of fashion.
Much work was done developing rule based systems in limited language domains
usually with invented examples rather than naturally occurring sentences. But by the
1980s the limitations of rule based language processors became apparent. The goal of
codifying syntactic rules seems in practice unobtainable for unrestricted natural language.
At this time modest achievements in speech recognition, based on empirical methods,
excited enthusiasm. Empirical methods became possible as the necessary large corpora
were collected. Increasing computing power made data driven, empirical methods
feasible, and they came back into favour.
There were other developments that encouraged this trend. In the 1980s some research
and its funding shifted from academia to industry. There was more interest in working
systems than underlying theories. The focus shifted from deep analysis of small samples
of language (rule based methods) to a shallow analysis of real language (data driven
methods).
Information theory provided metrics to evaluate empirically derived language models,
based on regular grammars. Rule based systems had no comparable evaluation
techniques. This was important in securing resources in an industrial environment.
In the mid 80s neural networks had a resurgence, as methods were found to avoid the
problems with perceptrons. A key publication was Parallel Distributed Processing (1986)
by Rumelhart and McClelland. Many recent developments in natural language processing
are based mainly on data driven methods. E.g language models for speech recognition;
neural systems for information retrieval, commonly based on Support Vector Machines.
Integrated systems that incorporate rule based and data driven modules are now found to
be necessary for many key tasks. Hybrid systems that are primarily data driven can
operate within a rule based framework. E.g. probabilistic or neural parsers.
Systems may combine modules based on different paradigms.
Example 1: the Verbmobil system, which aims to translate simple conversation between
English and German. Components include probabilistic taggers, leading into rule based
parsers, integrated with semantic analysers.
Example 2: See paper LCC Tools for Question Answering, an information retrieval
system, which integrates at least 5 different processing modules. These include a
probabilistic parser and a rule based logic prover.

Vous aimerez peut-être aussi