Académique Documents
Professionnel Documents
Culture Documents
and some modifications from presentations found in the WEB by several scholars including the following
Previous Lectures
Pre-start questionnaire Introduction and Phases of an NLP system NLP Applications - Chatting with Alice Finite State Automata & Regular Expressions & languages Deterministic & Non-deterministic FSAs Morphology: Inflectional & Derivational Parsing and Finite State Transducers Stemming & Porter Stemmer 20 Minute Quiz Statistical NLP Language Modeling N-Grams Smoothing and N-Gram: Add-one & Witten-Bell Return Quiz 1 Parts of Speech
Today's Lecture
Parts of Speech
Start with eight basic categories
These categories are based on morphological and distributional properties (not semantics) Some cases are easy, others are not
8
Parts of Speech
Closed classes
Prepositions: on, under, over, near, by, at, from, to, with, etc. Determiners: a, an, the, etc. Pronouns: she, who, I, others, etc. Conjunctions: and, but, or, as, if, when, etc. Auxiliary verbs: can, may, should, are, etc. Particles: up, down, on, off, in, out, at, by, etc. Nouns: Verbs: Adjectives: Adverbs:
9
Open classes:
There are various standard tagsets to choose from; some have a lot more tags than others The choice of tagset is based on the application Accurate tagging can be done with even large tagsets
10
Brown corpus: 87 tags Penn Treebank: 45 tags Lancaster UCREL C5: 61 tags Lancaster C7: 145 tags
11
12
13
14
UCREL C5
15
Tagging
Part of speech tagging is the process of assigning parts of speech to each word in a sentence Assume we have
A tagset A dictionary that gives you the possible set of tags for each entry A text to be tagged A reason?
16
The process of assigning a part-of-speech or lexical class marker to each word in a corpus:
WORDS
the driver put the keys on the table
TAGS
N V P DET
17
2 tags 3 tags
4 tags 5 tags 6 tags 7 tags 8 tags 9 tags
4,967 411
91 17 2 (well, beat)
6,731 1621
357 90 32 4 (s, half, back, a) 3 (that, more, in)
Most words are unambiguous 18 Many of the most common English words are ambiguous
19
Rule-based Tagging
Use dictionary (lexicon) to assign each word a list of potential POS Use large lists of hand-written disambiguation rules to identify a single POS for each word.
20
Sami is expected to race tomorrow. Sami/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN People continue to inquire the reason for the race for outer space. People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN Problem: assign a tag to race given its lexical frequency Solution: we choose the tag that has the greater P(race|VB) P(race|NN) Actual estimate from the Switchboard corpus: P(race|NN) = .00041 P(race|VB) = .00003
21
An example of Transformation-based Learning Very popular (freely available, works fairly well) A SUPERVISED method: requires a tagged corpus Basic idea: do a quick job first (using frequency), then revise it using contextual rules
22
An example
Examples:
It is expected to race tomorrow. The race for outer space. Tag all uses of race as NN (most likely tag in the Brown corpus)
Tagging algorithm:
1.
2.
Use a transformation rule to replace the tag NN with VB for all uses of race preceded by the tag TO:
Stochastic (Probabilities)
Simple approach
Disambiguate words based on the probability that a word occurs with a particular tag The best tag for given words is determined by the probability that it occurs with the n previous tags Trim the search for the most probable tag using the best N Maximum Likelihood Estimates (N is the number of tags of the following word)
N-gram approach
Viterbi Algorithm
noun
noun
noun
DT
aux
aux
verb
the
can
will
rust
25
S4
S5
VBD
VB
promised
to
back
the
26
bill
P (W |T )P (T ) arg max P (T |W ) P (W )
P(w) is common
27
arg max P (T |W ) P ( W |T )P (T )
P(w) is common
28
2)
Sami is expected to race tomorrow. People continue to inquire the reason for the race for outer space.
29
1) Sami/NNP is/VBZ expected/VBN to/TO race/? tomorrow/NN 2) People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/? For/IN outer/JJ space/NN
Bigram
Finally:
P(NN | TO) P( race | NN) = 0.000007 P( VB | TO) P(race | VB) = 0.00001
31
N
N V V
833
833 300 300
N, N
N, P V, N V, ART
108
366 75 194
Prob(N | N)
Prob(P | N) Porb( N | V) Prob(ART | V)
0.13
0.44 0.35 0.65
P
P
307
307
P, ART
P, N
226
81
Prob (ART | P)
Prob (N | P)
32
0.74
0.26
A Markov Chain
0.74
0.71
ART
1
0.65
0.43 0.35
0.26
N
0.44
0.13
33
Word Counts
N flies 21 V 23 ART 0 P 0 Total 44
fruit
like a the
49
10 1 1
5
30 0 0
1
0 201 300
0
21 0 2
55
61 202 303
flower
flowers birds others Total
53
42 64 592 833
15
16 1 210 300
0
0 0 56 558
0
0 0 284 307
34
68
58 65 1142 1998
35
flies/V
7.6*10-6
flies/N
NULL/0
flies/P
flies/ART
0
36
flies/V
like/V
flies/N
like/P
like/ART
37
flies/V
like/V
0.00031
flies/N
like/N
1.3*10-5
like/P
0.00022
like/V
a/V
flies/N
like/N
a/N
like/P
a/P
a/ART
39
like/V
a/V
flies/N
like/N
a/N
1.2*10-7
like/P
a/P
a/ART
7.2*105
40
like/V
flower/V
flies/N
a/N
flower/N
like/P
flower/P
a/ART
flower/ART
41
like/V
flower/V 2.6*10-9
flies/N
a/N
flower/N
4.3*106
like/P
flower/P
a/ART
flower/ART
42
Performance
This method has achieved 95-96% correct with reasonably complex English tagsets and reasonable amounts of hand-tagged training data. Forward pointer its also possible to train a system without hand-labeled training data
43
44
End of Part 1
45
Lecture 10: Parts of Speech 2-2 Morphosyntactic Tagset Of Arabic - Husni Al-Muhtaseb
46
-
Shereen Khoja 177 177 tags
103 103 Nouns 57 57 Verbs 9 9 Particles 7 7 Residual 1 1 Punctuation
47
Masculine Feminine Neuter
three genders:
48
Three persons
The speaker The person being addressed The person that is not present
Singular Dual Plural
49
Three numbers
three moods of the verb
Indicative Subjunctive Jussive
Nominative Accusative Genitive
50
51
Word
Noun
Verb
Particle
Residual
Punctuation
52
Word
Noun
Verb
Particle
Residual
Punctuation
Common
Proper
Pronoun
Numeral
Adjective
53
Word
Noun
Verb
Particle
Residual
Punctuation
Common
Proper
Pronoun
Numeral
Adjective
Personal
Relative
Demonstrative
54
Word
Noun
Verb
Particle
Residual
Punctuation
Common
Proper
Pronoun
Numeral
Adjective
Personal
Relative
Demonstrative
Specific
Common
55
Word
Noun
Verb
Particle
Residual
Punctuation
Common
Proper
Pronoun
Numeral
Adjective
Cardinal
Ordinal
Numerical Adjective
56
Word
Noun
Verb
Particle
Residual
Punctuation
Perfect
Imperfect
Imperative
57
Word
Noun
Verb
Particle
Residual
Punctuation
Subordinates
Answers
Explanations
Prepositions
Adverbial
58
Word
Noun
Verb
Particle
Residual
Punctuation
Conjunctions
Interjections
Exceptions
Negatives
59
Word
Noun
Verb
Particle
Residual
Punctuation
Foreign
Mathematical Formulae
Numerals
60
Word
Noun
Verb
Particle
Residual
Punctuation
Question Mark
Exclamation Mark
Comma
61
ManTag
Training Corpus
DataExtract
Probability Matrix
Tagged Corpus
APT
62
DataExtract Process
Takes in a tagged corpus and extracts various lexicons and the probability matrix
(Sprout, 1992) defines a clitic as a syntactically separate word that functions phonologically as an affix
63
DataExtract Process
Lexical probability: probability of a word having a certain tag Contextual probability: probability of a tag following another tag
64
DataExtract Process
N N 0.711 0.926 0.689 0.509 0.492 V 0.065 P 0.143 No. 0.010 Pu. 0.071
V
P No.
0.037
0.199 0.06
0.0
0.085 0.098
0.008
0.016 0.009
0.029
0.011 0.324
Pu.
0.159
0.152
0.046
0.151
65
Arabic Corpora
59,040 words of the Saudi al-Jazirah newspaper, dated 03/03/1999 3,104 words of the Egyptian al-Ahram newspaper, date 25/01/2000 5,811 words of the Qatari al-Bayan newspaper, date 25/01/2000 17,204 words of al-Mishkat, an Egyptian published paper in social science, April 1999
66
StatisticalComponent
67
68
5
]1. N [noun ]2. V [verb ]3. P [particle : ]4. R [residual () 5. PU [punctuation]: all
69
1.1. C [common] 1.2. P [proper] 1.3. Pr [pronoun] 1.4. Nu [numeral] 1.5. A [adjective]
70
Singular, masculine, accusative, common noun Singular, masculine, genitive, common noun
1.3.1. P [personal]
detached words such as attached to a word
to nouns to indicate possession to verbs as direct object prepositions
Third person, singular, masculine, personal pronoun Singular, feminine, demonstrative pronoun
73
Relative Pronoun
1.3.2.1. S [specific] 1.3.2.2. C [common]
Dual, feminine, specific, relative pronoun Plural, masculine, specific, relative pronoun Common, relative pronoun
74
1.4.1. Ca [cardinal] 1.4.2.O [ordinal] 1.4.3. Na [numerical adjective]:
Gender
M [masculine] F [feminine] N [neuter] Sg [singular] Du [dual] Pl [plural]
Person
1 [first] 2 [second] 3 [third]
Case
N [nominative] A [accusative] G [genitive]
Number
Definiteness
D [definite] I [indefinite]
76
Verbs
1. P [perfect] 2. I [imperfect] 3. Iv [imperative] ) ( ) (
First person, singular, neuter, perfect verb First person, singular, neuter, indicative, imperfect verb Second person, singular, masculine, imperative verb
77
Person
1 [first] 2 [second] 3 [third]
Number
Sg [singular] Pl [plural] Du [dual]
Mood
I [indicative] S [subjunctive] J [jussive]
78
1.1. Pr [prepositions] 1.2. A [adverbial] 1.3. C [conjunctions] 1.4. I [interjections] 1.5. E [exceptions] 1.6 N [negatives] 1.7. A [answers] 1.8. X [explanations] 1.9. S [subordinates]
79
Prepositions in : Adverbial particles shall : Conjunctions and : Interjections you : Exceptions Except : Negatives Not : Answers yes : Explanations that is : Subordinates if :
80
81
82
83
84
85
86
87
88
89
90
91
92
Parts of Speech
Part of Speech
Nouns
Adverbs
Verbs
Particles
Unique
Residual
Punctuation
93
1. Noun
Nouns ( N ) I. Type II. Definiteness III. Gender IV. Number V. Case
VI. Followship
VII. Variability VIII. Soundness
94
I. Type
Type
Common C-
Proper P-
Adjective J-
Numeral N-
Personal Pronoun- S-
Relative Pronoun R-
Demonstrative Pronoun D-
95
II. Definiteness
Definiteness
Definite D-
Indefinite - I-
96
III. Gender
Gender Masculine M- / Feminine F - / Unmarked U
97
IV. Number
Number
Singular 1 - 1
Dual 2 -2 Plural - 3 - 3 Sound S - Broken B - Mass M - Unmarked 4 - 3 Singular & Dual & Plural (man) A -
V. Case
Case
Nominative N Agent A Subject of cana C
Subject S
Predicative of inn I
99
Case Accusative A Patient P Predicative of cada K State (manner) S Predicative of cana C Subject of inn I
Distinguative D - Cause U
100
Infinitave F
Case
Genitive G
Post preposition P
Adjunct (post noun) A
Case
Vocative V
101
VI. Followship
Followship
Assertion A
Coordinated C
Attributive T
Substitute S
102
VII. Variability
Variability
Invariable (static) / I
Variable V
Semi-Variable
/ S
Vowels W
Letters L
103
VII. Soundness
Soundness
Defective D
Sound S
Ending with ya Y
.. Type .. Adjective
Adjective J -
Degree
Positive P
Comparative C -
Superlative S -
105
Type . Numeral
Numeral N -
Function
Cardinal R
Ordinal O
Numerical adjective A
106
Person
Attachment
First 1
Second 2 2
Third 3 3
Attached T 107
Detached D
Type
Specific F -
Common M
108
Example
<Noun , Common, Definite , Feminine, Singular , Nominative (Agent) , , Variable- Vowels, Sound> <N-C-D-F-1-NA--VW-S> < N C- I F- 3B AP- - V W- S>
109
< Noun , Personal Pronoun ,Definite , Feminine , Singular , Genitive post noun (Adjunct ), , Invariable (static) , , Third , attached < N S D F 1 GA I 3 T >
110
2. Adverbs
Adverbs D
Aspect
Case
Time T
Place P
Nominative
Accusative
Genitive
111
112
3. Verbs
Verbs V I. Tens (Aspect) II. Gender
III. number
IV. Person
V. Case
VI. Conditional
VII. Voice
VIII. Variability
IX. Perfectness
X. Augmentation
XI. Amount
XII. Soundness
I. Tense
1. 2. 3.
II. Gender
1. Masculine ( M 2. Feminine (F - ) 3. Unmarked (U
)
114
III. Number
Singular & Dual & Plural : verb of (man) (A ) Dual & Plural : verb of (ma , nahno)
(T - )
115
IV. Person
1. 2. 3.
V. Case
1. Indicative (( ) N - ) 2. Subjunctive (( ) A - )
Infinitive (( ) F - ) Non Infinitive (N )
3. Jussive ( () G - )
116
VI. Conditional
1. 2.
) )
V. Voice
1. Active ( A 2. Passive ( P ) )
117
VIII. Variability
1. 2.
Vowels (W - ). Letters (L - )
IX. Perfectness
1. Perfect (P - ) 2. Imperfect ( Can and cada ) ( I )
118
X. Augmentation
) -
XI. Amount
XII. Soundness
Defective (D
):
Sound ( S - )
120
XIII. Transitivity
Transitive ( T - )
One Patient
( O - )
( T - )
/ /
Two Patient
Intransitive (I - )
121
< Verb , Past , Feminine , Singular , Third , Subjunctive non infinitive , , Active , Invariable (static ) , Perfect , Augmented , Trilateral ,Sound, intransitive Agent only > <V P F 1- 3 A N A I P- A T S IA>
122
4. Particles
) ( 1 - 1 Coordinating / / / / / / )Subordinating (2 - 2
) (3 - 3 ) . /( Interrogative ) (4 - 4) . ( Preposition
123
Possibility (( ) 5). Protection ( ( ) 6) . Future ( / ( )7). Conditional ( / / / ()8). Answer ( / / / ()9). Exclamation (() 10 ). 11.Interjection/Introgative ( / / ( )11).
124
Negative ( / / / ()12). Imperative (Order))( (13). Cause ( / / / / ()14). Gerund ( / ()15). Deporticle (() 16). ( 17). Explanation (()18).
125
126
/
> < Particles , ta of Femininity <P - 17>. } { > <Particles , Swearness
><P - 21
127
5. Unique (U -
)
Unique U
Denominal D
Past P
Present R
Imperative Order I
128
} )2( ) 1( {
<Unique , The litters at the beginning Of some of the soar Of Al-Quran >
<U - L>
129
6. Residual
Residual R
Type
Gender
Number
Case
Followship
Foreign F
Masculine M
Singular 1-1
Nominative N
Assertion A
Formula R F
Feminine
Dual 2-2
Accusative A
Coordinated C Attributive T
Acronym A U
Unmarked
Plural 3-3
Genitive G
Abbreviation B
Vocative
V
130
Substitute S
/ < Residual , Foreign , Feminine , Singular , Genitive Adjunct (post noun) >
<R F F 1 G A>
131
7. Punctuation (c )
? ! . ; -
Question Mark (Q- ). Exclamation Mark (X - Ellipsis ( E ) . Full Stop (F- ). Comma (C- ). Dotted Comma (D-). Hyphen (H-).
).
132
..
- Interspersion Marks (I- ). , The English Comma (G- ). , , Interspersion Marks (R- ). ( ) Brackets (B- ). " Quotation Marks (U- ). : Colon (O- ). [ ] Square Brackets (S- ). {} / slash (L- ).
133
><C - F
134
135