Académique Documents
Professionnel Documents
Culture Documents
Collocations
Frank Smadja, Retrieving Collocations from Text, Computational Linguistics, 1993 Recurrent combinations of words that cooccur more often than chance, often with non-compositional meaning Technical and non-technical
Examples of collocations
The Dow Jones average of industrials The Dow average The Dow industrials *The Jones industrials The Dow Jones industrial *The industrial Dow *The Dow industrial
Collocation properties
Arbitrary (dialect dependent)
ride a bike, set the table
Domain dependent
dry suit, wet suit
Recurrent Cohesive
Part of a collocation primes for the rest
Applications
Lexicography Grammatical restrictions (compare with/to but associate with) Generation Translation
Types of collocations
Predicative relations
make a decision, hostile takeover flexible (syntactic variability, intervening words)
Possibly intervening words Possibly morphological and syntactic variation Semantic constraints (cf. doctors-dentists and doctors-hospitals)
Evaluation
Ask lexicographer to evaluate output 40% precision after stages one and two 80% precision after stage three 94% conditional recall
Terminology
Batrice Daille, Study and Implementation of Combined Techniques for Automatic Extraction of Terminology, ACL Balancing Act workshop, 1994
Terms refer to concepts Terms key for populating a domain ontology Terms are typically nominal compounds of certain structure, e.g., NN, N of N
Defining terms
Unique reference Unique translation Term extension by
modification (e.g., addition of an adjective) substitution extension of structure coordination
Algorithm
Apply syntactic constraints to match pairs of words in a candidate term Filter by application of an association measure Measures examined: pointwise mutual information, 2 (chi-square), log-likelihood ratio
Observations
Compare with reference list Frequency a strong predictor Log-likelihood ratio works best Additional criteria:
diversity of the distribution of each word distance between the two words (determines flexibility but not term status)
Analysis
Examined association measures Well-known problems:
eliminating general-language constructs (e.g., collocations) what to do with single word terms?
Observations
Frequency works well But a stronger predictor is P(k>1) compared to P(k1) in the same document Use syntactic patterns to propose terms, then check if they reappear in the same document Require this across multiple documents
Term Expansion
Jacquemin, Klavans, and Tzoukermann, Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax, ACL 1997. Need to expand a given list of terms, especially for scientific domains
Term variation
Syntactic (same words, different structure) Morphosyntactic (derivational forms of words) Semantic (synonyms are used) In IR, normalization through stemming and removal of stop words
Approach
Process corpus matching new candidate terms to old ones via unification Matching based on
inflectional morphology (transducer) derivational morphology (rule-based) syntactic transformations additions of words
Results
Manual inspection of several thousand proposed terms Precision of 89% Effectiveness in indexing increases by a factor of three when using the variants (P/R from 99.7/72 to 97/93)