Vous êtes sur la page 1sur 2

Newsletter Volume 4 Issue 8 August 2012

Editors Desk

Nouns and Null Thesaurus matter the most in Indexing as well as Searching
Nouns constitute nearly half of descriptors (postable terms) in any retrieval system. In a free-text searching (post-controlled vocabulary) situation if you take care of your key concepts in terms of nouns, half of your search strategy is through. The general rule for indexing and hence obviously for searching in natural language supported systems is to use noun or noun phrase wherever possible. Adjectives may be used where noun is not available. When noun phrase begins with an adjective, the noun from which the adjective is derived should be considered. Of course prepositions are excluded from noun phrases wherever possible

If indexing has not already taken care and/or search engines do not assure you of merging variant forms of words and statistically associating synonymous words, it is necessary to include all variant forms and spellings of words, synonyms, near-synonyms, quasi-synonyms, abbreviations, acronyms of nouns and tie them with Boolean operator OR so that recall is significantly boosted.

Usually, singular versus plural, American versus British spellings, acronyms and abbreviations as well as true synonyms are easy to guess and identify for most concepts. It is near-synonyms and quasi-synonyms that one has to be doubly careful. Near synonyms are not completely synonymous, but sufficiently close like evaporation and vaporization that the distinction is not worth making while indexing or searching in a retrieval system. They also include different linguistic origin, popular and scientific names (allergy/ hypersensitivity), terms from different cultures (flats/ apartments), grammatical form, trade names, etc.

Quasi-synonyms have different meanings but required to be treated as synonyms. They include words that represent opposites/ extremes/ different view points on a descriptive property continuum like stability and instability, roughness and smoothness. The assumption is that a user interested in stability automatically wishes to know the instability aspect also. They also represent concepts that have considerable semantic overlap (lighting and illumination), reciprocals and compliments (hardness/ softness, dryness/ wetness, accuracy/ error), unequivalents and diametrical opposites which are mutually exclusive (organic/ inorganic, rest/ motion). However reversals like potential and counter potential should be treated as separate.

Controlling near-synonyms and quasi-synonyms tend to improve recall, but will also tend to reduce precision. While controlling word forms (endings), it is customary to reduce them to root form (i.e., stem) almost effortlessly by computer, but at the cost of losing valid distinctions. It again improves recall and reduces precision. For example, stem word weld which includes welds, welding, welded, weldable, weldability, welder can have enormous recall power. For these reasons, many retrieval systems even if they do not have control vocabulary (thesaurus) they try to practice hybrid language with authority file for nouns in the given specific field and of course along with a computer generated list of stem words called Null thesaurus.

Lastly, apart from authority file for nouns and null-thesaurus, two simple ways of creating a limited (or partial) control vocabulary, which can be used as post-controlled vocabulary for search are: (i) Grouping of all types of synonymous words into classes and (ii) Grouping related specific terms under more generic terms. Further, term frequency, statistical association of terms, hedges, search fragments, meta-thesaurus, control vocabulary dynamics and other desk top tools and taxonomy tools make information retrieval far more easier and effective. . M S Sridhar sridhar@informindia.co.in

Vous aimerez peut-être aussi