Vous êtes sur la page 1sur 17

Domain-Specific Term Extraction for

Concept Identification in Ontology


Construction
Kiruparan Balachandran and Surangika Ranathunga
Department of Computer Science and Engineering
University of Moratuwa
Sri Lanka
2016 IEEE/WIC/ACM International Conference on Web Intelligence
1

Introduction - Ontology
An ontology is a formal and explicit specification of a shared conceptualization.
Ontology consists of :

Classes
Properties (Taxonomic and Non-taxonomic)
Individuals
Values
Axioms- used to verify the consistency of ontology.
E.g. sorting algorithm can be considered as an algorithm if and only if it solves a certain Computer Science problem

Problem

Solve

Algorithm

Has

Complexity

Is a
Sorting
Algorithm
2

Issues in Manual Construction


Time Consuming

Noise
Experts have different
Viewpoints
Assumptions
Needs regarding the same domain

Ontology Learning

Ontology learning (OL) is a solution to overcome issues related to the


manual construction of ontology.

Can be an automatic or a semi-automatic process


Building ontology from Scratch
Enriching or adapting an existing ontology

Ontology Learning Layer-Cake Approach


isA(Sorting Algorithm, Algorithm) -> solve (Sorting Algorithm, Problem)
solve (Algorithm, Problem) - Known as Non- Taxonomy Relationship
isA(Sorting Algorithm, Algorithm) - Known as Taxonomy Relationship
Algorithm (I, E, L)

Rules
Relations
Concept Hierarchy
Concepts

{Randomized Algorithm, Sorting Algorithm}, {System Software, Application Software}

Synonyms

{Randomized Algorithm, Sorting Algorithm, System Software, Application Software}

Terms

Unresolved Issues in Term Extraction


Assume that the domain expert feeds domain-specific terms
Corpus selection based on word count
Considering single contrastive corpus

Contrastive Corpus

Target Domain Corpus


Computer Science

Bio Medical

Cricket

Other Domains

Rules
Relations
Concept Hierarchy
Concepts

Synonyms
Terms

Unresolved Issues in Term Extraction

Issues in Term Extraction


Inverse document frequency: inadequate to identify the cross-domain
distribution
Domain relevance: fails if a term is used at a higher frequency in a few
documents in a domain, but not equally across domains
Domain consensus: only considers the term distribution within a
domain but not the term distribution across domains
DR with DC for single contrastive corpus: when combining a large
number of corpora, there is a significant count for each term from
individual corpora and this count misleads the calculation of statistical
distribution
Complex Term Extraction
rules do not consider all possible POS tags
does not limit the size of complex terms

Rules
Relations
Concept Hierarchy
Concepts

Synonyms
Terms

Improving Domain-Specific Term


Extraction Process
Rules
Objective

Relations
Concept Hierarchy
Concepts
Synonyms
Terms
Terms

Domain-Specific Term Extraction Process

Selecting and
Organizing corpora

Target Domain Corpus


with contrastive corpora

Corpus Annotation

POS Tagged Corpus

Extracting domainspecific terms

Selecting Corpora
Select corpora that are good in lexical richness
Occurrence (normalized by length)%
Frequency

Without Stop Words

of Words

MK

NUS

FAO

1.31

2.23

1.69

2.46

0.08

20.82

0.39

0.55

0.81

0.63

6.49

0.19

0.25

0.39

0.27

2.26

3.09

0.12

0.15

0.26

0.17

0.12

1.82

0.08

0.10

0.19

0.11

1.26

Total

2.11

3.31

3.63

3.66

2.48

33.50
10

Organizing Corpuses
Target domain is iteratively selected

Contrastive Domain

Target Domain
Mikalai
Krapivin
GENIA
Computer
Science
Bio Medical
GENIA
Mikalai
Krapivin
Cricinfo RSS
Bio Medical
Computer
Science

Contrastive Domain

Mikalai Krapivin
Cricinfo RSS
Computer Science
11

Extracting Domain-Specific Terms


Selecting and
Organizing corpora

Target Domain Corpus


with contrastive corpora

Corpus Annotation

POS Tagged Corpus

Linguistic rules to extract simple terms and


complex terms
Statistical distribution calculation to support
multiple contrastive corpora.

Extracting domainspecific terms


Domain-Specific Terms
rca algorithm, time
complexity, computational
complexity, processor
12

Extracting Simple and Complex Terms


Mikalai
Krapivin

GENIA

Cricinfo

Tokenize, annotate with POS

algorithm/NN}, {computational/JJ complexity/NN},


{time/NN complexity/NN}, {RCA/NNP algorithm/NN},
{Minkowski/NNP sum/NN}, {convex/NN
subpolygons/NNS}

Linguistic rules

Find simple and complex terms

Select candidate terms

{algorithm/NN}, {computational/JJ complexity/NN},


{time/NN complexity/NN}, {RCA/NNP
algorithm/NN}

Domain Weight Calculation for each term t(e.g. processor computational complexity etc.)

Weigh Domain-Specific Terms


Mikalai Krapivin

GENIA

Cricinfo

MAX((pGENIAcomputational complexity), (pCricinfocomputational complexity))


(pcomputational complexity) (pcomputational) (pcomplexity)

(pMAXcomputational complexity) (pMAXcomputational) (pMAXcomplexity)

ai
arg max (term ) ai

Domain Weight Calculation for each term t(e.g. processor computational complexity etc.)

14

Evaluation Domain-Specific Term Extraction


Based on Comparing existing studies with same data sets.
Evaluation of the best 700 simple terms and best 700 complex terms for the Computer
Science domain

Our approach

Existing approaches

ComSci Precisionfor complex terms


ComSci Precision for simple terms
DC Precision for simple terms
C-value/NC-value Precision for simple and complex terms

Top
700
52.5%
55%
47%
28%

Evaluation of the best 300 simple terms and best 300 complex terms for the Bio Medical domain

Our approach
Existing approaches

Bio Medical Precision for complex terms


Bio Medical Precision for simple terms
DC Precision for simple terms
C-value/NC-value Precision for simple and complex terms

Top
300
62%
80%
55%
32%
15

Conclusion
Our contribution in term extraction for ontology learning
Implemented a mechanism to select corpora and discussed an approach
to organize corpora

Implemented a calculation of statistical distribution that can extract


simple and complex terms from multiple domains by using multiple
contrastive corpora
Improvement
Consider synonyms of each domain-specific term, which helps to
identify more terms specific to a domain

16

Questions ?

Thank You

17

Vous aimerez peut-être aussi