Vous êtes sur la page 1sur 6

International Journal of Computer Trends and Technology- volume4Issue2- 2013

Network Based Framework for Information Retrieval Using Feature Association with Improved Keyword Weighting Factor
Evangelin.D1 Kalaivani.V2 and Nelson Samuel Jebastin.J3
1

Assistant Professor, Department of Information Technology, Sri Vidya College of Engineering and Technology, Virudhunagar, TamilNadu, India

Associate Professor, Department of Computer Science and Technology, National Engineering College, Kovilpatti, Tamilnadu, India
3

Assistant Professor in Bioinformatics, Annamalai University, Chidambaram, TamilNadu, India documents. The challenge of this area of research is that a typical text record contains information stored in rich and complex ways, often with little or no fixed-form fields. For the proposed network-based feature association (NBFA) the records of the dataset should contain at least one fixed-form field. There are several ways to describe and store information contained in text records in concise ways. Still, inspite of hopes to use text records in a helpful way, the increase in stored information in text record databases has caused an often disadvantageous information flood [3]. In order to overcome this setback, a special branch of data mining, called text mining, has been widely studied in recent years to allow users to acquire useful knowledge from large amounts of text records [4]. Information Retrieval (IR) is an application of computer systems to fetch unstructured electronic text and perform other related activities. IR, system is used for retrieval of relevant information by a user who had a need for local information. IR is defined as [5]: An IR system does not inform (i.e., change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request. Querying and query processing are other common terms for IR. Text Classification is the process of automated classification of documents based on keyword occurrences. When comparing document vectors, the similarity measure employed can be the sum of common keyword values in the compared document vectors (for global keyword values) or the sum of the products of common keyword values in the compared document vectors (i.e., dot product). In this paper, the development of a master defect record retrieval framework is focused, which will be tested using real-world technical defect records in a medical field. II. RELATED WORK Andrew Rodriguez et al [6] discussed that, data mining needs arise in various fields like telecommunication, manufacturing, medicine, finance and banking, defect record Page 106

Abstract Various fields like medicine, telecommunication,


finance and banking, online business etc., use electronic records to store and retrieve the information effectively. For instance, in a medical field, medical documents are prepared to store information about each disease the expert dealt. These documents include information like symptoms, causes, treatments etc., for the disease. When the same or a similar case occurs later a search can be performed on stored documents and a relevant solution may be found quickly. As more records accumulate, the retrieval process became more complex to retrieve the relevant documents efficiently. Current record retrieval techniques are not suitable when applied to this defect record retrieval problem. In this study, a new paradigm for master defect record (e.g., document containing remedy for the problem) retrieval using network-based feature association (NBFA) is proposed. These record retrieval process is proposed by constructing feature associations between the similar records in the database to limit the search space. And the new keyword weighting scheme is proposed to increase the relevance measure. In this paper, performance assessments on real data for a medicine field is provided and highlights the difficulties and challenges in this area of research are addressed. Index Terms Text mining, Information Retrieval, query, keyword weighting and vector space model (VSM).

I. INTRODUCTION

HE Data mining is required in different areas including telecommunications, manufacturing, and medicine [1], finance and banking, defect record retrieval, and online business [2]. As electronic storage space cost has decreased, it has become quite worthwhile to store large amounts of information about past instances (e.g., transactions, failures, and diagnoses). For example in a medical field for a particular disease, the document is prepared for that disease and the treatment required curing that disease. When the same case appears later, a search can be performed on that stored

ISSN: 2231-2803

http://www.internationaljournalssrg.org

International Journal of Computer Trends and Technology- volume4Issue2- 2013


retrieval and online business. There are several ways to describe and store information contained in text records. To use text records in a helpful way, the increase in stored information in text record databases has caused an often disadvantageous information flood. In order to overcome this, a special branch of data mining called text mining has been widely studied in recent years to allow users to acquire useful knowledge from large amounts of text records. Information Retrieval is an application of computer systems to fetch unstructured text and perform other related activities. N. Uramoto et al [7] described the application of Biomedical Documents to facilitate knowledge discovery from the very large text databases and healthcare applications. This set of tools, designated MedTAKMI, is an extension of the TAKMI (Text Analysis and Knowledge Mining). Patricia Cerrito et al [8] analyzed the uses of data-mining techniques on an electronic medical record in the Emergency Department of a hospital to improve care while lowering costs. The datamining techniques and the association rules in Enterprise Miner were used to examine the data. Patients' orders, medications, and complaints were also examined using Text Miner to investigate relationships among the variable categories. Fatudimu I.T et al [9] discussed that very large amount of politically oriented text are now available online. Fortunately, there are many tools to manage this outbreak of textual information, many of these tools are derived from earlier works in Information Retrieval (IR), Natural language processing, and statistics, Artificial intelligence (AI), Information Theory and Data Mining. Andrew Rodriguez et al [6] discussed that as electronic records (e.g., medical records and technical defect records) accumulate, the retrieval of a record from a past instance with the same or similar circumstances has become extremely valuable. A new paradigm for master defect record retrieval using networkbased feature association (NBFA) was studied. Master record retrieval process is done by constructing feature associations to limit the search space. Two major text mining research areas such as Information Retrieval (IR) and Text Categorization (TC) are used in this work. IR is the process of retrieving a single, linked record via a query of a text record database. TC refers to the assignment of text records into distinct classes or categories [10]. The vector space model (VSM) is an efficient method of text representation proposed in [11]. This study discusses how the VSM works with the idea that an approximate meaning of the document is captured by the words contained in it. VSM represents documents as vectors in a vector space. The document set comprises an r t record-term matrix in which each row represents a document; each column represents a keyword/term, and each entry represents the term value, which could be weighted using one of many weighting schemes. This TFIDF weighting scheme assigns weight to every term in every record. The same term in different records may have different TFIDF weights. The TFIDF weighting scheme is often used in the VSM to determine the similarity between two text records. [6], [12], [13] proposed a variety of similarity measures and schemes. The claim is that no similarity measure was found that could be considered a good performer under all circumstances. Minor variations, such as different logarithmic base values, caused prominent changes in the performance of the similarity measure. A simple nnearest neighbour search (n-NNS) using a Jaccard similarity coefficient is used in this framework. A Jaccard similarity coefficient is chosen to use, since it is among the most common in IR literature. III. PROBLEM D EFINITION As electronic storage space cost has decreased, it has become quite worthwhile to store large amounts of information about past instances (e.g., transactions, failures, and diagnoses). For instance, in a medical field, they use medical documents to store information about each disease the expert already dealt. These documents include information like symptoms, causes, treatments etc., for the disease. When the same or a similar case occurs later a search can be performed on stored documents and a relevant solution may be found quickly. These records usually contain all of the information about a problem from its first reporting, up to and including its resolution. When a user faces the problem, a query is raised by the user to look for similarities between the current problem and the previously recorded problems that are stored in the database. The objective is to retrieve the record most relevant to a query given by user. The size of these defect records varies greatly, since varying amounts of detail about defects are included and each defect requires a varying amount of detail. The Text Classification (TC) is used to accurately classify or group text records to common entities in a dataset. The entities or groups are based on the correlation criterion of record characteristics. IV. PROPOSED METHODOLOGY As shown in Fig. 1. the record retrieval process consists of following steps. 1) Data Pre-processing 2) Vector Space Model 3) n- Nearest Neighbor Search 4) Keyword Weight 5) Network Based Feature Association 6) Record Searching Configurations. Here the defect records i.e., problem occurs in the particular field are given as input. In the Data Preprocessing step keywords are extracted and it is represented in a standard format using vector space model. Next the keyword weight is calculated. A Component Identifier is assigned to each record in order to associate the component (features) using links between the records. Based on the keyword weighting scheme and NBFA methods described in the following Sections eight search configurations can be derived. Using these configurations the set of the most similar records are retrieved for the given defect (input query). The process of retrieving master defect record is described in the following section.

ISSN: 2231-2803

http://www.internationaljournalssrg.org

Page 107

International Journal of Computer Trends and Technology- volume4Issue2- 2013


Input Query duplicate of any other record. If the duplicate of field has a value, it is the record identifier of its master record. Fig. 2. shows that how the incoming query is converted into standard format using vector space model. This search vector is given as query to the database. And the similarity between query and documents in the database is found. Based on the similarity measure the documents are ranked and returned.
INPUT QUERY

DATA PREPROCESSING
1. Keyword Extraction 2. Assigning Component Identifier

KEYWORD WEIGHT

FEATURE ASSOCIATION
Keyword Vector constructed for input query

Extracted Keywords

GGLSE AO

RECORD SEARCH: Using Record searching Configurations


Query

Return Records
Fig. 1.Architecture of Proposed System

Keyword Vector constructed for dataset

Records returned for the given query

1 2
Similarity Measure

Record ID: R5

A. Data Pre-Processing In this module, the keyword extraction is performed. The keyword extraction process is semi automated and incorporates initial human-input to reduce irrelevant content in the text record (input query). List of possible keywords are generated by scanning the records headline and problem summary that exists in the given input record. The terms that made it through this process were retained as keywords. Articles, conjunctions, prepositions, and pronouns are often referred to as stop words or functional words in this body of research. Such terms are removed in this keyword extraction process. In parallel, each record is assigned a component identifier. This component identifier field is used in the later steps as followed. B. Vector Space model The query given by the user will be in unstructured format. In order to represent that query in standard format i.e., the format that system could understand, VSM is constructed for the query given by the user. As mentioned in section A the keywords are extracted from the input query and the keyword vector is constructed using VSM. Each entry in the VSM represents the number of time the word occurs in that query. And also the VSM is constructed for the dataset by building the keyword vector for each record based on the keywords that occur in input query. Here the keyword vector represents the number of time the particular keyword occurs in that whole document. Using our semi automated extraction system, the record representations can be updated periodically with the increase in size and number of records. Another objective is to scan the dataset to record the masterduplicate relationships. Each record has a duplicate of field. If this field does not indicate another record, then the record being scanned is not a

Record ID: R7

. . . . n

Fig. 2.Vector Space Model for Query and Documents

C. n-Nearest Neighbour Search (n-NNS) In order to find records from the database having the same or similar defect issues as the defect record (query) given by the user, a standard n-NNS using a Jaccard similarity coefficient is implemented. This search can be viewed as comparing selected elements (Keywords) in a database and terms (keywords) extracted from the query to measure the similarity rank between query and the document. The problem of finding nearest neighbor has been widely studied in the past if the data is in a simple, low-dimensional vector space [1]. In this case, however, the data lies in a large metric space in which the number of neighbors grows very rapidly as the search space increases. In order to lessen the total cost of comparisons, NBFA (described in Section E) is implemented to reduce the search space. To find similarity between query given by the user and document in the dataset, the Jaccard similarity coefficient is used and it is defined as follows:
j j i i Similarityi, j tT vt vt / tT max(vt vt )

(1)

ISSN: 2231-2803

http://www.internationaljournalssrg.org

Page 108

International Journal of Computer Trends and Technology- volume4Issue2- 2013


where vi = vi1, v i2, . . . , v i|T| (no. of time the term occur in the query) vj = vj1, v j2, . . . , v j|T| (no. of time the same term occur in a document) and |T| is the dimension of keyword vector space. D. Keyword Weight In this module, the issue of the uniqueness of keywords is addressed. This issue is important as more commonly used keywords may not be helpful in identifying the masterduplicate relationships. In order to incorporate the uniqueness of keywords, a global keyword weight is employed. That is, in the keyword matching step of NNS, each keyword weight depends on the number of documents in which a word occurs. The uniqueness of keywords in the database is examined and generates a keyword weight using a basic count of the number of documents in which keyword occurs. This technique is intuitive, since the keywords that appear in many different records of the database are not unique and are less likely to support in the location of the master record. Note that the number of times a term appears in a document is not counted here instead the number of document in which the keyword occurs is considered. The keywords that appear in only a few records with a greater weight are considered than the keywords that appear in many records. The weight function is defined as follows. For term t, the keyword weight is given by,
n

record is considered as shown in Fig. 3. Note that each node is reachable from itself. The associated components network can be constructed in several ways due to the nature and characteristics of network linking. Here three types of network construction shown in the following sections.

wt log[( tf t ) /( df t )]
d 1

(2)

Fig. 3.Search Space in Database

where, tft - Term frequency of the term t in document d, dft document frequency of the term t. After the keyword weight function is generated, it is incorporated it with n-NNS by modifying the Jaccard similarity measure. The new similarity is given by,
j j i i Similarityi , j tT wt vt vt / tT max( vt vt )

(3)

where vi = vi1, v i2, . . . , v i|T| (no. of time the term occur in the query) vj = vj1, v j2, . . . , v j|T| (no. of time the same term occur in a document) and |T| is the dimension of keyword vector space. E. Network Based Feature Association In order to improve the performance of record searches, the records are associated by establishing links among the records. Each record contains the component field (Fixed format field). In this work, the component field is used to create association among the records. It is suggested that component field is the most useful of the fixed-form fields to the engineers familiar with the records. The underlying principle of feature association is to reduce the search space. Specifically, associations of the component field are built by linking components based on training masterduplicate pairs. Since these features are in all records, this information can be used to limit the search space as shown in Fig. 3. First, construct an unconnected component network, where the set of nodes is the set of components. Then, add an arc (i, j) between nodes (components) i and j, if in our training set, a record with component i is a duplicate of a record with component j. When the search of master record in our database is performed, only the records associated with components that are reachable with the component of the new

1) Direct Links The direct links method of network construction will most limit the search space. A node (component) will only be reachable from another node, if there exists a directed arc built in the training set when there was a masterduplicate pair that directly links the two components. For example, in the direct links column of Fig. 4, assume only three masterduplicate pairs is used for training, the resulting NBFAs are a) if B then A; b) if C then A; and c) if D then C. After the training that takes place, if there is a record with known component B, then the search space is limited to only records with known components B and A. Similarly, if there is a record with known component C, then the search space is limited to only records with known components C and A. However, if there is a record with known component A, then the search space is limited to only records with known component A. 2) Undirected Direct Links The undirected direct links method of network construction is much like the method described earlier except that the links that are built bidirectional (undirected). Essentially, the information that is added is the duplicate component to master component relationship. For example, in the undirected direct links column of Fig. 5, assume only three masterduplicate pairs is used for training, the resulting NBFAs are a) if A then B and C; b) if B then A; c) if C then A and D; and d) if D then C. For example, after the training that takes place, if there is a record with known component A, then the search space is limited to records with known components A, B, and C. Similarly, if there is a record with known component C, then the search space is limited to only records with known components C, A, and D. Page 109

ISSN: 2231-2803

http://www.internationaljournalssrg.org

International Journal of Computer Trends and Technology- volume4Issue2- 2013


3) Indirect Links The indirect links method of network construction will grow the largest network with respect to the number of connected components in a tree. For example, after the training that takes place in Fig. 6, an indirect link has been formed from component D to A by the pairs AC and CD, although there is no direct link from component A to D. That is, an indirect link was made. Let us consider an example in Fig. 6. Assume that a new text record r7 has component A, whose indirect links include nodes B, C, and D. When the NNS is performed, the only records that are associated with components A, B, C, and D are considered. All other records are not included as part of the search space.

Fig. 6.Indirect Links

Fig. 4.Direct Links

F. Record Searching Configuration Based on the keyword weighting scheme and NBFA methods described in Sections D and E, eight search configurations can be derived. The eight configurations are: 1) no component association, without keyword weights (basic NNS); 2) no component association, with keyword weights; 3) Direct links, without keyword weights; 4) Direct links, with keyword weights; 5) Undirected direct links, without keyword weights; 6) Undirected direct links, with keyword links; 7) Indirect links, without keyword weights; and 8) Indirect links, with keyword weights. There are several advantages and disadvantages are noted for each of these configurations. For example, the direct links method is the most limiting and will reduce the search space the most. However, this method will increase a probability that the true master record is excluded from the search space. On the other hand, the indirect links method provides the least constrained search space. Although it is more likely that the true master record is included in the search space, it may not drastically increase the search recall. V. RESULT ANALYSIS The validations are carried out by automatically querying the database with defect (input) record and determining if the corresponding master record i.e., record containing remedy for that given problem is returned in the top n retrieved documents from the database (as illustrated in Fig. 2). The keyword vector of each defect (input) record is used as search input for a query. A successful query returns the master record from the entire database for the defect (input) record given by the user. Only the records whose associated components are reachable from the component of the defect (input) record are

Fig. 5.Undirected Direct Links

ISSN: 2231-2803

http://www.internationaljournalssrg.org

Page 110

International Journal of Computer Trends and Technology- volume4Issue2- 2013


included in the search space i.e., the associated record alone is searched (as shown in Fig. 3). There are two properties are now well accepted by the research community for measurement of search effectiveness. They are Recall and Precision. Recall is the ability of the search to find all of the relevant items in the corpus. Precision is the ability to retrieve documents that are mostly relevant to the query given. Both Recall and Precision metric is used to evaluate the performance measure of the documents retrieved to the users query.
RECALL

REFERENCES
[1] Uramoto.N, H. Matsuzawa, T. Nagano, A. Murakami, H. Takeuchi, K. Takeda, (2004) A text-mining system for knowledge discovery from biomedical documents, IBM systems journal, vol 43, no 3, pp.516533. Liu.B, (2003) Mining data records in web pages, in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, pp. 601606. Myllymaki.P, T. Silander, H. Tirri, and P. Uronen, (2001) Bayesian data mining on the web with b-course, in Proc. First IEEE Int. Conf. Data Mining, pp. 626629. Patricia Cerrito, Louisville, John C. Cerrito, Kroger harmacy, (2006) Data and Text Mining the Electronic Medical Record to Improve Care and to Lower Costs, in Proc. SAS SUGI, pp. 120. Lancaster F.W, (1968) Information Retrieval Systems: Characteristics, Testing, and Evaluation. New York: Wiley. Andrew Rodriguez, W. Art Chaovalitwongse, Liang Zhe, Harsh Singhal, and Hoang Pham, (2010) Master Defect Record Retrieval Using Network-Based Feature Association, IEEE Transactions on systems, man and cyberneticsPart C: Applications and Reviews, vol. 40, no. 3, pp. 319-329. Uramoto.N, H. Matsuzawa, T. Nagano, A. Murakami, H. Takeuchi, K. Takeda, (2004) A text-mining system for knowledge discovery from biomedical documents, IBM systems journal, vol 43, no 3, pp.516-533. Patricia Cerrito, Louisville, John C. Cerrito, Kroger Pharmacy, (2006) Data and Text Mining the Electronic Medical Record to Improve Care and to Lower Costs, in Proc. SAS SUGI, pp. 120. Fatudimu I.T, Musa A.G, Ayo C.K, Sofoluwe A. B, (2008) Knowledge Discovery in Online Repositories: A Text Mining Approach, European Journal of Scientific Research ISSN 1450-216X Vol.22 No.2 , pp.241250. DIK L. LEE, Hong Kon, Huei Chuang, Kent Seamons, (1987) Document Ranking and the Vector-Space Model, IEEE. John Atkinson, Alejandro Rivas, (2008) Discovering Novel Causal Patterns from Biomedical Natural-Language Texts Using Bayesian Nets, IEEE Transactions on information technology in biomedicine, vol. 12, no. 6, pp.714-722. Atika Mustafa, Ali Akbar, and Ahmer Sultan, (2009,April) Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization, International Journal of Multimedia and Ubiquitous Engineering Vol. 4, No. 2, April, 2009. Karin D. Quinones, Hua Su, Byron Marshall, Shauna Eggers, and Hsinchun Chen, (2007) User-Centered Evaluation of Arizona BioPathway: An Information Extraction, Integration, and Visualization System, IEEE Transactions on information technology in biomedicine, vol. 11, no. 5, pp.527-536.

[2] [3]

[4]

number of relevant documents retrieved Total number of relevant documents in database

[5] [6]

Without Keyw ord Weight With Keyw ord Weight

[7]
0.14 0.12
Relative Recall

[8]

0.1 0.08 0.06 0.04 0.02 0 No Component Association Indirect Link Undirected Direct Link Direct Link

[9]

[10] [11]

Record Searching Configurations

Fig. 7.Relative Recall yielded by eight record searching configurations

[12]

PRECISION

number of relevant items retrieved total number of items retrieved


Without Keyw ord Weight With Keyw ord Weight

[13]

1.2 1
Precision

Authors D.Evangelin (Correspondence Author) is Assistant Professor/IT in Sri Vidya College of Engineering and technology, TamilNadu, India. She has published paper in 2 international journals and national and International Conference.
No Component Association Indirect Link Undirected Direct Link Direct Link

0.8 0.6 0.4 0.2 0

Record Searching Configurations

Fig. 8.Precision value yielded by eight record searching configurations

V.Kalaivani is Associate Professor/CSE in National Engineering College, Kovilpatti, Tamilnadu, India. Her teaching experience spans 15 years and her research experience spans 11 years. She has published many no. of papers in international journals and many national and international conferences. J.Nelson Samuel Jebastin is Assistant Professor/ BioInformatics in Annamalai University, Chidambaram, Tamilnadu, India. His teaching experience spans 6 years and his research experience spans 4 years. He has published papers in 3 international journals and many national and international conferences.

The relative recall and precision values are calculated for all eight record searching configurations. The graph for the values predicted in our experiments is shown in Fig. 7. and Fig. 8. It is analyzed that the configuration called Direct Link with Keyword Weight is the best method among the eight record searching configurations. Because objective of this work is to retrieve a master record which is having the relevant solution for the query given by the user.

ISSN: 2231-2803

http://www.internationaljournalssrg.org

Page 111

Vous aimerez peut-être aussi