Académique Documents
Professionnel Documents
Culture Documents
Network Based Framework for Information Retrieval Using Feature Association with Improved Keyword Weighting Factor
Evangelin.D1 Kalaivani.V2 and Nelson Samuel Jebastin.J3
1
Assistant Professor, Department of Information Technology, Sri Vidya College of Engineering and Technology, Virudhunagar, TamilNadu, India
Associate Professor, Department of Computer Science and Technology, National Engineering College, Kovilpatti, Tamilnadu, India
3
Assistant Professor in Bioinformatics, Annamalai University, Chidambaram, TamilNadu, India documents. The challenge of this area of research is that a typical text record contains information stored in rich and complex ways, often with little or no fixed-form fields. For the proposed network-based feature association (NBFA) the records of the dataset should contain at least one fixed-form field. There are several ways to describe and store information contained in text records in concise ways. Still, inspite of hopes to use text records in a helpful way, the increase in stored information in text record databases has caused an often disadvantageous information flood [3]. In order to overcome this setback, a special branch of data mining, called text mining, has been widely studied in recent years to allow users to acquire useful knowledge from large amounts of text records [4]. Information Retrieval (IR) is an application of computer systems to fetch unstructured electronic text and perform other related activities. IR, system is used for retrieval of relevant information by a user who had a need for local information. IR is defined as [5]: An IR system does not inform (i.e., change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request. Querying and query processing are other common terms for IR. Text Classification is the process of automated classification of documents based on keyword occurrences. When comparing document vectors, the similarity measure employed can be the sum of common keyword values in the compared document vectors (for global keyword values) or the sum of the products of common keyword values in the compared document vectors (i.e., dot product). In this paper, the development of a master defect record retrieval framework is focused, which will be tested using real-world technical defect records in a medical field. II. RELATED WORK Andrew Rodriguez et al [6] discussed that, data mining needs arise in various fields like telecommunication, manufacturing, medicine, finance and banking, defect record Page 106
I. INTRODUCTION
HE Data mining is required in different areas including telecommunications, manufacturing, and medicine [1], finance and banking, defect record retrieval, and online business [2]. As electronic storage space cost has decreased, it has become quite worthwhile to store large amounts of information about past instances (e.g., transactions, failures, and diagnoses). For example in a medical field for a particular disease, the document is prepared for that disease and the treatment required curing that disease. When the same case appears later, a search can be performed on that stored
ISSN: 2231-2803
http://www.internationaljournalssrg.org
ISSN: 2231-2803
http://www.internationaljournalssrg.org
Page 107
DATA PREPROCESSING
1. Keyword Extraction 2. Assigning Component Identifier
KEYWORD WEIGHT
FEATURE ASSOCIATION
Keyword Vector constructed for input query
Extracted Keywords
GGLSE AO
Return Records
Fig. 1.Architecture of Proposed System
1 2
Similarity Measure
Record ID: R5
A. Data Pre-Processing In this module, the keyword extraction is performed. The keyword extraction process is semi automated and incorporates initial human-input to reduce irrelevant content in the text record (input query). List of possible keywords are generated by scanning the records headline and problem summary that exists in the given input record. The terms that made it through this process were retained as keywords. Articles, conjunctions, prepositions, and pronouns are often referred to as stop words or functional words in this body of research. Such terms are removed in this keyword extraction process. In parallel, each record is assigned a component identifier. This component identifier field is used in the later steps as followed. B. Vector Space model The query given by the user will be in unstructured format. In order to represent that query in standard format i.e., the format that system could understand, VSM is constructed for the query given by the user. As mentioned in section A the keywords are extracted from the input query and the keyword vector is constructed using VSM. Each entry in the VSM represents the number of time the word occurs in that query. And also the VSM is constructed for the dataset by building the keyword vector for each record based on the keywords that occur in input query. Here the keyword vector represents the number of time the particular keyword occurs in that whole document. Using our semi automated extraction system, the record representations can be updated periodically with the increase in size and number of records. Another objective is to scan the dataset to record the masterduplicate relationships. Each record has a duplicate of field. If this field does not indicate another record, then the record being scanned is not a
Record ID: R7
. . . . n
C. n-Nearest Neighbour Search (n-NNS) In order to find records from the database having the same or similar defect issues as the defect record (query) given by the user, a standard n-NNS using a Jaccard similarity coefficient is implemented. This search can be viewed as comparing selected elements (Keywords) in a database and terms (keywords) extracted from the query to measure the similarity rank between query and the document. The problem of finding nearest neighbor has been widely studied in the past if the data is in a simple, low-dimensional vector space [1]. In this case, however, the data lies in a large metric space in which the number of neighbors grows very rapidly as the search space increases. In order to lessen the total cost of comparisons, NBFA (described in Section E) is implemented to reduce the search space. To find similarity between query given by the user and document in the dataset, the Jaccard similarity coefficient is used and it is defined as follows:
j j i i Similarityi, j tT vt vt / tT max(vt vt )
(1)
ISSN: 2231-2803
http://www.internationaljournalssrg.org
Page 108
record is considered as shown in Fig. 3. Note that each node is reachable from itself. The associated components network can be constructed in several ways due to the nature and characteristics of network linking. Here three types of network construction shown in the following sections.
wt log[( tf t ) /( df t )]
d 1
(2)
where, tft - Term frequency of the term t in document d, dft document frequency of the term t. After the keyword weight function is generated, it is incorporated it with n-NNS by modifying the Jaccard similarity measure. The new similarity is given by,
j j i i Similarityi , j tT wt vt vt / tT max( vt vt )
(3)
where vi = vi1, v i2, . . . , v i|T| (no. of time the term occur in the query) vj = vj1, v j2, . . . , v j|T| (no. of time the same term occur in a document) and |T| is the dimension of keyword vector space. E. Network Based Feature Association In order to improve the performance of record searches, the records are associated by establishing links among the records. Each record contains the component field (Fixed format field). In this work, the component field is used to create association among the records. It is suggested that component field is the most useful of the fixed-form fields to the engineers familiar with the records. The underlying principle of feature association is to reduce the search space. Specifically, associations of the component field are built by linking components based on training masterduplicate pairs. Since these features are in all records, this information can be used to limit the search space as shown in Fig. 3. First, construct an unconnected component network, where the set of nodes is the set of components. Then, add an arc (i, j) between nodes (components) i and j, if in our training set, a record with component i is a duplicate of a record with component j. When the search of master record in our database is performed, only the records associated with components that are reachable with the component of the new
1) Direct Links The direct links method of network construction will most limit the search space. A node (component) will only be reachable from another node, if there exists a directed arc built in the training set when there was a masterduplicate pair that directly links the two components. For example, in the direct links column of Fig. 4, assume only three masterduplicate pairs is used for training, the resulting NBFAs are a) if B then A; b) if C then A; and c) if D then C. After the training that takes place, if there is a record with known component B, then the search space is limited to only records with known components B and A. Similarly, if there is a record with known component C, then the search space is limited to only records with known components C and A. However, if there is a record with known component A, then the search space is limited to only records with known component A. 2) Undirected Direct Links The undirected direct links method of network construction is much like the method described earlier except that the links that are built bidirectional (undirected). Essentially, the information that is added is the duplicate component to master component relationship. For example, in the undirected direct links column of Fig. 5, assume only three masterduplicate pairs is used for training, the resulting NBFAs are a) if A then B and C; b) if B then A; c) if C then A and D; and d) if D then C. For example, after the training that takes place, if there is a record with known component A, then the search space is limited to records with known components A, B, and C. Similarly, if there is a record with known component C, then the search space is limited to only records with known components C, A, and D. Page 109
ISSN: 2231-2803
http://www.internationaljournalssrg.org
F. Record Searching Configuration Based on the keyword weighting scheme and NBFA methods described in Sections D and E, eight search configurations can be derived. The eight configurations are: 1) no component association, without keyword weights (basic NNS); 2) no component association, with keyword weights; 3) Direct links, without keyword weights; 4) Direct links, with keyword weights; 5) Undirected direct links, without keyword weights; 6) Undirected direct links, with keyword links; 7) Indirect links, without keyword weights; and 8) Indirect links, with keyword weights. There are several advantages and disadvantages are noted for each of these configurations. For example, the direct links method is the most limiting and will reduce the search space the most. However, this method will increase a probability that the true master record is excluded from the search space. On the other hand, the indirect links method provides the least constrained search space. Although it is more likely that the true master record is included in the search space, it may not drastically increase the search recall. V. RESULT ANALYSIS The validations are carried out by automatically querying the database with defect (input) record and determining if the corresponding master record i.e., record containing remedy for that given problem is returned in the top n retrieved documents from the database (as illustrated in Fig. 2). The keyword vector of each defect (input) record is used as search input for a query. A successful query returns the master record from the entire database for the defect (input) record given by the user. Only the records whose associated components are reachable from the component of the defect (input) record are
ISSN: 2231-2803
http://www.internationaljournalssrg.org
Page 110
REFERENCES
[1] Uramoto.N, H. Matsuzawa, T. Nagano, A. Murakami, H. Takeuchi, K. Takeda, (2004) A text-mining system for knowledge discovery from biomedical documents, IBM systems journal, vol 43, no 3, pp.516533. Liu.B, (2003) Mining data records in web pages, in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, pp. 601606. Myllymaki.P, T. Silander, H. Tirri, and P. Uronen, (2001) Bayesian data mining on the web with b-course, in Proc. First IEEE Int. Conf. Data Mining, pp. 626629. Patricia Cerrito, Louisville, John C. Cerrito, Kroger harmacy, (2006) Data and Text Mining the Electronic Medical Record to Improve Care and to Lower Costs, in Proc. SAS SUGI, pp. 120. Lancaster F.W, (1968) Information Retrieval Systems: Characteristics, Testing, and Evaluation. New York: Wiley. Andrew Rodriguez, W. Art Chaovalitwongse, Liang Zhe, Harsh Singhal, and Hoang Pham, (2010) Master Defect Record Retrieval Using Network-Based Feature Association, IEEE Transactions on systems, man and cyberneticsPart C: Applications and Reviews, vol. 40, no. 3, pp. 319-329. Uramoto.N, H. Matsuzawa, T. Nagano, A. Murakami, H. Takeuchi, K. Takeda, (2004) A text-mining system for knowledge discovery from biomedical documents, IBM systems journal, vol 43, no 3, pp.516-533. Patricia Cerrito, Louisville, John C. Cerrito, Kroger Pharmacy, (2006) Data and Text Mining the Electronic Medical Record to Improve Care and to Lower Costs, in Proc. SAS SUGI, pp. 120. Fatudimu I.T, Musa A.G, Ayo C.K, Sofoluwe A. B, (2008) Knowledge Discovery in Online Repositories: A Text Mining Approach, European Journal of Scientific Research ISSN 1450-216X Vol.22 No.2 , pp.241250. DIK L. LEE, Hong Kon, Huei Chuang, Kent Seamons, (1987) Document Ranking and the Vector-Space Model, IEEE. John Atkinson, Alejandro Rivas, (2008) Discovering Novel Causal Patterns from Biomedical Natural-Language Texts Using Bayesian Nets, IEEE Transactions on information technology in biomedicine, vol. 12, no. 6, pp.714-722. Atika Mustafa, Ali Akbar, and Ahmer Sultan, (2009,April) Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization, International Journal of Multimedia and Ubiquitous Engineering Vol. 4, No. 2, April, 2009. Karin D. Quinones, Hua Su, Byron Marshall, Shauna Eggers, and Hsinchun Chen, (2007) User-Centered Evaluation of Arizona BioPathway: An Information Extraction, Integration, and Visualization System, IEEE Transactions on information technology in biomedicine, vol. 11, no. 5, pp.527-536.
[2] [3]
[4]
[5] [6]
[7]
0.14 0.12
Relative Recall
[8]
0.1 0.08 0.06 0.04 0.02 0 No Component Association Indirect Link Undirected Direct Link Direct Link
[9]
[10] [11]
[12]
PRECISION
[13]
1.2 1
Precision
Authors D.Evangelin (Correspondence Author) is Assistant Professor/IT in Sri Vidya College of Engineering and technology, TamilNadu, India. She has published paper in 2 international journals and national and International Conference.
No Component Association Indirect Link Undirected Direct Link Direct Link
V.Kalaivani is Associate Professor/CSE in National Engineering College, Kovilpatti, Tamilnadu, India. Her teaching experience spans 15 years and her research experience spans 11 years. She has published many no. of papers in international journals and many national and international conferences. J.Nelson Samuel Jebastin is Assistant Professor/ BioInformatics in Annamalai University, Chidambaram, Tamilnadu, India. His teaching experience spans 6 years and his research experience spans 4 years. He has published papers in 3 international journals and many national and international conferences.
The relative recall and precision values are calculated for all eight record searching configurations. The graph for the values predicted in our experiments is shown in Fig. 7. and Fig. 8. It is analyzed that the configuration called Direct Link with Keyword Weight is the best method among the eight record searching configurations. Because objective of this work is to retrieve a master record which is having the relevant solution for the query given by the user.
ISSN: 2231-2803
http://www.internationaljournalssrg.org
Page 111