Vous êtes sur la page 1sur 5

K. Mythili et al.

, International Journal of Science and Applied Information Technology, 1(3), July-August, 88-92

ISSN No. 2278-3083

Volume 1, No.3, July August 2012

International Journal of Science and Applied Information Technology


Available Online at http://warse.org/pdfs/ijsait05132012.pdf

A Pattern Taxonomy Model with New Pattern Discovery Model for Text Mining
Mrs.K. Mythili, Professor, Department of Computer Applications, Hindusthan College of Arts and Science, Coimbatore -6, Tamilnadu. India mythiliaru1@gmail.com Mrs. K. Yasodha, Research Scholar, Department of Computer Applications, Hindusthan College of Arts and Science, Coimbatore-6, Tamilnadu. India Yaso194@gmail.com ABSTRACT Most of the mining techniques are proposed for the purpose of developing efficient mining algorithms to find particular patterns within a reasonable and acceptable time frame. With a large number of patterns generated by using data mining approaches, how to effectively use and update these patterns is still an open research issue. In existing system, an effective pattern discovery technique introduced which first calculates discovered specificity patterns and then evaluates the term weight according to the distribution of terms in the discovered patterns rather than the distribution in documents for solving the misinterpretation problem. It also considers the influence of patterns from the negative training examples to find ambiguous (noisy) patterns and try to reduce their influence for the low-frequency problem. The process of updating ambiguous patterns can be referred as pattern evolution. This approach can improve the accuracy of evaluating term weights because discovered patterns are more specific than whole documents. This technique uses two processes, pattern deploying and pattern evolving, to refine the discovered patterns in text documents. But they dont consider the time series to rank the given sets of documents. In proposed system, the temporal text mining approach is introduced. The system terms of its ability is evaluated to predict forthcoming events in the document. Here the optimal decomposition of the time period associated with the given document set is discovered, where each subinterval consists of consecutive time points having identical information content. Extraction of sequences of events from new and other documents based on the publication times of these documents has been shown to be extremely effective in tracking past events. Keywords: Temporal Text Mining, Pattern Deploying, Pattern Evolution. 1. INTRODUCTION Text mining is a burgeoning new field that attempts to glean meaningful information from natural language text. It may be
@ 2012, IJSAIT All Rights Reserved

loosely characterized as the process of analyzing text to extract information that is useful for particular purposes. Compared with the kind of data stored in databases, text is unstructured, amorphous, and difficult to deal with algorithmically. Nevertheless, in modern culture, text is the most common vehicle for the formal exchange of information. The field of text mining usually deals with texts whose function is the communication of factual information or opinions, and the motivation for trying to extract information from such text automatically is compelling even if success is only partial. A text summarizer strives to produce a condensed representation of its input, intended for human consumption. It may condense individual documents or groups of documents. Text compression, a related area, also condenses documents, but summarization differs in that its output is intended to be human-readable. The output of text compression algorithms is certainly not human-readable, but neither is it actionable the only operation it supports is decompression, that is, automatic reconstruction of the original text. As a field, summarization differs from many other forms of text mining in that there are people, namely professional abstractors, who are skilled in the art of producing summaries and carry out the task as part of their professional life. Studies of these people and the way they work provide valuable insights for automatic summarization. Text mining is the discovery of interesting knowledge in text documents. It is a challenging issue to find accurate knowledge (or features) in text documents to help users to find what they want. In the beginning, Information Retrieval (IR) provided many term-based methods to solve this challenge. There are two fundamental issues regarding the effectiveness of pattern-based approaches: low frequency and misinterpretation. Given a specified topic, a highly frequent pattern (normally a short pattern with large support) is usually a general pattern, or a specific pattern of low frequency. If we decrease the minimum support, a lot of noisy patterns would be discovered. Misinterpretation means the measures used in pattern mining (e.g., support and confidence) turn out to
88

K. Mythili et al., International Journal of Science and Applied Information Technology, 1(3), July-August, 88-92

be not suitable in using discovered patterns to answer what users want. The difficult problem hence is how to use discovered patterns to accurately evaluate the weights of useful features (knowledge) in text documents. There are two fundamental problems regarding the effectiveness of pattern-based approaches: low frequency and misinterpretation. Given a specified topic, a highly frequent pattern (normally a short pattern with large support) is usually a general pattern, or a specific pattern of low frequency. If we decrease the minimum support, a lot of noisy patterns would be discovered. Misinterpretation means the measures used in pattern mining turn out to be not suitable in using discovered patterns to answer what users want. The difficult problem hence is how to use discovered patterns to accurately evaluate the weights of useful features (knowledge) in text document. Objective of this work is to discover the pattern from large database effectively and to help improving the effectiveness of pattern based approaches. The pattern evaluation approach can improve the accuracy of evaluating term weights because discovered patterns are more specific than whole documents. This technique uses two processes, pattern deploying and pattern evolving, to refine the discovered patterns in text documents. It can improve the accuracy of evaluating term weights because discovered patterns are more specific than the whole documents. It also makes the user to track the previous time series from the sets of document. In proposed system, the temporal text mining approach is used. Temporal text mining combines information extraction and data mining techniques upon texting repositories and incorporate times. The sequences of events from the sets of documents are extracted in order to track the past events effectively. The optimal decomposition of the time period is constructed associated with the given document set. The notion of the compressed level decomposition is introduced where each subinterval consists of consecutive time points having identical information content. Several documents are defined based on the information computed as document sets are combined. 2. REVIEW OF LITERATURE As the volume of electronic information increases, there is growing interest in developing tools to help people better find, filter, and manage these resources. Text categorization [9] is the assignment of natural language texts to one or more predefined categories based on their content which is an important component in many information organization and management tasks. Machine learning methods, including Support Vector Machines (SVMs), have tremendous potential for helping people to effectively organize the electronic resources. Text mining often involves the extraction of keywords with respect to some measure of importance. Weblog data is textual content with a clear and significant temporal aspect. Text categorization [1] (also known as text classification or topic spotting) is the task of automatically sorting a set of documents into categories from a predefined set. This task has several applications, including automated indexing of scientific articles according to predefined thesauri of technical terms, filing patents into patent directories,
@ 2012, IJSAIT All Rights Reserved

selective dissemination of information to information consumers, automated population of hierarchical catalogues of Web resources, spam filtering, identification of document genre, authorship attribution, survey coding, and even automated essay grading. Automated text classification is attractive because it frees organizations from the need of manually organizing document bases, which can be too expensive, or simply not feasible given the time constraints of the application or the number of documents involved. The accuracy of modern text classification systems rivals that of trained human professionals, thanks to a combination of information retrieval (IR) technology and machine learning (ML) technology. This will outline the fundamental traits of the technologies involved, of the applications that can feasibly be tackled through text classification and of the tools and resources that are available to the researcher and developer wishing to take up these technologies for deploying real-world applications. A web technology [4] extracts the statistical information and discovers interesting user patterns, cluster the user into groups according to their navigational behavior, discover potential correlations between web pages and user groups, identification of potential customers for E-commerce, enhance the quality and delivery of Internet information services to the end user, improve web server system performance and site design and facilitate personalization. Identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of text evaluation, which may even be more important than opinions on each individual object. The comparative sentence identification [7] problem first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. Experiment results using three types of documents, news articles, consumer reviews of products, and Internet forum postings, show a precision of 79% and recall of 81%. Comparison is one of the most convincing ways for text evaluation. Extracting comparative sentences from text is useful for many applications. For example, in the business environment, whenever a new product comes into market, the product manufacturer wants to know consumer opinions on the product, and how the product compares with those of its competitors. Much of such information is now readily available on the Web in the form of customer reviews, forum discussions, blogs, etc. Extracting such information can significantly help businesses in their marketing and product benchmarking efforts. Clearly, product comparisons are not only useful for product manufacturers, but also to potential customers as they enable customers to make better purchasing decisions. A statistical method called Latent Semantic Indexing (LSI) [3] which models the implicit higher-order structure in the association of words and objects and improves retrieval performance by up to 30%. In additional large performance improvements of 40% and 67% can be achieved using differential term weighting and iterative retrieval methods. These methods include restricting the allowable indexing and retrieval vocabulary and training intermediaries to generate terms from these restricted vocabularies, hand-crafting domain-specific thesauri to provide synonyms for users
89

K. Mythili et al., International Journal of Science and Applied Information Technology, 1(3), July-August, 88-92

search terms, constructing explicit models of domain-relevant knowledge, and automatically clustering terms and documents. The rationale for restricted or controlled vocabularies is that they are by design relatively unambiguous. However they have high costs and marginal (if any) benefits compared with automatic indexing based on the full content of texts. The use of a thesaurus is intended to improve retrieval by expanding terms that are too specific. Mining frequent patterns [5] in transaction databases, timeseries databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly for large number of patterns and/or long patterns. A novel frequent-pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed, smaller data structure, FP-tree which avoids costly, repeated database scans, (2) our FP-tree-based mining adopts a pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioningbased, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FPgrowth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent-pattern mining methods. SVM [2] can be used to learn a variety of representations, such as neural nets, splines, polynomial estimators, etc, One of the best approaches to data modeling. A knowledge discovery model is developed to effectively use and update the discovered patterns [6] and apply it to the field of text mining. Text mining is the discovery of interesting knowledge in text documents. It is a challenging issue to find accurate knowledge (or features) in text documents to help users to find what they want. The Rocchio [8] relevance feedback algorithm is one of the most popular and widely applied learning methods from information retrieval. Here, a probabilistic analysis of this algorithm is presented in a text categorization framework. The analysis gives theoretical insight into the heuristics used in the Rocchio algorithm, particularly the word weighting scheme and the similarity metric. It also suggests improvements which lead to a probabilistic variant of the Rocchio classifier. The Rocchio classifier, its probabilistic variant, and a naive Bayes classifier are compared on six text categorization tasks. The results show that the probabilistic algorithms are preferable to the heuristic Rocchio classifier not only because they are more well-founded, but also because they achieve better performance.

3. PROPOSED WORK 3. 1 Features Selection Metohod In this method, documents are considered as an input and the features for the set of documents are collected. Features are selected based on the TFIDF method. Information retrieval has been developed based on many mature techniques which demonstrate the terms which are important features in the text documents. However, many terms with larger weights (e.g., the term frequency and inverse document frequency (tf*idf) weighting scheme) are general terms because they can be frequently used in both relevant and irrelevant information. The features selection approach is used to improve the accuracy of evaluating term weights because the discovered patterns are more specific than whole documents. In order to reduce the irrelevant features, many dimensionality reduction approaches have been conducted by the use of feature selection techniques. 3.2 Finding Frequent and Closed Sequential Pattern When feature selection process is completed, the frequent and closed patterns are discovered based on the documents, the termset X in documentd,X is used to denote the covering set of X for d, which includes all paragraph dp PS(d) such that i.e.,

Its absolute support is the number of occurrences of X in PS(d) i.e.,

Its relative support is the fraction of the paragraphs that contain the pattern, that is,

Patterns can be structured into taxonomy by using the subset relation. Smaller patterns in the taxonomy are usually more general because they could be used frequently in both positive and negative documents and larger patterns are usually more specific since they may be used only in positive documents. The semantic information will be used in the pattern taxonomy to improve the performance of using closed patterns in text mining. A sequential pattern X is called frequent pattern if its relative support (or absolute support) is a minimum support. Some property of closed patterns can be used to define closed sequential patterns. The algorithm for finding the support count is given as,

90 @ 2012, IJSAIT All Rights Reserved

K. Mythili et al., International Journal of Science and Applied Information Technology, 1(3), July-August, 88-92

3.5 . IP Evaluation and Shuffling In inner pattern evaluation, the similarity is measured between the test document and the concept is estimated using inner product. The relevance of a documentd to a topic can be calculated by the function, R (d) = d. V To assign weights for all incoming documentsd based on their corresponding weightW functions the following formulae is used,

A comput er is capable of generating a "perfect shuffle", a random permutation of the set of documents. For a given noise negative document nd, its time complexity is O (nm2). The following algorithms are proposed for IP evaluation and for shuffling,

3.3 Pattern Taxonomy Model In PTM method, all the documentsd are spilted into paragraphs p which yields PS (d). Patterns can be structured into taxonomy by using the relation (or subset). From the set of paragraphs in documents the frequent patterns and the covering sets are discovered for each. Smaller patterns in the taxonomy, patterns are usually more general because they could be used frequently in both positive and negative documents. Larger patterns, in the taxonomy are usually more specific since they may be used only in positive documents. The semantic information will be used in the pattern taxonomy to improve the performance of using closed patterns in text mining. 3. 4 D-Pattern Discovery D-pattern mining algorithm is used to discover the D-patterns from the set of documents. The efficiency of the pattern taxonomy mining is improved by proposing an SP mining algorithm to find all the closed sequential patterns, which is used as the well-known appropriate property in order to reduce the searching space. The algorithm describes the training process of finding the set of d-patterns. For every positive document, the SP Mining algorithm is first called giving rise to a set of closed sequential patterns. The main focus is the deploying process, which consists of the d-pattern discovery and term support evaluation. All discovered patterns in a positive document are composed into a d-pattern giving rise to a set of d-patterns .Thereafter, term supports are calculated based on the normal forms for all terms in d-patterns.

91 @ 2012, IJSAIT All Rights Reserved

K. Mythili et al., International Journal of Science and Applied Information Technology, 1(3), July-August, 88-92

3.6 Temporal Sequential Pattern Mining In pattern discovery model, a dynamic programming algorithm is used for finding optimal information preserving decomposition and optimal lossy decomposition. A closed relationship is discovered between the decomposition of time period associated with the document set and the significant information computed for temporal analysis, the problem of identifying suitable time decomposition for a given document set which does not seem to have received adequate attention. So the time point is defined in interval and decomposition. Time point is given by base granularity such as seconds, minutes, day etc. The time interval between t1 and t2 is defined as t1 t t2. Decomposition of time interval T is given as sequence of time intervals T1 , T2, T3, T4 Tn And T is computed by T=T1*T2*T3*T4*.*Tn. The information is mapped with the keyword wi and document dataset D as, fm(wi,D) = v where v R+. 4. PERFORMANCE EVALUATION In this study document collection is used to evaluate the proposed approach. Various common measures are applied for performance evaluation. This evaluation compares and defines the following parameters such as precision, recall and Fmeasure which combines precision and recall with the existing system. Thus the experimental results show that the proposed method better than the existing system (Figure 1). The proposed system is more reliable and scalable for complex applications. The following fig1 compares the existing and proposed work and it shows better results in proposed work than the existing work by evaluating the parameters precision, recall and f-measures.

decomposition is introduced. This is used for analyzing relationship between the decomposition of time period associated with the document set and the significant information computed for temporal analysis. It quickly finds patterns for various ranges of parameters. It focuses on using information extraction to extract a structured database from a corpus of natural language text and then discover patterns in the resulting database using traditional KDD tools. It also concerns record linkage, a form of data-cleaning that identifies equivalent but textually distinct items in the extracted data prior to mining. It is also related with natural language learning. Further implementation will be focused on text mining for bioinformatics and it also includes applying the discovered patterns for various time series analysis domains such as prediction, serves as the pattern templates for numericto-symbolic (N/S) conversion and summarization of the time series. REFERENCES 1. M.F. Caropreso, S. Matwin, and F. Sebastiani. Statistical Phrases in Automated Text Categorization, Technical Report IEI-B4-07- 2000, Instituto di Elaborazione dellInformazione, 2000. 2. C. Cortes and V. Vapnik. Support-Vector Networks, Machine Learning, vol. 20, no. 3, pp. 273-297, 1995. 3. S.T. Dumais, Improving the Retrieval of Information from External Sources, Behavior Research Methods, Instruments, and Computers, Vol. 23, No. 2, pp. 229-236, 1991. 4. J. Han and K.C.-C. Chang. Data Mining for Web Intelligence, Computer, Vol. 35, No. 11, pp. 64-70, Nov. 2002. 5. J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation, Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 00), pp. 1-12, 2000. 6. Y. Huang and S. Lin. Mining Sequential Patterns Using Graph Search Techniques, Proc. 27th Ann. Intl Computer Software and Applications Conf., pp. 4-9, 2003. 7. N. Jindal and B. Liu. Identifying Comparative Sentences in Text Documents, Proc. 29th Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR 06), pp. 244-251, 2006. 8. T. Joachims. A Probabilistic Analysis of the Rocchio Algorithm with tfidf for Text Categorization, Proc. 14th Intl Conf. Machine Learning (ICML 97), pp. 143-151, 1997.

Figure 1: Comparison of existing and proposed work 5. CONCLUSION A PTM model with new pattern discovery model for text mining mainly focuses on the implement of temporal text pattern. Here a dynamic programming algorithm for optimal information preserving decomposition and optimal lossless

9. T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features, Proc. European Conf. Machine Learning (ICML 98),, pp. 137-142, 1998.

92 @ 2012, IJSAIT All Rights Reserved

Vous aimerez peut-être aussi