Académique Documents
Professionnel Documents
Culture Documents
Dr. C. Nalini
Professor, Department of Computer Science and Engineering,
Bharath University, Chennai, India
Dr. A. Kumaravel
Professor, School of Computing, Bharath University, Chennai, India
ABSTRACT
Studies on detecting the research progress in the academic research domain is challenging
for research communities and funding agencies. The data retrieved from the social networks
augment this issue for supporting the results in this direction. Here in this paper we address
this issue positively with the help text mining tasks. Classification as one of the major data
mining information gain ratio methodologies can be applied effectively for this purpose. The
objective of this paper is to check the learning algorithms for classification such examples
based on selected balanced dataset for research articles in technical conferences. The main
intention in this context is to deal with available balanced data set for high accuracy. For this
purpose various types of classifiers like Decision Trees, Rules, Nave Bayes, and Meta learning
models are built using an open source mining tool Weka. It is necessary to reduce the error
before constructing the final models and thus the varying the parameters and number of
iterations for training is carried out.
Key words: Data Mining, Information Gain, Classification, Nave Bayes, Meta Classifier,
Attribute Selection, Search methods.
Cite this Article: G. Ayyappan, Dr. C. Nalini and Dr. A. Kumaravel, Efficient Mining for
Social Networks Using Information Gain Ratio Based on Academic Dataset. International
Journal of Civil Engineering and Technology, 8(1), 2017, pp. 936942.
http://www.iaeme.com/IJCIET/issues.asp?JType=IJCIET&VType=8&IType=1
1. INTRODUCTION
In this paper we address the problem of academic social network data in research progress prediction
based on ranker. In decision tree learning, Information gain ratio is a ratio of information gain to the
intrinsic information [14]. It is used to reduce a bias towards multi-valued attributes by taking the
number and size of branches into account when choosing an attribute. Information Gain is also known
as Mutual Information [14]. The information gain is usually a good measure for deciding the relevance
of an attribute, it is not perfect. A notable problem occurs when information gain is applied to
attributes that a take on a large number of distinct values [14].
The selection techniques can be divided into two categories one is filter methods [3] and another
is wrapper methods [8]. Various feature ranking and feature selection techniques have been proposed
in the machine learning literature, like as, Principal Component Analysis [6], Correlation- based
Feature Selection [6], Information Gain attribute evaluation [6], Gain Ratio attribute evaluation [6],
Support Vector Machine feature elimination [5] and Chi-Square Feature Evaluation [6]. Some of these
methods does not perform feature selection but only feature ranking, they are usually combined with
another method when one needs to find out the appropriate number of features. Bi-directional search,
forward selection, backward selection, best-first search [12], genetic search [4], and other methods are
often used on this research work. The criteria for feature selection are information theoretic based such
as the Shannon entropy measure I for a dataset. The main drawback of the entropy measure is its
sensitivity to the number of attribute values [11]. Measure suffers the drawback that it may choose
attributes with very low information content [9]. A comprehensive discussion on Bayes theorem for
feature selection is available in [1,13].
Here, the dataset and contribution of dataset in section 2.The terms and methods are in section3,
the proposed algorithm and we use proposed approach for finding information gain ratio in section 4,
the information about the result and discussion of various classifiers approaches in section 5
represents. We present the conclusion in section 6.
2. DATASET DESCRIPTION
We have collected this data set from Arnet Miner (http://www.arnetminer.org). In this massive real
time dataset we have taken for text mining process only 6000 balanced academic data training set
randomly. We have divided 6000 records into 3 different folders like as large, medium, small .each
and every folder it contains 2000 individual text files records.
In order to get the underlying dataset suitable for text mining we separate the files in the directory
by applying Text Directory Loader on command line interface associate the class as a preprocessing
step. Then we tokenize each text file as a set of attribute by applying String to Word Vector filter.
This Balanced academic dataset contains folders small, medium, large folders which have 6000
instances and 1733 attributes. Each folder contains 2000 instances or tokens. Selecting this dataset
which is clean and simple, and preprocessing for appropriate format we follow.
4. PROPOSED ALGORITHM
The relational data base with string tokens is applied with the following filtering algorithm based on
information gain. Each step based on the information gain level the accuracies are obtained and
maximum accuracy is preferred and the corresponding classifier is identified. The pseudopodia for the
main algorithm in figure 2.
Table 1 (B) misc, rules, and trees classifiers accuracies based on Ranker
Figure 3 (B) Behaviors of misc, rules, and trees classifiers based on Ranker
The classifiers Hyper Pipes & Decision Stump classifiers are showing performance at the middle
between 64% to 75%. Jrip and J48 occupies that dominant role between 84% to 87%. ZeroR fails to
show any applicable tendency, but remains constant at 33.33%
6. CONCLUSION
This work establishes the significance of the information gain for selecting the attributes in the context
of social networks of academic data. Hence the priority of research titles and author levels can be
traced out as an application of this work with extended scope.
REFERENCES
[1] Balamurugan, S.A., Rajaram, R.: Effective and Efficient Feature Selection for Large Scale Data
using Bayes Theorem. Journal of Automation and Computing 6 (1), 6271 (2009)
[2] Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases (2008),
http://www.ics.uci.edu/~mlearn/mlrepository.html
[3] Cover, T.M.: On the possible ordering on the measurement selection problem. IEEE Transactions
on SMC 7(9), 657661 (1977)
[4] Goldberg, D.E.: Genetic algorithms in search, optimization and machine learning. Addison-Wesley,
Reading (1989)
[5] Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using
support vector machines. Machine Learning 46, 389422 (2002)
[6] Hall, M.A., Smith, L.A.: Practical feature subset selection for machine learning. In: Proceedings of
the 21st Australian Computer Science Conference, pp. 181191 (1998)
[7] Han, J., Kamber, M.: Data mining Concepts and Techniques. Morgan Kaufmann, San Francisco
(2006)
[8] Kohavi, R., John, G.H.: The Wrapper approach. In: Lui, H., Matoda, H. (eds.) Feature Extraction
Construction and Selection, pp. 3047. Kluwer Academic Publishers, Dordrecht (1998)
[9] Lopez de Mantaras, R.: A Distance- based attribute selection measure for decision tree induction.
Machine Learning 6, 8192 (1991)
[10] WEKA, Open Source Collection of Machine Learning Algorithms
[11] White, A.P., Lui, W.Z.: Bias in the information- based measure in decision tree induction. Machine
Learning 15, 321329 (1994)
[12] Witten, H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn.
Morgan Kaufmann, San Francisco (2005)