Vous êtes sur la page 1sur 7

International Journal of Civil Engineering and Technology (IJCIET)

Volume 8, Issue 1, January 2017, pp. 936942 Article ID: IJCIET_08_01_110


Available online at http://www.iaeme.com/IJCIET/issues.asp?JType=IJCIET&VType=8&IType=1
ISSN Print: 0976-6308 and ISSN Online: 0976-6316

IAEME Publication Scopus Indexed

EFFICIENT MINING FOR SOCIAL NETWORKS


USING INFORMATION GAIN RATIO BASED ON
ACADEMIC DATASET
G. Ayyappan
Research Scholar, Department of Computer Science and Engineering,
Bharath University, Chennai, India

Dr. C. Nalini
Professor, Department of Computer Science and Engineering,
Bharath University, Chennai, India

Dr. A. Kumaravel
Professor, School of Computing, Bharath University, Chennai, India

ABSTRACT
Studies on detecting the research progress in the academic research domain is challenging
for research communities and funding agencies. The data retrieved from the social networks
augment this issue for supporting the results in this direction. Here in this paper we address
this issue positively with the help text mining tasks. Classification as one of the major data
mining information gain ratio methodologies can be applied effectively for this purpose. The
objective of this paper is to check the learning algorithms for classification such examples
based on selected balanced dataset for research articles in technical conferences. The main
intention in this context is to deal with available balanced data set for high accuracy. For this
purpose various types of classifiers like Decision Trees, Rules, Nave Bayes, and Meta learning
models are built using an open source mining tool Weka. It is necessary to reduce the error
before constructing the final models and thus the varying the parameters and number of
iterations for training is carried out.
Key words: Data Mining, Information Gain, Classification, Nave Bayes, Meta Classifier,
Attribute Selection, Search methods.
Cite this Article: G. Ayyappan, Dr. C. Nalini and Dr. A. Kumaravel, Efficient Mining for
Social Networks Using Information Gain Ratio Based on Academic Dataset. International
Journal of Civil Engineering and Technology, 8(1), 2017, pp. 936942.
http://www.iaeme.com/IJCIET/issues.asp?JType=IJCIET&VType=8&IType=1

1. INTRODUCTION
In this paper we address the problem of academic social network data in research progress prediction
based on ranker. In decision tree learning, Information gain ratio is a ratio of information gain to the
intrinsic information [14]. It is used to reduce a bias towards multi-valued attributes by taking the

http://www.iaeme.com/IJCIET/index.asp 936 editor@iaeme.com


G. Ayyappan, Dr. C. Nalini and Dr. A. Kumaravel

number and size of branches into account when choosing an attribute. Information Gain is also known
as Mutual Information [14]. The information gain is usually a good measure for deciding the relevance
of an attribute, it is not perfect. A notable problem occurs when information gain is applied to
attributes that a take on a large number of distinct values [14].
The selection techniques can be divided into two categories one is filter methods [3] and another
is wrapper methods [8]. Various feature ranking and feature selection techniques have been proposed
in the machine learning literature, like as, Principal Component Analysis [6], Correlation- based
Feature Selection [6], Information Gain attribute evaluation [6], Gain Ratio attribute evaluation [6],
Support Vector Machine feature elimination [5] and Chi-Square Feature Evaluation [6]. Some of these
methods does not perform feature selection but only feature ranking, they are usually combined with
another method when one needs to find out the appropriate number of features. Bi-directional search,
forward selection, backward selection, best-first search [12], genetic search [4], and other methods are
often used on this research work. The criteria for feature selection are information theoretic based such
as the Shannon entropy measure I for a dataset. The main drawback of the entropy measure is its
sensitivity to the number of attribute values [11]. Measure suffers the drawback that it may choose
attributes with very low information content [9]. A comprehensive discussion on Bayes theorem for
feature selection is available in [1,13].
Here, the dataset and contribution of dataset in section 2.The terms and methods are in section3,
the proposed algorithm and we use proposed approach for finding information gain ratio in section 4,
the information about the result and discussion of various classifiers approaches in section 5
represents. We present the conclusion in section 6.

2. DATASET DESCRIPTION
We have collected this data set from Arnet Miner (http://www.arnetminer.org). In this massive real
time dataset we have taken for text mining process only 6000 balanced academic data training set
randomly. We have divided 6000 records into 3 different folders like as large, medium, small .each
and every folder it contains 2000 individual text files records.
In order to get the underlying dataset suitable for text mining we separate the files in the directory
by applying Text Directory Loader on command line interface associate the class as a preprocessing
step. Then we tokenize each text file as a set of attribute by applying String to Word Vector filter.
This Balanced academic dataset contains folders small, medium, large folders which have 6000
instances and 1733 attributes. Each folder contains 2000 instances or tokens. Selecting this dataset
which is clean and simple, and preprocessing for appropriate format we follow.

Figure 1 Research design for academic data mining

http://www.iaeme.com/IJCIET/index.asp 937 editor@iaeme.com


Efficient Mining for Social Networks Using Information Gain Ratio Based on Academic Dataset

3. TERMS, DEFINITIONS AND METHODS


The following terms & methods are applied for our experiments on social network induced dataset.
Information gain: Evaluates the worth of an attribute by measuring the information gain with respect
to the class.
Classifiers: In weka tool it has several classifiers are there. They are Bayes, Functions, Trees, Lazy,
Rules, Meta, and Multi Instances. Here in this research paper we use Bayes, meta, misc, rules, trees
classifiers.
Accuracy: The correctly and incorrectly classified instances show the percentage of test instances that
were correctly and incorrectly classified. The error rates are used for numeric prediction rather than
classification. In numeric prediction, predictions aren't just right or wrong, the error has a magnitude,
and these measures reflect that.
TP Rate: rate of true positives (instances correctly classified as a given class)
FP Rate: rate of false positives (instances falsely classified as a given class)
Precision: proportion of instances that are truly of a class divided by the total instances classified as
that class
Recall: proportion of instances classified as a given class divided by the actual total in that class
(equivalent to TP rate)
F-Measure: A combined measure for precision and recall calculated as 2 * Precision * Recall /
(Precision + Recall).

4. PROPOSED ALGORITHM
The relational data base with string tokens is applied with the following filtering algorithm based on
information gain. Each step based on the information gain level the accuracies are obtained and
maximum accuracy is preferred and the corresponding classifier is identified. The pseudopodia for the
main algorithm in figure 2.

Figure 2 Proposed algorithm with 2 iterations

5. RESULTS AND DISCUSSIONS


The following tables and figures are extracted from the experiments over social academic data. The
keen observations made are seen at the end of this section.

http://www.iaeme.com/IJCIET/index.asp 938 editor@iaeme.com


G. Ayyappan, Dr. C. Nalini and Dr. A. Kumaravel

Table 1 (A) Bayes & Meta classifiers accuracies based on Ranker

Figure 3 (A) Behaviors of bayes & Meta classifiers based on Ranker


The classifiers Bayes Net, Complement and Nave Bayes exhibit similar tendency has seen in
figure 3 within the range 0.03 to 0.1 for cut off values. While the classifier ACS.J48 is having bottom
most behaviors, the classifier Dagging.SMO dominates the rest of the classifiers.

Table 1 (B) misc, rules, and trees classifiers accuracies based on Ranker

http://www.iaeme.com/IJCIET/index.asp 939 editor@iaeme.com


Efficient Mining for Social Networks Using Information Gain Ratio Based on Academic Dataset

Figure 3 (B) Behaviors of misc, rules, and trees classifiers based on Ranker
The classifiers Hyper Pipes & Decision Stump classifiers are showing performance at the middle
between 64% to 75%. Jrip and J48 occupies that dominant role between 84% to 87%. ZeroR fails to
show any applicable tendency, but remains constant at 33.33%

Table 2 Comparisons of Ranker and without Ranker classifiers

Figure 4 Behaviors of Ranker and without Ranker classifiers


The effect of ranker method for selection of attributes is depicted as in figure 4. Most of the ranker
methods dominate and comparable with the classifications without ranker.

http://www.iaeme.com/IJCIET/index.asp 940 editor@iaeme.com


G. Ayyappan, Dr. C. Nalini and Dr. A. Kumaravel

Figure 5 Curve generated for J48 (large, medium, and small)


The area under the curve happens to be more than 88% (large =0.88, medium=0.88, small=0.98).
We observed this performance metric is greater than the accuracy measure.

6. CONCLUSION
This work establishes the significance of the information gain for selecting the attributes in the context
of social networks of academic data. Hence the priority of research titles and author levels can be
traced out as an application of this work with extended scope.

REFERENCES
[1] Balamurugan, S.A., Rajaram, R.: Effective and Efficient Feature Selection for Large Scale Data
using Bayes Theorem. Journal of Automation and Computing 6 (1), 6271 (2009)
[2] Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases (2008),
http://www.ics.uci.edu/~mlearn/mlrepository.html
[3] Cover, T.M.: On the possible ordering on the measurement selection problem. IEEE Transactions
on SMC 7(9), 657661 (1977)
[4] Goldberg, D.E.: Genetic algorithms in search, optimization and machine learning. Addison-Wesley,
Reading (1989)
[5] Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using
support vector machines. Machine Learning 46, 389422 (2002)
[6] Hall, M.A., Smith, L.A.: Practical feature subset selection for machine learning. In: Proceedings of
the 21st Australian Computer Science Conference, pp. 181191 (1998)
[7] Han, J., Kamber, M.: Data mining Concepts and Techniques. Morgan Kaufmann, San Francisco
(2006)
[8] Kohavi, R., John, G.H.: The Wrapper approach. In: Lui, H., Matoda, H. (eds.) Feature Extraction
Construction and Selection, pp. 3047. Kluwer Academic Publishers, Dordrecht (1998)
[9] Lopez de Mantaras, R.: A Distance- based attribute selection measure for decision tree induction.
Machine Learning 6, 8192 (1991)
[10] WEKA, Open Source Collection of Machine Learning Algorithms
[11] White, A.P., Lui, W.Z.: Bias in the information- based measure in decision tree induction. Machine
Learning 15, 321329 (1994)
[12] Witten, H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn.
Morgan Kaufmann, San Francisco (2005)

http://www.iaeme.com/IJCIET/index.asp 941 editor@iaeme.com


Efficient Mining for Social Networks Using Information Gain Ratio Based on Academic Dataset

[13] Subramanian Appavu, Ramasamy Rajaram, M. Nagammai, N. Priyanga, S. Priyanka: Bayes


Theorem and Information Gain Based Feature Selection for Maximizing the Performance of
Classifiers :Eds.): CCSIT 2011, Part I, CCIS 131, pp. 501511, 2011. Springer-Verlag Berlin
Heidelberg 2011.
[14] https://en.wikipedia.org/wiki/Information_gain_in_decision_trees (downloaded on 5.10.2016)
[15] https://weka.wikispaces.com/ROC+curves(downloaded on 5.10.2016)
[16] https://weka.wikispaces.com/Area+under+the+curve (downloaded on 5.10.2016)
[17] Malpani Radhika S and Dr.Sulochana Sonkamble, A Data Mining Approach to Avoid Potential
Biases. International Journal of Computer Engineering and Technology, 6 (7), 2015, pp. 27-34.
[18] Lakshmi. R and Antony Selvadoss Thanamani, Data Mining Based Dynamic Replication
Algorithm for Improving Data Availability in Data Grids. International Journal of Computer
Engineering and Technology (IJCET), 7(5), 2016, pp. 0916.

http://www.iaeme.com/IJCIET/index.asp 942 editor@iaeme.com

Vous aimerez peut-être aussi