Académique Documents
Professionnel Documents
Culture Documents
e-ISSN: 2455-5703
I. INTRODUCTION
A. Big Data and Characteristics
The data is collected and stored in every minute, every hour and every day in an organization or institute and is available in large
quantity. But the amount of data is not of importance but what the organizations do with these data to identify information that
can be useful for them. This can be done by analyzing the data to identify insights or critical information that can help the
organization to make useful decisions for their growth. The term big data describes a large volume of data that is available in
both structured and in unstructured formats. Even though the concept of big data is a new term, the process of collecting the data,
storing them in large amounts and analyzing them to gather new information is something that has been done since long before
big data has been used. The characteristics of big data can be explained using 3 Vs such as (1) Volume, (2) Velocity and (3)
Variety.
The applications of big data include areas such as health care, telecom, finance, etc. In this paper the process of association
rule generation in big data is discussed and an association rule mining technique is proposed to generate the rules from the KDD
CUP 99 dataset.
B. Data Mining in Big Data
Big Data mining deals with a large amount of data that is stored in the data warehouses and databases. The concept of big data
mining can be used to extract or identify the interesting patterns and information from these large data. Many data mining
techniques are available that can be applied to the big data. They are classification, clustering, association rules, prediction,
estimation, documentation and description. The researches around these techniques have been large since long ago. Many
algorithms have been applied in each of the data mining techniques and this also applies to big data.
One such well known technique that is applied is the association rule mining in big data. This is a most efficient data
mining technique that is used to discover the various hidden patterns and information from large databases. Here the
relationships between the various attributes of the data are identified using the association rule mining algorithm. Some basic
types of association rule mining algorithms are the Apriori algorithm, Distributed algorithm and Parallel algorithm.
C. Association Rule Mining
The Association Rule Mining (ARM) [1] in data mining is a popular approach that is used to analyse the given dataset to
discover interesting patterns or relationships between the various items in the dataset. The concept of strong association rules
was first used by Agarwal et al. [2] to identify the various association rules between the items that are sold during a large scale
transaction database collected from a supermarket using a point system. The relationship between the items is identified based on
the purchase pattern. The ARM technique generates a set of association rules prevailing between the various items of the given
dataset based on the number of occurrences of these items combination in the dataset.
179
An association rule is used to define the relationship between any two items in the given dataset. Consider three items
A, B and C. The relation {A, B} C say that if a person buys two items A and B together, then he/she will most likely buy the
item C also. That is, the relations between the items are generated by identifying the various patterns within the dataset. The
Association Rule Mining (ARM) technique [3] consists of two stages as follows:
1) Identify the itemset that occur frequently in the dataset The frequent itemset are those that have a support value (sup(item))
equal to or greater than the minimum support value (min_sup) that is pre-defined. The support value of itemset is calculated
as the number of transactions that contains that item. In the above example support of {A, B} is calculated as how many
transactions have both A and B.
2) Association rule generation using frequent itemset: In this stage the interesting rules are generated by calculating the
confidence factor (conf) for all the frequent itemset that are generated in previous stage. The confidence value for the above
example rule of {A, B} C will be sup({A, B})/sup(C).
D. MapReduce Approach for ARM
The association rules and the generation of rules are widely used and they face many issues and the major one is the availability
of large data and multidimensional datasets [4]. A single processor system and normal CPU speed and resources cannot handle
such large data and this makes the algorithm inefficient to use. In recent developments, the growth of network technology and
especially cloud platforms provided new ideas in terms of association rule generation by making use of parallel environment like
Hadoop [5]. MapReduce has been a popular and more used for computing large amounts of data ever since it was launched by
Google in its platform. The Google Distributed File System (GFS) and the Amazon Web Service (AWS) makes use of the
Hadoop platform and MapReduce to provide their services.
A MapReduce job usually splits the input data into various chunks and each of these are processed by the map tasks in
parallel manner. The Mapper maps the small tasks by making use of the key and value pair concept and the outputs are sorted.
Then the Reducer reduces the obtained outputs from the maps to obtain the final output. The MapReduce framework contains a
single Job Tracker as the master and a single Task Tracker as the slave for each cluster node. All input and output in MapReduce
are <key, value> pairs. The Hadoop is a Java based distributed programming environment sponsored by Apache that can be used
to process and handle large amounts of data. Hadoop has been created using the concept of MapReduce for large processing by
using a large number of nodes and clusters.
In case of Association Rule Mining in MapReduce, the Mapper maps the task of obtaining the various combinations of
items as the key and the value is used to keep track of the number of occurrences or the support count. Then finally the
Reducer task will reduce the obtained set of Mappers for each key value and calculates the final support and confidence for all
the candidate itemsets. This way the Association Rules can be generated with maximum support and confidence.
This remainder of this paper is organized as follows: Section 2 explains about the various association rule mining algorithms
using Hadoop and MapReduce; Section 3 describes the proposed method and its working; Section 4 shows the experimental
results of the proposed method; and finally Section 5 provides the overall conclusion of the paper.
180
MapReduce algorithm to identify the appropriate frequent itemsets and association rules by using a near-linear speed up process.
A large number of random samples are mined by using the original dataset.
Jongwook Woo et al. proposed a Market Based Analysis algorithm combined with MapReduce for association rule
generation. This is one of the most used algorithms for association rules [12]. At first the algorithm sorts the give dataset in
ascending order and then converts each instance of the dataset into a (key, value) pair and fit them into the MapReduce. Then the
execution is done on the Amazon EC2 MapReduce platform. The obtained experimental results shows that the performance is
increased by making use of the MapReduce parallel code but still there is a bottle neck at certain point when more nodes are
used.
B. Need for Proposed Method
The use of binomial algorithm is not suitable in many datasets and a novel method should be available that can be applied to any
format of datasets [13]. Also binomial transformation is complex and time consuming and is not necessary. It is difficult to
handle and process large volumes of data in a single server and so there is a need to use parallel environment.
In this paper an improved scalable and distributed key-value pair algorithm is proposed for the selection of frequent itemsets
from the dataset and for association rules generation. The proposed algorithm is a bottom up approach since at first the candidate
itemsets are generated and then the support values are calculated by getting the count from the dataset transactions. The
minimum support value is then provided to converts the candidate itemsets to frequent itemsets. A very large dataset is used here
and after selecting the frequent itemsets the association rules are generated. The implementation is done by making use of the
MapReduce platform and the complete process is parallelized.
181
First the dataset is read as input by the MapReduce code from the HDFS storage and it processes each item as a separate
key to calculate the frequent 1-itemset as in Fig. 1. Then using pair of items from the 1-itemset the frequent 2-itemsets are
generated. This process is repeated till any number of iterations based on the number of itemsets needed. Fig. 1 shows till 3itemset calculation using MapReduce. The key used in the Mapper represents the n-itemsets where n is the number items used to
form the key. The MapReduce flow of the proposed MapReduce framework is shown below in Fig. 2.
During the MapReduce operation the input dataset or file is split into many sections in the Mapper phase with each
Mapper having a unique key. In ARM the key represents the items available within the dataset and the value is the number of
occurrence of the item in the dataset. Initially the count is set to 1 in the Mapper and for each occurrence this count is increment.
Finally in the Reducer the total occurrence is found using merge and the support and confidence are calculated. The output file
consist of the list of rules generated based on the support and confidence.
182
Test Set Contains 311,029 connections or records with 17 new attacks types not available in training data.
No.
Value
No.
Value
duration
22
is_guest_login
protocol_type
23
count
service
24
srv_count
flag
25
serror_rate
src_bytes
26
srv_serror_rate
dst_bytes
27
rerror_rate
land
28
srv_rerror_rate
wrong_fragment
29
same_srv_rate
urgent
30
diff_srv_rate
10
hot
31
srv_diff_host_rate
11
num_failed_logins
32
dst_host_count
12
logged_in
33
dst_host_srv_count
13
num_compromised
34
dst_host_same_srv_rate
14
root_shell
35
dst_host_diff_srv_rate
15
su_attempted
36
dst_host_same_src_port_rate
16
num_root
37
dst_host_srv_diff_host_rate
17
num_file_creation
38
dst_host_serror_rate
18
num_shells
39
dst_host_srv_serror_rate
19
num_access_files
40
dst_host_rerror_rate
20
num_outbound_cmds
41
dst_host_srv_rerror_rate
21
is_host_login
Table 1: Features of the input dataset
The 41 features of the KDD CUP 99 dataset is shown in Table 1 and Fig. 3 shows the sample values of the dataset.
The values from 1 to 41 are represented by separating them using , (comma) in the dataset given below in Fig. 3. That is, each
instance or row of the dataset consists of 42 attributes with 41 feature attributes and one class attribute all separated using a ,
(comma) as in the figure below. The row values are split to read each attributes separately.
183
by calculating support and confidence and then selecting the rules based on that. Based on this it is possible to identify if the user
of a specific instance or attack is a guest login or host login. The obtained values of support and confidence during the 4 levels of
MapReduce operations are shown in Fig. 4.
The execution of the MapReduce phase [18] in Hadoop and the obtained final results of the reducer phase are shown in
Fig. 5 and Fig. 6 respectively. Fig. 5 shows the execution of the Reducer phase and the output file is being generated. The final
statistics of the MapReduce job is shown in Fig. 5. The generated output file is shown in Fig. 6.
184
The final output shown in Fig.6 shows the list of all frequent items sets that are generated along with the support and
confidence values near them. The format represented in the output is <itemset, support, confidence> and this is generated for all
possible combinations of itemsets for the given input attributes. In this case the 2-itemsets are generated.
REFERENCES
[1] Ashrafi, M.Z.,Taniar,D., Smith,K., ODAM:An Optimized Distributed Association Rule Mining Algorithm, Distributed
Systems Online, IEEE, Volume 5, Issue 3, 2004.
[2] R.Agrawal, R.Srikant, Fast Algorithms for Mining Association Rules , In Proceedings of International Conference on
Very Large DataBases ,pp.487-499, Santiago,Chile,September1994.
[3] JongSooPark, Ming-SyanChen, PhilipS. Yu,An Effective Hash-based Algorithm for Mining Association Rules, In
Proceedings of the ACMSIGMOD International Conference on Management of Data, Michael Carey and Donovan
Schneider, ACM, 1995.
[4] Ozel,S.A., Guvenir,H.A., An Algorithm for Mining Association Rules using Perfect Hashing and Database Pruning,10th
Turkish Symposiumon Artificial Intelligence and Neural Networks , Gazimagusa, Springer, pp. 257-264, 2001.
[5] KaramGouda, Mohammed JaveedZaki, Efficiently Mining Maximal Frequent Itemsets, In Proceedings of the IEEE
International Conference on DataMining, pp.163-170, November29-December 02 , 2001.
[6] J.Han,J. Pei,Y. Yin, Mining Frequent Patterns without Candidate Generation, ACMSIGMOD International
Conference,Dallas,2000.
[7] D.W.Cheung, Jiawei Han, V.T. Ng, A.W. Fu, Yongjian Fu, "Afast Distributed Algorithm for Mining Association Rules, In
Proceedings of International Conference on Parallel and Distributed Information Systems, IEEE CS Press, 1996.
[8] AnsariE, DastghaibifardG, KeshtkaranM, KaabiH, Distributed Frequent Itemset Mining using Trie Data Structure
,International Journal of Computer Science, Volume 35, Issue 3, pp. 337-381, 2008.
[9] Park,J.S.,Chen,M. S., Yu,P. S., Efficient Paralle l Data Mining for Association Rules, In Proceedings of the Fourth
International Conference on Information and Knowledge Management,pp.31-33, 1995.
[10] Woo, J., Xu, Y, Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing, In Proceedings of the
International Conference on Parallel and Distributed Processing Techniques and Applications, 2001.
[11] Lin, Ming-Yen, Pei-Yu Lee, Sue-Chen Hsueh, "Apriori-based Frequent Itemset Mining Algorithms on MapReduce", In
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ACM, 2012.
[12] PeddiKishor, SammulalPorika, Literature Survey on Association Rule Discovery in Data Mining, International Journal of
Computer Science and Management Research, Volume 2, Issue 1, January 2013.
[13] Zhang C.S, Li Z.Y, Zheng D.S., An Improved Algorithm for Apriori, In Proceedings of the 1st International Workshop on
Education Technology and Computer Science, Volume 1, pp. 995-998, 2009.
[14] C.Jin, C.Vecchiola, R.Buyya, MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms, Fourth IEEE
International Conference on eScience, pp. 214-221, 2008.
[15] T.Elsayed, J.Lin, Douglas W. Oard, Pairwise Document Similarity in Large Collections with MapReduce, In Proceedings
of 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009.
[16] J.H.C. Yeung, C.C. Tsang, K.H. Tsoi, B.Kwan, C. Cheung, A.P.C. Chan P.H.W. Leong, Map-reduce as a Programming
Model for Custom Computing Machines, In Proceedings of the 16th IEEE Symposium on Field-Programmable Custom
Computing Machines, pp. 149-159, 2008.
[17] M.Zaharia, A.Konwinski, A. D. Joseph, R. Katz, I. Stoica, Improving MapReduce Performance in Heterogeneous
Environments, EECS Department University of California, Berkeley Technical Report Number UCB/EECS-2008-99
August 19, 2008.
185
[18] MohammadhosseinBarkhordari, Mahdi Niamanesh, ScadiBino: An Effective MapReduce-based Association Rule Mining
Method, ACM 16th International Conference on Electronic Commerce, August 2014.
[19] P.Ganesh Kumar, D.Devaraj, Intrusion Detection using Artificial Neural Network with Reduced Input Features,
International Journal on Soft Computing, ICTACT, Issue 1, pp. 30-36, July 2010.
186