Hadoop

2011 International Conference on Computational and Information Sciences
Research of Cloud Computing based on the Hadoop platform
ChenHao Computer Science & Technology Department Southwest Petroleum University Chengdu. China Email:chanhaul@gmail.com
Abstract-The application of cloud computing will cause mass of data accumulation so that how to manage data and distribute data storage space effectively become a hot topic recently. Hadoop was developed by the Apache Foundation to deal with massive data through parallel processing. In addition, Hadoop is applied widely as the most popular distributed platform. This paper mainly presents a Hadoop platform computing model and the Map/Reduce algorithm. We combined the K-means implement with the data mining technology analysis to and
QiaoYing Computer Science & Technology Department Southwest Petroleum University Chengdu. China Email:teachqiao@163.com
are the important issues we should deal with. Hadoop is a building cloud computing platform which is designed by Apache open source projects. We use this framework to solve the problems and manage data conveniently. There are two major technologies: HDFS and Map/Reduce. HDFS is used to achieve the storage and fault-tolerant of huge document, Map/Reduce is used to compute the data by distributed computing. II. A BACKGROUND KNOWLEDGE AND RELEVANT
CONCEPTIONS
effectiveness
application of the cloud computing platform. Key words:-Cloud computing; Hadoop;
Cloud computing information Cloud computing was developed from a
Map/Reduce; Data mining; K-means
variety of network technologies, including parallel computing, distributed computing and grid computing etc. It will implement all the tasks by using the cloud virtually and accomplish the purpose of combining all the cheap computer points into the huge system which can provide the large computing capacity. In other words, we can regard as separating the monitor and main engine. As the Figure 1 shows:
I.
INTRODUCTION
Along with the maturing of computer network technology, especially today, the cloud computing technology has been widely recognized and applied. The IT giants companies such as Google, IBM, Amazon and Microsoft have launched their own commercial products; they also make the cloud computing technology as the priority strategy in the future development. But this will lead to a large number of data problems: the explosive growth of online information. Each user may have a huge amount of information. From now on, the transistor circuit has been gradually approaching its physical limits. In the past; the Moores law of each 18 months CPU efficiency double has come to the failure point. Facing the massive information, how to manage and store the data
978-0-7695-4501-1/11 $26.00 2011 IEEE DOI 10.1109/ICCIS.2011.213 181
Figure 1 Cloud Computing Structure
Like the bank system. We can store the data and use the application as conveniently as we can save and manage money in the bank. The user no longer need to use so many hardware as background supporting, the only thing is
connecting to the cloud. B Parallel computing Parallel computing is a method to upraise the efficiency of the computing capacity and it solves the problem by using the multiple resources. The main principle is that we split the task into N parts and send the task to the N computers, so the efficiency is increased N times. But the parallel computing has a serious shortcoming which is that each part is relevant, that is a barrier for parallel computing development. C Distributed computing The basic principle of distributed computing and parallel computing is consistent. The advantage is that the distributed computing has a very strong fault-tolerant and easily to expand computing capacity by increasing the number of computer nodes. The difference is that the part is independent of each other, so a batch of computing nodes failure would not affect the accuracy of calculation. D GFS and HDFS GFS cluster, consisted by a master and lots chunserver, is accessed by many clients. Here is the structure shown as Figure 2:
Figure 4 Cloud Computing Platform Figure 3 HDFS Structure
III. THE STRUCTURE OF CLOUD COMPUTING Cloud computing platform is a powerful cloud network to connect a large number of concurrent services, and can be used to extend each of virtual servers. Combining every source through the platform is in order to support the huge computing and storage abilities. The general cloud computing system structure shown as Figure 4
Hadoop Structure Hadoop, developed by the Apache Company,
is a distributed system basic structure. It makes the users program the distributed software easily even they know nothing about the bottom circumstances. HDFS, which is the base layer, is a main storage system in Hadoop and running on the ordinary cluster components. Usually it is deployed in the low-cost hardware devices to
Figure 2 GFS Structure
provide high rate of transmission and access the application data. This is suitable for the program which has large dataset. B Map/Reduce Map/Reduce, presented by the Jeffery Dean and Sanjay Ghemawat, is a programming model in the massive data computing, which is developed by Google and meanwhile is a core technology of cloud computing. The model
HDFS is a distributed file system and uses the ordinary hardware, which has been achieved by the GFS structure from the Google paper. It adopts the master/slave model. An HDFS cluster consists of a Namenode and many Datanode. The Namenode is the center management server, shown as Figure 3:
182
abstracts the common operation of large dataset as Map and Reduce steps to simplify the programmers difficulty of distributed and parallel computing. Firstly, dataset are split the several small parts by Map, then several parts are sent to a large number of computers to make the parallel computing, and which will produce a series of intermediate keys. Finally, Reduce will integrate the all results and output them. The structure is shown as Figure 5:
and security of cloud computing. There are so many algorithms in cluster analysis and we will combine the K-means and Map/Reduce to discuss the distribution of data in the Hadoop platform. A K-means algorithm K-means algorithm is an objective function which is aiming at optimizing the distance between the data point to the center point. That distance uses Euclidean distance similarity as measure. This function makes the clustering meet some rules: the similarity of objects in the same cluster is higher, and the difference between the cluster types has less similarity. It means the idea of High cohesion, Low coupling in software engineering. The specific process is as follows: A) B) Firstly, we should select K objects as For the rest of the other objects, we the centers of the data clusters;
Figure 5 Map/Reduce Process
should assign them to these center clusters based on the similarity (distance); C) D) until Then we calculate each of cluster Last, we should repeat these steps the criteria measure function is centers to get a new center;
Map function is usually supported to the diffident needs of users according to the different business. It is used to deal with a pair of key-value and to get a new pair of key-value as intermediate results. The Map/Reduce function library will put the same value together and send to the Reduce function. Reduce function is the same as Map function defined by the user themselves. According to the transfer intermediate key, it will deal with the results and merge the same key to produce a smaller set of values. So, we can conclude the process of Map/Reduce operation:
converged; In addition, we usually adopt the mean square as the criteria measure function. For example, we stimulate a set of data to scatter them by K-means and we use k=3 as cluster types. The experimental data shown in Figure 6:
7 ClusterOne ClusterTwo ClusterThree Centroids
IV. APPLICATION OF CLUSTER Cluster algorithm is an important factor in data mining, especially in such a large system of data computing. The clustering is particularly vital in cloud technology. The division of data characteristics is the most vital step in the storage
183
2.5
3.5
4.5
Figure 6 Experimental Data 1
On this basis, we can also add some disturbance factors to enhance the data hidden, so
the data have a higher security. Such as we add an F (5, 10) distribution function to the original data, then the experience data shown in Figure 7:
11 10 9 8 7 6 5 4 3 2 1 ClusterOne ClusterTwo ClusterThree Centroids
D) E)
Master split the data point to Map to Map return the intermediate value to
operate. Master then let Reduce operate. This process, which compares to the previous algorithm without using the K-means, has more impact on the data distribution. Before the Map/Reduce, we use the K-means to classify the data types which is more effective to manage the data classifications. V. CONCLUSION
10
12
14
16
18
20
Data storage is an important element of cloud computing. This paper discussed the major core technologies HDFS and Map/Reduce in Hadoop framework. Combination of data mining and K-means clustering algorithm will make the data management is easier and quicker in cloud computing model. Even though this technology is still in its infancy, we believe that along with the continuous improvement, the cloud computing will develop towards the security and reliable directions. REFERENCES
[1]. WangXiangqian. Optimization of High Performance MapReduce System. Computer Software Theory, Univerisity of Science and Technology of China. 2010 ZhuZhu. Research and application of massive data processing model based on Hadoop.Beijing University of Posts and Telecommunications. 2008 YangChenzhu. The Research of Data Mining Based on Hadoop. Chongqing Univerisity. 2010 QiuRongtai. Research on MapReduce application based on Hadoop, Henan Polytechnic University, 2009 WangPeng. Into Cloud Computing.. People Posts and Telecommuniations Press.2009 (In Chinese)
Figure 7 Experimental Data 2
The application of K-means Consequently, we can use K-means clustering
method to separate the data which have similarity to each other. Combining with the Map/Reduce in the Hadoop, we can distribute storage and use the data in cloud computing. The structure is shown as Figure 8:
[2].
[3]. Figure 8 Combination Structure
According to the diagrams display, we can generally divide the process into five steps: A) In cloud computing, data will be distributed in so many blocks by using the K-means algorithm to ensure data in each block have high similarity. B) We distribute the block again by Center controlled, and then we specify the corresponding pointer to the each distributed data to make the Map/Reduce operation become more convenient. C) Then we should notice to the Master Management the location of these small block data to facilitate the Master to assign the tasks.
184
[4]. [5].

Hadoop

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Hadoop

Transféré par

Droits d'auteur :

Formats disponibles

2011 International Conference on Computational and Information Sciences

Research of Cloud Computing based on the Hadoop platform

application of the cloud computing platform. Key words:-Cloud computing; Hadoop;

Cloud computing information Cloud computing was developed from a

Map/Reduce; Data mining; K-means

Figure 1 Cloud Computing Structure

Hadoop Structure Hadoop, developed by the Apache Company,

Figure 5 Map/Reduce Process

Figure 6 Experimental Data 1

Figure 7 Experimental Data 2

The application of K-means Consequently, we can use K-means clustering

[3]. Figure 8 Combination Structure

Vous aimerez peut-être aussi