Vous êtes sur la page 1sur 4

IPASJ International Journal of Computer Science (IIJCS)

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 4, April 2017 ISSN 2321-5992

Effective Methods for Range Aggregate Queries


in Big Data with Enhanced Security
Anisa I. Tamboli1 and Sandeep G. Sutar2
1
Student, Annasaheb Dange College of Engineering and Technology, Ashta, Maharashtra, India
2
Assistant Professor, Annasaheb Dange College of Engineering and Technology, Ashta, Maharashtra, India

Abstract
Big data analysis can realize development of various societal aspects and preferences of individual day by day deeds.
This provides a new prospect to explore elementary questions about the composite world. Presently, it is important
to provide proficient techniques and tools for big data analysis. Efficient Range Aggregate queries are important
tools in decision management, online proposition, trend estimation, and so on. Existing methods for handling range
aggregate queries are insufficient to quickly obtain accurate results in big data. In this paper, we propose effective
methods for handling range aggregation queries in big data. Proposed system makes use of hadoop distributed file
system, which will provide framework for the analysis and transformation of very large data sets using the Map
Reduce paradigm. The interface to hadoop file system will be Linux file system, which in turn improve the
performance for the applications. In proposed system, Big data will get divided into independent partitions with map
reduce paradigm, and then generates estimation sketch for each partition. When range aggregation query request
arrives, system will obtain the result directly by summarizing estimates from all partitions. The big data involves
major increase in data volumes, and the selected tuples maybe locate in different files formats i.e. data may present
in structured, semi structured or unstructured format. In this paper, proposed system aims to provide fast approach
for range aggregate query in order to fetch results within least amount of time by using structured and semi
structured heterogeneous file context.
Keywords: Big data, map reduce, RAQ (Range Aggregate Query), Mongo DB

1. INTRODUCTION
Big data is described as huge amount of data which requires new techniques so that it becomes possible to extract
knowledge from it by capturing and analysis process [2]. Due to such huge size of data it becomes very complex to
perform effective analysis using the existing conventional techniques. Big data due to its various properties like
volume, velocity, variety, variability, value and complexity put forward many challenges [4]. There is another thing
linked to big data is social sites and media. Social sites like Google for Gmail, facebook, whatsapp are strike every day
by billions of people everywhere the world. The more elementary test for Big Data applications is to travel the large
volumes of data and fetch useful information or knowledge for coming actions [2].
An application example of big data analysis is Distributed intrusion detection systems (DIDS) which monitor and
report anomaly activities or strange patterns on the network level. A DIDS detects anomalies via statistics information
of summarizing traffic features from diverse sensors to improve false-alarm rates of detecting coordinated attacks. Such
a scenario motivates a typical range-aggregate query problem that summarizes aggregated features from all tuples
within given queried ranges [1]. Range-aggregate queries are applied on such records for certain aggregate function for
analysis within given query range. These range aggregation queries efficiently work with tiny datasets but when big
data comes in picture the huge record not processed efficiently [5].
In existing system, range aggregate queries executed in big data environment which give better efficiency than the
other linear execution process. Proposed work contributed with use of MongoDB database for better result than
preceding system. MongoDB built specifically to handle semi-structured and unstructured data. Proposed work divide
big data into multiple partitions using map reduce algorithm, and then generates a local estimation sketch for each
partition. When a range-aggregate query request arrives, system obtains the result directly by summarizing all estimates
from all partitions. Proposed work also focuses on data security techniques which enhances the privacy of sensitive data
[8].

Volume 5, Issue 4, April 2017 Page 11


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 4, April 2017 ISSN 2321-5992

2. PROPOSED WORK

2.1 SCOPE
Existing methods for handling range aggregate queries are insufficient to quickly obtain accurate results in big data
environments. In this paper, we propose effective methods for handling range aggregation queries in big data
environments. Proposed system also focuses on heterogeneous file context i.e., structured and semi structure files can be
used in a database and accessed to fetch answer of range aggregation queries. Heterogeneous file context requires data
cleaning, preprocessing and need to convert semi structured files in structured form and then apply aggregate function
on structured database. Proposed system overcomes this problem using MongoDB database. MongoDB store a semi-
structure data e.g. xml file in the form of tree structure and execute queries on this tree structure dataset. Proposed
system provides data security techniques which give privacy to the sensitive data which needs to hide from specific
user. This can be done using generalized form to display sensitive data [8].

2.2 SYSTEM ARCHITECTURE:


As shown in Figure 1, proposed system will function in following steps:
Step 1: Proposed system first divides big data into multiple partitions using map reduce paradigm, and then it will
generate estimation sketch for each partition.
Step 2: When a range aggregate query request arrives, system will obtain the result by summarizing estimations from
all partitions.
Step 3: It divides all data into different groups with regard to their attribute values of interest.
Step 4: Then separates each group into multiple partitions according to the current data distributions
Step 5: Proposed system uses Mongo DB where semi structured dataset converted into structured format and estimated
result of range aggregate query displayed to user

Figure 1 System Architecture

Volume 5, Issue 4, April 2017 Page 12


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 4, April 2017 ISSN 2321-5992

3. MODULES
MODULE 1: PREPROCESSING PHASE
Preprocessing phase includes following parts,
Big Data Setup, User Profile, Account Registration:
First user should set up big data environment and then create an account and then only they are allowed to access the
system. Once User creates an account in system, they will able to login through their accounts. Based on the Users
request, the system will process the user request and respond to them

MODULE 2: RAQ REQUEST


The user can send the RAQ (Range Aggregate Query) request for uploaded structured and semi structured data for
further accessing like adding content to the data. Then the partitioned data are combined to make original data and the
data should be downloaded by the user.

MODULE 3: MAPPER AND REDUCER


Map Reduce is the basic data processing scheme used in Hadoop which contains splitting the entire task into two parts,
known as mappers and reducers. At a high-level, mappers read the data from distributed file system, process it and
generate some intermediate results to the reducers. Reducers are used to aggregate the intermediate results to generate
the ultimate output which is again written to HDFS. Hadoop job involves running several mappers and reducers across
different nodes in the cluster.
Example: Imagine that for a database of 1.1 billion people, one would like to compute the average number of social
contacts a person has according to age. In SQL, such a query could be expressed as:
SELECT age, AVG (contacts)
FROM social person
GROUP BY age
ORDER BY age
Using Map Reduce paradigm, the K1 key values could be the integers 1 through 1100, each representing a batch of 1
million records, the K2 key value could be a persons age in years, and this computation could be achieved using the
following functions:
Function Map is
Input: integer K1 between 1 and 1100, representing a batch of 1 million social person records
For each social person record in the K1 batch do
Let Y be the person's age
Let N be the number of contacts the person has
Produce one output record (Y, (N, 1))
Repeat
End function
Function Reduce is
Input: age (in years) Y
For each input record (Y, (N, C)) do
Accumulate in S the sum of N*C
Accumulate in Cnew the sum of C
Repeat
Let A be S/Cnew
Produce one output record (Y, (A, Cnew))
End function
The Map Reduce System would line up the 1100 Map processors, and would provide each with its corresponding 1
million input records. The Map step would produce 1.1 billion (Y, (N, 1)) records, with Y values ranging between, say,
8 and 103. The Map Reduce System would then line up the 96 Reduce processors by performing shuffling operation of
the key/value pairs due to the fact that we need average per age, and provide each with its millions of corresponding
input records. The Reduce step would result in the much reduced set of only 96 output records (Y, A), which would be
put in the final result file, sorted by Y.
In this way, Map Reduce is a framework using which we will process range aggregate queries on huge amounts of data,
in parallel, on large clusters of commodity hardware in a reliable manner.

Volume 5, Issue 4, April 2017 Page 13


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 5, Issue 4, April 2017 ISSN 2321-5992

MODULE 4: SECURITY
Security can be improved by providing data security techniques which enhances the privacy of sensitive data. Executed
results can be obtained securely by hiding sensitive data from users and security can be improved for query processing
in big data environment.
Example: Display data in generalized form of age attribute of employees
Age attribute can have any numerical value and user can reveal employees data based upon his or her age so we can
use generalization algorithm so that particular age of employee can be secured and falls in range e.g. from 20-30, 30-40
etc. [8].
Generalization Algorithm:
1. Retrieve age from Mongo database
2. Calculate unit place digit from age
3. Subtract above result from age this gives starting range of age
4. Add 10 to starting range to get ending range
5. Concatenate start age range and end age range to obtain generalization results

4. CONCLUSION
In this paper, we proposed approach for range aggregate queries which is new estimated answering method that obtains
accurate estimations quickly for range aggregate queries in big data environments. We believe that our system provides
a better starting point for building real time answering methods for big data analysis. In proposed system we will
overcome problems present in existing system. This makes system more proficient in providing accurate results quickly
for range aggregate queries in big data environment.

REFERENCES
[1] Elumalai R, Mathankumar G, Gunaseelan V, Aravind raj S, Gnanavel S, Black Money Check: Integration of Big
Data & Cloud Computing To Detect Black Money Rotation with Range Aggregate Queries, international research
journal in advanced engg. And technology VOL 2 ISSUE 2 (2016) PAGES 767-772 RECEIVED: 25/03/2016.
PUBLISHED : 01/04/2016
[2] Prasadkumar Kale1, Arti Mohanpurkar, Efficient Query Handling On Big Data in Network Using Pattern
Matching Algorithm: A Review, Volume 3 Issue 11, November 2014
[3] Ching-Tien Ho, Rakesh Agrawal, Nimrod Megiddo, Ramakrishnan Srikant Range Queries in OLAP Data Cubes
IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120
[4] Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin, Fast Data in the Era of Big Data: Twitters
Real-Time Related Query Suggestion Architecture, arXiv:1210.7350v1, 27 Oct 2012
[5] Weifa Liang a, Hui Wang b, Maria E. Orlowska Range queries in dynamic OLAP data cubes , St. Lucia Qld 4072,
Australia Received 12 August 1999; received in revised form 14 February 2000;
[6] Dilpreet Singh, Chandan K Reddy, A survey on platforms for big data analytics, Journal of Big Data 2014 1:8.,
doi:10.1186/s40537-014-0008-6
[7] Jeffrey Dean, Sanjay Ghemawat, Map Reduce: Simplified Data Processing on Large Clusters,
COMMUNICATIONS OF THE ACM January 2008/Vol. 51, No. 1
[8] Varsha P Gaikwad, Nikita R. Khare, Chaitanya N. Kalantri, Collaborative Data Publishing Technique with
Enhanced Security, International Journal of Latest Trends in Engineering and Technology, Vol. 6 Issue 4March
2016, ISSN : 2278-621X.

Anisa I. Tamboli received B.Tech degree in Information Technology from Walchand College of
Engineering, Sangli, M.E. pursuing in Computer Science and Engineering from Annasaheb Dange College
of Engineering, Ashta, has industrial experience of 2.9 years and teaching experience of 2 years.

Sandeep G. Sutar Assistant Professor of Computer Science and Engineering department, experience of 12
years in teaching, received B.E & M.E degree in Computer Science and Engineering from Shivaji
University of Kolhapur, renowned teacher for subjects like Grid, Cloud Computing and Big Data.

Volume 5, Issue 4, April 2017 Page 14

Vous aimerez peut-être aussi