Académique Documents
Professionnel Documents
Culture Documents
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - In big data environment, Aggregate Queries are accurate results for the large datasets. The Hadoop and map
important tools in finding individual persons behavior, trends reduce concept is used for storage and processing of large
and various activities in the real world. The Aggregation is datasets.
applied to employ aggregate function on all the tuples within a
specified range. Existing approaches for this method is not 2. LITERATURE SURVEY
enough to provide fast results for large datasets such as in
banks, financial institutions, etc. It is important to provide Xiaochun Yun, et al.. [1] This paper describes about
effective methods and tools for big data analysis. In the the implementation of low cost and fast approach technique
proposed system, Fast RAQ divides big data into different for getting accurate results in big data analysis using
partitions using Partitioning Algorithm and generates a local queries.
value for each individual partition. When a Query request is
given, this algorithm obtains the result directly by grouping Zhiqiang Zhang, et al.. [2] proposed Hadoop online
the local estimates from all tuples and provides a collective Aggregation in the distributed environment. The random
results. This system applies Fast RAQ for Banking Domain. The sampling and sample size estimation are analyzed. This two
banking datasets are divided into multiple tuples and stored in sample values are calculated according to (1)user calculated
different sets of the database across different places. This sample value, (2)system calculated sample value. It also
proposed method tracks multiple accounts maintained in ensures that approximate aggregation results are produced.
different banks of same user and their transaction details. This
helps in finding out tax violators using their unique id. A. Munar, et al.. [3] It is based on the highly scalable
and fault tolerance map reduce model for the use of large
Key Words: Hadoop, Bigdata, Cloud, Fast RAQ, Balanced scale database. It uses various big data analytics to handle
partitioning systems with different requirement specification. This paper
mainly focuses on providing good performance even when
1.INTRODUCTION there is an enormous increase in the database.
Big data analysis is generally used to explore the Y. Shi, et al.. [4] Cloud Based system for Online
hidden patterns from the large datasets. This provides a new Aggregation which provides progressive approximate
approach to discover the solutions for the various difficulties aggregate answers for both single table and multiple joined
in the real world. It is vital to provide a cost-effective and tables.
time saving methods and tools for the analysis in big data
environment.
3. PROPOSED SYSTEM
The main aim of the project is to identify the tax In the Proposed System, the data sets are divided
violators in the banking sector. To track the transaction of into different partitions using the partitioning algorithm.
users in multiple banks and monitor them using Fast Range Then a sample value is obtained from each individual
Aggregate Queries with Balance Partitioning algorithm. The partitions and the analyzes made on the datasets is obtained.
result is obtained by summarizing local estimates from all When the query arrives another cost factor involved in the
the partitions and provides a collective results. analyzing of big data are cost of network synchronization
and the scanning of files in every transaction while passing
The transaction details of different banks are taken the range aggregate queries meanwhile in our proposed
and stored in the cloud. The datasets are partitioned into system since our query is fast range aggregated it passes
different tuples before uploading it into the cloud storage. through every tuple, counter values from the aggregated
The algorithm partitions the datasets according to its columns and the sample values from the rows are calculated
attributes, interests, etc. the cost of network synchronization of files and the scanning
of files can be reduced. Which it leads to it produce of
In this project, the time taken to process the given accurate result.
query is enormously reduced. It provides an efficient and
2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2267
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
4.1 Big Data Analysis First the user wants to create an account using first
name, last name, document id. After creating the account an
The process of analyzing large amount of datasets to unique account number and password is generated for user.
explore the hidden patterns, facts, etc. All the user details are stored in remote server.
4.2 Hadoop
Hadoop is used for the purpose of storing and 5.2 Remote Server
analyzing large datasets in the distributed environment.
Remote server contain information about the user.
4.3 My SQL Query Browser
Also the service provider will maintain the user information.
It provides authentication when a user tries to login to the
It is used to store the datasets. It act as a backend in
account. The user information will be stored in remote
the project. server.
The users banking data is partitioned into multiple The banking transaction of different banks from
Tuples and stored in different sets of Database. Fast RAQ different locations are uploaded to the cloud storage.
first divides big data into independent partitions with a
balanced partitioning algorithm, and then generates a local 5.4 Big Data Setup
value for each individual partition. When a Query request is
given, this algorithm obtains the result directly by grouping Everyday several millions of transactions can take
the local estimates from all tuples and provides a collective place in the banking sector. It is a challenging process to
results. A Map Reduce usually splits the input data-set into provide an efficient way to explore the big data. The
independent chunks which are processed by the map tasks transaction details may contain a single day transaction or
in parallel. This sorts the outputs of the maps, which are then weekly, monthly transaction.
input to the reduce tasks. A particular value is fixed in order
to filter required data from the whole bunch of database. 5.5 Data Split-up/Partitioning
2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2268
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
REFERENCES
2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 2269