Vous êtes sur la page 1sur 5

,QWHUQDWLRQDO&RQIHUHQFHRQ,QIRUPDWLRQ7HFKQRORJ\ ,Q&,7H 7KH1H[W*HQHUDWLRQ,76XPPLW

Comparative Analysis on Techniques for Big Data


Testing

Adiba Abidin Divya Lal Naveen Garg Vikas Deep


Department of Department of Department of Department of
Information Technology Information Technology Information Technology Information Technology
Amity University Amity University Amity University Amity University
Uttar Pradesh, India Uttar Pradesh, India Uttar Pradesh, India Uttar Pradesh, India
adibaabidin@gmail.com lal.divya92@gmail.com er.gargnaveen@gmail.co vdeep.amity@gmail.com
m

Abstract—Big data is a big affair of discussion these days. It is


usually used everywhere from newspaper to technical magazines A. Urge of Big Data
from social media to journals. The term big data refers to Big Data is basically a general term which is used to
complex data sets whose size is beyond the ability of traditional describe structured and unstructured data. The size of big data
processing techniques within a desired span of time .Big data is dynamic in nature and depends on organization to
consists of large volume that might be petabytes or Exabyte’s of organization. Big data is a kind of data set, if not designed and
data consisting of billions to trillions of records of millions of analyzed properly can lead to severe failures which can have
people. Testing of this huge volume of data is a big challenge. adverse effects on organizations. So in order to avoid such a
With the emergence of social media, cloud and smart phones, situation, testing of big data is needed to be done efficiently.
industries have to deal with the voluminous data. while big data Tester should have knowledge about big data framework from
provide solutions to complex business problems like analysis of the scratch. Big data tester should be clear with the concepts
huge data serves as a basis for faster and better decision making,
of big data and other data warehouse concepts. If it is a data
new products and services are being developed for the
customers[4]. This paper focuses on the various techniques that
warehouse or a big data system, the basic component that is of
have been implemented. interest to the testers, is the 'Data'. One can think if someone
knows how testing of data warehouse is done then testing of
Keywords— Big data, testing; data, huge; organizations; big data can also be done easily. But, that’s not the case. If
approach. tester is able to identify the basic difference between them
then it will be easy for a tester to test big data appropriately.
As Big data involves the use of tools such as HADOOP etc ,so
I. INTRODUCTION
tester must have knowledge about the framework. This will
The data is increasing day by day across the world from help tester to test big data correctly.A robust automation
the last two decades, There are more than 2 lakhs tweets every framework can help in doing comprehensive testing.
minute ,millions of queries are being searched on Google,
huge amount of videos are uploaded, millions of emails are B. Big Data Characteristics
sent, large volume of data is being processed on facebook and
is created on websites[1]. According to IBM, “Every day we The 4V’s that define big data are Variety, Velocity and
create 2.5 quintillion bytes of data-so much that 90% of the Volume
data in the world today has been created in the last two years
alone”. So now the question arises where does this data go?
What we do with this data? The only answer is to use this data
in an efficient and effective manner which can provide some
benefits to the organizations as well as to customers.
With the advent of new technologies, a large amount of
structured and unstructured data is produced, gathered from
various sources such as social media, websites, audios, videos
and so on which is difficult to manage and process.[1] The
need of big data comes from various big companies with the
purpose of analysis of huge amount of data to uncover the
hidden patterns, important information which can serve as a
basis for decision making.

‹,((( 
,QWHUQDWLRQDO&RQIHUHQFHRQ,QIRUPDWLRQ7HFKQRORJ\ ,Q&,7H 7KH1H[W*HQHUDWLRQ,76XPPLW

Fig. 1. 4 V’s of big data

C. Challenges and Solution of 4V’s


Variety Variety Testing

Information can be put in numerous formats. For instance The problems which occur due to Variety testing are
database, excel and access or for the matter of the actuality, it (1)Validation of semi-structured and unstructured data using
can be put away in a basic content document. Some of the human intervention, (2)Due to lack of proper defined formats,
time the information is not even in the customary existence of unstructured validation issues. (3)In order to
configuration as we accept, it might be as video, SMS, pdf or process semi-structured and unstructured data, scripting is a
something we may have not contemplated it. It is the need of big issue. (4)Sampling problem.
the organization to mastermind it and make it important. This
present reality has information in a wide range of formats and Solution for the above challenges are as follow (1) To
that is the test we have to overcome with the Big Data. This identify the inconsistency use compare tools to compare the
assortment of the information speak about Big Data[2]. data. (2) For the semi-structured data, the conversion into the
structured format is required, (3) Parsing of unstructured text
data is required which further is compared with the data
Velocity output.

At first, organizations investigated information utilizing a


batch process. One takes a chunk of data, presents a job to the Velocity Testing
server and waits for the outcome. That plan works when the
approaching information rate is slower than the batch The issues which happen due to velocity testing are as
processing rate the outcome is helpful despite the delay. With follow (1)Building up of production like set up of regression
the new sources of data, the batch process separates. The testing (2) Simulating production job run and node
information is currently spilling into the server in real time breakdown
progressively, in a nonstop manner and the outcome is just
helpful if the delay is short [3]. Solution for the above difficulties are :capture the
throughput, job completion time, availability of nodes.
Volume Testing
Volume
The challenges are as follow: (1) Large amount of data, (2)
The new standard is that more sources of data are included
Splitting and storage of data on various nodes, (3) Overall
on regular basis. For organizations, previously, all information
coverage is difficult, (4) Data summarization issues.
was created internally by representatives. As of now, the
information is created by workers and clients. For a particular Approaches for the above challenges are (1) Using data
gathering of organizations, the information is additionally sampling strategy which is to be done based on data
produced by machines. For instance, Hundreds of a large requirements, (2) To compare with the actual output data,
number of Smart phones send variety of data to the network conversion of raw data into the expected result format is
infrastructure; multiple sensor readings from processing required, (3) Preparation of compared scripts for comparisons.
plants, pipelines, and so forth. This data did not exist five
years back and the outcome is that more sources of data with a
bigger size of data consolidate to build the volume of data that Veracity Testing
must be analyzed and tested. This is a noteworthy issue for
those hoping to put that data to use as opposed to letting it Challenges: Irrelevant, unreliable and indefinite data
simply vanish. coming from various sources
Solution: Pre processing of data is required which includes
clean and reliable data testing.
Veracity

The data coming from different sources in huge volume at


a very high rate would be meaningless if that data is incorrect
leading to various problems for customers and organizations, II. APPROACHES OF TESTING BIG DATA
so organizations need to ensure the correctness of the data.
Veracity deals with unpredictable and indefinite data. If the The need of an hour is of leveraging big data. so testing of
data is erroneous, it would result into unreliable data that big data needs to be done adequately. Taking this into
would harm the organizations. So preprocessing of this data consideration we viewed various research work done in this
needs to be done in order to generate better quality data to area
improve the efficiency.


,QWHUQDWLRQDO&RQIHUHQFHRQ,QIRUPDWLRQ7HFKQRORJ\ ,Q&,7H 7KH1H[W*HQHUDWLRQ,76XPPLW

A. Genetic algorithm The work was carried out by making comparison on the
As genetic algorithm is sequential in nature, it does not basis of quality between K-means, PSO and hybrid clustering
support parallelism; the work focuses on parallelizing genetic algorithm. The two PSO techniques i.e. the standard PSO
algorithm using extended hadoop mapreduce for clustering as algorithm and the one used with K-means algorithm, were
shown in figure 2. Two phased clustering is carried out. The matched and it was concluded that the hybrid version has
first phase clustering is carried out by splitting the input data, improved merging minor quantization faults.[11]
each split is the passed onto the mapper. The result is then Various PSO based clustering algorithm were used where
passed to the second phased clustering where single reducer is major evaluating parameters were quantization error, objective
used. In other words, for the implementation of genetic function value, fitness value, inter and intra cluster, mean and
algorithm, numerous mappers and a single reducer is used. standard deviation, execution time, error rate and mean square
error.[15]

C. Performance Testing

Fig. 3. Representation of big data analytics application [12]


Kafka queues are used to direct the numerous input
streams, data is then carried to NoSQL Database or HDFS and
depending upon the storage NoSQL queries or map reduce
Fig. 2. parallel implementation using Hadoop MapReduce [7]
programs are used.
There are some important areas which should be taken into
The work was compared with sequential algorithm and it consideration to accomplish good performance testing, few of
was observed that high level of accuracy was obtained without them are- how data flows at numerous nodes, how many
much adjustment further enhancing the processing of large threads are involved in performing read and write operation,
dataset clusters.[7] values for timeout, performance of map reduce and so on[12].

B. PSO D. Regression Testing


The work introduces clustering of data using PSO to
estimate the centroid of clusters marked by the user. The work The work emphasizes on taking the evaluation from the
is then continued with K-means clustering to compute the first client for further improvement in order to prevent delay in the
swam. maintenance process and increase the cost.[13] The paper
focuses on methods of data generation on data and work load


,QWHUQDWLRQDO&RQIHUHQFHRQ,QIRUPDWLRQ7HFKQRORJ\ ,Q&,7H 7KH1H[W*HQHUDWLRQ,76XPPLW

characterization which would help in maintenance life cycle of 3. Performance YES YES YES YES
commercial big data system. Test

4. Regression YES YES YES YES


Testing
E. Failover Testing
5. Failover YES YES YES YES
Testing
The work emphasizes on approving the process for
recovery and data should switch between nodes without
aimlessly processing. Failover testing is carried out to ensure
data recovery, preventing data corruption, managing edit logs.
Two matrices are handled during this testing- Recovery Point IV. FUTURE WORK
Objective (RPO) and Recovery Time Objective (RTO).[5]
In current business environment, organizations not just
III. RESULT AND CONCLUSION need to discover the significant information they require, they
should discover it rapidly. Organizations perform and take
decision with the help of visualization but the out coming
Testing techniques has been discussed in Section 2 and
challenge is the high volume of data and necessary depth
based on this discussion; some benefits of these techniques are
required. Data analysis is carried out with the help of
shown in table 1.
visualization requiring high level of understanding. If data is
not proper and received to the user on time, then analyzing the
TABLE I.  BENEFITS OF DIFFERENT APPROACHES data rapidly is of no use. Plotting focuses on a diagram for
S.No. Approach Benefit
investigation gets to be troublesome when managing greatly a
lot of data or an assortment of classifications of data. Rather
than using tables containing text and numbers, graphical
1. Genetic Execution time reduced and high representation is a good and fast approach to achieve
Algorithm level of accuracy achieve visualization but big data consisting of huge and complex
makes it difficult in plotting.[10] These challenges can be
solved by making the testing technique more efficient. As data
2. PSO PSO when merged with K-means is increasing day by day, the necessity of controlling and
algorithm provide efficient result in testing data become more and more important. The most
terms of quality as compared to just
using PSO algorithm
important thing that tester has to keep in mind is the dynamic
nature of the data and other various performance bottleneck
3. Performance Numerous options are provided for a issues associated with big data.The above mentioned
Testing typical objective, reducing the techniques can be used for testing big data. Along with this, in
complexity
near future we can combine some of the above techniques in
4. Regression Complete and representative order to make our result more efficient.
Testing regression data base can decrease the
maintainence process
5. Failover Testing Efficient way of managing data
recovery, edit logs and preventing
data corruption References

[1] Shilpa and Manjit Kaur, “BIG Data and Methodology-A review”,
IJARCSSE, vol 3, Issue 10,pp. 991-995, October 2013.
Comparison of techniques based on 4V’s:
[2] “Big Data – What is Big Data – 3 Vs of Big Data – Volume, Velocity
As we know that 4V’s play an important role for testing of and Variety – Day 2 of 21”, SQL AUTHORITY.COM, October 2013.
big data, so testing techniques has been compared on the basis [3] “The 3Vs that define Big Data”, Data Science Central, July 2012.
of 4 V’s and shown in table 2. [4] Tom Davenport, Three big benefits of big data analytics(sascom
magazine).
[5] Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar
TABLE II.  COMPARING THE VARIOUS APPROACHES Gajja, “Big Data: Testing Apprroach to Overcome Quality Challenges”,
Infosys Lab Briefings, vol 11, no 1, pp. 65-72, 2013
S Approach Variety Velocity Volume Veracity
No. [6] Nivranshu Hans, Sana Mahajan, and SN Omkar, “Big Data Clustering
Using Genetic Algorithm On Hadoop Mapreduce”, INTERNATIONAL
JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH, vol4,
1. GA YES YES YES NO
Issue 04, pp. 58- 62, April 2015
[7] Dian Palupi Rini, Siti Mariyam Shamsuddin and Siti Sophiyati Yuhaniz,
“Particle Swarm Optimization: Technique, System and Challenge”,
2. PSO YES YES YES NO International Journal of Computer Applications, vol 14, no 1, pp. 19 –
27, January 2011


,QWHUQDWLRQDO&RQIHUHQFHRQ,QIRUPDWLRQ7HFKQRORJ\ ,Q&,7H 7KH1H[W*HQHUDWLRQ,76XPPLW

[8] “Five big data challenges And how to overcome them with visual [11] Alexander Alexandrov, Christoph Brucke and Volker Markl,”Issues in
analytics”, sas THE POWER TO KNOW. Big Data Testing and Benchmarking”, Technische Universität Berlin
[9] DW van der Merwe and AP Engelbrecht, Data Clustering using Particle Einsteinufer
Swarm Optimization(Department of Computer Science University of [12] Bhasker Allene and Marco Righini, “Better Performance for Big Data”,
Pretoria). Intel Corporation, 2013
[10] Mustafa Batterywala and Shirish Bhale, “Performance Testing of Big [13] Bhagyashree Bhoyar, Pramod Patil and Priyanka Abhang, “ A Survey of
Data Applications”, Impetus Technologies, STC 2013 Accelerated PSO Swarm Search Feature Selection for Data Stream
Mining Big Data”, IJIET, vol 6, Issue 3, pp. 53-58, February 2016.



Vous aimerez peut-être aussi