Académique Documents
Professionnel Documents
Culture Documents
Yan Ding
1
, Huaimin Wang, Peichang Shi, Hongyi Fu
National University of Defense Technology
Changsha, China
1
yanding@nudt.edu.cn
Changguo Guo, Muhua Zhang
Chinese Electronic Equipment System Corporation Institute
Beijing, China
AbstractComputation integrity is difficult to verify when
mass data processing is outsourced. Current integrity
protection mechanisms and policies verify the results
generated by participating nodes within a computing
environment of service providers (SP), which can not
preventing the subjective cheating of SPs. This paper provides
an analysis and a modeling of computation integrity for mass
data processing services. A third-party sampling-result
verification method called trusted sampling-based third-party
result verification (TS-TRV) is proposed to prevent lazy
cheating by SPs. TS-TRV is a general solution for common
computing jobs and uses the powerful computing capability of
SPs to support verification computing, thus lessening the
computing and transmission burden of the verifier. A series of
simulation experiments and theoretical analysis indicates that
TS-TRV is an effective method of detecting the cheating
behavior of SP while ensuring the authenticity of sampling.
Compared with the transmission overhead of nave sampling
verification, which is O(N), the network transmission overhead
of TS-TRV is only O(logN). TS-TRV efficiently solves the
verification problem of the intermediate results in
MapReduce-based mass data processing.
Keywords- result verification, mass data processing,
MapReduce, trusted sampling, Merkle tree
I. INTRODUCTION
Cloud computing is currently the focus of much
research. The key feature of cloud computing, which is
based on servitization [1], provides remarkable convenience
to users. However, issues arise with outsourcing service
mode; that is, the control of resources and processing shifts
from users to cloud service provider (SP), which results in
uncontrollable service processing and hard-to-verify results.
As a result, the integrity of services is technically assured by
the SP. However, in reality, cloud computing is charged on
demand; thus, profit-driven SPs, such as data storage
[2][3], large-scale computing outsourcing [4][5], and so on,
may degrade service quality and reduce the size of the
computing problem.
Currently, mass data processing techniques are in
demand both in research and business domains. Numerous
data analysis and processing services are provided to
accommodate various requests. Actual service modes
include processing and analyzing specified data provided by
users, as well as data analyzing services for open data
platforms. Considering that concrete data processing is
accomplished by the SP, users cannot identify whether data
processing is done completely. Users do not have enough
computing resources to verify the authenticity of the results
because of the huge data amount of the computing problem.
Thus, SPs are able to cut corners in computing and cheat the
results to make profit. In practice, some doubts surround
data analysis results provided by inexpensive and minor
data analysis agents. Thus, verifying the computation
integrity of mass data processing is necessary to guarantee
service quality and to protect user interests.
In practical mass data processing techniques, the large
amount of computed data makes the traditional relational
data management unsuitable. MapReduce [6], which is a
new parallel programming framework, dynamically
organizes a large amount of nodes into a computing
environment and uses the principle of parallel computing to
fulfill mass data processing tasks. MapReduce has been
adopted by Google, Yahoo!, Amazon, and Facebook, and
has become the dominant mass data processing technique.
Thus, applying MapReduce for computation integrity
verification is practical in solving the service integrity issue
of mass data processing. Currently, researches on result
verification of mass data processing are mainly focused on
maintaining the integrity of results generated by
participating computing nodes in a computing environment.
However, in cloud computing, trust between the user and
the SP is a major factor considered by the user when
choosing services. Therefore, from the viewpoint of users,
verifying the computing result in mass data processing as
well as the computation integrity of the SP is important.
To address this challenge, this paper proposes the trusted
sampling-based third-party result verification method (TS-
TRV). By sampling the MapReduce intermediate results, we
can verify whether user data are processed completely in
map phase. TS-TRV utilizes the Merkle tree [7] to organize
the intermediate results of the SP for verification, thereby
guaranteeing the authenticity of sampling and decreasing the
overhead of result submission. Theory analysis and
simulation experiments showed that the communication
overhead of TS-TRV is O(logN), whereas that of the
common sampling techniques is O(N). The computational
overhead of verification is mainly distributed on the side of
the SP. Thus, the verifier cuts computing and network
transmission costs, which results in more flexible computing
environment requests to the verifier, in compliance with the
cloud computing principle, which states that computing
should be concentrated on the cloud side.
The paper is organized as follows. Section 2 introduces
2013 IEEE Seventh International Symposium on Service-Oriented System Engineering
978-0-7695-4944-6/12 $26.00 2012 IEEE
DOI 10.1109/SOSE.2013.65
391
related research in this domain. Section 3 defines the
MapReduce computing model and the cheating model, and
Section 4 presents TS-TRV. Section 5 presents the analysis
and experimental evaluations as well as the comparison of
TS-TRV with similar works. Section 6 concludes this paper
and presents our future research focus.
II. RELATED WORKS
The result verification of outsourcing is an emergent
topic along with distributed computing modes. Given that
computational jobs are dispatched to participation nodes,
which cannot be controlled by the job manager, the
computing results submitted by nodes need to be verified
before use. Result verification comprises three categories of
techniques. First, replication and voting features for
redundant computing are applied such that multiple
computing nodes will perform the same job; the result is
accepted when it is submitted by more than half of the total
nodes [8]. Second, sampling techniques address the resource
cost of replication, involving result-based sampling [9] and
test job injection sampling [10]. With sampling techniques,
computation results are verified and trusted with a certain
probability. Finally, checkpointing deals with result
verification for sequential computation [11]. Computation is
divided into several time slices. At the end of each slice, the
checkpoint is obtained to partially verify the computing
result.
Currently, research studies concerning result verification
for mass data processing of MapReduce focus on the
computation integrity of the inner nodes in the MapReduce
computing environment. Wei Wei et al. worked on the
verification problem of computation results in an open
MapReduce environment [12]. According to them, a
computation result generated by participation nodes from
different resource owners may not be trusted. Thus, they
proposed an integrity protection mechanism called
SecureMR, which uses two-copy replication to verify the
result in the map phase. Results can be submitted to the
reduce phase only if the results of all copies are the same.
SecureMR aims for 100% detection rate. However, this
method has increased computational costs and cannot cope
with collusion. Using this finding as basis, Yongzhi Wang
and his colleagues worked against the collusion problem
[13]. They introduced the verifier role in the MapReduce
computing model. Computation results undergo replication
verification, are sampled, and are then recomputed by the
verifier, thus solving the collusion problem to a certain
extent. However, this method is based on the assumption
that the verifier is absolutely trusted. Thus, the verifier
becomes a system bottleneck. Z. Xiao et al. worked on
result cheating which is caused by network attack on the
working nodes in the MapReduce platform
[14]. They used
a set of trusted auditing nodes to record the result generated
by various phases of MapReduce. The cheating nodes can
be located by recomputing the results. Considering the
credibility of objects, Chu Huang et al. proposed a
watermark injection method to verify if the submitted
results are completed correctly [15]. The watermarks used
for verification are inserted randomly into the job before the
job is submitted by the user. After the result is submitted,
the watermarks are first checked to determine whether they
are correctly processed. If they are, the integrity request is
assumed to be met with a certain probability. This solution
is effective in text processing jobs, which utilize substitute
encryption to generate watermarks. However, creating
watermarks is difficult for jobs that are difficult to predict,
such as statistics.
Current studies on result verification for MapReduce
focus on the inner computing environment. It is an urgent
need to provide simple and efficient computation integrity
verification for common MapReduce computing results from
the viewpoint of users. Considering the characteristic of
mass data processing jobs, verifying a whole task via
replication induces unacceptable computational overhead if
the result of the whole MapReduce framework is verified.
Thus, decreasing the overhead induced by verification
becomes a vital factor to consider.
III. RESULT CHEATING PROBLEM ON MASS DATA
PROCESSING
A. MapReduce programing model
The MapReduce programming model consists of a
single master node (job tracker) and several slave nodes
(task tracker). By taking Hadoop
[16]
as a sample
implementation, we can illustrate the MapReduce
programming model as follows.
Figure 1. Illustration of the MapReduce programming model.
As shown in Fig. 1, the MapReduce process can be
divided into two phases: map and reduce. First, during the
map phase, the input is partitioned into m splits, which are
independent from each other. The master node dispatches
these splits to several worker nodes to do parallel map
operations. These workers are called mappers. During
execution, each mapper deals with one split, casts map
operation on all input keyvalue pairs, and saves the result
on the local node. The computational result in this phase is
called intermediate result. When map computation is
completed, all intermediate results are partitioned into r
different parts according to their keys, and every partition is
assigned to a worker node to cast reduce operation; this type
of worker is called a reducer. In the reduce phase, each
392
reducer reads partitioned intermediate result from all
necessary mapper nodes and casts reduce operation on the
mapper nodes to obtain a final result, which is then stored
on the distributed file system.
On the basis of the MapReduce principle, we construct
definitions of the MapReduce programming model as
follows:
Definition 1 (MapReduce programming model).
Given a problem with an input set of D={x
1
, ..., x
n
}, where
the type of x
i
is in the <key1, value1> keyvalue pair
format. The computation procedure includes two phases,
and the results form a set R.
z Map phase
In this phase, the computation of each input keyvalue
pair is independent from each other. Thus, a map function
can be regarded as one-to-one mapping and denoted as f(x).
Hence, the computation in map phase can be expressed as
follows: for all x
i
D, y
i
= f(x
i
) is computed, where f is a
user-defined map function. The intermediate result set is Y
= {y
1
, ..., y
n
}, where y
i
is in the <key2, value2> keyvalue
pair format.
In a real process, job D is divided into m sets denoted by
D
i
, where i[1, m]. Each D
i
is dispatched to a mapper, and
a total of m mappers complete computation in parallel. The
computation results generated by all mappers form sets Y
i
,
where i[1, m] and Y=Y
1
... Y
m
.
z Reduce phase
Reduce computation is completed by reduction of results
generated in the map phase. Thus, the reduce function can
be regarded as mapping from the intermediate result set Y
to the final result R. Therefore, the computation in reduce
phase can be expressed as R=g(Y), where g is a user-
defined reduce function, and the result set R={r
1
,,r
s
},
where r
i
is in <key3, value3> keyvalue pair format.
In a real process, when the value of key2 of y
i
, Y is
divided into several sets,
gets a
reducer, the r reducers work in parallel and produce the
computation result set R
i
, and R=R
1
... R
m
.
B. Cheating model
During MapReduce data processing, the SP can gain
business profits by conducting partial computation to save
computational costs. The SP can also use a false result,
thereby confusing the user. Thus, computational integrity is
compromised. Based on motivation, the possible cheating
behavior of the SP is categorized into two:
z Lazy cheating
Following this model, the SP may only perform part of
the computation task and use the partial computing result as
a substitute of the real result, which must be generated by
performing all necessary computations. Thus, computational
cost is lowered and extra profits are gained. Cheating has
three types according to the period when the cheating
occurred:1) Cheating in map phase: The SP does the actual
computing only on part of the input and performs cheaper
computing
(x
i
) (for x
i
fx
i n
v hashv
hashv
where represents nodes in the tree, hash is the
specified one-way hash function which may be MD5 or
SHA1 or so, || stands for conjunction of two hashed value,
and R
along the path from the y
i
s corresponding node to the root
node in ks Merkle tree, where l is the height of ks Merkle
tree. Then, in the master node, the values
along
the path formed by
.
For example, the input x
6
of the third map node is
sampled. As shown in Fig. 1, the sample corresponds to L
6
in the Merkle tree of the third node. To build the root node
from L
6
, all necessary nodes which form the set {L
5
, A, B,
C} are needed. As shown in Fig. 2, the necessary nodes for
calculating the global root node in the global Merkle-tree
level are {R
4
, D, E}. Therefore, the verification information
set of this sampling is
After constructing response sets for all sampling
challenges, the SP sends them to the verifier.
Step 4: Result verification
For sample input x
i
, the verifier first calculates f(x
i
), then
compares f(x
i
) and y
i
. If they are not equal, the SP cheated.
Otherwise, the verifier uses
.
For s samplings, the cheating detection rate of the
verification is
Pi p
(2)
The simulation environment will then be constructed on
Matlab to test the cheating detection rate of TS-TRV. In this
case, the input size is N=150,000 and the number of mapper
nodes is M=150. Therefore, each mapper deals with 1000
inputs. First, errors are injected according to the different
mapper cheating probability p
1
and intra-node cheating
probability p
2
, and then sampling tests are conducted on the
result. To reduce the effect on the test result induced by
random factors for every similar parameter configuration,
the test is repeated 200 times, and the mean values are taken
as the final result.
Fig. 5 shows that a fixed number of samples affects the
cheating detection rate according to different p1 and p2. In
Figs. 5-a and 5-b, 100 and 300 samples were selected,
respectively, and a test was conducted to determine how
cheating detection rate follows p1 and p2 changes. When the
number of samples is 100, 15.3% of all 2601 dots have a
detection rate lower than 95%, and 77.32% of the results are
higher than 99%. When the number of samples increases to
300, only 8.3% dots have a detection rate lower than 95%,
and 87.93% of the results are higher than 99%. According to
formula (2), the effect of mapper cheating probability p1 and
intra-node cheat probability p2 are equivalent to the cheating
detection rate. However, in the experiment, errors are
introduced on specific mapper cheating probability. When
the number of mappers is rather small, the error injection is
obviously affected by randomicity. As a result, the cheating
detection could not be improved by p2s increase when p1 is
very small. When the number of mappers become larger than
a certain value (N>=100), system cheating detection rate will
follow the theoretical analysis of formula (2).
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p1
a. Sample Number = 100
p2
C
h
e
a
t D
e
te
c
tio
n
R
a
te
395
Figure 5. Relationship between cheating detection rate and cheating
probability.
For commitment-based sampling method, the time spent
on network transmission is mainly used for passing all
intermediate results to verifier. The concrete sampling and
verification steps are all done solely by the verifier, and no
interaction with SP is needed. Hence, if the input size is N,
the transfer overhead is around O(N). As to TS-TRV during
result submission, only the root value of Merkle tree is
transferred. The network overhead is barely O(1). In
addition, after sampling, the SP needs to transfer the
verification information set. If the input size is N, the
number of samples is s, and the overhead of network
transfer is O(logN).
VI. CONCLUSIONS
This paper provided an analysis and modeling of
computing integrity for mass data processing services. To
handle the lazy cheating model of SPs, we proposed a third-
party sampling result verification method called TS-TRV.
The cheating detection rate and the performance of TS-TRV
are both analyzed and simulated in an experimental
environment. The result shows that TS-TRV achieves a high
cheating detection rate and low network transfer overhead.
Its verification cost is concentrated on the SPs side, thereby
reducing the computing and transmission burden of the
verifier. In our future work, we will address the
computational integrity issue in the reduce phase in the
MapReduce framework.
ACKNOWLEDGMENT
This work was supported by the National Basic Research
Program of China under Grant No.2011CB302600, the
National Natural Science Foundation of China under Grant
No. 61161160565, the HGJ Major Project of China under
Grant No. 2012ZX01040001 and the Fund No. KJ-12-06.
REFERENCES
[1] D. G. Feng, M. Zhang, Y. Zhang and Z. Xu, Study on Cloud
Computing Security, Journal of Software, Vol. 22, Jan. 2011, pp.71-
83(in Chinese), doi: 10.3724/SP.J.1001.2011.03958.
[2] A. Juels and B. S. Kaliski, Pors: Proofs of retrievability for large
files, Proc. the 14th ACM Conf. on Computer and Communications
Security (CCS 07), ACM Press, Oct. 2007, pp.584-597,
doi:10.1145/1315245.1315317.
[3] G. Ateniese, R. Burns, R. Curtmola, R. Curtmola, J. Herring, L.
Kissner and et al, Provable data possession at untrusted stores,
Proc. the 14th ACM Conf. on Computer and Communications
Security (CCS 07), ACM Press, Oct. 2007, pp. 598-609, doi:
10.1145/1315245.1315318.
[4] C. Wang, K. Ren, J. Wang and K. M. R. Urs, Harnessing the Cloud
for Securely Outsourcing Large-scale Systems of Linear Equations,
Proc. the 31st International Conference on Distributed Computing
Systems (ICDCS 11), IEEE Press, Jun. 2011, pp 549- 558,
doi:10.1109/ICDCS.2011.41.
[5] C. Wang, K. Ren and J. Wang, Secure and Practical Outsourcing of
Linear Programming in Cloud Computing, Proc. 2011 Proceedings
IEEE INFOCOM, IEEE Press, Apr. 2011, pp. 820-828,
doi:10.1109/INFCOM.2011.5935305.
[6] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on
large clusters, Proc. the 6th conference on Symposium on Opearting
Systems Design & Implementation (OSDI 04), USENIX Association
Berkeley, Mar. 2004. pp. 10-10.
[7] R. Merkle. Secrecy, authentication, and public key systems. PhD
thesis, Electrical Engineering, Stanford University, 1979.
[8] M. Taufer, D. Anderson, P. Cicotti, C. Brooks III, Homogeneous
redundancy: A technique to ensure integrity of molecular simulation
results using public computing, Proc. the 19th IEEE International
Parallel and Distributed Processing Symposium (IPDPS 05),
Workshop 1, IEEE Press, Apr. 2005, pp. 119a, doi:
10.1109/IPDPS.2005.247.
[9] W. Du, J. Jia, M. Mangal, M. Murugesan, Uncheatable grid
computing, Proc. the 24th International Conference on Distributed
Computing Systems (ICDCS 04), IEEE Press, Mar. 2004, pp. 4-11,
doi: 10.1109/ICDCS.2004.1281562.
[10] S. Zhao, V. Lo and C. G. Dickey, Result Verification and Trust-
Based Scheduling in Peer-to-Peer Grids, Proc. the 5th IEEE
International Conference on Peer-to-Peer Computing (P2P 05), IEEE
Press, Aug.2005, pp. 31-38, doi: 10.1109/P2P.2005.32.
[11] F. Monrose, P. Wycko and A. Rubin, Distributed execution with
remote audit, Proc. the Network and Distributed System Security
Symposium (NDSS 99), Internet Society, Feb. 1999, pp. 103-113.
[12] W. Wei, J. Du, T. Yu, X. Gu, SecureMR: A service integrity
assurance framework for mapreduce, Proc. the 25th Annual
Computer Security Applications Conference (ACSAC 09), IEEE
Press, Dec. 2009, pp. 73-82, doi: 10.1109/ACSAC.2009.17.
[13] Y. Wang and J. Wei, VIAF: Verification-based Integrity Assurance
Framework for MapReduce, Proc. IEEE International Conference on
Cloud Computing (Cloud 11), IEEE Press, Jul. 2011, pp. 300-307,
doi:10.1109/CLOUD.2011.33.
[14] Z. Xiao, Y. Xiao, Accountable MapReduce in cloud computing,
Proc IEEE Conference on Computer Communications Workshops
(INFOCOM WKSHPS 11), IEEE Press, Apr. 2011, pp. 1082 1087,
doi: 10.1109/INFCOMW.2011.5928788.
[15] C. Huang, S. Zhu and D. Wu, Towards Trusted Services: Result
Verification Schemes for MapReduce, Proc the 12th IEEE/ACM
International Symposium on Cluster, Cloud and Grid Computing
(CCGrid 12), IEEE Press, May 2012. pp. 41-48, doi:
10.1109/CCGrid.2012.77.
[16] Hadoop, Apache hadoop. Available:http://hadoop.apache.org
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p1
b. Sample Number = 300
p2
C
h
e
a
t D
e
te
c
tio
n
R
a
te
396