Académique Documents
Professionnel Documents
Culture Documents
KEYWORDS
Big data, analytics, workflow management
1 INTRODUCTION
The analysis of Big Data is a core and critical
task in multifarious domains of science and
industry. Such analysis needs to be performed
on a range of data stores, both traditional and
modern, on data sources that are heterogeneous
in their schemas and formats, and on a diversity
of query engines.
The users that need to perform such data
analysis may have several roles, like, business
analysts, engineers, end-users, scientists etc.
Users with different roles may need different
84
Proceedings of the Second International Conference on Data Mining, Internet Computing, and Big Data, Reduit, Mauritius 2015
http://www.asap-fp7.eu
vertex&
edge&
processor&
input&or&output&
data&and&metadata&
plain$
sink$
root$
plain$
sink$
root$
plain$
root$
plain$
plain$
85
Proceedings of the Second International Conference on Data Mining, Internet Computing, and Big Data, Reduit, Mauritius 2015
P1#
I1#
P2#
I1#
O1#
O2#
task2#
O1#
P3#
I2#
task1#
task3#
O2#
86
Proceedings of the Second International Conference on Data Mining, Internet Computing, and Big Data, Reduit, Mauritius 2015
I1#
P1#
O1#
O3#
I1# I3#
I1#
I1#
P2#
O2#
O1#
P1#
O3#
O2#
P2#
histogram&
rela.on&
create&&
histogram&
other&
sta.s.cs&
further&
process&
store&&
in&log&
87
Proceedings of the Second International Conference on Data Mining, Internet Computing, and Big Data, Reduit, Mauritius 2015
select&
O1&
I2&
O2&
join&
O3&
project&
I1&
sort&
O&
O4&
I"
edge"
O"
P1$
P1$
edge$
edge$
P2$
P2$
88
Proceedings of the Second International Conference on Data Mining, Internet Computing, and Big Data, Reduit, Mauritius 2015
data&
lter&
stream&
batch&
average&
data)
in#memory)
algorithm)
vola3le)
persistent)
disk#based)
algorithm)
data*
algorithm*
data*with**
required*
availability*
guarantees***
algorithm*that*
locks*the*data*
Reliability.
Metadata
concerning
the
correctness or completeness of data. The values
may be the following:
- Reliable: The data are considered to be correct
and complete2.
- Non-reliable: The data are considered to be
either incorrect and/or incomplete.
The above values of the metadata types may,
even further, have properties, especially related
to quantification. For example, the flow values
may be characterised by size (e.g. the size of
the batches or streams or the range of their
varying sizes) or therate of their availability in
time; the persistence values may be
characterised by the lifetime of data or required
2
89
Proceedings of the Second International Conference on Data Mining, Internet Computing, and Big Data, Reduit, Mauritius 2015
90
Proceedings of the Second International Conference on Data Mining, Internet Computing, and Big Data, Reduit, Mauritius 2015
"DB": <db_meta>}}
}}
91
Proceedings of the Second International Conference on Data Mining, Internet Computing, and Big Data, Reduit, Mauritius 2015
"constrains": {
"input1": {
"data_info": {
"attributes":["$attr1","$attr2"]}},
"input2": {
"data_info": {
"attributes":["$attr1","$attr2"]}},
"output1": {
"data_info": {
"attributes":["$attr1","$attr2"]}},
"op_specification": {
"algorithm": {
"join": {
"join_condition":
"input1.$attr1=input2.$attr1"}}}}
}}
calc1$
select1$
calc3$
join$
select2$
kMeans$
calc2$
result$
calc1$
calc2$
archive$
history$
92
Proceedings of the Second International Conference on Data Mining, Internet Computing, and Big Data, Reduit, Mauritius 2015
93
Proceedings of the Second International Conference on Data Mining, Internet Computing, and Big Data, Reduit, Mauritius 2015
9 CONCLUSIONS
This paper proposes a novel workflow model
for the expression of analytics tasks on Big
Data. The model enables the separation of task
dependencies from task functionality. Using the
proposed model, a user is able to express a
variety of application logics and to set her
degree of control on the execution of the
workflow. Finally, we depict the proposed
workflow model on specific use cases from the
telecommunication and web analytics domains.
Ongoing and future work focuses on defining
methods for the manipulation of the workflow
structure in order to optimise its execution
semantics.
ACKNOWLEDGEMENTS
The research leading to these results has
received funding from the European Union
Seventh Framework Programme (FP7/2007-
REFERENCES
[1]
[2]
[3]
[4]
Pegasus. http://pegasus.isi.edu/.
[5]
[6]
[7]
[8]
[9]
[10]
[11]
94