Académique Documents
Professionnel Documents
Culture Documents
Abstract—In the Big Data era, machine learning has more To efficiently process large quantities of data, a scalable
potential to discover valuable insights from the data. As an approach is needed. Although distributed data-parallelism
important machine learning technique, Bayesian Network (BN) (DDP) patterns, such as Map, Reduce, Match, CoGroup and
has been widely used to model probabilistic relationships among Cross, are promising techniques to build scalable data parallel
variables. To deal with the challenges of Big Data PN learning, analysis and analytics applications, applying the DDP patterns
we apply the techniques in distributed data-parallelism (DDP) in Big Data BN learning still faces several challenges: (1) How
and scientific workflow to the BN learning process. We first can we effectively pre-process Big Data to evaluate its quality
propose an intelligent Big Data pre-processing approach and a and reduce the size if necessary? (2) How to design a workflow
data quality score to measure and ensure the data quality and
capable of taking Gigabytes of big data sets and learn BNs with
data faithfulness. Then, a new weight based ensemble algorithm
decent accuracy? (3) How to provide easy scalability support to
is proposed to learn a BN structure from an ensemble of local
results. To easily integrate the algorithm with DDP engines, such BN learning algorithms? These three questions have not
as Hadoop, we employ Kepler scientific workflow to build the received substantial attention in the current research status-quo.
whole learning process. We demonstrate how Kepler can This is the main motivation for this research: the creation of the
facilitate building and running our Big Data BN learning novel workflow - Scalable Bayesian Network Learning (SBNL)
application. Our experiments show good scalability and learning workflow. This SBNL workflow has three research novelties
accuracy when running the application in real distributed which contribute to the current literature:
environments.
• Intelligent Big Data pre-processing through the use of a
Keywords—Big Data; Bayesian network; Distributed computing; proposed data quality score called S Arc to measure and
Ensemble learning; Scientific workflow; Kepler; Hadoop ensure data quality and data faithfulness.
I. INTRODUCTION • Effective BN learning from Big Data by leveraging
ensemble learning and distributed computing model. A
With the explosive growth of data encountered in our daily
new weight based ensemble algorithm is proposed to
lives, we have entered the Big Data era [1]. For the United
learn a BN structure from an ensemble of local results.
States health care sector alone, the creative and effective use of
This algorithm is implemented as an R package and
Big Data could bring more than $300 billion potential annual
reusable by third parties.
value each year [35]. Making the best use of Big Data and
releasing its value are the core problems that researchers all • A user-friendly approach to build and run scalable Big
over the world long to solve since the New Millennium. Data machine learning applications on top of DDP
Bayesian Network (BN), a probabilistic graph model, patterns and engines via scientific workflows. Users do
provides intuitive and theoretically solid mechanisms for not need to write programs to fit the interfaces of
processing uncertain information and presenting causalities different DDP engines. They only need to build
among variables. BN is an ideal tool for causal relationship algorithm specific components using languages like R
modeling and probabilistic reasoning. BN is a widely used in or Matlab, which they are already familiar with.
modeling [2][3][4], prediction [5][6][7], and risk analysis [8]. SBNL is validated using three publicly available data sets.
BNs have been applied in a wide range of domains such as SBNL obtains significant performance gain when applied to
Health Care, Education, Finance, Environment, Bioinformatics, distributed environments while keeping the same learning
Telecommunication, and Information Technology [9]. With accuracy, making SBNL an ideal workflow for Big Data
abundant data resources nowadays, learning BN from Big Data Bayesian Network learning.
could discover valuable business insights [4] and bring
potential revenue value [8] to different domains. The remainder of this paper is organized as follows. Section
II presents the background of BN learning techniques and
evaluation. Section III discusses how to build scalable Big Data N' N'
applications via Kepler scientific workflow system. Section IV n qi
Γ( ) ri
Γ( + N ijk )
describes SBNL workflow in detail. The evaluation results and qi ri qi
P ( B, D) = P( B )∏ ∏ .∏
related work are presented in Section V and VI, respectively. i =1 j =1 N' k =1 N'
Section VII concludes this paper with future work. Γ( + N ij ) Γ( )
qi ri qi (1)
II. BAYESIAN NETWORK LEARNING TECHNIQUES AND
EVALUATION where N ijk is the number of instances in D for which X i = k
2
Spark Project: http://spark.apache.org/
3
Kepler Project: https://kepler-project.org/
its input data as a whole. Instead, they first partition the input SBNL workflow
data and then process each partition separately.
The UDF for the DDP patterns is an independent Quality Evaluation &
Big Data Data Partitioning
component and can naturally be encapsulated within a DDP
actor. The logic of the UDF can either be expressed as a sub-
workflow or compiled code. In the first case, users can
compose a sub-workflow for their UDF via Kepler GUI using Local Learner
specific subsidiary actors for the DDP pattern and any other
Data Quality Local Ensemble
general actors. Since the sub-workflow is not specific to any Evaluation Learning
engine API, the same sub-workflow could be executed on
different DDP engines. Like other actors, multiple DDP actors
can be linked to construct bigger applications.
Each DDP pattern defines its execution semantics, i.e., how Master Learner
data partitions are processed by the pattern. This clear
definition enables decoupling between a DDP pattern and its MasterEnsemble Final BN
execution engines. To execute DDP workflows on different Learning Structure
DDP execution engines, we have implemented a DDP director
in Kepler. Currently, this director can execute DDP workflows
with Hadoop, Stratosphere and Spark. At runtime, the director
will detect the availability of DDP execution engines and Kepler Workflow
transform workflows into their corresponding jobs. The
adaptability of the director makes it user-friendly since it hides Fig. 2. Overview of the SBNL algorithm.
the underlying execution engines from users. As shown in Fig. 2, SBNL workflow consists of four
components: (1) Data partitioning, (2) Local learner, (3)
C. Machine Learning Support in Kepler Master learner, (4) Kepler workflow.
There are many popular tools/languages for machine In the data partitioning component, the SBNL workflow
learning, such as R, Matlab, Python and Knime [31]. Complex partitions the data set into data partitions of reasonable size.
machine learning applications might need to integrate different SBNL has a score based algorithm to dynamically determine
components implemented in different tools/languages. Kepler the best partition size to balance both learning complexity and
supports easy integration of these tools/languages within one accuracy. Then, data partitions are sent evenly to each local
process. Besides the ExternalExecution actor in Kepler to learner. The local learner will first use the value of SArc to
invoke arbitrary binary tools in batch mode, we also have examine the data partition’s quality. If the quality is good,
actors specifically for many scripting languages. For instance, SBNL then enters local ensemble learning (LEL) step, each
users can embed their own R scripts in the RExpression actor. local learner will run MMHC algorithm separately on each
Users can further customize the input/output ports of the local data partition to learn an individual BN. Then, local
learner applies our proposed ensemble method on individual
RExpression actor to connect with other actors and build
BNs to generate a final local BN. During local learning, the
complex applications.
best local data partition is obtained in each local learner.
In addition, we are investigating how to integrate other Finally, SBNL workflow reaches the master learner
popular machine learning tools, such as Mahout 4, into Kepler. component. This component receives local BN and best local
Users will be able to use their machine learning data partitions from all local learners. The best data partition
functions/libraries as actors and connect them with other can be obtained in master learner. Then, master learner runs
actors. our proposed ensemble algorithm on the local BN using the
best data partition. Note that master learner does not run any
IV. PROPOSED APPROACH BN learning algorithm, it just gives weight to each local BN
A. Overview of SBNL Workflow and ensembles the final BN. So, all the computing heavy lifting
After introducing the background knowledge in previous tasks are distributed among the local learners.
sections, we give the overview of our SBNL workflow. Details of each component in SBNL workflow are specified
in the following sub-sections.
B. Quality Evaluation and Data Partitioning
First thing SBNL workflow does is evaluating the quality
of a given big data set. It runs a scoring algorithm to
incrementally evaluate a partition Dp of the whole data set D.
Each time the scoring algorithm doubles the size of Dp until a
threshold is reached. If the SArc value of Dp is larger than 1.0,
4
Mahout: https://mahout.apache.org/ then the whole data set will not be used for SBNL workflow.
Facing Big Data larger than the memory size, a single MMHC algorithm. This step also looks at each data partition
machine could not compute the Bayesian score of the whole and selects the best partition. Then, LEL calculates learnScores
data set. Therefore, we use distributed computing model in for local BNs using the best data partition.
SBNL workflow. A big data set is partitioned into K slices of
size Ns. TABLE III. LOCAL ENSEMBLE LEARNING
N d = N s *K + N r (5) LocalEnsembleLearning(dataPartitions){
# Initialization
where Nd is the size of data set D and Nr is the row number of localBNs = learn BNs from dataPartitions using MMHC;
the remainder of D after K partitions. By counting the last Nr learnScores = Bde scores of localBNs using best data partition;
rows data as another slice, total partition slice is K+1. Given finalLocalBN = ensembleBNs(localBNs,learnScores, bestPartition);
the total number of local learners denoted as Nlocal, we try to }
send data slices evenly to the local learners for better load ensembleBNs(localBNs, learnScores, bestPartition){
balance. weights = weightCalculator(learnScores, bestPartition);
An important task here is to determine Ns , we propose a mergedMatrix = matrix(0, nnodes,nnodes);
fast incremental algorithm FindNs to find a proper partition
size. FindNs is described in Table II. # Transform and merge local BNs
for ( n in 1:length(localBNs)) {
TABLE II. THE FINDNS FUNCTION adjMatrix = BNToAdjMatrix(localBNs[n]);
function FindNs (data, maxStep, maxSize ){ mergedMatrix = adjMatrix * weights[n] + mergedMatrix;
bestScore = 100; currentStep = 1; }
nrowdata= number of rows in data;
ncoldata = number of column in data; # Transform merged matrix into final local BN
Ns = 1000*( ncoldata % 10);
slicedData = data[1:sliceSize] ; minThreshold = min(weights);
score = dataQualityCalculator(slicedData) *(-1); finalLocalBN = MergedMatrixToBN(mergedMatrix, minThreshold*2);
while (score < bestScore && currentStep < maxStep && sliceSize < return finalLocalBN;
maxSize) {
}
bestScore = score;
Ns= Ns*2;
slicedData = data[1: Ns]; Assigning weight to each individual learner is an important
score = SarcCalculator (slicedData) *(-1); technique in ensemble learning. LEL leverages the weighting
currentStep = currentStep+1; technique and proposes a method called ensembleBNs. In
} ensembleBNs, based on the value of localScores, a weight
return Ns; vector is calculated. For example, if the localScores = [-0.2, -
}
0.3, -0.25, -0.25], then the corresponding normalized weight
function SarcCalculator(D, parameters){ vector is [0.306, 0.204, 0.245, 0.245]. Smaller local score in
network = mmhc(D, parameters); absolute value has higher weight. After obtaining the weights,
score = score(network, D, type ="BDe"); ensembleBNs then transforms local BNs into adjacency
SArc = score/(Nd * (# of arcs in network)); matrixes and merges them into one matrix using the weight
return SArc;
vector. In the end, ensembleBNs leverages the merged matrix
}
to generate the final local BN. A threshold is set as the minimal
value in the weight vector. Then ensembleBNs iterates the
FindNs algorithm begins with the initial slice size: merged matrix and identify an arc when mergedMatrix [i , j] >
minThreshold * 2. This is a voting mechanism to promote and
N s = 1000 * (ncoldata %10) (7) discover an arc when it is present in more than two local BNs.
Then it doubles the value of Ns iteratively and evaluate the D. Master Learner
data partition (data[1: Ns]) until its SArc value could no longer After describing local learner, we now introduce the final
be improved or the maximum number iteration or partition size component in SBNL workflow – the master learner. Master
is reached. In this way, the quality of each data partition is learner adopts similar strategy as the local learner and reuses
ensured by SArc and the data partition size is controlled under a the function ensembleBNs. There are two inputs: final local
threshold. BNs deoted as Slocal, and the best local partition denoted as
C. Local Learner Dlocalbest. Master learner contains four steps: 1) Obtain the
First activity in local learner is the Data Quality Evaluation global best partition from Dlocalbest.; 2) Calculate scores for
(DQE). During DQE, each data partition is examined with the Slocal using Dbest; 3) Call ensembleBNs function to obtain the
function SarcCalculator. If the data partition’s SArc is less than final BN; 4) Return final BN as the learning result of SBNL
-0.5, then this partition is dropped by SBNL workflow. workflow from Big Data.
According to Table VI, we can claim that SBNL has 100% We first give Hailfinder data set to SBNL workflow. In the
data selection accuracy in its local learner component. first data evaluation actor, SArc value of Hailfinder remains very
Therefore, we could conclude that SArc is an accurate measure high around 1.5. So SBNL workflow determines that
to test the faithfulness of data set D. Hailfinder data set is not suitable for BN learning. To confirm
it, we further apply a data set of Hailfinder to MMHC
After running FindNs on three big data sets, we obtain Ns
for each big data set (in Table VII). Note that SArc values shown 5
Hadoop Distributed File System (HDFS) :
http://wiki.apache.org/hadoop/HDFS
algorithm. The learned BN is very different from the actual Learning BN from data is a traditional research area with a
Hailfinder network since there are over 30 missing arcs. This long history. Chickering et al. [16] show that finding the
study affirms the correctness of SBNL workflow. Low quality optimal BN structure in the graph search space is NP hard. A
data sets are rejected in the beginning by SBNL so as to ensure comprehensive comparative survey was carried out on BN
good learning results. learning algorithms [9]. The majority of learning algorithms
are not designed for Big Data BN learning. The number of
We also evaluated the Alarm and Insurance data set. Both possible BN structures grows super-exponentially with respect
data sets have good quality. The accuracy analysis is
to the number of variables. In addition, large data sets can
summarized in Table IX. Alarm10M data set is partitioned into hardly fit in the memory of a single machine. Therefore, it is
208 partitions and Insurance10M data set is partitioned into
advisable to learn BN from Big Data through distributed
625 partitions. For Alarm10M data set, we compare SBNL’s computing methods in a divide-and-conquer fashion. Chen et
result with a single 96000 row data set (Alarm96K) applied
al. [13] study the problem of learning the structure of a BN
directly to MMHC algorithm on a single machine. Similarly, from a distributed heterogeneous data sources, but this
for insurance10M data set, we compare SBNL’s result with a
approach focuses on learning sparsely connected networks with
single 16000 row data set (Insurance16K) applied to MMHC different features at each site. In 2010, Na and Yang proposed a
algorithm.
method for learning the structure of a BN from distributed data
sources [14], but their local learning is using K2 algorithm with
TABLE IX. NETWORK ACCURACY ANALYSIS
medium accuracy and the approach does not scale for big data
SArc AA MA SHD set. In 2011, Tamada et al. proposed a Parallel Algorithm for
Alarm10M (SBNL) 0.28 0 9 9
learning optimal BN structure [15], but this approach is limited
for optimal structure search of BNs, which is not suitable for
Alarm96K (Single) 0.27 2 5 7 large data sets with millions of records. In Big Data BN
Insurance10M (SBNL) 0.68 2 26 27 learning area, current research focus mainly on methods for
distributed computing and scale-up implementation. To our
Insurance16K (Single) 0.69 3 24 25
best knowledge, this research is the first to bring workflow
concept into Big Data BN learning. This is a key contribution
to the existing research.
Alarm data set has good data quality with very low SArc,
therefore, the learned BN is close to the actual network. The There are several studies to scale up machine learning
best partition size of Alarm data set is 96000. We use a applications. The MapReduce framework has been shown to be
separate Alarm data set with Alarm96K to compare SBNL’s broadly applicable to many machine learning algorithms [26].
accuracy. It is observed that after applying the Alarm10M data Das et al. use a JSON query language, called Jaql, as bridge
set rows to SBNL, we learned a BN with 37 correct arcs, zero between R and Hadoop [23]. It provides a new package for
missing arcs with a structure hamming distance of nine. It is HDFS operations. Ghoting and Pednault propose Hadoop-ML,
close to the learning result of Alarm96K data set, showing an infrastructure on which developers can build task-parallel or
good learning accuracy of SBNL workflow. Note that there is data-parallel machine learning algorithms on program blocks
no added arc; this is due to the ensemble weighting mechanism under the language runtime environment [24]. Budiu et al.
of SBNL which selects popular arcs discovered by the local demonstrate how to use DryadLINQ for machine learning
learner, resulting in a very compact BN with most of the applications such as decision tree induction and k-means [34].
correct arcs. Yet learning curves of these tools are relatively steep since
researchers have to learn the architectures and interfaces to
On the other hand, Insurance data set has higher SArc value. implement their own data mining algorithms. Wegener et al.
So its learning accuracy is not as good as Alarm network. The introduce a system architecture for GUI based data mining of
best partition size of Insurance10M is 16000. It can be large data on clusters based on MapReduce that overcomes the
observed that the learning results of Insurance10M data set limitations of data mining toolkits [25]. It uses an
with SBNL workflow are similar to that of Insurance16K data implementation based on Weka and Hadoop to verify the
set. Again, this comparison confirms the learning accuracy of architecture. This work is similar to our work as both provide
SBNL workflow. GUI support and Hadoop integration. Our work is targeted to
another popular machine learning and data mining tool, namely
VI. RELATED WORK
R, and our framework can adapt with different DDP engines.
To efficiently manage the massive amounts of data
encountered in big data applications, approaches to in-situ There are also some machine learning workflow tools such
analytics have been investigated. Zou et al. explore the use of as Knime and Ipython notebook6. For instance, Knime provides
data reduction via online data compression in [32] and apply a lot of machine learning packages. Yet its Big Data extension7
this idea to large-scale remote visual data exploration [33]. Our currently is limited to Hadoop/HDFS access. We have not seen
approach addresses the data set size problem by using a pre- how DDP patterns/sub-workflows are supported in these
processing technique to eliminate poor-quality data, and by workflow tools.
using an approach that leverages an ensemble model coupled
with distributed processing.
6
Ipython notebook: http://ipython.org/notebook.html
7
http://www.knime.org/knime-big-data-extension
VII. CONCLUSIONS [11] J. Cheng, R. Greiner, J. Kelly, D. A, Bell, W. Liu, “Learning Bayesian
networks from data: An information-theory based approach”, Artificial
In the Big Data era, techniques for processing and Intelligence, Vol.137, pp. 43-90, 2002.
analyzing data must work in contexts where the data set [12] X. Xie, Z. Geng, “A Recursive Method for Structural Learning of
consists of millions of samples and the amount of data is Directed Acyclic Graphs”, Journal of Machine Learning Research, Vol.9,
measured in petabytes. By combining machine learning, pp. 459-483, 2008.
distributed computing and workflow techniques, we design a [13] R. Chen, K. Sivakumar , H. Kargupta, Learning bayesian network
Scalable Bayesian Network Learning (SBNL) workflow. The structure from distributed data, In Proceedings of the 3rd SIAM
International Data Mining Conference, pp. 284-288, 2003.
workflow includes intelligent Big Data pre-processing, and
[14] Y. Na , J. Yang, “Distributed Bayesian network structure learning”, In
effective BN learning from Big Data by leveraging ensemble Proceedings of 2010 IEEE International Symposium on Industrial
learning and distributed computing model. We also illustrate Electronics (ISIE), pp. 1607 - 1611, 2010.
how the Kepler scientific workflow system can easily provide [15] Y. Tamada, S. Imoto, S. Miyano, “Parallel Algorithm for Learning
scalability to Bayesian network learning. It should be noted Optimal Bayesian Network Structure”, Journal of Machine Learning
that this approach can be applied to many other machine Research, Vol.12, pp. 2437-2459, 2011.
learning techniques as well to make them scalable and Big [16] D. M. Chickering, D. Geiger, D. Heckerman, “Learning Bayesian
Data ready. networks is NP-hard”. Vol. 196, Technical Report MSR-TR-94-17,
Microsoft Research, 1994.
For future work, we plan to improve the performance of the [17] D. Opitz, and R. Maclin, “Popular ensemble methods: An empirical
data partition part by integrating the current data partition study”, Journal of Artificial Intelligence Research, Vol. 11, pp. 169-198,
approach with HDFS to achieve parallel data partition and 1999.
loading. We also plan to apply our work on bigger data sets [18] B. Zenko, “A comparison of stacking with meta decision trees to
bagging, boosting, and stacking with other methods”, In Proceedings
with more distributed resources to further verify its scalability. IEEE International Conference on Data Mining (ICDM 2001), pp. 669-
670, 2001.
ACKNOWLEDGMENT
[19] K. Monteith, J. L. Carroll, K. Seppi, T. Martinez., “Turning Bayesian
This work is supported by the Natural Science Foundation Model Averaging into Bayesian Model Combination”, In Proceedings of
of Jiangsu Province, China under grant No.BK20140857 and the International Joint Conference on Neural Networks (IJCNN'11), pp.
National Science Foundation, U.S. under grant DBI-1062565 2657-2663, 2011.
and 1331615. [20] J. A. Hoeting, D. Madigan, A. E. Raftery, C. T. Volinsky, “Bayesian
Model Averaging: A Tutorial”. Statistical Science, Vol. 14 , Issue 4 , pp.
382-401, 1999.
REFERENCES
[21] M. Scutari, Bayesian Network Repository,
[1] R. Lu, H. Zhu, X. Liu, J. K. Liu, J. Shao. “Toward efficient and privacy- http://www.bnlearn.com/bnrepository, 2011
preserving computing in big data era”. Network, IEEE, Vol. 28, Issue 4,
[22] I. Beinlich, Suermondt, H.R. Chavez, G. Cooper, “The ALARM
pp. 46-50, 2014.
monitoring system: a case study with two probabilistic inference
[2] Y. Zhang, Y. Zhang, E. Swears, N. Larios, Z. Wang, Q. Ji, “Modeling techniques for belief networks”, In Proceedings of Artificial Intelligence
Temporal Interactions with Interval Temporal Bayesian Networks for in Medical Care, pp. 247-256, 1989.
Complex Activity Recognition”, IEEE Transactions on Pattern Analysis
[23] S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J.
and Machine Intelligence, Vol. 35, Issue 10, pp. 2468-2483, 2013
McPherson, “Ricardo: Integrating R and Hadoop,” In Proceedings of
[3] M. Neil, C. Xiaoli, N. Fenton, “Optimizing the Calculation of ACM SIGMOD International Conference on Management Data
Conditional Probability Tables in Hybrid Bayesian Networks Using (SIGMOD10), pp. 987-998, 2010.
Binary Factorization”, IEEE Transactions on Knowledge and Data
[24] A. Ghoting and E. Pednault, “Hadoop-ML: An Infrastructure for the
Engineering, Vol. 24, Issue 7, pp.1306-1312, 2012
Rapid Implementation of Parallel Reusable Analytics,” In Proceedings
[4] M. Gui, A. Pahwa, S. Das, “Bayesian Network Model With Monte Carlo of Large-Scale Machine Learning: Parallelism and Massive Data Sets
Simulations for Analysis of Animal-Related Outages in Overhead Workshop (NIPS ’09), 2009.
Distribution Systems”, IEEE Transactions on Power Systems, Vol. 26,
[25] D. Wegener, M. Mock, D. Adranale, S. Wrobel, “Toolkit-Based High-
Issue 3, pp. 1618- 1624, 2011
Performance Data Mining of Large Data on MapReduce Clusters,” In
[5] N. E. Fenton, M. Neil, “A critique of software defect prediction models”, Proceedings of International Conference on Data Mining Workshops
IEEE Transactions on Software Engineering, Vol. 25, Issue 5, pp. 675- (ICDMW ’09), pp. 296-301, 2009.
689, 1999.
[26] C. Chu, S. K. Kim, Y. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, K.
[6] S. Sun, C. Zhang, G. Yu, “A bayesian network approach to traffic flow Olukotun, “Map-Reduce for machine learning on multicore,” in
forecasting”, IEEE Transactions on Intelligent Transportation Systems, Advances in neural information processing systems 19, pp. 281-288,
Vol 7, Issue 1, pp. 124-132, 2006. 2007.
[7] K. Dejaeger, T. Verbraken, B. Baesens, “Toward Comprehensible [27] D. Battre, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke,
Software Fault Prediction Models Using Bayesian Network Classifiers”, “Nephele/PACTs: A programming model and execution framework for
IEEE Transactions on Software Engineering, Vol. 39, Issue 2, pp. 237- web-scale analytical processing”, In Proceedings of the 1st ACM
248, 2013. symposium on Cloud computing (SoCC’10), ACM, pp. 119-130, 2010.
[8] M. Neil and N. Fenton, “Using Bayesian Networks to Model the [28] B. Ludaescher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, M.
Operational Risk to Information Technology Infrastructure in Financial Jones, E. A. Lee, J. Tao, Y. Zhao, “Scientific workflow management and
Institutions”, Journal of Financial Transformation, Vol. 22, pp. 131-138, the Kepler system”, Concurrency and Computation: Practice &
2008. Experience, Special Issue on Scientific Workflows, Vol. 18, Issue 10, pp.
[9] Y. Tang, K. Cooper, C. Cangussu, “Bayesian Belief Network Structure 1039-1065, 2006.
Learning Algorithms”, Technical Report. University of Texas at Dallas. [29] A. Goderis, C. Brooks, I. Altintas, E. Lee, C. Goble, “Heterogeneous
UTDCS-25-09. 2009. composition of models of computation”, Future Generation Computer
[10] I. Tsamardinos, L. E. Brown, C. F. Aliferis, “The max-min hill-climbing Systems, Vol. 25, Issue 5, pp. 552-560, 2009.
Bayesian network structure learning algorithm”, Machine Learning, Vol.
65, Issue 1, pp. 31-78, 2006.
[30] J. Wang, D. Crawl, I. Altintas, W. Li. “Big Data Applications using [33] H. Zou, M. Slawinska, K. Schwan, M. Wolf, G. Eisenhauer, F. Zheng, J.
Workflows for Data Parallel Computing”, Computing in Science & Dayal, J. Logan, S. Klasky, T. Bode, M. Kinsey, M. Clark. “FlexQuery:
Engineering, Vol. 16, Issue 4, pp. 11-22, July-Aug. 2014, IEEE. An Online In-situ Query System for Interactive Remote Visual Data
[31] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kötter, T. Meinl, P. Exploration at Large Scale”. In Proceedings of 2013 IEEE International
Ohl, C. Sieb, K. Thiel, B. Wiswedel, “KNIME: The Konstanz Conference on Cluster Computing (Cluster 2013). pp. 1-8, 2013.
information miner”. Studies in Classification, Data Analysis, and [34] M. Budiu, D. Fetterly, M. Isard, F. McSherry, Y. Yu. "Large-scale
Knowledge Organization, pp. 319-326, 2008. machine learning using DryadLINQ." , in R. Bekkerman, M. Bilenko, J.
[32] H. Zou, F. Zheng, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, Q. Langford (Eds.), Scaling up Machine Learning: Parallel and Distributed
Liu, N. Podhorszki, S. Klasky, “Quality-Aware Data Management for Approaches, Cambridge University Press, pp 49-68, 2011.
Large Scale Scientific Applications”, In Proceedings of High [35] W. Raghupathi, V. Raghupathi. “Big data analytics in healthcare:
Performance Computing, Networking, Storage and Analysis (SCC), promise and potential”. Health Information Science and Systems, Vol. 2,
2012 SC Companion, pp. 816-820, 2012. Issue 1, 3, 2014.