Data Intensive Distributed Computing - Challenges and Solutions For Large-Scale Information Management 622912992

Data Intensive Distributed
Computing:
Challenges and for Large-Scale
Management
Tevfik Kosar
State
Solutions Information
University of New York
at
Buffalo (SUNY), USA
Information Science
REFERENCE
Detailed Table of Contents
Preface
xiii
Section 1 New
Paradigms
in Data Intensive
Computing
Chapter 1
Data-Aware Distributed
Computing
Esma Yildirim, State
University ofNew York at Buffalo (SUNY), USA Mehmet Balman, Lawrence Berkeley National Laboratory, USA TevfikKosar, State University of New York at Buffalo (SUNY), USA
With the continuous increase in the data
to remote and distributed data
requirements
a
of scientific and commercial data
has become
major
bottleneck for end-to-end

access and
Traditional distributed data

access
computing systems closely couple

a
applications, access application performance. computation, and generally,

com
is considered
side effect of computation. The limitations of traditional distributed
puting systems and CPU-oriented scheduling and workflow management tools in managing complex data handling have motivated a newly emerging era: data-aware distributed computing. In this chapter, the authors elaborate on how the most crucial distributed computing components, such as scheduling,
workflow management, and end-to-end
throughput optimization,
can
become "data-aware." In this
new
computing paradigm, called data-aware distributed computing, data placement activities are represented as full-featured jobs in the end-to-end workflow, and they are queued, managed, scheduled, and opti mized via a specialized data-aware scheduler. As part of this new paradigm, the authors present a set of tools for mitigating the data bottleneck in distributed computing systems, which consists of three main data-aware a which such as planning, scheduling, resource scheduler, components: provides capabilities reservation, job execution, and error recovery for data movement tasks; integration of these capabilities to the other layers in distributed computing, such as workflow planning; and further optimization of data movement tasks via dynamic tuning of underlying protocol transfer parameters.
Chapter
loan
Towards Data Intensive Ian Foster,
Raicu, Illinois Institute
Many-Task Computing of Technology,
28 USA &
Argonne National Laboratory, USA
University of Chicago, USA & Argonne National Laboratory, USA Yong Zhao, University ofElectronic Science and Technology of China, China Alex Szalay, Johns Hopkins University, USA Philip Little, University of Notre Dame, USA
Christopher M. Moretti, University ofNotre Dame, USA Amitabh Chaudhary, University ofNotre Dame, USA
Douglas Thain, University of Notre Dame, USA Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configura
tions)
today's largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems.
do not scale to
chapter, the authors argue that in such circumstances, data locality is critical to the successful and applications. They propose a "data diffusion" approach to enable data-intensive many-task computing. They define an abstract model for data diffu sion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. They also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning, demon
In this efficient use of large distributed systems for data-intensive
strating approaches that improve

Chapter 3
both
performance and scalability.
Micro-Services: A Service-Oriented
Arcot
Mike
Paradigm for Scalable, Distributed Data Management Rajasekar, University ofNorth Carolina at Chapel Hill, USA Wan, University of California at San Diego, USA
74
Reagan Moore, University of North Carolina at Chapel Hill, USA Wayne Schroeder, University of California at San Diego, USA
Service-oriented architectures tional software units to distributed data
(SOA) enable orchestration of loosely-coupled and interoperable func

and execute
as a set
develop
grid life-cycle of a data object. The set of physical and discipline-centric characteristics.
can
be viewed
complex but agile applications. Data management on a operations that are performed across all stages in the such operations depends on the type of objects, based on their
of
In this
called micro-services, which

eter
are
orchestrated into conditional workflows for database-based
chapter, the authors define server-side functions, achieving large-scale data
management specific to collections of data. Micro-services communicate with each other
using param
a
exchange, in memory data structures,
persistent information store, and
network
messaging system that uses
a serialization protocol for communicating with remote micro-services. The orchestration of the workflow is done by a distributed rule engine that chains and executes the workflows
and maintains transactional properties through recovery micro-services. They discuss the micro-service oriented architecture, compare the micro-service approach with traditional SOA, and describe the use
of micro-services for
implementing policy-based data management systems.
Section 2 Distributed
Storage
Chapter 4
Distributed
Storage Systems for Data Intensive Computing Sudharshan S. Vazhkadai, Oak Ridge National Laboratory, USA AH R. Butt, Virginia Polytechnic Institute and State University, USA Xiaosong Ma, North Carolina State University, USA chapter,
the authors that
95
present an overview of the utility of distributed storage systems in supporting applications increasingly becoming data intensive. Their coverage of distributed storage systems is based on the requirements imposed by data intensive computing and not a mere summary of storage systems. To this end, they delve into several aspects of supporting data-intensive analysis, such as data staging, offloading, checkpointing, and end-user access to terabytes of data, and illustrate the use of novel techniques and methodologies for realizing distributed storage systems therein. The data deluge from scientific experiments, observations, and simulations is affecting all of the aforementioned dayIn this modern
are
to-day operations in data-intensive computing. Modern distributed storage systems employ techniques that can help improve application performance, alleviate I/O bandwidth bottleneck, mask failures, and improve data availability. They present key guiding principles involved in the construction of such storage systems, associated tradeoffs, design, and architecture, all with an eye toward addressing challenges of data-intensive scientific applications. They highlight the concepts involved using several case studies of state-of-the-art storage systems that are currently available in the data-intensive computing landscape. Chapter 5 Metadata Management in PetaShare Distributed Storage Network Ismail Akturk, Bilkent University, Turkey Xinqi Wang, Louisiana State University, USA Tevfik Kosar, State University of New York at Buffalo (SUNY), USA
The unbounded increase in the size of data
118
generated by scientific applications necessitates collaboration sharing among the nation's education and research institutions. Simply purchasing high-capacity, high-performance storage systems and adding them to the existing infrastructure of the collaborating
and
underlying and highly challenging data handling problem. Scientists compelled to spend a great deal of time and energy on solving basic data-handling issues, such as the physical location of data, how to access it, and/or how to move it to visualization and/or compute
are
resources
institutions does not solve the
for further
analysis.
This
chapter presents
a
the
design
and
efficient distributed data storage system,

across
PetaShare, which spans

unified
implementation of a reliable and multiple institutions across the state

space and efficient data movement
of Louisiana. At the back-end, PetaShare provides
name
provides light-weight clients the en able easy, transparent, and scalable access. In PetaShare, the authors have designed and implemented an asynchronously replicated multi-master metadata system for enhanced reliability and availability. The authors also present a high level cross-domain metadata schema to provide a structured systematic view of multiple science domains supported by PetaShare. geographically
distributed storage sites. At the front-end, it
Chapter 6
Data Intensive
Computing with Clustered Chirp Servers Douglas Thain, University of Notre Dame, USA
Bui,
140
University of Notre Dame, USA Hoang University of Notre Dame, USA Peter Bui, University of Notre Dame, USA Rory Carmichael, University of Notre Dame, USA Scott Emrich, University ofNotre Dame, USA Patrick Flynn, University of Notre Dame, USA
Michael Albrecht,
Over the last few
decades, computing performance, memory capacity, and disk storage have all increased magnitude. However, I/O performance has not increased at nearly the same pace: by is still measured in milliseconds, and disk I/O throughput is still measured in disk movement arm a megabytes per second. If one wishes to build computer systems that can store and process petabytes of data, they must have large numbers of disks and the corresponding I/O paths and memory capacity to support the desired data rate. A cost efficient way to accomplish this is by clustering large numbers of commodity machines together. This chapter presents Chirp as a building block for clustered data intensive scientific computing. Chirp was originally designed as a lightweight file server for grid computing and was used as a "personal" file server. The authors explore building systems with very high I/O capacity using commodity storage devices by tying together multiple Chirp servers. Several real-life applications such as the GRAND Data Analysis Grid, the Biometrics Research Grid, and the Biocompute Facility use Chirp as their fundamental building block, but provide different services and interfaces appropriate
many orders of
to their
target communities.
Section 3 Data & Workflow
Management
Chapter 7 A Survey of Scheduling and Management Techniques for Data-Intensive 156 Application Workflows Suraj Pandey The Commonwealth Scientific and Industrial Research Organisation (CSIRO),
Australia
Rajkitmar Buyya, The University ofMelbourne, Australia

This chapter presents a comprehensive survey of algorithms, techniques, and frameworks used for sched uling and management of data-intensive application workflows. Many complex scientific experiments
expressed in the form of workflows for structured, repeatable, controlled, scalable, and automated executions. This chapter focuses on the type of workflows that have tasks processing huge amount of data, usually in the range from hundreds of mega-bytes to petabytes. Scientists are already using Grid systems that schedule these workflows onto globally distributed resources for optimizing various objec tives: minimize total makespan of the workflow, minimize cost and usage of network bandwidth, minimize cost of computation and storage, meet the deadline of the application, and so forth. This chapter lists
are
techniques used in each of these systems for processing huge amount of data. A survey of workflow management techniques is useful for understanding the working of the Grid systems providing insights on performance optimization of scientific applications dealing with data-intensive workloads.
and describes
Chapter 8 Data Management in Scientific Workflows Ewa Deeltnan, University of Southern California, USA Ann Chervenak, University of Southern California, USA
Scientific
177
applications such
as
those in astronomy,
earthquake science, gravitational-wave physics,
and
others have embraced workflow

to
technologies
to
do
large-scale science. Workflows
enable researchers
access
collaboratively design,
manage, and obtain results that involve hundreds of thousands of steps,

amounts of intermediate and final data
terabytes of data, and generate similar
products. Although work flow systems are able to facilitate the automated generation of data products, many issues still remain to be addressed. These issues exist in different forms in the workflow lifecycle. This chapter describes a workflow lifecycle as consisting of a workflow generation phase where the analysis is defined, the workflow planning phase where resources needed for execution are selected, the workflow execution
part, where the actual computations take place, and the result, metadata, and provenance storing phase. The authors discuss the issues related to data management at each step of the workflow cycle. They
describe challenge problems and illustrate them in the context of real-life
applications. They discuss
the
challenges, possible solutions, and open issues faced when mapping and executing large-scale workflows on current cyberinfrastructure. They particularly emphasize the issues related to the management of data throughout the workflow lifecycle. Chapter 9 Replica Management in Data Intensive Distributed Science Applications
Ann L. Chervenak, Robert Schuler,
188
University
of Southern
California,
USA
University ofSouthern California, USA

sets
Management of the large data

the fact that
applications is complicated by participating institutions are often geographically distributed and separated by distinct administrative domains. A key data management problem in these distributed collaborations has been the creation and maintenance of replicated data sets. This chapter provides an overview of replica management schemes used in large, data-intensive, distributed scientific collaborations. Early replica management strategies focused on the development of robust, highly scalable catalogs for maintaining replica locations. In recent years, more sophisticated, application-specific replica management systems
have been developed motivated interest in
to
produced by data-intensive
scientific
support the requirements of scientific Virtual Organizations. These systems have

schemes for
application-independent, policy-driven
replica management that can
performance and reliability requirements of a range of scientific collaborations. The authors discuss the data replication solutions to meet the challenges associated with increasingly data and the data sets to at large requirement run analysis geographically distributed sites.
be tailored to meet the
Section 4 Data
Discovery & Visualization
Chapter
10
Data Intensive
Computing for Bioinformatics

-
207
Judy Qiu, Indiana University Bloomington, USA Jaliya Ekanayake, Indiana University Bloomington, USA Thilina Gunarathne, Indiana University Bloomington, USA Jong Youl Choi, Indiana University Bloomington, USA
-
Seung-Hee Bae, Indiana University Bloomington, USA Yang Ruan, Indiana University Bloomington, USA Saliya Ekanayake, Indiana University Bloomington, USA Stephen Wu, Indiana University Bloomington, USA Scott Beason, Computer Sciences Corporation, USA Geoffrey Fox, Indiana University Bloomington, USA
-
Mina Rho, Indiana Haixu
Tang,
Indiana
University Bloomington, USA University Bloomington, USA

-
computing, cloud computing, and multicore computing are converging as frontiers to ad dress massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large-scale computing resources effectively to advance fundamental science discoveries such as those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies of collections of genes. Metagenomic research is just one of the areas that present a significant compu tational challenge because of the amount and complexity of data to be processed. This chapter discusses
Data intensive
data-mining algorithms and new programming models for several Life Sciences particularly focus on methods that are applicable to large data sets coming applications. from high throughput devices of steadily increasing power. They show results for both clustering and dimension reduction algorithms, and the use of MapReduce on modest size problems. They identify two key areas where further research is essential, and propose to develop new 0(NlogN) complexity algorithms suitable for the analysis of millions of sequences. They suggest Iterative MapReduce as a promising programming model combining the best features of MapReduce with those of high perfor
the
use
of innovative
The authors
mance
environments such 11
as
MPI.
Chapter
Visualization of Large-Scale Distributed Data Jason
242
Leigh, University ofIllinois at Chicago,

University of Illinois
at
USA
Chicago, USA Luc Renambot, University ofIllinois at Chicago, USA Venkatram Vishwanath, University ofIllinois at Chicago, USA &Argonne National Laboratory, USA
Andrew Johnson,
Argonne National Laboratory, USA Nicholas Schwarz, Northwestern University, USA

Tom Peterka, An effective visualization is best achieved through the creation of a proper representation of data and the interactive manipulation and querying of the visualization. Large-scale data visualization is particularly
challenging because the
size of the data is several orders of magnitude
larger than
what
can
be
managed
desktop computer. Large-scale data visualization therefore requires the use of distributed computing. By leveraging the widespread expansion of the Internet and other national and international high-speed network infrastructure such as the National LambdaRail, Internet-2, and the Global Lambda Integrated Facility, data and service providers began to migrate toward a model of widespread distribu tion of resources. This chapter introduces different instantiations of the visualization pipeline and the
on an
average
historic motivation for their creation. The authors examine individual components of the pipeline in detail to understand the technical challenges that must be solved in order to ensure continued scalability.
They discuss distributed data management issues that are specifically relevant to large-scale visualiza tion. They also introduce key data rendering techniques and explain through case studies approaches for scaling them by leveraging distributed computing. Lastly they describe advanced display technologies
that
are now
considered the "lenses" for
examining large-scale data.
Chapter
12
on
On-Demand Visualization
Scalable Shared Infrastructure
275
Huadong Liu, University of Tennessee, USA Jinzhu Gao, University of The Pacific, USA Man Huang, University of Tennessee, USA Micah Beck, University of Tennessee, USA Terry Moore, University of Tennessee, USA
The emergence of high-resolution
and
simulation, where simulation outputs have grown to terascale levels beyond, raises major new challenges for the visualization community, which is serving computational scientists who want adequate visualization services provided to them on-demand. Many existing algo rithms for parallel visualization were not designed to operate optimally on time-shared parallel systems or on heterogeneous systems. They are usually optimized for systems that are homogeneous and have
This chapter explores the possibility of developing parallel visualiza distributed, heterogeneous processors to visualize cutting edge simulation datasets. The authors study how to effectively support multiple concurrent users operating on the same large dataset, with each focusing on a dynamically varying subset of the data. From a system design point of view, they observe that a distributed cache offers various advantages, including improved scalability. They develop basic scheduling mechanisms that were able to achieve fault-tolerance and load-balancing, optimal use of resources, and flow-control using system-level back-off, while still enforcing deadline driven (i.e. time-critical) visualization.
use.
been reserved for exclusive
tion
algorithms that
can use
Compilation
of References
291 319 331
About the Contributors Index

Data Intensive Distributed Computing - Challenges and Solutions For Large-Scale Information Management 622912992

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Intensive Distributed Computing - Challenges and Solutions For Large-Scale Information Management 622912992

Transféré par

Droits d'auteur :

Formats disponibles

Data Intensive Distributed

University of New York

Buffalo (SUNY), USA

Detailed Table of Contents

Esma Yildirim, State

of scientific and commercial data

bottleneck for end-to-end

Traditional distributed data

computing systems closely couple

applications, access application performance. computation, and generally,

side effect of computation. The limitations of traditional distributed

become "data-aware." In this

Towards Data Intensive Ian Foster,

Raicu, Illinois Institute

Many-Task Computing of Technology,

Argonne National Laboratory, USA

strating approaches that improve

performance and scalability.

(SOA) enable orchestration of loosely-coupled and interoperable func

called micro-services, which

orchestrated into conditional workflows for database-based

chapter, the authors define server-side functions, achieving large-scale data

management specific to collections of data. Micro-services communicate with each other

exchange, in memory data structures,

persistent information store, and

messaging system that uses

implementing policy-based data management systems.

institutions does not solve the

efficient distributed data storage system,

PetaShare, which spans

implementation of a reliable and multiple institutions across the state

of Louisiana. At the back-end, PetaShare provides

distributed storage sites. At the front-end, it

Over the last few

Section 3 Data & Workflow

Rajkitmar Buyya, The University ofMelbourne, Australia

earthquake science, gravitational-wave physics,

others have embraced workflow

large-scale science. Workflows

manage, and obtain results that involve hundreds of thousands of steps,

terabytes of data, and generate similar

applications. They discuss

University ofSouthern California, USA

Management of the large data

support the requirements of scientific Virtual Organizations. These systems have

replica management that can

be tailored to meet the

Discovery & Visualization

Computing for Bioinformatics

Mina Rho, Indiana Haixu

University Bloomington, USA University Bloomington, USA

Visualization of Large-Scale Distributed Data Jason

Leigh, University ofIllinois at Chicago,

Argonne National Laboratory, USA Nicholas Schwarz, Northwestern University, USA

challenging because the

size of the data is several orders of magnitude

considered the "lenses" for

examining large-scale data.

Scalable Shared Infrastructure

been reserved for exclusive

291 319 331

About the Contributors Index

Vous aimerez peut-être aussi