Vous êtes sur la page 1sur 8

Data Intensive Distributed

Computing:
Challenges and for Large-Scale
Management
Tevfik Kosar
State

Solutions Information

University of New York

at

Buffalo (SUNY), USA

Information Science

REFERENCE

Detailed Table of Contents

Preface

xiii

Section 1 New

Paradigms

in Data Intensive

Computing

Chapter 1
Data-Aware Distributed

Computing

Esma Yildirim, State

University ofNew York at Buffalo (SUNY), USA Mehmet Balman, Lawrence Berkeley National Laboratory, USA TevfikKosar, State University of New York at Buffalo (SUNY), USA
With the continuous increase in the data
to remote and distributed data

requirements
a

of scientific and commercial data

has become

major

bottleneck for end-to-end


access and

Traditional distributed data


access

computing systems closely couple


a

applications, access application performance. computation, and generally,


com

is considered

side effect of computation. The limitations of traditional distributed

puting systems and CPU-oriented scheduling and workflow management tools in managing complex data handling have motivated a newly emerging era: data-aware distributed computing. In this chapter, the authors elaborate on how the most crucial distributed computing components, such as scheduling,
workflow management, and end-to-end

throughput optimization,

can

become "data-aware." In this

new

computing paradigm, called data-aware distributed computing, data placement activities are represented as full-featured jobs in the end-to-end workflow, and they are queued, managed, scheduled, and opti mized via a specialized data-aware scheduler. As part of this new paradigm, the authors present a set of tools for mitigating the data bottleneck in distributed computing systems, which consists of three main data-aware a which such as planning, scheduling, resource scheduler, components: provides capabilities reservation, job execution, and error recovery for data movement tasks; integration of these capabilities to the other layers in distributed computing, such as workflow planning; and further optimization of data movement tasks via dynamic tuning of underlying protocol transfer parameters.

Chapter
loan

Towards Data Intensive Ian Foster,

Raicu, Illinois Institute

Many-Task Computing of Technology,

28 USA &

Argonne National Laboratory, USA

University of Chicago, USA & Argonne National Laboratory, USA Yong Zhao, University ofElectronic Science and Technology of China, China Alex Szalay, Johns Hopkins University, USA Philip Little, University of Notre Dame, USA

Christopher M. Moretti, University ofNotre Dame, USA Amitabh Chaudhary, University ofNotre Dame, USA
Douglas Thain, University of Notre Dame, USA Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configura

tions)

today's largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems.
do not scale to

chapter, the authors argue that in such circumstances, data locality is critical to the successful and applications. They propose a "data diffusion" approach to enable data-intensive many-task computing. They define an abstract model for data diffu sion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. They also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning, demon
In this efficient use of large distributed systems for data-intensive

strating approaches that improve


Chapter 3

both

performance and scalability.

Micro-Services: A Service-Oriented
Arcot

Mike

Paradigm for Scalable, Distributed Data Management Rajasekar, University ofNorth Carolina at Chapel Hill, USA Wan, University of California at San Diego, USA

74

Reagan Moore, University of North Carolina at Chapel Hill, USA Wayne Schroeder, University of California at San Diego, USA
Service-oriented architectures tional software units to distributed data

(SOA) enable orchestration of loosely-coupled and interoperable func


and execute
as a set

develop

grid life-cycle of a data object. The set of physical and discipline-centric characteristics.
can

be viewed

complex but agile applications. Data management on a operations that are performed across all stages in the such operations depends on the type of objects, based on their
of
In this

called micro-services, which


eter

are

orchestrated into conditional workflows for database-based

chapter, the authors define server-side functions, achieving large-scale data

management specific to collections of data. Micro-services communicate with each other

using param
a

exchange, in memory data structures,

persistent information store, and

network

messaging system that uses

a serialization protocol for communicating with remote micro-services. The orchestration of the workflow is done by a distributed rule engine that chains and executes the workflows

and maintains transactional properties through recovery micro-services. They discuss the micro-service oriented architecture, compare the micro-service approach with traditional SOA, and describe the use
of micro-services for

implementing policy-based data management systems.

Section 2 Distributed

Storage

Chapter 4
Distributed

Storage Systems for Data Intensive Computing Sudharshan S. Vazhkadai, Oak Ridge National Laboratory, USA AH R. Butt, Virginia Polytechnic Institute and State University, USA Xiaosong Ma, North Carolina State University, USA chapter,
the authors that

95

present an overview of the utility of distributed storage systems in supporting applications increasingly becoming data intensive. Their coverage of distributed storage systems is based on the requirements imposed by data intensive computing and not a mere summary of storage systems. To this end, they delve into several aspects of supporting data-intensive analysis, such as data staging, offloading, checkpointing, and end-user access to terabytes of data, and illustrate the use of novel techniques and methodologies for realizing distributed storage systems therein. The data deluge from scientific experiments, observations, and simulations is affecting all of the aforementioned dayIn this modern
are

to-day operations in data-intensive computing. Modern distributed storage systems employ techniques that can help improve application performance, alleviate I/O bandwidth bottleneck, mask failures, and improve data availability. They present key guiding principles involved in the construction of such storage systems, associated tradeoffs, design, and architecture, all with an eye toward addressing challenges of data-intensive scientific applications. They highlight the concepts involved using several case studies of state-of-the-art storage systems that are currently available in the data-intensive computing landscape. Chapter 5 Metadata Management in PetaShare Distributed Storage Network Ismail Akturk, Bilkent University, Turkey Xinqi Wang, Louisiana State University, USA Tevfik Kosar, State University of New York at Buffalo (SUNY), USA
The unbounded increase in the size of data

118

generated by scientific applications necessitates collaboration sharing among the nation's education and research institutions. Simply purchasing high-capacity, high-performance storage systems and adding them to the existing infrastructure of the collaborating
and

underlying and highly challenging data handling problem. Scientists compelled to spend a great deal of time and energy on solving basic data-handling issues, such as the physical location of data, how to access it, and/or how to move it to visualization and/or compute
are
resources

institutions does not solve the

for further

analysis.

This

chapter presents
a

the

design

and

efficient distributed data storage system,


across

PetaShare, which spans


unified

implementation of a reliable and multiple institutions across the state


space and efficient data movement

of Louisiana. At the back-end, PetaShare provides

name

provides light-weight clients the en able easy, transparent, and scalable access. In PetaShare, the authors have designed and implemented an asynchronously replicated multi-master metadata system for enhanced reliability and availability. The authors also present a high level cross-domain metadata schema to provide a structured systematic view of multiple science domains supported by PetaShare. geographically

distributed storage sites. At the front-end, it

Chapter 6
Data Intensive

Computing with Clustered Chirp Servers Douglas Thain, University of Notre Dame, USA
Bui,

140

University of Notre Dame, USA Hoang University of Notre Dame, USA Peter Bui, University of Notre Dame, USA Rory Carmichael, University of Notre Dame, USA Scott Emrich, University ofNotre Dame, USA Patrick Flynn, University of Notre Dame, USA
Michael Albrecht,

Over the last few

decades, computing performance, memory capacity, and disk storage have all increased magnitude. However, I/O performance has not increased at nearly the same pace: by is still measured in milliseconds, and disk I/O throughput is still measured in disk movement arm a megabytes per second. If one wishes to build computer systems that can store and process petabytes of data, they must have large numbers of disks and the corresponding I/O paths and memory capacity to support the desired data rate. A cost efficient way to accomplish this is by clustering large numbers of commodity machines together. This chapter presents Chirp as a building block for clustered data intensive scientific computing. Chirp was originally designed as a lightweight file server for grid computing and was used as a "personal" file server. The authors explore building systems with very high I/O capacity using commodity storage devices by tying together multiple Chirp servers. Several real-life applications such as the GRAND Data Analysis Grid, the Biometrics Research Grid, and the Biocompute Facility use Chirp as their fundamental building block, but provide different services and interfaces appropriate
many orders of
to their

target communities.

Section 3 Data & Workflow

Management

Chapter 7 A Survey of Scheduling and Management Techniques for Data-Intensive 156 Application Workflows Suraj Pandey The Commonwealth Scientific and Industrial Research Organisation (CSIRO),
Australia

Rajkitmar Buyya, The University ofMelbourne, Australia


This chapter presents a comprehensive survey of algorithms, techniques, and frameworks used for sched uling and management of data-intensive application workflows. Many complex scientific experiments

expressed in the form of workflows for structured, repeatable, controlled, scalable, and automated executions. This chapter focuses on the type of workflows that have tasks processing huge amount of data, usually in the range from hundreds of mega-bytes to petabytes. Scientists are already using Grid systems that schedule these workflows onto globally distributed resources for optimizing various objec tives: minimize total makespan of the workflow, minimize cost and usage of network bandwidth, minimize cost of computation and storage, meet the deadline of the application, and so forth. This chapter lists
are

techniques used in each of these systems for processing huge amount of data. A survey of workflow management techniques is useful for understanding the working of the Grid systems providing insights on performance optimization of scientific applications dealing with data-intensive workloads.
and describes

Chapter 8 Data Management in Scientific Workflows Ewa Deeltnan, University of Southern California, USA Ann Chervenak, University of Southern California, USA
Scientific

177

applications such

as

those in astronomy,

earthquake science, gravitational-wave physics,

and

others have embraced workflow


to

technologies

to

do

large-scale science. Workflows

enable researchers
access

collaboratively design,

manage, and obtain results that involve hundreds of thousands of steps,


amounts of intermediate and final data

terabytes of data, and generate similar

products. Although work flow systems are able to facilitate the automated generation of data products, many issues still remain to be addressed. These issues exist in different forms in the workflow lifecycle. This chapter describes a workflow lifecycle as consisting of a workflow generation phase where the analysis is defined, the workflow planning phase where resources needed for execution are selected, the workflow execution
part, where the actual computations take place, and the result, metadata, and provenance storing phase. The authors discuss the issues related to data management at each step of the workflow cycle. They
describe challenge problems and illustrate them in the context of real-life

applications. They discuss

the

challenges, possible solutions, and open issues faced when mapping and executing large-scale workflows on current cyberinfrastructure. They particularly emphasize the issues related to the management of data throughout the workflow lifecycle. Chapter 9 Replica Management in Data Intensive Distributed Science Applications
Ann L. Chervenak, Robert Schuler,

188

University

of Southern

California,

USA

University ofSouthern California, USA


sets

Management of the large data


the fact that

applications is complicated by participating institutions are often geographically distributed and separated by distinct administrative domains. A key data management problem in these distributed collaborations has been the creation and maintenance of replicated data sets. This chapter provides an overview of replica management schemes used in large, data-intensive, distributed scientific collaborations. Early replica management strategies focused on the development of robust, highly scalable catalogs for maintaining replica locations. In recent years, more sophisticated, application-specific replica management systems
have been developed motivated interest in
to

produced by data-intensive

scientific

support the requirements of scientific Virtual Organizations. These systems have


schemes for

application-independent, policy-driven

replica management that can

performance and reliability requirements of a range of scientific collaborations. The authors discuss the data replication solutions to meet the challenges associated with increasingly data and the data sets to at large requirement run analysis geographically distributed sites.

be tailored to meet the

Section 4 Data

Discovery & Visualization

Chapter

10

Data Intensive

Computing for Bioinformatics


-

207

Judy Qiu, Indiana University Bloomington, USA Jaliya Ekanayake, Indiana University Bloomington, USA Thilina Gunarathne, Indiana University Bloomington, USA Jong Youl Choi, Indiana University Bloomington, USA
-

Seung-Hee Bae, Indiana University Bloomington, USA Yang Ruan, Indiana University Bloomington, USA Saliya Ekanayake, Indiana University Bloomington, USA Stephen Wu, Indiana University Bloomington, USA Scott Beason, Computer Sciences Corporation, USA Geoffrey Fox, Indiana University Bloomington, USA
-

Mina Rho, Indiana Haixu

Tang,

Indiana

University Bloomington, USA University Bloomington, USA


-

computing, cloud computing, and multicore computing are converging as frontiers to ad dress massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large-scale computing resources effectively to advance fundamental science discoveries such as those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies of collections of genes. Metagenomic research is just one of the areas that present a significant compu tational challenge because of the amount and complexity of data to be processed. This chapter discusses
Data intensive

data-mining algorithms and new programming models for several Life Sciences particularly focus on methods that are applicable to large data sets coming applications. from high throughput devices of steadily increasing power. They show results for both clustering and dimension reduction algorithms, and the use of MapReduce on modest size problems. They identify two key areas where further research is essential, and propose to develop new 0(NlogN) complexity algorithms suitable for the analysis of millions of sequences. They suggest Iterative MapReduce as a promising programming model combining the best features of MapReduce with those of high perfor
the
use

of innovative

The authors

mance

environments such 11

as

MPI.

Chapter

Visualization of Large-Scale Distributed Data Jason

242

Leigh, University ofIllinois at Chicago,


University of Illinois
at

USA

Chicago, USA Luc Renambot, University ofIllinois at Chicago, USA Venkatram Vishwanath, University ofIllinois at Chicago, USA &Argonne National Laboratory, USA
Andrew Johnson,

Argonne National Laboratory, USA Nicholas Schwarz, Northwestern University, USA


Tom Peterka, An effective visualization is best achieved through the creation of a proper representation of data and the interactive manipulation and querying of the visualization. Large-scale data visualization is particularly

challenging because the

size of the data is several orders of magnitude

larger than

what

can

be

managed

desktop computer. Large-scale data visualization therefore requires the use of distributed computing. By leveraging the widespread expansion of the Internet and other national and international high-speed network infrastructure such as the National LambdaRail, Internet-2, and the Global Lambda Integrated Facility, data and service providers began to migrate toward a model of widespread distribu tion of resources. This chapter introduces different instantiations of the visualization pipeline and the
on an

average

historic motivation for their creation. The authors examine individual components of the pipeline in detail to understand the technical challenges that must be solved in order to ensure continued scalability.

They discuss distributed data management issues that are specifically relevant to large-scale visualiza tion. They also introduce key data rendering techniques and explain through case studies approaches for scaling them by leveraging distributed computing. Lastly they describe advanced display technologies
that
are now

considered the "lenses" for

examining large-scale data.

Chapter

12
on

On-Demand Visualization

Scalable Shared Infrastructure

275

Huadong Liu, University of Tennessee, USA Jinzhu Gao, University of The Pacific, USA Man Huang, University of Tennessee, USA Micah Beck, University of Tennessee, USA Terry Moore, University of Tennessee, USA
The emergence of high-resolution
and

simulation, where simulation outputs have grown to terascale levels beyond, raises major new challenges for the visualization community, which is serving computational scientists who want adequate visualization services provided to them on-demand. Many existing algo rithms for parallel visualization were not designed to operate optimally on time-shared parallel systems or on heterogeneous systems. They are usually optimized for systems that are homogeneous and have
This chapter explores the possibility of developing parallel visualiza distributed, heterogeneous processors to visualize cutting edge simulation datasets. The authors study how to effectively support multiple concurrent users operating on the same large dataset, with each focusing on a dynamically varying subset of the data. From a system design point of view, they observe that a distributed cache offers various advantages, including improved scalability. They develop basic scheduling mechanisms that were able to achieve fault-tolerance and load-balancing, optimal use of resources, and flow-control using system-level back-off, while still enforcing deadline driven (i.e. time-critical) visualization.
use.

been reserved for exclusive

tion

algorithms that

can use

Compilation

of References

291 319 331

About the Contributors Index

Vous aimerez peut-être aussi