Académique Documents
Professionnel Documents
Culture Documents
Computing:
Challenges and for Large-Scale
Management
Tevfik Kosar
State
Solutions Information
at
Information Science
REFERENCE
Preface
xiii
Section 1 New
Paradigms
in Data Intensive
Computing
Chapter 1
Data-Aware Distributed
Computing
University ofNew York at Buffalo (SUNY), USA Mehmet Balman, Lawrence Berkeley National Laboratory, USA TevfikKosar, State University of New York at Buffalo (SUNY), USA
With the continuous increase in the data
to remote and distributed data
requirements
a
has become
major
is considered
puting systems and CPU-oriented scheduling and workflow management tools in managing complex data handling have motivated a newly emerging era: data-aware distributed computing. In this chapter, the authors elaborate on how the most crucial distributed computing components, such as scheduling,
workflow management, and end-to-end
throughput optimization,
can
new
computing paradigm, called data-aware distributed computing, data placement activities are represented as full-featured jobs in the end-to-end workflow, and they are queued, managed, scheduled, and opti mized via a specialized data-aware scheduler. As part of this new paradigm, the authors present a set of tools for mitigating the data bottleneck in distributed computing systems, which consists of three main data-aware a which such as planning, scheduling, resource scheduler, components: provides capabilities reservation, job execution, and error recovery for data movement tasks; integration of these capabilities to the other layers in distributed computing, such as workflow planning; and further optimization of data movement tasks via dynamic tuning of underlying protocol transfer parameters.
Chapter
loan
28 USA &
University of Chicago, USA & Argonne National Laboratory, USA Yong Zhao, University ofElectronic Science and Technology of China, China Alex Szalay, Johns Hopkins University, USA Philip Little, University of Notre Dame, USA
Christopher M. Moretti, University ofNotre Dame, USA Amitabh Chaudhary, University ofNotre Dame, USA
Douglas Thain, University of Notre Dame, USA Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configura
tions)
today's largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems.
do not scale to
chapter, the authors argue that in such circumstances, data locality is critical to the successful and applications. They propose a "data diffusion" approach to enable data-intensive many-task computing. They define an abstract model for data diffu sion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. They also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning, demon
In this efficient use of large distributed systems for data-intensive
both
Micro-Services: A Service-Oriented
Arcot
Mike
Paradigm for Scalable, Distributed Data Management Rajasekar, University ofNorth Carolina at Chapel Hill, USA Wan, University of California at San Diego, USA
74
Reagan Moore, University of North Carolina at Chapel Hill, USA Wayne Schroeder, University of California at San Diego, USA
Service-oriented architectures tional software units to distributed data
develop
grid life-cycle of a data object. The set of physical and discipline-centric characteristics.
can
be viewed
complex but agile applications. Data management on a operations that are performed across all stages in the such operations depends on the type of objects, based on their
of
In this
are
using param
a
network
a serialization protocol for communicating with remote micro-services. The orchestration of the workflow is done by a distributed rule engine that chains and executes the workflows
and maintains transactional properties through recovery micro-services. They discuss the micro-service oriented architecture, compare the micro-service approach with traditional SOA, and describe the use
of micro-services for
Section 2 Distributed
Storage
Chapter 4
Distributed
Storage Systems for Data Intensive Computing Sudharshan S. Vazhkadai, Oak Ridge National Laboratory, USA AH R. Butt, Virginia Polytechnic Institute and State University, USA Xiaosong Ma, North Carolina State University, USA chapter,
the authors that
95
present an overview of the utility of distributed storage systems in supporting applications increasingly becoming data intensive. Their coverage of distributed storage systems is based on the requirements imposed by data intensive computing and not a mere summary of storage systems. To this end, they delve into several aspects of supporting data-intensive analysis, such as data staging, offloading, checkpointing, and end-user access to terabytes of data, and illustrate the use of novel techniques and methodologies for realizing distributed storage systems therein. The data deluge from scientific experiments, observations, and simulations is affecting all of the aforementioned dayIn this modern
are
to-day operations in data-intensive computing. Modern distributed storage systems employ techniques that can help improve application performance, alleviate I/O bandwidth bottleneck, mask failures, and improve data availability. They present key guiding principles involved in the construction of such storage systems, associated tradeoffs, design, and architecture, all with an eye toward addressing challenges of data-intensive scientific applications. They highlight the concepts involved using several case studies of state-of-the-art storage systems that are currently available in the data-intensive computing landscape. Chapter 5 Metadata Management in PetaShare Distributed Storage Network Ismail Akturk, Bilkent University, Turkey Xinqi Wang, Louisiana State University, USA Tevfik Kosar, State University of New York at Buffalo (SUNY), USA
The unbounded increase in the size of data
118
generated by scientific applications necessitates collaboration sharing among the nation's education and research institutions. Simply purchasing high-capacity, high-performance storage systems and adding them to the existing infrastructure of the collaborating
and
underlying and highly challenging data handling problem. Scientists compelled to spend a great deal of time and energy on solving basic data-handling issues, such as the physical location of data, how to access it, and/or how to move it to visualization and/or compute
are
resources
for further
analysis.
This
chapter presents
a
the
design
and
name
provides light-weight clients the en able easy, transparent, and scalable access. In PetaShare, the authors have designed and implemented an asynchronously replicated multi-master metadata system for enhanced reliability and availability. The authors also present a high level cross-domain metadata schema to provide a structured systematic view of multiple science domains supported by PetaShare. geographically
Chapter 6
Data Intensive
Computing with Clustered Chirp Servers Douglas Thain, University of Notre Dame, USA
Bui,
140
University of Notre Dame, USA Hoang University of Notre Dame, USA Peter Bui, University of Notre Dame, USA Rory Carmichael, University of Notre Dame, USA Scott Emrich, University ofNotre Dame, USA Patrick Flynn, University of Notre Dame, USA
Michael Albrecht,
decades, computing performance, memory capacity, and disk storage have all increased magnitude. However, I/O performance has not increased at nearly the same pace: by is still measured in milliseconds, and disk I/O throughput is still measured in disk movement arm a megabytes per second. If one wishes to build computer systems that can store and process petabytes of data, they must have large numbers of disks and the corresponding I/O paths and memory capacity to support the desired data rate. A cost efficient way to accomplish this is by clustering large numbers of commodity machines together. This chapter presents Chirp as a building block for clustered data intensive scientific computing. Chirp was originally designed as a lightweight file server for grid computing and was used as a "personal" file server. The authors explore building systems with very high I/O capacity using commodity storage devices by tying together multiple Chirp servers. Several real-life applications such as the GRAND Data Analysis Grid, the Biometrics Research Grid, and the Biocompute Facility use Chirp as their fundamental building block, but provide different services and interfaces appropriate
many orders of
to their
target communities.
Management
Chapter 7 A Survey of Scheduling and Management Techniques for Data-Intensive 156 Application Workflows Suraj Pandey The Commonwealth Scientific and Industrial Research Organisation (CSIRO),
Australia
expressed in the form of workflows for structured, repeatable, controlled, scalable, and automated executions. This chapter focuses on the type of workflows that have tasks processing huge amount of data, usually in the range from hundreds of mega-bytes to petabytes. Scientists are already using Grid systems that schedule these workflows onto globally distributed resources for optimizing various objec tives: minimize total makespan of the workflow, minimize cost and usage of network bandwidth, minimize cost of computation and storage, meet the deadline of the application, and so forth. This chapter lists
are
techniques used in each of these systems for processing huge amount of data. A survey of workflow management techniques is useful for understanding the working of the Grid systems providing insights on performance optimization of scientific applications dealing with data-intensive workloads.
and describes
Chapter 8 Data Management in Scientific Workflows Ewa Deeltnan, University of Southern California, USA Ann Chervenak, University of Southern California, USA
Scientific
177
applications such
as
those in astronomy,
and
technologies
to
do
enable researchers
access
collaboratively design,
products. Although work flow systems are able to facilitate the automated generation of data products, many issues still remain to be addressed. These issues exist in different forms in the workflow lifecycle. This chapter describes a workflow lifecycle as consisting of a workflow generation phase where the analysis is defined, the workflow planning phase where resources needed for execution are selected, the workflow execution
part, where the actual computations take place, and the result, metadata, and provenance storing phase. The authors discuss the issues related to data management at each step of the workflow cycle. They
describe challenge problems and illustrate them in the context of real-life
the
challenges, possible solutions, and open issues faced when mapping and executing large-scale workflows on current cyberinfrastructure. They particularly emphasize the issues related to the management of data throughout the workflow lifecycle. Chapter 9 Replica Management in Data Intensive Distributed Science Applications
Ann L. Chervenak, Robert Schuler,
188
University
of Southern
California,
USA
applications is complicated by participating institutions are often geographically distributed and separated by distinct administrative domains. A key data management problem in these distributed collaborations has been the creation and maintenance of replicated data sets. This chapter provides an overview of replica management schemes used in large, data-intensive, distributed scientific collaborations. Early replica management strategies focused on the development of robust, highly scalable catalogs for maintaining replica locations. In recent years, more sophisticated, application-specific replica management systems
have been developed motivated interest in
to
produced by data-intensive
scientific
application-independent, policy-driven
performance and reliability requirements of a range of scientific collaborations. The authors discuss the data replication solutions to meet the challenges associated with increasingly data and the data sets to at large requirement run analysis geographically distributed sites.
Section 4 Data
Chapter
10
Data Intensive
207
Judy Qiu, Indiana University Bloomington, USA Jaliya Ekanayake, Indiana University Bloomington, USA Thilina Gunarathne, Indiana University Bloomington, USA Jong Youl Choi, Indiana University Bloomington, USA
-
Seung-Hee Bae, Indiana University Bloomington, USA Yang Ruan, Indiana University Bloomington, USA Saliya Ekanayake, Indiana University Bloomington, USA Stephen Wu, Indiana University Bloomington, USA Scott Beason, Computer Sciences Corporation, USA Geoffrey Fox, Indiana University Bloomington, USA
-
Tang,
Indiana
computing, cloud computing, and multicore computing are converging as frontiers to ad dress massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large-scale computing resources effectively to advance fundamental science discoveries such as those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies of collections of genes. Metagenomic research is just one of the areas that present a significant compu tational challenge because of the amount and complexity of data to be processed. This chapter discusses
Data intensive
data-mining algorithms and new programming models for several Life Sciences particularly focus on methods that are applicable to large data sets coming applications. from high throughput devices of steadily increasing power. They show results for both clustering and dimension reduction algorithms, and the use of MapReduce on modest size problems. They identify two key areas where further research is essential, and propose to develop new 0(NlogN) complexity algorithms suitable for the analysis of millions of sequences. They suggest Iterative MapReduce as a promising programming model combining the best features of MapReduce with those of high perfor
the
use
of innovative
The authors
mance
environments such 11
as
MPI.
Chapter
242
USA
Chicago, USA Luc Renambot, University ofIllinois at Chicago, USA Venkatram Vishwanath, University ofIllinois at Chicago, USA &Argonne National Laboratory, USA
Andrew Johnson,
larger than
what
can
be
managed
desktop computer. Large-scale data visualization therefore requires the use of distributed computing. By leveraging the widespread expansion of the Internet and other national and international high-speed network infrastructure such as the National LambdaRail, Internet-2, and the Global Lambda Integrated Facility, data and service providers began to migrate toward a model of widespread distribu tion of resources. This chapter introduces different instantiations of the visualization pipeline and the
on an
average
historic motivation for their creation. The authors examine individual components of the pipeline in detail to understand the technical challenges that must be solved in order to ensure continued scalability.
They discuss distributed data management issues that are specifically relevant to large-scale visualiza tion. They also introduce key data rendering techniques and explain through case studies approaches for scaling them by leveraging distributed computing. Lastly they describe advanced display technologies
that
are now
Chapter
12
on
On-Demand Visualization
275
Huadong Liu, University of Tennessee, USA Jinzhu Gao, University of The Pacific, USA Man Huang, University of Tennessee, USA Micah Beck, University of Tennessee, USA Terry Moore, University of Tennessee, USA
The emergence of high-resolution
and
simulation, where simulation outputs have grown to terascale levels beyond, raises major new challenges for the visualization community, which is serving computational scientists who want adequate visualization services provided to them on-demand. Many existing algo rithms for parallel visualization were not designed to operate optimally on time-shared parallel systems or on heterogeneous systems. They are usually optimized for systems that are homogeneous and have
This chapter explores the possibility of developing parallel visualiza distributed, heterogeneous processors to visualize cutting edge simulation datasets. The authors study how to effectively support multiple concurrent users operating on the same large dataset, with each focusing on a dynamically varying subset of the data. From a system design point of view, they observe that a distributed cache offers various advantages, including improved scalability. They develop basic scheduling mechanisms that were able to achieve fault-tolerance and load-balancing, optimal use of resources, and flow-control using system-level back-off, while still enforcing deadline driven (i.e. time-critical) visualization.
use.
tion
algorithms that
can use
Compilation
of References