HDF Vs NetCDF

Parallel I/O Performance Study
and Optimizations with HDF5,

A Scientific Data Package
MuQun Yang, Christian Chilan, Albert Cheng,
Quincey Koziol, Mike Folk, Leon Arber
The HDF Group
Champaign, IL 61820
Outline
Introduction to parallel I/O library
HDF5,netCDF,netCDF4
Parallel HDF5 and Parallel netCDF
performance comparison
Parallel netCDF4 and Parallel netCDF
Collective I/O optimizations inside HDF5
Conclusion
I/O process for Parallel Application

P0
P1
P2
P3
P0 becomes bottleneck
May exceed system memory
I/O library
File System
P0
P1
P2
I/O library I/O library I/O library
File System
P3
I/O library
May achieve good performance

Needs post-processing(?)
More work for applications
3
I/O process for Parallel Application
P0
P1
P2
P3
P0
P1
P2
Parallel I/O libraries
Parallel I/O libraries
Parallel File System
Parallel File System
P3
HDF5 VS netCDF
HDF5 and netCDF provide data formats and

programming interfaces
HDF5
Hierarchical file structures

Flexible data model
Many Features
NetCDF
In-memory compression filters

Chunked storage
Parallel I/O through MPI-IO
Linear data layout
A parallel version of netCDF from ANL/Northwestern

U. (PnetCDF) provide support for parallel access on
top of MPI-IO
5
Overview of NetCDF4
Advantages:
New features provided by HDF5:
More than one unlimited dimension

Various compression filters
Complicate data type such as struct or array datatype
Parallel IO through MPI-IO
NetCDF-user friendly APIs

Long-term maintenance and distribution
Potential larger user community
Disadvantage:
Install HDF5 library
An example for collective I/O
Every processor has a

noncontiguous selection.
Access requests are
interleaved.
Write operation with 32
processors, each processor
selection has 512K rows and 8
columns (32 MB/proc.)
Independent I/O: 1,659.48 s.
Collective I/O: 4.33 s.
Row-major data layout

P0
P1
P2
P3
Row 1
P0 P1 P2 P3
Row 2
P0 P1 P2
P3
Outline
HDF5,netCDF,netCDF4
Conclusion
Parallel HDF5 and PnetCDF

Previous Study:
NCAR Bluesky
Power4
LLNL uP
PnetCDF claims higher performance than HDF5
Power5
PnetCDF 1.0.1 vs. HDF5 1.6.5.

9
HDF5 and PnetCDF performance

comparison
Benchmark is the I/O kernel of FLASH.

FLASH I/O generates 3D blocks of size 8x8x8 on
Bluesky and 16x16x16 on uP.
Each processor handles 80 blocks and writes
them into 3 output files.
The performance metric given by FLASH I/O is the
parallel execution time.
The more processors, the larger the problem size.
10
Previous HDF5 and PnetCDF Performance Comparison at

ASCI White
(From Flash I/O website)

11

comparison
Flash I/O Benchmark (Checkpoint files)
PnetCDF
HDF5 independent
PnetCDF
60
HDF5 independent
2500
50
2000
MB/s
MB/s
40
30
20
1500
1000
10
500
10
60
110
Number of Processors
Bluesky: Power 4
160
10
110
210
310
uP: Power 5
12

comparison

PnetCDF
HDF5 collective
PnetCDF
HDF5 independent
60
HDF5 collective
HDF5 independent
2500
50
2000
M B /s
MB/s
40
30
20
1500
1000
10
500
0
10
60
110
Bluesky: Power 4
160
10
110
210
310
uP: Power 5
13
ROMS
Regional Oceanographic Modeling System
Supports MPI and OpenMP
I/O in NetCDF
History file writer in parallel
Data:
60
1D-4D double-precision float and integer arrays
14
PnetCDF4 and PnetCDF performance comparison
Bandwidth (MB/S)
PNetCDF collective
NetCDF4 collective
160
140
120
100
80
60
40
20
0
0
16
32
48
64
80
96
112
128
144
Number of processors
Fixed problem size = 995 MB

Performance of PnetCDF4 is close to PnetCDF
15
ROMS Output with Parallel NetCDF4

Output size 995 MB
Output size 15.5 GB
Bandwidth (MB/S)
300
250
200
150
100
50
0
0
16
32
48 64 80 96 112 128 144

The IO performance gets improved as the file size increases.

It can provide decent I/O performance for big problem size.
16
Outline
HDF5,netCDF,netCDF4
Conclusion
17
Improvements of collective IO
supports inside HDF5
Advanced HDF5 feature: non-regular
selections
Performance optimizations: chunked
storage
Provide several IO options to achieve good

collective IO performance
Provide APIs for applications to participate in
the optimization process
18
Improvement 1
HDF5 non-regular selections
2-D array with the IO in shaded selections
Only one HDF5 IO call

Good for collective IO
19
HDF5 chunked storage
Required for extendable data variables

Required for filters
Better subsetting access time
For more information about chunking:

http://hdf.ncsa.uiuc.edu/UG41r3_html/Perform.fm2.html#149138
Performance issue:
Severe performance penalties with many small
chunk IOs
20
Improvement 2: One linked chunk IO

chunk 1
chunk 2
chunk 3
chunk 4
P0
P1
MPI-IO
Collective View
One MPI Collective IO call
21
Improvement 3:
Multi-chunk IO Optimization
Have to keep the option to do collective IO per chunk
Collective IO bugs inside different MPI-IO packages

Limitation of system memory
Problem
Bad performance caused by improper use of collective IO
P0
P4
P0
P4
P1
P5
P1
P5
P2
P6
P2
P6
P3
P7
P3
P7
Just use independent IO
22
Improvement 4
Problem
HDF5 may not have enough information to

make the correct decision about the way to do
collective IO
Solution
Provide APIs for applications to participate in

the decision-making process
23
Flow chart of Collective Chunking IO improvements inside HDF5

Collective chunk mode
Optional User Input

Decision-making about
One linked-chunk IO
Optional User Input
NO
Multi-chunk IO
Yes
Decision-making about
Collective IO
Yes
One Collective IO call for all chunks
Collective IO per chunk
NO
Independent IO
24
For the detailed about performance study

and optimization inside HDF5:
http://hdf.ncsa.uiuc.edu/HDF5/papers/papers/ParallelIO/ParallelPerformance.pdf
http://hdf.ncsa.uiuc.edu/HDF5/papers/papers/ParallelIO/HDF5-CollectiveChunkIO.pdf
25
Conclusions
HDF5 provides collective IO supports for non-regular

selections
Supporting collective IO for chunked storage is not
trivial. Users can participate in the decision-making
process that selects different IO options.
I/O Performance is quite comparable when parallel
NetCDF and parallel HDF5 libraries are used in similar
manners.
I/O performance of parallel NetCDF4 is compatible
with parallel NetCDF with about 15% slowness in
average for the output of ROMS history file. We
suspect that the slowness is due to the software
management when passing information from parallel
NetCDF4 to HDF5.
26
Acknowledgments
This work is funded by National Science

Foundation Teragrid grants, the
Department of Energy's ASC Program, the
DOE SciDAC program, NCSA, and NASA.
27

HDF Vs NetCDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

HDF Vs NetCDF

Transféré par

Droits d'auteur :

Formats disponibles

Parallel I/O Performance Study

and Optimizations with HDF5,

I/O process for Parallel Application

I/O library I/O library I/O library

May achieve good performance

I/O process for Parallel Application

Parallel I/O libraries

Parallel I/O libraries

Parallel File System

Parallel File System

HDF5 and netCDF provide data formats and

Hierarchical file structures

In-memory compression filters

Linear data layout

A parallel version of netCDF from ANL/Northwestern

New features provided by HDF5:

More than one unlimited dimension

NetCDF-user friendly APIs

Install HDF5 library

An example for collective I/O

Every processor has a

Row-major data layout

Parallel HDF5 and PnetCDF

PnetCDF claims higher performance than HDF5

PnetCDF 1.0.1 vs. HDF5 1.6.5.

HDF5 and PnetCDF performance

Benchmark is the I/O kernel of FLASH.

Previous HDF5 and PnetCDF Performance Comparison at

(From Flash I/O website)

HDF5 and PnetCDF performance

Flash I/O Benchmark (Checkpoint files)

HDF5 and PnetCDF performance

Flash I/O Benchmark (Checkpoint files)

1D-4D double-precision float and integer arrays

PnetCDF4 and PnetCDF performance comparison

Fixed problem size = 995 MB

ROMS Output with Parallel NetCDF4

Output size 15.5 GB

48 64 80 96 112 128 144

The IO performance gets improved as the file size increases.

Provide several IO options to achieve good

Only one HDF5 IO call

HDF5 chunked storage

Required for extendable data variables

For more information about chunking:

Improvement 2: One linked chunk IO

One MPI Collective IO call

Have to keep the option to do collective IO per chunk

Collective IO bugs inside different MPI-IO packages

Bad performance caused by improper use of collective IO

Just use independent IO

HDF5 may not have enough information to

Provide APIs for applications to participate in

Flow chart of Collective Chunking IO improvements inside HDF5

Optional User Input

Optional User Input

Collective IO per chunk

For the detailed about performance study

HDF5 provides collective IO supports for non-regular

This work is funded by National Science

Vous aimerez peut-être aussi