Implementation Overview¶

The basic outline of mppnccombine-fast consists of one “Writer” rank and one or more “Reader” ranks. The Writer rank handles all writing to the output file, while the Reader ranks read in data from the many files to be collated and send the data to the Writer rank.

The main slowdown in copying compressed variables is that the hdf5 library has to de-compress them during the read, and re-compress them during the write. mppnccombine-fast works around this by using HDF5 1.10.2’s direct IO functions H5DOwrite_chunk() and H5DOread_chunk() to copy the compressed data from one file to the other directly, rather than going through the de-compress/re-compress cycle.

To get a even larger speedup MPI is used to have separate read and write processes, since HDF5 IO is a blocking function.

Since the NetCDF4 library is much nicer to use, but doesn’t provide public access to the underlying HDF5 file, we need to do a bit of musical chairs with the files, swapping between NetCDF4 and HDF5 modes by re-opening the files.

$digraph ranks { subgraph cluster_writer { copy_uncollated -> async_loop -> all_ended label = "Writer" copy_uncollated [label = "Copy Uncollated Vars"] async_loop [label = "Async Write Loop"] all_ended [label = "All Readers Ended"] rank = same } subgraph cluster_reader { open_file -> send_chunks -> end_of_files label = "Reader" open_file [label = "Open File"] send_chunks [label = "Send Collated Chunks"] end_of_files [label = "All Files Read"] rank = same } send_chunks:s -> open_file:e [constraint=false] async_loop:s -> async_loop:n [constraint=false] send_chunks -> async_loop [constraint=false] end_of_files -> async_loop [constraint=false] }$

Writer Rank¶

The Writer starts out by copying the dimensions, attributes and any uncollated variables from the first of the listed input files using the NetCDF API in init() and copy_contiguous(). It then re-opens the file using the HDF5 API and enters the ‘Async Write Loop’ in run_async_writer().

This loop polls for any incoming MPI messages from the Reader processes then performs some action (e.g. write a compressed chunk directly to the file at some location). Once a Reader has finished reading all of its input files it sends a close message to the Writer rank, once all close messages have been received the Writer rank closes the output file and exits.

Reader Ranks¶

The Readers distribute input files amonst themselves using a shared atomic counter. When a Reader is ready for a new file it gets the next value from the counter, then in copy_chunked() opens that file using NetCDF to query its attributes and discover and copy collated variables.

Depending on the chunking and alignment of the file the Reader will decide to copy the data either in uncompressed form using NetCDF with copy_netcdf_variable_chunks() or directly copying the compressed chunks by re-opening the file in HDF5 mode with copy_hdf5_variable_chunks().

Once all available files have been read the Reader sends a final close message to the Writer and exits.

The Async Write Loop¶

The Async write loop is set up to handle a number of messages that the Readers will send to the Writer

open_variable_async(): Obtain a handle to a variable in the output file

variable_info_async(): Obtain chunking and compression information for a variable

write_uncompressed_async(): Write uncompressed data to a given logical location in the variable

write_chunk_async(): Write a compressed chunk directly to the output file at a given chunk location

close_variable_async(): Return the variable handle

close_async(): Reports that the Reader will not send any more messages

The Writer asyncronously polls for these messages in run_async_writer(), then actions them in receive_open_variable_async() etc.