Implementation Overview

The basic outline of mppnccombine-fast consists of one “Writer” rank and one or more “Reader” ranks. The Writer rank handles all writing to the output file, while the Reader ranks read in data from the many files to be collated and send the data to the Writer rank.

The main slowdown in copying compressed variables is that the hdf5 library has to de-compress them during the read, and re-compress them during the write. mppnccombine-fast works around this by using HDF5 1.10.2’s direct IO functions H5DOwrite_chunk() and H5DOread_chunk() to copy the compressed data from one file to the other directly, rather than going through the de-compress/re-compress cycle.

To get a even larger speedup MPI is used to have separate read and write processes, since HDF5 IO is a blocking function.

Since the NetCDF4 library is much nicer to use, but doesn’t provide public access to the underlying HDF5 file, we need to do a bit of musical chairs with the files, swapping between NetCDF4 and HDF5 modes by re-opening the files.

digraph ranks {
    subgraph cluster_writer {
        copy_uncollated -> async_loop -> all_ended

        label = "Writer"
        copy_uncollated [label = "Copy Uncollated Vars"]
        async_loop [label = "Async Write Loop"]
        all_ended [label = "All Readers Ended"]
        rank = same
    }

    subgraph cluster_reader {
        open_file -> send_chunks -> end_of_files

        label = "Reader"
        open_file [label = "Open File"]
        send_chunks [label = "Send Collated Chunks"]
        end_of_files [label = "All Files Read"]
        rank = same
    }

    send_chunks:s -> open_file:e [constraint=false]

    async_loop:s -> async_loop:n [constraint=false]

    send_chunks -> async_loop [constraint=false]
    end_of_files -> async_loop [constraint=false]
}

Writer Rank

The Writer starts out by copying the dimensions, attributes and any uncollated variables from the first of the listed input files using the NetCDF API in init() and copy_contiguous(). It then re-opens the file using the HDF5 API and enters the ‘Async Write Loop’ in run_async_writer().

This loop polls for any incoming MPI messages from the Reader processes then performs some action (e.g. write a compressed chunk directly to the file at some location). Once a Reader has finished reading all of its input files it sends a close message to the Writer rank, once all close messages have been received the Writer rank closes the output file and exits.

Reader Ranks

The Readers distribute input files amonst themselves using a shared atomic counter. When a Reader is ready for a new file it gets the next value from the counter, then in copy_chunked() opens that file using NetCDF to query its attributes and discover and copy collated variables.

Depending on the chunking and alignment of the file the Reader will decide to copy the data either in uncompressed form using NetCDF with copy_netcdf_variable_chunks() or directly copying the compressed chunks by re-opening the file in HDF5 mode with copy_hdf5_variable_chunks().

Once all available files have been read the Reader sends a final close message to the Writer and exits.

The Async Write Loop

The Async write loop is set up to handle a number of messages that the Readers will send to the Writer

The Writer asyncronously polls for these messages in run_async_writer(), then actions them in receive_open_variable_async() etc.