This is the specification for a molecular dynamics trajectory file format based on HDF5 which is supported by the MDTraj package.
In storing MD trajectory data for the for purposes including very large scale analysis, there are a few design goals. (1) The trajectories should be small and space efficient on disk. (2) The trajectories should be fast to write and fast to read. (3) The data format should support flexible read options. For instance, random access to different frames in the trajectory should be possible. It should be possible to easily query the dimensions of the trajectory (n_frames, n_atoms, etc) without loading the file into memory. It should be possible to load only every n-th frame, or to directly only a subset of the atoms with limited memory overhead. (5) The trajectory format should be easily extensible in a backward compatible manner. For instance, it should be possible to add new arrays/fields like the potential energy or the topology without breaking backwards compatibility.
Currently, MDTraj is able to read and write trajectories in DCD, XTC, TRR, BINPOS, and AMBER NetCDF formats, in addition to HDF5. This presents an opportunity to compare these formats and see how they fit our design goals. The most space efficient is XTC, because it uses 16 bit fixed precision encoding. For some reason, the XTC read times are quite slow though. DCD is fast to read and write, but relatively inflexible. NetCDF is fast and flexible. BINPOS and MDCRD are garbage – they’re neither fast, small, nor flexible.
Of the formats we currently have, AMBER NetCDF is the best, in that it it satisfies all of the design goals except for the first. But the trajectories are twice as big on disk as XTC, which is really quite unfortunate. For dealing with large data sets, size matters. So let’s define a HDF5 standard that has the benefits of AMBER NetCDF and the benefits of XTC mixed together. We’ll use an extensible data format (HDF5), we’ll provide options for lossy and lossless compression, and we’ll store the topology inside the trajectory, so that a single trajectory file always contains the information needed to understand (and visualize) the system.
This specification is heavily influenced by the AMBER NetCDF standard. Significant portions of the text are copied verbatim.
Creators may extend this format by adding new arrays. Arrays containing per-atom and per-frame data that naturally possesses physical units should declare those units explicitly in the array attributes. Readers should be flexible, ignoring the presence of arrays that they are not equiped to handle.
It is our experience that not having the topology stored in the same file as the the trajectory’s coordinate data is a pain – it’s just really inconvenient. And generally, the trajectories are long enough that it doesn’t take up much incremental storage space to store the topology in there too. The topology is not that complicated.
The topology will be stored in JSON. The JSON will then be serialized as a string and stored in the HDF5 file with an ASCII encoding.
The topology stores a hierarchical description of the chains, residues, and atoms in the system. Each chain is associated with an index and a list of residues. Each residue is associated with a name, an index, a resSeq index (not zero-indexed), and a list of atoms. Each atom is associated with a name, an element, and an index. All of the indicies should be zero-based.
The name of a residue is not strictly proscribed, but should generally follow PDB 3.0 nomenclature. The element of an atom shall be one of the one or two letter element abbreviations from the periodic table. The name of an atom shall indicate some information about the type of the atom beyond just its element, such as ‘CA’ for the alpha carbom, ‘HG’ for a gamma hydrogen, etc. This format does not specify exactly what atom names are allowed – creators should follow the conventions from the forcefield they are using.
In addition to the chains, the topology shall also contain a list of the bonds. The bonds shall be a list of length-2 lists of integers, where the integers refer to the index of the two atoms that are bonded.
The following shows the topology of alanine dipeptide in this format. Since it’s JSON, the whitespace is optional and just for readability.
{'bonds': [[4, 1],
[4, 5],
[1, 0],
[1, 2],
[1, 3],
[4, 6],
[14, 8],
[14, 15],
[8, 10],
[8, 9],
[8, 6],
[10, 11],
[10, 12],
[10, 13],
[7, 6],
[14, 16],
[18, 19],
[18, 20],
[18, 21],
[18, 16],
[17, 16]],
'chains': [{'index': 0,
'residues': [{'atoms': [{'element': 'H',
'index': 0,
'name': 'H1'},
{'element': 'C',
'index': 1,
'name': 'CH3'},
{'element': 'H',
'index': 2,
'name': 'H2'},
{'element': 'H',
'index': 3,
'name': 'H3'},
{'element': 'C',
'index': 4,
'name': 'C'},
{'element': 'O',
'index': 5,
'name': 'O'}],
'index': 0,
'resSeq': 1,
'name': 'ACE'},
{'atoms': [{'element': 'N',
'index': 6,
'name': 'N'},
{'element': 'H',
'index': 7,
'name': 'H'},
{'element': 'C',
'index': 8,
'name': 'CA'},
{'element': 'H',
'index': 9,
'name': 'HA'},
{'element': 'C',
'index': 10,
'name': 'CB'},
{'element': 'H',
'index': 11,
'name': 'HB1'},
{'element': 'H',
'index': 12,
'name': 'HB2'},
{'element': 'H',
'index': 13,
'name': 'HB3'},
{'element': 'C',
'index': 14,
'name': 'C'},
{'element': 'O',
'index': 15,
'name': 'O'}],
'index': 1,
'resSeq': 2,
'name': 'ALA'},
{'atoms': [{'element': 'N',
'index': 16,
'name': 'N'},
{'element': 'H',
'index': 17,
'name': 'H'},
{'element': 'C',
'index': 18,
'name': 'C'},
{'element': 'H',
'index': 19,
'name': 'H1'},
{'element': 'H',
'index': 20,
'name': 'H2'},
{'element': 'H',
'index': 21,
'name': 'H3'}],
'index': 2,
'resSeq': 3,
'name': 'NME'}]}]}