Hello Charlie,
I have a bunch of small files containing the results of a big computation, and each of the small files contains data from a rectangular patch of the domain, a tile. Thus, each file more-or-less corresponds to a chunk, to use the terminology of netcdf compression.
I would like to merge the small files into a single large file. My current approach involves reading all tiles, assembling a large global array in memory, and then writing the global array to a freshly created netcdf file.
I wonder if a tool exists, or if it is feasible to write one, which can merge the tiles more directly? If the original small files (tiles) could be organized to form a proper "chunk" files, is there a way to quickly copy them (and some metadata, too, I would imagine) into a readable "chunked" netcdf file?
My goal is to speed up the global (merged tiles) file creation process. I am not concerned with memory usage.
-Ed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Great question, Ed. I don't have an "off-the-shelf" answer, and NCO does not directly support that capability. I suggest checking to see if someone has written a "stitcher" tool in Python to assemble quilts out of patches. This is also the kind of thing that could be brute forced with a shell script and an operaor that can append along a single dimension (e.g., ncrcat) combined with a dimension permutation capability (ncpdq). If you find a good answer, please post it here.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thx for your reply. I have no great answer. I wrote a script that assembles the tiles into global arrays in memory and then writes to disk. It is about 5x slower than simply copying the files on the filesystem, so it is not too bad, but probably not too good either. A colleague is comparing it to the standard python tool for doing this (with MOM6 files), so I should know how it stacks up in a few days.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello Charlie,
I have a bunch of small files containing the results of a big computation, and each of the small files contains data from a rectangular patch of the domain, a tile. Thus, each file more-or-less corresponds to a chunk, to use the terminology of netcdf compression.
I would like to merge the small files into a single large file. My current approach involves reading all tiles, assembling a large global array in memory, and then writing the global array to a freshly created netcdf file.
I wonder if a tool exists, or if it is feasible to write one, which can merge the tiles more directly? If the original small files (tiles) could be organized to form a proper "chunk" files, is there a way to quickly copy them (and some metadata, too, I would imagine) into a readable "chunked" netcdf file?
My goal is to speed up the global (merged tiles) file creation process. I am not concerned with memory usage.
-Ed
Great question, Ed. I don't have an "off-the-shelf" answer, and NCO does not directly support that capability. I suggest checking to see if someone has written a "stitcher" tool in Python to assemble quilts out of patches. This is also the kind of thing that could be brute forced with a shell script and an operaor that can append along a single dimension (e.g., ncrcat) combined with a dimension permutation capability (ncpdq). If you find a good answer, please post it here.
Thx for your reply. I have no great answer. I wrote a script that assembles the tiles into global arrays in memory and then writes to disk. It is about 5x slower than simply copying the files on the filesystem, so it is not too bad, but probably not too good either. A colleague is comparing it to the standard python tool for doing this (with MOM6 files), so I should know how it stacks up in a few days.