Imagine you are in Tasmania and need to move 35TB (1 million files) to S3 in the Sydney region. The link between Tasmania and continental Australia will undergo maintenance in the next month, which means either one or both:
- You cannot use network links to transfer the data
- Tasmania might be drifting further away from the mainland now that it is untethered
In short, I’m going to be presented with a bunch of HDs and I need to copy the data on them, fly to Sydney and upload the data to S3. If the HD given would be 35TB I could just copy the data and be done with it – no dramas. Likely though, the HDs will be smaller than 35TB, so I need to look at a few options of doing that.
Things to consider are:
- Files should be present on the HDs in their original form – so they can be uploaded to S3 directly without needing a staging space for unzipping etc
- HDs should be accessible independently, in case a HD is faulty I can easily identify what files need copying again
- Copy operation should be reproducible, so previous point could be satisfied if anything goes wrong in the copying process
- Copying should be done in parallel (it’s 35TB, it’ll take a while)
- It has to be simple to debug if things go wrong
LVM/ZFS over a few HDs
Building a larger volume over a few HDs require me to connect all HDs at the same time to a machine and if any of them fail I will lose all the data. I decide to not do that – too risky. It’ll also be difficult to debug if anything goes wrong.
tar | split
Not a bad option on its own. An archive can be built and split into parts, then the parts could be copied onto the detination HDs. But the lose of a single HD will prevent me from copying the files on the next HD.
tar also supports -L (tape length) and can potentially split the backup on its own without the use of split. Still, it’ll take a very long time to spool it to multiple HDs as it wouldn’t be able to do it in parallel. In addition, I’ll have to improvise something for untarring and uploading to S3 as I will have no staging area to untar those 35TB. I’ll need something along the lines of
tar -O -xf ... | s3cmd.
tar also has an interesting of -L (tape length), which will split a volume to a few tapes. Can’t say I am super keen using it. It has to work the first time.
I decided to write a utility that’ll do what I need since there’s only one chance of getting it right – it’s called
span-files.sh. It operates in three phases:
- index – lists all files to be copied and their sizes
- span – given a maximum size of a HD, iterate on the index and generate a list of files to be copied per HD
- copy – produces
rsync --files-from=list.Xcommands to run per HD. They can all be run in parallel if needed
The utility is available here:
I’ll let you know how it all went after I do the actual copy. I still wonder whether I forgot some things…