Imagine you are in Tasmania and need to move 35TB (1 million files) to S3 in the Sydney region. The link between Tasmania and continental Australia will undergo maintenance in the next month, which means either one or both:
- You cannot use network links to transfer the data
- Tasmania might be drifting further away from the mainland now that it is untethered
In short, I’m going to be presented with a bunch of HDs and I need to copy the data on them, fly to Sydney and upload the data to S3. If the HD given would be 35TB I could just copy the data and be done with it – no dramas. Likely though, the HDs will be smaller than 35TB, so I need to look at a few options of doing that.
Things to consider are:
- Files should be present on the HDs in their original form – so they can be uploaded to S3 directly without needing a staging space for unzipping etc
- HDs should be accessible independently, in case a HD is faulty I can easily identify what files need copying again
- Copy operation should be reproducible, so previous point could be satisfied if anything goes wrong in the copying process
- Copying should be done in parallel (it’s 35TB, it’ll take a while)
- It has to be simple to debug if things go wrong
LVM/ZFS over a few HDs
Building a larger volume over a few HDs require me to connect all HDs at the same time to a machine and if any of them fail I will lose all the data. I decide to not do that – too risky. It’ll also be difficult to debug if anything goes wrong.
tar | split
Not a bad option on its own. An archive can be built and split into parts, then the parts could be copied onto the detination HDs. But the lose of a single HD will prevent me from copying the files on the next HD.
tar also supports -L (tape length) and can potentially split the backup on its own without the use of split. Still, it’ll take a very long time to spool it to multiple HDs as it wouldn’t be able to do it in parallel. In addition, I’ll have to improvise something for untarring and uploading to S3 as I will have no staging area to untar those 35TB. I’ll need something along the lines of tar -O -xf ... | s3cmd
.
tar also has an interesting of -L (tape length), which will split a volume to a few tapes. Can’t say I am super keen using it. It has to work the first time.
Span Files
I decided to write a utility that’ll do what I need since there’s only one chance of getting it right – it’s called span-files.sh
. It operates in three phases:
- index – lists all files to be copied and their sizes
- span – given a maximum size of a HD, iterate on the index and generate a list of files to be copied per HD
- copy – produces
rsync --files-from=list.X
commands to run per HD. They can all be run in parallel if needed
The utility is available here:
https://github.com/danfruehauf/Scripts/tree/master/span-files
I’ll let you know how it all went after I do the actual copy. I still wonder whether I forgot some things…
There’s also the option of not reinventing the wheel…
It’s called Amazon Snowball. See https://aws.amazon.com/importexport/
I have thought about that option but unfortunately at the time of writing this post (and comment), Snowball was not available in the ap-southeast-2 region (Sydney). It is available now only in the us-east-1 and us-west-2 regions.
I stand corrected for your specific use case 🙂
Only mentioned it because it wasn’t mentioned on the post at all, so I thought of mentioning it.
By the way, maybe it would be possible to pipeline two AWS features: Snowball and this: https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/ – but I am guessing that at such a volume, cost may be prohibitive… ?