Spanning files over multiple smaller devices

Imagine you are in Tasmania and need to move 35TB (1 million files) to S3 in the Sydney region. The link between Tasmania and continental Australia will undergo maintenance in the next month, which means either one or both:

You cannot use network links to transfer the data
Tasmania might be drifting further away from the mainland now that it is untethered

In short, I’m going to be presented with a bunch of HDs and I need to copy the data on them, fly to Sydney and upload the data to S3. If the HD given would be 35TB I could just copy the data and be done with it – no dramas. Likely though, the HDs will be smaller than 35TB, so I need to look at a few options of doing that.

Things to consider are:

Files should be present on the HDs in their original form – so they can be uploaded to S3 directly without needing a staging space for unzipping etc
HDs should be accessible independently, in case a HD is faulty I can easily identify what files need copying again
Copy operation should be reproducible, so previous point could be satisfied if anything goes wrong in the copying process
Copying should be done in parallel (it’s 35TB, it’ll take a while)
It has to be simple to debug if things go wrong

LVM/ZFS over a few HDs

Building a larger volume over a few HDs require me to connect all HDs at the same time to a machine and if any of them fail I will lose all the data. I decide to not do that – too risky. It’ll also be difficult to debug if anything goes wrong.

tar | split

Not a bad option on its own. An archive can be built and split into parts, then the parts could be copied onto the detination HDs. But the lose of a single HD will prevent me from copying the files on the next HD.

tar also supports -L (tape length) and can potentially split the backup on its own without the use of split. Still, it’ll take a very long time to spool it to multiple HDs as it wouldn’t be able to do it in parallel. In addition, I’ll have to improvise something for untarring and uploading to S3 as I will have no staging area to untar those 35TB. I’ll need something along the lines of tar -O -xf ... | s3cmd.

tar also has an interesting of -L (tape length), which will split a volume to a few tapes. Can’t say I am super keen using it. It has to work the first time.

Span Files

I decided to write a utility that’ll do what I need since there’s only one chance of getting it right – it’s called span-files.sh. It operates in three phases:

index – lists all files to be copied and their sizes
span – given a maximum size of a HD, iterate on the index and generate a list of files to be copied per HD
copy – produces rsync --files-from=list.X commands to run per HD. They can all be run in parallel if needed

The utility is available here:
https://github.com/danfruehauf/Scripts/tree/master/span-files

I’ll let you know how it all went after I do the actual copy. I still wonder whether I forgot some things…

3 responses to “Spanning files over multiple smaller devices”

Subscribe to comments with RSS.

There’s also the option of not reinventing the wheel…

It’s called Amazon Snowball. See https://aws.amazon.com/importexport/


Shimi
February 19, 2016 at 2:12 pm
- I have thought about that option but unfortunately at the time of writing this post (and comment), Snowball was not available in the ap-southeast-2 region (Sydney). It is available now only in the us-east-1 and us-west-2 regions.
  
  
  malkodan
  February 20, 2016 at 3:29 am
  - I stand corrected for your specific use case 🙂
    
    Only mentioned it because it wasn’t mentioned on the post at all, so I thought of mentioning it.
    
    By the way, maybe it would be possible to pipeline two AWS features: Snowball and this: https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/ – but I am guessing that at such a volume, cost may be prohibitive… ?
    
    Shimi
    February 20, 2016 at 4:58 pm

Bashing Linux Linux system administration, programming and everything that goes in between…