Spanning files over multiple smaller devices   3 comments

Imagine you are in Tasmania and need to move 35TB (1 million files) to S3 in the Sydney region. The link between Tasmania and continental Australia will undergo maintenance in the next month, which means either one or both:

  • You cannot use network links to transfer the data
  • Tasmania might be drifting further away from the mainland now that it is untethered

In short, I’m going to be presented with a bunch of HDs and I need to copy the data on them, fly to Sydney and upload the data to S3. If the HD given would be 35TB I could just copy the data and be done with it – no dramas. Likely though, the HDs will be smaller than 35TB, so I need to look at a few options of doing that.

Things to consider are:

  • Files should be present on the HDs in their original form – so they can be uploaded to S3 directly without needing a staging space for unzipping etc
  • HDs should be accessible independently, in case a HD is faulty I can easily identify what files need copying again
  • Copy operation should be reproducible, so previous point could be satisfied if anything goes wrong in the copying process
  • Copying should be done in parallel (it’s 35TB, it’ll take a while)
  • It has to be simple to debug if things go wrong

LVM/ZFS over a few HDs

Building a larger volume over a few HDs require me to connect all HDs at the same time to a machine and if any of them fail I will lose all the data. I decide to not do that – too risky. It’ll also be difficult to debug if anything goes wrong.

tar | split

Not a bad option on its own. An archive can be built and split into parts, then the parts could be copied onto the detination HDs. But the lose of a single HD will prevent me from copying the files on the next HD.

tar also supports -L (tape length) and can potentially split the backup on its own without the use of split. Still, it’ll take a very long time to spool it to multiple HDs as it wouldn’t be able to do it in parallel. In addition, I’ll have to improvise something for untarring and uploading to S3 as I will have no staging area to untar those 35TB. I’ll need something along the lines of tar -O -xf ... | s3cmd.

tar also has an interesting of -L (tape length), which will split a volume to a few tapes. Can’t say I am super keen using it. It has to work the first time.

Span Files

I decided to write a utility that’ll do what I need since there’s only one chance of getting it right – it’s called span-files.sh. It operates in three phases:

  • index – lists all files to be copied and their sizes
  • span – given a maximum size of a HD, iterate on the index and generate a list of files to be copied per HD
  • copy – produces rsync --files-from=list.X commands to run per HD. They can all be run in parallel if needed

The utility is available here:
https://github.com/danfruehauf/Scripts/tree/master/span-files

I’ll let you know how it all went after I do the actual copy. I still wonder whether I forgot some things…

Posted February 7, 2016 by malkodan in System Administration

Tagged with , , , , ,

3 responses to “Spanning files over multiple smaller devices

Subscribe to comments with RSS.

  1. There’s also the option of not reinventing the wheel…

    It’s called Amazon Snowball. See https://aws.amazon.com/importexport/

Leave a comment