Archive for the ‘files’ Tag

Spanning files over multiple smaller devices   3 comments

Imagine you are in Tasmania and need to move 35TB (1 million files) to S3 in the Sydney region. The link between Tasmania and continental Australia will undergo maintenance in the next month, which means either one or both:

  • You cannot use network links to transfer the data
  • Tasmania might be drifting further away from the mainland now that it is untethered

In short, I’m going to be presented with a bunch of HDs and I need to copy the data on them, fly to Sydney and upload the data to S3. If the HD given would be 35TB I could just copy the data and be done with it – no dramas. Likely though, the HDs will be smaller than 35TB, so I need to look at a few options of doing that.

Things to consider are:

  • Files should be present on the HDs in their original form – so they can be uploaded to S3 directly without needing a staging space for unzipping etc
  • HDs should be accessible independently, in case a HD is faulty I can easily identify what files need copying again
  • Copy operation should be reproducible, so previous point could be satisfied if anything goes wrong in the copying process
  • Copying should be done in parallel (it’s 35TB, it’ll take a while)
  • It has to be simple to debug if things go wrong

LVM/ZFS over a few HDs

Building a larger volume over a few HDs require me to connect all HDs at the same time to a machine and if any of them fail I will lose all the data. I decide to not do that – too risky. It’ll also be difficult to debug if anything goes wrong.

tar | split

Not a bad option on its own. An archive can be built and split into parts, then the parts could be copied onto the detination HDs. But the lose of a single HD will prevent me from copying the files on the next HD.

tar also supports -L (tape length) and can potentially split the backup on its own without the use of split. Still, it’ll take a very long time to spool it to multiple HDs as it wouldn’t be able to do it in parallel. In addition, I’ll have to improvise something for untarring and uploading to S3 as I will have no staging area to untar those 35TB. I’ll need something along the lines of tar -O -xf ... | s3cmd.

tar also has an interesting of -L (tape length), which will split a volume to a few tapes. Can’t say I am super keen using it. It has to work the first time.

Span Files

I decided to write a utility that’ll do what I need since there’s only one chance of getting it right – it’s called span-files.sh. It operates in three phases:

  • index – lists all files to be copied and their sizes
  • span – given a maximum size of a HD, iterate on the index and generate a list of files to be copied per HD
  • copy – produces rsync --files-from=list.X commands to run per HD. They can all be run in parallel if needed

The utility is available here:
https://github.com/danfruehauf/Scripts/tree/master/span-files

I’ll let you know how it all went after I do the actual copy. I still wonder whether I forgot some things…

Advertisements

Posted February 7, 2016 by malkodan in System Administration

Tagged with , , , , ,

Handling many files in one directory   Leave a comment

The Assignment

You have a directory with gazillion files. Since most filesystems are not very efficient with many files in one directory, it is advisable to spread them among a hierarchy of directories. Write a program (or script) which handles a directory with many files and spreads them in an efficient hierarchy.

Does that sounds like a University assignment or something? Yes, it does.

Well apparently such a situation just happened to me in real life. Searching across the internet I couldn’t find anything too useful. And I will stand corrected if there is something which already deals with that problem. Post ahead if so.

And yes, thank god I’m using Unix (Linux), don’t even want to think what one would do on Windows.

The Situation

An application was spooling many files to the same directory, generating up to a million files in the same directory. I’m sorry I cannot disclose any more information about it, but lets just say it is a well known open source application.
Access to these files was obviously fast having ext4 and dir_index, but the directory index is too big to actually list files or do anything else without clogging everything in the system. And we need these files.

So we’ve decided to model the files in a way that’ll be more efficient for browsing and we can then handle it from there.

The Solution

After implementing something pretty quick and dirty for the situation, to mitigate the pain, I’ve sat down and wrote something a bit more generic. I’m happy to introduce the spread_files.sh utility.
What does it take care of:

  • Reading the directory index just once
  • Hierarchy depth as parameter
  • Stacking up to X files per mv command
  • Has recursion in Bash!!
  • Obviously the best solution would be to never get to that situation, however if you do, feel free to use spread_files.sh.