Archive for November 2013

Ninja Merge   Leave a comment

Recently I was presented with the following situation at work:

  • Your input is a handful of directories, filled with files, some of them are a “sort of a copy” of the other
  • Your output should be one directory with all the files from the source directories merged into it
  • The caveat is – if any of the files collide, you must mark them somehow for inspection

So that sounds pretty simple, isn’t it? In my case the input was millions of files. I’m not sure about the exact number, it doesn’t matter. The best solution for this problem is to never get to this situation, however sometimes you just inherit stuff like that at a new work place.

The Solution

We needed a ninja. I called it ninja-merge.sh. It is a Bash wrapper for rsync that will merge directories one by one into a destination directory and handle the collisions for you using a checksum function (md5 was “good enough” for that task).

Get ninja-merge.sh here:
https://github.com/danfruehauf/Scripts/tree/master/ninja-merge

It even has unit tests and the works. All that you have to do is specify:

  • A list of source directories
  • A destination directory
  • A directory to store the collisions

If a path collided, you might end up with something like that in your collision directory:

$ cd collision_directory && find . -type f
./a/b/filename1.nc.345c3132699d7524cefe3a161859ebee
./a/b/filename1.nc.259974c1617b40d95c0d29a6dd7b207e

Sorting the collisions is something you’ll have to do manually. Sorry!