Backups… all night?   1 comment

Being the one that is responsible for the backups at work – I never compromised on anything.
As a SysAdmin you must:

  1. Backup
  2. Backup some more
  3. Test your backups

Shoot to maim

At first we mainly needed to backup our subversion repository. A pretty easy task for any SysAdmin.
What I would do is simply dump the repository at night, and scp it to two other workstations of developers in the company (I didn’t really have much of a choice in terms of other computers in the network).
It worked.

The golden goose is on the loose

After a while I managed to convince our R&D manager it is time for detachable backups. Detachable backups can save you in case the building is on fire or if someone decides to shoot a rocket on your building (unlikely even in Israel, but as a SysAdmin – never take any chances).
With the virtual threat of a virtual rocket that might incinerate all of our important information, we decided that the cheapest and most effective way of action is to purchase a tape drive and a few tapes. Mind you, the year is 2006 and portable HDs are expensive, uncommon and small.

Backing up to a tape has always been something on my TODO list that I had to tick.
During one of my previous jobs, we had a tape archive that took care of it transparently, it was managed by a different team. Ever since I had always had yearnings for the /dev/tape that’s totally sequential.
It was very soon that I’d discovered that these tapes are a plain headache:

  1. It is a mess in general to deal with the tapes as the access is sequential
  2. It’s slow!!!
  3. The only reasonable way to deal with the backups is with dumpe2fs – it’s an archaic tool it’s archaic and work only on the extX filesystem family!
  4. It takes a while to eject the tape, I can still remember the minutes of waiting in the servers room for the tape to eject, so I can deposit it at home
  5. The tapes tend to break! like any tape, the film tends to run away from the bearings, rendering the tape useless

Too bad our backup was far from being able to fit on a DVD media.

The glamour, the fortune, the pain

The tape backup held us good for more than 2 years. I was so happy the solution was robust enough to keep us running for that much time without the need of any major changes.
But portable USB HDs became cheaper and larger and it was time for a change. I was excited to receive two brand new and shiny 500GB HDs. I diligently worked on a new backup script. A backup script that would not be dependant on the filesystem type (hell! i wanted to use JFS!), a backup script that would have snapshots weeks back! a backup script that would rule them all!
This backup script will hopefully be published in one of my next posts.
I felt like king of the world, backups became easy, I was much more confident with the new backup as the files could be seen on the mounted HD easily, in contrast to the sequential tape and the binary filesystem dump.
Backups ran manually by me during the day. I inspected them carefully and was pleased.
It was time for the backup to take place at night. And so it was.

From time to time I would get in the backup log:
Input/output error

At first I didn’t pay much attention.
WTF?! are my HDs broken?! – no way, they are brand new and it happened on both of them. But dmesg also showed some nasty information while accessing the HDs.
I started to trigger the backups manually at day time. Not a single error.

Backups went back to night time.
At the morning I would issue a ls:

# ls /media/backup
Input/output error
# ls /media/backup
daily.1 daily.2 daily.3 daily.4 daily.5 daily.6 daily.7

What the hell is going on around here?! – first command fails but the second succeeds?
First command also used to lag for a while, where the second breezed out. I discovered only later it was a key hint.

My backup creates many links using “cp -aL” (in order to preserve snapshots), I had a speculation I might be messing the filesystem structure with too many links to the same inode – unlikely, but I was shooting at all directions, I was clueless.

So there I go, easing the backups up and eliminating the snapshot functionality. Guess what? – still errors on the backup.

What do I do next? Do I stay up at night just to witness the problem in real time?!, Don’t laugh, a friend of mine actually had to do it once in other occasions.
At this time I already introduced this issue to all of my fellow SysAdmin friends. None of them had any idea. I can’t blame them.
I was frustrated, even the archaic tape backups worked better than the HDs, is newer always better? – perhaps not.
I recreated the filesystem on the portable HDs as ext3 instead of JFS, maybe JFS is buggy?
I’ll save you the trouble. JFS is far from being buggy and it had also nothing to do with crond.

We’ll show the unbelievers

For days I’d watch the nightly email the backup would produce, notice the failure and rerun it manually during the day. Until one day.
It had struck me like a lightning on a sunny day.

The second command would always succeed on the device. What if this HD is a little tired?
What if the portable HD goes to sleep and is having problems waking up?
It’s worth trying.

# sdparm --set=STANDBY=0 /dev/sdb
# sdparm --save /dev/sdb

What do you say? – It worked!
It appears that some USB HDs go to sleep and doesn’t wake up nicely when they should.
Should I file a bug about it? Was it the hardware that malfunctioned?
I was so happy this issue was solved – I never cared about either.
Maybe after crafting this post – it is time to care a little more though.

As the madmen play on words and make us all dance to their song…

I’m sitting at my desk, receiving the nightly email informing me the backup was successful. The portable HDs now also utilize an encrypted filesystem. The backup never fails.
I look at my watch, drink a glass of wine and rejoice.

One response to “Backups… all night?

Subscribe to comments with RSS.

  1. cool!

    1. I truly think that it’s a bug (kernel probably), if you can reproduce it easily, check it with the latest kernel and see if it still occurs..

    2. In a few years now you’ll be laughing at your cp -aL solution. I wonder if there’s already a sane snapshot-supporting filesystem solution for Linux. btrfs is amazing, but is still discouraged for production.
    a. Maybe until then it’s worthy to have an OpenSolaris VM with ZFS?
    b. or Veritas Storage Foundation, which has a good Linux version, even. (VSF “Basic” is free of charge and will answer most of your needs. you should be reading the license though)
    c. A storage box that supports snapshoting.
    All above solution would give you much more power, by having a larger number of available snapshots, and quickening the process (no need to copy anything).

    3. I disagree with your statement that the only reasonable tool for tape backups is dumpe2fs. Actually VC is the only place that I know which used dumpe2fs. Most places use Tar (Tape ARchive, you remember) or proprietary backup tools (e.g. Veritas Netbackup) for tape backup/restore. Anyway, I’m not trying to argue that tapes are good or anything, you did well 🙂

    4. dar -A looks interesting.

    The guy who stayed awake all night

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: