Archive for January 2010

Backups… all night?   1 comment

Being the one that is responsible for the backups at work – I never compromised on anything.
As a SysAdmin you must:

  1. Backup
  2. Backup some more
  3. Test your backups

Shoot to maim

At first we mainly needed to backup our subversion repository. A pretty easy task for any SysAdmin.
What I would do is simply dump the repository at night, and scp it to two other workstations of developers in the company (I didn’t really have much of a choice in terms of other computers in the network).
It worked.

The golden goose is on the loose

After a while I managed to convince our R&D manager it is time for detachable backups. Detachable backups can save you in case the building is on fire or if someone decides to shoot a rocket on your building (unlikely even in Israel, but as a SysAdmin – never take any chances).
With the virtual threat of a virtual rocket that might incinerate all of our important information, we decided that the cheapest and most effective way of action is to purchase a tape drive and a few tapes. Mind you, the year is 2006 and portable HDs are expensive, uncommon and small.

Backing up to a tape has always been something on my TODO list that I had to tick.
During one of my previous jobs, we had a tape archive that took care of it transparently, it was managed by a different team. Ever since I had always had yearnings for the /dev/tape that’s totally sequential.
It was very soon that I’d discovered that these tapes are a plain headache:

  1. It is a mess in general to deal with the tapes as the access is sequential
  2. It’s slow!!!
  3. The only reasonable way to deal with the backups is with dumpe2fs – it’s an archaic tool it’s archaic and work only on the extX filesystem family!
  4. It takes a while to eject the tape, I can still remember the minutes of waiting in the servers room for the tape to eject, so I can deposit it at home
  5. The tapes tend to break! like any tape, the film tends to run away from the bearings, rendering the tape useless

Too bad our backup was far from being able to fit on a DVD media.

The glamour, the fortune, the pain

The tape backup held us good for more than 2 years. I was so happy the solution was robust enough to keep us running for that much time without the need of any major changes.
But portable USB HDs became cheaper and larger and it was time for a change. I was excited to receive two brand new and shiny 500GB HDs. I diligently worked on a new backup script. A backup script that would not be dependant on the filesystem type (hell! i wanted to use JFS!), a backup script that would have snapshots weeks back! a backup script that would rule them all!
This backup script will hopefully be published in one of my next posts.
I felt like king of the world, backups became easy, I was much more confident with the new backup as the files could be seen on the mounted HD easily, in contrast to the sequential tape and the binary filesystem dump.
Backups ran manually by me during the day. I inspected them carefully and was pleased.
It was time for the backup to take place at night. And so it was.

From time to time I would get in the backup log:
Input/output error

At first I didn’t pay much attention.
WTF?! are my HDs broken?! – no way, they are brand new and it happened on both of them. But dmesg also showed some nasty information while accessing the HDs.
I started to trigger the backups manually at day time. Not a single error.

Backups went back to night time.
At the morning I would issue a ls:

# ls /media/backup
Input/output error
# ls /media/backup
daily.1 daily.2 daily.3 daily.4 daily.5 daily.6 daily.7

What the hell is going on around here?! – first command fails but the second succeeds?
First command also used to lag for a while, where the second breezed out. I discovered only later it was a key hint.

My backup creates many links using “cp -aL” (in order to preserve snapshots), I had a speculation I might be messing the filesystem structure with too many links to the same inode – unlikely, but I was shooting at all directions, I was clueless.

So there I go, easing the backups up and eliminating the snapshot functionality. Guess what? – still errors on the backup.

What do I do next? Do I stay up at night just to witness the problem in real time?!, Don’t laugh, a friend of mine actually had to do it once in other occasions.
At this time I already introduced this issue to all of my fellow SysAdmin friends. None of them had any idea. I can’t blame them.
I was frustrated, even the archaic tape backups worked better than the HDs, is newer always better? – perhaps not.
I recreated the filesystem on the portable HDs as ext3 instead of JFS, maybe JFS is buggy?
I’ll save you the trouble. JFS is far from being buggy and it had also nothing to do with crond.

We’ll show the unbelievers

For days I’d watch the nightly email the backup would produce, notice the failure and rerun it manually during the day. Until one day.
It had struck me like a lightning on a sunny day.

The second command would always succeed on the device. What if this HD is a little tired?
What if the portable HD goes to sleep and is having problems waking up?
It’s worth trying.

# sdparm --set=STANDBY=0 /dev/sdb
# sdparm --save /dev/sdb

What do you say? – It worked!
It appears that some USB HDs go to sleep and doesn’t wake up nicely when they should.
Should I file a bug about it? Was it the hardware that malfunctioned?
I was so happy this issue was solved – I never cared about either.
Maybe after crafting this post – it is time to care a little more though.

As the madmen play on words and make us all dance to their song…

I’m sitting at my desk, receiving the nightly email informing me the backup was successful. The portable HDs now also utilize an encrypted filesystem. The backup never fails.
I look at my watch, drink a glass of wine and rejoice.