Networker

[Networker] Power failure and DB corruption

2013-03-11 09:29:44
Subject: [Networker] Power failure and DB corruption
From: Michael Leone <Michael.Leone AT PHA.PHILA DOT GOV>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Mon, 11 Mar 2013 09:21:32 -0400
Gotta love Mondays ...

On Sat around 1PM, apparently our electrical power decided it didn't feel 
like hanging out with us anymore, and left to go hang out somewhere else. 
So our UPS kicked in. The generator is supposed to kick in at this point, 
and the batteries are there to only hold us over until the generator 
finishes coming back online.

This has worked in the past, all smoothly and as expected. But not that 
day ...

But the generator decided it didn't want to be bothered, and never turned 
on ... (well, it had apparently had a coolant leak that we didn't know 
about, and it wouldn't turn on because of that. A cascade of failures 
...). And so there was no electrical main power, and no generator, and the 
UPS was left with it's batteries.  And when the batteries got too low ...

(you see where I'm going with this, right?)

CRASH. Everything came down hard.  Luckily (?)  the mains kicked in right 
around then. And it took so long for the main network switch to come up, 
that DNS resolution was failing (among other things). So NW assigned new 
client IDs to it's clients. And (apparently) purged a lot of histiory 
while it was at it, since my log and index drive went from 8G free to 85G 
free ...

<SIGH>

I have a severity 1 call into EMC, just waiting for a call back. I foresee 
having to do a DR-level recovery, a mmrecov, in order to straighten things 
back out. Luckily, I had a full bootstrap and CFI backup (savegrp -O -l 
full)  that finished about an hour before this all happened, so - if I do 
have to do a mmrecov - I will just lose the stuff that happened (or tried 
to happen) after the crash, from Sat 1PM onward. But as long as I can get 
back all my client histories for the last 5 years, I'll be happy. (right 
now, I don't have everything - there are holes in the backup history for 
various clients, etc, days and months missing, etc).

(there was a decision that we didn't need to run the UPS software on the 
servers, to gracefully shut them down if/when the batteries go out; that's 
why nothing shut down gracefully)

-- 
Michael Leone
Network Administrator, ISM
Philadelphia Housing Authority
2500 Jackson St
Philadelphia, PA 19145
Tel:  215-684-4180
Cell: 215-252-0143
<mailto:michael.leone AT pha.phila DOT gov>