Gotta love Mondays ...
On Sat around 1PM, apparently our electrical power decided it didn't feel
like hanging out with us anymore, and left to go hang out somewhere else.
So our UPS kicked in. The generator is supposed to kick in at this point,
and the batteries are there to only hold us over until the generator
finishes coming back online.
This has worked in the past, all smoothly and as expected. But not that
day ...
But the generator decided it didn't want to be bothered, and never turned
on ... (well, it had apparently had a coolant leak that we didn't know
about, and it wouldn't turn on because of that. A cascade of failures
...). And so there was no electrical main power, and no generator, and the
UPS was left with it's batteries. And when the batteries got too low ...
(you see where I'm going with this, right?)
CRASH. Everything came down hard. Luckily (?) the mains kicked in right
around then. And it took so long for the main network switch to come up,
that DNS resolution was failing (among other things). So NW assigned new
client IDs to it's clients. And (apparently) purged a lot of histiory
while it was at it, since my log and index drive went from 8G free to 85G
free ...
<SIGH>
I have a severity 1 call into EMC, just waiting for a call back. I foresee
having to do a DR-level recovery, a mmrecov, in order to straighten things
back out. Luckily, I had a full bootstrap and CFI backup (savegrp -O -l
full) that finished about an hour before this all happened, so - if I do
have to do a mmrecov - I will just lose the stuff that happened (or tried
to happen) after the crash, from Sat 1PM onward. But as long as I can get
back all my client histories for the last 5 years, I'll be happy. (right
now, I don't have everything - there are holes in the backup history for
various clients, etc, days and months missing, etc).
(there was a decision that we didn't need to run the UPS software on the
servers, to gracefully shut them down if/when the batteries go out; that's
why nothing shut down gracefully)
--
Michael Leone
Network Administrator, ISM
Philadelphia Housing Authority
2500 Jackson St
Philadelphia, PA 19145
Tel: 215-684-4180
Cell: 215-252-0143
<mailto:michael.leone AT pha.phila DOT gov>
|