I had an interesting deadlock crop up yesterday. It's a slightly long
story, please bear with me.
First, let's set the scene.
I have a storage node that has two AFTD each of which is 16 TB and two
LTO-4 tape drives (which are two of the 14 LTO4 drives in the Qualstar
XLS 832-700 tape library). On Thursday, one of the tape drives broke
and I disabled it in NetWorker.
So, we have two AFTD's and 1 working tape drive out of the 2 tape drives
that NetWorker knows are attached to this storage node.
At 4:30pm on Thursday, a 1.5 TB saveset started staging from AFTD-1 to
the working tape drive. At 5:30pm on Thursday, the networker server
(ozzie) did its daily backup which was written to (of course) AFTD-1.
This being the backup of the networker server, it is set to auto-clone
and so the clone task kicked off and was waiting for a tape for the
OFFSITE pool.
At some point in the evening, the tape that the staging process was
using filled up and was ejected from the 1 working tape drive.
NetWorker loaded the OFFSITE tape that was needed for the clone that had
been waiting since about 6pm. Oops! The staging process holds AFTD-1
and wants that working tape drive to stage the last 250GB of that 1.5 GB
saveset! DEADLOCK! The cloning task can't obtain the lock on AFTD-1
because the staging process still has it.
Thank the lucky stars that the failure of the serial port on the LTO4
drive had cleared up and I was able to re-enable the broken tape drive
to let NetWorker unwind itself. (Alternatively, I could have left it to
stew until the replacement drive showed up, but I never quite trust that
will happen and my experience with these loss of communication issues is
that the serial ports reset themselves and it clears up after a while
then works for months before failing again).
However, in the midst of all this, NetWorker's countdown to cleaning the
disabled drive hit zero, so I had a raft of errors in the log from
NetWorker trying to clean the tape drive that it knew was disabled.
My theory is that the NetWorker code is not checking that a drive is
disabled soon enough in the processes so it a) tries to clean a disabled
drive, and b) gets itself into stupid deadlocks.
Sadly, this being a timing issue involving large amounts of data that my
test environment is not set up to deal with, I can't quite see how I'm
going to test my theory.
Therefore, I ask my fellow NetWorker administrators - does this sound
like anything you've ever experienced yourselves? (Alternatively, we
have a lot of very strange errors that every vendor always tells us
nobody else has ever seen, so we have our own theory that the land the
university is built on had to be a sacred burial ground for some earlier
people and we're suffering the curse for desecrating their holy land)
If you're still here, if we ever meet, I'll buy you a coffee!
Thanks,
--
Frank Swasey | http://www.uvm.edu/~fcs
Sr Systems Administrator | Always remember: You are UNIQUE,
University of Vermont | just like everyone else.
"I am not young enough to know everything." - Oscar Wilde (1854-1900)
|