[Networker] NetWorker 7.6.5 on RHEL 6.3 64-bit

I had an interesting deadlock crop up yesterday. It's a slightly longstory, please bear with me.


First, let's set the scene.

I have a storage node that has two AFTD each of which is 16 TB and twoLTO-4 tape drives (which are two of the 14 LTO4 drives in the QualstarXLS 832-700 tape library). On Thursday, one of the tape drives brokeand I disabled it in NetWorker.

So, we have two AFTD's and 1 working tape drive out of the 2 tape drivesthat NetWorker knows are attached to this storage node.

At 4:30pm on Thursday, a 1.5 TB saveset started staging from AFTD-1 tothe working tape drive. At 5:30pm on Thursday, the networker server(ozzie) did its daily backup which was written to (of course) AFTD-1.This being the backup of the networker server, it is set to auto-cloneand so the clone task kicked off and was waiting for a tape for theOFFSITE pool.

At some point in the evening, the tape that the staging process wasusing filled up and was ejected from the 1 working tape drive.NetWorker loaded the OFFSITE tape that was needed for the clone that hadbeen waiting since about 6pm. Oops! The staging process holds AFTD-1and wants that working tape drive to stage the last 250GB of that 1.5 GBsaveset! DEADLOCK! The cloning task can't obtain the lock on AFTD-1because the staging process still has it.

Thank the lucky stars that the failure of the serial port on the LTO4drive had cleared up and I was able to re-enable the broken tape driveto let NetWorker unwind itself. (Alternatively, I could have left it tostew until the replacement drive showed up, but I never quite trust thatwill happen and my experience with these loss of communication issues isthat the serial ports reset themselves and it clears up after a whilethen works for months before failing again).

However, in the midst of all this, NetWorker's countdown to cleaning thedisabled drive hit zero, so I had a raft of errors in the log fromNetWorker trying to clean the tape drive that it knew was disabled.

My theory is that the NetWorker code is not checking that a drive isdisabled soon enough in the processes so it a) tries to clean a disableddrive, and b) gets itself into stupid deadlocks.

Sadly, this being a timing issue involving large amounts of data that mytest environment is not set up to deal with, I can't quite see how I'mgoing to test my theory.

Therefore, I ask my fellow NetWorker administrators - does this soundlike anything you've ever experienced yourselves? (Alternatively, wehave a lot of very strange errors that every vendor always tells usnobody else has ever seen, so we have our own theory that the land theuniversity is built on had to be a sacred burial ground for some earlierpeople and we're suffering the curse for desecrating their holy land)


If you're still here, if we ever meet, I'll buy you a coffee!

Thanks,

--
Frank Swasey                    | http://www.uvm.edu/~fcs
Sr Systems Administrator        | Always remember: You are UNIQUE,
University of Vermont           |    just like everyone else.
  "I am not young enough to know everything." - Oscar Wilde (1854-1900)