Networker

[Networker] NetWorker 7.6.5 on RHEL 6.3 64-bit

2013-03-02 07:06:38
Subject: [Networker] NetWorker 7.6.5 on RHEL 6.3 64-bit
From: Frank Swasey <Frank.Swasey AT UVM DOT EDU>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Sat, 2 Mar 2013 07:06:32 -0500
I had an interesting deadlock crop up yesterday. It's a slightly long story, please bear with me.

First, let's set the scene.

I have a storage node that has two AFTD each of which is 16 TB and two LTO-4 tape drives (which are two of the 14 LTO4 drives in the Qualstar XLS 832-700 tape library). On Thursday, one of the tape drives broke and I disabled it in NetWorker.

So, we have two AFTD's and 1 working tape drive out of the 2 tape drives that NetWorker knows are attached to this storage node.

At 4:30pm on Thursday, a 1.5 TB saveset started staging from AFTD-1 to the working tape drive. At 5:30pm on Thursday, the networker server (ozzie) did its daily backup which was written to (of course) AFTD-1. This being the backup of the networker server, it is set to auto-clone and so the clone task kicked off and was waiting for a tape for the OFFSITE pool.

At some point in the evening, the tape that the staging process was using filled up and was ejected from the 1 working tape drive. NetWorker loaded the OFFSITE tape that was needed for the clone that had been waiting since about 6pm. Oops! The staging process holds AFTD-1 and wants that working tape drive to stage the last 250GB of that 1.5 GB saveset! DEADLOCK! The cloning task can't obtain the lock on AFTD-1 because the staging process still has it.

Thank the lucky stars that the failure of the serial port on the LTO4 drive had cleared up and I was able to re-enable the broken tape drive to let NetWorker unwind itself. (Alternatively, I could have left it to stew until the replacement drive showed up, but I never quite trust that will happen and my experience with these loss of communication issues is that the serial ports reset themselves and it clears up after a while then works for months before failing again).

However, in the midst of all this, NetWorker's countdown to cleaning the disabled drive hit zero, so I had a raft of errors in the log from NetWorker trying to clean the tape drive that it knew was disabled.

My theory is that the NetWorker code is not checking that a drive is disabled soon enough in the processes so it a) tries to clean a disabled drive, and b) gets itself into stupid deadlocks.

Sadly, this being a timing issue involving large amounts of data that my test environment is not set up to deal with, I can't quite see how I'm going to test my theory.

Therefore, I ask my fellow NetWorker administrators - does this sound like anything you've ever experienced yourselves? (Alternatively, we have a lot of very strange errors that every vendor always tells us nobody else has ever seen, so we have our own theory that the land the university is built on had to be a sacred burial ground for some earlier people and we're suffering the curse for desecrating their holy land)

If you're still here, if we ever meet, I'll buy you a coffee!

Thanks,

--
Frank Swasey                    | http://www.uvm.edu/~fcs
Sr Systems Administrator        | Always remember: You are UNIQUE,
University of Vermont           |    just like everyone else.
  "I am not young enough to know everything." - Oscar Wilde (1854-1900)

<Prev in Thread] Current Thread [Next in Thread>
  • [Networker] NetWorker 7.6.5 on RHEL 6.3 64-bit, Frank Swasey <=