Networker

Re: [Networker] SLAs and backup cost estimates

2008-05-16 15:19:48
Subject: Re: [Networker] SLAs and backup cost estimates
From: Stan Horwitz <stan AT TEMPLE DOT EDU>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Fri, 16 May 2008 15:17:20 -0400
On May 16, 2008, at 2:56 PM, Fazil Saiyed wrote:

Hi,
Stan, thanks for the update, what i meant by " Device Management " is
jbconfig\Gui to get back to base config if there are problems with devices
and we need to rebuild or adjust the backup server, recently networker
added "jbedit" but now we exclusively use GUI. I have a script now that captures pool, device associations, client config etc in case i need to
refer back.
Major problems exist with keeping track of " clone completion" Media
movement with correct retention to Offsite for us, reporting on what being backed up in real time with actionable items remains deficient. We have
now moved most of the scheduling to external job schedulers.
During DR networker never fails to surprise us, whetter it's long boot
times due to " reverse name resolution" ( now we have no host file, we use
DNS" to
Devices not working right, to recovers failing ( could drivers, hardware issues at offsite") we do not use replication or have standby server for
recover at DR.
Frustration mounts as recovery team wants their data fast.
We use ADIC Tape lib at offsite vendor site, which is also used locally,
every time at DR it's different reason why networker takes longer to
recover, from HBA, Hardware, name resolution etc.
Reconfiguring Networker at DR involves removing existing config that is usually the most time consuming, i whish networker recovery is not tied to
it, or changed in a fashion that you recover networker indexes &
functionality and either re-configure hardware or import hardware config
separately, this would speed up connecting to Tape Drives\Tape Lib.
Just my thoughts.

We have two tape libraries. One tape library (our Sony PetaSite) has 14 Sony S-AIT1 tape drives in it, and roughly 1000 tape slots. This tape library is connected via fibre channel to my Solaris 10 sparc NetWorker server (its a Sun T2000) and also to a Sun X4500 with Solaris 10 x86. This aspect of it was all set up by an outside consultant with Cambridge Computer Services. I did not have to use jbconfig at all to set it up only because the consultant did that part of it. I have not had to touch that configuration since the last time it was changed last October when we wired in the X4500. Four of the 14 tape drives are shared via dynamic drive sharing. Five of the tape drives are local to the NetWorker server. The remaining five are local to the X4500 storage node. All are wired in via a Qlogic SAN switch.

This Sony PetaSite was first deployed four years ago and last upgraded two years ago. It has been running without interruption for the past two years since it was upgraded to have more tape drives. I have not rebooted it since that time (it runs a modified version of Red Hat Linux). On the other hand, on the average of once a week, one of the S- AIT1 tape drives chokes on a tape and NetWorker fails to verify it. When that happens, I almost always have to schlep over to the building where the tape library is housed and power cycle the affected tape drive, but I haven't needed to do anything at the OS level to keep the devices happy.

I am aware that NetWorker needs improvement with how tape cloning is done, but it is not a pain point at my site. Although we do ship a few tapes off-site, they are the original copies. We don't bother cloning the tapes before shipping them off-site, but we also replicate the data on those tapes to a disk device (outside of NetWorker) before each night's backups run. So if we need to get back a file quickly, we can just copy it over, plus the replicants are off-site at our SunGard DR facility.

My other storage node, which is a Dell 2950 with a Qualstar tape library that has 4 LTO-3 tape drives, I can't seem to get it to maintain persistent binding. The tape drives are all wired into individual HBA ports on the Dell. The HBA cards are Dell branded, but made by Qlogic. Every once in a while, NetWorker loses track of which tapes are in which LTO-3 drives due to a loss of the fibre channel binding. I quickly learned (with help from EMC tech support) that power cycling the tape library and the 2950, then rebuilding the tape library resource (via the NMC GUI) fixes the problem just fine. In fact, this problem occurred last week when I was in Florida on vacation and I had taken the precaution of documenting the procedure to delete the tape library resource and rebuild it. A colleague back in the office did what my documentation said to do and its been working fine since. I haven't needed to use jbconfig for this process though; the GUI works just fine. I am also considering replacing the HBA cards in my Dell 2950 with cards that have more robust drivers.

The only time frustration built here by anyone who needed to do a recover was when the disk array on our old Sun V480 died last October, just when a client with 1TB worth of data also died. Not being a hardware guy, it took several days to diagnose the problem and fix it. This was certainly not a NetWorker issue and the hardware involved was replaced sinc that failure.

Let me know if you have any other questions.

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the body of the email. Please write to networker-request 
AT listserv.temple DOT edu if you have any problems with this list. You can access the 
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER