Networker

Re: [Networker] Problem with mmrecov after /nsr array failure

2007-10-17 22:00:19
Subject: Re: [Networker] Problem with mmrecov after /nsr array failure
From: Rob Sterba <Sterba_Robert AT EMC DOT COM>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Wed, 17 Oct 2007 21:45:09 -0400
Have you tried deleting the library and all devices and re-adding them? 

-----Original Message-----
From: EMC NetWorker discussion [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] On
Behalf Of Stan Horwitz
Sent: Wednesday, October 17, 2007 7:36 PM
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Subject: [Networker] Problem with mmrecov after /nsr array failure

This past Saturday, my NetWorker 7.4 server's /nsr storage array  
failed. The server runs Solaris 9 and the array is an old Sun A1000  
the array is one of two that was connected to the server. I placed a  
service call have have the problem fixed. On Saturday morning, I  
found myself in a computer room while a hardware fixer upper guy  
fixed the array ... so we thought. To make a long story short, fsck  
ran for 24 hours and by Sunday night, I was up and running again with  
the fixed array, but NetWorker crashed within seconds of restarting,  
so I decided to hold off until Monday morning to address the problem  
so I could get some sleep.

So on Monday, rebooting the server didn't produce any SCSI errors at  
all and it came backup fine, so I did a mmrecov from Friday's  
bootstrap tape. I restarted NetWorker and all was fine. Later that  
evening, the same array died on me again. Sigh! On Tuesday, we did  
more array repairs, but nothing we tried worked. The broken A1000  
disk array is one of two we had sitting on my backup server. The  
second one is /nsr2 (which contains some CFI data for a few large  
clients), but it was only 20% full and the the /nsr array only  
contained 21GB worth of data. Since the /nsr2 array had something  
like 150GB free on it, so my boss and I decided to create a directory  
called nsr on the /nsr2 array and we disconnected the faulty /nsr  
array from the SCSI chain and powered it off. So /nsr now sits on  
the /nsr2 array and all the /nsr2 array's cfi data is still visible  
to NetWorker as /nsr2. I hope this makes sense.

This all works and I get no SCSI errors at all when I rebooted the  
server twice. Since this scheme wiped out the entire contents of / 
nsr, I used jbconfig to configure a tape library resource so I could  
read the bootstrap tape. Then I used mmrecov to recover the same  
bootstrap saveset from the same tape I used on Monday. This worked,  
except for one problem. When I did the mmrecov, instead of recovering  
to /nsr/res.R it recovered the data to /nsr/res and when I restarted  
NSR, the tape library that's connected to our server appeared twice  
in the NetWorker management console window and each instance of the  
tape library had two device resources for every physical device on  
the library (14 physical devices), except for the five devices that  
we use for NDMP which only had one device resource each. This server  
also has a Linux storage node connected to a totally different  
library, and that library's resource information is fine. I spent two  
hours tonight trying to fix this issue, including doing another  
mmrecov, which also dumped its data into /nsr/res instead of /nsr/res.R.

I tried deleting the second tape library resource, but this did not  
help. As a result, tape mount requests are not being satisfied for  
the main tape library, but they are for the tape library on my  
storage node. I don't know if its relevant, but the tape library is a  
Sony PetaSite with 14 S-AIT1 drives and its fibre channel connected  
to my NetWorker server. We do not do drive or tape library sharing.  
The inquire command also shows exactly the same thing it showed  
before we disconnected the broken A1000 array (except of course, for  
the missing array).

If anyone has any idea how to correct this problem, please let me  
know; otherwise, I intend to open up a support case with EMC in the  
morning (since I am too exhausted to do it now).

--
Stan Horwitz
stan AT temple DOT edu

CONFIDENTIALITY STATEMENT: The information contained in this e-mail,  
including attachments, is the confidential information of, and/or is  
the property of, Temple University. The information is intended for  
use solely by the individual or entity named in the e-mail. If you  
are not an intended recipient or you received this in error, then any  
review, printing, copying, or distribution of any such information is  
prohibited. Please notify the sender immediately by reply e-mail and  
then delete this e-mail from your system.

To sign off this list, send email to listserv AT listserv.temple DOT edu and
type "signoff networker" in the body of the email. Please write to
networker-request AT listserv.temple DOT edu if you have any problems with this
list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER