Veritas-bu

[Veritas-bu] Backups slow to a crawl

2005-03-25 11:34:59
Subject: [Veritas-bu] Backups slow to a crawl
From: jeffm AT nicusa DOT com (Jeff McCombs)
Date: Fri, 25 Mar 2005 11:34:59 -0500
Bill,

    Yep. The library is connected via a SCSI to the Sun system. I too
thought this might be a cable problem myself. Especially since this is a
single-ended connection (blame purchasing, not me). I thought, maybe I might
be over that 3 meter cable length or maybe something got knocked loose..

    I re-seated the SCSI card in the system, manually cleaned the drives
with a brand new cleaning tape (even though the one I had only has about 20
cleanings on it), re-seated the drives in the library, and double checked
the SCSI connections yet again.

    bptm logs show no errors. /var/adm/messages shows no errors. I even
dropped to OBP, set the diag switch to 'true', and ran obdiag.. No problems
reported. Prtdiag -v.. No problems... Only thing I haven't tried is VCS.

    I'm pretty much at wits end. I'm having a spare SCSI controller card
sent from our offices in Indianapolis, which should arrive sometime early
next week. I'll swap the card out just to be safe and run further tests.

    I'm also going to head back out onsite and physically swap the tape
drives in the library. I'll run some additional tests outside of NBU. If the
kw/s and %b problems follow the drive, I'll be able to say it's the drive.
If not, maybe it's the controller.. Or the cable, thought I didn't see any
bent pins... 

    Anyone ever have a SCSI cable just fail? Is that possible? I suppose it
is.. 

    It's probably sunspots. Yeah.. That's what I'll tell management..
"Sorry, backups suck right now because of Sunspots. Check back with me in 11
years, after this current spot-cycle completes.." :)

    -Jeff


On 3/25/05 11:12 AM, "Jorgensen, Bill" <Bill_Jorgensen AT csgsystems DOT com>
wrote:

> Jeff:
> 
> Just a thought... I am not sure I have thoroughly read this thread so
> forgive me if I rehash stuff.
> 
> Are your drives direct-attached via scsi? If so have you investigated
> scsi cable problems? If the backup server is a Sun then take a look in
> /var/adm/messages. Look for parity errors or statements about reduced
> transfer rate. If you see things like that then look at the cable as the
> issue. This one is tough.
> 
> Good luck,
> 
> Bill
> 
> --------------------------------------------------------
>      Bill Jorgensen
>      CSG Systems, Inc.
>      (w) 303.200.3282
>      (p) 303.947.9733
> --------------------------------------------------------
>      UNIX... Spoken with hushed and
>      reverent tones.
> --------------------------------------------------------
> 
> -----Original Message-----
> From: veritas-bu-admin AT mailman.eng.auburn DOT edu
> [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Jeff
> McCombs
> Sent: Friday, March 25, 2005 8:42 AM
> To: Veritas-bu AT mailman.eng.auburn DOT edu
> Subject: Re: [Veritas-bu] Backups slow to a crawl
> 
> Gang,
> 
>     Ok. So I took Darren's suggestion and 'downed' the drive in NBU,
> drove
> out to our facility with a new, unused tape and slapped it into the
> drive.
> 
> I hoped over to my home directory where I've got a good 5G or so of data
> with a good mix of file sizes and types and ran the following;
> 
> Tar cf - . | compress | dd obs=1024k of=/dev/rmt/1 con=sync
> 
> And watched the output of iostat -xtcn, with samples being taken every
> second.
> 
> And everything looked good for the first, oh.. 5 minutes or so. But the
> longer that the stream to tape ran, the worse the performance started to
> get. After 5 minutes I began to see the busy:kw/s ratio drop. Busy went
> from
> 4-10 % and kw/s 3 MB/Sec when things were good, to 90-100% and kw/s of
> 100-200k/sec. The longer it ran, the worse it got. Eventually, 6 out of
> 10
> samples were reading 100% busy and a kw/s of 0. The other 4 samples
> would
> range from busy @ 89 - 99, kw/s down into the sub-50k/sec range.
> 
> I also checked the output of 'iostat -xtcne' during this run, and while
> there were soft and hard errors in the counters, these never actually
> increased. 'iostat -nE' provided the following:
> 
> rmt/0           Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
> Vendor: QUANTUM  Product: DLT8000          Revision: 0250 Serial No: ?P
> rmt/1           Soft Errors: 56 Hard Errors: 2 Transport Errors: 2
> Vendor: QUANTUM  Product: DLT8000          Revision: 0250 Serial No: ?P
> 
> Again though, after performing more tests, I couldn't get these counters
> to
> increase.
> 
> I did get a response from Veritas. The tech on the phone suggested I
> muck
> with the buffers. Per his instructions, I set NET_BUFFER_SZ to 131072,
> NUMBER_DATA_BUFFERS to 32, and SIZE_DATA_BUFFERS to 131072.
> 
> I ran a full backup of our system dedicated to managing Checkpoint
> firewalls
> (Sun V100, approx 8GB of data, 100 MB FDX network on the same 3750
> switch &
> VLAN as the backup system), and performance was actually worse on the
> first
> drive! Both drives sat at approximately 512k/sec, though busy was into
> the
> 4-10% range for the duration of the backup.
> 
> Aargh. If this was a windows system, I'd be blaming drivers.. I checked
> cables, cleaned and reseated the drives, made sure the SCSI controller
> card
> was seated properly, checked termination.. Guess I'll call Overland and
> have
> them get me a new drive.
> 
> Many thanks to those of you who have helped me out already. It's much
> appreciated!
> 
> -jeff
> 
> On 3/24/05 11:14 AM, "Darren Dunham" <ddunham AT taos DOT com> wrote:
>> 
>> I didn't reply initially because it appeared that you had fixed it.
>> 
>> I too would be very suspicious of those iostat figures.  To me the
> high
>> busy alongside very low throughput screams drive problems.
> Multiplexing
>> shouldn't be affecting that.
>> 
>> If at all possible, I'd try to replicate the error by doing some drive
>> testing outside of NBU.
>> 
>> Down the drive, load a scratch tape, then get busy with 'dd' or
>> something.  Can you make it behave similarly?  If so, I'd make it my
>> number one suspect.

-- 
Jeff McCombs                 |                                    NIC, Inc
Systems Administrator        |                       http://www.nicusa.com
jeffm AT nicusa DOT com             |                                NASDAQ: 
EGOV
Phone: (703) 909-3277        |        "NIC - the People Behind eGovernment"
--
If you try to fail, and you succeed - What did you just do?



<Prev in Thread] Current Thread [Next in Thread>