Veritas-bu

[Veritas-bu] Backups slow to a crawl

2005-03-25 10:41:48
Subject: [Veritas-bu] Backups slow to a crawl
From: jeffm AT nicusa DOT com (Jeff McCombs)
Date: Fri, 25 Mar 2005 10:41:48 -0500
Gang,

    Ok. So I took Darren's suggestion and 'downed' the drive in NBU, drove
out to our facility with a new, unused tape and slapped it into the drive.

I hoped over to my home directory where I've got a good 5G or so of data
with a good mix of file sizes and types and ran the following;

Tar cf - . | compress | dd obs=1024k of=/dev/rmt/1 con=sync

And watched the output of iostat -xtcn, with samples being taken every
second.

And everything looked good for the first, oh.. 5 minutes or so. But the
longer that the stream to tape ran, the worse the performance started to
get. After 5 minutes I began to see the busy:kw/s ratio drop. Busy went from
4-10 % and kw/s 3 MB/Sec when things were good, to 90-100% and kw/s of
100-200k/sec. The longer it ran, the worse it got. Eventually, 6 out of 10
samples were reading 100% busy and a kw/s of 0. The other 4 samples would
range from busy @ 89 - 99, kw/s down into the sub-50k/sec range.

I also checked the output of 'iostat -xtcne' during this run, and while
there were soft and hard errors in the counters, these never actually
increased. 'iostat -nE' provided the following:

rmt/0           Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
Vendor: QUANTUM  Product: DLT8000          Revision: 0250 Serial No: ?P
rmt/1           Soft Errors: 56 Hard Errors: 2 Transport Errors: 2
Vendor: QUANTUM  Product: DLT8000          Revision: 0250 Serial No: ?P

Again though, after performing more tests, I couldn't get these counters to
increase.

I did get a response from Veritas. The tech on the phone suggested I muck
with the buffers. Per his instructions, I set NET_BUFFER_SZ to 131072,
NUMBER_DATA_BUFFERS to 32, and SIZE_DATA_BUFFERS to 131072.

I ran a full backup of our system dedicated to managing Checkpoint firewalls
(Sun V100, approx 8GB of data, 100 MB FDX network on the same 3750 switch &
VLAN as the backup system), and performance was actually worse on the first
drive! Both drives sat at approximately 512k/sec, though busy was into the
4-10% range for the duration of the backup.

Aargh. If this was a windows system, I'd be blaming drivers.. I checked
cables, cleaned and reseated the drives, made sure the SCSI controller card
was seated properly, checked termination.. Guess I'll call Overland and have
them get me a new drive.

Many thanks to those of you who have helped me out already. It's much
appreciated!

-jeff

On 3/24/05 11:14 AM, "Darren Dunham" <ddunham AT taos DOT com> wrote:
> 
> I didn't reply initially because it appeared that you had fixed it.
> 
> I too would be very suspicious of those iostat figures.  To me the high
> busy alongside very low throughput screams drive problems.  Multiplexing
> shouldn't be affecting that.
> 
> If at all possible, I'd try to replicate the error by doing some drive
> testing outside of NBU.
> 
> Down the drive, load a scratch tape, then get busy with 'dd' or
> something.  Can you make it behave similarly?  If so, I'd make it my
> number one suspect.

-- 
Jeff McCombs                 |                                    NIC, Inc
Systems Administrator        |                       http://www.nicusa.com
jeffm AT nicusa DOT com             |                                NASDAQ: 
EGOV
Phone: (703) 909-3277        |        "NIC - the People Behind eGovernment"
--
    "So we went to Atari and said, 'Hey, we've got this amazing thing,
     even built with some of your parts, and what do you think about
     funding us? Or we'll give it to you. We just want to do it. Pay
     our salary, we'll come work for you.' And they said 'No.' So
     then we went to Hewlett-Packard, and they said 'Hey, we don't
     need you. You haven't got enough college yet."
                    - Steve Jobs, cofounder of Apple Computer