Veritas-bu

[Veritas-bu] Backups slow to a crawl

2005-03-25 13:31:56
Subject: [Veritas-bu] Backups slow to a crawl
From: SJACOBSO AT novell DOT com (Scott Jacobson)
Date: Fri, 25 Mar 2005 11:31:56 -0700
This is a MIME message. If you are reading this text, you may want to 
consider changing to a mail reader or gateway that understands how to 
properly handle MIME multipart messages.

--=__Part3B18300C.0__=
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

Jeff,
=20
>From the e-mail thread so far, you have some great suggestions and tips, =
but if I'm reading your response correctly you haven't yet determined =
where your problem resides ~ pulling data over network from the source or =
on the master server or tape drive side.
=20
I've changed verbosity up and used the BPTM log and looked for "waiting =
for empty" vs "waiting for full" entries when I've started to see slow =
downs like this, I don't have the TID from Veritas that explains the =
details, but it should help in pointing where to focus on trouble =
shooting.
=20
Scott J.

>>> Jeff McCombs <jeffm AT nicusa DOT com> 3/25/2005 9:34 AM >>>

Bill,

    Yep. The library is connected via a SCSI to the Sun system. I too
thought this might be a cable problem myself. Especially since this is a
single-ended connection (blame purchasing, not me). I thought, maybe I =
might
be over that 3 meter cable length or maybe something got knocked loose..

    I re-seated the SCSI card in the system, manually cleaned the drives
with a brand new cleaning tape (even though the one I had only has about =
20
cleanings on it), re-seated the drives in the library, and double checked
the SCSI connections yet again.

    bptm logs show no errors. /var/adm/messages shows no errors. I even
dropped to OBP, set the diag switch to 'true', and ran obdiag.. No =
problems
reported. Prtdiag -v.. No problems... Only thing I haven't tried is VCS.

    I'm pretty much at wits end. I'm having a spare SCSI controller card
sent from our offices in Indianapolis, which should arrive sometime early
next week. I'll swap the card out just to be safe and run further tests.

    I'm also going to head back out onsite and physically swap the tape
drives in the library. I'll run some additional tests outside of NBU. If =
the
kw/s and %b problems follow the drive, I'll be able to say it's the drive.
If not, maybe it's the controller.. Or the cable, thought I didn't see any
bent pins...=20

    Anyone ever have a SCSI cable just fail? Is that possible? I suppose =
it
is..=20

    It's probably sunspots. Yeah.. That's what I'll tell management..
"Sorry, backups suck right now because of Sunspots. Check back with me in =
11
years, after this current spot-cycle completes.." :)

    -Jeff


On 3/25/05 11:12 AM, "Jorgensen, Bill" <Bill_Jorgensen AT csgsystems DOT com>
wrote:

> Jeff:
>=20
> Just a thought... I am not sure I have thoroughly read this thread so
> forgive me if I rehash stuff.
>=20
> Are your drives direct-attached via scsi? If so have you investigated
> scsi cable problems? If the backup server is a Sun then take a look in
> /var/adm/messages. Look for parity errors or statements about reduced
> transfer rate. If you see things like that then look at the cable as the
> issue. This one is tough.
>=20
> Good luck,
>=20
> Bill
>=20
> --------------------------------------------------------
>      Bill Jorgensen
>      CSG Systems, Inc.
>      (w) 303.200.3282
>      (p) 303.947.9733
> --------------------------------------------------------
>      UNIX... Spoken with hushed and
>      reverent tones.
> --------------------------------------------------------
>=20
> -----Original Message-----
> From: veritas-bu-admin AT mailman.eng.auburn DOT edu
> [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Jeff
> McCombs
> Sent: Friday, March 25, 2005 8:42 AM
> To: Veritas-bu AT mailman.eng.auburn DOT edu
> Subject: Re: [Veritas-bu] Backups slow to a crawl
>=20
> Gang,
>=20
>     Ok. So I took Darren's suggestion and 'downed' the drive in NBU,
> drove
> out to our facility with a new, unused tape and slapped it into the
> drive.
>=20
> I hoped over to my home directory where I've got a good 5G or so of data
> with a good mix of file sizes and types and ran the following;
>=20
> Tar cf - . | compress | dd obs=3D1024k of=3D/dev/rmt/1 con=3Dsync
>=20
> And watched the output of iostat -xtcn, with samples being taken every
> second.
>=20
> And everything looked good for the first, oh.. 5 minutes or so. But the
> longer that the stream to tape ran, the worse the performance started to
> get. After 5 minutes I began to see the busy:kw/s ratio drop. Busy went
> from
> 4-10 % and kw/s 3 MB/Sec when things were good, to 90-100% and kw/s of
> 100-200k/sec. The longer it ran, the worse it got. Eventually, 6 out of
> 10
> samples were reading 100% busy and a kw/s of 0. The other 4 samples
> would
> range from busy @ 89 - 99, kw/s down into the sub-50k/sec range.
>=20
> I also checked the output of 'iostat -xtcne' during this run, and while
> there were soft and hard errors in the counters, these never actually
> increased. 'iostat -nE' provided the following:
>=20
> rmt/0           Soft Errors: 18 Hard Errors: 0 Transport Errors: 0
> Vendor: QUANTUM  Product: DLT8000          Revision: 0250 Serial No: ?P
> rmt/1           Soft Errors: 56 Hard Errors: 2 Transport Errors: 2
> Vendor: QUANTUM  Product: DLT8000          Revision: 0250 Serial No: ?P
>=20
> Again though, after performing more tests, I couldn't get these counters
> to
> increase.
>=20
> I did get a response from Veritas. The tech on the phone suggested I
> muck
> with the buffers. Per his instructions, I set NET_BUFFER_SZ to 131072,
> NUMBER_DATA_BUFFERS to 32, and SIZE_DATA_BUFFERS to 131072.
>=20
> I ran a full backup of our system dedicated to managing Checkpoint
> firewalls
> (Sun V100, approx 8GB of data, 100 MB FDX network on the same 3750
> switch &
> VLAN as the backup system), and performance was actually worse on the
> first
> drive! Both drives sat at approximately 512k/sec, though busy was into
> the
> 4-10% range for the duration of the backup.
>=20
> Aargh. If this was a windows system, I'd be blaming drivers.. I checked
> cables, cleaned and reseated the drives, made sure the SCSI controller
> card
> was seated properly, checked termination.. Guess I'll call Overland and
> have
> them get me a new drive.
>=20
> Many thanks to those of you who have helped me out already. It's much
> appreciated!
>=20
> -jeff
>=20
> On 3/24/05 11:14 AM, "Darren Dunham" <ddunham AT taos DOT com> wrote:
>>=20
>> I didn't reply initially because it appeared that you had fixed it.
>>=20
>> I too would be very suspicious of those iostat figures.  To me the
> high
>> busy alongside very low throughput screams drive problems.
> Multiplexing
>> shouldn't be affecting that.
>>=20
>> If at all possible, I'd try to replicate the error by doing some drive
>> testing outside of NBU.
>>=20
>> Down the drive, load a scratch tape, then get busy with 'dd' or
>> something.  Can you make it behave similarly?  If so, I'd make it my
>> number one suspect.

--=20
Jeff McCombs                 |                                    NIC, Inc
Systems Administrator        |                       http://www.nicusa.com
jeffm AT nicusa DOT com             |                                NASDAQ: 
EGOV
Phone: (703) 909-3277        |        "NIC - the People Behind eGovernment"=

--
If you try to fail, and you succeed - What did you just do?


_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu



--=__Part3B18300C.0__=
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; charset=3Diso-8859-1"=
>
<META content=3D"MSHTML 6.00.2800.1491" name=3DGENERATOR></HEAD>
<BODY style=3D"MARGIN: 4px 4px 1px; FONT: 10pt Tahoma">
<DIV>Jeff,</DIV>
<DIV>&nbsp;</DIV>
<DIV>From the e-mail thread so far, you have some great suggestions and =
tips, but if I'm reading your response correctly you haven't yet determined=
 where your problem resides ~ pulling data over network from the source or =
on the master server or tape drive side.</DIV>
<DIV>&nbsp;</DIV>
<DIV>I've changed verbosity up&nbsp;and&nbsp;used the BPTM log and looked =
for "waiting for empty" vs "waiting for full" entries&nbsp;when&nbsp;I've =
started&nbsp;to see&nbsp;slow downs like this, I don't have the TID from =
Veritas that explains the details, but it&nbsp;should help in pointing =
where to focus on trouble shooting.</DIV>
<DIV>&nbsp;</DIV>
<DIV>Scott J.<BR><BR>&gt;&gt;&gt; Jeff McCombs &lt;jeffm AT nicusa DOT com&gt; =
3/25/2005 9:34 AM &gt;&gt;&gt;<BR></DIV>
<DIV style=3D"COLOR: #000000">Bill,<BR><BR>&nbsp;&nbsp;&nbsp; Yep. The =
library is connected via a SCSI to the Sun system. I too<BR>thought this =
might be a cable problem myself. Especially since this is a<BR>single-ended=
 connection (blame purchasing, not me). I thought, maybe I might<BR>be =
over that 3 meter cable length or maybe something got knocked loose..<BR><B=
R>&nbsp;&nbsp;&nbsp; I re-seated the SCSI card in the system, manually =
cleaned the drives<BR>with a brand new cleaning tape (even though the one =
I had only has about 20<BR>cleanings on it), re-seated the drives in the =
library, and double checked<BR>the SCSI connections yet again.<BR><BR>&nbsp=
;&nbsp;&nbsp; bptm logs show no errors. /var/adm/messages shows no errors. =
I even<BR>dropped to OBP, set the diag switch to 'true', and ran obdiag.. =
No problems<BR>reported. Prtdiag -v.. No problems... Only thing I haven't =
tried is VCS.<BR><BR>&nbsp;&nbsp;&nbsp; I'm pretty much at wits end. I'm =
having a spare SCSI controller card<BR>sent from our offices in Indianapoli=
s, which should arrive sometime early<BR>next week. I'll swap the card out =
just to be safe and run further tests.<BR><BR>&nbsp;&nbsp;&nbsp; I'm also =
going to head back out onsite and physically swap the tape<BR>drives in =
the library. I'll run some additional tests outside of NBU. If the<BR>kw/s =
and %b problems follow the drive, I'll be able to say it's the drive.<BR>If=
 not, maybe it's the controller.. Or the cable, thought I didn't see =
any<BR>bent pins... <BR><BR>&nbsp;&nbsp;&nbsp; Anyone ever have a SCSI =
cable just fail? Is that possible? I suppose it<BR>is.. <BR><BR>&nbsp;&nbsp=
;&nbsp; It's probably sunspots. Yeah.. That's what I'll tell management..<B=
R>"Sorry, backups suck right now because of Sunspots. Check back with me =
in 11<BR>years, after this current spot-cycle completes.." :)<BR><BR>&nbsp;=
&nbsp;&nbsp; -Jeff<BR><BR><BR>On 3/25/05 11:12 AM, "Jorgensen, Bill" =
&lt;Bill_Jorgensen AT csgsystems DOT com&gt;<BR>wrote:<BR><BR>&gt; 
Jeff:<BR>&gt; =
<BR>&gt; Just a thought... I am not sure I have thoroughly read this =
thread so<BR>&gt; forgive me if I rehash stuff.<BR>&gt; <BR>&gt; Are your =
drives direct-attached via scsi? If so have you investigated<BR>&gt; scsi =
cable problems? If the backup server is a Sun then take a look in<BR>&gt; =
/var/adm/messages. Look for parity errors or statements about reduced<BR>&g=
t; transfer rate. If you see things like that then look at the cable as =
the<BR>&gt; issue. This one is tough.<BR>&gt; <BR>&gt; Good luck,<BR>&gt; =
<BR>&gt; Bill<BR>&gt; <BR>&gt; --------------------------------------------=
------------<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Bill Jorgensen<BR>&gt;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp; CSG Systems, Inc.<BR>&gt;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp; (w) 303.200.3282<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (p) =
303.947.9733<BR>&gt; ------------------------------------------------------=
--<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; UNIX... Spoken with hushed =
and<BR>&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; reverent tones.<BR>&gt; =
--------------------------------------------------------<BR>&gt; <BR>&gt; =
-----Original Message-----<BR>&gt; From: veritas-bu-admin AT mailman DOT 
eng.aubur=
n.edu<BR>&gt; [<A href=3D"mailto:veritas-bu-admin AT mailman.eng.auburn DOT 
edu]">=
mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu]</A> On Behalf Of 
Jeff<BR>&g=
t; McCombs<BR>&gt; Sent: Friday, March 25, 2005 8:42 AM<BR>&gt; To: =
Veritas-bu AT mailman.eng.auburn DOT edu<BR>&gt; Subject: Re: [Veritas-bu] =
Backups slow to a crawl<BR>&gt; <BR>&gt; Gang,<BR>&gt; <BR>&gt;&nbsp;&nbsp;=
&nbsp;&nbsp; Ok. So I took Darren's suggestion and 'downed' the drive in =
NBU,<BR>&gt; drove<BR>&gt; out to our facility with a new, unused tape and =
slapped it into the<BR>&gt; drive.<BR>&gt; <BR>&gt; I hoped over to my =
home directory where I've got a good 5G or so of data<BR>&gt; with a good =
mix of file sizes and types and ran the following;<BR>&gt; <BR>&gt; Tar cf =
- . | compress | dd obs=3D1024k of=3D/dev/rmt/1 con=3Dsync<BR>&gt; =
<BR>&gt; And watched the output of iostat -xtcn, with samples being taken =
every<BR>&gt; second.<BR>&gt; <BR>&gt; And everything looked good for the =
first, oh.. 5 minutes or so. But the<BR>&gt; longer that the stream to =
tape ran, the worse the performance started to<BR>&gt; get. After 5 =
minutes I began to see the busy:kw/s ratio drop. Busy went<BR>&gt; =
from<BR>&gt; 4-10 % and kw/s 3 MB/Sec when things were good, to 90-100% =
and kw/s of<BR>&gt; 100-200k/sec. The longer it ran, the worse it got. =
Eventually, 6 out of<BR>&gt; 10<BR>&gt; samples were reading 100% busy and =
a kw/s of 0. The other 4 samples<BR>&gt; would<BR>&gt; range from busy @ =
89 - 99, kw/s down into the sub-50k/sec range.<BR>&gt; <BR>&gt; I also =
checked the output of 'iostat -xtcne' during this run, and while<BR>&gt; =
there were soft and hard errors in the counters, these never actually<BR>&g=
t; increased. 'iostat -nE' provided the following:<BR>&gt; <BR>&gt; =
rmt/0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Soft =
Errors: 18 Hard Errors: 0 Transport Errors: 0<BR>&gt; Vendor: QUANTUM&nbsp;=
 Product: DLT8000&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
Revision: 0250 Serial No: ?P<BR>&gt; rmt/1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp; Soft Errors: 56 Hard Errors: 2 Transport =
Errors: 2<BR>&gt; Vendor: QUANTUM&nbsp; Product: DLT8000&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Revision: 0250 Serial No: ?P<BR>&gt; =
<BR>&gt; Again though, after performing more tests, I couldn't get these =
counters<BR>&gt; to<BR>&gt; increase.<BR>&gt; <BR>&gt; I did get a =
response from Veritas. The tech on the phone suggested I<BR>&gt; muck<BR>&g=
t; with the buffers. Per his instructions, I set NET_BUFFER_SZ to =
131072,<BR>&gt; NUMBER_DATA_BUFFERS to 32, and SIZE_DATA_BUFFERS to =
131072.<BR>&gt; <BR>&gt; I ran a full backup of our system dedicated to =
managing Checkpoint<BR>&gt; firewalls<BR>&gt; (Sun V100, approx 8GB of =
data, 100 MB FDX network on the same 3750<BR>&gt; switch &amp;<BR>&gt; =
VLAN as the backup system), and performance was actually worse on =
the<BR>&gt; first<BR>&gt; drive! Both drives sat at approximately =
512k/sec, though busy was into<BR>&gt; the<BR>&gt; 4-10% range for the =
duration of the backup.<BR>&gt; <BR>&gt; Aargh. If this was a windows =
system, I'd be blaming drivers.. I checked<BR>&gt; cables, cleaned and =
reseated the drives, made sure the SCSI controller<BR>&gt; card<BR>&gt; =
was seated properly, checked termination.. Guess I'll call Overland =
and<BR>&gt; have<BR>&gt; them get me a new drive.<BR>&gt; <BR>&gt; Many =
thanks to those of you who have helped me out already. It's much<BR>&gt; =
appreciated!<BR>&gt; <BR>&gt; -jeff<BR>&gt; <BR>&gt; On 3/24/05 11:14 AM, =
"Darren Dunham" &lt;ddunham AT taos DOT com&gt; wrote:<BR>&gt;&gt; <BR>&gt;&gt; 
I =
didn't reply initially because it appeared that you had fixed it.<BR>&gt;&g=
t; <BR>&gt;&gt; I too would be very suspicious of those iostat figures.&nbs=
p; To me the<BR>&gt; high<BR>&gt;&gt; busy alongside very low throughput =
screams drive problems.<BR>&gt; Multiplexing<BR>&gt;&gt; shouldn't be =
affecting that.<BR>&gt;&gt; <BR>&gt;&gt; If at all possible, I'd try to =
replicate the error by doing some drive<BR>&gt;&gt; testing outside of =
NBU.<BR>&gt;&gt; <BR>&gt;&gt; Down the drive, load a scratch tape, then =
get busy with 'dd' or<BR>&gt;&gt; something.&nbsp; Can you make it behave =
similarly?&nbsp; If so, I'd make it my<BR>&gt;&gt; number one suspect.<BR><=
BR>-- <BR>Jeff McCombs&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; NIC, Inc<BR>Systems Administrator&nbsp;&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp; <A href=3D"http://www.nicusa.com";>http://www.nicusa.com</A>=
<BR>jeffm AT nicusa DOT 
com&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; NASDAQ: EGOV<BR>Pho=
ne: (703) 909-3277&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&=
nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "NIC - the People Behind eGovernment"<BR>--<B=
R>If you try to fail, and you succeed - What did you just do?<BR><BR><BR>__=
_____________________________________________<BR>Veritas-bu maillist&nbsp; =
-&nbsp; Veritas-bu AT mailman.eng.auburn DOT edu<BR><A 
href=3D"http://mailman.eng.=
auburn.edu/mailman/listinfo/veritas-bu">http://mailman.eng.auburn.edu/mailm=
an/listinfo/veritas-bu</A><BR></DIV></BODY></HTML>

--=__Part3B18300C.0__=--

<Prev in Thread] Current Thread [Next in Thread>