Jeff:
Replacement may be good. Use iostat with -e and see what you get. Also,
look in /usr/openv/netbackup/logs/bptm and see if the tape manager is
logging any errors for tape while mounted on that drive.
Bill
--------------------------------------------------------
Bill Jorgensen
CSG Systems, Inc.
(w) 303.200.3282
(p) 303.947.9733
--------------------------------------------------------
UNIX... Spoken with hushed and
reverent tones.
--------------------------------------------------------
-----Original Message-----
From: veritas-bu-admin AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of
Chris.Romano AT Lazard DOT com
Sent: Thursday, March 24, 2005 8:56 AM
To: Jeff McCombs
Cc: veritas-bu AT mailman.eng.auburn DOT edu;
veritas-bu-admin AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] Backups slow to a crawl
I had a similiar problem....each morning I would get in the office and
see rmt2
still doing it's last few backups while the other 3 tape drives had
finished.
The problem turned out to be the drive...Quantum swapped it out with a
new one
and the problem was solved.
Even though rmt2 was backing things up, it was operating at a crawl due
to I/O
errors and retries. The interesting thing was, no errors were showing in
/var/adm/messages.
Quantum could see the errors when they connected directly to the Library
with
their PC.
Chris.
To:
veritas-bu AT mailman.eng.auburn DOT edu
"Jeff McCombs" <jeffm AT nicusa DOT com> cc:
Sent by: Subject: Re:
[Veritas-bu] Backups slow to a crawl
veritas-bu-admin AT mailman DOT eng.auburn.
edu
24 Mar 2005 10:21 AM
Ok. I lied. Removing multiplexing did not fix the problem.
It's strange, I _know_ my network is clean, I know my backup policies
should
be fine..
I'm still concerned about the busy percentage of rmt/1 v.s. Rmt/0.
Just to refresh for new readers, my backups are failing for some clients
due
to a status-196 (window closed). These are for small systems, without a
lot
of data on them. Doesn't seem to be related to the backup type, MPX or
streams setting. For example, our jumpstart system took 9 hours to
backup
22G, we averaged 672Kb/sec, where as our developmental database server
backed up 24G in 3.5 hours, avg speed of 1743K/sec (though the number of
files was almost half that of the jumpstart system, which may have an
impact).
In trying to troubleshoot, I watched the system's I/O performance using
'iostat' and noticed that /dev/rmt/1, the 2nd drive in our library
(Overland
Neo 2000) appears to be having some problems in sending data to tape. I
noticed that the %-busy on the drive shoots up to 100% as kw/s (kbytes
written/sec) drops drastically down into the 200-300 range.
/dev/rmt/0 has no problems during the same time period. %-busy sits
anywhere
from 2 to 30%, and kw/s is in the 1.2 to 2.5 range.
The only correlation I can find with systems that are failing backups
with a
196 status are systems that were queued to rmt/1. Systems queued to
rmt/0
backup fine, and usually these systems backups complete in 15 minutes or
so.
Now correct me if I'm wrong, but under ideal circumstances, the
following
should happen during as backup windows open and a schedule starts;
client jobs are assigned to available drives (per policy or global
configuration), division of work is done on a client-basis and not a job
one
(so clientA:job1 -> drive 1 and clientA:job2 -> drive2 doesn't occur).
As client jobs are completed, any available drive should pickup the
backlog for any other drive(?). For example:
Job queue per drive
Drive 1: Drive 2:
ClientA:job1 ClientB:job1
ClientA:job2 ClientB:job2
ClientC:job1 ClientD:job1
ClientC:job2 ClientD:job2
ClientC:job3 ClientE:job1
If Drive-1 clears it's jobs, while Drive-2 is still working on
ClientB, Drive-1 should pickup Client E, and possibly client D, right?
This doesn't seem to be happening, and I'm curious as to why.. I did
see
'Jerry's' (though he signs his email as Brian) post yesterday about
technote
#274544 (or #274559 for 5.0 folks), and the related #237534 technotes.
However even with attempting the workarounds suggested in the technote
and
specifying the storage-unit in the policy (we only have one anyway), I'm
still getting 196's. We don't have a large volume DB either, with only
100
tapes.
Can anyone shed some light here? I've included some specifics on the
policies and clients below.. I worry that rmt/1 is failing.. And the
darn
thing just got out of warranty last month to boot (of course!). I've
gone
ahead and opened a service request with Veritas, but .. Well you know
how
long getting anything useful out of them can be (took me a month to get
a
5.1 media kit!).
System info:
Media Server / Master server are same system.
SunFire V240, Solaris 9, current recommended patch set as of
02/05
NBU Enterprise 5.0 MP4
Overland Neo 2000 Storage, 26-slot / 2-Drive DLT library.
# of clients: 32
Clients are Solaris 9 systems, 5.0 MP4 client software.
Client file list: ALL_LOCAL_DRIVES
No extra directives in bp.conf
Policy configuration (CDC-revised):
Type: Standard
Storage Unit: backup-dlt2-robot-tld-0
Volume Pool: NetBackup (overridden per schedule)
Checkpoints: 15-minutes
Limit Jobs: not Set
Priority: 0
Follow NFS: Not Set
Cross Mount Pts: Yes
Collect TIR: Yes with Move
Compression: Yes
Multiple Streams: Yes
No Advanced client settings
Schedule: Daily-Differential
Calendar based: Mo, We, Fr (18:00 - 06:00)
Policy Pool: Daily
Retention: 2 weeks
Multiplexing: 1
Schedule: Daily-Cumulative
Calendar based: Sa, Tu, Th (18:00 - 06:00)
Policy Pool: Daily
Retention: 2 weeks
Multiplexing: 1
Schedule: Weekly
Calendar based: Su (00:00 - 23:59 window)
Retries: Yes
Multiple Copies:
#1 - Pool: Weekly-Short, Retention 2-weeks
#2 - Pool: Weekly-Offsite, Retention 1-month
Multiplexing: 1
Schedule: Monthly
Calendar based: 1st of every month
(M-F 18:00-06:00, Sa/Su 00:00-23:59)
Retries: Yes
Multiple Copies:
#1 - Pool: Monthly-short, Retention 2 months
#2 - Pool: Monthly-Offsite, retention 6 months
Multiplexing: 1
On 3/23/05 1:32 PM, "Jeff McCombs" <jeffm AT nicusa DOT com> wrote:
> Yeah, I originally thought that this might be a network problem
myself.
> However I have checked the network settings on the Sun systems and the
Cisco
> switches in-between. I'm even forcing a 100FDX on the switch and
system just
> to be safe (auto negotiation never works, regardless of what the
vendors
> say)
>
> Seems that this is a MPX thing. I did some further testing and backing
up
> systems without multiplexing enabled, and the problem goes away. The
rmt/1
> device stops with the 100% busy and 0 kw/s, client full backups drop
back
> down into the 15 minute range...
>
>
>
>
> On 3/23/05 10:56 AM, "Jorgensen, Bill" <Bill_Jorgensen AT csgsystems DOT com>
> wrote:
>
>> Jeff:
>>
>> A few things to consider (assuming a Sun sever as the NBU master):
>>
>> 1.) Are you aware of anything that has changed on your NBU server?
>> 2.) Are you aware of anything that has changed with your network?
>> (Providing you are doing Ethernet-based backups. If not, what about
the
>> SAN?)
>> 3.) Are you aware of any changes to the policies?
>>
>> If no to the above try the following:
>>
>> 1.) Find out what Veritas recommends for your environment for these
two
>> variables:
>> NUMBER_DATA_BUFFERS
>> SIZE_DATA_BUFFERS
>> These are found in /usr/openv/netbackup/db/config. They may not give
>> them to you if you open a ticket with the solution center
(Professional
>> Services). Ask around if they do not.
>>
>> 2.) Check the network driver settings for a few things. This depends
on
>> the network type you are using. 100Mb-switched, 10Mb-switched, etc.
>>
>> root[prod-backup:/]# ndd -get /dev/qfe adv_autoneg_cap
>> 1
>> root[prod-backup:/]# ndd -get /dev/qfe adv_100hdx_cap
>> 1
>> root[prod-backup:/]# ndd -get /dev/qfe adv_100fdx_cap
>> 1
>> What the output above is stating is that the qfe driver is set at 100
>> half and full duplex, and autonegotiate. Once you know how the
network
>> driver is configured go to your network guys and ask them to see how
the
>> port on the switch is configured (unless you are the network guy). If
>> the port is NOT set to 100-full or autonegotiate have them set it
>> accordingly.
>>
>> 3.) Reseat the RJ-45 connectors for the physical connections.
>>
>> These are some things that have bit us in the past.
>>
>> Good luck,
>>
>> Bill
>>
>> --------------------------------------------------------
>> Bill Jorgensen
>> CSG Systems, Inc.
>> (w) 303.200.3282
>> (p) 303.947.9733
>> --------------------------------------------------------
>> UNIX... Spoken with hushed and
>> reverent tones.
>> --------------------------------------------------------
>>
>> -----Original Message-----
>> From: veritas-bu-admin AT mailman.eng.auburn DOT edu
>> [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Jeff
>> McCombs
>> Sent: Wednesday, March 23, 2005 6:51 AM
>> To: veritas-bu AT mailman.eng.auburn DOT edu
>> Subject: [Veritas-bu] Backups slow to a crawl
>>
>> Gurus,
>>
>> NB 5.0 MP4, single combination media/master server, Solaris 9.
>> Overland
>> Neo 2000 26-slot 2 drive DLT.
>>
>> I'm noticing that for some reason or another, all of my client
>> backups
>> have slowed to a _crawl_. A _cumulative_ (!) backup of local disk on
a
>> Sun
>> V100 is taking somewhere on the order of 2 hours at this point, and
with
>> over 40 systems, I'm blowing past my window consistently.
>>
>> I'm not quite sure what's going on here, but as I sit and watch
the
>> output from 'iostat', I'm noticing that rmt/1 (the 2nd drive in the
Neo)
>> is
>> fluxuating between 100% busy, with kw/s at close to zero, and busy @
>> 1-15%
>> and kw/s up into the 1000's.
>>
>> rmt/0 seems to be fine, kw/s sits consistently up in the 1.8-2K
>> range,
>> while busy is anywhere from 2% - 30% on average. My other disks
aren't
>> working hard, CPU isn't loaded and I've got plenty of memory.
>>
>> The policy I'm using allows for multiple datastreams, no limits
on
>> jobs,
>> and most schedules allow for an MPX of 2. I'm backing up
>> ALL_LOCAL_DRIVES on
>> all clients, and I'm not using any NEW_STREAM directives. I'm not
seeing
>> any
>> errors on the media either.
>>
>> Can anyone shed some light on what might be happening here? Am I
>> looking
>> at a drive that might be having some problems, or am I barking up the
>> wrong
>> tree, and it's something else entirely?
>>
>> A small sample of iostat output covering the affected devices is
>> below.
>>
>> sample (extra disks removed from putput);
>> root@backup(pts/1):~# iostat -nx 1 100
>> extended device statistics
>> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
>> 0.0 4.1 0.0 252.2 0.0 0.0 0.0 5.9 0 2 rmt/0
>> 0.0 4.6 0.0 278.4 0.0 0.1 0.0 27.3 0 12 rmt/1
>>
>> 0.0 4.1 0.0 252.3 0.0 0.0 0.0 5.9 0 2 rmt/0
>> 0.0 4.6 0.0 278.4 0.0 0.1 0.0 27.3 0 12 rmt/1
>>
>> 0.0 33.0 0.0 2076.4 0.0 0.2 0.0 5.8 0 19 rmt/0
>> 0.0 2.0 0.0 125.8 0.0 1.0 0.0 490.0 0 98 rmt/1
>>
>> 0.0 38.0 0.0 2394.0 0.0 0.2 0.0 5.4 0 21 rmt/0
>> 0.0 8.0 0.0 504.0 0.0 1.0 0.0 124.9 0 100 rmt/1
>>
>> 0.0 27.0 0.0 1701.1 0.0 0.2 0.0 6.5 0 17 rmt/0
>> 0.0 2.0 0.0 126.0 0.0 1.0 0.0 499.9 0 100 rmt/1
>>
>> 0.0 33.0 0.0 2078.9 0.0 0.2 0.0 5.3 0 18 rmt/0
>> 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0 100 rmt/1
>>
>> 0.0 16.0 0.0 1008.0 0.0 0.1 0.0 6.2 0 10 rmt/0
>> 0.0 13.0 0.0 819.0 0.0 0.6 0.0 48.4 0 63 rmt/1
>>
>> 0.0 40.0 0.0 2520.1 0.0 0.2 0.0 5.9 0 24 rmt/0
>> 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0 100 rmt/1
>>
>> 0.0 33.0 0.0 2078.9 0.0 0.2 0.0 5.3 0 18 rmt/0
>> 0.0 10.0 0.0 630.0 0.0 1.0 0.0 99.9 0 100 rmt/1
>>
--
Jeff McCombs | NIC,
Inc
Systems Administrator |
http://www.nicusa.com
jeffm AT nicusa DOT com | NASDAQ:
EGOV
Phone: (703) 909-3277 | "NIC - the People Behind
eGovernment"
--
What do you do for endangered animals that only eat endangered plants?
_______________________________________________
Veritas-bu maillist - Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
_______________________________________________
Veritas-bu maillist - Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|