We also just fixed a similar problem was was caused by a dodgy SCSI card
in our E10K.
Again there were no errors in /var/adm/messages just an SDLT drive
running at 900Kb/sec!!!
Sean
> -----Original Message-----
> From: veritas-bu-admin AT mailman.eng.auburn DOT edu
> [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of
> Chris.Romano AT Lazard DOT com
> Sent: 24 March 2005 15:56
> To: Jeff McCombs
> Cc: veritas-bu AT mailman.eng.auburn DOT edu;
> veritas-bu-admin AT mailman.eng.auburn DOT edu
> Subject: Re: [Veritas-bu] Backups slow to a crawl
>
>
>
> I had a similiar problem....each morning I would get in the
> office and see rmt2 still doing it's last few backups while
> the other 3 tape drives had finished. The problem turned out
> to be the drive...Quantum swapped it out with a new one and
> the problem was solved.
>
> Even though rmt2 was backing things up, it was operating at a
> crawl due to I/O errors and retries. The interesting thing
> was, no errors were showing in /var/adm/messages.
>
> Quantum could see the errors when they connected directly to
> the Library with their PC.
>
>
> Chris.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> To:
> veritas-bu AT mailman.eng.auburn DOT edu
>
> "Jeff McCombs" <jeffm AT nicusa DOT com> cc:
>
>
> Sent by:
> Subject: Re: [Veritas-bu] Backups slow to a crawl
>
> veritas-bu-admin AT mailman DOT eng.auburn.
>
>
> edu
>
>
>
>
>
> 24 Mar 2005 10:21 AM
>
>
>
>
>
>
>
>
>
>
>
> Ok. I lied. Removing multiplexing did not fix the problem.
>
> It's strange, I _know_ my network is clean, I know my backup
> policies should be fine..
>
> I'm still concerned about the busy percentage of rmt/1 v.s. Rmt/0.
>
> Just to refresh for new readers, my backups are failing for
> some clients due to a status-196 (window closed). These are
> for small systems, without a lot of data on them. Doesn't
> seem to be related to the backup type, MPX or streams
> setting. For example, our jumpstart system took 9 hours to
> backup 22G, we averaged 672Kb/sec, where as our developmental
> database server backed up 24G in 3.5 hours, avg speed of
> 1743K/sec (though the number of files was almost half that of
> the jumpstart system, which may have an impact).
>
> In trying to troubleshoot, I watched the system's I/O
> performance using 'iostat' and noticed that /dev/rmt/1, the
> 2nd drive in our library (Overland Neo 2000) appears to be
> having some problems in sending data to tape. I noticed that
> the %-busy on the drive shoots up to 100% as kw/s (kbytes
> written/sec) drops drastically down into the 200-300 range.
>
> /dev/rmt/0 has no problems during the same time period.
> %-busy sits anywhere from 2 to 30%, and kw/s is in the 1.2 to
> 2.5 range.
>
> The only correlation I can find with systems that are failing
> backups with a 196 status are systems that were queued to
> rmt/1. Systems queued to rmt/0 backup fine, and usually these
> systems backups complete in 15 minutes or so.
>
> Now correct me if I'm wrong, but under ideal circumstances,
> the following should happen during as backup windows open and
> a schedule starts;
>
> client jobs are assigned to available drives (per policy
> or global configuration), division of work is done on a
> client-basis and not a job one (so clientA:job1 -> drive 1
> and clientA:job2 -> drive2 doesn't occur).
>
> As client jobs are completed, any available drive should
> pickup the backlog for any other drive(?). For example:
>
> Job queue per drive
> Drive 1: Drive 2:
> ClientA:job1 ClientB:job1
> ClientA:job2 ClientB:job2
> ClientC:job1 ClientD:job1
> ClientC:job2 ClientD:job2
> ClientC:job3 ClientE:job1
>
> If Drive-1 clears it's jobs, while Drive-2 is still
> working on ClientB, Drive-1 should pickup Client E, and
> possibly client D, right?
>
> This doesn't seem to be happening, and I'm curious as to
> why.. I did see 'Jerry's' (though he signs his email as
> Brian) post yesterday about technote #274544 (or #274559 for
> 5.0 folks), and the related #237534 technotes. However even
> with attempting the workarounds suggested in the technote and
> specifying the storage-unit in the policy (we only have one
> anyway), I'm still getting 196's. We don't have a large
> volume DB either, with only 100 tapes.
>
> Can anyone shed some light here? I've included some
> specifics on the policies and clients below.. I worry that
> rmt/1 is failing.. And the darn thing just got out of
> warranty last month to boot (of course!). I've gone ahead and
> opened a service request with Veritas, but .. Well you know
> how long getting anything useful out of them can be (took me
> a month to get a 5.1 media kit!).
>
> System info:
> Media Server / Master server are same system.
> SunFire V240, Solaris 9, current recommended patch
> set as of 02/05
> NBU Enterprise 5.0 MP4
> Overland Neo 2000 Storage, 26-slot / 2-Drive DLT library.
>
> # of clients: 32
> Clients are Solaris 9 systems, 5.0 MP4 client software.
> Client file list: ALL_LOCAL_DRIVES
> No extra directives in bp.conf
>
> Policy configuration (CDC-revised):
> Type: Standard
> Storage Unit: backup-dlt2-robot-tld-0
> Volume Pool: NetBackup (overridden per schedule)
> Checkpoints: 15-minutes
> Limit Jobs: not Set
> Priority: 0
> Follow NFS: Not Set
> Cross Mount Pts: Yes
> Collect TIR: Yes with Move
> Compression: Yes
> Multiple Streams: Yes
> No Advanced client settings
>
> Schedule: Daily-Differential
> Calendar based: Mo, We, Fr (18:00 - 06:00)
> Policy Pool: Daily
> Retention: 2 weeks
> Multiplexing: 1
>
> Schedule: Daily-Cumulative
> Calendar based: Sa, Tu, Th (18:00 - 06:00)
> Policy Pool: Daily
> Retention: 2 weeks
> Multiplexing: 1
>
> Schedule: Weekly
> Calendar based: Su (00:00 - 23:59 window)
> Retries: Yes
> Multiple Copies:
> #1 - Pool: Weekly-Short, Retention 2-weeks
> #2 - Pool: Weekly-Offsite, Retention 1-month
> Multiplexing: 1
>
> Schedule: Monthly
> Calendar based: 1st of every month
> (M-F 18:00-06:00, Sa/Su 00:00-23:59)
> Retries: Yes
> Multiple Copies:
> #1 - Pool: Monthly-short, Retention 2 months
> #2 - Pool: Monthly-Offsite, retention 6 months
> Multiplexing: 1
>
>
> On 3/23/05 1:32 PM, "Jeff McCombs" <jeffm AT nicusa DOT com> wrote:
>
> > Yeah, I originally thought that this might be a network problem
> > myself. However I have checked the network settings on the
> Sun systems
> > and the Cisco switches in-between. I'm even forcing a 100FDX on the
> > switch and system just to be safe (auto negotiation never works,
> > regardless of what the vendors
> > say)
> >
> > Seems that this is a MPX thing. I did some further testing
> and backing
> > up systems without multiplexing enabled, and the problem goes away.
> > The rmt/1 device stops with the 100% busy and 0 kw/s, client full
> > backups drop back down into the 15 minute range...
> >
> >
> >
> >
> > On 3/23/05 10:56 AM, "Jorgensen, Bill"
> <Bill_Jorgensen AT csgsystems DOT com>
> > wrote:
> >
> >> Jeff:
> >>
> >> A few things to consider (assuming a Sun sever as the NBU master):
> >>
> >> 1.) Are you aware of anything that has changed on your NBU server?
> >> 2.) Are you aware of anything that has changed with your network?
> >> (Providing you are doing Ethernet-based backups. If not,
> what about
> >> the
> >> SAN?)
> >> 3.) Are you aware of any changes to the policies?
> >>
> >> If no to the above try the following:
> >>
> >> 1.) Find out what Veritas recommends for your environment
> for these
> >> two
> >> variables:
> >> NUMBER_DATA_BUFFERS
> >> SIZE_DATA_BUFFERS
> >> These are found in /usr/openv/netbackup/db/config. They
> may not give
> >> them to you if you open a ticket with the solution center
> (Professional
> >> Services). Ask around if they do not.
> >>
> >> 2.) Check the network driver settings for a few things.
> This depends
> >> on the network type you are using. 100Mb-switched, 10Mb-switched,
> >> etc.
> >>
> >> root[prod-backup:/]# ndd -get /dev/qfe adv_autoneg_cap
> >> 1
> >> root[prod-backup:/]# ndd -get /dev/qfe adv_100hdx_cap
> >> 1
> >> root[prod-backup:/]# ndd -get /dev/qfe adv_100fdx_cap
> >> 1
> >> What the output above is stating is that the qfe driver is
> set at 100
> >> half and full duplex, and autonegotiate. Once you know how the
> >> network driver is configured go to your network guys and
> ask them to
> >> see how the port on the switch is configured (unless you are the
> >> network guy). If the port is NOT set to 100-full or autonegotiate
> >> have them set it accordingly.
> >>
> >> 3.) Reseat the RJ-45 connectors for the physical connections.
> >>
> >> These are some things that have bit us in the past.
> >>
> >> Good luck,
> >>
> >> Bill
> >>
> >> --------------------------------------------------------
> >> Bill Jorgensen
> >> CSG Systems, Inc.
> >> (w) 303.200.3282
> >> (p) 303.947.9733
> >> --------------------------------------------------------
> >> UNIX... Spoken with hushed and
> >> reverent tones.
> >> --------------------------------------------------------
> >>
> >> -----Original Message-----
> >> From: veritas-bu-admin AT mailman.eng.auburn DOT edu
> >> [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Jeff
> >> McCombs
> >> Sent: Wednesday, March 23, 2005 6:51 AM
> >> To: veritas-bu AT mailman.eng.auburn DOT edu
> >> Subject: [Veritas-bu] Backups slow to a crawl
> >>
> >> Gurus,
> >>
> >> NB 5.0 MP4, single combination media/master server, Solaris 9.
> >> Overland Neo 2000 26-slot 2 drive DLT.
> >>
> >> I'm noticing that for some reason or another, all of my client
> >> backups have slowed to a _crawl_. A _cumulative_ (!)
> backup of local
> >> disk on a Sun
> >> V100 is taking somewhere on the order of 2 hours at this
> point, and with
> >> over 40 systems, I'm blowing past my window consistently.
> >>
> >> I'm not quite sure what's going on here, but as I sit
> and watch
> >> the output from 'iostat', I'm noticing that rmt/1 (the 2nd
> drive in
> >> the Neo) is fluxuating between 100% busy, with kw/s at
> close to zero,
> >> and busy @ 1-15%
> >> and kw/s up into the 1000's.
> >>
> >> rmt/0 seems to be fine, kw/s sits consistently up in
> the 1.8-2K
> >> range, while busy is anywhere from 2% - 30% on average. My other
> >> disks aren't working hard, CPU isn't loaded and I've got plenty of
> >> memory.
> >>
> >> The policy I'm using allows for multiple datastreams,
> no limits
> >> on jobs, and most schedules allow for an MPX of 2. I'm backing up
> >> ALL_LOCAL_DRIVES on
> >> all clients, and I'm not using any NEW_STREAM directives.
> I'm not seeing
> >> any
> >> errors on the media either.
> >>
> >> Can anyone shed some light on what might be happening
> here? Am I
> >> looking at a drive that might be having some problems, or am I
> >> barking up the wrong
> >> tree, and it's something else entirely?
> >>
> >> A small sample of iostat output covering the affected
> devices is
> >> below.
> >>
> >> sample (extra disks removed from putput); root@backup(pts/1):~#
> >> iostat -nx 1 100
> >> extended device statistics
> >> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
> >> 0.0 4.1 0.0 252.2 0.0 0.0 0.0 5.9 0 2 rmt/0
> >> 0.0 4.6 0.0 278.4 0.0 0.1 0.0 27.3 0 12 rmt/1
> >>
> >> 0.0 4.1 0.0 252.3 0.0 0.0 0.0 5.9 0 2 rmt/0
> >> 0.0 4.6 0.0 278.4 0.0 0.1 0.0 27.3 0 12 rmt/1
> >>
> >> 0.0 33.0 0.0 2076.4 0.0 0.2 0.0 5.8 0 19 rmt/0
> >> 0.0 2.0 0.0 125.8 0.0 1.0 0.0 490.0 0 98 rmt/1
> >>
> >> 0.0 38.0 0.0 2394.0 0.0 0.2 0.0 5.4 0 21 rmt/0
> >> 0.0 8.0 0.0 504.0 0.0 1.0 0.0 124.9 0 100 rmt/1
> >>
> >> 0.0 27.0 0.0 1701.1 0.0 0.2 0.0 6.5 0 17 rmt/0
> >> 0.0 2.0 0.0 126.0 0.0 1.0 0.0 499.9 0 100 rmt/1
> >>
> >> 0.0 33.0 0.0 2078.9 0.0 0.2 0.0 5.3 0 18 rmt/0
> >> 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0 100 rmt/1
> >>
> >> 0.0 16.0 0.0 1008.0 0.0 0.1 0.0 6.2 0 10 rmt/0
> >> 0.0 13.0 0.0 819.0 0.0 0.6 0.0 48.4 0 63 rmt/1
> >>
> >> 0.0 40.0 0.0 2520.1 0.0 0.2 0.0 5.9 0 24 rmt/0
> >> 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0 100 rmt/1
> >>
> >> 0.0 33.0 0.0 2078.9 0.0 0.2 0.0 5.3 0 18 rmt/0
> >> 0.0 10.0 0.0 630.0 0.0 1.0 0.0 99.9 0 100 rmt/1
> >>
>
> --
> Jeff McCombs |
> NIC, Inc
> Systems Administrator |
> http://www.nicusa.com
> jeffm AT nicusa DOT com |
> NASDAQ: EGOV
> Phone: (703) 909-3277 | "NIC - the People
> Behind eGovernment"
> --
> What do you do for endangered animals that only eat endangered plants?
>
>
> _______________________________________________
> Veritas-bu maillist - Veritas-bu AT mailman.eng.auburn DOT edu
> http://mailman.eng.auburn.edu/mailman/listinfo> /veritas-bu
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Veritas-bu maillist - Veritas-bu AT mailman.eng.auburn DOT edu
> http://mailman.eng.auburn.edu/mailman/listinfo> /veritas-bu
>
|