Veritas-bu

[Veritas-bu] Backups slow to a crawl

2005-03-24 15:03:39
Subject: [Veritas-bu] Backups slow to a crawl
From: sean_clarke AT softhome DOT net (Sean Clarke)
Date: Thu, 24 Mar 2005 20:03:39 -0000
We also just fixed a similar problem was was caused by a dodgy SCSI card
in our E10K.

Again there were no errors in /var/adm/messages just an SDLT drive
running at 900Kb/sec!!!

Sean

> -----Original Message-----
> From: veritas-bu-admin AT mailman.eng.auburn DOT edu 
> [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of 
> Chris.Romano AT Lazard DOT com
> Sent: 24 March 2005 15:56
> To: Jeff McCombs
> Cc: veritas-bu AT mailman.eng.auburn DOT edu; 
> veritas-bu-admin AT mailman.eng.auburn DOT edu
> Subject: Re: [Veritas-bu] Backups slow to a crawl
> 
> 
> 
> I had a similiar problem....each morning I would get in the 
> office and see rmt2 still doing it's last few backups while 
> the other 3 tape drives had finished. The problem turned out 
> to be the drive...Quantum swapped it out with a new one and 
> the problem was solved.
> 
> Even though rmt2 was backing things up, it was operating at a 
> crawl due to I/O errors and retries. The interesting thing 
> was, no errors were showing in /var/adm/messages.
> 
> Quantum could see the errors when they connected directly to 
> the Library with their PC.
> 
> 
> Chris.
> 
> 
>                                                               
>                   
>                                                               
>                   
>                                                               
>                   
> 
> 
>                                                               
>                                                               
>                 
>                                                               
>                                                               
>                 
>                                                           To: 
>      veritas-bu AT mailman.eng.auburn DOT edu                        
>                 
>               "Jeff McCombs" <jeffm AT nicusa DOT com>           cc: 
>                                                               
>                 
>               Sent by:                                    
> Subject: Re: [Veritas-bu] Backups slow to a crawl             
>                     
>               veritas-bu-admin AT mailman DOT eng.auburn.            
>                                                               
>                 
>               edu                                             
>                                                               
>                 
>                                                               
>                                                               
>                 
>               24 Mar 2005 10:21 AM                            
>                                                               
>                 
>                                                               
>                                                               
>                 
>                                                               
>                                                               
>                 
> 
> 
> 
> Ok. I lied. Removing multiplexing did not fix the problem.
> 
> It's strange, I _know_ my network is clean, I know my backup 
> policies should be fine..
> 
> I'm still concerned about the busy percentage of rmt/1 v.s. Rmt/0.
> 
> Just to refresh for new readers, my backups are failing for 
> some clients due to a status-196 (window closed). These are 
> for small systems, without a lot of data on them. Doesn't 
> seem to be related to the backup type, MPX or streams 
> setting. For example, our jumpstart system took 9 hours to 
> backup 22G, we averaged 672Kb/sec, where as our developmental 
> database server backed up 24G in 3.5 hours, avg speed of 
> 1743K/sec (though the number of files was almost half that of 
> the jumpstart system, which may have an impact).
> 
> In trying to troubleshoot, I watched the system's I/O 
> performance using 'iostat' and noticed that /dev/rmt/1, the 
> 2nd drive in our library (Overland Neo 2000) appears to be 
> having some problems in sending data to tape. I noticed that 
> the %-busy on the drive shoots up to 100% as kw/s (kbytes
> written/sec) drops drastically down into the 200-300 range.
> 
> /dev/rmt/0 has no problems during the same time period. 
> %-busy sits anywhere from 2 to 30%, and kw/s is in the 1.2 to 
> 2.5 range.
> 
> The only correlation I can find with systems that are failing 
> backups with a 196 status are systems that were queued to 
> rmt/1. Systems queued to rmt/0 backup fine, and usually these 
> systems backups complete in 15 minutes or so.
> 
> Now correct me if I'm wrong, but under ideal circumstances, 
> the following should happen during as backup windows open and 
> a schedule starts;
> 
>     client jobs are assigned to available drives (per policy 
> or global configuration), division of work is done on a 
> client-basis and not a job one (so clientA:job1 -> drive 1 
> and clientA:job2 -> drive2 doesn't occur).
> 
>     As client jobs are completed, any available drive should 
> pickup the backlog for any other drive(?). For example:
> 
>                 Job queue per drive
>       Drive 1:                    Drive 2:
>     ClientA:job1                ClientB:job1
>     ClientA:job2                ClientB:job2
>     ClientC:job1                ClientD:job1
>     ClientC:job2                ClientD:job2
>     ClientC:job3                ClientE:job1
> 
>       If Drive-1 clears it's jobs, while Drive-2 is still 
> working on ClientB, Drive-1 should pickup Client E, and 
> possibly client D, right?
> 
>     This doesn't seem to be happening, and I'm curious as to 
> why.. I did see 'Jerry's' (though he signs his email as 
> Brian) post yesterday about technote #274544 (or #274559 for 
> 5.0 folks), and the related #237534 technotes. However even 
> with attempting the workarounds suggested in the technote and 
> specifying the storage-unit in the policy (we only have one 
> anyway), I'm still getting 196's. We don't have a large 
> volume DB either, with only 100 tapes.
> 
>     Can anyone shed some light here? I've included some 
> specifics on the policies and clients below.. I worry that 
> rmt/1 is failing.. And the darn thing just got out of 
> warranty last month to boot (of course!). I've gone ahead and 
> opened a service request with Veritas, but .. Well you know 
> how long getting anything useful out of them can be (took me 
> a month to get a 5.1 media kit!).
> 
>     System info:
>         Media Server / Master server are same system.
>         SunFire V240, Solaris 9, current recommended patch 
> set as of 02/05
>         NBU Enterprise 5.0 MP4
>         Overland Neo 2000 Storage, 26-slot / 2-Drive DLT library.
> 
>     # of clients: 32
>     Clients are Solaris 9 systems, 5.0 MP4 client software.
>     Client file list: ALL_LOCAL_DRIVES
>     No extra directives in bp.conf
> 
>     Policy configuration (CDC-revised):
>         Type:   Standard
>         Storage Unit:       backup-dlt2-robot-tld-0
>         Volume Pool:        NetBackup (overridden per schedule)
>         Checkpoints:        15-minutes
>         Limit Jobs:         not Set
>         Priority:           0
>         Follow NFS:         Not Set
>         Cross Mount Pts:    Yes
>         Collect TIR:        Yes with Move
>         Compression:        Yes
>         Multiple Streams:   Yes
>         No Advanced client settings
> 
>         Schedule: Daily-Differential
>             Calendar based:      Mo, We, Fr (18:00 - 06:00)
>             Policy Pool:         Daily
>             Retention:           2 weeks
>             Multiplexing:        1
> 
>         Schedule: Daily-Cumulative
>             Calendar based:     Sa, Tu, Th (18:00 - 06:00)
>             Policy Pool:        Daily
>             Retention:          2 weeks
>             Multiplexing:       1
> 
>         Schedule: Weekly
>             Calendar based:     Su (00:00 - 23:59 window)
>               Retries:          Yes
>             Multiple Copies:
>                 #1 - Pool: Weekly-Short, Retention 2-weeks
>                 #2 - Pool: Weekly-Offsite, Retention 1-month
>             Multiplexing:       1
> 
>         Schedule: Monthly
>             Calendar based:     1st of every month
>                                 (M-F 18:00-06:00, Sa/Su 00:00-23:59)
>                 Retries:        Yes
>             Multiple Copies:
>                 #1 - Pool: Monthly-short, Retention 2 months
>                 #2 - Pool: Monthly-Offsite, retention 6 months
>             Multiplexing:       1
> 
> 
> On 3/23/05 1:32 PM, "Jeff McCombs" <jeffm AT nicusa DOT com> wrote:
> 
> > Yeah, I originally thought that this might be a network problem 
> > myself. However I have checked the network settings on the 
> Sun systems 
> > and the Cisco switches in-between. I'm even forcing a 100FDX on the 
> > switch and system just to be safe (auto negotiation never works, 
> > regardless of what the vendors
> > say)
> >
> > Seems that this is a MPX thing. I did some further testing 
> and backing 
> > up systems without multiplexing enabled, and the problem goes away. 
> > The rmt/1 device stops with the 100% busy and 0 kw/s, client full 
> > backups drop back down into the 15 minute range...
> >
> >
> >
> >
> > On 3/23/05 10:56 AM, "Jorgensen, Bill" 
> <Bill_Jorgensen AT csgsystems DOT com>
> > wrote:
> >
> >> Jeff:
> >>
> >> A few things to consider (assuming a Sun sever as the NBU master):
> >>
> >> 1.) Are you aware of anything that has changed on your NBU server?
> >> 2.) Are you aware of anything that has changed with your network? 
> >> (Providing you are doing Ethernet-based backups. If not, 
> what about 
> >> the
> >> SAN?)
> >> 3.) Are you aware of any changes to the policies?
> >>
> >> If no to the above try the following:
> >>
> >> 1.) Find out what Veritas recommends for your environment 
> for these 
> >> two
> >> variables:
> >> NUMBER_DATA_BUFFERS
> >> SIZE_DATA_BUFFERS
> >> These are found in /usr/openv/netbackup/db/config. They 
> may not give
> >> them to you if you open a ticket with the solution center 
> (Professional
> >> Services). Ask around if they do not.
> >>
> >> 2.) Check the network driver settings for a few things. 
> This depends 
> >> on the network type you are using. 100Mb-switched, 10Mb-switched, 
> >> etc.
> >>
> >> root[prod-backup:/]# ndd -get /dev/qfe adv_autoneg_cap
> >> 1
> >> root[prod-backup:/]# ndd -get /dev/qfe adv_100hdx_cap
> >> 1
> >> root[prod-backup:/]# ndd -get /dev/qfe adv_100fdx_cap
> >> 1
> >> What the output above is stating is that the qfe driver is 
> set at 100 
> >> half and full duplex, and autonegotiate. Once you know how the 
> >> network driver is configured go to your network guys and 
> ask them to 
> >> see how the port on the switch is configured (unless you are the 
> >> network guy). If the port is NOT set to 100-full or autonegotiate 
> >> have them set it accordingly.
> >>
> >> 3.) Reseat the RJ-45 connectors for the physical connections.
> >>
> >> These are some things that have bit us in the past.
> >>
> >> Good luck,
> >>
> >> Bill
> >>
> >> --------------------------------------------------------
> >>      Bill Jorgensen
> >>      CSG Systems, Inc.
> >>      (w) 303.200.3282
> >>      (p) 303.947.9733
> >> --------------------------------------------------------
> >>      UNIX... Spoken with hushed and
> >>      reverent tones.
> >> --------------------------------------------------------
> >>
> >> -----Original Message-----
> >> From: veritas-bu-admin AT mailman.eng.auburn DOT edu
> >> [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Jeff 
> >> McCombs
> >> Sent: Wednesday, March 23, 2005 6:51 AM
> >> To: veritas-bu AT mailman.eng.auburn DOT edu
> >> Subject: [Veritas-bu] Backups slow to a crawl
> >>
> >> Gurus,
> >>
> >>     NB 5.0 MP4, single combination media/master server, Solaris 9. 
> >> Overland Neo 2000 26-slot 2 drive DLT.
> >>
> >>     I'm noticing that for some reason or another, all of my client 
> >> backups have slowed to a _crawl_. A _cumulative_ (!) 
> backup of local 
> >> disk on a Sun
> >> V100 is taking somewhere on the order of 2 hours at this 
> point, and with
> >> over 40 systems, I'm blowing past my window  consistently.
> >>
> >>     I'm not quite sure what's going on here, but as I sit 
> and watch 
> >> the output from 'iostat', I'm noticing that rmt/1 (the 2nd 
> drive in 
> >> the Neo) is fluxuating between 100% busy, with kw/s at 
> close to zero, 
> >> and busy @ 1-15%
> >> and kw/s up into the 1000's.
> >>
> >>     rmt/0 seems to be fine, kw/s sits consistently up in 
> the 1.8-2K 
> >> range, while busy is anywhere from 2% - 30% on average. My other 
> >> disks aren't working hard, CPU isn't loaded and I've got plenty of 
> >> memory.
> >>
> >>     The policy I'm using allows for multiple datastreams, 
> no limits 
> >> on jobs, and most schedules allow for an MPX of 2. I'm backing up
> >> ALL_LOCAL_DRIVES on
> >> all clients, and I'm not using any NEW_STREAM directives. 
> I'm not seeing
> >> any
> >> errors on the media either.
> >>
> >>     Can anyone shed some light on what might be happening 
> here? Am I 
> >> looking at a drive that might be having some problems, or am I 
> >> barking up the wrong
> >> tree, and it's something else entirely?
> >>
> >>     A small sample of iostat output covering the affected 
> devices is 
> >> below.
> >>
> >> sample (extra disks removed from putput); root@backup(pts/1):~# 
> >> iostat -nx 1 100
> >>                     extended device statistics
> >>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> >>     0.0    4.1    0.0  252.2  0.0  0.0    0.0    5.9   0   2 rmt/0
> >>     0.0    4.6    0.0  278.4  0.0  0.1    0.0   27.3   0  12 rmt/1
> >>
> >>     0.0    4.1    0.0  252.3  0.0  0.0    0.0    5.9   0   2 rmt/0
> >>     0.0    4.6    0.0  278.4  0.0  0.1    0.0   27.3   0  12 rmt/1
> >>
> >>     0.0   33.0    0.0 2076.4  0.0  0.2    0.0    5.8   0  19 rmt/0
> >>     0.0    2.0    0.0  125.8  0.0  1.0    0.0  490.0   0  98 rmt/1
> >>
> >>     0.0   38.0    0.0 2394.0  0.0  0.2    0.0    5.4   0  21 rmt/0
> >>     0.0    8.0    0.0  504.0  0.0  1.0    0.0  124.9   0 100 rmt/1
> >>
> >>     0.0   27.0    0.0 1701.1  0.0  0.2    0.0    6.5   0  17 rmt/0
> >>     0.0    2.0    0.0  126.0  0.0  1.0    0.0  499.9   0 100 rmt/1
> >>
> >>     0.0   33.0    0.0 2078.9  0.0  0.2    0.0    5.3   0  18 rmt/0
> >>     0.0    0.0    0.0    0.0  0.0  1.0    0.0    0.0   0 100 rmt/1
> >>
> >>     0.0   16.0    0.0 1008.0  0.0  0.1    0.0    6.2   0  10 rmt/0
> >>     0.0   13.0    0.0  819.0  0.0  0.6    0.0   48.4   0  63 rmt/1
> >>
> >>     0.0   40.0    0.0 2520.1  0.0  0.2    0.0    5.9   0  24 rmt/0
> >>     0.0    0.0    0.0    0.0  0.0  1.0    0.0    0.0   0 100 rmt/1
> >>
> >>     0.0   33.0    0.0 2078.9  0.0  0.2    0.0    5.3   0  18 rmt/0
> >>     0.0   10.0    0.0  630.0  0.0  1.0    0.0   99.9   0 100 rmt/1
> >>
> 
> --
> Jeff McCombs                 |                                
>     NIC, Inc
> Systems Administrator        |                       
> http://www.nicusa.com
> jeffm AT nicusa DOT com             |          
>                       NASDAQ: EGOV
> Phone: (703) 909-3277        |        "NIC - the People 
> Behind eGovernment"
> --
> What do you do for endangered animals that only eat endangered plants?
> 
> 
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu 
> http://mailman.eng.auburn.edu/mailman/listinfo> /veritas-bu
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu 
> http://mailman.eng.auburn.edu/mailman/listinfo> /veritas-bu
>