Veritas-bu

[Veritas-bu] Backups slow to a crawl

2005-03-24 10:21:55
Subject: [Veritas-bu] Backups slow to a crawl
From: jeffm AT nicusa DOT com (Jeff McCombs)
Date: Thu, 24 Mar 2005 10:21:55 -0500
Ok. I lied. Removing multiplexing did not fix the problem.

It's strange, I _know_ my network is clean, I know my backup policies should
be fine.. 

I'm still concerned about the busy percentage of rmt/1 v.s. Rmt/0.

Just to refresh for new readers, my backups are failing for some clients due
to a status-196 (window closed). These are for small systems, without a lot
of data on them. Doesn't seem to be related to the backup type, MPX or
streams setting. For example, our jumpstart system took 9 hours to backup
22G, we averaged 672Kb/sec, where as our developmental database server
backed up 24G in 3.5 hours, avg speed of 1743K/sec (though the number of
files was almost half that of the jumpstart system, which may have an
impact). 

In trying to troubleshoot, I watched the system's I/O performance using
'iostat' and noticed that /dev/rmt/1, the 2nd drive in our library (Overland
Neo 2000) appears to be having some problems in sending data to tape. I
noticed that the %-busy on the drive shoots up to 100% as kw/s (kbytes
written/sec) drops drastically down into the 200-300 range.

/dev/rmt/0 has no problems during the same time period. %-busy sits anywhere
from 2 to 30%, and kw/s is in the 1.2 to 2.5 range.

The only correlation I can find with systems that are failing backups with a
196 status are systems that were queued to rmt/1. Systems queued to rmt/0
backup fine, and usually these systems backups complete in 15 minutes or so.

Now correct me if I'm wrong, but under ideal circumstances, the following
should happen during as backup windows open and a schedule starts;

    client jobs are assigned to available drives (per policy or global
configuration), division of work is done on a client-basis and not a job one
(so clientA:job1 -> drive 1 and clientA:job2 -> drive2 doesn't occur).

    As client jobs are completed, any available drive should pickup the
backlog for any other drive(?). For example:

                Job queue per drive
      Drive 1:                    Drive 2:
    ClientA:job1                ClientB:job1
    ClientA:job2                ClientB:job2
    ClientC:job1                ClientD:job1
    ClientC:job2                ClientD:job2
    ClientC:job3                ClientE:job1

      If Drive-1 clears it's jobs, while Drive-2 is still working on
ClientB, Drive-1 should pickup Client E, and possibly client D, right?

    This doesn't seem to be happening, and I'm curious as to why.. I did see
'Jerry's' (though he signs his email as Brian) post yesterday about technote
#274544 (or #274559 for 5.0 folks), and the related #237534 technotes.
However even with attempting the workarounds suggested in the technote and
specifying the storage-unit in the policy (we only have one anyway), I'm
still getting 196's. We don't have a large volume DB either, with only 100
tapes.

    Can anyone shed some light here? I've included some specifics on the
policies and clients below.. I worry that rmt/1 is failing.. And the darn
thing just got out of warranty last month to boot (of course!). I've gone
ahead and opened a service request with Veritas, but .. Well you know how
long getting anything useful out of them can be (took me a month to get a
5.1 media kit!).

    System info:
        Media Server / Master server are same system.
        SunFire V240, Solaris 9, current recommended patch set as of 02/05
        NBU Enterprise 5.0 MP4
        Overland Neo 2000 Storage, 26-slot / 2-Drive DLT library.
    
    # of clients: 32
    Clients are Solaris 9 systems, 5.0 MP4 client software.
    Client file list: ALL_LOCAL_DRIVES
    No extra directives in bp.conf

    Policy configuration (CDC-revised):
        Type:   Standard
        Storage Unit:       backup-dlt2-robot-tld-0
        Volume Pool:        NetBackup (overridden per schedule)
        Checkpoints:        15-minutes
        Limit Jobs:         not Set
        Priority:           0
        Follow NFS:         Not Set
        Cross Mount Pts:    Yes
        Collect TIR:        Yes with Move
        Compression:        Yes
        Multiple Streams:   Yes
        No Advanced client settings

        Schedule: Daily-Differential
            Calendar based:      Mo, We, Fr (18:00 - 06:00)
            Policy Pool:         Daily
            Retention:           2 weeks
            Multiplexing:        1

        Schedule: Daily-Cumulative
            Calendar based:     Sa, Tu, Th (18:00 - 06:00)
            Policy Pool:        Daily
            Retention:          2 weeks
            Multiplexing:       1

        Schedule: Weekly
            Calendar based:     Su (00:00 - 23:59 window)
              Retries:          Yes
            Multiple Copies:
                #1 - Pool: Weekly-Short, Retention 2-weeks
                #2 - Pool: Weekly-Offsite, Retention 1-month
            Multiplexing:       1

        Schedule: Monthly
            Calendar based:     1st of every month
                                (M-F 18:00-06:00, Sa/Su 00:00-23:59)
                Retries:        Yes
            Multiple Copies:
                #1 - Pool: Monthly-short, Retention 2 months
                #2 - Pool: Monthly-Offsite, retention 6 months
            Multiplexing:       1
            

On 3/23/05 1:32 PM, "Jeff McCombs" <jeffm AT nicusa DOT com> wrote:

> Yeah, I originally thought that this might be a network problem myself.
> However I have checked the network settings on the Sun systems and the Cisco
> switches in-between. I'm even forcing a 100FDX on the switch and system just
> to be safe (auto negotiation never works, regardless of what the vendors
> say)
> 
> Seems that this is a MPX thing. I did some further testing and backing up
> systems without multiplexing enabled, and the problem goes away. The rmt/1
> device stops with the 100% busy and 0 kw/s, client full backups drop back
> down into the 15 minute range...
> 
> 
> 
> 
> On 3/23/05 10:56 AM, "Jorgensen, Bill" <Bill_Jorgensen AT csgsystems DOT com>
> wrote:
> 
>> Jeff:
>> 
>> A few things to consider (assuming a Sun sever as the NBU master):
>> 
>> 1.) Are you aware of anything that has changed on your NBU server?
>> 2.) Are you aware of anything that has changed with your network?
>> (Providing you are doing Ethernet-based backups. If not, what about the
>> SAN?)
>> 3.) Are you aware of any changes to the policies?
>> 
>> If no to the above try the following:
>> 
>> 1.) Find out what Veritas recommends for your environment for these two
>> variables:
>> NUMBER_DATA_BUFFERS
>> SIZE_DATA_BUFFERS
>> These are found in /usr/openv/netbackup/db/config. They may not give
>> them to you if you open a ticket with the solution center (Professional
>> Services). Ask around if they do not.
>> 
>> 2.) Check the network driver settings for a few things. This depends on
>> the network type you are using. 100Mb-switched, 10Mb-switched, etc.
>> 
>> root[prod-backup:/]# ndd -get /dev/qfe adv_autoneg_cap
>> 1
>> root[prod-backup:/]# ndd -get /dev/qfe adv_100hdx_cap
>> 1
>> root[prod-backup:/]# ndd -get /dev/qfe adv_100fdx_cap
>> 1
>> What the output above is stating is that the qfe driver is set at 100
>> half and full duplex, and autonegotiate. Once you know how the network
>> driver is configured go to your network guys and ask them to see how the
>> port on the switch is configured (unless you are the network guy). If
>> the port is NOT set to 100-full or autonegotiate have them set it
>> accordingly.
>> 
>> 3.) Reseat the RJ-45 connectors for the physical connections.
>> 
>> These are some things that have bit us in the past.
>> 
>> Good luck,
>> 
>> Bill
>> 
>> --------------------------------------------------------
>>      Bill Jorgensen
>>      CSG Systems, Inc.
>>      (w) 303.200.3282
>>      (p) 303.947.9733
>> --------------------------------------------------------
>>      UNIX... Spoken with hushed and
>>      reverent tones.
>> --------------------------------------------------------
>> 
>> -----Original Message-----
>> From: veritas-bu-admin AT mailman.eng.auburn DOT edu
>> [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Jeff
>> McCombs
>> Sent: Wednesday, March 23, 2005 6:51 AM
>> To: veritas-bu AT mailman.eng.auburn DOT edu
>> Subject: [Veritas-bu] Backups slow to a crawl
>> 
>> Gurus,
>> 
>>     NB 5.0 MP4, single combination media/master server, Solaris 9.
>> Overland
>> Neo 2000 26-slot 2 drive DLT.
>> 
>>     I'm noticing that for some reason or another, all of my client
>> backups
>> have slowed to a _crawl_. A _cumulative_ (!) backup of local disk on a
>> Sun
>> V100 is taking somewhere on the order of 2 hours at this point, and with
>> over 40 systems, I'm blowing past my window  consistently.
>> 
>>     I'm not quite sure what's going on here, but as I sit and watch the
>> output from 'iostat', I'm noticing that rmt/1 (the 2nd drive in the Neo)
>> is
>> fluxuating between 100% busy, with kw/s at close to zero, and busy @
>> 1-15%
>> and kw/s up into the 1000's.
>> 
>>     rmt/0 seems to be fine, kw/s sits consistently up in the 1.8-2K
>> range,
>> while busy is anywhere from 2% - 30% on average. My other disks aren't
>> working hard, CPU isn't loaded and I've got plenty of memory.
>> 
>>     The policy I'm using allows for multiple datastreams, no limits on
>> jobs,
>> and most schedules allow for an MPX of 2. I'm backing up
>> ALL_LOCAL_DRIVES on
>> all clients, and I'm not using any NEW_STREAM directives. I'm not seeing
>> any
>> errors on the media either.
>> 
>>     Can anyone shed some light on what might be happening here? Am I
>> looking
>> at a drive that might be having some problems, or am I barking up the
>> wrong
>> tree, and it's something else entirely?
>> 
>>     A small sample of iostat output covering the affected devices is
>> below.
>> 
>> sample (extra disks removed from putput);
>> root@backup(pts/1):~# iostat -nx 1 100
>>                     extended device statistics
>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>>     0.0    4.1    0.0  252.2  0.0  0.0    0.0    5.9   0   2 rmt/0
>>     0.0    4.6    0.0  278.4  0.0  0.1    0.0   27.3   0  12 rmt/1
>> 
>>     0.0    4.1    0.0  252.3  0.0  0.0    0.0    5.9   0   2 rmt/0
>>     0.0    4.6    0.0  278.4  0.0  0.1    0.0   27.3   0  12 rmt/1
>> 
>>     0.0   33.0    0.0 2076.4  0.0  0.2    0.0    5.8   0  19 rmt/0
>>     0.0    2.0    0.0  125.8  0.0  1.0    0.0  490.0   0  98 rmt/1
>> 
>>     0.0   38.0    0.0 2394.0  0.0  0.2    0.0    5.4   0  21 rmt/0
>>     0.0    8.0    0.0  504.0  0.0  1.0    0.0  124.9   0 100 rmt/1
>> 
>>     0.0   27.0    0.0 1701.1  0.0  0.2    0.0    6.5   0  17 rmt/0
>>     0.0    2.0    0.0  126.0  0.0  1.0    0.0  499.9   0 100 rmt/1
>> 
>>     0.0   33.0    0.0 2078.9  0.0  0.2    0.0    5.3   0  18 rmt/0
>>     0.0    0.0    0.0    0.0  0.0  1.0    0.0    0.0   0 100 rmt/1
>> 
>>     0.0   16.0    0.0 1008.0  0.0  0.1    0.0    6.2   0  10 rmt/0
>>     0.0   13.0    0.0  819.0  0.0  0.6    0.0   48.4   0  63 rmt/1
>> 
>>     0.0   40.0    0.0 2520.1  0.0  0.2    0.0    5.9   0  24 rmt/0
>>     0.0    0.0    0.0    0.0  0.0  1.0    0.0    0.0   0 100 rmt/1
>> 
>>     0.0   33.0    0.0 2078.9  0.0  0.2    0.0    5.3   0  18 rmt/0
>>     0.0   10.0    0.0  630.0  0.0  1.0    0.0   99.9   0 100 rmt/1
>> 

-- 
Jeff McCombs                 |                                    NIC, Inc
Systems Administrator        |                       http://www.nicusa.com
jeffm AT nicusa DOT com             |                                NASDAQ: 
EGOV
Phone: (703) 909-3277        |        "NIC - the People Behind eGovernment"
--
What do you do for endangered animals that only eat endangered plants?