Veritas-bu

[Veritas-bu] Backups slow to a crawl

2005-03-24 11:26:43
Subject: [Veritas-bu] Backups slow to a crawl
From: Bill_Jorgensen AT csgsystems DOT com (Jorgensen, Bill)
Date: Thu, 24 Mar 2005 09:26:43 -0700
Jeff:

Replacement may be good. Use iostat with -e and see what you get. Also,
look in /usr/openv/netbackup/logs/bptm and see if the tape manager is
logging any errors for tape while mounted on that drive.

Bill

--------------------------------------------------------
     Bill Jorgensen
     CSG Systems, Inc.
     (w) 303.200.3282
     (p) 303.947.9733
--------------------------------------------------------
     UNIX... Spoken with hushed and
     reverent tones.
--------------------------------------------------------

-----Original Message-----
From: veritas-bu-admin AT mailman.eng.auburn DOT edu
[mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of
Chris.Romano AT Lazard DOT com
Sent: Thursday, March 24, 2005 8:56 AM
To: Jeff McCombs
Cc: veritas-bu AT mailman.eng.auburn DOT edu;
veritas-bu-admin AT mailman.eng.auburn DOT edu
Subject: Re: [Veritas-bu] Backups slow to a crawl


I had a similiar problem....each morning I would get in the office and
see rmt2
still doing it's last few backups while the other 3 tape drives had
finished.
The problem turned out to be the drive...Quantum swapped it out with a
new one
and the problem was solved.

Even though rmt2 was backing things up, it was operating at a crawl due
to I/O
errors and retries. The interesting thing was, no errors were showing in
/var/adm/messages.

Quantum could see the errors when they connected directly to the Library
with
their PC.


Chris.


 

 

 



 

 

                                                          To:
veritas-bu AT mailman.eng.auburn DOT edu

              "Jeff McCombs" <jeffm AT nicusa DOT com>           cc:

              Sent by:                                    Subject: Re:
[Veritas-bu] Backups slow to a crawl                                 
              veritas-bu-admin AT mailman DOT eng.auburn.

              edu

 

              24 Mar 2005 10:21 AM

 

 




Ok. I lied. Removing multiplexing did not fix the problem.

It's strange, I _know_ my network is clean, I know my backup policies
should
be fine..

I'm still concerned about the busy percentage of rmt/1 v.s. Rmt/0.

Just to refresh for new readers, my backups are failing for some clients
due
to a status-196 (window closed). These are for small systems, without a
lot
of data on them. Doesn't seem to be related to the backup type, MPX or
streams setting. For example, our jumpstart system took 9 hours to
backup
22G, we averaged 672Kb/sec, where as our developmental database server
backed up 24G in 3.5 hours, avg speed of 1743K/sec (though the number of
files was almost half that of the jumpstart system, which may have an
impact).

In trying to troubleshoot, I watched the system's I/O performance using
'iostat' and noticed that /dev/rmt/1, the 2nd drive in our library
(Overland
Neo 2000) appears to be having some problems in sending data to tape. I
noticed that the %-busy on the drive shoots up to 100% as kw/s (kbytes
written/sec) drops drastically down into the 200-300 range.

/dev/rmt/0 has no problems during the same time period. %-busy sits
anywhere
from 2 to 30%, and kw/s is in the 1.2 to 2.5 range.

The only correlation I can find with systems that are failing backups
with a
196 status are systems that were queued to rmt/1. Systems queued to
rmt/0
backup fine, and usually these systems backups complete in 15 minutes or
so.

Now correct me if I'm wrong, but under ideal circumstances, the
following
should happen during as backup windows open and a schedule starts;

    client jobs are assigned to available drives (per policy or global
configuration), division of work is done on a client-basis and not a job
one
(so clientA:job1 -> drive 1 and clientA:job2 -> drive2 doesn't occur).

    As client jobs are completed, any available drive should pickup the
backlog for any other drive(?). For example:

                Job queue per drive
      Drive 1:                    Drive 2:
    ClientA:job1                ClientB:job1
    ClientA:job2                ClientB:job2
    ClientC:job1                ClientD:job1
    ClientC:job2                ClientD:job2
    ClientC:job3                ClientE:job1

      If Drive-1 clears it's jobs, while Drive-2 is still working on
ClientB, Drive-1 should pickup Client E, and possibly client D, right?

    This doesn't seem to be happening, and I'm curious as to why.. I did
see
'Jerry's' (though he signs his email as Brian) post yesterday about
technote
#274544 (or #274559 for 5.0 folks), and the related #237534 technotes.
However even with attempting the workarounds suggested in the technote
and
specifying the storage-unit in the policy (we only have one anyway), I'm
still getting 196's. We don't have a large volume DB either, with only
100
tapes.

    Can anyone shed some light here? I've included some specifics on the
policies and clients below.. I worry that rmt/1 is failing.. And the
darn
thing just got out of warranty last month to boot (of course!). I've
gone
ahead and opened a service request with Veritas, but .. Well you know
how
long getting anything useful out of them can be (took me a month to get
a
5.1 media kit!).

    System info:
        Media Server / Master server are same system.
        SunFire V240, Solaris 9, current recommended patch set as of
02/05
        NBU Enterprise 5.0 MP4
        Overland Neo 2000 Storage, 26-slot / 2-Drive DLT library.

    # of clients: 32
    Clients are Solaris 9 systems, 5.0 MP4 client software.
    Client file list: ALL_LOCAL_DRIVES
    No extra directives in bp.conf

    Policy configuration (CDC-revised):
        Type:   Standard
        Storage Unit:       backup-dlt2-robot-tld-0
        Volume Pool:        NetBackup (overridden per schedule)
        Checkpoints:        15-minutes
        Limit Jobs:         not Set
        Priority:           0
        Follow NFS:         Not Set
        Cross Mount Pts:    Yes
        Collect TIR:        Yes with Move
        Compression:        Yes
        Multiple Streams:   Yes
        No Advanced client settings

        Schedule: Daily-Differential
            Calendar based:      Mo, We, Fr (18:00 - 06:00)
            Policy Pool:         Daily
            Retention:           2 weeks
            Multiplexing:        1

        Schedule: Daily-Cumulative
            Calendar based:     Sa, Tu, Th (18:00 - 06:00)
            Policy Pool:        Daily
            Retention:          2 weeks
            Multiplexing:       1

        Schedule: Weekly
            Calendar based:     Su (00:00 - 23:59 window)
              Retries:          Yes
            Multiple Copies:
                #1 - Pool: Weekly-Short, Retention 2-weeks
                #2 - Pool: Weekly-Offsite, Retention 1-month
            Multiplexing:       1

        Schedule: Monthly
            Calendar based:     1st of every month
                                (M-F 18:00-06:00, Sa/Su 00:00-23:59)
                Retries:        Yes
            Multiple Copies:
                #1 - Pool: Monthly-short, Retention 2 months
                #2 - Pool: Monthly-Offsite, retention 6 months
            Multiplexing:       1


On 3/23/05 1:32 PM, "Jeff McCombs" <jeffm AT nicusa DOT com> wrote:

> Yeah, I originally thought that this might be a network problem
myself.
> However I have checked the network settings on the Sun systems and the
Cisco
> switches in-between. I'm even forcing a 100FDX on the switch and
system just
> to be safe (auto negotiation never works, regardless of what the
vendors
> say)
>
> Seems that this is a MPX thing. I did some further testing and backing
up
> systems without multiplexing enabled, and the problem goes away. The
rmt/1
> device stops with the 100% busy and 0 kw/s, client full backups drop
back
> down into the 15 minute range...
>
>
>
>
> On 3/23/05 10:56 AM, "Jorgensen, Bill" <Bill_Jorgensen AT csgsystems DOT com>
> wrote:
>
>> Jeff:
>>
>> A few things to consider (assuming a Sun sever as the NBU master):
>>
>> 1.) Are you aware of anything that has changed on your NBU server?
>> 2.) Are you aware of anything that has changed with your network?
>> (Providing you are doing Ethernet-based backups. If not, what about
the
>> SAN?)
>> 3.) Are you aware of any changes to the policies?
>>
>> If no to the above try the following:
>>
>> 1.) Find out what Veritas recommends for your environment for these
two
>> variables:
>> NUMBER_DATA_BUFFERS
>> SIZE_DATA_BUFFERS
>> These are found in /usr/openv/netbackup/db/config. They may not give
>> them to you if you open a ticket with the solution center
(Professional
>> Services). Ask around if they do not.
>>
>> 2.) Check the network driver settings for a few things. This depends
on
>> the network type you are using. 100Mb-switched, 10Mb-switched, etc.
>>
>> root[prod-backup:/]# ndd -get /dev/qfe adv_autoneg_cap
>> 1
>> root[prod-backup:/]# ndd -get /dev/qfe adv_100hdx_cap
>> 1
>> root[prod-backup:/]# ndd -get /dev/qfe adv_100fdx_cap
>> 1
>> What the output above is stating is that the qfe driver is set at 100
>> half and full duplex, and autonegotiate. Once you know how the
network
>> driver is configured go to your network guys and ask them to see how
the
>> port on the switch is configured (unless you are the network guy). If
>> the port is NOT set to 100-full or autonegotiate have them set it
>> accordingly.
>>
>> 3.) Reseat the RJ-45 connectors for the physical connections.
>>
>> These are some things that have bit us in the past.
>>
>> Good luck,
>>
>> Bill
>>
>> --------------------------------------------------------
>>      Bill Jorgensen
>>      CSG Systems, Inc.
>>      (w) 303.200.3282
>>      (p) 303.947.9733
>> --------------------------------------------------------
>>      UNIX... Spoken with hushed and
>>      reverent tones.
>> --------------------------------------------------------
>>
>> -----Original Message-----
>> From: veritas-bu-admin AT mailman.eng.auburn DOT edu
>> [mailto:veritas-bu-admin AT mailman.eng.auburn DOT edu] On Behalf Of Jeff
>> McCombs
>> Sent: Wednesday, March 23, 2005 6:51 AM
>> To: veritas-bu AT mailman.eng.auburn DOT edu
>> Subject: [Veritas-bu] Backups slow to a crawl
>>
>> Gurus,
>>
>>     NB 5.0 MP4, single combination media/master server, Solaris 9.
>> Overland
>> Neo 2000 26-slot 2 drive DLT.
>>
>>     I'm noticing that for some reason or another, all of my client
>> backups
>> have slowed to a _crawl_. A _cumulative_ (!) backup of local disk on
a
>> Sun
>> V100 is taking somewhere on the order of 2 hours at this point, and
with
>> over 40 systems, I'm blowing past my window  consistently.
>>
>>     I'm not quite sure what's going on here, but as I sit and watch
the
>> output from 'iostat', I'm noticing that rmt/1 (the 2nd drive in the
Neo)
>> is
>> fluxuating between 100% busy, with kw/s at close to zero, and busy @
>> 1-15%
>> and kw/s up into the 1000's.
>>
>>     rmt/0 seems to be fine, kw/s sits consistently up in the 1.8-2K
>> range,
>> while busy is anywhere from 2% - 30% on average. My other disks
aren't
>> working hard, CPU isn't loaded and I've got plenty of memory.
>>
>>     The policy I'm using allows for multiple datastreams, no limits
on
>> jobs,
>> and most schedules allow for an MPX of 2. I'm backing up
>> ALL_LOCAL_DRIVES on
>> all clients, and I'm not using any NEW_STREAM directives. I'm not
seeing
>> any
>> errors on the media either.
>>
>>     Can anyone shed some light on what might be happening here? Am I
>> looking
>> at a drive that might be having some problems, or am I barking up the
>> wrong
>> tree, and it's something else entirely?
>>
>>     A small sample of iostat output covering the affected devices is
>> below.
>>
>> sample (extra disks removed from putput);
>> root@backup(pts/1):~# iostat -nx 1 100
>>                     extended device statistics
>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>>     0.0    4.1    0.0  252.2  0.0  0.0    0.0    5.9   0   2 rmt/0
>>     0.0    4.6    0.0  278.4  0.0  0.1    0.0   27.3   0  12 rmt/1
>>
>>     0.0    4.1    0.0  252.3  0.0  0.0    0.0    5.9   0   2 rmt/0
>>     0.0    4.6    0.0  278.4  0.0  0.1    0.0   27.3   0  12 rmt/1
>>
>>     0.0   33.0    0.0 2076.4  0.0  0.2    0.0    5.8   0  19 rmt/0
>>     0.0    2.0    0.0  125.8  0.0  1.0    0.0  490.0   0  98 rmt/1
>>
>>     0.0   38.0    0.0 2394.0  0.0  0.2    0.0    5.4   0  21 rmt/0
>>     0.0    8.0    0.0  504.0  0.0  1.0    0.0  124.9   0 100 rmt/1
>>
>>     0.0   27.0    0.0 1701.1  0.0  0.2    0.0    6.5   0  17 rmt/0
>>     0.0    2.0    0.0  126.0  0.0  1.0    0.0  499.9   0 100 rmt/1
>>
>>     0.0   33.0    0.0 2078.9  0.0  0.2    0.0    5.3   0  18 rmt/0
>>     0.0    0.0    0.0    0.0  0.0  1.0    0.0    0.0   0 100 rmt/1
>>
>>     0.0   16.0    0.0 1008.0  0.0  0.1    0.0    6.2   0  10 rmt/0
>>     0.0   13.0    0.0  819.0  0.0  0.6    0.0   48.4   0  63 rmt/1
>>
>>     0.0   40.0    0.0 2520.1  0.0  0.2    0.0    5.9   0  24 rmt/0
>>     0.0    0.0    0.0    0.0  0.0  1.0    0.0    0.0   0 100 rmt/1
>>
>>     0.0   33.0    0.0 2078.9  0.0  0.2    0.0    5.3   0  18 rmt/0
>>     0.0   10.0    0.0  630.0  0.0  1.0    0.0   99.9   0 100 rmt/1
>>

--
Jeff McCombs                 |                                    NIC,
Inc
Systems Administrator        |
http://www.nicusa.com
jeffm AT nicusa DOT com             |                                NASDAQ:
EGOV
Phone: (703) 909-3277        |        "NIC - the People Behind
eGovernment"
--
What do you do for endangered animals that only eat endangered plants?


_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu











_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu