Re: Troubleshooting a slowdown problem?

On Thu, Jun 17, 2004 at 12:02:29PM -0400, KEVIN ZEMBOWER wrote:
> A couple of months ago, I added a server, centernet, to my
> Amanda backups.  Since that time, about once a week or two, but
> not everyday, the backup runs for 16-20 hours, instead of its
> normal less than 8. It's running right now since last night at
> 8:00pm:

[I've reordered the per-DLE lines, but made no other changes to them]

> 
> admin://db/c$                       0   462720k finished (20:56:41)
> admin://db/e$                       1       10k finished (20:16:16)
> admin://db/f$                       1  2559660k finished (20:51:25)
> admin://db/f$/inetsrv/webpub/images 1       30k finished (20:16:07)
> admin:sda1                          0     3410k finished (20:17:09)
> admin:sda3                          0  3683924k wait for dumping 
> admin:sdb1                          0    24270k finished (20:17:05)
> mailinglists:hda1                   0     2198k finished (20:04:30)
> mailinglists:hda2                   0   399339k finished (23:03:41)
> mailinglists:hda7                   0   743827k finished (4:41:56)

> centernet:sda1                      0     4846k finished (20:06:03)
> centernet:sda2                      0   715715k finished (2:15:47)
> centernet:sda3                      0   110883k finished (20:58:42)
> centernet:sda5                      0  1812031k dumping  1332992k ( 73.56%) 
> (2:02:34)
> centernet:sda6                      1       73k finished (20:03:37)
> centernet:sda7                      0      564k finished (20:04:03)
> centernet:sda9                      0    30391k finished (20:18:59)
>

The times here are indeed times of day.  I believe it's the time
that the DLE entered the state shown.

> [...]
>
> 7 dumpers idle  : no-hold
> taper idle
> network free kps:     26362
> holding space   :  34661948k ( 95.03%)
>  dumper0 busy   :  8:13:03  ( 95.12%)
>  dumper1 busy   :  6:24:10  ( 74.11%)
>  dumper2 busy   :  2:52:40  ( 33.31%)
>    taper busy   :  1:07:38  ( 13.05%)
>  0 dumpers busy :  0:00:00  (  0.00%)
>  1 dumper busy  :  0:13:42  (  2.64%)             no-hold:  0:13:42  (100.00%)
>  2 dumpers busy :  7:57:50  ( 92.18%)  client-constrained:  5:31:58  ( 69.47%)
>                                                   no-hold:  2:25:40  ( 30.49%)
>                                                start-wait:  0:00:11  (  0.04%)
>  3 dumpers busy :  0:26:50  (  5.18%)  client-constrained:  0:26:47  ( 99.77%)
>                                                start-wait:  0:00:03  (  0.23%)
[...]

Unfortunately, the amstatus output doesn't tell us how long each
DLE took to dump.  We'll have to wait for the report email once
the run has finished, unless you want to start digging through
the "amdump" file -- and frankly, it's easier to wait and let
Amanda do the arithmetic :-)  But at this point, my guess would
be that it's centernet:sda2 and centernet:sda5 that are the
problems.

Interesting things here:
  - Most of the DLEs finished before 21:00, i.e. within the first
    hour.  To the basic machinery seems to be working.

  - Large DLEs tend to finish later -- but not always; consider
    admin:://db/f$.  Thus, size isn't the *only* factor.  (E.g.
    it's unlikely that large DLEs are being dumped direct to
    tape; you would seem to have enough holding disk for a 2.5-GB
    DLE, which is way more than centernet:sda5 needs.)

  - All of the small DLEs on centernet finish quickly, so it
    seems not to be the case that centernet's Amanda setup is
    just hosed.

  - The only non-centernet DLEs to finish later than 21:00 are
    also large.  But my first guess would be to put that down to
    lack of holding-disk space; the two large centernet DLE's are
    blocking the others.  Thus, adding more holding disk would
    help some, although, as mentioned above, I *don't* believe it
    would address the underlying problem with centernet.  E.g.
    it would let admin:sda3 start dumping now, instead of waiting
    for centernet:sda5 to finish.

  - The run is currently in "no-hold" state, and has spent almost
    three hours there; that's likely the amount of time (so far)
    that more holding disk would save

> dumpcycle 3             # the number of days in the normal dump cycle
> runspercycle 3          # the number of amdump runs in dumpcycle days

I was going to guess that the problem happens when centernet:sda5
gets a level 0, but a three-day dumpcycle, together with your
observation that the problem happens "about once a week or two",
would seem to rule that out.

As for what's slowing down those large centernet DLEs, we don't
really know much yet.  What's the CPU load like on centernet --
is it CPU-bound doing compression?  Is any non-Amanda stuff
running that might be hogging resources?

How about the network between centernet and the Amanda server?
Try ftping a large file -- preferably while one of those DLEs is
still progressing slowly, so that if there's an intermittent
network problem, you'll have a better chance of catching it
red-handed.

Is there an overloaded firewall in between the two machines?

--

|  | /\
|-_|/  >   Eric Siegerman, Toronto, Ont.        erics AT telepres DOT com
|  |  /
It must be said that they would have sounded better if the singer
wouldn't throw his fellow band members to the ground and toss the
drum kit around during songs.
        - Patrick Lenneau