Re: very slow dumper (42.7KB/s)

Frank Smith wrote:
> Tom Robinson wrote:
> >>> Hi,
> >>>
> >>> I'm running amanda (2.6.0p2-1) but have an older client running
> >>> 2.4.2p2-1. On that client the full backup of a 4GB disk takes a very
> >>> long time:
> >>>
> >>> DUMP SUMMARY:
> >>>                                                    DUMPER
> >>> STATS                   TAPER STATS
> >>> HOSTNAME DISK                      L   ORIG-KB     OUT-KB  COMP%
> >>> MMM:SS   KB/s MMM:SS     KB/s
> >>> ------------------------------------
> >>> ------------------------------------------ ---------------
> >>> host     /                         0   4256790    1819411   42.7
> >>> 637:22   47.6  26:01   1165.9
> >>>
> >>> I'm not sure where to start looking for this bottle-neck.
> >>>
> >>> Any clues would be appreciated.
> bump
>
> Try looking on the client while the backup is running. Could be
> any of a lot of things.  Network problems (check for errors on
> the NIC and the switch port), lack of CPU to run the compression,
> disk I/O contention, huge numbers of files (either in aggregate
> or in a single directory), or possibly even impending disk failure
> (lots of read retries or a degraded RAID).
>    Looking at something like 'top' during the backup should give
> you an idea of whether your CPU is overloaded or if you are always
> waiting for disk, and if there is some other process(es) running
> that may also be trying to do a lot of disk I/O.  Your system logs
> should show if you are seeing disk errors, and the output of ifconfig
> or similar will show the error counts on the NIC.
>    If you don't see anything obvious at first, try running your
> dump program (dump or tar or whatever Amanda is configured to use)
> with the output directed to /dev/null and see how long that takes,
> if that is also slow then it is not the network or Amanda. Then
> try it without compression to see how much that speeds things up.

Hi,

Thanks for the feedback Frank. I am running dump.

After re-nicing the sendbackups and dumpers to first -1 and then -3 the
load average still hovers at zero:

load average: 0.00, 0.00, 0.00

Re-nicing again to 0, I looked at iostat -x and found the disk saturated
(%util is frequently reaches 100 but drops quickly). The average queue
size (avgqu-sz) and await are also astoundingly high:

avg-cpu:  %user   %nice    %sys   %idle
           0.00    0.00    0.00  100.00
Device:  rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s avgrq-sz avgqu-sz  
await  svctm  %util
hda        8.67   0.00  0.00  1.33   74.67   10.67    64.00
14316554.32    0.00 7500.00 100.00

avg-cpu:  %user   %nice    %sys   %idle
           0.00    0.00    0.33   99.67
Device:  rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s avgrq-sz avgqu-sz  
await  svctm  %util
hda        1.00   0.67  0.67 10.00   10.67   85.33     9.00 14316556.95 
700.00 140.62  15.00

avg-cpu:  %user   %nice    %sys   %idle
           1.67    0.00    6.00   92.33
Device:  rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s avgrq-sz avgqu-sz  
await  svctm  %util
hda       10.00   0.00  0.67  0.00   85.33    0.00   128.00     1.93
28150.00 6600.00  44.00

avg-cpu:  %user   %nice    %sys   %idle
           0.00    0.00    2.00   98.00
Device:  rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s avgrq-sz avgqu-sz  
await  svctm  %util
hda       10.00   0.00  0.00  0.00   85.33    0.00     0.00     0.53   
0.00   0.00   2.67

avg-cpu:  %user   %nice    %sys   %idle
           0.33    0.00    5.33   94.33
Device:  rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s avgrq-sz avgqu-sz  
await  svctm  %util
hda       10.00   0.00  1.00  1.33   85.33   10.67    41.14     6.90
10128.57 4271.43  99.67

More concerning is the monitoring the network with tshark on the
server-side I see a lost segment followed by a flurry of "Dup ACK" and
"TCP Retransmission" every eight to ten seconds:

 11.641313   10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=558073 Ack=1 Win=5392 Len=1348 TSV=519917696 TSER=427020080
 11.670185   10.0.225.2 -> 192.168.0.31 TCP [TCP Previous segment lost]
53096 > 11003 [ACK] Seq=594469 Ack=1 Win=5392 Len=1348 TSV=519917781
TSER=427020930
 11.670211 192.168.0.31 -> 10.0.225.2   TCP 11003 > 53096 [ACK] Seq=1
Ack=559421 Win=501 Len=0 TSV=427020990 TSER=519917696 SLE=594469 SRE=595817
 11.699896   10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=595817 Ack=1 Win=5392 Len=1348 TSV=519917781 TSER=427020930
 11.699916 192.168.0.31 -> 10.0.225.2   TCP [TCP Dup ACK 657#1] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021020 TSER=519917696
SLE=594469 SRE=597165
 11.730662   10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=597165 Ack=1 Win=5392 Len=1348 TSV=519917787 TSER=427020990
 11.730747 192.168.0.31 -> 10.0.225.2   TCP [TCP Dup ACK 657#2] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021050 TSER=519917696
SLE=594469 SRE=598513
 11.730716   10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=598513 Ack=1 Win=5392 Len=1348 TSV=519917787 TSER=427020990
 11.730761 192.168.0.31 -> 10.0.225.2   TCP [TCP Dup ACK 657#3] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021050 TSER=519917696
SLE=594469 SRE=599861
 11.760131   10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=599861 Ack=1 Win=5392 Len=1348 TSV=519917790 TSER=427021020
 11.760151 192.168.0.31 -> 10.0.225.2   TCP [TCP Dup ACK 657#4] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021080 TSER=519917696
SLE=594469 SRE=601209
 11.789844   10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=601209 Ack=1 Win=5392 Len=1348 TSV=519917793 TSER=427021050
 11.789874 192.168.0.31 -> 10.0.225.2   TCP [TCP Dup ACK 657#5] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021110 TSER=519917696
SLE=594469 SRE=602557
 11.820556   10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=602557 Ack=1 Win=5392 Len=1348 TSV=519917793 TSER=427021050
 11.820584 192.168.0.31 -> 10.0.225.2   TCP [TCP Dup ACK 657#6] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021140 TSER=519917696
SLE=594469 SRE=603905
 11.850132   10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=603905 Ack=1 Win=5392 Len=1348 TSV=519917796 TSER=427021080
 11.850159 192.168.0.31 -> 10.0.225.2   TCP [TCP Dup ACK 657#7] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021170 TSER=519917696
SLE=594469 SRE=605253
 11.879920   10.0.225.2 -> 192.168.0.31 TCP [TCP Retransmission] 53096 >
11003 [ACK] Seq=559421 Ack=1 Win=5392 Len=1348 TSV=519917702 TSER=427020140
----output truncated----

While the disk is reaching saturation (and recovering quickly) I'm
thinking that the all the retransmissions would be slowing things down more.

I don't see any errors on the client interface but there are four on the
server interface over the last four days.

Any comments would be helpful.

Thanks,

Tom

-- 

Tom Robinson
System Administrator

MoTeC

121 Merrindale Drive
Croydon South
3136 Victoria
Australia

T: +61 3 9761 5050
F: +61 3 9761 5051  
M: +61 4 3268 7026
E: tom.robinson AT motec.com DOT au