Frank Smith wrote:
> Tom Robinson wrote:
> >>> Hi,
> >>>
> >>> I'm running amanda (2.6.0p2-1) but have an older client running
> >>> 2.4.2p2-1. On that client the full backup of a 4GB disk takes a very
> >>> long time:
> >>>
> >>> DUMP SUMMARY:
> >>> DUMPER
> >>> STATS TAPER STATS
> >>> HOSTNAME DISK L ORIG-KB OUT-KB COMP%
> >>> MMM:SS KB/s MMM:SS KB/s
> >>> ------------------------------------
> >>> ------------------------------------------ ---------------
> >>> host / 0 4256790 1819411 42.7
> >>> 637:22 47.6 26:01 1165.9
> >>>
> >>> I'm not sure where to start looking for this bottle-neck.
> >>>
> >>> Any clues would be appreciated.
> bump
>
> Try looking on the client while the backup is running. Could be
> any of a lot of things. Network problems (check for errors on
> the NIC and the switch port), lack of CPU to run the compression,
> disk I/O contention, huge numbers of files (either in aggregate
> or in a single directory), or possibly even impending disk failure
> (lots of read retries or a degraded RAID).
> Looking at something like 'top' during the backup should give
> you an idea of whether your CPU is overloaded or if you are always
> waiting for disk, and if there is some other process(es) running
> that may also be trying to do a lot of disk I/O. Your system logs
> should show if you are seeing disk errors, and the output of ifconfig
> or similar will show the error counts on the NIC.
> If you don't see anything obvious at first, try running your
> dump program (dump or tar or whatever Amanda is configured to use)
> with the output directed to /dev/null and see how long that takes,
> if that is also slow then it is not the network or Amanda. Then
> try it without compression to see how much that speeds things up.
Hi,
Thanks for the feedback Frank. I am running dump.
After re-nicing the sendbackups and dumpers to first -1 and then -3 the
load average still hovers at zero:
load average: 0.00, 0.00, 0.00
Re-nicing again to 0, I looked at iostat -x and found the disk saturated
(%util is frequently reaches 100 but drops quickly). The average queue
size (avgqu-sz) and await are also astoundingly high:
avg-cpu: %user %nice %sys %idle
0.00 0.00 0.00 100.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz
await svctm %util
hda 8.67 0.00 0.00 1.33 74.67 10.67 64.00
14316554.32 0.00 7500.00 100.00
avg-cpu: %user %nice %sys %idle
0.00 0.00 0.33 99.67
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz
await svctm %util
hda 1.00 0.67 0.67 10.00 10.67 85.33 9.00 14316556.95
700.00 140.62 15.00
avg-cpu: %user %nice %sys %idle
1.67 0.00 6.00 92.33
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz
await svctm %util
hda 10.00 0.00 0.67 0.00 85.33 0.00 128.00 1.93
28150.00 6600.00 44.00
avg-cpu: %user %nice %sys %idle
0.00 0.00 2.00 98.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz
await svctm %util
hda 10.00 0.00 0.00 0.00 85.33 0.00 0.00 0.53
0.00 0.00 2.67
avg-cpu: %user %nice %sys %idle
0.33 0.00 5.33 94.33
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz
await svctm %util
hda 10.00 0.00 1.00 1.33 85.33 10.67 41.14 6.90
10128.57 4271.43 99.67
More concerning is the monitoring the network with tshark on the
server-side I see a lost segment followed by a flurry of "Dup ACK" and
"TCP Retransmission" every eight to ten seconds:
11.641313 10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=558073 Ack=1 Win=5392 Len=1348 TSV=519917696 TSER=427020080
11.670185 10.0.225.2 -> 192.168.0.31 TCP [TCP Previous segment lost]
53096 > 11003 [ACK] Seq=594469 Ack=1 Win=5392 Len=1348 TSV=519917781
TSER=427020930
11.670211 192.168.0.31 -> 10.0.225.2 TCP 11003 > 53096 [ACK] Seq=1
Ack=559421 Win=501 Len=0 TSV=427020990 TSER=519917696 SLE=594469 SRE=595817
11.699896 10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=595817 Ack=1 Win=5392 Len=1348 TSV=519917781 TSER=427020930
11.699916 192.168.0.31 -> 10.0.225.2 TCP [TCP Dup ACK 657#1] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021020 TSER=519917696
SLE=594469 SRE=597165
11.730662 10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=597165 Ack=1 Win=5392 Len=1348 TSV=519917787 TSER=427020990
11.730747 192.168.0.31 -> 10.0.225.2 TCP [TCP Dup ACK 657#2] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021050 TSER=519917696
SLE=594469 SRE=598513
11.730716 10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=598513 Ack=1 Win=5392 Len=1348 TSV=519917787 TSER=427020990
11.730761 192.168.0.31 -> 10.0.225.2 TCP [TCP Dup ACK 657#3] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021050 TSER=519917696
SLE=594469 SRE=599861
11.760131 10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=599861 Ack=1 Win=5392 Len=1348 TSV=519917790 TSER=427021020
11.760151 192.168.0.31 -> 10.0.225.2 TCP [TCP Dup ACK 657#4] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021080 TSER=519917696
SLE=594469 SRE=601209
11.789844 10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=601209 Ack=1 Win=5392 Len=1348 TSV=519917793 TSER=427021050
11.789874 192.168.0.31 -> 10.0.225.2 TCP [TCP Dup ACK 657#5] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021110 TSER=519917696
SLE=594469 SRE=602557
11.820556 10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=602557 Ack=1 Win=5392 Len=1348 TSV=519917793 TSER=427021050
11.820584 192.168.0.31 -> 10.0.225.2 TCP [TCP Dup ACK 657#6] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021140 TSER=519917696
SLE=594469 SRE=603905
11.850132 10.0.225.2 -> 192.168.0.31 TCP 53096 > 11003 [ACK]
Seq=603905 Ack=1 Win=5392 Len=1348 TSV=519917796 TSER=427021080
11.850159 192.168.0.31 -> 10.0.225.2 TCP [TCP Dup ACK 657#7] 11003 >
53096 [ACK] Seq=1 Ack=559421 Win=501 Len=0 TSV=427021170 TSER=519917696
SLE=594469 SRE=605253
11.879920 10.0.225.2 -> 192.168.0.31 TCP [TCP Retransmission] 53096 >
11003 [ACK] Seq=559421 Ack=1 Win=5392 Len=1348 TSV=519917702 TSER=427020140
----output truncated----
While the disk is reaching saturation (and recovering quickly) I'm
thinking that the all the retransmissions would be slowing things down more.
I don't see any errors on the client interface but there are four on the
server interface over the last four days.
Any comments would be helpful.
Thanks,
Tom
--
Tom Robinson
System Administrator
MoTeC
121 Merrindale Drive
Croydon South
3136 Victoria
Australia
T: +61 3 9761 5050
F: +61 3 9761 5051
M: +61 4 3268 7026
E: tom.robinson AT motec.com DOT au
|