Hi all,
This is my last shot at troubleshooting this before reverting back to the
old version I was running.
Last week I'd gotten to the point that amdump was not even ending the run
properly. It just exited and left me needing to run amcleanup. After
removing a new host, it now at least completes, but it's still got a few
errors that I cannot track down. I'll revisit the new host once I can get
this other mess cleaned up.
Here's the last report with all the errors that seem odd along with some
log snippets from both sides for each error:
FAILURE AND STRANGE DUMP SUMMARY:
h10.foo.com / lev 3 FAILED [cannot read header: got
0 instead of 32768]
h10.foo.com / lev 3 was successfully retried
client amandad:
amandad: time 0.335: accept recv REQ pkt:
<<<<<
SERVICE sendsize
OPTIONS features=fffffeff9ffeffffff7f;maxdumps=1;hostname=h10.foo.com;config
=DailySet1;
[...]
DUMP / 0 1970:1:1:0:0:0 -1 OPTIONS |;auth=bsdtcp;compress-fast;index;
DUMP / 3 2007:4:3:7:33:58 -1 OPTIONS |;auth=bsdtcp;compress-fast;index;
amandad: time 0.335: creating new service: sendsize
OPTIONS features=fffffeff9ffeffffff7f;maxdumps=1;hostname=h10.biglist.com;config
=DailySet1;
[...]
DUMP / 0 1970:1:1:0:0:0 -1 OPTIONS |;auth=bsdtcp;compress-fast;index;
DUMP / 3 2007:4:3:7:33:58 -1 OPTIONS |;auth=bsdtcp;compress-fast;index;
amandad: time 0.336: sending ACK pkt:
<<<<<
amandad: time 4.137: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
amandad: time 6.257: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
amandad: time 7.798: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]
amandad: time 7.924: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]
amandad: time 8.041: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]
amandad: time 13.378: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]
amandad: time 22.698: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]
amandad: time 31.718: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
amandad: time 36.094: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]
amandad: time 36.575: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]
amandad: time 38.793: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]
amandad: time 842.858: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]
amandad: time 1363.486: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]
amandad: time 1363.525: sending REP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]
amandad: time 1363.635: received ACK pkt:
<<<<<
security_close(handle=0x8058000, driver=0x28094900 (BSDTCP))
security_stream_close(0x8059000)
amandad: time 1363.644: pid 70599 finish time Sat Apr 7 03:32:47 2007
phew... So for disk "/" (actually all disks), we're sending the size
quite a bit.
And another amandad.debug from the client:
security_handleinit(handle=0x8052000, driver=0x28094900 (BSDTCP))
security_streaminit(stream=0x8058000, driver=0x28094900 (BSDTCP))
amandad: time 0.034: accept recv REQ pkt:
<<<<<
SERVICE sendbackup
OPTIONS
features=fffffeff9ffeffffff7f;hostname=h10.foo.com;config=DailySet1;
DUMP / 3 2007:4:3:7:33:58 OPTIONS |;auth=bsdtcp;compress-fast;index;
amandad: time 0.034: creating new service: sendbackup
OPTIONS
features=fffffeff9ffeffffff7f;hostname=h10.foo.com;config=DailySet1;
DUMP / 3 2007:4:3:7:33:58 OPTIONS |;auth=bsdtcp;compress-fast;index;
amandad: time 0.035: sending ACK pkt:
<<<<<
security_streaminit(stream=0x8079000, driver=0x28094900 (BSDTCP))
security_streaminit(stream=0x8082000, driver=0x28094900 (BSDTCP))
security_streaminit(stream=0x808b000, driver=0x28094900 (BSDTCP))
amandad: time 0.181: sending REP pkt:
<<<<<
CONNECT DATA 499999 MESG 499998 INDEX 499997
OPTIONS features=fffffeff9ffeffffff7f;
amandad: time 0.230: received ACK pkt:
<<<<<
security_close(handle=0x8052000, driver=0x28094900 (BSDTCP))
security_stream_close(0x8058000)
security_stream_close(0x808b000)
security_stream_close(0x8079000)
security_stream_close(0x8082000)
amandad: time 1.122: pid 72283 finish time Sat Apr 7 03:34:28 2007
sendbackup for / had no errors.
sendsize.debug does have some errors for /:
sendsize: debug 1 pid 70603 ruid 2 euid 2: start at Sat Apr 7 03:10:04 2007
sendsize: version 2.5.1p3
Reading conf file "/usr/local/etc/amanda/amanda-client.conf".
Could not open conf file "/usr/local/etc/amanda/DailySet1/amanda-client.conf":
No such file or directory
sendsize: debug 1 pid 70603 ruid 2 euid 2: rename at Sat Apr 7 03:10:04 2007
sendsize[70603]: time 0.067: waiting for any estimate child: 1 running
sendsize[70605]: time 0.067: calculating for amname /, dirname /, spindle -1
sendsize[70605]: time 0.067: getting size via dump for / level 0
sendsize[70605]: time 0.067: calculating for device /dev/twed0s1a with ufs
sendsize[70605]: time 0.068: running "/sbin/dump 0Shsf 0 1048576 -
/dev/twed0s1a"
sendsize[70605]: time 0.068: running /usr/local/libexec/amanda/killpgrp
sendsize[70605]: time 0.082: DUMP: Date of this level 0 dump: Sat Apr 7
03:10:04 2007
sendsize[70605]: time 0.083: DUMP: Date of last level 0 dump: the epoch
sendsize[70605]: time 0.083: DUMP: Dumping /dev/twed0s1a (/) to standard
output
sendsize[70605]: time 2.051: DUMP: mapping (Pass I) [regular files]
sendsize[70605]: time 2.763: DUMP: mapping (Pass II) [directories]
sendsize[70605]: time 2.764: DUMP: estimated 85064 tape blocks.
sendsize[70605]: time 2.764: .....
sendsize[70605]: estimate time for / level 0: 2.696
sendsize[70605]: estimate size for / level 0: 85064 KB
sendsize[70605]: time 2.764: asking killpgrp to terminate
dump_calc_estimates: warning - seek failed: Illegal seek
sendsize[70605]: time 3.770: getting size via dump for / level 3
sendsize[70605]: time 3.771: calculating for device /dev/twed0s1a with ufs
sendsize[70605]: time 3.771: running "/sbin/dump 3Shsf 0 1048576 -
/dev/twed0s1a"
sendsize[70605]: time 3.771: running /usr/local/libexec/amanda/killpgrp
sendsize[70605]: time 3.775: DUMP: Date of this level 3 dump: Sat Apr 7
03:10:08 2007
sendsize[70605]: time 3.778: DUMP: Date of last level 2 dump: Tue Apr 3
03:33:59 2007
sendsize[70605]: time 3.778: DUMP: Dumping /dev/twed0s1a (/) to standard
output
sendsize[70605]: time 4.472: DUMP: mapping (Pass I) [regular files]
sendsize[70605]: time 4.779: DUMP: mapping (Pass II) [directories]
sendsize[70605]: time 4.885: DUMP: estimated 69 tape blocks.
sendsize[70605]: time 4.885: .....
sendsize[70605]: estimate time for / level 3: 1.114
sendsize[70605]: estimate size for / level 3: 69 KB
sendsize[70605]: time 4.885: asking killpgrp to terminate
dump_calc_estimates: warning - seek failed: Illegal seek
sendsize[70605]: time 5.891: done with amname / dirname / spindle -1
sendsize[70603]: time 5.891: child 70605 terminated normally
sendsize[70603]: time 5.891: waiting for any estimate child: 1 running
[...]
and from amdump.1 on the server:
planner: time 1.139: setting up estimates for h10.foo.com:/
setup_estimate: h10.biglist.com:/: command 0, options: none last_level
3 next_level0 8 level_days 1 getting estimates 0 (-2) 3 (-2) -1 (-2)
[...]
planner: time 4.075: got partial result for host h10.foo.com disk /: 0
-> -2K, 3 -> -2K, -1 -> -2K
[...]
planner: time 7.717: got partial result for host h10.foo.com disk /: 0
-> 85064K, 3 -> -2K, -1 -> -2K
[...]
planner: time 9.837: got partial result for host h10.foo.com disk /: 0
-> 85064K, 3 -> 69K, -1 -> -2K
[...]
planner: time 11.378: got partial result for host h10.foo.com disk /:
0 -> 85064K, 3 -> 69K, -1 -> -2K
[...]
planner: time 11.504: got partial result for host h10.foo.com disk /:
0 -> 85064K, 3 -> 69K, -1 -> -2K
[...]
planner: time 11.620: got partial result for host h10.foo.com disk /:
0 -> 85064K, 3 -> 69K, -1 -> -2K
[...]
OK, this just goes on for quite some time... I snipped about 10 more of
those lines.
driver: send-cmd time 394.723 to taper: FILE-WRITE 00-00127
/spool2/amanda/20070405031000/h10.foo.com._.3 h10.foo.com
UNKNOWNFEATURE / 3 20070405031000 0
[...]
driver: startaflush: FIRST h10.foo.com / 42 101098191
[...]
driver: finished-cmd time 396.992 taper wrote h10.foo.com:/
[...]
planner: time 846.438: got partial result for host h10.foo.com disk /:
0 ->85064K, 3 -> 69K, -1 -> -2K
[...]
planner: time 1367.068: got partial result for host h10.foo.com disk
/: 0 -> 85064K, 3 -> 69K, -1 -> -2K
[...]
planner: time 1367.190: getting estimates took 1365.664 secs
FAILED QUEUE: empty
DONE QUEUE:
[...]
138: h10.biglist.com /
[...]
ANALYZING ESTIMATES...
[...]
pondering h10.foo.com:/... next_level0 8 last_level 3 (not due for a
full dump, picking an incr level)
pick: size 69 level 3 days 1 (thresh 22K, 2 days)
curr level 3 nsize 69 csize 10 total size 13432774 total_lev0 32964
balanced-lev0size 3594603
INITIAL SCHEDULE (size 13432774):
[...]
h10.foo.com / pri 1 lev 3 nsize 69 csize 10
[...]
DELAYING DUMPS IF NEEDED, total_size 13432774, tape length 104857600 mark
0
delay: Total size now 13432774.
PROMOTING DUMPS IF NEEDED, total_lev0 32964, balanced_size 3594603...
[...]
no try h10.foo.com:/ 2 1 8 = 23
[...]
no try h10.foo.com:/ 2 1 8 = 23
many more repeats of the above...
planner: time 1368.188: analysis took 0.994 secs
GENERATING SCHEDULE:
--------
[...]
DUMP h10.foo.com fffffeff9ffeffffff7f / 20070407031000 1 3
2007:4:3:7:33:58 69 10 0 12
[...]
driver: assign_holdingdisk: time 1368.294: 0 assigning holding disk
/spool2/amanda to disk h10.foo.com:/, reserved 96, left 0
[...]
driver: send-cmd time 1368.295 to chunker1: PORT-WRITE 01-00140
/spool2/amanda/20070407031000/h10.foo.com._.3 h10.biglist.com
fffffeff9ffeffffff7f / 3 2007:
4:3:7:33:58 1048576 DUMP 96 |;auth=bsdtcp;compress-fast;index;
[...]
driver: send-cmd time 1368.306 to dumper1: PORT-DUMP 01-00140 4609
h10.foo.com fffffeff9ffeffffff7f / NODEVICE 3 2007:4:3:7:33:58 DUMP X X X
|;auth=bsdtcp;compress-fast;index;
[...]
send request:
----
SERVICE sendbackup
OPTIONS
features=fffffeff9ffeffffff7f;hostname=h10.biglist.com;config=DailySet1;
DUMP / 3 2007:4:3:7:33:58 OPTIONS |;auth=bsdtcp;compress-fast;index;
----
got response:
----
CONNECT DATA 499999 MESG 499998 INDEX 499997
OPTIONS features=fffffeff9ffeffffff7f;
----
got response:
----
CONNECT DATA 499999 MESG 499998 INDEX 499997
OPTIONS features=fffffeff9ffeffffff7f;
got response:
----
CONNECT DATA 499999 MESG 499998 INDEX 499997
OPTIONS features=fffffeff9ffeffffff7f;
----
got response:
----
CONNECT DATA 499999 MESG 499998 INDEX 499997
OPTIONS features=fffffeff9ffeffffff7f;
----
[...]
rename_tmp_holding: /spool2/amanda/20070407031000/h10.biglist.com._.3:
empty file?
driver: adjust_diskspace: time 1444.241: h10.foo.com:/
/spool2/amanda/20070407031000/h10.foo.com._.3
[...]
driver: adjust_diskspace: time 1444.241: after: disk h10.foo.com:/ used 0
[...]
driver: assign_holdingdisk: time 1444.244: 0 assigning holding disk
/spool2/amanda to disk h10.foo.com:/, reserved 96, left 0
driver: started chunker1 pid 88265
driver: send-cmd time 1444.244 to chunker1: START 20070407031000
driver: send-cmd time 1444.245 to chunker1: PORT-WRITE 00-00190
/spool2/amanda/20070407031000/h10.biglist.com._.3 h10.foo.com
fffffeff9ffeffffff7f / 3 2007:
4:3:7:33:58 1048576 DUMP 96 |;auth=bsdtcp;compress-fast;index;
chunker: pid 88265 executable chunker1 version 2.5.1p3
driver: result time 1444.269 from chunker1: PORT 3469
driver: send-cmd time 1444.269 to dumper1: PORT-DUMP 00-00190 3469
h10.foo.com fffffeff9ffeffffff7f / NODEVICE 3 2007:4:3:7:33:58 DUMP X X X
|;auth=bsdtcp;compress-fast;index;
send request:
----
SERVICE sendbackup
OPTIONS
features=fffffeff9ffeffffff7f;hostname=h10.foo.com;config=DailySet1;
DUMP / 3 2007:4:3:7:33:58 OPTIONS |;auth=bsdtcp;compress-fast;index;
----
[...]
driver: finished-cmd time 1467.828 dumper1 dumped h10.foo.com:/
driver: finished-cmd time 1467.829 chunker1 chunked h10.foo.com:/
driver: adjust_diskspace: time 1467.832: h10.foo.com:/
/spool2/amanda/20070407031000/h10.foo.com._.3
driver: adjust_diskspace: time 1467.832: hdisk spool1 done, reserved 96
used 42
diff -54 alloc -8548747 dumpers 4
driver: adjust_diskspace: time 1467.832: after: disk h10.foo.com:/
used 42
driver: send-cmd time 1467.832 to taper: FILE-WRITE 00-00195
/spool2/amanda/20070407031000/h10.foo.com._.3 h10.foo.com
fffffeff9ffeffffff7f / 3 20070407 031000 0
driver: startaflush: FIRST h10.foo.com / 42 95280821
[...]
driver: finished-cmd time 1470.910 taper wrote h10.foo.com:/
[...]
Wow.
OK, the other failures are much simpler:
devel2.foo.com /spool2/jails lev 3 STRANGE
that's gnutar, and files changed during the dump, but it was successful.
h14.foo.com /spool lev 3 FAILED [dump (7763) /sbin/dump
returned 3]
h14.foo.com /spool lev 3 was successfully retried
that's a partition where a db was being dumped while dump was running.
Not an amanda issue.
h13.foo.com /home2/savingse lev 2 STRANGE
And that's another gnutar with files changing under it.
What else for that first host with the dump failure on "/"?
While other hosts did not fail the same as h10, there are a ton of
messages on just about every DLE like:
"planner: time 4.075: got partial result for host h03.foo.com disk
/spool: 0 -> -2K, 3 -> -2K, -1 -> -2K"
I have the full amdump log available if needed.
Sorry for all the info, but I'm not sure which parts are important and
which aren't.
Thanks,
Charles
|