Amanda-Users

backup failures post-upgrade (2.5.1p3)

2007-04-08 01:40:50
Subject: backup failures post-upgrade (2.5.1p3)
From: Charles Sprickman <spork AT bway DOT net>
To: amanda-users AT amanda DOT org
Date: Sun, 8 Apr 2007 01:32:17 -0400 (EDT)
Hi all,

This is my last shot at troubleshooting this before reverting back to the old version I was running.

Last week I'd gotten to the point that amdump was not even ending the run properly. It just exited and left me needing to run amcleanup. After removing a new host, it now at least completes, but it's still got a few errors that I cannot track down. I'll revisit the new host once I can get this other mess cleaned up.

Here's the last report with all the errors that seem odd along with some log snippets from both sides for each error:

FAILURE AND STRANGE DUMP SUMMARY:
h10.foo.com / lev 3 FAILED [cannot read header: got 0 instead of 32768]
  h10.foo.com     /                lev 3  was successfully retried

client amandad:

amandad: time 0.335: accept recv REQ pkt:
<<<<<
SERVICE sendsize
OPTIONS features=fffffeff9ffeffffff7f;maxdumps=1;hostname=h10.foo.com;config
=DailySet1;
[...]
DUMP /  0 1970:1:1:0:0:0 -1  OPTIONS |;auth=bsdtcp;compress-fast;index;
DUMP /  3 2007:4:3:7:33:58 -1  OPTIONS |;auth=bsdtcp;compress-fast;index;

amandad: time 0.335: creating new service: sendsize
OPTIONS features=fffffeff9ffeffffff7f;maxdumps=1;hostname=h10.biglist.com;config
=DailySet1;
[...]
DUMP /  0 1970:1:1:0:0:0 -1  OPTIONS |;auth=bsdtcp;compress-fast;index;
DUMP /  3 2007:4:3:7:33:58 -1  OPTIONS |;auth=bsdtcp;compress-fast;index;

amandad: time 0.336: sending ACK pkt:
<<<<<
amandad: time 4.137: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064

amandad: time 6.257: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69

amandad: time 7.798: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]

amandad: time 7.924: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]

amandad: time 8.041: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]

amandad: time 13.378: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]

amandad: time 22.698: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]

amandad: time 31.718: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69

amandad: time 36.094: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]

amandad: time 36.575: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]

amandad: time 38.793: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]

amandad: time 842.858: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]

amandad: time 1363.486: sending PREP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]

amandad: time 1363.525: sending REP pkt:
<<<<<
OPTIONS features=fffffeff9ffeffffff7f;
/ 0 SIZE 85064
/ 3 SIZE 69
[...]

amandad: time 1363.635: received ACK pkt:
<<<<<

security_close(handle=0x8058000, driver=0x28094900 (BSDTCP))
security_stream_close(0x8059000)
amandad: time 1363.644: pid 70599 finish time Sat Apr  7 03:32:47 2007

phew... So for disk "/" (actually all disks), we're sending the size quite a bit.

And another amandad.debug from the client:

security_handleinit(handle=0x8052000, driver=0x28094900 (BSDTCP))
security_streaminit(stream=0x8058000, driver=0x28094900 (BSDTCP))
amandad: time 0.034: accept recv REQ pkt:
<<<<<
SERVICE sendbackup
OPTIONS features=fffffeff9ffeffffff7f;hostname=h10.foo.com;config=DailySet1;
DUMP /  3 2007:4:3:7:33:58 OPTIONS |;auth=bsdtcp;compress-fast;index;

amandad: time 0.034: creating new service: sendbackup
OPTIONS features=fffffeff9ffeffffff7f;hostname=h10.foo.com;config=DailySet1;
DUMP /  3 2007:4:3:7:33:58 OPTIONS |;auth=bsdtcp;compress-fast;index;

amandad: time 0.035: sending ACK pkt:
<<<<<

security_streaminit(stream=0x8079000, driver=0x28094900 (BSDTCP))
security_streaminit(stream=0x8082000, driver=0x28094900 (BSDTCP))
security_streaminit(stream=0x808b000, driver=0x28094900 (BSDTCP))
amandad: time 0.181: sending REP pkt:
<<<<<
CONNECT DATA 499999 MESG 499998 INDEX 499997
OPTIONS features=fffffeff9ffeffffff7f;

amandad: time 0.230: received ACK pkt:
<<<<<

security_close(handle=0x8052000, driver=0x28094900 (BSDTCP))
security_stream_close(0x8058000)
security_stream_close(0x808b000)
security_stream_close(0x8079000)
security_stream_close(0x8082000)
amandad: time 1.122: pid 72283 finish time Sat Apr  7 03:34:28 2007

sendbackup for / had no errors.

sendsize.debug does have some errors for /:

sendsize: debug 1 pid 70603 ruid 2 euid 2: start at Sat Apr  7 03:10:04 2007
sendsize: version 2.5.1p3
Reading conf file "/usr/local/etc/amanda/amanda-client.conf".
Could not open conf file "/usr/local/etc/amanda/DailySet1/amanda-client.conf": 
No such file or directory
sendsize: debug 1 pid 70603 ruid 2 euid 2: rename at Sat Apr  7 03:10:04 2007
sendsize[70603]: time 0.067: waiting for any estimate child: 1 running
sendsize[70605]: time 0.067: calculating for amname /, dirname /, spindle -1
sendsize[70605]: time 0.067: getting size via dump for / level 0
sendsize[70605]: time 0.067: calculating for device /dev/twed0s1a with ufs
sendsize[70605]: time 0.068: running "/sbin/dump 0Shsf 0 1048576 - 
/dev/twed0s1a"
sendsize[70605]: time 0.068: running /usr/local/libexec/amanda/killpgrp
sendsize[70605]: time 0.082:   DUMP: Date of this level 0 dump: Sat Apr  7 
03:10:04 2007
sendsize[70605]: time 0.083:   DUMP: Date of last level 0 dump: the epoch
sendsize[70605]: time 0.083:   DUMP: Dumping /dev/twed0s1a (/) to standard 
output
sendsize[70605]: time 2.051:   DUMP: mapping (Pass I) [regular files]
sendsize[70605]: time 2.763:   DUMP: mapping (Pass II) [directories]
sendsize[70605]: time 2.764:   DUMP: estimated 85064 tape blocks.
sendsize[70605]: time 2.764: .....
sendsize[70605]: estimate time for / level 0: 2.696
sendsize[70605]: estimate size for / level 0: 85064 KB
sendsize[70605]: time 2.764: asking killpgrp to terminate
dump_calc_estimates: warning - seek failed: Illegal seek
sendsize[70605]: time 3.770: getting size via dump for / level 3
sendsize[70605]: time 3.771: calculating for device /dev/twed0s1a with ufs
sendsize[70605]: time 3.771: running "/sbin/dump 3Shsf 0 1048576 - 
/dev/twed0s1a"
sendsize[70605]: time 3.771: running /usr/local/libexec/amanda/killpgrp
sendsize[70605]: time 3.775:   DUMP: Date of this level 3 dump: Sat Apr  7 
03:10:08 2007
sendsize[70605]: time 3.778:   DUMP: Date of last level 2 dump: Tue Apr  3 
03:33:59 2007
sendsize[70605]: time 3.778:   DUMP: Dumping /dev/twed0s1a (/) to standard 
output
sendsize[70605]: time 4.472:   DUMP: mapping (Pass I) [regular files]
sendsize[70605]: time 4.779:   DUMP: mapping (Pass II) [directories]
sendsize[70605]: time 4.885:   DUMP: estimated 69 tape blocks.
sendsize[70605]: time 4.885: .....
sendsize[70605]: estimate time for / level 3: 1.114
sendsize[70605]: estimate size for / level 3: 69 KB
sendsize[70605]: time 4.885: asking killpgrp to terminate
dump_calc_estimates: warning - seek failed: Illegal seek
sendsize[70605]: time 5.891: done with amname / dirname / spindle -1
sendsize[70603]: time 5.891: child 70605 terminated normally
sendsize[70603]: time 5.891: waiting for any estimate child: 1 running
[...]

and from amdump.1 on the server:

planner: time 1.139: setting up estimates for h10.foo.com:/
setup_estimate: h10.biglist.com:/: command 0, options: none last_level 3 next_level0 8 level_days 1 getting estimates 0 (-2) 3 (-2) -1 (-2)
[...]
planner: time 4.075: got partial result for host h10.foo.com disk /: 0 -> -2K, 3 -> -2K, -1 -> -2K
[...]
planner: time 7.717: got partial result for host h10.foo.com disk /: 0 -> 85064K, 3 -> -2K, -1 -> -2K
[...]
planner: time 9.837: got partial result for host h10.foo.com disk /: 0 -> 85064K, 3 -> 69K, -1 -> -2K
[...]
planner: time 11.378: got partial result for host h10.foo.com disk /: 0 -> 85064K, 3 -> 69K, -1 -> -2K
[...]
planner: time 11.504: got partial result for host h10.foo.com disk /: 0 -> 85064K, 3 -> 69K, -1 -> -2K
[...]
planner: time 11.620: got partial result for host h10.foo.com disk /: 0 -> 85064K, 3 -> 69K, -1 -> -2K
[...]
OK, this just goes on for quite some time... I snipped about 10 more of those lines.

driver: send-cmd time 394.723 to taper: FILE-WRITE 00-00127 /spool2/amanda/20070405031000/h10.foo.com._.3 h10.foo.com UNKNOWNFEATURE / 3 20070405031000 0
[...]
driver: startaflush: FIRST h10.foo.com / 42 101098191
[...]
driver: finished-cmd time 396.992 taper wrote h10.foo.com:/
[...]
planner: time 846.438: got partial result for host h10.foo.com disk /: 0 ->85064K, 3 -> 69K, -1 -> -2K
[...]
planner: time 1367.068: got partial result for host h10.foo.com disk /: 0 -> 85064K, 3 -> 69K, -1 -> -2K
[...]
planner: time 1367.190: getting estimates took 1365.664 secs
FAILED QUEUE: empty
DONE QUEUE:
[...]
138: h10.biglist.com /
[...]
ANALYZING ESTIMATES...
[...]
pondering h10.foo.com:/... next_level0 8 last_level 3 (not due for a full dump, picking an incr level)
   pick: size 69 level 3 days 1 (thresh 22K, 2 days)
curr level 3 nsize 69 csize 10 total size 13432774 total_lev0 32964 balanced-lev0size 3594603
INITIAL SCHEDULE (size 13432774):
[...]
h10.foo.com / pri 1 lev 3 nsize 69 csize 10
[...]
DELAYING DUMPS IF NEEDED, total_size 13432774, tape length 104857600 mark 0
  delay: Total size now 13432774.
PROMOTING DUMPS IF NEEDED, total_lev0 32964, balanced_size 3594603...
[...]
no try h10.foo.com:/ 2 1 8 = 23
[...]
no try h10.foo.com:/ 2 1 8 = 23

many more repeats of the above...

planner: time 1368.188: analysis took 0.994 secs

GENERATING SCHEDULE:
--------
[...]
DUMP h10.foo.com fffffeff9ffeffffff7f / 20070407031000 1 3 2007:4:3:7:33:58 69 10 0 12
[...]
driver: assign_holdingdisk: time 1368.294: 0 assigning holding disk /spool2/amanda to disk h10.foo.com:/, reserved 96, left 0
[...]
driver: send-cmd time 1368.295 to chunker1: PORT-WRITE 01-00140 /spool2/amanda/20070407031000/h10.foo.com._.3 h10.biglist.com fffffeff9ffeffffff7f / 3 2007:
4:3:7:33:58 1048576 DUMP 96 |;auth=bsdtcp;compress-fast;index;
[...]
driver: send-cmd time 1368.306 to dumper1: PORT-DUMP 01-00140 4609 h10.foo.com fffffeff9ffeffffff7f / NODEVICE 3 2007:4:3:7:33:58 DUMP X X X |;auth=bsdtcp;compress-fast;index;
[...]
send request:
----
SERVICE sendbackup
OPTIONS features=fffffeff9ffeffffff7f;hostname=h10.biglist.com;config=DailySet1;
DUMP /  3 2007:4:3:7:33:58 OPTIONS |;auth=bsdtcp;compress-fast;index;

----

got response:
----
CONNECT DATA 499999 MESG 499998 INDEX 499997
OPTIONS features=fffffeff9ffeffffff7f;

----

got response:
----
CONNECT DATA 499999 MESG 499998 INDEX 499997
OPTIONS features=fffffeff9ffeffffff7f;
got response:
----
CONNECT DATA 499999 MESG 499998 INDEX 499997
OPTIONS features=fffffeff9ffeffffff7f;

----

got response:
----
CONNECT DATA 499999 MESG 499998 INDEX 499997
OPTIONS features=fffffeff9ffeffffff7f;

----
[...]
rename_tmp_holding: /spool2/amanda/20070407031000/h10.biglist.com._.3: empty file? driver: adjust_diskspace: time 1444.241: h10.foo.com:/ /spool2/amanda/20070407031000/h10.foo.com._.3
[...]
driver: adjust_diskspace: time 1444.241: after: disk h10.foo.com:/ used 0
[...]
driver: assign_holdingdisk: time 1444.244: 0 assigning holding disk /spool2/amanda to disk h10.foo.com:/, reserved 96, left 0
driver: started chunker1 pid 88265
driver: send-cmd time 1444.244 to chunker1: START 20070407031000
driver: send-cmd time 1444.245 to chunker1: PORT-WRITE 00-00190 /spool2/amanda/20070407031000/h10.biglist.com._.3 h10.foo.com fffffeff9ffeffffff7f / 3 2007:
4:3:7:33:58 1048576 DUMP 96 |;auth=bsdtcp;compress-fast;index;
chunker: pid 88265 executable chunker1 version 2.5.1p3
driver: result time 1444.269 from chunker1: PORT 3469
driver: send-cmd time 1444.269 to dumper1: PORT-DUMP 00-00190 3469 h10.foo.com fffffeff9ffeffffff7f / NODEVICE 3 2007:4:3:7:33:58 DUMP X X X |;auth=bsdtcp;compress-fast;index;
send request:
----
SERVICE sendbackup
OPTIONS features=fffffeff9ffeffffff7f;hostname=h10.foo.com;config=DailySet1;
DUMP /  3 2007:4:3:7:33:58 OPTIONS |;auth=bsdtcp;compress-fast;index;

----
[...]
driver: finished-cmd time 1467.828 dumper1 dumped h10.foo.com:/
driver: finished-cmd time 1467.829 chunker1 chunked h10.foo.com:/
driver: adjust_diskspace: time 1467.832: h10.foo.com:/ /spool2/amanda/20070407031000/h10.foo.com._.3 driver: adjust_diskspace: time 1467.832: hdisk spool1 done, reserved 96 used 42
diff -54 alloc -8548747 dumpers 4
driver: adjust_diskspace: time 1467.832: after: disk h10.foo.com:/ used 42 driver: send-cmd time 1467.832 to taper: FILE-WRITE 00-00195 /spool2/amanda/20070407031000/h10.foo.com._.3 h10.foo.com fffffeff9ffeffffff7f / 3 20070407 031000 0
driver: startaflush: FIRST h10.foo.com / 42 95280821
[...]
driver: finished-cmd time 1470.910 taper wrote h10.foo.com:/
[...]

Wow.

OK, the other failures are much simpler:

 devel2.foo.com  /spool2/jails    lev 3  STRANGE

that's gnutar, and files changed during the dump, but it was successful.

h14.foo.com /spool lev 3 FAILED [dump (7763) /sbin/dump returned 3]
 h14.foo.com     /spool           lev 3  was successfully retried

that's a partition where a db was being dumped while dump was running. Not an amanda issue.

  h13.foo.com     /home2/savingse  lev 2  STRANGE

And that's another gnutar with files changing under it.

What else for that first host with the dump failure on "/"?

While other hosts did not fail the same as h10, there are a ton of messages on just about every DLE like:

"planner: time 4.075: got partial result for host h03.foo.com disk /spool: 0 -> -2K, 3 -> -2K, -1 -> -2K"

I have the full amdump log available if needed.

Sorry for all the info, but I'm not sure which parts are important and which aren't.

Thanks,

Charles


<Prev in Thread] Current Thread [Next in Thread>
  • backup failures post-upgrade (2.5.1p3), Charles Sprickman <=