backup ceased working? (longer mail)

Hi all,

please forgive me posting longer texts to that list, but right now I am
into a situation where I sort of feel lost within my amanda
configuration and log files. After working well for quite a while, our
tape backup ceased working a few days ago, and by now I haven't managed
to figure out why.

Basically, we do run two configurations - full dump to tapes each
weekend, incrementals every weekday. Last weekends full dumps did not
finish. Neither did any of the incrementals the last days. I am
completely unsure why, the only last thing I can remember having done to
the backup machine is "chown -R"'ing the whole filesystem because of a
change in the Windows domain these machines belong to. Initially, I
thought incremental to take exceptionally long because due to chown'ing
all files have been touched, but after Full dump also ending unfinished,
I start doubting this. Unfortunately, I am rather short on evidence on
what is goin' on as there hardly are any log messages to make me think.



What to be _observed_:



- We run dumps from a local disk filesystem directly to tape, no holding
disk.


- The last days, dumps seemed to be started well, having a bunch of
dumpers and a taper process running as well as sendsize/tar, and things
seem fine.


- Sooner or later, sendsize/tar disappears from the process list,
leaving the dumper and taper processes running until all eternity.
Attaching to them using "strace -p", however, shows that they seem to be
waiting - for what?


- Manually killing all the processes and doing amcleanup ends in a
notification mail to be sent out telling me that all dump results are
missing.


- Running amverify leaves me thinking that actually nothing has been
stored to tape so far. This also fits what amrestore keeps thinking
about the tape:


--- amverify output:
[...]
Loading current slot...
changer: got exit: 0 str: 3 /dev/nst0
Using device /dev/nst0
Waiting for device to go ready...
Rewinding...
Processing label...
Volume Back07, Date 20060816
Rewinding...
End-of-Information detected.
Rewinding...

--- /amverify


--- amrestore output:
backer:/tmp# amrestore /dev/nst0
amrestore: missing file header block
amrestore:   2: reached end of information

--- /amrestore output





- The only "real"? error message to be seen is in amandad.2006XXX.debug:
[...]
>>>>>
amandad: time 21599.282: /usr/lib/amanda/sendsize timed out waiting for
REP data
amandad: time 21599.298: sending NAK pkt:
<<<<<
ERROR timeout on reply pipe
>>>>>
amandad: time 21600.301: pid 19395 finish time Wed Aug 16 09:00:04 2006



- The regular amanda dump logs seem to be fine except for "log" to not
be showing any proof of any dumps put to tape:



--- /var/log/amanda/Incremental/log:
[...]
DISK planner backer.planconnect.net backercfg
DISK planner backer.planconnect.net backervar
START planner date 20060816
WARNING planner tapecycle (4) <= runspercycle (4)
STATS driver startup time 0.066
START taper datestamp 20060816 label Back07 tape 0

--- log



--- corresponding amdump tail:
[...]
changer: opening pipe to: /usr/lib/amanda/chg-zd-mtx -slot current
changer: got exit: 0 str: 3 /dev/nst0
taper: slot: 3 wrote label `Back07' date `20060816'
driver: result time 4.889 from taper: TAPER-OK
driver: state time 4.889 free kps: 600 space: 0 taper: idle
idle-dumpers: 5 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle:
not-idle
driver: interface-state time 4.889 if : free 600
                             driver: hdisk-state time 4.889
                                                          planner: time
7232.420: got partial result for host backer.planconnect.net disk
backervar: 1 -> -2K, 2 -> -2K, -1 -> -2K
                                                  planner: time
7232.444: got partial result for host backer.planconnect.net disk
backercfg: 1 -> -2K, -1 -> -2K, -1 -> -2K
planner: time 7232.444: got partial result for host
backer.planconnect.net disk refast: 1 -> -2K, -1 -> -2K, -1 -> -2K

[...]
planner: time 14160.633: got partial result for host
backer.planconnect.net disk planc1: 1 -> -2K, 2 -> -2K, -1 -> -2K
planner: time 14160.633: got partial result for host
backer.planconnect.net disk planc3: 1 -> 2471980K, 2 -> 2466110K, -1 -> -2K
taper: DONE [idle wait: 35325.938 secs]
taper: writing end marker. [Back07 OK kb 0 fm 0]

--- /amdump





- sendsize log has a strange error, as well, for some of the files in
the backup fs:

--- sendsize log:
[...]
sendsize[19779]: time 31925.159: /bin/tar:
./docs/work/d2845/d894/oi01ngxo/file.hpgl: Warning: Cannot seek to 0:
Bad file descriptor
sendsize[19779]: time 31925.161: /bin/tar:
./docs/work/d2845/d895/ega9cq2a/file.hpgl: Warning: Cannot seek to 0:
Bad file descriptor
sendsize[19779]: time 31925.162: /bin/tar:
./docs/work/d2845/d896/z5l00n0d/file.hpgl: Warning: Cannot seek to 0:
Bad file descriptor
sendsize[19779]: time 32962.193: Total bytes written: 300444446720
(280GiB, 44MiB/s)
sendsize[19779]: time 32962.205: .....
sendsize[19779]: estimate time for planc1 level 2: 6551.741
sendsize[19779]: estimate size for planc1 level 2: 293402780 KB
sendsize[19779]: time 32962.205: waiting for /bin/tar "planc1" child
sendsize[19779]: time 32962.218: after /bin/tar "planc1" wait
sendsize[19779]: time 32962.278: done with amname 'planc1', dirname
'/backup/PV/', spindle 1

--- /sendsize log





- For the sake of being complete, one of the runtar logs (which seems
fine to me).

--- runtar:

runtar: debug 1 pid 20008 ruid 34 euid 0: start at Wed Aug 16 10:20:14
2006
/bin/tar: version 2.5.0p2

running: /bin/tar: /bin/tar --create --file /dev/null --directory
/backup/PV/ --one-file-system --num
eric-owner --listed-incremental
/var/lib/amanda/gnutar-lists/backer.planconnect.netplanc1_2.new --spa
rse --ignore-failed-read --totals .

runtar: pid 20008 finish time Wed Aug 16 10:20:14 2006

--- /runtar



That's about all I know, and I feel kinda lost this way, as I am sure it
will not run this evening, either. Though probably the information I
provided actually is rather vague, does anyone have a slight idea of
where to look to see what happens or to just gather more information
than that? Most notably:

- What happens to the dumper/taper processes, what are they actually
waiting for after sendsize has ended?

- What is the "bad file descriptor" all about? These files seem to be
perfectly fine, at least I can open and read them, also using
permissions of the backup user.

- Why the timeout error in amandad log?

- Overally, what is most likely to be responsible for that error?
Hardware? Software? Misconfiguration?


I'd be grateful for any hints no matter how small. System configuration:

- Debian GNU/Linux
- amanda 2.5.0p2-1
- Tandberg Autoloader (scsiinfo: NEC LL0101H-0A Rev. 002)
- IBM Ultrium-TD2 Rev. 36M6 tape drive


Thanks in advance and sorry again for the long mail.
Best,
Kristian