Hi all,
please forgive me posting longer texts to that list, but right now I am
into a situation where I sort of feel lost within my amanda
configuration and log files. After working well for quite a while, our
tape backup ceased working a few days ago, and by now I haven't managed
to figure out why.
Basically, we do run two configurations - full dump to tapes each
weekend, incrementals every weekday. Last weekends full dumps did not
finish. Neither did any of the incrementals the last days. I am
completely unsure why, the only last thing I can remember having done to
the backup machine is "chown -R"'ing the whole filesystem because of a
change in the Windows domain these machines belong to. Initially, I
thought incremental to take exceptionally long because due to chown'ing
all files have been touched, but after Full dump also ending unfinished,
I start doubting this. Unfortunately, I am rather short on evidence on
what is goin' on as there hardly are any log messages to make me think.
What to be _observed_:
- We run dumps from a local disk filesystem directly to tape, no holding
disk.
- The last days, dumps seemed to be started well, having a bunch of
dumpers and a taper process running as well as sendsize/tar, and things
seem fine.
- Sooner or later, sendsize/tar disappears from the process list,
leaving the dumper and taper processes running until all eternity.
Attaching to them using "strace -p", however, shows that they seem to be
waiting - for what?
- Manually killing all the processes and doing amcleanup ends in a
notification mail to be sent out telling me that all dump results are
missing.
- Running amverify leaves me thinking that actually nothing has been
stored to tape so far. This also fits what amrestore keeps thinking
about the tape:
--- amverify output:
[...]
Loading current slot...
changer: got exit: 0 str: 3 /dev/nst0
Using device /dev/nst0
Waiting for device to go ready...
Rewinding...
Processing label...
Volume Back07, Date 20060816
Rewinding...
End-of-Information detected.
Rewinding...
--- /amverify
--- amrestore output:
backer:/tmp# amrestore /dev/nst0
amrestore: missing file header block
amrestore: 2: reached end of information
--- /amrestore output
- The only "real"? error message to be seen is in amandad.2006XXX.debug:
[...]
>>>>>
amandad: time 21599.282: /usr/lib/amanda/sendsize timed out waiting for
REP data
amandad: time 21599.298: sending NAK pkt:
<<<<<
ERROR timeout on reply pipe
>>>>>
amandad: time 21600.301: pid 19395 finish time Wed Aug 16 09:00:04 2006
- The regular amanda dump logs seem to be fine except for "log" to not
be showing any proof of any dumps put to tape:
--- /var/log/amanda/Incremental/log:
[...]
DISK planner backer.planconnect.net backercfg
DISK planner backer.planconnect.net backervar
START planner date 20060816
WARNING planner tapecycle (4) <= runspercycle (4)
STATS driver startup time 0.066
START taper datestamp 20060816 label Back07 tape 0
--- log
--- corresponding amdump tail:
[...]
changer: opening pipe to: /usr/lib/amanda/chg-zd-mtx -slot current
changer: got exit: 0 str: 3 /dev/nst0
taper: slot: 3 wrote label `Back07' date `20060816'
driver: result time 4.889 from taper: TAPER-OK
driver: state time 4.889 free kps: 600 space: 0 taper: idle
idle-dumpers: 5 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 0 driver-idle:
not-idle
driver: interface-state time 4.889 if : free 600
driver: hdisk-state time 4.889
planner: time
7232.420: got partial result for host backer.planconnect.net disk
backervar: 1 -> -2K, 2 -> -2K, -1 -> -2K
planner: time
7232.444: got partial result for host backer.planconnect.net disk
backercfg: 1 -> -2K, -1 -> -2K, -1 -> -2K
planner: time 7232.444: got partial result for host
backer.planconnect.net disk refast: 1 -> -2K, -1 -> -2K, -1 -> -2K
[...]
planner: time 14160.633: got partial result for host
backer.planconnect.net disk planc1: 1 -> -2K, 2 -> -2K, -1 -> -2K
planner: time 14160.633: got partial result for host
backer.planconnect.net disk planc3: 1 -> 2471980K, 2 -> 2466110K, -1 -> -2K
taper: DONE [idle wait: 35325.938 secs]
taper: writing end marker. [Back07 OK kb 0 fm 0]
--- /amdump
- sendsize log has a strange error, as well, for some of the files in
the backup fs:
--- sendsize log:
[...]
sendsize[19779]: time 31925.159: /bin/tar:
./docs/work/d2845/d894/oi01ngxo/file.hpgl: Warning: Cannot seek to 0:
Bad file descriptor
sendsize[19779]: time 31925.161: /bin/tar:
./docs/work/d2845/d895/ega9cq2a/file.hpgl: Warning: Cannot seek to 0:
Bad file descriptor
sendsize[19779]: time 31925.162: /bin/tar:
./docs/work/d2845/d896/z5l00n0d/file.hpgl: Warning: Cannot seek to 0:
Bad file descriptor
sendsize[19779]: time 32962.193: Total bytes written: 300444446720
(280GiB, 44MiB/s)
sendsize[19779]: time 32962.205: .....
sendsize[19779]: estimate time for planc1 level 2: 6551.741
sendsize[19779]: estimate size for planc1 level 2: 293402780 KB
sendsize[19779]: time 32962.205: waiting for /bin/tar "planc1" child
sendsize[19779]: time 32962.218: after /bin/tar "planc1" wait
sendsize[19779]: time 32962.278: done with amname 'planc1', dirname
'/backup/PV/', spindle 1
--- /sendsize log
- For the sake of being complete, one of the runtar logs (which seems
fine to me).
--- runtar:
runtar: debug 1 pid 20008 ruid 34 euid 0: start at Wed Aug 16 10:20:14
2006
/bin/tar: version 2.5.0p2
running: /bin/tar: /bin/tar --create --file /dev/null --directory
/backup/PV/ --one-file-system --num
eric-owner --listed-incremental
/var/lib/amanda/gnutar-lists/backer.planconnect.netplanc1_2.new --spa
rse --ignore-failed-read --totals .
runtar: pid 20008 finish time Wed Aug 16 10:20:14 2006
--- /runtar
That's about all I know, and I feel kinda lost this way, as I am sure it
will not run this evening, either. Though probably the information I
provided actually is rather vague, does anyone have a slight idea of
where to look to see what happens or to just gather more information
than that? Most notably:
- What happens to the dumper/taper processes, what are they actually
waiting for after sendsize has ended?
- What is the "bad file descriptor" all about? These files seem to be
perfectly fine, at least I can open and read them, also using
permissions of the backup user.
- Why the timeout error in amandad log?
- Overally, what is most likely to be responsible for that error?
Hardware? Software? Misconfiguration?
I'd be grateful for any hints no matter how small. System configuration:
- Debian GNU/Linux
- amanda 2.5.0p2-1
- Tandberg Autoloader (scsiinfo: NEC LL0101H-0A Rev. 002)
- IBM Ultrium-TD2 Rev. 36M6 tape drive
Thanks in advance and sorry again for the long mail.
Best,
Kristian
|