Re: "disk was stranded on waitq"/sendsize timed out waiting for REP data
2007-03-05 14:45:05
If you can't kill sendsize, it's because it is hang in a system call.
It's often when it try to access a mount point.
Do you have a hanged mount point?
did "df" also hang?
Jean-Louis
Toralf Lund wrote:
We just started to get a serious problem with our amdump execution
(Amanda 2.5.0p2). As usual, we don't thing we have changed anything at
all after the last successful dump
Symptoms:
1. "amstatus" says
fileserv:/scanner 0 planner: [hmm, disk was
stranded on waitq]
2. "sendsize" on the host in question hangs, and I mean really hangs
- not even 'kill -9' will stop it.
3. The amandad.<id>.debug on this host ("fileserv") says:
amandad: time 14027.090: sending ACK pkt:
<<<<<
>>>>>
amandad: time 21600.297: /usr/freeware/libexec/sendsize timed out
waiting for REP data
amandad: time 21600.309: sending NAK pkt:
<<<<<
ERROR timeout on reply pipe
>>>>>
amandad: time 35627.467: /usr/freeware/libexec/sendsize timed out
waiting for REP data
amandad: time 35627.467: sending NAK pkt:
<<<<<
ERROR timeout on reply pipe
>>>>>
amandad: time 35650.476: pid 11670783 finish time Thu Mar 1
07:54:12 2007
This happens for all disks on one particular host. Other DLEs appear
to be OK, but nothing is actually dumped, since amdump will give up
the entire operation due to these problems (I think.)
Also, we actually run amdump with two different configs (the usual
tape backup and an "incremental only" with output to harddisk) on the
same disks every night (but not simultaneously, of course), and we see
this behaviour for both.
HELP!
- Toralf
|
|
|