Re: "disk was stranded on waitq"/sendsize timed out waiting for REP data

If you can't kill sendsize, it's because it is hang in a system call.
It's often when it try to access a mount point.
Do you have a hanged mount point?

did "df" also hang?

Jean-Louis

Toralf Lund wrote:

We just started to get a serious problem with our amdump execution(Amanda 2.5.0p2). As usual, we don't thing we have changed anything atall after the last successful dump
Symptoms:

  1. "amstatus" says
     fileserv:/scanner                        0 planner: [hmm, disk was
     stranded on waitq]
  2. "sendsize" on the host in question hangs, and I mean really hangs
     - not even 'kill -9' will stop it.
  3. The amandad.<id>.debug on this host ("fileserv") says:
     amandad: time 14027.090: sending ACK pkt:
     <<<<<
      >>>>>
     amandad: time 21600.297: /usr/freeware/libexec/sendsize timed out
     waiting for REP data
     amandad: time 21600.309: sending NAK pkt:
     <<<<<
     ERROR timeout on reply pipe
      >>>>>
     amandad: time 35627.467: /usr/freeware/libexec/sendsize timed out
     waiting for REP data
     amandad: time 35627.467: sending NAK pkt:
     <<<<<
     ERROR timeout on reply pipe
      >>>>>
     amandad: time 35650.476: pid 11670783 finish time Thu Mar  1
     07:54:12 2007
This happens for all disks on one particular host. Other DLEs appearto be OK, but nothing is actually dumped, since amdump will give upthe entire operation due to these problems (I think.)
Also, we actually run amdump with two different configs (the usualtape backup and an "incremental only" with output to harddisk) on thesame disks every night (but not simultaneously, of course), and we seethis behaviour for both.
HELP!

- Toralf