Amanda-Users

Re: "disk was stranded on waitq"/sendsize timed out waiting for REP data

2007-03-05 14:45:05
Subject: Re: "disk was stranded on waitq"/sendsize timed out waiting for REP data
From: Jean-Louis Martineau <martineau AT zmanda DOT com>
To: Toralf Lund <toralf AT procaptura DOT com>
Date: Mon, 05 Mar 2007 14:35:01 -0500
If you can't kill sendsize, it's because it is hang in a system call.
It's often when it try to access a mount point.
Do you have a hanged mount point?

did "df" also hang?

Jean-Louis

Toralf Lund wrote:
We just started to get a serious problem with our amdump execution (Amanda 2.5.0p2). As usual, we don't thing we have changed anything at all after the last successful dump

Symptoms:

  1. "amstatus" says
     fileserv:/scanner                        0 planner: [hmm, disk was
     stranded on waitq]
  2. "sendsize" on the host in question hangs, and I mean really hangs
     - not even 'kill -9' will stop it.
  3. The amandad.<id>.debug on this host ("fileserv") says:
     amandad: time 14027.090: sending ACK pkt:
     <<<<<
      >>>>>
     amandad: time 21600.297: /usr/freeware/libexec/sendsize timed out
     waiting for REP data
     amandad: time 21600.309: sending NAK pkt:
     <<<<<
     ERROR timeout on reply pipe
      >>>>>
     amandad: time 35627.467: /usr/freeware/libexec/sendsize timed out
     waiting for REP data
     amandad: time 35627.467: sending NAK pkt:
     <<<<<
     ERROR timeout on reply pipe
      >>>>>
     amandad: time 35650.476: pid 11670783 finish time Thu Mar  1
     07:54:12 2007

This happens for all disks on one particular host. Other DLEs appear to be OK, but nothing is actually dumped, since amdump will give up the entire operation due to these problems (I think.)

Also, we actually run amdump with two different configs (the usual tape backup and an "incremental only" with output to harddisk) on the same disks every night (but not simultaneously, of course), and we see this behaviour for both.

HELP!

- Toralf