Amanda-Users

Re: 2.5.1p2: planner returns error=EOF on read

2006-11-27 09:34:17
Subject: Re: 2.5.1p2: planner returns error=EOF on read
From: Jean-Louis Martineau <martineau AT zmanda DOT com>
To: Jean-Francois Malouin <Jean-Francois.Malouin AT bic.mni.mcgill DOT ca>, AMANDA users <amanda-users AT amanda DOT org>
Date: Mon, 27 Nov 2006 09:24:03 -0500
Jean-Francois,

Could you try the attached patch, it should kill the tar process.
If it still doesn't work, could you attach a debugger to the sendbackup process to know where it is?

Jean-Louis

Jean-Francois Malouin wrote:
* Jean-Louis Martineau <martineau AT zmanda DOT com> [20061123 17:20]:
Do amandad is running?

To be sure that I would hit the problem again I didn't load new tapes
so after the holddding filled up I have a bunch of gtar running and
just wasting a lot of cycles. A typical example (sorry for the long lines):

amanda 3555936 3557855 0 - - 0:00 <defunct> amanda 3557855 1 0 05:37:39 ? 1:30 /opt/amanda/amanda1/libexec/sendbackup amandad bsdtcp
    root    3568087    3557855  0 05:40:02 ?      277:35 gtar --create --file - 
--directory /data/mafalda/mafalda1/susanita/sandra/DTI -

Tracing the sendbackup process yields no output while the gtar one
gives just an uninterupted flow of lines:

   68mS             tar(3568087): write(1, <fd 63 e3 6c ff 42 3c df 87
41 82 bf e9 dd 95 2c>..., 10240) errno = 32 (Broken pipe)

Don't know if this is going to be of any help...
jf

Could you try to find what each process (amanda,sendbackup,tar) are doing? on which system call they are hung?
On linux, I would use strace for that.

Jean-Louis

Jean-Francois Malouin wrote:
Jean-Louis,

The amdump just finished and because amanda ran out of tapes
(runtapes=10) it completed badly leaving stuff in the holdding
disk. I'm flushing it at the moment but I noticed that when amanda
gave up it left a lot of gnutar lying around, with amandad as parent.
I had to manually kill them. Looks like signals are not doing ok.
I've attached one example of a DLE's sendbackup and runtar debug file.
It's not the first time I notice that when a DLE fails to make it
to tape successfully processes are left running...

* Jean-Louis Martineau <martineau AT zmanda DOT com> [20061122 09:32]:
Could you post amandad.<timestamp>.debug file from yorick?

Jean-Francois Malouin wrote:
On a server running irix-6.5 and amanda 2.5.1p2
planner debug shows:

security_getdriver(name=bsdtcp) returns 4075aa8
security_handleinit(handle=1001d4a8, driver=4075aa8 (BSDTCP))
security_streaminit(stream=1005e538, driver=4075aa8 (BSDTCP))
security_close(handle=1001cfe0, driver=4075aa8 (BSDTCP))
security_stream_close(10053a80)
security_stream_seterr(1005e538, SOCKET_EOF)
security_seterror(handle=1001d4a8, driver=4075aa8 (BSDTCP) error=EOF
on read from yorick)
security_close(handle=1001d4a8, driver=4075aa8 (BSDTCP))
security_stream_close(1005e538)

and the amanda report shows things like:

yorick  DATA_sub101   lev 0  FAILED [hmm, disk was stranded on waitq]
planner: ERROR Request to yorick failed: EOF on read from yorick

and the backup fails. The funny thing is that I have 2 other configurations
running 2.5.1p2 in parallel that doesn't exhibit this behaviour.
Any clues?
jf

------------------------------------------------------------------------

runtar: debug 1 pid 2826413 ruid 666 euid 0: start at Thu Nov 23 08:04:13 2006
runtar: version 2.5.1p2
/usr/freeware/bin/tar version: tar (GNU tar) 1.13.25

config: stk_80-conf1
runtar: debug 1 pid 2826413 ruid 0 euid 0: rename at Thu Nov 23 08:04:13 2006 running: /usr/freeware/bin/tar: 'gtar' '--create' '--file' '-' '--directory' '/data/mafalda/mafalda1/susanita/jen/anxiety_version1/sub101' '--one-file-system' '--listed-incremental' '/opt/amanda/amanda1/var/amanda/gnutar-lists/yoricksub101_1.new' '--sparse' '--ignore-failed-read' '--totals' '.' runtar: pid 2826413 finish time Thu Nov 23 08:04:13 2006 ------------------------------------------------------------------------

sendbackup: debug 1 pid 2836263 ruid 666 euid 666: start at Thu Nov 23 08:03:20 2006
sendbackup: version 2.5.1p2
Could not open conf file "/opt/amanda/amanda1/etc/amanda/amanda-client.conf": No such file or directory Reading conf file "/opt/amanda/amanda1/etc/amanda/stk_80-conf1/amanda-client.conf". sendbackup: debug 1 pid 2836263 ruid 666 euid 666: rename at Thu Nov 23 08:03:20 2006 sendbackup req: <GNUTAR sub101 /data/mafalda/mafalda1/susanita/jen/anxiety_version1/sub101 1 2006:11:13:13:21:32 OPTIONS |;auth=bsdtcp;index;>
 parsed request as: program `GNUTAR'
                    disk `sub101'
device `/data/mafalda/mafalda1/susanita/jen/anxiety_version1/sub101'
                    level 1
                    since 2006:11:13:13:21:32
                    options `|;auth=bsdtcp;index;'
sendbackup: start: yorick:sub101 lev 1
sendbackup-gnutar: time 0.208: doing level 1 dump as listed-incremental
>from '/opt/amanda/amanda1/var/amanda/gnutar-lists/yoricksub101_0' to
'/opt/amanda/amanda1/var/amanda/gnutar-lists/yoricksub101_1.new'
sendbackup-gnutar: time 53.057: doing level 1 dump from date: 2006-11-13 13:22:26 GMT sendbackup: time 53.246: started index creator: "/usr/freeware/bin/tar -tf - 2>/dev/null | sed -e 's/^\.//'" sendbackup: time 53.254: spawning /opt/amanda/amanda1/libexec/runtar in pipeline sendbackup: argument list: runtar stk_80-conf1 gtar --create --file - --directory /data/mafalda/mafalda1/susanita/jen/anxiety_version1/sub101 --one-file-system --listed-incremental /opt/amanda/amanda1/var/amanda/gnutar-lists/yoricksub101_1.new --sparse --ignore-failed-read --totals . sendbackup-gnutar: time 53.292: /opt/amanda/amanda1/libexec/runtar: pid 2826413
sendbackup: time 53.292: started backup
sendbackup: time 1839.833: index tee cannot write [Broken pipe]
sendbackup: time 1839.833: pid 2836285 finish time Thu Nov 23 08:34:00 2006
sendbackup: time 1839.926: 118: strange(?): sendbackup: index tee cannot write [Broken pipe]
sendbackup: time 9284.787: error [/usr/freeware/bin/tar got signal 15]
sendbackup: time 9284.856: pid 2836263 finish time Thu Nov 23 10:38:05 2006


diff -u -r --show-c-function --new-file --exclude-from=amanda.diff 
amanda-2.5.1p2.new/client-src/sendbackup.c 
amanda-2.5.1p2.new.sendbackup/client-src/sendbackup.c
--- amanda-2.5.1p2.new/client-src/sendbackup.c  2006-07-25 14:27:56.000000000 
-0400
+++ amanda-2.5.1p2.new.sendbackup/client-src/sendbackup.c       2006-11-27 
08:52:01.000000000 -0500
@@ -549,9 +549,10 @@ check_status(
     amwait_t   w)
 {
     char *thiserr = NULL;
-    char *str;
+    char *str, *strX;
     int ret, sig, rc;
     char number[NUM_STR_SIZE];
+    char numberpid[NUM_STR_SIZE];
 
     str = childstr(pid);
 
@@ -570,54 +571,59 @@ check_status(
         * but the failure is noted.
         */
        if(ret != 0) {
-           fprintf(stderr, "? %s returned %d\n", str, ret);
+           fprintf(stderr, "? index %s returned %d\n", str, ret);
            rc = 0;
        }
-    }
-
-#ifndef HAVE_GZIP
-    if(pid == comppid) {
+       indexpid = -1;
+       strX = "index ";
+    } else if(pid == comppid) {
        /*
         * compress returns 2 sometimes, but it is ok.
         */
+#ifndef HAVE_GZIP
        if(ret == 2) {
            rc = 0;
        }
-    }
 #endif
-
-#ifdef DUMP_RETURNS_1
-    if(pid == dumppid && tarpid == -1) {
+       comppid = -1;
+       strX = "compress ";
+    } else if(pid == dumppid && tarpid == -1) {
         /*
         * Ultrix dump returns 1 sometimes, but it is ok.
         */
+#ifdef DUMP_RETURNS_1
         if(ret == 1) {
            rc = 0;
        }
-    }
 #endif
-
-#ifdef IGNORE_TAR_ERRORS
-    if(pid == tarpid) {
+       dumppid = -1;
+       strX = "dump ";
+    } else if(pid == tarpid) {
        /*
         * tar bitches about active filesystems, but we do not care.
         */
+#ifdef IGNORE_TAR_ERRORS
         if(ret == 2) {
            rc = 0;
        }
-    }
 #endif
+       dumppid = tarpid = -1;
+       strX = "dump ";
+    } else {
+       strX = "unknown ";
+    }
 
     if(rc == 0) {
        return 0;                               /* normal exit */
     }
 
+    snprintf(numberpid, SIZEOF(number), "%d", (int)pid);
     if(ret == 0) {
        snprintf(number, SIZEOF(number), "%d", sig);
-       thiserr = vstralloc(str, " got signal ", number, NULL);
+       thiserr = vstralloc(strX, "(", numberpid, ") ", str, " got signal ", 
number, NULL);
     } else {
        snprintf(number, SIZEOF(number), "%d", ret);
-       thiserr = vstralloc(str, " returned ", number, NULL);
+       thiserr = vstralloc(strX, "(", numberpid, ") ", str, " returned ", 
number, NULL);
     }
 
     if(errorstr) {
@@ -733,10 +739,47 @@ parse_backup_messages(
        /*NOTREACHED*/
     }
 
-    while((wpid = wait(&retstat)) != -1) {
+    while((wpid = waitpid((pid_t)-1, &retstat, WNOHANG)) > 0) {
        if(check_status(wpid, retstat)) goterror = 1;
     }
 
+    if (dumppid != -1) {
+       sleep(5);
+       while((wpid = waitpid((pid_t)-1, &retstat, WNOHANG)) > 0) {
+           if(check_status(wpid, retstat)) goterror = 1;
+       }
+    }
+    if (dumppid != -1) {
+       dbprintf(("%s: Sending SIGHUP to dump process %d\n",
+                 debug_prefix_time(NULL), (int)dumppid));
+       if(dumppid != -1) {
+           if(kill(dumppid, SIGHUP) == -1) {
+               dbprintf(("%s: Can't send SIGHUP to %d: %s\n",
+                         debug_prefix_time(NULL), (int)dumppid,
+                         strerror(errno)));
+           }
+       }
+       sleep(5);
+       while((wpid = waitpid((pid_t)-1, &retstat, WNOHANG)) > 0) {
+           if(check_status(wpid, retstat)) goterror = 1;
+       }
+    }
+    if (dumppid != -1) {
+       dbprintf(("%s: Sending SIGKILL to dump process %d\n",
+                 debug_prefix_time(NULL), (int)dumppid));
+       if(dumppid != -1) {
+           if(kill(dumppid, SIGKILL) == -1) {
+               dbprintf(("%s: Can't send SIGKILL to %d: %s\n",
+                         debug_prefix_time(NULL), (int)dumppid,
+                         strerror(errno)));
+           }
+       }
+       sleep(5);
+       while((wpid = waitpid((pid_t)-1, &retstat, WNOHANG)) > 0) {
+           if(check_status(wpid, retstat)) goterror = 1;
+       }
+    }
+
     if(errorstr) {
        error("error [%s]", errorstr);
        /*NOTREACHED*/
<Prev in Thread] Current Thread [Next in Thread>