Amanda-Users

Re: Failed Backups

2003-06-06 12:06:45
Subject: Re: Failed Backups
From: "Steven M. Wilson" <stevew AT purdue DOT edu>
To: Chris Gordon <chris AT theory14 DOT net>
Date: Fri, 06 Jun 2003 11:04:24 -0500
Chris,

I looked around a little in the Amanda source code and convinced myself that there was a bug there. I sent a note to to the amanda-hackers mailing list and received a prompt reply from Jean-Louis Martineau with a patch that fixed the problem for me. I'll attach his message and patch.

Hope that helps!

Steve


Chris Gordon wrote:

Steve,

On Wed, Jun 04, 2003 at 02:29:20PM -0000, smw_purdue wrote:
Chris,

I'm having the same problem using a similar configuration of backups
to disk without any holding disks.  Every time Amanda drops into
degraded mode it's because an error occurred with one of the clients
(usually a timeout, indicating that a client system was unavailable).
I would suspect that there's a bug in the code that puts Amanda into
degraded mode on more errors than just a tape error.  Notice in your
log that you have an "unknown response" from gilgamesh.  This error
was probably what kicked Amanda into degraded mode.

That is exactly what appears to be happening.  I configured a holding
disk in an attempt to eliminate that as a possible cause. In my case,
the problem is intermittent with everything working fine for some time
and then I a failure.  The failure may be some file systems on a given
host or most/all of the backup run.

Today, I had two file systems fail on the again on gilgamesh and I began checking the various logs for issue. What I found in
"sendbackup.lotsofnumbers.debug" is:

---[ begin ]---
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1496
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1497
sendbackup: time 0.002: stream_server: waiting for connection:
0.0.0.0.1498
sendbackup: time 0.003: waiting for connect on 1496, then 1497, then
1498
sendbackup: time 29.996: stream_accept: timeout after 30 seconds
sendbackup: time 29.996: timeout on data port 1496
sendbackup: time 59.996: stream_accept: timeout after 30 seconds
sendbackup: time 59.996: timeout on mesg port 1497
sendbackup: time 89.996: stream_accept: timeout after 30 seconds
sendbackup: time 89.996: timeout on index port 1498
sendbackup: time 89.996: pid 5263 finish time Fri Jun  6 00:47:44 2003
---[ end ]---

Anybody out there have time to debug the source?  I may take a look at
it but time is at a premium right now... (when isn't it???).

Anyone have any ideas?  This only happens occasionally and I haven't
yet been able to draw a correlation.

Thanks,
Chris

--
Steven M. Wilson, Systems and Network Manager
Markey Center for Structural Biology
Purdue University
stevew AT purdue DOT edu    765.496.1946


--- server-src/driver.c.orig    2003-01-01 18:28:54.000000000 -0500
+++ server-src/driver.c 2003-06-04 15:54:44.000000000 -0400
@@ -2242,10 +2222,10 @@
            error("error [dump to tape DONE result_argc != 5: %d]", 
result_argc);
        }
 
-       free_serial(result_argv[2]);
-
        if(failed == 1) goto tryagain;  /* dump didn't work */
-       else if(failed == 2) goto fatal;
+       else if(failed == 2) goto failed_dumper;
+
+       free_serial(result_argv[2]);
 
        /* every thing went fine */
        update_info_dumper(dp, origsize, dumpsize, dumptime);
@@ -2259,9 +2239,10 @@
 
     case TRYAGAIN: /* TRY-AGAIN <handle> <err mess> */
     tryagain:
+       headqueue_disk(&runq, dp);
+    failed_dumper:
        update_failed_dump_to_tape(dp);
        free_serial(result_argv[2]);
-       headqueue_disk(&runq, dp);
        tape_left = tape_length;
        break;
 
@@ -2269,7 +2250,6 @@
     case TAPE_ERROR: /* TAPE-ERROR <handle> <err mess> */
     case BOGUS:
     default:
-    fatal:
        update_failed_dump_to_tape(dp);
        free_serial(result_argv[2]);
        failed = 2;     /* fatal problem */
--- Begin Message ---
Subject: Re: Going to degraded mode unnecessarily
From: Jean-Louis Martineau <martinea AT IRO.UMontreal DOT CA>
To: "Steven M. Wilson" <stevew AT purdue DOT edu>
Date: Wed, 4 Jun 2003 16:59:58 -0400
Hi Steven,

Could you try this patch, It should apply to the latest 2.4.4
snapshot for http://www.iro.umontreal.ca/~martinea/amanda

Jean-Louis

On Wed, Jun 04, 2003 at 02:16:14PM -0500, Steven M. Wilson wrote:
> 
> 
> I have a question for the Amanda development experts.
> 
> I'm using version 2.4.4 and backing up to hard disk directly (no tapes, no 
> holding disks).  On several occasions, I've had a client error cause Amanda 
> to go into degraded mode.  It appears that the dump_to_tape function 
> (server-src/driver.c) takes any FATAL dumper error and forces Amanda into  
> degraded mode.  Shouldn't the code be more discerning as to what caused the 
> error?  I would think that Amanda should go into degraded mode only if an 
> error were related to the output device.  In my case the error was on the 
> client and unrelated to writing the backup to disk.
> 
> Here's some of the related amdump messages:
> 
> driver: result time 6754.491 from dumper0: FAILED 01-00368 [data timeout]
> taper: reader-side: got label slot024 filenum 184
> driver: result time 6754.492 from taper: DONE 00-00367 slot024 184 [sec 
> 2174.408 kb 2061376 kps 948.0 {wr: writers 64419 rdwait 2166.220 wrwait 
> 7.959 filemark 0.021}]
> driver: error time 6754.503 serial gen mismatch dump of driver schedule 
> before start degraded mode:
> 
> Note that the "serial gen mismatch" error is probably an unrelated bug in 
> the code.  It looks like "free_serial" is called twice in a row for a 
> failure (see line 2245 of driver.c and 2263 and 2274).  We might want to 
> look at moving the first call to free_serial so that it occurs after the 
> if(failed==1)...else if(failed==2)... block.
> 
> Thanks!
> 
> Steve
> 
> -- 
> Steven M. Wilson, Systems and Network Manager
> Markey Center for Structural Biology
> Purdue University
> stevew AT purdue DOT edu    765.496.1946
> 

-- 
Jean-Louis Martineau             email: martineau AT IRO.UMontreal DOT CA 
Departement IRO, Universite de Montreal
C.P. 6128, Succ. CENTRE-VILLE    Tel: (514) 343-6111 ext. 3529
Montreal, Canada, H3C 3J7        Fax: (514) 343-5834

Attachment: driver.c.failed_dump_to_tape.diff
Description: Text document


--- End Message ---
<Prev in Thread] Current Thread [Next in Thread>