Bacula-users

Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??

2010-04-16 01:56:36
Subject: Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??
From: Kern Sibbald <kern AT sibbald DOT com>
To: Stephen Thompson <stephen AT seismo.berkeley DOT edu>
Date: Fri, 16 Apr 2010 07:54:21 +0200
On Thursday 15 April 2010 22:16:46 Stephen Thompson wrote:
> Hello,
>
> Thanks for the response.
>
> No, it's nothing to do with mail configuration; 100% sure of that.
> (I know people say that all the time, but, seriously, it's the director).
>
> And by alerts, I do mean "Messages" in the bacula vernacular.
>
> The first time this crash happened, we received 120,000 Messages in the
> form of emails to our administrative account.  The messages were
> identical both to each other and to the content of the $JOB.mail file in
> our bacula working directory (which is never removed automatically after
> one of these crashes - perhaps that causes the endless cycle).  The same
> Message also appears to be written to our bacula log file each time an
> email is generated (or vice versa).
>
> It seems to me like it's possible for the director to get stuck in a
> loop and send the contents of that mail file again and again,
> infinitely.  Both times we've had the SD crash (both have happened since
> upgrading to 5.0.1), the only thing that stopped the Message generation
> was stopping the director itself.
>
> Of course, that's the annoying symptom.  The more serious problem is our
> the crash of our SD.  Any pointers to getting "ptrace" working with the
> automatic scripts?
>
1. Make sure the binaries are compiled with the -g option
and

2. Run the Director as root
or
3. Reacquire root permision in the traceback script
or
4. Run the Director under the debugger manually

Test by sending a SIGILL or SIGSEGV to the Director.

Kern

> thanks!
> Stephen
>
> On 04/15/2010 12:40 PM, Kern Sibbald wrote:
> > On Thursday 15 April 2010 19:36:51 Stephen Thompson wrote:
> >> Additionally, seems like the SD was possibly reading a new
> >> freshly-labeled tape when it crashed...  Last items in bacula log
> >> besides alerts already mentioned:
> >
> > In Bacula "alerts" refer to tape drive information stored concerning tape
> > problems, so I am assuming you mean messages.
> >
> >> 15-Apr 09:31 server-sd JobId 100000: Writing spooled data to Volume.
> >> Despooling 35,000,185,219 bytes ...
> >> 15-Apr 09:51 server-sd JobId 100000: End of Volume "FB0568" at 888:1414
> >> on device "SL500-Drive-1" (/dev/nst0). Write of 262144 bytes got -1.
> >> 15-Apr 09:51 server-sd JobId 100000: Re-read of last block succeeded.
> >> 15-Apr 09:51 server-sd JobId 100000: End of medium on Volume "FB0568"
> >> Bytes=887,261,470,720 Blocks=3,384,635 at 15-Apr-2010 09:51.
> >> 15-Apr 09:51 server-sd JobId 100000: 3307 Issuing autochanger "unload
> >> slot 38, drive 1" command.
> >> 15-Apr 09:52 server-sd JobId 100000: 3301 Issuing autochanger "loaded?
> >> drive 1" command.
> >> 15-Apr 09:52 server-sd JobId 100000: 3302 Autochanger "loaded? drive 1",
> >> result: nothing loaded.
> >> 15-Apr 09:52 server-sd JobId 100000: 3304 Issuing autochanger "load slot
> >> 39, drive 1" command.
> >> 15-Apr 09:52 server-sd JobId 100000: 3305 Autochanger "load slot 39,
> >> drive 1", status is OK.
> >> 15-Apr 09:52 server-sd JobId 100000: Volume "FB0569" previously written,
> >> moving to end of data.
> >>
> >> Nothing but thousands of 'repetitive' alerts after that...
> >
> > What exactly is repeated?
> >
> > There was a Bacula bug #1480 in message delivery that may be the same
> > that you are experiencing, it was triggered by a misconfigured SMTP
> > server or by a reference in Bacula to a non-existent SMTP server  - and
> > the simple solution is to make sure Bacula points to a valid functional
> > SMTP server.  This problem was not particular to version 5.0.1, but I
> > think it was fixed after the release of 5.0.1.  Please see the bugs
> > database for more details.
> >
> > Kern
> >
> >> thanks again,
> >> Stephen
> >>
> >> On 04/15/2010 10:25 AM, Stephen Thompson wrote:
> >>> Hello,
> >>>
> >>> I have just now experienced a possible new bug with bacula 5.0.1.
> >>>
> >>> The symptoms are this:
> >>>
> >>> bacula-sd crashes
> >>> bacula-dir continues to run
> >>> bacula-dir then spews out identical "Intervention needed" emails until
> >>> manually restarted
> >>>
> >>> The first time this happened over a weekend and upon returning I found
> >>> my inbox has about 120,000 bacula emails, all the SAME and of this
> >>> type:
> >>>
> >>> "15-Apr 10:02 client-fd JobId 100001: Fatal error: backup.c:1048
> >>> Network send error to SD. ERR=Broken pipe"
> >>>
> >>> It happened again just now (second time since upgrading from 3.0.3 to
> >>> 5.0.1) and I managed to stop the director with only a few thousand
> >>> emails going out.
> >>>
> >>> So there are really 2 issues here:
> >>>
> >>> 1)
> >>> Why does the director apparently get stuck in an infinite loop of
> >>> sending the same email message?  Is this a known bug?
> >>>
> >>> 2)
> >>> Regarding the SD, I received one alert of this type, the rest like the
> >>> above:
> >>>
> >>>     "15-Apr 10:02 server-sd: ERROR in lock.c:268 Failed ASSERT:
> >>> dev->blocked()"
> >>>
> >>> A traceback like:
> >>> --
> >>> ptrace: Operation not permitted.
> >>> /var/bacula/work/29091: No such file or directory.
> >>> $1 = 0
> >>> /opt/bacula-5.0.1/scripts/btraceback.gdb:2: Error in sourced command
> >>> file: No symbol "exename" in current context.
> >>> --
> >>>
> >>> And a bactrace like:
> >>> --
> >>> Attempt to dump current JCRs
> >>> JCR=0x19a24888 JobId=100000 name=client_1.2010-04-14_18.02.33_41
> >>> JobStatus=l use_count=1
> >>>            JobType=B JobLevel=F
> >>>            sched_time=14-Apr-2010 21:35 start_time=14-Apr-2010 21:35
> >>>            end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00
> >>>            db=(nil) db_batch=(nil) batch_started=0
> >>> JCR=0x1981b248 JobId=100001 name=client_10.2010-04-14_20.00.15_04
> >>> JobStatus=R
> >>>            use_count=1
> >>>            JobType=B JobLevel=I
> >>>            sched_time=15-Apr-2010 09:15 start_time=15-Apr-2010 09:15
> >>>            end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00
> >>>            db=(nil) db_batch=(nil) batch_started=0
> >>> Attempt to dump plugins. Hook count=0
> >>> --
> >>>
> >>> Both clients and server seem healthy, except for the SD crash.
> >>> Any ideas?
> >>>
> >>>
> >>> thanks!
> >>> Stephen
> >>>
> >>>
> >>> -----------------------------------------------------------------------
> >>>-- ------------ Further info:
> >>>
> >>> My catalog...
> >>>
> >>>        mysql-5.0.77 (64bit) MyISAM
> >>>        210Gb in size
> >>>        1,412,297,215 records in File table
> >>>        note: database built with bacula 2x scripts,
> >>>        upgraded with 3x scripts, then again with 5x scripts
> >>>        (i.e. nothing customized along the way)
> >>>
> >>> My OS&   hardware for bacula DIR+SD server...
> >>>
> >>>        Centos 5.4 (fully patched)
> >>>        8Gb RAM
> >>>        2Gb Swap
> >>>        1Tb EXT3 filesystem on external fiber RAID5 array
> >>>        (dedicated to database, incl. temp files)
> >>>        2 dual-core [AMD Opteron(tm) Processor 2220] CPUs
> >>>        StorageTek SL500 Library with 2 LTO3 Drives
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -----------------------------------------------------------------------
> >>>-- ----- Download Intel&#174; Parallel Studio Eval
> >>> Try the new software tools for yourself. Speed compiling, find bugs
> >>> proactively, and fine-tune applications for parallel performance.
> >>> See why Intel Parallel Studio got high marks during beta.
> >>> http://p.sf.net/sfu/intel-sw-dev
> >>> _______________________________________________
> >>> Bacula-devel mailing list
> >>> Bacula-devel AT lists.sourceforge DOT net
> >>> https://lists.sourceforge.net/lists/listinfo/bacula-devel



------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users