Bacula-users

Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??

2010-04-15 16:18:58
Subject: Re: [Bacula-users] [Bacula-devel] 5.0.1 infinite email loop bug??
From: Stephen Thompson <stephen AT seismo.berkeley DOT edu>
To: Kern Sibbald <kern AT sibbald DOT com>
Date: Thu, 15 Apr 2010 13:16:46 -0700
Hello,

Thanks for the response.

No, it's nothing to do with mail configuration; 100% sure of that.
(I know people say that all the time, but, seriously, it's the director).

And by alerts, I do mean "Messages" in the bacula vernacular.

The first time this crash happened, we received 120,000 Messages in the 
form of emails to our administrative account.  The messages were 
identical both to each other and to the content of the $JOB.mail file in 
our bacula working directory (which is never removed automatically after 
one of these crashes - perhaps that causes the endless cycle).  The same 
Message also appears to be written to our bacula log file each time an 
email is generated (or vice versa).

It seems to me like it's possible for the director to get stuck in a 
loop and send the contents of that mail file again and again, 
infinitely.  Both times we've had the SD crash (both have happened since 
upgrading to 5.0.1), the only thing that stopped the Message generation 
was stopping the director itself.

Of course, that's the annoying symptom.  The more serious problem is our 
the crash of our SD.  Any pointers to getting "ptrace" working with the 
automatic scripts?

thanks!
Stephen






On 04/15/2010 12:40 PM, Kern Sibbald wrote:
> On Thursday 15 April 2010 19:36:51 Stephen Thompson wrote:
>> Additionally, seems like the SD was possibly reading a new
>> freshly-labeled tape when it crashed...  Last items in bacula log
>> besides alerts already mentioned:
>
> In Bacula "alerts" refer to tape drive information stored concerning tape
> problems, so I am assuming you mean messages.
>
>>
>>
>> 15-Apr 09:31 server-sd JobId 100000: Writing spooled data to Volume.
>> Despooling 35,000,185,219 bytes ...
>> 15-Apr 09:51 server-sd JobId 100000: End of Volume "FB0568" at 888:1414
>> on device "SL500-Drive-1" (/dev/nst0). Write of 262144 bytes got -1.
>> 15-Apr 09:51 server-sd JobId 100000: Re-read of last block succeeded.
>> 15-Apr 09:51 server-sd JobId 100000: End of medium on Volume "FB0568"
>> Bytes=887,261,470,720 Blocks=3,384,635 at 15-Apr-2010 09:51.
>> 15-Apr 09:51 server-sd JobId 100000: 3307 Issuing autochanger "unload
>> slot 38, drive 1" command.
>> 15-Apr 09:52 server-sd JobId 100000: 3301 Issuing autochanger "loaded?
>> drive 1" command.
>> 15-Apr 09:52 server-sd JobId 100000: 3302 Autochanger "loaded? drive 1",
>> result: nothing loaded.
>> 15-Apr 09:52 server-sd JobId 100000: 3304 Issuing autochanger "load slot
>> 39, drive 1" command.
>> 15-Apr 09:52 server-sd JobId 100000: 3305 Autochanger "load slot 39,
>> drive 1", status is OK.
>> 15-Apr 09:52 server-sd JobId 100000: Volume "FB0569" previously written,
>> moving to end of data.
>>
>> Nothing but thousands of 'repetitive' alerts after that...
>
> What exactly is repeated?
>
> There was a Bacula bug #1480 in message delivery that may be the same that you
> are experiencing, it was triggered by a misconfigured SMTP server or by a
> reference in Bacula to a non-existent SMTP server  - and the simple solution
> is to make sure Bacula points to a valid functional SMTP server.  This
> problem was not particular to version 5.0.1, but I think it was fixed after
> the release of 5.0.1.  Please see the bugs database for more details.
>
> Kern
>
>>
>> thanks again,
>> Stephen
>>
>> On 04/15/2010 10:25 AM, Stephen Thompson wrote:
>>> Hello,
>>>
>>> I have just now experienced a possible new bug with bacula 5.0.1.
>>>
>>> The symptoms are this:
>>>
>>> bacula-sd crashes
>>> bacula-dir continues to run
>>> bacula-dir then spews out identical "Intervention needed" emails until
>>> manually restarted
>>>
>>> The first time this happened over a weekend and upon returning I found
>>> my inbox has about 120,000 bacula emails, all the SAME and of this type:
>>>
>>> "15-Apr 10:02 client-fd JobId 100001: Fatal error: backup.c:1048 Network
>>> send error to SD. ERR=Broken pipe"
>>>
>>> It happened again just now (second time since upgrading from 3.0.3 to
>>> 5.0.1) and I managed to stop the director with only a few thousand
>>> emails going out.
>>>
>>> So there are really 2 issues here:
>>>
>>> 1)
>>> Why does the director apparently get stuck in an infinite loop of
>>> sending the same email message?  Is this a known bug?
>>>
>>> 2)
>>> Regarding the SD, I received one alert of this type, the rest like the
>>> above:
>>>
>>>     "15-Apr 10:02 server-sd: ERROR in lock.c:268 Failed ASSERT:
>>> dev->blocked()"
>>>
>>> A traceback like:
>>> --
>>> ptrace: Operation not permitted.
>>> /var/bacula/work/29091: No such file or directory.
>>> $1 = 0
>>> /opt/bacula-5.0.1/scripts/btraceback.gdb:2: Error in sourced command
>>> file: No symbol "exename" in current context.
>>> --
>>>
>>> And a bactrace like:
>>> --
>>> Attempt to dump current JCRs
>>> JCR=0x19a24888 JobId=100000 name=client_1.2010-04-14_18.02.33_41
>>> JobStatus=l use_count=1
>>>            JobType=B JobLevel=F
>>>            sched_time=14-Apr-2010 21:35 start_time=14-Apr-2010 21:35
>>>            end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00
>>>            db=(nil) db_batch=(nil) batch_started=0
>>> JCR=0x1981b248 JobId=100001 name=client_10.2010-04-14_20.00.15_04
>>> JobStatus=R
>>>            use_count=1
>>>            JobType=B JobLevel=I
>>>            sched_time=15-Apr-2010 09:15 start_time=15-Apr-2010 09:15
>>>            end_time=31-Dec-1969 16:00 wait_time=31-Dec-1969 16:00
>>>            db=(nil) db_batch=(nil) batch_started=0
>>> Attempt to dump plugins. Hook count=0
>>> --
>>>
>>> Both clients and server seem healthy, except for the SD crash.
>>> Any ideas?
>>>
>>>
>>> thanks!
>>> Stephen
>>>
>>>
>>> -------------------------------------------------------------------------
>>> ------------ Further info:
>>>
>>> My catalog...
>>>
>>>        mysql-5.0.77 (64bit) MyISAM
>>>        210Gb in size
>>>        1,412,297,215 records in File table
>>>        note: database built with bacula 2x scripts,
>>>        upgraded with 3x scripts, then again with 5x scripts
>>>        (i.e. nothing customized along the way)
>>>
>>> My OS&   hardware for bacula DIR+SD server...
>>>
>>>        Centos 5.4 (fully patched)
>>>        8Gb RAM
>>>        2Gb Swap
>>>        1Tb EXT3 filesystem on external fiber RAID5 array
>>>        (dedicated to database, incl. temp files)
>>>        2 dual-core [AMD Opteron(tm) Processor 2220] CPUs
>>>        StorageTek SL500 Library with 2 LTO3 Drives
>>>
>>>
>>>
>>>
>>>
>>> -------------------------------------------------------------------------
>>> ----- Download Intel&#174; Parallel Studio Eval
>>> Try the new software tools for yourself. Speed compiling, find bugs
>>> proactively, and fine-tune applications for parallel performance.
>>> See why Intel Parallel Studio got high marks during beta.
>>> http://p.sf.net/sfu/intel-sw-dev
>>> _______________________________________________
>>> Bacula-devel mailing list
>>> Bacula-devel AT lists.sourceforge DOT net
>>> https://lists.sourceforge.net/lists/listinfo/bacula-devel
>


-- 
Stephen Thompson               Berkeley Seismological Laboratory
stephen AT seismo.berkeley DOT edu    215 McCone Hall # 4760
404.538.7077 (phone)           University of California, Berkeley
510.643.5811 (fax)             Berkeley, CA 94720-4760

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>