Bacula-users

Re: [Bacula-users] bacula hang waiting for storage

2008-12-02 14:30:27
Subject: Re: [Bacula-users] bacula hang waiting for storage
From: Nils Blanck-Wehde <nils.blanck-wehde AT backofficeservice DOT biz>
Date: Tue, 02 Dec 2008 20:25:32 +0100
Julien, hi Pasi,

it looks like we have similar problems. But to be honest I have no real 
clue whats causing them. Right now it could be anything from a HBA 
driver problem to a bug in bacula.

I will just give you some details about our setup so maybe we can find 
similarities.

I just looked in my old system log. One time the "Error writing final 
EOF" happened, I got the following kernel message:

    scsi 2:0:5:1: scsi: Device offlined - not ready after error recovery

The exact text of this messages seems to come from the adaptec aic7xxx 
kernel module so if you use another host controller, the message may vary.
You will find the complete snip of my syslog attached.

Our System Setup:
bacula 2.4.2 (fschwarz EL5 rpms) on CentOS 5.2 (kernel  
2.6.18-92.1.10.el5 #1 SMP Tue Aug 5 07:41:53 EDT 2008 i686 i686 i386 
GNU/Linux)
We are using a Quantum Superloader 3 DLT connected to an Adaptec 29160LP 
PCI64 controller sitting in a 32bit PCI slot. Our backup server has a 
Jetway VIA C7 MiniITX Board. Due to the 1U chassis the HBA sits in a 90° 
PCI riser card. I think that the risercard COULD lead to timing problems 
but I am not sure.

If you guys use a similar setup we might be able to narrow the 
phenomenon down.

Best regards, Nils





Nov  8 15:38:15 bacula-Backupserver kernel: st0: Current: sense key: 
Medium Error
Nov  8 15:38:15 bacula-Backupserver kernel:     Add. Sense: Write error
Nov  8 15:38:15 bacula-Backupserver kernel:
Nov  8 15:53:15 bacula-Backupserver kernel: st 2:0:5:0: Attempting to 
queue an ABORT message
Nov  8 15:53:15 bacula-Backupserver kernel: CDB: 0x10 0x0 0x0 0x0 0x1 0x0
Nov  8 15:53:15 bacula-Backupserver kernel: scsi2: At time of recovery, 
card was not paused
Nov  8 15:53:15 bacula-Backupserver kernel: >>>>>>>>>>>>>>>>>> Dump Card 
State Begins <<<<<<<<<<<<<<<<<
Nov  8 15:53:15 bacula-Backupserver kernel: scsi2: Dumping Card State 
while idle, at SEQADDR 0x9
Nov  8 15:53:15 bacula-Backupserver kernel: Card was paused
Nov  8 15:53:15 bacula-Backupserver kernel: ACCUM = 0x4, SINDEX = 0x57, 
DINDEX = 0x26, ARG_2 = 0x0
Nov  8 15:53:15 bacula-Backupserver kernel: HCNT = 0x0 SCBPTR = 0x0
Nov  8 15:53:15 bacula-Backupserver kernel: SCSIPHASE[0x0] SCSISIGI[0x0] 
ERROR[0x0] SCSIBUSL[0x0]
Nov  8 15:53:15 bacula-Backupserver kernel: LASTPHASE[0x1] SCSISEQ[0x12] 
SBLKCTL[0xa] SCSIRATE[0x0]
Nov  8 15:53:15 bacula-Backupserver kernel: SEQCTL[0x10] SEQ_FLAGS[0xc0] 
SSTAT0[0x0] SSTAT1[0x8]
Nov  8 15:53:15 bacula-Backupserver kernel: SSTAT2[0x0] SSTAT3[0x0] 
SIMODE0[0x8] SIMODE1[0xa4]
Nov  8 15:53:15 bacula-Backupserver kernel: SXFRCTL0[0x80] DFCNTRL[0x0] 
DFSTATUS[0x89]
Nov  8 15:53:15 bacula-Backupserver kernel: STACK: 0x0 0x164 0x179 0x3
Nov  8 15:53:15 bacula-Backupserver kernel: SCB count = 4
Nov  8 15:53:15 bacula-Backupserver kernel: Kernel NEXTQSCB = 2
Nov  8 15:53:15 bacula-Backupserver kernel: Card NEXTQSCB = 2
Nov  8 15:53:15 bacula-Backupserver kernel: QINFIFO entries:
Nov  8 15:53:15 bacula-Backupserver kernel: Waiting Queue entries:
Nov  8 15:53:15 bacula-Backupserver kernel: Disconnected Queue entries: 0:3
Nov  8 15:53:15 bacula-Backupserver kernel: QOUTFIFO entries:
Nov  8 15:53:15 bacula-Backupserver kernel: Sequencer Free SCB List: 1 2 
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
29 30 31
Nov  8 15:53:15 bacula-Backupserver kernel: Sequencer SCB Info:
Nov  8 15:53:15 bacula-Backupserver kernel:   0 SCB_CONTROL[0x44] 
SCB_SCSIID[0x57] SCB_LUN[0x0] SCB_TAG[0x3]
Nov  8 15:53:15 bacula-Backupserver kernel:   1 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:   2 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:   3 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:   4 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:   5 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:   6 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:   7 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:   8 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:   9 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  10 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  11 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  12 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  13 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  14 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  15 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  16 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  17 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  18 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  19 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  20 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  21 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  22 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  23 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  24 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  25 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  26 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  27 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  28 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  29 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  30 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel:  31 SCB_CONTROL[0x0] 
SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Nov  8 15:53:15 bacula-Backupserver kernel: Pending list:
Nov  8 15:53:15 bacula-Backupserver kernel:   3 SCB_CONTROL[0x40] 
SCB_SCSIID[0x57] SCB_LUN[0x0]
Nov  8 15:53:15 bacula-Backupserver kernel: Kernel Free SCB list: 1 0
Nov  8 15:53:15 bacula-Backupserver kernel: Untagged Q(5): 3
Nov  8 15:53:15 bacula-Backupserver kernel:
Nov  8 15:53:15 bacula-Backupserver kernel: <<<<<<<<<<<<<<<<< Dump Card 
State Ends >>>>>>>>>>>>>>>>>>
Nov  8 15:53:15 bacula-Backupserver kernel: (scsi2:A:5:0): Device is 
disconnected, re-queuing SCB
Nov  8 15:53:15 bacula-Backupserver kernel: Recovery code sleeping
Nov  8 15:53:15 bacula-Backupserver kernel: Recovery SCB completes
Nov  8 15:53:15 bacula-Backupserver kernel: Recovery code awake
Nov  8 15:53:15 bacula-Backupserver kernel: aic7xxx_abort returns 0x2002
Nov  8 15:53:15 bacula-Backupserver kernel: st 2:0:5:0: Attempting to 
queue a TARGET RESET message
Nov  8 15:53:15 bacula-Backupserver kernel: CDB: 0x10 0x0 0x0 0x0 0x1 0x0
Nov  8 15:53:15 bacula-Backupserver kernel: st 2:0:5:0: Command not found
Nov  8 15:53:15 bacula-Backupserver kernel: aic7xxx_dev_reset returns 0x2002
Nov  8 15:53:26 bacula-Backupserver kernel: st 2:0:5:0: scsi: Device 
offlined - not ready after error recovery
Nov  8 15:53:26 bacula-Backupserver kernel: st0: Error 10000 (sugg. bt 
0x0, driver bt 0x0, host bt 0x1).



Julien Cigar schrieb:
> Same problem here with a Sony SDX-700C
>
> On Tue, 2008-12-02 at 15:30 +0100, Nils Blanck-Wehde wrote:
>   
>> Hi!
>>
>> Just wanted to let you know that I came across the exact same error 
>> Error writing final EOF to tape. This Volume may not be readable.
>> a couple of times with 2.4.2 using a Quantum DLT VS1 drive connected
>> to Adaptec 29160LP. I don't think that the tape is really defective as
>> bacula states. I could do working backups on these tapes later. Maybe
>> its a problem with positioning (forwarding) the tape to the right
>> position?
>> If there is still interest in this issue I might search for the
>> corresponding job-output.
>>
>> Nils
>>
>> Pasi Kärkkäinen schrieb: 
>>     
>>> On Thu, Nov 27, 2008 at 05:53:41PM +0100, Arno Lehmann wrote:
>>>   
>>>       
>>>> Hi,
>>>>
>>>> 27.11.2008 15:10, Pasi Kärkkäinen wrote:
>>>>     
>>>>         
>>>>> On Thu, Nov 27, 2008 at 08:14:45AM +0100, Arno Lehmann wrote:
>>>>>       
>>>>>           
>>>>>> Hi,
>>>>>>
>>>>>> 26.11.2008 21:22, Bob Hetzel wrote:
>>>>>>         
>>>>>>             
>>>>>>> I've got bacula currently in a hung state with the following 
>>>>>>> interesting 
>>>>>>> info.  When I run a status storage produces the following...
>>>>>>>           
>>>>>>>               
>>>>>> Is your Bacula still stuck? If so, and you have gdb installed, and a 
>>>>>> Bacula with debug symbols, now might be a good time to see what it's 
>>>>>> doing...
>>>>>>         
>>>>>>             
>>>> ...
>>>>     
>>>>         
>>>>> I have also seen this lately.. but that was with Bacula 2.5.18.
>>>>>
>>>>> I could make that hang happen multiple times, but I'm not totally sure 
>>>>> what
>>>>> caused that..
>>>>>       
>>>>>           
>>>> Well, if you can recreate the issue it's worth the effort building 
>>>> Bacula with debug information so you get usable backtraces.
>>>>
>>>> If the problem happens again, you can use gdb to create a backtrace, 
>>>> showing the developers more details about what happens and thus 
>>>> enabling them to fix the issue.
>>>>
>>>> I would recommend that now.
>>>>
>>>>     
>>>>         
>>> And now it's stuck again.. 
>>>
>>> Last output in bconsole:
>>>
>>> 01-Dec 20:01 bacula-sd JobId 4231: Forward spacing Volume "Pool4-Vol-0111" 
>>> to file:block 0:218.
>>> 01-Dec 20:04 bacula-sd JobId 4231: Error: block.c:568 Write error at 
>>> 509:3263 on device "IBM-LTO3-Drive" (/dev/nst0). ERR=Input/output error.
>>> 01-Dec 20:04 bacula-sd JobId 4231: Error: Error writing final EOF to tape. 
>>> This Volume may not be readable. dev.c:1723 ioctl MTWEOF error on 
>>> "IBM-LTO3-Drive" (/dev/nst0). ERR=Input/output error.
>>> 01-Dec 20:04 bacula-sd JobId 4231: End of medium on Volume "807NNNL3" 
>>> Bytes=482,782,454,784 Blocks=7,483,606 at 01-Dec-2008 20:04.
>>> 01-Dec 20:04 bacula-sd JobId 4231: 3307 Issuing autochanger "unload slot 7, 
>>> drive 0" command.
>>>
>>> bconsole is still usable after this..
>>>
>>> "sta director" shows a lot of jobs waiting for execution (since this was a
>>> 'copy pool uncopied jobs to tape'-job), but nothing happens really.
>>>
>>>
>>> "sta storage" makes bconsole hang.. last output:
>>>
>>> Device status:
>>> Autochanger "IBM-LTO3-AutoChanger" with devices:
>>>    "IBM-LTO3-Drive" (/dev/nst0)
>>> Device "FSDevice0" (/mnt/backup1/pool00) is not open.
>>> Device "FSDevice1" (/mnt/backup1/pool01) is not open.
>>> Device "FSDevice2" (/mnt/backup1/pool02) is not open.
>>> Device "FSDevice3" (/mnt/backup1/pool03) is not open.
>>> Device "FSDevice4" (/mnt/backup1/pool04) is mounted with:
>>>     Volume:      Pool4-Vol-0111
>>>     Pool:        *unknown*
>>>     Media type:  File4
>>>     Total Bytes Read=3,848,656,896 Blocks Read=59,658 Bytes/block=64,512
>>>     Positioned at File=0 Block=3,848,592,601
>>> Device "IBM-LTO3-Drive" (/dev/nst0) is not open.
>>>     Device is being initialized.
>>>     Drive 0 is not loaded.
>>> ====
>>>
>>> Used Volume status:
>>>
>>> <hangs here, have to kill the bconsole>
>>>
>>> What kind of backtrace do you want? From which daemon? bacula-sd? 
>>>
>>> -- Pasi
>>>       

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users