Re: [Bacula-users] Dell TL4000 labeling timeout

@Marcin - dmesg is clean

@Ana - I ~think~ /dev/changer is created by udev, I'm not 100%.  It's
a symlink to sg19 in this case:

lrwxrwxrwx 1 root root 4 Jun 10 17:06 /dev/changer -> sg19
crw-rw---- 1 root disk 21, 19 Jun 10 17:06 /dev/sg19

Here are the tape drive parts of "lsscsi -l".  I removed a bunch of
HDDs attached to the system:

[2:0:8:0]    tape    IBM      ULT3580-HH6      D8E5  /dev/st0
  state=running queue_depth=254 scsi_level=7 type=1 device_blocked=0 timeout=900
[2:0:10:0]   tape    IBM      ULT3580-HH6      D8E5  /dev/st1
  state=running queue_depth=254 scsi_level=7 type=1 device_blocked=0 timeout=900
[2:0:10:1]   mediumx IBM      3573-TL          C.30  -
  state=running queue_depth=254 scsi_level=6 type=8 device_blocked=0 timeout=0

At some point we had the Tape unit powered down and then powered it
back up, but the system remained on, so the dmesg mentions the drives
multiple times:

[anoonan@odin ~]$ dmesg | grep Attached
sd 0:1:0:0: Attached scsi disk sda
scsi 0:0:0:0: Attached scsi generic sg0 type 0
scsi 0:0:1:0: Attached scsi generic sg1 type 0
sd 0:1:0:0: Attached scsi generic sg2 type 0
sd 1:0:0:0: Attached scsi disk sdb
sd 1:0:0:0: Attached scsi generic sg3 type 0
sd 1:0:0:1: Attached scsi disk sdc
sd 1:0:0:1: Attached scsi generic sg4 type 0
sd 1:0:0:2: Attached scsi disk sdd
sd 1:0:0:2: Attached scsi generic sg5 type 0
sd 1:0:0:3: Attached scsi disk sde
sd 1:0:0:3: Attached scsi generic sg6 type 0
sd 1:0:0:31: Attached scsi disk sdf
sd 1:0:0:31: Attached scsi generic sg7 type 0
sd 1:0:1:0: Attached scsi disk sdg
sd 1:0:1:0: Attached scsi generic sg8 type 0
sd 1:0:1:1: Attached scsi disk sdh
sd 1:0:1:1: Attached scsi generic sg9 type 0
sd 1:0:1:2: Attached scsi disk sdi
sd 1:0:1:2: Attached scsi generic sg10 type 0
sd 1:0:1:3: Attached scsi disk sdj
sd 1:0:1:3: Attached scsi generic sg11 type 0
sd 1:0:1:31: Attached scsi disk sdk
sd 1:0:1:31: Attached scsi generic sg12 type 0
scsi 2:0:0:0: Attached scsi generic sg13 type 1
scsi 2:0:1:0: Attached scsi generic sg14 type 1
st 2:0:0:0: Attached scsi tape st0
st 2:0:1:0: Attached scsi tape st1
st 2:0:2:0: Attached scsi tape st1
st 2:0:2:0: Attached scsi generic sg14 type 1
st 2:0:3:0: Attached scsi tape st1
st 2:0:3:0: Attached scsi generic sg14 type 1
st 2:0:4:0: Attached scsi tape st0
st 2:0:4:0: Attached scsi generic sg13 type 1
st 2:0:5:0: Attached scsi tape st1
st 2:0:5:0: Attached scsi generic sg14 type 1
scsi 2:0:5:1: Attached scsi generic sg15 type 8
st 2:0:6:0: Attached scsi tape st0
st 2:0:6:0: Attached scsi generic sg13 type 1
scsi 2:0:6:1: Attached scsi generic sg14 type 8
st 2:0:7:0: Attached scsi tape st1
st 2:0:7:0: Attached scsi generic sg15 type 1
sd 1:0:0:4: Attached scsi disk sdl
sd 1:0:0:4: Attached scsi generic sg13 type 0
sd 1:0:0:5: Attached scsi disk sdm
sd 1:0:0:5: Attached scsi generic sg14 type 0
sd 1:0:1:4: Attached scsi disk sdn
sd 1:0:1:4: Attached scsi generic sg15 type 0
sd 1:0:1:5: Attached scsi disk sdo
sd 1:0:1:5: Attached scsi generic sg16 type 0
st 2:0:8:0: Attached scsi tape st0
st 2:0:8:0: Attached scsi generic sg17 type 1
st 2:0:9:0: Attached scsi tape st1
st 2:0:9:0: Attached scsi generic sg18 type 1
scsi 2:0:9:1: Attached scsi generic sg19 type 8
st 2:0:10:0: Attached scsi tape st1
st 2:0:10:0: Attached scsi generic sg18 type 1
scsi 2:0:10:1: Attached scsi generic sg19 type 8

I did change out the device names for the drives to be /dev/tape/by-id
names instead to make sure the naming stays stable after reboots, but
I haven't tried changing /dev/changer to anything else.  The btape
tests were successful, and I haven't had any problems with mtx or even
mt commands, though as mentioned previously, I've gotten Input/Output
errors from mt when doing rewind/weof commands to the drives.  That
being said, I'm suspicious of the amount of time that mt reports
DR_OPEN on loads.  I can issue mtx-changer commands OK, as well,
though I'm not sure if the "load" command is actually returning
correctly, or is just timing out internally.  unloads/list/listall
from mtx-changer have always returned successfully.

This is a Centos 5 machine, so not very new at all.  mtx package is:
mtx-1.2.18-9, mt-st is mt-st-0.9b-4.el5

It may be a little challenge to upgrade this system (probably to
Centos 6), but not impossible if it needs to happen.

Thanks,
Andrew

On Wed, Jun 17, 2015 at 8:32 PM, Ana Emília M. Arruda
<emiliaarruda AT gmail DOT com> wrote:
> Hello Andrew,
>
> Is /dev/changer created by udev rules? Have you tried /dev/sgX instead? Can
> you send us the output of the "lsscsi -l" command and "dmesg | grep
> Attached"? Have you checked your drives/autochanger using just mtx/mt
> commands to see if they are working? Which is your mtx version?
>
> Best regards,
> Ana
>
> On Wed, Jun 17, 2015 at 7:29 PM, Marcin Haba <ganiuszka AT gmail DOT com> 
> wrote:
>>
>> Hello,
>>
>> Do you have any errors in dmesg (hardware errors, bus reset, SCSI
>> errors ... etc.) ?
>>
>> Best regards,
>> Marcin Haba (gani)
>>
>> 2015-06-17 21:56 GMT+02:00 Andrew Noonan <anoonan AT gmail DOT com>:
>> > Hi all,
>> >
>> >      It's taking a lot longer because of the higher timeouts, but the
>> > label is still failing with a termination.  If I understand it
>> > correctly, the mtx-changer script is polling with 'mt' looking for the
>> > $ready state, defined in the config file as ONLINE (for Linux).  I'm
>> > not seeing drive 0 go into that state... I just see:
>> >
>> > SCSI 2 tape drive:
>> > File number=-1, block number=-1, partition=0.
>> > Tape block size 0 bytes. Density code 0x0 (default).
>> > Soft error count since last status=0
>> > General status bits on (50000):
>> >  DR_OPEN IM_REP_EN
>> >
>> > the other device looks like:
>> >
>> > SCSI 2 tape drive:
>> > File number=0, block number=0, partition=0.
>> > Tape block size 0 bytes. Density code 0x5a (no translation).
>> > Soft error count since last status=0
>> > General status bits on (41010000):
>> >  BOT ONLINE IM_REP_EN
>> >
>> > So I see that it's ~possible~ to see the ONLINE state, but it doesn't
>> > seem like it ever gets to that state during load.
>> >
>> > Any thoughts?
>> >
>> > Thanks,
>> > Andrew
>> >
>> > On Wed, Jun 17, 2015 at 11:44 AM, Andrew Noonan <anoonan AT gmail DOT com>
>> > wrote:
>> >> Hi Ana,
>> >>
>> >>      Thanks for the reply.  I'm adding those into the drives.  BTW,
>> >> 900 is the value.  Having no real experience with these, is it
>> >> abnormal for a load to take the 10+ minutes, or is that reasonable?
>> >> My next step is to add those settings in, restart the SD, and attempt
>> >> to do a "label barcode" again.
>> >>
>> >> Thanks,
>> >> Andrew
>> >>
>> >> On Tue, Jun 16, 2015 at 9:10 PM, Ana Emília M. Arruda
>> >> <emiliaarruda AT gmail DOT com> wrote:
>> >>> Hello Andrew,
>> >>>
>> >>> You can find in the output of a "lsscsi -l" command the timeout for
>> >>> your
>> >>> drives. Then you can configure 3 timeout directives for each one of
>> >>> your two
>> >>> drives (LRADrive-1 e LRADrive-2):
>> >>>
>> >>> Maximum Changer Wait = X
>> >>> Maximum Rewind Wait = X
>> >>> Maximum Open Wait = X
>> >>>
>> >>> where X is the timeout value for your dirves.
>> >>>
>> >>> You can also customize your mtx-changer script for this timeout
>> >>> changing the
>> >>> bellow 300 seconds value:
>> >>>
>> >>> wait_for_drive() {
>> >>>   i=0
>> >>>   while [ $i -le 300 ]; do  # Wait max 300 seconds
>> >>>
>> >>> Best regards,
>> >>> Ana
>> >>>
>> >>>
>> >>> On Tue, Jun 16, 2015 at 5:02 PM, Andrew Noonan <anoonan AT gmail DOT com>
>> >>> wrote:
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> I'm almost completely new to tape.  We've been doing disk-based
>> >>>> backups for years, but we now have a project where we want to offsite
>> >>>> hundreds of TB permanently, and have a Dell TL4000 (a rebranded IBM
>> >>>> 3573-TL from the looks of it) with 2 ULT3580 LTO-6 drives.  We're
>> >>>> running bacula 5.2.  The server is a Dell 1950 running Centos 5
>> >>>> (sorry
>> >>>> for the old OS).
>> >>>>
>> >>>> The btape tests run on both units without a problem, including the
>> >>>> autochanger tests, and manually executing load/unload/list commands
>> >>>> with mtx-changer seem to run fine.  The one exception to this is that
>> >>>> the mtx-changer load command seems to take about 10 minutes to
>> >>>> complete, which seems unreasonably long.  These are brand new tapes
>> >>>> and I haven't written anything to them other then whatever btape does
>> >>>> with testing.  I put a 5 minute sleep on the load for mtx-changer,
>> >>>> but
>> >>>> other then that haven't customized the script, as I'm not sure what
>> >>>> I'd customize.
>> >>>>
>> >>>> The "update slots" command from the director works OK, but when I go
>> >>>> to do a "label barcode", the resulting "load slot" gets killed by
>> >>>> Bacula:
>> >>>>
>> >>>> 3992 Bad autochanger "load slot 20, drive 1": ERR=Child died from
>> >>>> signal 15: Termination.
>> >>>> Results=Program killed by Bacula (timeout)
>> >>>>
>> >>>> I've seen that in some of these posts to the list, this ends up being
>> >>>> permissions problems against the devices, but that doesn't seem to be
>> >>>> the case as far as I can see:
>> >>>>
>> >>>> bacula-sd is running as the bacula user/group.  The bacula user is in
>> >>>> the "disk" group, and the *st* devices are in the disk group with
>> >>>> "rw"
>> >>>> permissions:
>> >>>>
>> >>>> crw-rw---- 1 root disk 9, 128 Jun  4 12:02 /dev/nst0
>> >>>> crw-rw---- 1 root disk 9, 224 Jun  4 12:02 /dev/nst0a
>> >>>> crw-rw---- 1 root disk 9, 160 Jun  4 12:02 /dev/nst0l
>> >>>> crw-rw---- 1 root disk 9, 192 Jun  4 12:02 /dev/nst0m
>> >>>> crw-rw---- 1 root disk 9, 129 Jun 10 17:06 /dev/nst1
>> >>>> crw-rw---- 1 root disk 9, 225 Jun 10 17:06 /dev/nst1a
>> >>>> crw-rw---- 1 root disk 9, 161 Jun 10 17:06 /dev/nst1l
>> >>>> crw-rw---- 1 root disk 9, 193 Jun 10 17:06 /dev/nst1m
>> >>>> crw-rw---- 1 root disk 9,   0 Jun  4 12:02 /dev/st0
>> >>>> crw-rw---- 1 root disk 9,  96 Jun  4 12:02 /dev/st0a
>> >>>> crw-rw---- 1 root disk 9,  32 Jun  4 12:02 /dev/st0l
>> >>>> crw-rw---- 1 root disk 9,  64 Jun  4 12:02 /dev/st0m
>> >>>> crw-rw---- 1 root disk 9,   1 Jun 10 17:06 /dev/st1
>> >>>> crw-rw---- 1 root disk 9,  97 Jun 10 17:06 /dev/st1a
>> >>>> crw-rw---- 1 root disk 9,  33 Jun 10 17:06 /dev/st1l
>> >>>> crw-rw---- 1 root disk 9,  65 Jun 10 17:06 /dev/st1m
>> >>>>
>> >>>> Here's a block of debug from the SD during a label attempt for one of
>> >>>> the
>> >>>> slots:
>> >>>>
>> >>>> odin-sd: autochanger.c:434-0 Wiffle through devices looking for slot
>> >>>> odin-sd: autochanger.c:313-0 Locking changer LogRepoAutochanger
>> >>>> odin-sd: autochanger.c:740-0 omsg=/usr/lib64/bacula/mtx-changer
>> >>>> /dev/changer loaded 14 /dev/nst0 0
>> >>>> odin-sd: autochanger.c:272-0 Run
>> >>>> program=/usr/lib64/bacula/mtx-changer
>> >>>> /dev/changer loaded 14 /dev/nst0 0
>> >>>> odin-sd: watchdog.c:206-0 Registered watchdog 636b888, interval 300
>> >>>> odin-sd: bpipe.c:220-0 Wait for 28962 opt=1
>> >>>> odin-sd: bpipe.c:228-0 Got break wpid=28962 status=0 ERR=none
>> >>>> odin-sd: bpipe.c:249-0 child status=0
>> >>>> odin-sd: watchdog.c:226-0 Unregistered watchdog 636b888
>> >>>> odin-sd: bpipe.c:264-0 returning stat=0,0
>> >>>> odin-sd: autochanger.c:274-0 run_prog: /usr/lib64/bacula/mtx-changer
>> >>>> /dev/changer loaded 14 /dev/nst0 0 stat=0 result=0
>> >>>> odin-sd: autochanger.c:327-0 Unlocking changer LogRepoAutochanger
>> >>>> odin-sd: autochanger.c:313-0 Locking changer LogRepoAutochanger
>> >>>> odin-sd: autochanger.c:740-0 omsg=/usr/lib64/bacula/mtx-changer
>> >>>> /dev/changer loaded 14 /dev/nst1 1
>> >>>> odin-sd: autochanger.c:272-0 Run
>> >>>> program=/usr/lib64/bacula/mtx-changer
>> >>>> /dev/changer loaded 14 /dev/nst1 1
>> >>>> odin-sd: watchdog.c:206-0 Registered watchdog 636b888, interval 300
>> >>>> odin-sd: bpipe.c:220-0 Wait for 28976 opt=1
>> >>>> odin-sd: bpipe.c:228-0 Got break wpid=28976 status=0 ERR=none
>> >>>> odin-sd: bpipe.c:249-0 child status=0
>> >>>> odin-sd: watchdog.c:226-0 Unregistered watchdog 636b888
>> >>>> odin-sd: bpipe.c:264-0 returning stat=0,0
>> >>>> odin-sd: autochanger.c:274-0 run_prog: /usr/lib64/bacula/mtx-changer
>> >>>> /dev/changer loaded 14 /dev/nst1 1 stat=0 result=0
>> >>>> odin-sd: autochanger.c:327-0 Unlocking changer LogRepoAutochanger
>> >>>> odin-sd: autochanger.c:453-0 Slot=14 not found in another device
>> >>>> odin-sd: autochanger.c:313-0 Locking changer LogRepoAutochanger
>> >>>> odin-sd: autochanger.c:183-0 Doing changer load slot 14 "LRADrive-2"
>> >>>> (/dev/nst1)
>> >>>> odin-sd: autochanger.c:740-0 omsg=/usr/lib64/bacula/mtx-changer
>> >>>> /dev/changer load 14 /dev/nst1 1
>> >>>> odin-sd: dev.c:1746-0 close_dev "LRADrive-2" (/dev/nst1)
>> >>>> odin-sd: dev.c:1751-0 device "LRADrive-2" (/dev/nst1) already closed
>> >>>> vol=
>> >>>> odin-sd: autochanger.c:190-0 Run
>> >>>> program=/usr/lib64/bacula/mtx-changer
>> >>>> /dev/changer load 14 /dev/nst1 1
>> >>>> odin-sd: watchdog.c:206-0 Registered watchdog 636b888, interval 300
>> >>>> odin-sd: bpipe.c:443-0 Run program fgets killed=1
>> >>>> odin-sd: bpipe.c:220-0 Wait for 28990 opt=1
>> >>>> odin-sd: bpipe.c:228-0 Got break wpid=28990 status=15 ERR=none
>> >>>> odin-sd: bpipe.c:256-0 Child died from signal 15
>> >>>> odin-sd: watchdog.c:235-0 Unregistered inactive watchdog 636b888
>> >>>> odin-sd: bpipe.c:264-0 returning stat=15,134217743
>> >>>> odin-sd: autochanger.c:205-0 load slot 14, drive 1, bad stats=Child
>> >>>> died from signal 15: Termination.
>> >>>> odin-sd: autochanger.c:212-0 load slot 14 status=134217743
>> >>>> odin-sd: autochanger.c:327-0 Unlocking changer LogRepoAutochanger
>> >>>> odin-sd: autochanger.c:218-0 After changer, status=134217743
>> >>>> odin-sd: dev.c:1735-0 Clear volhdr vol=
>> >>>> odin-sd: vol_mgr.c:544-0 vol_unused: no vol on "LRADrive-2"
>> >>>> (/dev/nst1)
>> >>>> odin-sd: lock.c:302-0 return lock. old=BST_WRITING_LABEL from
>> >>>> dircmd.c:554
>> >>>> odin-sd: lock.c:307-0 return lock. new=BST_NOT_BLOCKED
>> >>>> odin-sd: dev.c:1746-0 close_dev "LRADrive-2" (/dev/nst1)
>> >>>> odin-sd: dev.c:1751-0 device "LRADrive-2" (/dev/nst1) already closed
>> >>>> vol=
>> >>>> odin-sd: acquire.c:731-0 Enter detach_dcr_from_dev
>> >>>> odin-sd: dircmd.c:220-0 <dird: label LogRepoAutochanger
>> >>>> VolumeName=000030L6 PoolName=LogrepoArchive MediaType=LTO-6 Slot=15
>> >>>> drive=1
>> >>>> odin-sd: dircmd.c:234-0 Do command: label
>> >>>> odin-sd: dircmd.c:627-0 Try changer device LRADrive-1
>> >>>> odin-sd: dircmd.c:648-0 Device LogRepoAutochanger drive wrong: want=1
>> >>>> got=0 skipping
>> >>>> odin-sd: dircmd.c:627-0 Try changer device LRADrive-2
>> >>>> odin-sd: dircmd.c:643-0 Found changer device LRADrive-2
>> >>>> odin-sd: dircmd.c:656-0 Found device LRADrive-2
>> >>>> odin-sd: block.c:144-0 Returning new block=636b800
>> >>>> odin-sd: acquire.c:713-0 JobId=0 enter attach_dcr_to_dev
>> >>>> odin-sd: dircmd.c:421-0 Can label. Device is not open
>> >>>> odin-sd: lock.c:285-0 steal lock. old=BST_NOT_BLOCKED from
>> >>>> dircmd.c:470
>> >>>> odin-sd: lock.c:290-0 steal lock. new=BST_WRITING_LABEL
>> >>>> odin-sd: dircmd.c:471-0 Stole device "LRADrive-2" (/dev/nst1) lock,
>> >>>> writing label.
>> >>>>
>> >>>> The config I've got for these is:
>> >>>>
>> >>>> Device {
>> >>>>   Name = LRADrive-1
>> >>>>   Alert Command = "sh -c 'smartctl -H -l error %c'"
>> >>>>   AlwaysOpen = yes
>> >>>>   ArchiveDevice = /dev/nst0
>> >>>>   AutoChanger = yes
>> >>>>   AutomaticMount = yes
>> >>>>   DeviceType = Tape
>> >>>>   DriveIndex = 0
>> >>>>   LabelMedia = no
>> >>>>   MediaType = LTO-6
>> >>>>   RandomAccess = no
>> >>>>   RemovableMedia = yes
>> >>>> }
>> >>>>
>> >>>> Device {
>> >>>>   Name = LRADrive-2
>> >>>>   Alert Command = "sh -c 'smartctl -H -l error %c'"
>> >>>>   AlwaysOpen = yes
>> >>>>   ArchiveDevice = /dev/nst1
>> >>>>   AutoChanger = yes
>> >>>>   AutomaticMount = yes
>> >>>>   DeviceType = Tape
>> >>>>   DriveIndex = 1
>> >>>>   LabelMedia = no
>> >>>>   MediaType = LTO-6
>> >>>>   RandomAccess = no
>> >>>>   RemovableMedia = yes
>> >>>> }
>> >>>>
>> >>>> Autochanger {
>> >>>>   Name = LogRepoAutochanger
>> >>>>   ChangerCommand = "/usr/lib64/bacula/mtx-changer %c %o %S %a %d"
>> >>>>   ChangerDevice = /dev/changer
>> >>>>   Device = LRADrive-1
>> >>>>   Device = LRADrive-2
>> >>>> }
>> >>>>
>> >>>>
>> >>>> I know there are some things that could be optimized here for
>> >>>> performance, and I'm certainly interested in them, but right now I
>> >>>> can't even label my tapes :)
>> >>>>
>> >>>> I suspect it's the long load delay, and I wasn't sure if maybe the
>> >>>> drive is searching for some mark or something.  On that note, I tried
>> >>>> to do a "rewind" and "weof" using the /dev/st0 device (wasn't sure if
>> >>>> nst0 would complain about issuing a rewind), but I would get
>> >>>> "Input/Output error" messages from mt on both the rewind and weof
>> >>>> commands.
>> >>>>
>> >>>> Any advice I could get would be helpful.
>> >>>>
>> >>>> Thanks!
>> >>>> Andrew
>> >>>>
>> >>>>
>> >>>>
>> >>>> ------------------------------------------------------------------------------
>> >>>> _______________________________________________
>> >>>> Bacula-users mailing list
>> >>>> Bacula-users AT lists.sourceforge DOT net
>> >>>> https://lists.sourceforge.net/lists/listinfo/bacula-users
>> >>>
>> >>>
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > _______________________________________________
>> > Bacula-users mailing list
>> > Bacula-users AT lists.sourceforge DOT net
>> > https://lists.sourceforge.net/lists/listinfo/bacula-users
>>
>>
>>
>> --
>> "Większej miłości nikt nie ma nad tę, jak gdy kto życie swoje kładzie
>> za przyjaciół swoich." Jezus Chrystus
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Bacula-users mailing list
>> Bacula-users AT lists.sourceforge DOT net
>> https://lists.sourceforge.net/lists/listinfo/bacula-users
>
>

------------------------------------------------------------------------------
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users