Bacula-users

Re: [Bacula-users] Fatal error: askdir.c:340 NULL Volume name. This shouldn't happen!!!

2014-02-13 14:56:48
Subject: Re: [Bacula-users] Fatal error: askdir.c:340 NULL Volume name. This shouldn't happen!!!
From: "Clark, Patricia A." <clarkpa AT ornl DOT gov>
To: "bacula-users AT lists.sourceforge DOT net" <bacula-users AT lists.sourceforge DOT net>
Date: Thu, 13 Feb 2014 14:54:07 -0500
On 1/28/14 4:22 AM, "Wolfgang Denk" <wd AT denx DOT de> wrote:


>Hello,
>
>for some time, I'm facing the situation that 2 specific backup jobs
>(out of a total of some 30+) fail every now and then with the error
>message:
>
>Fatal error: askdir.c:340 NULL Volume name. This shouldn't happen!!!
>
>Re-running the jobs works fine.
>
>Backup is done to a LTO library with 2 tape drives.   Here is an example
>log that shows what is happening:
>
>1) I start two simultaneous backup jobs.
>
>    *run level=Incremental pool=INCR Hercules-Root
>    Run Backup job
>    JobName:  Hercules-Root
>    Level:    Incremental
>    Client:   hercules-fd
>    FileSet:  Hercules Root
>    Pool:     INCR (From User input)
>    Storage:  LTOLIB (From Job resource)
>    When:     2014-01-28 09:41:11
>    Priority: 10
>    OK to run? (yes/mod/no): yes
>    Job queued. JobId=71138
>    *run level=Incremental pool=INCR Pollux-Root
>    Run Backup job
>    JobName:  Pollux-Root
>    Level:    Incremental
>    Client:   pollux-fd
>    FileSet:  Pollux Root
>    Pool:     INCR (From User input)
>    Storage:  LTOLIB (From Job resource)
>    When:     2014-01-28 09:41:19
>    Priority: 10
>    OK to run? (yes/mod/no): yes
>    Job queued. JobId=71139
>
>2) The DIR decides to allocate each job to one of the two available
>   tape drives.  At this point of time, the following tapes are loaded
>   to the tape drives:
>
>       drive 0: slot 26
>       drive 1: slot 32
>
>   Both tapes are in the correct pool (INCR) and have status "Append".
>
>   JobId 71138 will use drive 0,
>   JobId 71139 will use drive 1.
>
>    28-Jan 09:41 mneme-dir JobId 71138: Start Backup JobId 71138,
>Job=Hercules-Root.2014-01-28_09.41.13_25
>    28-Jan 09:41 mneme-dir JobId 71138: Using Device "LTO3-0" to write.
>    28-Jan 09:41 ltos-sd JobId 71138: 3307 Issuing autochanger "unload
>slot 32, drive 1" command.
>    28-Jan 09:41 mneme-dir JobId 71139: Start Backup JobId 71139,
>Job=Pollux-Root.2014-01-28_09.41.21_26
>    28-Jan 09:41 mneme-dir JobId 71139: Using Device "LTO3-1" to write.
>    28-Jan 09:43 ltos-sd JobId 71139: Fatal error: askdir.c:340 NULL
>Volume name. This shouldn't happen!!!
>    28-Jan 09:43 ltos-sd JobId 71139: Spooling data ...
>    28-Jan 09:44 ltos-sd JobId 71139: Elapsed time=00:01:23, Transfer
>rate=0  Bytes/second
>
>3) JobId 71139 fails with "NULL Volume name" error.  Note that before
>   the tape originally loaded in this drive has been unloaded.
>
>    28-Jan 09:44 pollux-fd JobId 71139: Error: bsock.c:429 Write error
>sending 8 bytes to Storage daemon:ltos.denx.de:9103: ERR=Connection reset
>by peer
>    28-Jan 09:44 pollux-fd JobId 71139: Fatal error: xattr.c:98 Network
>send error to SD. ERR=Connection reset by peer
>    28-Jan 09:44 mneme-dir JobId 71139: Error: Bacula mneme-dir 5.2.13
>(19Jan13):
>      Build OS:               x86_64-redhat-linux-gnu redhat
>      JobId:                  71139
>      Job:                    Pollux-Root.2014-01-28_09.41.21_26
>      Backup Level:           Incremental, since=2014-01-28 00:06:14
>      Client:                 "pollux-fd" 5.2.13 (19Jan13)
>x86_64-redhat-linux-gnu,redhat,Cat)
>      FileSet:                "Pollux Root" 2005-12-22 11:06:26
>      Pool:                   "INCR" (From User input)
>      Catalog:                "MyCatalog" (From Client resource)
>      Storage:                "LTOLIB" (From Job resource)
>      Scheduled time:         28-Jan-2014 09:41:19
>      Start time:             28-Jan-2014 09:41:23
>      End time:               28-Jan-2014 09:44:47
>      Elapsed time:           3 mins 24 secs
>      Priority:               10
>      FD Files Written:       2
>      SD Files Written:       0
>      FD Bytes Written:       0 (0 B)
>      SD Bytes Written:       0 (0 B)
>      Rate:                   0.0 KB/s
>      Software Compression:   None
>      VSS:                    no
>      Encryption:             no
>      Accurate:               no
>      Volume name(s):
>      Volume Session Id:      275
>      Volume Session Time:    1390085323
>      Last Volume Bytes:      16,678,095,872 (16.67 GB)
>      Non-fatal FD errors:    1
>      SD Errors:              1
>      FD termination status:  Error
>      SD termination status:  Error
>      Termination:            *** Backup Error ***
>
>4) The other (first started) job now also unloads the tape, and then
>   load the one previously unloaded from the other drive:
>
>    28-Jan 09:44 ltos-sd JobId 71138: 3307 Issuing autochanger "unload
>slot 26, drive 0" command.
>    28-Jan 09:44 ltos-sd JobId 71138: 3304 Issuing autochanger "load slot
>32, drive 0" command.
>    28-Jan 09:45 ltos-sd JobId 71138: 3305 Autochanger "load slot 32,
>drive 0", status is OK.
>    28-Jan 09:45 ltos-sd JobId 71138: Volume "INC007L3" previously
>written, moving to end of data.
>    28-Jan 09:47 ltos-sd JobId 71138: Ready to append to end of Volume
>"INC007L3" at file=6.
>    28-Jan 09:47 ltos-sd JobId 71138: Spooling data ...
>
>
>To me it appears as if the DIR does not correctly take into account
>which tape is loaded where.  It sees two tape drives, assignes one of
>them for the first starting job, but then decides not to use the
>currently mounted tape (which would be perfectly fine from all
>criteria like Pool, Status, Use Days etc. - actually will later be
>loaded into the other drive to run more jobs), but to use the one
>which is currently loaded in the _other_ drive.  It starts to unload
>from the other drive.
>
>Now the second job starts running and finds that someone pulled the
>tape out from under it, and it fails.
>
>
>Does my interpretation make sense?
>
>Is this a common problem, or am I doing something wrong?
>
>
>The "interesting" thing is that it's always the same 2 jobs out of my
>list which are candidates for this error.  And it does not always
>happen - maybe 2 times per week or so...
>
>All this is with 5.2.13 on Fedora 20 systems...
>
>
>All help / ideas welcome.  Thanks in advance.
>
>Best regards,
>
>Wolfgang Denk
>
>-- 
>DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
>HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
>Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd AT denx DOT de
>The following statement is not true.  The previous statement is true.
>>>>>>
I posted a similar notice 10/31/2013.  Beyond the no responses, there is
no bug report which means no fix is pending.  Signing up to make a bug
report is also broken and has been mentioned on this list.  No message is
sent that allows for verification and setting up an account.

I can suggest a work-around that reduces the incidence of tape stealing by
the storage director from one job for another - set their job schedules
with an offset.  I have used a 15 minute offset for jobs that had started
at the same time.  I still have the "NULL Volume name" happening, but with
less frequency.  I just look for it and rerun the affected jobs.  My
current biggest issue is that the storage daemon doesn't see the job
failure and release the tape resource unless I restart the bacula
processes.  There appears to be a bug report and a fix pending release for
that issue.

Patti Clark
Linux System Administrator
Oak Ridge National Laboratory


------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users