Bacula-users

[Bacula-users] Timout problem with spooled jobs

2015-09-03 19:35:06
Subject: [Bacula-users] Timout problem with spooled jobs
From: "Jeffrey R. Lang" <JRLang AT uwyo DOT edu>
To: bacula Users <bacula-users AT lists.sourceforge DOT net>
Date: Thu, 3 Sep 2015 23:16:04 +0000
First let me say thanks to Kern and all those that have helped make
bacula a great tool.

My current backup environment currently consists of a server, VTL and a
tape library connected by a 10GiG network.  Bacula currently at 5.2.13. 
I plan on upgrading once I've integrated the tape library and thing were
working.  A good starting point for an upgrade.

My issue is when jobs are destined for the tape library I have enable
job spooling, but these job always timeout after the first spooled block
of data is written to tape.  Here's an example:

03-Sep 11:19 bkupsvr2-dir JobId 12570: No prior Full backup Job record found.
03-Sep 11:19 bkupsvr2-dir JobId 12570: No prior or suitable Full backup found 
in catalog. Doing FULL backup.
03-Sep 11:19 bkupsvr2-dir JobId 12570: Start Backup JobId 12570, 
Job=bighorn-home.2015-09-03_11.19.32_03
03-Sep 11:20 bkupsvr2-dir JobId 12570: Using Device "LTO5-0" to write.
03-Sep 11:21 bkupsvr2-sd JobId 12570: 3304 Issuing autochanger "load slot 94, 
drive 0" command.
03-Sep 11:22 bkupsvr2-sd JobId 12570: 3305 Autochanger "load slot 94, drive 0", 
status is OK.
03-Sep 11:22 bkupsvr2-sd JobId 12570: Volume "000094L5" previously written, 
moving to end of data.
03-Sep 11:23 bkupsvr2-sd JobId 12570: Ready to append to end of Volume 
"000094L5" at file=1724.
03-Sep 11:23 bkupsvr2-sd JobId 12570: Spooling data ...
03-Sep 15:07 bkupsvr2-sd JobId 12570: User specified Device spool size reached: 
DevSpoolSize=322,122,610,512 MaxDevSpoolSize=322,122,547,200
03-Sep 15:07 bkupsvr2-sd JobId 12570: Writing spooled data to Volume. 
Despooling 322,122,610,512 bytes ...
03-Sep 15:23 mmcnsd4-fd JobId 12570: Error: bsock.c:429 Write error sending 
253977 bytes to Storage daemon:bkupsvr2.gg.uwyo.edu:9103: ERR=Connection timed 
out
03-Sep 15:23 mmcnsd4-fd JobId 12570: Fatal error: backup.c:1200 Network send 
error to SD. ERR=Connection timed out
03-Sep 15:23 bkupsvr2-dir JobId 12570: Error: Director's comm line to SD 
dropped.
03-Sep 15:23 bkupsvr2-dir JobId 12570: Error: Bacula bkupsvr2-dir 5.2.13 
(19Jan13):
  Build OS:               x86_64-unknown-linux-gnu redhat Enterprise release
  JobId:                  12570
  Job:                    bighorn-home.2015-09-03_11.19.32_03
  Backup Level:           Full (upgraded from Incremental)
  Client:                 "mmonsd4-fd" 5.2.13 (19Jan13) 
x86_64-unknown-linux-gnu,redhat,
  FileSet:                "bighorn-home" 2015-06-12 08:21:10
  Pool:                   "ARCC" (From Job resource)
  Catalog:                "MyCatalog" (From Client resource)
  Storage:                "NEO4200" (From Pool resource)
  Scheduled time:         03-Sep-2015 11:19:31
  Start time:             03-Sep-2015 11:20:26
  End time:               03-Sep-2015 15:23:28
  Elapsed time:           4 hours 3 mins 2 secs
  Priority:               10
  FD Files Written:       1,805,322
  SD Files Written:       0
  FD Bytes Written:       321,716,097,371 (321.7 GB)
  SD Bytes Written:       0 (0 B)
  Rate:                   22062.5 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Accurate:               yes
  Volume name(s):         000094L5
  Volume Session Id:      1
  Volume Session Time:    1441300753
  Last Volume Bytes:      1,817,725,514,752 (1.817 TB)
  Non-fatal FD errors:    2
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Error
  Termination:            *** Backup Error ***

If I turn off job spooling then the job will complete as expected.

I have enable "heartbeats" on the client, storage daemon and director
but that didn't help.

My current client configuration is this:

FileDaemon {                          # this is me
  Name = mmcnsd4-fd
  FDport = 9102                  # where we listen for the director
  WorkingDirectory = /usr/local/bacula/working
  Pid Directory = /usr/local/bacula/working
  Maximum Concurrent Jobs = 20
  Maximum Network Buffer Size = 262144
  Heartbeat Interval = 60
}

My storage daemon configuration is:
Storage {                             # definition of myself
  Name = bkupsvr2-sd
  SDPort = 9103                  # Director's port     
  WorkingDirectory = "/usr/local/bacula/working"
  Pid Directory = "/usr/local/bacula/working"
  Maximum Concurrent Jobs = 20
  Heartbeat Interval = 60
}

The only thing I can see is that with spooling turned off, data is
constantly flowing over the network connection.  With the spooling
turned on there is a quiet period on the network connection.

I've talked with my network engineer about this and he says there's
nothing in the network that would cause the application to close the
connection.

So has any one seen this problem before?   
Any ideas on what to look at to figure this out?

jeff


Attachment: jrlang.vcf
Description: jrlang.vcf

------------------------------------------------------------------------------
Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
Get real-time metrics from all of your servers, apps and tools
in one place.
SourceForge users - Click here to start your Free Trial of Datadog now!
http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>