Bacula-users

[Bacula-users] Backup failing reliably repeatable

2009-04-16 11:42:42
Subject: [Bacula-users] Backup failing reliably repeatable
From: John Lockard <jlockard AT umich DOT edu>
To: bacula-users AT lists.sourceforge DOT net
Date: Thu, 16 Apr 2009 11:37:38 -0400
Client and server at 2.4.4.  Both client and server are Linux 2.6

Logs from client:
--------------------
15-Apr 16:53 tibor-dir JobId 2954: Start Backup JobId 2954, 
Job=Belobog-Data-Users.2009-04-15_16.32.07.04
15-Apr 16:53 tibor-dir JobId 2954: Using Volume "100108L2" from 'Scratch' pool.
15-Apr 16:53 tibor-dir JobId 2954: Using Device "NEO-LTO-1"
15-Apr 16:53 tibor-sd JobId 2954: 3307 Issuing autochanger "unload slot 3, 
drive 0" command.
15-Apr 16:53 tibor-sd JobId 2954: 3304 Issuing autochanger "load slot 21, drive 
0" command.
15-Apr 16:54 tibor-sd JobId 2954: 3305 Autochanger "load slot 21, drive 0", 
status is OK.
15-Apr 16:54 tibor-sd JobId 2954: Wrote label to prelabeled Volume "100108L2" 
on device "NEO-LTO-1" (/dev/nst0)
15-Apr 16:54 tibor-sd JobId 2954: Spooling data ...
15-Apr 18:22 tibor-sd JobId 2954: User specified spool size reached.
15-Apr 18:22 tibor-sd JobId 2954: Writing spooled data to Volume. Despooling 
27,941,786,126 bytes ...
15-Apr 18:32 tibor-sd JobId 2954: Despooling elapsed time = 00:09:57, Transfer 
rate = 46.80 M bytes/second
15-Apr 18:33 tibor-sd JobId 2954: Spooling data again ...
15-Apr 19:29 tibor-sd JobId 2954: User specified spool size reached.
15-Apr 19:29 tibor-sd JobId 2954: Writing spooled data to Volume. Despooling 
27,941,794,024 bytes ...
15-Apr 19:40 tibor-sd JobId 2954: Despooling elapsed time = 00:11:24, Transfer 
rate = 40.85 M bytes/second
15-Apr 19:41 tibor-sd JobId 2954: Spooling data again ...
15-Apr 20:28 tibor-sd JobId 2954: User specified spool size reached.
15-Apr 20:28 tibor-sd JobId 2954: Writing spooled data to Volume. Despooling 
27,941,795,119 bytes ...
15-Apr 20:42 tibor-sd JobId 2954: Despooling elapsed time = 00:14:30, Transfer 
rate = 32.11 M bytes/second
15-Apr 20:43 tibor-sd JobId 2954: Spooling data again ...
15-Apr 21:27 tibor-sd JobId 2954: User specified spool size reached.
15-Apr 21:27 tibor-sd JobId 2954: Writing spooled data to Volume. Despooling 
27,941,795,083 bytes ...
15-Apr 21:44 tibor-sd JobId 2954: Despooling elapsed time = 00:16:34, Transfer 
rate = 28.11 M bytes/second
15-Apr 21:45 tibor-sd JobId 2954: Spooling data again ...
15-Apr 22:29 tibor-sd JobId 2954: User specified spool size reached.
15-Apr 22:29 tibor-sd JobId 2954: Writing spooled data to Volume. Despooling 
27,941,795,185 bytes ...
15-Apr 22:43 tibor-sd JobId 2954: Despooling elapsed time = 00:14:28, Transfer 
rate = 32.19 M bytes/second
15-Apr 22:44 tibor-sd JobId 2954: Spooling data again ...
15-Apr 23:29 tibor-sd JobId 2954: User specified spool size reached.
15-Apr 23:29 tibor-sd JobId 2954: Writing spooled data to Volume. Despooling 
27,941,793,858 bytes ...
15-Apr 23:42 tibor-sd JobId 2954: Despooling elapsed time = 00:12:54, Transfer 
rate = 36.10 M bytes/second
15-Apr 23:43 tibor-sd JobId 2954: Spooling data again ...
16-Apr 00:26 tibor-sd JobId 2954: User specified spool size reached.
16-Apr 00:26 tibor-sd JobId 2954: Writing spooled data to Volume. Despooling 
27,941,794,859 bytes ...
16-Apr 00:43 tibor-sd JobId 2954: Despooling elapsed time = 00:16:46, Transfer 
rate = 27.77 M bytes/second
16-Apr 00:43 tibor-sd JobId 2954: Spooling data again ...
16-Apr 02:07 belobog-fd JobId 2954: Fatal error: backup.c:1087 Network send 
error to SD. ERR=Connection reset by peer
16-Apr 02:07 tibor-dir JobId 2954: Error: Bacula tibor-dir 2.4.4 (28Dec08): 
16-Apr-2009 02:07:38
  Build OS:               x86_64-unknown-linux-gnu redhat Enterprise release
  JobId:                  2954
  Job:                    Belobog-Data-Users.2009-04-15_16.32.07.04
  Backup Level:           Full
  Client:                 "belobog" 2.4.4 (28Dec08) 
x86_64-unknown-linux-gnu,suse,10.0
  FileSet:                "belobog-data-users" 2009-04-11 05:15:01
  Pool:                   "Radev-Full" (From Job FullPool override)
  Storage:                "Overland_Neo_4000" (From Pool resource)
  Scheduled time:         15-Apr-2009 16:32:02
  Start time:             15-Apr-2009 16:53:13
  End time:               16-Apr-2009 02:07:38
  Elapsed time:           9 hours 14 mins 25 secs
  Priority:               30
  FD Files Written:       7,029,945
  SD Files Written:       0
  FD Bytes Written:       217,674,494,521 (217.6 GB)
  SD Bytes Written:       0 (0 B)
  Rate:                   6543.6 KB/s
  Software Compression:   None
  VSS:                    no
  Storage Encryption:     no
  Volume name(s):         100108L2
  Volume Session Id:      50
  Volume Session Time:    1239651099
  Last Volume Bytes:      194,987,520,000 (194.9 GB)
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Error
  Termination:            *** Backup Error ***
--------------------

On the client side I see this in the firewall log:
Apr 16 02:07:38 belobog kernel: SFW2-OUT-ERROR IN= OUT=eth0 SRC=xxx.xxx.xxx.32 
DST=xxx.xxx.yyy.25 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=59630 DF PROTO=TCP 
SPT=11310 DPT=9103 WINDOW=1460 RES=0x00 ACK RST URGP=0 OPT 
(0101080A3D08C91922123CB5)

(SRC and DST have been cleansed in the firewall log by me)

Looking at information on SuSEfirewall (iptables with wrappers) I'm led to 
believe
the firewall log message is just a notice.  Basically that the connection was
dropped, and not by the firewall.

I can repeat this over and over on this same machine with a backup of this
partition or any other partition which is at least 300G in size (I don't
have any partitions larger than 217G but smaller than 200G).  All of these
backups fail at roughly 217G as the FD Bytes Written.

This backup runs from the client to the backup server with a 100G spool
area.  From the spool, the data is written to LTO2 tape.  Backups on this
system failed at 217G when my spool was 50G, 10G and 5G in size.

I have other servers with more storage on which backups run fine, up to
4TB.

Any ideas?

Thanks,
John

-- 
Brain: Pinky, Are you pondering what I'm pondering?
Pinky: Well, I think so, Brain, but I can't memorize
       a whole opera in Yiddish.
-------------------------------------------------------------------
         John M. Lockard |  U of Michigan - School of Information
 Unix and Security Admin |      1214 SI North - 1075 Beal Ave.
      jlockard AT umich DOT edu |        Ann Arbor, MI  48109-2112
 www.umich.edu/~jlockard |     734-615-8776 | 734-647-8045 FAX
-------------------------------------------------------------------

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>