Bacula-users

Re: [Bacula-users] Backup failing reliably repeatable

2009-04-16 19:43:34
Subject: Re: [Bacula-users] Backup failing reliably repeatable
From: "James Harper" <james.harper AT bendigoit.com DOT au>
To: "John Lockard" <jlockard AT umich DOT edu>, <bacula-users AT lists.sourceforge DOT net>
Date: Fri, 17 Apr 2009 09:39:24 +1000
Does belobog have the same network adapter and kernel as your other
servers?

What brand (machine and network adapter), what kernel, and what network
adapter?

Could be a tcp offload issue, try disabling tso and tx csum offload on
the fd (use ethtool)

Could also be a firewall issue timing out the connection... are the sd,
dir, and fd on the same lan segment?

James




> -----Original Message-----
> From: John Lockard [mailto:jlockard AT umich DOT edu]
> Sent: Friday, 17 April 2009 01:38
> To: bacula-users AT lists.sourceforge DOT net
> Subject: [Bacula-users] Backup failing reliably repeatable
> 
> Client and server at 2.4.4.  Both client and server are Linux 2.6
> 
> Logs from client:
> --------------------
> 15-Apr 16:53 tibor-dir JobId 2954: Start Backup JobId 2954,
Job=Belobog-Data-
> Users.2009-04-15_16.32.07.04
> 15-Apr 16:53 tibor-dir JobId 2954: Using Volume "100108L2" from
'Scratch'
> pool.
> 15-Apr 16:53 tibor-dir JobId 2954: Using Device "NEO-LTO-1"
> 15-Apr 16:53 tibor-sd JobId 2954: 3307 Issuing autochanger "unload
slot 3,
> drive 0" command.
> 15-Apr 16:53 tibor-sd JobId 2954: 3304 Issuing autochanger "load slot
21,
> drive 0" command.
> 15-Apr 16:54 tibor-sd JobId 2954: 3305 Autochanger "load slot 21,
drive 0",
> status is OK.
> 15-Apr 16:54 tibor-sd JobId 2954: Wrote label to prelabeled Volume
"100108L2"
> on device "NEO-LTO-1" (/dev/nst0)
> 15-Apr 16:54 tibor-sd JobId 2954: Spooling data ...
> 15-Apr 18:22 tibor-sd JobId 2954: User specified spool size reached.
> 15-Apr 18:22 tibor-sd JobId 2954: Writing spooled data to Volume.
Despooling
> 27,941,786,126 bytes ...
> 15-Apr 18:32 tibor-sd JobId 2954: Despooling elapsed time = 00:09:57,
Transfer
> rate = 46.80 M bytes/second
> 15-Apr 18:33 tibor-sd JobId 2954: Spooling data again ...
> 15-Apr 19:29 tibor-sd JobId 2954: User specified spool size reached.
> 15-Apr 19:29 tibor-sd JobId 2954: Writing spooled data to Volume.
Despooling
> 27,941,794,024 bytes ...
> 15-Apr 19:40 tibor-sd JobId 2954: Despooling elapsed time = 00:11:24,
Transfer
> rate = 40.85 M bytes/second
> 15-Apr 19:41 tibor-sd JobId 2954: Spooling data again ...
> 15-Apr 20:28 tibor-sd JobId 2954: User specified spool size reached.
> 15-Apr 20:28 tibor-sd JobId 2954: Writing spooled data to Volume.
Despooling
> 27,941,795,119 bytes ...
> 15-Apr 20:42 tibor-sd JobId 2954: Despooling elapsed time = 00:14:30,
Transfer
> rate = 32.11 M bytes/second
> 15-Apr 20:43 tibor-sd JobId 2954: Spooling data again ...
> 15-Apr 21:27 tibor-sd JobId 2954: User specified spool size reached.
> 15-Apr 21:27 tibor-sd JobId 2954: Writing spooled data to Volume.
Despooling
> 27,941,795,083 bytes ...
> 15-Apr 21:44 tibor-sd JobId 2954: Despooling elapsed time = 00:16:34,
Transfer
> rate = 28.11 M bytes/second
> 15-Apr 21:45 tibor-sd JobId 2954: Spooling data again ...
> 15-Apr 22:29 tibor-sd JobId 2954: User specified spool size reached.
> 15-Apr 22:29 tibor-sd JobId 2954: Writing spooled data to Volume.
Despooling
> 27,941,795,185 bytes ...
> 15-Apr 22:43 tibor-sd JobId 2954: Despooling elapsed time = 00:14:28,
Transfer
> rate = 32.19 M bytes/second
> 15-Apr 22:44 tibor-sd JobId 2954: Spooling data again ...
> 15-Apr 23:29 tibor-sd JobId 2954: User specified spool size reached.
> 15-Apr 23:29 tibor-sd JobId 2954: Writing spooled data to Volume.
Despooling
> 27,941,793,858 bytes ...
> 15-Apr 23:42 tibor-sd JobId 2954: Despooling elapsed time = 00:12:54,
Transfer
> rate = 36.10 M bytes/second
> 15-Apr 23:43 tibor-sd JobId 2954: Spooling data again ...
> 16-Apr 00:26 tibor-sd JobId 2954: User specified spool size reached.
> 16-Apr 00:26 tibor-sd JobId 2954: Writing spooled data to Volume.
Despooling
> 27,941,794,859 bytes ...
> 16-Apr 00:43 tibor-sd JobId 2954: Despooling elapsed time = 00:16:46,
Transfer
> rate = 27.77 M bytes/second
> 16-Apr 00:43 tibor-sd JobId 2954: Spooling data again ...
> 16-Apr 02:07 belobog-fd JobId 2954: Fatal error: backup.c:1087 Network
send
> error to SD. ERR=Connection reset by peer
> 16-Apr 02:07 tibor-dir JobId 2954: Error: Bacula tibor-dir 2.4.4
(28Dec08):
> 16-Apr-2009 02:07:38
>   Build OS:               x86_64-unknown-linux-gnu redhat Enterprise
release
>   JobId:                  2954
>   Job:                    Belobog-Data-Users.2009-04-15_16.32.07.04
>   Backup Level:           Full
>   Client:                 "belobog" 2.4.4 (28Dec08)
x86_64-unknown-linux-
> gnu,suse,10.0
>   FileSet:                "belobog-data-users" 2009-04-11 05:15:01
>   Pool:                   "Radev-Full" (From Job FullPool override)
>   Storage:                "Overland_Neo_4000" (From Pool resource)
>   Scheduled time:         15-Apr-2009 16:32:02
>   Start time:             15-Apr-2009 16:53:13
>   End time:               16-Apr-2009 02:07:38
>   Elapsed time:           9 hours 14 mins 25 secs
>   Priority:               30
>   FD Files Written:       7,029,945
>   SD Files Written:       0
>   FD Bytes Written:       217,674,494,521 (217.6 GB)
>   SD Bytes Written:       0 (0 B)
>   Rate:                   6543.6 KB/s
>   Software Compression:   None
>   VSS:                    no
>   Storage Encryption:     no
>   Volume name(s):         100108L2
>   Volume Session Id:      50
>   Volume Session Time:    1239651099
>   Last Volume Bytes:      194,987,520,000 (194.9 GB)
>   Non-fatal FD errors:    0
>   SD Errors:              0
>   FD termination status:  Error
>   SD termination status:  Error
>   Termination:            *** Backup Error ***
> --------------------
> 
> On the client side I see this in the firewall log:
> Apr 16 02:07:38 belobog kernel: SFW2-OUT-ERROR IN= OUT=eth0
SRC=xxx.xxx.xxx.32
> DST=xxx.xxx.yyy.25 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=59630 DF
PROTO=TCP
> SPT=11310 DPT=9103 WINDOW=1460 RES=0x00 ACK RST URGP=0 OPT
> (0101080A3D08C91922123CB5)
> 
> (SRC and DST have been cleansed in the firewall log by me)
> 
> Looking at information on SuSEfirewall (iptables with wrappers) I'm
led to
> believe
> the firewall log message is just a notice.  Basically that the
connection was
> dropped, and not by the firewall.
> 
> I can repeat this over and over on this same machine with a backup of
this
> partition or any other partition which is at least 300G in size (I
don't
> have any partitions larger than 217G but smaller than 200G).  All of
these
> backups fail at roughly 217G as the FD Bytes Written.
> 
> This backup runs from the client to the backup server with a 100G
spool
> area.  From the spool, the data is written to LTO2 tape.  Backups on
this
> system failed at 217G when my spool was 50G, 10G and 5G in size.
> 
> I have other servers with more storage on which backups run fine, up
to
> 4TB.
> 
> Any ideas?
> 
> Thanks,
> John
> 
> --
> Brain: Pinky, Are you pondering what I'm pondering?
> Pinky: Well, I think so, Brain, but I can't memorize
>        a whole opera in Yiddish.
> -------------------------------------------------------------------
>          John M. Lockard |  U of Michigan - School of Information
>  Unix and Security Admin |      1214 SI North - 1075 Beal Ave.
>       jlockard AT umich DOT edu |        Ann Arbor, MI  48109-2112
>  www.umich.edu/~jlockard |     734-615-8776 | 734-647-8045 FAX
> -------------------------------------------------------------------
> 
>
------------------------------------------------------------------------
------
> Stay on top of everything new and different, both inside and
> around Java (TM) technology - register by April 22, and save
> $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
> 300 plus technical and hands-on sessions. Register today.
> Use priority code J9JMT32. http://p.sf.net/sfu/p
> _______________________________________________
> Bacula-users mailing list
> Bacula-users AT lists.sourceforge DOT net
> https://lists.sourceforge.net/lists/listinfo/bacula-users

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>