Bacula-users

Re: [Bacula-users] Sending spooled attrs to the Director Fatal error: Network error with FD during Backup: ERR=Connection reset by peer ?

2011-12-06 11:38:13
Subject: Re: [Bacula-users] Sending spooled attrs to the Director Fatal error: Network error with FD during Backup: ERR=Connection reset by peer ?
From: Bob Hetzel <beh AT case DOT edu>
To: bacula-users AT lists.sourceforge DOT net
Date: Tue, 06 Dec 2011 11:35:54 -0500
I've been doing backups for a long time now and one thing I've learned is 
that if you have a backup that takes more than 24 hrs you're asking for 
trouble.  In theory this should work but since your fulls take so long you 
won't be able to get any changed files that it misses until you complete 
the full.

Here's what I mean in more detail.  If you have this set up in 9 equal 
sized directories a1, a2, a3 to a9, then in theory that 9 day full-backup 
job will be able to get all of a1 in the first 24 hours and work through 
the directories in order but if anything changes in the a1 directory before 
it completes, it won't go back to it.  So the obvious answer in that 
situation would be to split it up into at least 9 separate jobs.  To be 
sure this will mean some work by you, and it would also be a great thing 
for you to do some periodic auditing to ensure you aren't skipping babckup 
on any directories.

The more you can split it up the happier your life will be.  On a system 
that big you may even have enough IO throughput available able to run 2 or 
more jobs in parallel cutting the backup window down substantially.

Ideally, you'd be able to break it down into units small enough that your 
fulls won't interfere with your incrementals.  Bacula, like most backup 
packages, doesn't allow you to continue a failed full backup where it died 
so breaking big jobs like that into smaller jobs means when you have a 
system problem you won't have to repeat all that.

In addition, if your full backup takes > 9 days that means your disaster 
recovery will take even longer so keep that in mind as well.  If you can 
separate the jobs out by how critical the info is you can restore the most 
important information first just to get things running.


> Date: Mon, 5 Dec 2011 19:55:49 -0500
> From: "Ethier, Michael" <methier AT CGR.Harvard DOT edu>
> Subject: [Bacula-users] Sending spooled attrs to the Director Fatal
>
> Hello,
>
> We are running Bacula 5.0.3 on RHEL and Centos. I have recently had a 16.5TB 
> backup fail at the
> end when the system tried to spool the attribute data, messages are below. 
> The backend database used
> is MySQL:
>
> [root@hulsbackup lib]#  mysql -V
> mysql  Ver 14.12 Distrib 5.0.77, for redhat-linux-gnu (x86_64) using readline 
> 5.1
>
> and lives on the same machine partition as the data spool directory. All 
> backup data was spooled
> and dumped to tape successfully it appears.
>
> I have successfully backed up a 5TB data set before this. However, between 
> that backup and
> this failed one, we moved the bacula server to a different net and changed to 
> a LACP bonded interface.
> There is a local iptables firewall running on the Bacula server.
>
> In addition we kept hitting this 6 day limit where backups were getting auto 
> killed, so I changed
> the following lines, and recompiled with a 60 day limit on both the bacula 
> server and client.
>
> bnet.c:   bsock->timeout = 60 * 60 * 60 * 24;   /* 60 days timeout */
> bsock.c:   timeout = 60 * 60 * 60 * 24;   /* 60 days timeout */
>
> Other than that, everything is the default code. Has anyone hit this problem 
> and knows the solution
> to this problem ? I can't easily re-run and reproduce this since it runs for 
> over 9 days.
>
> Thanks,
> Mike
>
> ...
> ...
>
> 05-Dec 02:48 hulsbackup-sd JobId 109: Alert: Home page is 
> http://smartmontools.sourceforge.net/
>
> 05-Dec 02:48 hulsbackup-sd JobId 109: Alert:
>
> 05-Dec 02:48 hulsbackup-sd JobId 109: Alert: TapeAlert: OK
>
> 05-Dec 02:48 hulsbackup-sd JobId 109: Alert:
>
> 05-Dec 02:48 hulsbackup-sd JobId 109: Alert: Error Counter logging not 
> supported
>
> 05-Dec 02:48 hulsbackup-sd JobId 109: Sending spooled attrs to the Director. 
> Despooling 196,979,273 bytes ...
>
> 05-Dec 03:12 hulsbackup-dir JobId 109: Fatal error: Network error with FD 
> during Backup: ERR=Connection reset by peer
>
> 05-Dec 03:12 hulsbackup-dir JobId 109: Fatal error: No Job status returned 
> from FD.
>
> 05-Dec 03:12 hulsbackup-dir JobId 109: Error: Bacula hulsbackup-dir 5.0.3 
> (04Aug10): 05-Dec-2011 03:12:15
>
>   Build OS:               x86_64-unknown-linux-gnu redhat Enterprise release
>
>   JobId:                  109
>
>   Job:                    ceserve1.2011-11-25_21.11.56_11
>
>   Backup Level:           Full
>
>   Client:                 "ceserve1-fd" 5.0.3 (04Aug10) 
> x86_64-unknown-linux-gnu,redhat,
>
>   FileSet:                "ceserve1-data" 2011-11-02 11:03:12
>
>   Pool:                   "Default" (From Job resource)
>
>   Catalog:                "MyCatalog" (From Client resource)
>
>   Storage:                "Autochanger" (From command line)
>
>   Scheduled time:         25-Nov-2011 21:11:47
>
>   Start time:             25-Nov-2011 21:11:58
>
>   End time:               05-Dec-2011 03:12:15
>
>   Elapsed time:           9 days 6 hours 17 secs
>
>   Priority:               10
>
>   FD Files Written:       0
>
>   SD Files Written:       571,253
>
>   FD Bytes Written:       0 (0 B)
>
>   SD Bytes Written:       16,495,138,769,029 (16.49 TB)
>
>   Rate:                   0.0 KB/s
>
>   Software Compression:   None
>
>   VSS:                    no
>
>   Encryption:             no
>
>   Accurate:               no
>
>   Volume name(s):         
> 000093L3|000094L3|000095L3|000096L3|000097L3|000098L3|000099L3|000100L3|000101L3|000102L3|000103L3|000104L3|000105L3|000106L3|000107L3|000108L3|000109L3|000110L3|000111L3|000112L3|000113L3|000114L3|000115L3|000127L3|000117L3|000118L3|000119L3|000013L3|000121L3|000122L3|000123L3|000124L3|000125L3|000126L3|000166L3|000128L3|000129L3|000130L3|000131L3|000132L3
>
>   Volume Session Id:      2
>
>   Volume Session Time:    1322270042
>
>   Last Volume Bytes:      246,238,949,376 (246.2 GB)
>
>   Non-fatal FD errors:    0
>
>   SD Errors:              39
>
>   FD termination status:  Error
>
>   SD termination status:  OK
>
>   Termination:            *** Backup Error ***
>
> -------------- next part --------------
> An HTML attachment was scrubbed...

------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of 
discussion for anyone considering optimizing the pricing and packaging model 
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>