Bacula-users

Re: [Bacula-users] Timeout (?) problems with some Full backups

2009-08-13 05:12:43
Subject: Re: [Bacula-users] Timeout (?) problems with some Full backups
From: Nick Lock <nick.lock AT exa-networks.co DOT uk>
To: John Lockard <jlockard AT umich DOT edu>
Date: Thu, 13 Aug 2009 10:08:05 +0100
Thanks very much for the help everyone, I do appreciate it. And quite
possibly - you've fixed it :)

>>From the top:

John Drescher wrote:

> I would make the heartbeat interval much shorter.

Thinking that this might be a good idea, I dropped all the Heartbeat
Intervals to 1 minute and left the same backup to run again along with
the usual overnight backups. (hooray for bash, I can push a new
bacula-fd.conf to all the servers in less than 2 minutes!) As it was the
end of the day for me, I thought letting Bacula do it's stuff in peace
and taking a fresh look this morning would be best.

I assume that a 1 minute heartbeat won't be too detrimental to the
network - compared to the flood of backup data that's coming into the
same server? (I run the Dir and SD on the same server)

I'll show the results below...

> Also I am interested on why the backup rate is this slow. Slow
> network?

This particular machine is a VM on a quite highly loaded host. We put
lower priority, less demanding, VM hosts together on one host in order
to give more resources to higher priority tasks. The effective CPU speed
in the VM is quite low, so I think that the compression which the
bacula-fd task performs on the data is a rate-limiting step. We *can*
get line-speed transfers from other servers, it just happens that this
one is a little slower!



Josh Fisher wrote:

> Something is dropping the Dir-FD connection. A router in between FD
> and Dir, perhaps.

Yes, it seems that something is dropping the connection. For peace of
mind I'll have a word with our networking engineer as to what the
current connection timeout values are on the hardware I don't directly
control. It's also occured to me that I'll have to check what the
current tcp timeout settings are on the server - maybe a recent update
changed them and I didn't notice...? (Full backups have been running OK
for a few months prior to my current problems!)

However, it might be moot after last night's attempts with a 1 minute
heartbeat (see below).


John Lockard wrote:

> While the job is running, keep an eye on the system which houses
> your MySQL database and make sure that it isn't filling up a
> partition with temp data.  I was running into a similar problem
> and needed to move my mysql_tmpdir (definable in /etc/my.cnf)
> to another location.

I'd never even considered that! We're using Postgres for the db, the
temp directory for that is
currently /var/lib/postgresql/8.3/main/base/pgsql_tmp and has 33Gb free.
I don't think that it's the problem this time, but I'll certainly have
to keep it in mind in the future...



OK - the results. Here's the same backup that I showed failing in my
first post:
-----------------------------------------------------------------------

12-Aug 16:37 exa-bacula-dir JobId 5518: Start Backup JobId 5518,
Job=backup_scavenger.2009-08-12_16.37.28.03
12-Aug 16:37 exa-bacula-dir JobId 5518: Purging oldest volume
"scavenger-full-1191"
12-Aug 16:37 exa-bacula-dir JobId 5518: 1 File on Volume
"scavenger-full-1191" purged from catalog.
12-Aug 16:37 exa-bacula-dir JobId 5518: There are no more Jobs
associated with Volume "scavenger-full-1191". Marking it purged.
12-Aug 16:37 exa-bacula-dir JobId 5518: All records pruned from Volume
"scavenger-full-1191"; marking it "Purged"
12-Aug 16:37 exa-bacula-dir JobId 5518: Using Device
"FileStorageScavenger"
12-Aug 16:37 exa-bacula-sd JobId 5518: Recycled volume
"scavenger-full-1191" on device
"FileStorageScavenger" (/srv/bacula/volume/web-scavenger), all previous
data lost.
12-Aug 16:37 exa-bacula-dir JobId 5518: Max Volume jobs exceeded.
Marking Volume "scavenger-full-1191" as Used.
12-Aug 18:16 exa-bacula-sd JobId 5518: Job write elapsed time =
01:38:34, Transfer rate = 373.4 K bytes/second
12-Aug 18:19 exa-bacula-dir JobId 5518: Bacula exa-bacula-dir 2.4.4
(28Dec08): 12-Aug-2009 18:19:42
  Build OS:               x86_64-pc-linux-gnu debian lenny/sid
  JobId:                  5518
  Job:                    backup_scavenger.2009-08-12_16.37.28.03
  Backup Level:           Full
  Client:                 "scavenger" 2.4.4 (28Dec08)
i486-pc-linux-gnu,debian,5.0
  FileSet:                "full-scavenger" 2009-04-16 15:58:05
  Pool:                   "scavenger-full" (From Job FullPool override)
  Storage:                "FileScavenger" (From Job resource)
  Scheduled time:         12-Aug-2009 16:37:24
  Start time:             12-Aug-2009 16:37:30
  End time:               12-Aug-2009 18:19:42
  Elapsed time:           1 hour 42 mins 12 secs
  Priority:               10
  FD Files Written:       81,883
  SD Files Written:       81,883
  FD Bytes Written:       2,197,458,673 (2.197 GB)
  SD Bytes Written:       2,208,651,895 (2.208 GB)
  Rate:                   358.4 KB/s
  Software Compression:   28.4 %
  VSS:                    no
  Storage Encryption:     no
  Volume name(s):         scavenger-full-1191
  Volume Session Id:      1
  Volume Session Time:    1250091413
  Last Volume Bytes:      2,212,931,408 (2.212 GB)
  Non-fatal FD errors:    0
  SD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Backup OK

12-Aug 18:19 exa-bacula-dir JobId 5518: Begin pruning Jobs.
12-Aug 18:19 exa-bacula-dir JobId 5518: Pruned 1 Job for client
scavenger from catalog.
12-Aug 18:19 exa-bacula-dir JobId 5518: Begin pruning Files.
12-Aug 18:19 exa-bacula-dir JobId 5518: No Files found to prune.
12-Aug 18:19 exa-bacula-dir JobId 5518: End auto prune.
------------------------------------------------------------------------

Bingo! Everything worked, including the Full backups that have been
failing since Sunday.

I'll be keeping an eye on it for the next few days, but as the one
change seems to have fixed all my problems I'm feeling confident that
the heartbeat interval was the problem.

Many thanks to Josh Fisher and John Lockard for their assistance, and
specific thanks to John Drescher for his suggestion :)


Nick Lock.


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>