Bacula-users

[Bacula-users] Backup starts then immediately stops transferring and times out?

2012-06-15 13:09:54
Subject: [Bacula-users] Backup starts then immediately stops transferring and times out?
From: Erich Weiler <weiler AT soe.ucsc DOT edu>
To: bacula-users AT lists.sourceforge DOT net
Date: Fri, 15 Jun 2012 10:07:12 -0700
Hi Y'all,

We have two RHEL 6 servers, one is running bacula-dir and bacula-sd 
(with a lot of disk storage for the backups), and the other is the 
client we want to back up from running bacula-fd.

Both servers are behind different firewalls, and each has a public IP 
address that is configured for 1:1 NAT to the internal server.  I can 
telnet from the director to the client on port 9102 and connect OK, and 
the client can telnet to the storage daemon on the other server on port 
9103 and connect OK.  So, I think the network path is working.

What appears to happen is that we start the backup job, and it connects 
and begins transferring and then almost immediately stops and then times 
out.  It transfers like 800 bytes and seems to just sit there.  This is 
what it looks like on the different status reports from bconsole on the 
directory node:


DIRECTOR STATUS:
*status
Status available for:
      1: Director
      2: Storage
      3: Client
      4: All
Select daemon type for status (1-4): 1
bacula-dir Version: 5.0.0 (26 January 2010) x86_64-unknown-linux-gnu 
redhat (Final)
Daemon started 15-Jun-12 09:39, 0 Jobs run since started.
  Heap: heap=135,168 smbytes=61,022 max_bytes=61,278 bufs=247 max_bufs=253

Scheduled Jobs:
Level          Type     Pri  Scheduled          Name               Volume
===================================================================================
Incremental    Backup    10  16-Jun-12 02:00    hubcon02           
Default0002
====

Running Jobs:
Console connected at 15-Jun-12 09:52
  JobId Level   Name                       Status
======================================================================
     19 Full    hubcon02.2012-06-15_09.53.03_03 is running
====

Terminated Jobs:
  JobId  Level    Files      Bytes   Status   Finished        Name
====================================================================
      4  Full          0         0   Error    13-Jun-12 14:41 hubcon02
      5  Full          0         0   Error    13-Jun-12 15:40 hubcon02
      8  Full          0         0   Error    13-Jun-12 22:10 hubcon02
     10  Full         10    173.5 K  Error    13-Jun-12 22:56 hubcon02
     11  Full          0         0   Error    13-Jun-12 23:06 hubcon02
     13  Full          0         0   Error    14-Jun-12 01:30 hubcon02
     14  Full         10    107.9 K  Error    14-Jun-12 02:15 hubcon02
     15  Full          0         0   Error    15-Jun-12 04:11 hubcon02
     16  Full         10    108.2 K  Cancel   15-Jun-12 08:05 hubcon02
     17  Full         10    108.2 K  Error    15-Jun-12 08:22 hubcon02

====
*

STORAGE STATUS:
*status
Status available for:
      1: Director
      2: Storage
      3: Client
      4: All
Select daemon type for status (1-4): 2
The defined Storage resources are:
      1: storage-hubcon01
      2: storage-hubcon02
      3: storage-app01
Select Storage resource (1-3): 2
Connecting to Storage daemon storage-hubcon02 at 128.114.61.132:9103

bacula-sd Version: 5.0.0 (26 January 2010) x86_64-unknown-linux-gnu 
redhat (Final)
Daemon started 15-Jun-12 09:24, 0 Jobs run since started.
  Heap: heap=131,072 smbytes=304,519 max_bytes=304,713 bufs=149 max_bufs=151
Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8

Running Jobs:
Writing: Full Backup job hubcon02 JobId=19 Volume="Default0002"
     pool="Default" device="backup-hubcon02" (/data/backup/backup-hubcon02)
     spooling=0 despooling=0 despool_wait=0
     Files=4 Bytes=52,038 Bytes/sec=634
     FDReadSeqNo=32 in_msg=23 out_msg=5 fd=8
====

Jobs waiting to reserve a drive:
====

Terminated Jobs:
  JobId  Level    Files      Bytes   Status   Finished        Name
===================================================================
      8  Full          0         0   Cancel   13-Jun-12 22:10 hubcon02
      9  Full          2       631   Cancel   13-Jun-12 22:32 hubcon02
     10  Full          4    51.80 K  Cancel   13-Jun-12 23:05 hubcon02
     11  Full          0         0   Error    13-Jun-12 23:06 hubcon02
     12  Full          0         0   Cancel   13-Jun-12 23:18 hubcon02
     13  Full          2       631   Error    14-Jun-12 01:19 hubcon02
     14  Full          2       631   Cancel   14-Jun-12 02:15 hubcon02
     15  Full          2       860   Error    15-Jun-12 04:00 hubcon02
     16  Full          2       860   Cancel   15-Jun-12 08:05 hubcon02
     17  Full          2       860   Cancel   15-Jun-12 08:22 hubcon02
====

Device status:
Device "backup-hubcon01" (/data/backup/backup-hubcon01) is not open.
Device "backup-hubcon02" (/data/backup/backup-hubcon02) is mounted with:
     Volume:      Default0002
     Pool:        Default
     Media type:  File
     Total Bytes=215 Blocks=0 Bytes/block=215
     Positioned at File=0 Block=215
Device "backup-app01" (/data/backup/backup-app01) is not open.
====

Used Volume status:
Default0002 on device "backup-hubcon02" (/data/backup/backup-hubcon02)
     Reader=0 writers=2 devres=0 volinuse=1
====

====

*

CLIENT STATUS:
*status
Status available for:
      1: Director
      2: Storage
      3: Client
      4: All
Select daemon type for status (1-4): 3
The defined Client resources are:
      1: hubcon01-fd
      2: hubcon02-fd
      3: app01-fd
Select Client (File daemon) resource (1-3): 2
Connecting to Client hubcon02-fd at bacula-fd.mydomain.com:9102

hubcon02-fd Version: 5.0.0 (26 January 2010)  x86_64-unknown-linux-gnu 
redhat Enterprise 6.0
Daemon started 15-Jun-12 09:26, 1 Job run since started.
  Heap: heap=0 smbytes=167,245 max_bytes=167,392 bufs=116 max_bufs=117
  Sizeof: boffset_t=8 size_t=8 debug=0 trace=0

Running Jobs:
JobId 19 Job hubcon02.2012-06-15_09.53.03_03 is running.
     Full Backup Job started: 15-Jun-12 09:53
     Files=10 Bytes=173,753 Bytes/sec=1,368 Errors=0
     Files Examined=10
     Processing file: /root/.gstreamer-0.10/registry.x86_64.bin
     SDReadSeqNo=5 fd=5
Director connected at: 15-Jun-12 09:55
====

Terminated Jobs:
  JobId  Level    Files      Bytes   Status   Finished        Name
======================================================================
      6  Full         10    107.9 K  Error    13-Jun-12 21:46 hubcon02
      9  Full         10    107.9 K  Cancel   13-Jun-12 22:32 hubcon02
     10  Full         10    173.5 K  Error    13-Jun-12 22:55 hubcon02
     13  Full         10    107.9 K  Error    13-Jun-12 23:35 hubcon02
     12  Full          0         0   Error    14-Jun-12 01:19 hubcon02
     14  Full         10    107.9 K  Error    14-Jun-12 02:15 hubcon02
     15  Full         10    108.2 K  Error    15-Jun-12 02:16 hubcon02
     16  Full         10    108.2 K  Error    15-Jun-12 08:05 hubcon02
     17  Full         10    108.2 K  Error    15-Jun-12 08:22 hubcon02
     18  Full         10    108.2 K  Error    15-Jun-12 09:43 hubcon02
====
*

So, the storage says it transferred 52,038 bytes almost immediately 
after the job started and then transferred no more, I check for a while 
after.  Sometimes it tranfers only 800 bytes then just waits.  
Eventually the job times out with:

15-Jun 02:00 bacula-dir JobId 15: No prior Full backup Job record found.
15-Jun 02:00 bacula-dir JobId 15: No prior or suitable Full backup found in 
catalog. Doing FULL backup.
15-Jun 02:00 bacula-dir JobId 15: Start Backup JobId 15, 
Job=hubcon02.2012-06-15_02.00.00_10
15-Jun 02:00 bacula-dir JobId 15: Using Device "backup-hubcon02"
15-Jun 02:00 bacula-sd JobId 15: Volume "Default0002" previously written, 
moving to end of data.
15-Jun 02:00 bacula-sd JobId 15: Ready to append to end of Volume "Default0002" 
size=215
15-Jun 04:00 bacula-sd JobId 15: Fatal error: append.c:242 Network error 
reading from FD. ERR=Connection reset by peer
15-Jun 04:00 bacula-sd JobId 15: Job write elapsed time = 02:00:00, Transfer 
rate = 0  Bytes/second
15-Jun 04:00 bacula-sd JobId 15: Error: bsock.c:518 Read error from 
client:111.21.21.21:36643: ERR=Connection reset by peer
15-Jun 04:11 bacula-dir JobId 15: Fatal error: Network error with FD during 
Backup: ERR=Connection timed out
15-Jun 04:11 bacula-dir JobId 15: Fatal error: No Job status returned from FD.
15-Jun 04:11 bacula-dir JobId 15: Error: Bacula cghub-bacula-sc-dir 5.0.0 
(26Jan10): 15-Jun-2012 04:11:21

It *does* seem like the FD can connect back to the DIR and SD, so I don't know 
why it would stop transferring immediately after starting and then eventually 
time out.  I've confirmed my firewalls are not timing out the connections.  I 
also have "Heartbeat Interval = 60" in the SD, DIR and FD config files.  But 
the reason I don't think it's a firewall timeout is because it almost 
immediately stops transferring after only like 1 second.

Anyone have any ideas as to what is happening?

Thanks in advance!!

-erich


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>