Bacula-users

[Bacula-users] Storage Deamon hangs resulting in a network time out

2012-08-27 11:26:38
Subject: [Bacula-users] Storage Deamon hangs resulting in a network time out
From: "DAHLBOKUM Markus (FPT INDUSTRIAL)" <markus.dahlbokum AT fptindustrial DOT com>
To: "bacula-users AT lists.sourceforge DOT net" <bacula-users AT lists.sourceforge DOT net>
Date: Mon, 27 Aug 2012 17:07:59 +0200

Hi,

 

this year I switched our backup server to Ubuntu 12.04.

For the first month everything was ok until the last Ubuntu update.

 

The backup structure is as follows:

On the storage server, a huge disk storage is attached. Here only the file daemon is running.

On the backup server the director and the storage daemons are running.

 

Once a week the complete files from the storage server are written onto tape.

Two jobs (two tape drives) with a delay, but they might run simultaneously for some time.

 

OS is in both cases Ubuntu 12.04 64 bit.

Kernel: 3.2.0-27

Bacula taken from the Ubuntu packages: 5.2.5-0ubuntu6.1

 

As the amount of data is rather big, the jobs might request a second tape from their pool. We don’t use an auto-changer, so I need to replace the tape on Monday morning. Then the job continues.

 

That way it worked for about 3 months until the latest update, which contained an update of the bacula packages.

The update was done on both machines.

 

Now the following error occurs:

 

25-Aug 09:08 ttl010-dir JobId 31: Start Backup JobId 31, Job=backup4.2012-08-25_09.08.00_15

25-Aug 09:08 ttl010-dir JobId 31: Using Device "Drive-1"

25-Aug 09:08 ttl010-sd JobId 31: Recycled volume "PB4T1" on device "Drive-1" (/dev/nst0), all previous data lost.

25-Aug 09:08 ttl010-dir JobId 31: Volume used once. Marking Volume "PB4T1" as Used.

25-Aug 16:17 ttl010-sd JobId 31: End of Volume "PB4T1" at 1152:13356 on device "Drive-1" (/dev/nst0). Write of 64512 bytes got -1.

25-Aug 16:17 ttl010-sd JobId 31: Re-read of last block succeeded.

25-Aug 16:17 ttl010-sd JobId 31: End of medium on Volume "PB4T1" Bytes=1,152,787,894,272 Blocks=17,869,355 at 25-Aug-2012 16:17.

25-Aug 16:18 ttl010-sd JobId 31: Job backup4.2012-08-25_09.08.00_15 is waiting. Cannot find any appendable volumes.

Please use the "label" command to create a new Volume for:

    Storage:      "Drive-1" (/dev/nst0)

    Pool:         Pool-backup4

    Media type:   LTO-4

25-Aug 16:33 ttl011-fd JobId 31: Error: bsock.c:389 Write error sending 65536 bytes to Storage daemon:160.220.129.201:9103: ERR=Connection timed out

25-Aug 16:33 ttl011-fd JobId 31: Fatal error: backup.c:1190 Network send error to SD. ERR=Connection timed out

25-Aug 16:33 ttl010-sd JobId 31: Error: bsock.c:389 Write error sending -6 bytes to client:160.220.129.203:36643: ERR=Connection reset by peer

25-Aug 16:33 ttl010-dir JobId 31: Error: Bacula ttl010-dir 5.2.5 (26Jan12):

  Build OS:               x86_64-pc-linux-gnu ubuntu 12.04

  JobId:                  31

  Job:                    backup4.2012-08-25_09.08.00_15

  Backup Level:           Full

  Client:                 "ttl011-fd" 5.2.5 (26Jan12) x86_64-pc-linux-gnu,ubuntu,12.04

  FileSet:                "backup" 2012-05-14 19:58:03

  Pool:                   "Pool-backup4" (From Job resource)

  Catalog:                "MyCatalog" (From Client resource)

  Storage:                "Tape1" (From Job resource)

  Scheduled time:         25-Aug-2012 09:08:00

  Start time:             25-Aug-2012 09:08:03

  End time:               25-Aug-2012 16:33:27

  Elapsed time:           7 hours 25 mins 24 secs

  Priority:               10

  FD Files Written:       252,220

  SD Files Written:       0

  FD Bytes Written:       1,151,889,284,025 (1.151 TB)

  SD Bytes Written:       0 (0 B)

  Rate:                   43103.2 KB/s

  Software Compression:   None

  VSS:                    no

  Encryption:             no

  Accurate:               no

  Volume name(s):         PB4T1

  Volume Session Id:      2

  Volume Session Time:    1345456088

  Last Volume Bytes:      0 (0 B)

  Non-fatal FD errors:    1

  SD Errors:              0

  FD termination status:  Error

  SD termination status:  Wait for new Volume

  Termination:            *** Backup Error ***

 

 

 

 

After this error occurred the first time I modified my file and storage daemons to use a heartbeat. But the problem is still present.

 

Dmesg gives the following:

 

[1701840.528104] INFO: task bacula-sd:25961 blocked for more than 120 seconds.

[1701840.528113] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[1701840.528120] bacula-sd       D ffffffff81806080     0 25961      1 0x00000000

[1701840.528130]  ffff880118dc7bf8 0000000000000082 ffff880118dc7ba8 ffff880118dc7ba8

[1701840.528141]  ffff880118dc7fd8 ffff880118dc7fd8 ffff880118dc7fd8 0000000000013780

[1701840.528151]  ffffffff81c0d020 ffff880115f22de0 ffff880118dc7bd8 7fffffffffffffff

[1701840.528160] Call Trace:

[1701840.528178]  [<ffffffff81657d8f>] schedule+0x3f/0x60

[1701840.528188]  [<ffffffff816583d5>] schedule_timeout+0x2a5/0x320

[1701840.528199]  [<ffffffff8130d117>] ? kobject_put+0x27/0x60

[1701840.528208]  [<ffffffff81659f85>] ? _raw_spin_lock_irq+0x15/0x20

[1701840.528217]  [<ffffffff81657bcf>] wait_for_common+0xdf/0x180

[1701840.528227]  [<ffffffff8105fae0>] ? try_to_wake_up+0x200/0x200

[1701840.528235]  [<ffffffff81657d4d>] wait_for_completion+0x1d/0x20

[1701840.528256]  [<ffffffffa0186b1a>] st_do_scsi.constprop.17+0x12a/0x280 [st]

[1701840.528267]  [<ffffffffa018c033>] do_load_unload.part.13+0x98/0x12f [st]

[1701840.528278]  [<ffffffffa018b04a>] st_ioctl+0xa6a/0xb80 [st]

[1701840.528288]  [<ffffffff8118a16a>] do_vfs_ioctl+0x8a/0x340

[1701840.528296]  [<ffffffff8117823d>] ? vfs_read+0x10d/0x180

[1701840.528303]  [<ffffffff8118a4b1>] sys_ioctl+0x91/0xa0

[1701840.528311]  [<ffffffff81662282>] system_call_fastpath+0x16/0x1b

 

 

Any suggestions what is going wrong?

 

Thank you in advance.

Regards,

Markus

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
<Prev in Thread] Current Thread [Next in Thread>
  • [Bacula-users] Storage Deamon hangs resulting in a network time out, DAHLBOKUM Markus (FPT INDUSTRIAL) <=