Hi,
this year I switched our backup server to Ubuntu 12.04.
For the first month everything was ok until the last Ubuntu update.
The backup structure is as follows:
On the storage server, a huge disk storage is attached. Here only the file daemon is running.
On the backup server the director and the storage daemons are running.
Once a week the complete files from the storage server are written onto tape.
Two jobs (two tape drives) with a delay, but they might run simultaneously for some time.
OS is in both cases Ubuntu 12.04 64 bit.
Kernel: 3.2.0-27
Bacula taken from the Ubuntu packages: 5.2.5-0ubuntu6.1
As the amount of data is rather big, the jobs might request a second tape from their pool. We don’t use an auto-changer, so I need to replace the tape on Monday morning. Then the job continues.
That way it worked for about 3 months until the latest update, which contained an update of the bacula packages.
The update was done on both machines.
Now the following error occurs:
25-Aug 09:08 ttl010-dir JobId 31: Start Backup JobId 31, Job=backup4.2012-08-25_09.08.00_15
25-Aug 09:08 ttl010-dir JobId 31: Using Device "Drive-1"
25-Aug 09:08 ttl010-sd JobId 31: Recycled volume "PB4T1" on device "Drive-1" (/dev/nst0), all previous data lost.
25-Aug 09:08 ttl010-dir JobId 31: Volume used once. Marking Volume "PB4T1" as Used.
25-Aug 16:17 ttl010-sd JobId 31: End of Volume "PB4T1" at 1152:13356 on device "Drive-1" (/dev/nst0). Write of 64512 bytes got -1.
25-Aug 16:17 ttl010-sd JobId 31: Re-read of last block succeeded.
25-Aug 16:17 ttl010-sd JobId 31: End of medium on Volume "PB4T1" Bytes=1,152,787,894,272 Blocks=17,869,355 at 25-Aug-2012 16:17.
25-Aug 16:18 ttl010-sd JobId 31: Job backup4.2012-08-25_09.08.00_15 is waiting. Cannot find any appendable volumes.
Please use the "label" command to create a new Volume for:
Storage: "Drive-1" (/dev/nst0)
Pool: Pool-backup4
Media type: LTO-4
25-Aug 16:33 ttl011-fd JobId 31: Error: bsock.c:389 Write error sending 65536 bytes to Storage daemon:160.220.129.201:9103: ERR=Connection timed out
25-Aug 16:33 ttl011-fd JobId 31: Fatal error: backup.c:1190 Network send error to SD. ERR=Connection timed out
25-Aug 16:33 ttl010-sd JobId 31: Error: bsock.c:389 Write error sending -6 bytes to client:160.220.129.203:36643: ERR=Connection reset by peer
25-Aug 16:33 ttl010-dir JobId 31: Error: Bacula ttl010-dir 5.2.5 (26Jan12):
Build OS: x86_64-pc-linux-gnu ubuntu 12.04
JobId: 31
Job: backup4.2012-08-25_09.08.00_15
Backup Level: Full
Client: "ttl011-fd" 5.2.5 (26Jan12) x86_64-pc-linux-gnu,ubuntu,12.04
FileSet: "backup" 2012-05-14 19:58:03
Pool: "Pool-backup4" (From Job resource)
Catalog: "MyCatalog" (From Client resource)
Storage: "Tape1" (From Job resource)
Scheduled time: 25-Aug-2012 09:08:00
Start time: 25-Aug-2012 09:08:03
End time: 25-Aug-2012 16:33:27
Elapsed time: 7 hours 25 mins 24 secs
Priority: 10
FD Files Written: 252,220
SD Files Written: 0
FD Bytes Written: 1,151,889,284,025 (1.151 TB)
SD Bytes Written: 0 (0 B)
Rate: 43103.2 KB/s
Software Compression: None
VSS: no
Encryption: no
Accurate: no
Volume name(s): PB4T1
Volume Session Id: 2
Volume Session Time: 1345456088
Last Volume Bytes: 0 (0 B)
Non-fatal FD errors: 1
SD Errors: 0
FD termination status: Error
SD termination status: Wait for new Volume
Termination: *** Backup Error ***
After this error occurred the first time I modified my file and storage daemons to use a heartbeat. But the problem is still present.
Dmesg gives the following:
[1701840.528104] INFO: task bacula-sd:25961 blocked for more than 120 seconds.
[1701840.528113] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1701840.528120] bacula-sd D ffffffff81806080 0 25961 1 0x00000000
[1701840.528130] ffff880118dc7bf8 0000000000000082 ffff880118dc7ba8 ffff880118dc7ba8
[1701840.528141] ffff880118dc7fd8 ffff880118dc7fd8 ffff880118dc7fd8 0000000000013780
[1701840.528151] ffffffff81c0d020 ffff880115f22de0 ffff880118dc7bd8 7fffffffffffffff
[1701840.528160] Call Trace:
[1701840.528178] [<ffffffff81657d8f>] schedule+0x3f/0x60
[1701840.528188] [<ffffffff816583d5>] schedule_timeout+0x2a5/0x320
[1701840.528199] [<ffffffff8130d117>] ? kobject_put+0x27/0x60
[1701840.528208] [<ffffffff81659f85>] ? _raw_spin_lock_irq+0x15/0x20
[1701840.528217] [<ffffffff81657bcf>] wait_for_common+0xdf/0x180
[1701840.528227] [<ffffffff8105fae0>] ? try_to_wake_up+0x200/0x200
[1701840.528235] [<ffffffff81657d4d>] wait_for_completion+0x1d/0x20
[1701840.528256] [<ffffffffa0186b1a>] st_do_scsi.constprop.17+0x12a/0x280 [st]
[1701840.528267] [<ffffffffa018c033>] do_load_unload.part.13+0x98/0x12f [st]
[1701840.528278] [<ffffffffa018b04a>] st_ioctl+0xa6a/0xb80 [st]
[1701840.528288] [<ffffffff8118a16a>] do_vfs_ioctl+0x8a/0x340
[1701840.528296] [<ffffffff8117823d>] ? vfs_read+0x10d/0x180
[1701840.528303] [<ffffffff8118a4b1>] sys_ioctl+0x91/0xa0
[1701840.528311] [<ffffffff81662282>] system_call_fastpath+0x16/0x1b
Any suggestions what is going wrong?
Thank you in advance.
Regards,
Markus