Bacula-users

Re: [Bacula-users] Solaris "Packet size too big" failures

2008-12-23 12:13:34
Subject: Re: [Bacula-users] Solaris "Packet size too big" failures
From: Conor Edberg <conor AT lsit.ucsb DOT edu>
To: Jason Dixon <jdixon AT omniti DOT com>
Date: Tue, 23 Dec 2008 08:53:03 -0800
On Dec 9, 2008, at 2:36 PM, Jason Dixon wrote:

> On Tue, Dec 09, 2008 at 09:26:38PM +0000, Allan Black wrote:
>> Jason Dixon wrote:
>>> Alas, I spoke too soon.  The CatalogBackup job failed again last  
>>> night,
>>> usual symptoms.
>>
>> OK. need to find out what the FD is doing. I would recommend:
>>
>> truss -o filename -f -a -e -v all -w 2 -p <FD pid>
>>
>> Is it possible to run the catalog backup during the day, by hand?
>> That way you could avoid leaving the truss running (and collecting
>> data) all night.
>
> I've run it 6 times today, no failures yet.  Frustrating.  Here are  
> the
> results showing all the previous failures, then the successes today.
>
> -bash-3.2$ echo 'list jobs' | sudo /opt/bacula/sbin/i386/bconsole |  
> grep BackupCatalog | grep '| f '
> |    60 | BackupCatalog            | 2008-11-08 00:36:12 | B    | F  
> |          1 |     965,869,568 | f         |
> |   116 | BackupCatalog            | 2008-11-11 23:14:05 | B    | F  
> |          1 |     614,727,680 | f         |
> |   286 | BackupCatalog            | 2008-11-22 23:14:28 | B    | F  
> |          1 |   5,278,400,512 | f         |
> |   298 | BackupCatalog            | 2008-11-23 23:14:32 | B    | F  
> |          1 |   4,723,965,952 | f         |
> |   336 | BackupCatalog            | 2008-11-26 23:14:30 | B    | F  
> |          1 |   3,979,280,384 | f         |
> |   361 | BackupCatalog            | 2008-11-28 23:16:11 | B    | F  
> |          1 |   2,101,936,128 | f         |
> |   385 | BackupCatalog            | 2008-11-30 23:13:21 | B    | F  
> |          1 |   3,863,216,128 | f         |
>
> -bash-3.2$ echo 'list jobs' | sudo /opt/bacula/sbin/i386/bconsole |  
> grep BackupCatalog | grep '2008-12-09'
> |   486 | BackupCatalog            | 2008-12-09 16:46:51 | B    | F  
> |          1 |   3,183,911,677 | T         |
> |   487 | BackupCatalog            | 2008-12-09 16:55:42 | B    | F  
> |          1 |   3,183,912,220 | T         |
> |   488 | BackupCatalog            | 2008-12-09 17:03:22 | B    | F  
> |          1 |   3,183,912,724 | T         |
> |   489 | BackupCatalog            | 2008-12-09 17:18:03 | B    | F  
> |          1 |   3,183,913,233 | T         |
> |   490 | BackupCatalog            | 2008-12-09 17:23:26 | B    | F  
> |          1 |   3,183,913,740 | T         |
> |   491 | BackupCatalog            | 2008-12-09 17:29:53 | B    | F  
> |          1 |   3,183,914,246 | T         |
>
>
> -- 
> Jason Dixon
> OmniTI Computer Consulting, Inc.
> jdixon AT omniti DOT com
> 443.325.1357 x.241
>



I'm seeing similar problems to the one Jason described.  However, in  
addition to the 'Packet Size too big' failure, I see a variety of  
other error messages as well.  FD, SD, and dir are all running on the  
same machine, disk to disk backup. Some examples of the failures:

20-Dec 05:01 server-sd JobId 328: Fatal error: append.c:190 File index  
from FD not positive or sequential
20-Dec 05:01 server-sd JobId 328: Job write elapsed time = 00:18:22,  
Transfer rate = 10.48 M bytes/second
20-Dec 05:01 server-fd JobId 328: Fatal error: backup.c:1003 Network  
send error to SD. ERR=Broken pipe
20-Dec 05:01 server-fd JobId 328: Error: bsock.c:341 Write error  
sending -1 bytes to Storage daemon:192.168.1.123:9103: ERR=Broken pipe
20-Dec 05:01 server-dir JobId 328: Error: Bacula server-dir 2.5.16- 
beta (20Oct08): 20-Dec-2008 05:01:23
   Build OS:               i386-pc-solaris2.11 solaris 5.11

05-Dec 13:30 server-fd JobId 275: Fatal error: backup.c:957 Network  
send error to SD. ERR=I/O error
05-Dec 13:30 server-sd JobId 275: Fatal error: append.c:259 Network  
error on data channel. ERR=No data available
05-Dec 13:30 server-sd JobId 275: Job write elapsed time = 01:36:26,  
Transfer rate = 50.96 M bytes/second
05-Dec 13:30 server-sd JobId 275: Error: bsock.c:485 Read error from  
client:192.168.1.123:36643: ERR=No data available
05-Dec 13:30 server-dir JobId 275: Bacula server-dir 2.5.16-beta  
(20Oct08): 05-Dec-2008 13:30:43
   Build OS:               i386-pc-solaris2.11 solaris 5.11

10-Oct 10:15 server-sd JobId 130: Fatal error: append.c:180 Malformed  
data header from FD:
10-Oct 10:15 server-sd JobId 130: Job write elapsed time = 00:03:34,  
Transfer rate = 13.56 M bytes/second\
10-Oct 10:15 server-fd JobId 130: Fatal error: backup.c:892 Network  
send error to SD. ERR=Broken pipe\
10-Oct 10:15 server-dir JobId 130: Error: Bacula server-dir 2.4.2  
(26Jul08): 10-Oct-2008 10:15:49\
   Build OS:               i386-pc-solaris2.11 solaris 5.11

19-Sep 23:05 server-sd JobId 23: Ready to append to end of Volume  
"test1" size=140938726135
19-Sep 23:20 server-sd JobId 23: Fatal error: append.c:159 Error  
reading data header from FD. ERR=No data available
19-Sep 23:20 server-sd JobId 23: Job write elapsed time = 00:15:04,  
Transfer rate = 8.229 M bytes/second
19-Sep 23:20 server-sd JobId 23: Fatal error: bsock.c:415 Packet size  
too big from "client:127.0.0.1:36643. Terminating connection.
19-Sep 23:20 server-fd JobId 23: Fatal error: backup.c:892 Network  
send error to SD. ERR=Broken pipe
19-Sep 23:20 server-fd JobId 23: Error: bsock.c:306 Write error  
sending 32768 bytes to Storage daemon:127.0.0.1:9103: ERR=Broken pipe
19-Sep 23:20 server-dir JobId 23: Error: Bacula server-dir 2.4.2  
(26Jul08): 19-Sep-2008 23:20:18
   Build OS:               i386-pc-solaris2.11 solaris 5.11
   JobId:                  23
   Job:                    Client1.2008-09-19_23.05.49



Here's the truss of a failure (I think this is the relevant bit),  
using the options from Alan's message above:

14457/53:       write(6, "\0\080\0 m k Y\r\n M 3 m".., 32772)   = 32772
14457/53:       read(14, " a p L S p L 9 f 6 + l 2".., 32768)   = 32768
14457/53:       write(6, "\0\080\0 a p L S p L 9 f".., 32772)   = 32772
14457/53:       read(14, " I T F a T F B i m 4 S V".., 32768)   = 32768
14457/53:       write(6, "\0\080\0 I T F a T F B i".., 32772)   = 32772
14457/53:       read(14, " C O / B f / / v\r\n 9 f".., 32768)   = 32768
14457/53:       write(6, "\0\080\0 C O / B f / / v".., 32772)   = 32772
14457/54:       pollsys(0xFE2DFDF0, 1, 0xFE2DFE98, 0x00000000)  = 1
14457/54:               fd=6  ev=POLLRDNORM rev=POLLRDNORM
14457/54:               timeout: 5.000000000 sec
14457/53:       read(14, " 6 d L r f 3 3\r\n 3 p q".., 32768)   = 32768
14457/54:       read(6, 0xFE2DFF7C, 4)                           
Err#131 ECONNRESET
14457/54:       lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) =  
0xFFBFFEFF [0x0000FFFF]
14457/54:       lwp_exit()
14457/53:       write(6, "\0\080\0 6 d L r f 3 3\r".., 32772)   Err#32  
EPIPE
14457/53:           Received signal #13, SIGPIPE [ignored]
14457/53:       time()                                          =  
1229375022
14457/53:       time()                                          =  
1229375022
14457/53:       write(4, "\0\0\094 J m s g   J o b".., 152)     = 152
14457/53:       close(14)                                       = 0
14457/53:       close(13)                                       = 0
14457/53:       close(12)                                       = 0
14457/53:       close(11)                                       = 0
14457/53:       close(10)                                       = 0
14457/53:       close(9)                                        = 0
14457/53:       close(8)                                        = 0
14457/53:       close(7)                                        = 0
14457/53:       nanosleep(0xFE3DEE0C, 0x00000000)               = 0
14457/53:               tmout: 0.050000000 sec
14457/53:       nanosleep(0xFE3DEE0C, 0x00000000)               = 0
14457/53:               tmout: 0.050000000 sec
14457/53:       nanosleep(0xFE3DEE0C, 0x00000000)               = 0
14457/53:               tmout: 0.050000000 sec
14457/53:       nanosleep(0xFE3DEE0C, 0x00000000)               = 0
14457/53:               tmout: 0.050000000 sec


--
Conor Edberg
Letters & Science IT
UC Santa Barbara

------------------------------------------------------------------------------
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users