Bacula-users

Re: [Bacula-users] Solaris "Packet size too big" failures

2009-01-27 05:15:01
Subject: Re: [Bacula-users] Solaris "Packet size too big" failures
From: Allan Black <Allan.Black AT btconnect DOT com>
To: bacula-users AT lists.sourceforge DOT net
Date: Tue, 27 Jan 2009 10:12:03 +0000
Jason Dixon wrote:
> a "Packet size too big" error.  The Director resides on a global zone in
> Solaris x86.  I've managed to capture a truss during one of the
> failures:
> http://mirrors.omniti.com/bacula/bacula.truss

Very strange. Everything seems to be going normally:

14106/1:        pollsys(0x08046EF0, 1, 0x00000000, 0x00000000)  = 1
14106/1:                fd=4  ev=POLLRDNORM rev=POLLRDNORM
14106/1:        accept(4, 0x08047D90, 0x08047DA0, SOV_DEFAULT)  = 5
14106/1:                AF_INET  name = 10.80.117.97  port = 40563
[...]
14106/67:       read(5, "\0\0\0  ", 4)                          = 4
14106/67:       read(5, " H e l l o   D i r e c t".., 32)       = 32
[Incoming connection from the director]

[...]
[The director tells the FD to back up /data/bacu<something>]

14106/67:       so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, 0x00000000, 
SOV_DEFAULT) = 6
14106/67:       setsockopt(6, SOL_SOCKET, SO_KEEPALIVE, 0xFE65ECBC, 4, 
SOV_DEFAULT) = 0
14106/67:       connect(6, 0x080FBEBC, 16, SOV_DEFAULT)         = 0
14106/67:               AF_INET  name = 10.80.117.97  port = 9103
[FD opens connection to the SD]

14106/67:       open64("/data/bacula/work/bacula.sql", O_RDONLY) = 7
14106/67:       write(6, "\0\0\005 1   2   0", 9)               = 9
14106/67:       read(7, " - -\n - -   P o s t g r".., 65536)    = 65536
14106/67:       write(6, "\001\0\0 - -\n - -   P o".., 65540)   = 65540
14106/67:       read(7, "   B J J F E L   B J J F".., 65536)    = 65536
14106/67:       write(6, "\001\0\0   B J J F E L  ".., 65540)   = 65540
[The FD opens /data/bacula/work/bacula.sql and passes the contents to the SD]

[...]
14106/67:       read(7, " 8 1 6 6 7\t 4 4 3\t 5 3".., 65536)    = 65536
14106/67:       write(6, "\001\0\0 8 1 6 6 7\t 4 4".., 65540)   = 65540
14106/67:       read(7, " i e   A   A   C\t 2 Z q".., 65536)    = 65536
14106/67:       write(6, "\001\0\0 i e   A   A   C".., 65540)   = 65540
14106/67:       read(7, " B J J 1 / q   B G Q P L".., 65536)    = 65536
14106/2:        lwp_park(0xFE76EF2C, 0)         (sleeping...)
14106/2:                timeout: 29.999928355 sec
14106/68:       pollsys(0xFE55FE10, 1, 0xFE55FEC8, 0x00000000)  = 0
14106/68:               fd=6  ev=POLLRDNORM rev=0
14106/68:               timeout: 5.000000000 sec
14106/67:       write(6, 0x08114854, 65540)     (sleeping...)
14106/68:       pollsys(0xFE55FE10, 1, 0xFE55FEC8, 0x00000000)  = 1
14106/68:               fd=6  ev=POLLRDNORM rev=POLLRDNORM
14106/68:               timeout: 5.000000000 sec
14106/67:       write(6, "\001\0\0 B J J 1 / q   B".., 65540)   = 65540
14106/67:       read(7, " 6 w P q 8 2 V q 2 3 X n".., 65536)    = 65536
14106/68:       read(6, 0xFE55FF80, 4)                          Err#131 
ECONNRESET
14106/68:       lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF 
[0x0000FFFF]
14106/68:       lwp_exit()
14106/67:       write(6, "\001\0\0 6 w P q 8 2 V q".., 65540)   Err#32 EPIPE
14106/67:           Received signal #13, SIGPIPE [ignored]
[This is where it goes wrong]

Just after half way through the above, this happens:

14106/68:       pollsys(0xFE55FE10, 1, 0xFE55FEC8, 0x00000000)  = 1
14106/68:               fd=6  ev=POLLRDNORM rev=POLLRDNORM
14106/68:               timeout: 5.000000000 sec

which indicates that a "normal" incoming event has occurred on file descriptor 
6,
which is the connection to the SD. 3 lines later,

14106/68:       read(6, 0xFE55FF80, 4)                          Err#131 
ECONNRESET

The FD attempts to read from the SD, and gets "Connection reset by peer". From
the job report you posted, it doesn't look like the SD is crashing/restarting,
nor is the machine rebooting.

Something, somewhere though, is interfering with the connection between the FD
and the SD. Sorry to say this, but you may have to truss the SD!

Allan

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users