Bacula-users

Re: [Bacula-users] Win32 FD / Write error sending N bytes to Storage daemon

2011-06-06 12:50:06
Subject: Re: [Bacula-users] Win32 FD / Write error sending N bytes to Storage daemon
From: Yann Cézard <yann.cezard AT univ-pau DOT fr>
To: bacula-users AT lists.sourceforge DOT net
Date: Mon, 06 Jun 2011 18:46:48 +0200
Le 20/05/2011 10:22, Yann Cézard a écrit :
Hi everyone,

Since a few weeks, I am facing a really strange problem with
my win32 bacula-fd.

It seems that the problem started when I upgraded my SD + DIR
to the 5.0.X (I was still using the 2.4.4 until that time).
The problem is that almost (this is only observable on Full backups
of several GB) all my win32-fd Full backups now fails with the following errors :
- with a 2.4.4 client :
12-May 22:22 msadpau-fd JobId 2538: Generate VSS snapshots. Driver="VSS Win 2003", Drive(s)="E"
13-May 00:00 msadpau-fd JobId 2538: Fatal error: ../../filed/backup.c:892 Network send error to SD. ERR=Input/output error
13-May 00:02 msadpau-fd JobId 2538: VSS Writer (BackupComplete): "System Writer", State: 0x1 (VSS_WS_STABLE)
[...]
13-May 00:02 msadpau-fd JobId 2538: VSS Writer (BackupComplete): "NTDS", State: 0x1 (VSS_WS_STABLE)
13-May 00:02 backuppa-sd JobId 2538: JobId=2538 Job="msad-stockage-pau.2011-05-12_22.00.00_37" marked to be canceled.
13-May 00:02 backuppa-sd JobId 2538: Job write elapsed time = 01:39:44, Transfer rate = 6.296 M Bytes/second
13-May 00:02 backuppa-sd JobId 2538: Error: bsock.c:518 Read error from client:10.1.2.17:36643: ERR=Connection reset by peer
13-May 00:02 backuppa-dir JobId 2538: Error: Bacula backuppa-dir 5.0.2 (28Apr10): 13-May-2011 00:02:15
  Build OS:               x86_64-pc-linux-gnu debian squeeze/sid
  Backup Level:           Full
  Client:                 "BLABLABLA" 2.4.4 (28Dec08) Linux,Cross-compile,Win32
- after upgrading the client to 5.0.3 (error message is more verbose, but
the problem is still there) :
16-May 17:22 msadpau-fd JobId 2552: Generate VSS snapshots. Driver="VSS Win 2003", Drive(s)="E"
16-May 20:14 msadpau-fd JobId 2552: Error: /home/kern/bacula/k/bacula/src/lib/bsock.c:393 Write error sending 65562 bytes to Storage daemon:backuppa:9103: ERR=Input/output error
16-May 20:14 msadpau-fd JobId 2552: Fatal error: /home/kern/bacula/k/bacula/src/filed/backup.c:1024 Network send error to SD. ERR=Input/output error
16-May 20:15 msadpau-fd JobId 2552: Error: /home/kern/bacula/k/bacula/src/lib/bsock.c:339 Socket has errors=1 on call to Storage daemon:backuppa:9103
16-May 20:16 msadpau-fd JobId 2552: VSS Writer (BackupComplete): "System Writer", State: 0x1 (VSS_WS_STABLE)
[...]
16-May 20:16 msadpau-fd JobId 2552: VSS Writer (BackupComplete): "WMI Writer", State: 0x1 (VSS_WS_STABLE)
16-May 20:16 backuppa-sd JobId 2552: JobId=2552 Job="msad-stockage-pau.2011-05-16_17.22.21_58" marked to be canceled.
16-May 20:16 backuppa-sd JobId 2552: Job write elapsed time = 02:53:20, Transfer rate = 6.234 M Bytes/second
16-May 20:16 backuppa-sd JobId 2552: Error: bsock.c:518 Read error from client:10.1.2.17:36643: ERR=Connection reset by peer
16-May 20:16 backuppa-dir JobId 2552: Error: Bacula backuppa-dir 5.0.2 (28Apr10): 16-May-2011 20:16:02
  Build OS:               x86_64-pc-linux-gnu debian squeeze/sid
  Backup Level:           Full
  Client:                 "BLABLABLA" 5.0.3 (04Aug10) Linux,Cross-compile,Win32
- after upgrading the DIR/SD from 5.0.2 to 5.0.3 (Debian squeeze => wheezy) :
19-mai 10:07 msadpau-fd JobId 2565: Generate VSS snapshots. Driver="VSS Win 2003", Drive(s)="E"
19-mai 10:09 msadpau-fd JobId 2565: Error: /home/kern/bacula/k/bacula/src/lib/bsock.c:393 Write error sending 65536 bytes to Storage daemon:backuppa:9103: ERR=Input/output error
19-mai 10:09 msadpau-fd JobId 2565: Fatal error: /home/kern/bacula/k/bacula/src/filed/backup.c:1024 Network send error to SD. ERR=Input/output error
19-mai 10:11 msadpau-fd JobId 2565: Error: /home/kern/bacula/k/bacula/src/lib/bsock.c:339 Socket has errors=1 on call to Storage daemon:backuppa:9103
19-mai 10:11 msadpau-fd JobId 2565: VSS Writer (BackupComplete): "System Writer", State: 0x1 (VSS_WS_STABLE)
[...]
19-mai 10:11 msadpau-fd JobId 2565: VSS Writer (BackupComplete): "WMI Writer", State: 0x1 (VSS_WS_STABLE)
19-mai 10:11 backuppa-sd JobId 2565: JobId=2565 Job="msad-stockage-pau.2011-05-19_10.03.34_03" marked to be canceled.
19-mai 10:11 backuppa-sd JobId 2565: Error: bsock.c:537 Read error from client:10.1.2.17:36643: ERR=Connexion ré-initialisée par le correspondant
19-mai 10:11 backuppa-sd JobId 2565: Job write elapsed time = 00:04:12, Transfer rate = 11.61 M Bytes/second
19-mai 10:11 backuppa-dir JobId 2565: Error: Bacula backuppa-dir 5.0.3 (04Aug10): 19-mai-2011 10:11:35
  Build OS:               x86_64-pc-linux-gnu debian wheezy/sid
  Backup Level:           Full
  Client:                 "Serveur MSAD Pau" 5.0.3 (04Aug10) Linux,Cross-compile,Win32
After doing a lot of search through the list, and "googling", I found differents
testimonies of similar problems, but no real solution.

What I tried :
- turn off the anti-virus software : no change
- Heartbeat Level in Storage and Client : no change
- reducing keepalive on SD : no change
- changing the Network Buffer Size : no change :
19-mai 17:29 msadpau-fd JobId 2577: Error: /home/kern/bacula/k/bacula/src/lib/bsock.c:393 Write error sending 32768 bytes to Storage daemon:backuppa:9103: ERR=Input/output error

19-mai 17:36 msadpau-fd JobId 2579: Error: /home/kern/bacula/k/bacula/src/lib/bsock.c:393 Write error sending 131072 bytes to Storage daemon:backuppa:9103: ERR=Input/output error
- I also turn on debuging/tracing on FD and SD size, didn't help so much.
  Here is the FD side :
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/backup.c:1028-0 Send data to SD len=65536
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/backup.c:1028-0 Send data to SD len=65536
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/backup.c:1028-0 Send data to SD len=65536
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/backup.c:1028-0 Send data to SD len=65536
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/backup.c:1028-0 Send data to SD len=65536
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/heartbeat.c:96-0 wait_intr=0 stop=0
/// Here, the SD side is showing that it's doing a read and both side seems to wait for about 2 minutes
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/heartbeat.c:96-0 wait_intr=0 stop=0
/// and then the connexion is closed.

msadpau-fd: /home/kern/bacula/k/bacula/src/filed/heartbeat.c:96-0 wait_intr=0 stop=0
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/heartbeat.c:91-0 Got BNET_SIG 0 from SD
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/heartbeat.c:96-0 wait_intr=1 stop=1
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/backup.c:211-0 end blast_data ok=0
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/job.c:1660-0 Error in blast_data.
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/job.c:282-0 Quit command loop. Canceled=1
msadpau-fd: /home/kern/bacula/k/bacula/src/lib/runscript.c:110-0 runscript: running all RUNSCRIPT object (ClientAfterJob) JobStatus=f
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/job.c:309-0 End FD msg: 2800 End Job TermCode=102 JobFiles=21762 ReadBytes=2921610796 JobBytes=2921545260 Errors=2 VSS=1 Encrypt=0

msadpau-fd: /home/kern/bacula/k/bacula/src/filed/job.c:388-0 Calling term_find_files
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/job.c:391-0 Done with term_find_files
msadpau-fd: /home/kern/bacula/k/bacula/src/win32/compat/compat.cpp:210-0 Enter wchar_win32_path
msadpau-fd: /home/kern/bacula/k/bacula/src/win32/compat/compat.cpp:394-0 Leave wchar_win32_path=\
msadpau-fd: /home/kern/bacula/k/bacula/src/lib/jcr.c:181-0 write_last_jobs seek to 192
msadpau-fd: /home/kern/bacula/k/bacula/src/lib/mem_pool.c:369-0 garbage collect memory pool
msadpau-fd: /home/kern/bacula/k/bacula/src/filed/job.c:393-0 Done with free_jcr
msadpau-fd: /home/kern/bacula/k/bacula/src/lib/mem_pool.c:369-0 garbage collect memory pool



But now the strangiest thing is that, in order to have the job fails faster
during my tests, I turn off the compression feature. And the result is that the
job failed "a lot" faster, but not due to compression. In fact, with compression,
sometime the job failed at 40GB written, sometime at 100GB. Without compression,
it failed at 2GB, always at the exactly same number of bytes written.
So i decided to make a test in the other way, and set the compression up
(was GZIP2 at start, set it at GZIP4) : the job ran well !

So now I am wondering what can I do to found where the problem is exactly ?
Is it on FD side ? on SD side ? It seems to happen when the data is send too fast,
or too much load on the client side ?

Other facts :
- problem is observed on different kind of hardware
- problem is observed with clients on the same network than the
  the backup server, as well as with client on another network.
- no problem at all on linux client with bigger sized jobs.
- there is no heavy load on the DIR/SD/MySQL side
- the different servers which have the problem doesn't show
  any problem of network connectivity.

Any clue ? anything that I can do to try to trace the problem ?

Regards
Nobody has a clue ?

It's really annoying, a lot of full backups are failling, it's getting really problematic.
We tried to update the NIC driver, not better, I tried to deactivate TCP offload in the NIC
driver, not better (even worse in fact, the job failed faster).

I found differents testimonies of such problems on list archives, but no solution.
Did everybody give up and go to another backup solution for Windows client ?
It seems to be Windows/network related, but the fact is that bacula is the only application
that has problem; so it's really weird.

Regards.
-- 
Yann Cézard  -  infrastructures - administrateur systèmes serveurs
Centre de ressources informatiques    -     http://cri.univ-pau.fr
Université de Pau et des pays de l'Adour -  http://www.univ-pau.fr
------------------------------------------------------------------------------
Simplify data backup and recovery for your virtual environment with vRanger.
Installation's a snap, and flexible recovery options mean your data is safe,
secure and there when you need it. Discover what all the cheering's about.
Get your free trial download today. 
http://p.sf.net/sfu/quest-dev2dev2 
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users