Hi,
Please excuse me if I am using this wrong, in all my years
in IT, it seems this is the first time I have used a mailing list for support.
(I’m usually pretty good at the whole RTFM thing)
We have a backup box (FC6) that is running backups from a
lot of windows servers using rSync.
We are not running the latest version of BackupPC. I
am reluctant to update this unless I have a good idea that it’s going to
help since we have a lot of automated scripts that manage the BackupPC config
files and we will need to review them all. If this is the route we need
to go then no problem but as I understand it, my problem is specifically
related to rSync. (please correct me if I am wrong.)
We have been running this for well over a year now (maybe a
few years, my memory fails me) and iirc this problem has only started showing
up over the past 6-12 months. We do not have any automatic updates on the
BackupPC box so nothing there should have really changed.
Out backup box resides on what we call our CORE
network. The servers it backs up are all on remote network which are
connected to the CORE network with VPN’s. (VPN’s are running
over DSL) Backups do take a long time to run (~10 hours or so) due to the
amount of data.
The problem we are seeing is that Backups are randomly
failing.
The log file on BackupPC showing something like this:
Connected
to xxx.xxx.xxx.xxx:873, remote version 29
Negotiated
protocol version 26
Connected
to module kale-susl
Sending
args: --server --sender --numeric-ids --perms --owner --group -D --links
--times --block-size=2048 --recursive --ignore-times . .
Xfer
PIDs are now 5220
[
skipped 971 lines ]
Read
EOF: Connection reset by peer
Tried
again: got 0 bytes
finish:
removing in-process file Data/Apps/GoldMine/GMBase/ScriptsW.MDX
Child
is aborting
Parent
read EOF from child: fatal error!
Done:
923 files, 3041217353 bytes
Got
fatal error during xfer (Child exited prematurely)
Backup
aborted (Child exited prematurely)
The log on the windows server is:
2008/11/18 17:46:05 [3252] connect from UNKNOWN (xxx.xxx.xxx.xxx)
2008/11/18 17:46:05 [3252] rsync on . from backupuser@UNKNOWN (xxx.xxx.xxx.xxx)
2008/11/18 17:46:05 [3252] building file list
2008/11/18 18:03:14 [3252] rsync: writefd_unbuffered failed to write 4092 bytes [sender]: Connection reset by peer (104)
2008/11/18 18:03:14 [3252] rsync error: error in rsync protocol data stream (code 12) at /home/lapo/packaging/tmp/rsync-2.6.9/io.c(1122) [sender=2.6.9]
I have been trying to work this out for month or two now.
The problems seem to be random, but more common on specific
servers.
There is nothing special about these specific servers –
they seem just random but persistant.
Originally, we were running the recommended rSync package
for BackupPC.
After looking into the problem over the past month, I have
seen a lot of posts suggesting there this was a common problem with a
particular build of rSync.
I have updated rSync on the backupPC box, “rpm –q
rsync” currently replies...
rsync-2.6.9-5.fc8 (yes, the only
updated rpm i could find was an fc8 one)
On a few select servers (including the one that generated
the above logs) I setup cygwin directly and added rSync to it with the
installer wizard.
I selected rSync 2.6.9 rather than 3.x.x as i assumed this
would b required for compatibility.
These seem to be the only recommendations I can find for
fixing this problem. (updating rSync)
Sadly, it has not helped me so far.
The connection between the BackupPC server and the example
server used for the above logs is VPN like the rest of the servers but this
server is local and the VPN operates over local Ethernet links. (ie. Stable links.)
I have tried and tried to verify as much as I can that there
are no network/VPN dropouts at the times that this is failing and im pretty
sure there are not. It sometimes fails within 3 minutes of the job
starting, other times after hours. I know I have had a remote desktop
session open to the server and been actively using it at the time it failed and
I noticed absolutely no disturbance in my RD session. (which you would
expect at least a short pause if there was a brief disruption to the
connection.)
I am at a loss and I am really hoping that someone will be
able to show me a way to further my research into what is causing this problem
so that I can hopefully isolate the issue and resolve it.
I have said above that this seems to be most prominent on
specific servers (about 7 of them) but it IS happening occasionally on all of
our servers.
We do get drops on our VPN’s from time to time but we
have a monitoring system in place that alerts us immediately about this.
These drops probably average at about 1 per every 3 months,
per VPN. The rSync fails are daily, and usually several per day.
Any help would be very much appreciated.
Kind Regards,
James Sefton
Phase 5 Communications Ltd. (UK)