Networker

Re: [Networker] Problem with ending a savegroup

2005-08-11 11:11:12
Subject: Re: [Networker] Problem with ending a savegroup
From: George Sinclair <George.Sinclair AT NOAA DOT GOV>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Thu, 11 Aug 2005 11:10:14 -0400
I have seen this problem occur on Linux systems, too. In my experience, there are two reasons: A. The affected client has a zillion files, causing the backup software to crawl through a horrendous number of inodes to determine what to backup (e.g. on an incremental) or maybe there's just a ton of stuff to back up anyway, protracting the whole process or B. some kind of DNS problem. We had this happen once wherein all clients in the group would finish their incrementals in like 30 minutes for 30+ machines, but one client would just hang for an hour or two before finally doing anything. We finally tracked it down to a bogus entry in the client's /etc/hosts file. It would complete its backups, but it would take forever before it started. After that fix, problem solved.

I think what we need is a feature in the product that would somehow allow the remaining running or pending savesets to continue but also allow the group to restart. Maybe those savesets in limbo could somehow be transferred to a temporary group so they could continue to run and then run again later, but he main group would not be affected?

The thing about killing off a running group so it can restart is that you might not want to do that if there's a full still running and it's near completion. I think I'd prefer to do it manually so I can make that determination on a case by case basis.
However, that provides little succor when your away on vacation.

George

John Stoffel wrote:

Conrad> All that would do is prevent Windows boxes from hanging Unix
Conrad> systems.  The Windows boxes would still hang each other, and
Conrad> the occasional Unix failure would still hang the group.

Sure, but at least it wouldn't hang all the systems.  Some improvement
is better than none.
Conrad> The problem is very inconsistent. A client will cause a
Conrad> savegroup to hang one day that hadn't done that the previous
Conrad> day and won't do it the next. When a client hangs a savegroup
Conrad> consistently, we can track it down. And when we can't we do
Conrad> exactly what you suggest, with a special savegroup.

Conrad> Yes, patching and client reboots do sometimes help the
Conrad> situation. There doesn't appear to be any correlation with
Conrad> filesystem size or number of files. In most cases there is no
Conrad> data passing across the link.

Wish I had better help for you in this situation, sorry I can't do
more than the obvious.

John

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listserv.temple DOT edu or visit the list's Web site at
http://listserv.temple.edu/archives/networker.html where you can
also view and post messages to the list. Questions regarding this list
should be sent to stan AT temple DOT edu
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=


--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listserv.temple DOT edu or visit the list's Web site at
http://listserv.temple.edu/archives/networker.html where you can
also view and post messages to the list. Questions regarding this list
should be sent to stan AT temple DOT edu
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

<Prev in Thread] Current Thread [Next in Thread>