Re: [Networker] Problem with ending a savegroup

I have seen this problem occur on Linux systems, too. In my experience,there are two reasons: A. The affected client has a zillion files,causing the backup software to crawl through a horrendous number ofinodes to determine what to backup (e.g. on an incremental) or maybethere's just a ton of stuff to back up anyway, protracting the wholeprocess or B. some kind of DNS problem. We had this happen once whereinall clients in the group would finish their incrementals in like 30minutes for 30+ machines, but one client would just hang for an hour ortwo before finally doing anything. We finally tracked it down to abogus entry in the client's /etc/hosts file. It would complete itsbackups, but it would take forever before it started. After that fix,problem solved.

I think what we need is a feature in the product that would somehowallow the remaining running or pending savesets to continue but alsoallow the group to restart. Maybe those savesets in limbo could somehowbe transferred to a temporary group so they could continue to run andthen run again later, but he main group would not be affected?

The thing about killing off a running group so it can restart is thatyou might not want to do that if there's a full still running and it'snear completion. I think I'd prefer to do it manually so I can make thatdetermination on a case by case basis.

However, that provides little succor when your away on vacation.

George

John Stoffel wrote:

Conrad> All that would do is prevent Windows boxes from hanging Unix
Conrad> systems.  The Windows boxes would still hang each other, and
Conrad> the occasional Unix failure would still hang the group.

Sure, but at least it wouldn't hang all the systems.  Some improvement

is better than none.

Conrad> The problem is very inconsistent. A client will cause a
Conrad> savegroup to hang one day that hadn't done that the previous
Conrad> day and won't do it the next. When a client hangs a savegroup
Conrad> consistently, we can track it down. And when we can't we do
Conrad> exactly what you suggest, with a special savegroup.

Conrad> Yes, patching and client reboots do sometimes help the
Conrad> situation. There doesn't appear to be any correlation with
Conrad> filesystem size or number of files. In most cases there is no
Conrad> data passing across the link.

Wish I had better help for you in this situation, sorry I can't do
more than the obvious.

John

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listserv.temple DOT edu or visit the list's Web site at
http://listserv.temple.edu/archives/networker.html where you can
also view and post messages to the list. Questions regarding this list
should be sent to stan AT temple DOT edu
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=


--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listserv.temple DOT edu or visit the list's Web site at
http://listserv.temple.edu/archives/networker.html where you can
also view and post messages to the list. Questions regarding this list
should be sent to stan AT temple DOT edu
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=