Networker

Re: [Networker] AW: [Networker] Inactivity Timeout on particular Save Set

2003-02-13 09:15:04
Subject: Re: [Networker] AW: [Networker] Inactivity Timeout on particular Save Set
From: Ingo Roschmann <ingo AT VISIONET DOT DE>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Thu, 13 Feb 2003 09:14:54 -0500
Hi,

we still need some good ideas for our problem of occasionally hanging save-
processes on a w2k-cluster:
In short the problem is:
- savegrp abandones one particular save set after its inactivity timeout
  period
- save.exe then hangs on the client and cannot be killed except by reboot
- while one save.exe hangs, manually backing up the save set by 'save' from
  the client is not possible: after some seconds of activity the new save
  also hangs; backuplevel is irrelevant
- checkdisk shows no errors
- monitoring the disks, raid-array, network (by some compaq-tool) do not
  show any errors (says customer)

I have checked the file indexes, no errors
I have tried a manual full backup of the save set and when it stopped
responding I tried it another time: the verbous output of those two saves
ends at exactly the same position, whereas an incremental backup does _not_
stop at that particular file, but later it also stops responding; so I
don't think there are "evil files" or something

How can I check whether it is a problem of networker, of the client's file
system, the client's hardware or whatever?

Any ideas appreciated!

Ingo



On Fri, 17 Jan 2003 11:47:57 +0100, Gottwald, Stephan <Stephan.Gottwald@ITZ-
DUESSELDORF.DE> wrote:

>Hi
>
>We just experienced this Problem on a W2k Server.
>
>It was a File Server with a Raid Array (no Cluster).
>We couldnt do a full backup on the Data Drive (70 GB) any more.
>Sometimes a manual backup would execute successfully, thoug not often.
>save.exe hangs, no possibility to end the task, no commands that make use
of the \Pipe mechanism work any more (eg remote shutdown), logon to the
console or a terminal server session not possible.
>
>Onle a hard reboot (power cycle) would work to restart the Server.
>
>I ran a chkdsk, and it reported no errors.
>
>Tried everything, up to a complete reinstall of the Networker Software on
the server and all clients.
>I even discarded all existing indexes, media database etc.
>
>Some weeks later we had suddenly a total corruption of the NTFS file
system.
>A lot of files we couldnt open anymore, or even copy to another server.
>W2k gave a message that it could not access the drive.
>
>Only solution we had was to format the drive and recover the Data from
Backup and the Data we successfully copied to another Server.
>
>I dont know if it is a problem of W2k or the Raid-Array.
>We had no errors in the internal Log of the array-controller up to the day
the total corruption occured.
>On this Day, one drive did not react at boot-time, though we could take it
online with the management software and had no problems with this drive
afterwards.
>
>Right now we are closely monitoring the raid-array to see if anything
happens...
>
>Hope this helps.
>
>Greetings
>
>
>
>Stephan Gottwald
>
>
>> -----Ursprüngliche Nachricht-----
>> Von: Ingo Roschmann [mailto:ingo AT VISIONET DOT DE]
>> Gesendet: Freitag, 17. Januar 2003 11:16
>> An: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
>> Betreff: [Networker] Inactivity Timeout on particular Save Set
>>
>>
>> Hi all,
>>
>> I know, there's been a lot of discussion on inactivity
>> timeouts, but I haven't found a hint that covers our problem:
>>
>> From time to time, one particular save set fails due to
>> inactivity timeout; in addition, the save.exe-process hangs
>> and we can only kill it by rebooting the machine.
>>
>> Our environment is networker 6.1.1, server is solaris 8,
>> client is a w2k- mscs-cluster with open file manager 8.0 running.
>>
>> Symptoms are:
>> - It is always the same save set that fails and the save set
>> is located on its own group of hard disks
>> - Increasing the inactivity timeout is useless; a manual save
>> on the client shows that the process seems to stop working
>> after a few minutes
>> - Another save set on the same clusternode at the same time works fine
>> - The backup level is irrelevant; the error occured with
>> incremental backups as well as with level 5 backups
>> - While the abandoned save.exe process still hangs, every
>> attempt to do another save will fail; if the hanging process
>> is killed (after a reboot), backup may work for a few days
>> until the error occurs again
>> - The last time the error occurred a scandisk on the volume
>> reported errors
>>
>> Now I have 3 questions:
>>
>> - What do you think of the idea, the hard disks the save set
>> is located on, are the problem and are there any suggestions
>> for testing this hypothesis?
>> - Are there any suggestions how we could kill the hanging
>> save.exe- processes on the client? We can't kill them by task
>> manager nor by tools like pskill nor does stopping the
>> networker services help (and we don't want to boot the server
>> everytime the problem occurred)
>> - Has anyone any idea, please?
>>
>> Thanks in advance,
>> Ingo
>>
>> --
>> Note: To sign off this list, send a "signoff networker"
>> command via email to listserv AT listmail.temple DOT edu or visit
>> the list's Web site at
>> http://listmail.temple.edu/archives/networker.> html where you
>> can also view and post messages to the list.
>> =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
>>
>
>--
>Note: To sign off this list, send a "signoff networker" command via email
>to listserv AT listmail.temple DOT edu or visit the list's Web site at
>http://listmail.temple.edu/archives/networker.html where you can
>also view and post messages to the list.
>=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=