Re: Large Linux clients

        Ya, 
        Sorry, I have no answers for you, but you do have my sympathy.

        I've had to do that kind of detective work before. Some times it
is an oddly named file, a very very long-named file, or some times it's
a file that somehow got a very bizarre date, like "Apr 15  1904". In a
few cases it has also been hung NFS mounts somewhere in the path.

        I've had to drill down each of the subdir one after another just
like you did to figure it out, because there was no filename or other
hints in the schedule or error logs, just a generic failed message.

        Luckily I only have to do it about once or twice a year, but it
is time consuming.

 Ben


-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Zoltan Forray/AC/VCU
Sent: Friday, April 01, 2005 9:03 AM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: Large Linux clients

Thanks for the suggestion.   However, this is not true.  We already
tried
this.

We did "find . | wc -l" to get the object count (1.1M) with no problems.
But the backup still will not work. Constantly fails, in
unpredictable/inconsistant places, with the same "Producer Thread"
error.

I spent 2+ days drilling through the various sub-directories (of this
directory that causes the failures), one-by-one, and was able to backup
38 of the 40 subdirs, totalling over 980K objects, with out a problem.
When I included these two other directories, in the same pile, the
backup would fail.

When I then went back and individually selected the sub-sub directories
of these sub-directories (one at a time), I was able to backup *ALL* of
the sub-sub directories, no problem.  Then I went back and selected the
upper-level directory and backed it up, no problem..

Let me draw a picture of the structure of these directories.

The problem directories are in this directory:
/coyote/dsk3/patients/prostateReOpt/Mount_0/ .

If I try to backup the /Mount_0/ as a whole, crashes every time.   If I
point to sub-dirs below /Mount_0/ (40 of these - all with the same named
4-subsub dirs ), two of these cause a crash. I noted that these two both
have >72K objects while the other 38 have less than 60K objects.

Yet when I manually picked the 4-subsub dirs of the Patient_172 dir, the
backup worked (sort of - see below). Same for the Patient_173.

To really drive me crazy, the first attempt at backing up one of the
subsub dirs under Patient_172, the backup crashed. Yet I could backup
the other 3 with no issue. So, we started looking at the problem subdir
and noticed a weird file name that ended in a tilde (~).  When I
excluded it, the backup ran. Then when I went back and picked just the
file with the tilde, it backed up fine (my head is getting
balder-and-balder !!).  I then went back and re-selected the whole
Patient_172 directory and it backed up (or at least scanned it since
everything was backed-up) just fine !!!1
ARRRRRRRRRRRRGGGGGGHHHHHHHHHHHHH !!

This is maddening and shows no rhyme-or-reason.




Henk ten Have <hthta AT NCSA.UIUC DOT EDU>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
04/01/2005 08:29 AM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>


To
ADSM-L AT VM.MARIST DOT EDU
cc

Subject
Re: [ADSM-L] Large Linux clients






An old trick I used for many years:
to investigate a "problem" filesystem, do a "find" in that filesystem.
If the find dies, tsm definitly will die.
I'll bet your find will die, and that's why your backup will die/hang or
whatever also. A find will do a filestat on all files/dirs, actually the
same the backup does.
So your issue is OS related and not tsm.

Cheers
Henk ()

On Tuesday 29 March 2005 12:11, you wrote:
> On Mar 29, 2005, at 12:37 PM, Zoltan Forray/AC/VCU wrote:
> > ...However, then I try to backup the tree at the third-level (e.g.
> > /coyote/dsk3/), the client pretty much siezes immediately and 
> > dsmerror.log says "B/A Txn Producer Thread, fatal error, Signal 11".

> > The server shows the session as "SendW" and nothing going else going

> > on....
>
> Zoltan -
>
> Signal 11 is a segfault - a software failure.
> The client programming has a defect, which may be incited by a problem

> in that area of the file system (so have that investigated). A 
> segfault can be induced by memory constraint, which in this context 
> would most likely be Unix Resource Limits, so also enter the command 
> 'limit' in Linux csh or tcsh and potentially boost the stack size 
> ('unlimit stacksize'). This is to say that the client was probably 
> invoked under artificially limited environmentals.
>
>     Richard Sims