ADSM-L

Re: Large Linux clients

2005-04-01 11:04:33
Subject: Re: Large Linux clients
From: Zoltan Forray/AC/VCU <zforray AT VCU DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Fri, 1 Apr 2005 11:03:12 -0500
Thanks for the suggestion.   However, this is not true.  We already tried
this.

We did "find . | wc -l" to get the object count (1.1M) with no problems.
But the backup still will not work. Constantly fails, in
unpredictable/inconsistant places, with the same "Producer Thread" error.

I spent 2+ days drilling through the various sub-directories (of this
directory that causes the failures), one-by-one, and was able to backup 38
of the 40 subdirs, totalling over 980K objects, with out a problem.  When
I included these two other directories, in the same pile, the backup would
fail.

When I then went back and individually selected the sub-sub directories of
these sub-directories (one at a time), I was able to backup *ALL* of the
sub-sub directories, no problem.  Then I went back and selected the
upper-level directory and backed it up, no problem..

Let me draw a picture of the structure of these directories.

The problem directories are in this directory:
/coyote/dsk3/patients/prostateReOpt/Mount_0/ .

If I try to backup the /Mount_0/ as a whole, crashes every time.   If I
point to sub-dirs below /Mount_0/ (40 of these - all with the same named
4-subsub dirs ), two of these cause a crash. I noted that these two both
have >72K objects while the other 38 have less than 60K objects.

Yet when I manually picked the 4-subsub dirs of the Patient_172 dir, the
backup worked (sort of - see below). Same for the Patient_173.

To really drive me crazy, the first attempt at backing up one of the
subsub dirs under Patient_172, the backup crashed. Yet I could backup the
other 3 with no issue. So, we started looking at the problem subdir and
noticed a weird file name that ended in a tilde (~).  When I excluded it,
the backup ran. Then when I went back and picked just the file with the
tilde, it backed up fine (my head is getting balder-and-balder !!).  I
then went back and re-selected the whole Patient_172 directory and it
backed up (or at least scanned it since everything was backed-up) just
fine !!!1  ARRRRRRRRRRRRGGGGGGHHHHHHHHHHHHH !!

This is maddening and shows no rhyme-or-reason.




Henk ten Have <hthta AT NCSA.UIUC DOT EDU>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
04/01/2005 08:29 AM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>


To
ADSM-L AT VM.MARIST DOT EDU
cc

Subject
Re: [ADSM-L] Large Linux clients






An old trick I used for many years:
to investigate a "problem" filesystem, do a "find" in that filesystem.
If the find dies, tsm definitly will die.
I'll bet your find will die, and that's why your backup will die/hang or
whatever also. A find will do a filestat on all files/dirs, actually the
same
the backup does.
So your issue is OS related and not tsm.

Cheers
Henk ()

On Tuesday 29 March 2005 12:11, you wrote:
> On Mar 29, 2005, at 12:37 PM, Zoltan Forray/AC/VCU wrote:
> > ...However, then I try to backup the tree at the third-level (e.g.
> > /coyote/dsk3/), the client pretty much siezes immediately and
> > dsmerror.log
> > says "B/A Txn Producer Thread, fatal error, Signal 11".  The server
> > shows
> > the session as "SendW" and nothing going else going on....
>
> Zoltan -
>
> Signal 11 is a segfault - a software failure.
> The client programming has a defect, which may be incited by a problem
> in that area of the file system (so have that investigated). A segfault
> can be induced by memory constraint, which in this context would most
> likely be Unix Resource Limits, so also enter the command 'limit' in
> Linux csh or tcsh and potentially boost the stack size ('unlimit
> stacksize'). This is to say that the client was probably invoked under
> artificially limited environmentals.
>
>     Richard Sims

<Prev in Thread] Current Thread [Next in Thread>