ADSM-L

Re: [ADSM-L] Urgent - Library Master mount queue breaking down, tapes going into RESERVED status and never getting mounted

2010-09-10 17:02:19
Subject: Re: [ADSM-L] Urgent - Library Master mount queue breaking down, tapes going into RESERVED status and never getting mounted
From: Richard Rhodes <rrhodes AT FIRSTENERGYCORP DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Fri, 10 Sep 2010 17:00:37 -0400
One time when we had problems like this it was caused by rmt devices being
out of sync with TSM paths.  We never did figure out how it occured, but we
ended up blowing away all our paths and drives, and recreating it.

Rick






             "John D.
             Schneider"
             <john.schneider@C                                          To
             OMPUTERCOACHINGCO         ADSM-L AT VM.MARIST DOT EDU
             MMUNITY.COM>                                               cc
             Sent by: "ADSM:
             Dist Stor                                             Subject
             Manager"                  Re: Urgent - Library Master mount
             <[email protected]         queue breaking down, tapes going
             .EDU>                     into RESERVED status and never
                                       getting mounted

             09/10/2010 04:39
             PM


             Please respond to
             "ADSM: Dist Stor
                 Manager"
             <[email protected]
                   .EDU>






Richard,
   All good suggestions.  No AIX errors with the VTL or VTL drives.  We
are using the Atape driver, because the VTL is emulating a 3584 with
LTO1 drives.

But there are a number of Atape files, in particular Atape.smc0.traceX.
I look in them and see regular errors in them; but I wonder if this is a
red herring.  Because I look on the Library Master for a physical 3584
library, and I see similar trace files, and the same sort of errors on
the smc1 device for a real 3584 library.

So are these libraries always getting these errors?

I looked at our SAN switches a couple days ago, and zeroed out the error
counters for the AIX host, the EDL, and the ISLs between the switches.
Two days later, and all those ports are totally error free.  So I don't
see how it could be in the switches.

All good ideas, and I don't mean to disparage them.  I just don't see a
smoking gun, yet.

Best Regards,

John D. Schneider
The Computer Coaching Community, LLC
Office: (314) 635-5424 / Toll Free: (866) 796-9226
Cell: (314) 750-8721



-------- Original Message --------
Subject: Re: [ADSM-L] Urgent - Library Master mount queue breaking
down, tapes going into RESERVED status and never getting mounted
From: Richard Rhodes <rrhodes AT FIRSTENERGYCORP DOT COM>
Date: Fri, September 10, 2010 12:44 pm
To: ADSM-L AT VM.MARIST DOT EDU

Sounds like maybe the library manager is not communicating with the VTL.
Some things to check:

- any errors in the AIX error log?
- any errors in the VTL?
- any san errors?

If you are running atape . . .
- check the logs in /var/adm/ras
- are you running multi-pathing? If yes, what is the status of the
paths?

Atape with multi-paths is very good at hiding hardware problems.


Rick





 "John D.
 Schneider"
 <john.schneider@C To
 OMPUTERCOACHINGCO ADSM-L AT VM.MARIST DOT EDU
 MMUNITY.COM> cc
 Sent by: "ADSM:
 Dist Stor Subject
 Manager" Urgent - Library Master mount queue
 <[email protected] breaking down, tapes going into
 .EDU> RESERVED status and never getting
 mounted

 09/10/2010 01:05
 PM


 Please respond to
 "ADSM: Dist Stor
 Manager"
 <[email protected]
 .EDU>






 Greetings,
 Our environement is 8 TSM instances on AIX, running AIX 5.3ML11, and
TSM 5.4.3.0. I know we are rather far behind, but this has been an
extremely stable version for us, until just recently. There are 4
instances on one AIX host, and 4 on the other. The hosts are pSeries
570s. There is also a Windows Lan-free client in the mix. Total client
count about 1500, in schedules more or less spread across the night.
Performance of backups is OK; the AIX hosts are generally 20-30 CPU
loaded across 8 CPUs.
 One of the TSM instances servers as a TSM Library Master for the
others, and has no other workload. It mounts tapes for a EMC Disk
library (virtual library), configured with 128 virtual LTO1 tape drives,
shared between all the instances. The device class for the library has
a 15 minute mount retention period. The clients mostly can only mount a
single virtual tape. A few larger database servers are allowed to mount
more. All have "keep mount point" set to yes.
 This basic configuration has been in place about three years. At
first we had problems, and had to put LIBSHRTIMEOUT 60 and COMMTIMEOUT
3600 in the dsmserv.opt of the Library Master. But it has been many
months since we had to make any configuration changes to the
environment. I like STABLE.
 But things are growing, and we are adding new clients all the time,
and have added about forty in the last few weeks.
 A couple weeks ago, the Library Master instance got into a state
where there were lots of tapes in RESERVED status when we did a 'q
mount'. There were still occasional mounts happening, but lots of
clients were in Media wait. We restarted the Library Master and the
problem went away, but then it came back like a week later.
 Now it is happening every day. Last night we stayed up all night
watching it, and at first could see just a couple of RESERVED tape
drives, and lots of normal mounts coming and going. Then slowly the
number of RESERVED ones would creap up over the course of an hour or two
until there were 80 or more in RESERVED status, and dozens of clients in
Media wait. Ordinarily virtual tape mounts take 2-4 seconds. Last
night during the problem they were taking 15-20 seconds. At about 1am
we restarted the Library Master, and the RESERVED drives went away, but
were back again within the hour.
 One thing I noticed then was that the Library Master had over 300
sessions, all admin. Usually it has very few. Our MAXSESSIONS was set
to 500, so I wondered if perhaps were were overrunning it. We bumped it
up to 1000 on all instances. We restarted all TSM instances this time,
including the lan-free one. (The lan-free Windows server was hung,
although we don't know if this is coincidence, or has something to do
with anything).
 After we restarted, we appeared to be stable for about 4 hours, so we
started rerunning a bunch of the TSM clients that failed last night
during the problem. In no time at all the RESERVED list grew huge,
clients were in Media wait again, and we had to restart the Library
Master again.

 So it seems like to me the problem has to do with the Library
Master's queuing mechanism. Somehow it is becoming overwhelmed with
tape mount requests, and can't satisfy them all, so they go into
RESERVED status. This is somewhat normal behavior, and we see drives go
into RESERVED status lots of times when a burst of mounts happens at
once, but then the queue clears after a few minutes. But even after an
hour or two it never catches up, and things go from bad to worse.

 One other tidbit, but might not even be related. Back on 8/23 our
EMC Disk library had a drive fail, but within 24 hours had rebuilt onto
a spare. We just found out about it, and haven't replaced the drive. I
don't think it is related, but I didn't want to leave out any important
fact.

 If anybody has any advice on how to tune the Library Master to allow
it to support a greater number of requests at once, please let me know.

Best Regards,

John D. Schneider
The Computer Coaching Community, LLC
Office: (314) 635-5424 / Toll Free: (866) 796-9226
Cell: (314) 750-8721



-----------------------------------------
The information contained in this message is intended only for the
personal and confidential use of the recipient(s) named above. If
the reader of this message is not the intended recipient or an
agent responsible for delivering it to the intended recipient, you
are hereby notified that you have received this document in error
and that any review, dissemination, distribution, or copying of
this message is strictly prohibited. If you have received this
communication in error, please notify us immediately, and delete
the original message.

<Prev in Thread] Current Thread [Next in Thread>