Re: Clients not gettin contacted?

In article <01bde30a$65348200$4b2d9ca7 AT pingoo.chubb DOT com>, Jim Healy
<jhealy AT CHUBB DOT COM> says:
>
>Hello out in ADSM land another fantastic friday is upon us And of course =
>I have a new dilema.
>I'm running version2 on my MVS server(yes I know soon to go ver3) and =
>I'm running a lot of various clients(v2 and v3) mostly os2 in scheduler =
>prompted mode. Up until 2 nights ago schedules were running normal. =
>Here's the problem: it seems that some clients are not getting contacted =
>to initiate the scheduled backup.  I've checked the schedule log and the =
>error log and neither have any updates for the last two days. The Mvs =
>server log has no info about any attempts to contact the client to =
>initiate the backup. Why? can anyone point me in the direction of an =
>answer?=20
>Signed,Desperate in NJ
>

We have seen this problem with the adsm vm ver 2 server. I believe
that we know what is happening and I have two workarounds.

I should note that I think that this is a generic server problem,
but I have never been able to convince IBM that my theories are correct.
IBM opened an apar for fixing the way the scheduler works. But I do
not believe that they have fixed it yet. And I do not think
that they think of the problem in the exact same way that I do.
They were thinking of it being more of a cpu load problem and not
as an algorithm problem.
But they did spend a lot of time working on the problem.
I got the theory from a customer on an adsm tele-conference call.
Which seemed to have vanished.

First the background:
--------------------
We had a server with over 1,000 clients, mostly  windows nt or
We had a server with over 1,000 clients, mostly  windows nt or
windows 95. They were set up on various schedules and all
prompted. We would have a number which did not schedule for
good reasons (machine powered off, portable taken home, adsm client
not set up correctly, etc) but we had a number that we could not
find a good reason for them not to back up. In fact if we
defined an assoc to a midday schedule for a machine that did not
backup up the night before, it would back up.
The dsmsched.log for these machines showed nothing, except for
server contact after the schedule window had ended.
The server log showed nothing, until the schedule window closed at
which time the adsm server would contact the client to updated it
on the next schedule window.

We had one schedule with most of the pc's scheduled from 17:30 hrs est
till 08:30 the next morning. We had another schedule that
started at 18:30, 19:30 and 20:30hrs  and one that started at 23:30hrs.
This was to push folks in other time zone to a later time slot,
and the 23:30 hrs start time was for new pc's on their first backup.
We also had a schedule that started at 17:30 but with a higher
priority for several servers. Later we put the servers in a different
adsm server.

We had a suggestion to split the main schedule into two groups.
so we created a twin schedule for many of the nodes. This did not
have any effect.

Now the theory of what is going on:
------------------------------------
I believe that in a given short interval of time, the adsm server
I believe that in a given short interval of time, the adsm server
makes one combined list of all the nodes with pending scheduled events
to act on and sorts this list by schedule priority. For example
in our case at 19:00 hrs we would have pending sessions for those
in the schedule that started at 17:30 and those in the schedule at
18:30.

I believe that the scheduler prompter fires up every so many minutes,
which seems to vary based on something that I do not know. But
I believe that in each of these fire up periods if there are
scheduler slots open, it starts at the top of the combined list
of nodes-to-contact to start a schedule. But if it can not contact
the node, it has to go thru a timeout period. Before it reaches the
end of the combine list, the period ends and the scheduler prompter
goes to the top of the list and starts the process over again. So
the nodes at the bottom of the list are never contacted.

Workarounds:
------------
What we found was if we created a twin schedule for those pc's that
What we found was if we created a twin schedule for those pc's that
we did not think were going to backup this would make things work.
The magic was that the twin schedule had to have a lower priority
then the main schedule. Of course this took a lot of work moving
machines from the prime schedule to the twin if they were problem
cases and moving them from twin to the main schedule if they were
fixed up.

After things got out of hand during a period of time when the
micro computer dept was restaging pc's for many folks and both the
old and new pc's were scheduled in adsm.
We gave up on prompted and went to polling.
This seemed to fix up the problem. Actually we have both polling and
prompted. For new units they would get polling. For old units they
would get switched to polling if they called the help desk about
a failure to back up. We would see this be the only change in a
client required to change it from a non-performer to a performer

Fix:
---
Nodes that timeout, have to be moved to the bottom of the scheduler
Nodes that timeout, have to be moved to the bottom of the scheduler
prompters todo list.

-------------------------------------------------------------------------
Leonard Boyle, Mainframe support            snolen AT vm.sas DOT com
Leonard Boyle, Mainframe support            snolen AT vm.sas DOT com
SAS Institute Inc.                          ussas4hs@ibmmail
Room E206                                   (919) 677-8000 ext 6241
203 SAS Campus Drive
Cary NC 27513