Re: ADSM RESTORE OPS

> ----------
> From:         Daniel Thompson[SMTP:thompsod AT USAA DOT COM]
> Sent:         Sunday, September 21, 1997 8:00 PM
> To:   ADSM-L AT VM.MARIST DOT EDU
> Subject:      Re: ADSM RESTORE OPS
>
> Daniel,
>
> We appear to have found a workaround which solves the problem with the
> help of our TCP/IP vendor.  I forward his notes to you and thank you
> for your response to my email.  Do you have any comments to add now
> that this additional big clue has been found as to the "actual" cause
> of the problem here (which is still unresolved).
>
> Gil
>
> Response from TCP/IP vendor follows:
>
>   Here is a summary of my visit to Maritz and where we need to go
> from here to solve the actual problem that has been in existance
> since the 15th of May.
>
>    The problem as I understand it appeared durning the restores of
> Databases from the IBM host using IBM's ADSM. Restores would appear
> to progress normaly for a time and then at random intervals hang and
> never recover. The connection between the Sun servers with the
> databases and the Host is a FDDI ring with the host attachment being
> an IBM Osa board. Interlink's TCP Access is being used on the Host
> with the standard TCP implemention on the Suns. The Sun systems are
> patched with all the latest patches and the TCP Access code is
> PUT9703 with many patches applied.
>
>   After looking at the traces of the TCP flow from both Sun Snoop and
> the HP FDDI Monitor the condition at the time of the Hang was seen to
> be the following. The TCP data flow appears to go along just fine
> till the moment of the hang. At that point we see the Sun make 8 fast
> retransmits indicating that it for some reason doesn't like the block
> of data from the host. The TCP Access code correctly retransmits the
> block many times after that but the Sun TCP refuses to acknowlege any
> of these blocks. After sending the orignal 8 fast retransmits the Sun
> TCP stack goes completely Brain dead. No more traffic is seen from
> the Sun. After 4.9 minuets the IBM ADSM application decides the
> client has gone away and breaks the connection.
>
>    Further Investigation showed that during the 4.9 min's that the
> host was retransmitting the block in question, that doing a
>  netstat -s command showed that the Sun's tcpInErrs counter was
> increasing for each block that was sent from the host. Thinking that
> maybe the Host may have malformed the block in question we extracted
> from the HP trace a complete block and with the Help of Jim Sansing
> of Interlink looked at it to see if it was in Error. Jim has
> determined that the block appears to be fine and doesn't know why the
> Sun should reject it.
>
>   In the mean time I fooled with some parameters on the Sun and
> managed to get restores to work. I changed the TCP_MSS_MAX parameter
> to 3840 bytes. After making that change the restores appear to work.
> With the help of John Frank have run many restores that have complete
> successfuly since making that change. We also set the parameter back
> to it's orignal value of 65495 allowing the Host and Sun to select
> the largest packet size they thought best. With the Default
> TCP_MSS_MAX the hang problem reappeared.
>
>   So for the moment we seem to have found a work around which allows
> for the restores to happen but haven't found the real reason for the
> failures. Since we haven't really fixed the problem it's possible but
> seems with each successful restore more unlikely that the problem
> could reappear. With Jim saying that the packet seen by the HP scope
> is good the problem moves back into Suns court. We need to find out
> from sun why it's rejecting the block.
>
>   Jim Sansing has sugested that what needs to happen next is for a
> conference call to be set up between interlink, Maritz and Sun to see
> if an action plan can be drawn up to find the reason for the packet
> rejection. Maybe John at Maritz could set this up as I'm going to be
> on the road for the next 2 days?
>
>   Anyhow that's the summary as I see it. Currently we have the
> restores working but the cause is unknown. Then next step is to get
> everyone on to the phone to see if we can Nail down the real cause
> for this this problem..
>
>     Don
>   Some questions:
>   1) What does the session status show on the MVS server when viewed
> via
> the admin client.  IDLEWAIT, MEDIAWAIT etc.
>
> 2) If IDLEWAIT, is there a tape dismount anywhere near the time that
> the
> idlewait began.  Where exactly this dismount message occurs depends on
> the
> volume retention parms on the device class you are using.  You can
> find
> this dismount on the MVS system logs.  This is a valid question only
> for
> the volumes used in the restore.  If you do not know these, try a show
> volumeusage before the restore.  (A show volumeusage for large
> restores is
> part of our internal procedures.  This command does not seem to work
> once
> the restore is in progress.)
>
> Let us know,
> Dan T.
> ----------
> > From: Standen, Gilbert <StandeGL AT MARITZ DOT COM>
> > To: ADSM-L AT VM.MARIST DOT EDU
> > Subject: ADSM RESTORE OPS
> > Date: Friday, September 19, 1997 5:18 PM
> >
> > Greetings!  "ADSM restore intermittent hangs" (also known as "The
> > Lawnmower man hits a pothole now and then")  We use MVS ADSM with
> > Interlink for MVS TCP/IP
> > to SUN Solaris 2.5.1 over FDDI network and use the ADSM to back up
> > several ORACLE databases running on the SUN servers.  We have an
> > intermittent glitch and would greatly appreciate anything you could
> > share with us.
> > Backup works with 100% reliability, but restores sometimes
> inexplicably
> > "hang".  We see only
> > ANR0480W in the ADSM error log. Vendor technical support for various
> > products in the data
> > path have so far not solved this problem. We have been working on
> this
> > problem for three months.
> > Any responses very welcome !
> >
> > Gil Standen
> > Maritz Corporation
> > St. Louis, Missouri
> > standegl AT maritz DOT com
> > (314)-827-1016
> > (314)-827-3146 (fax)
>