ADSM-L

Re: Long pause in the middle of an NT backup.

2004-06-10 12:49:54
Subject: Re: Long pause in the middle of an NT backup.
From: Pete Tanenhaus <tanenhau AT US.IBM DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Thu, 10 Jun 2004 11:54:32 -0400
>> Actually, we're not using journaling on the server in question. The
journal
>> service hasn't been installed.

The only other possibility is that you are using Named Pipes for
communicating with the TSM server
(verify by checking CommMethod option in dsm.opt).

The error (rc 233) indicates that the server broke the connection for some
reason which should have caused
the client to try to reconnect to the server or terminate the backup if
the connection couldn't be re-established.



Pete Tanenhaus
Tivoli Storage Solutions Software Development
email: tanenhau AT us.ibm DOT com
tieline: 320.8778, external: 607.754.4213

"Those who refuse to challenge authority are condemned to conform to it"

---------------------- Forwarded by Pete Tanenhaus/San Jose/IBM on 06/10/2004 
11:50 AM ---------------------------
Please respond to "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
Sent by:        "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
To:     ADSM-L AT VM.MARIST DOT EDU
cc:
Subject:        Re: Long pause in the middle of an NT backup.



Actually, we're not using journaling on the server in question. The
journal
service hasn't been installed.

--Matt

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Pete Tanenhaus
Sent: Thursday, June 10, 2004 10:12 AM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: Long pause in the middle of an NT backup.


> dsmerror.log:
>
> 06/08/2004 01:30:48 win32NpWrite(): Error 233 writing to named pipe

This indicates that the backup client has encountered an error sending a
journal notification to the journal service.

Most likely what happened is that the error occurred while the client was
performing the initial non-journal based backup of a drive with an active
but not yet valid journal.

The journal service probably encountered an error reading a notification
from the backup client and severed the connection with the client (check
jbberror.log for a return code of 998), and any subsequent notifications
sent to the journal service by the client would result in broken
connection
errors (this is what rc 233 is).

Since the client can't send any more notifications the final "Incremental
Complete" notification is never sent to the journal service and the
journal
for the drive  never set to the valid state, which means that the next
backup of the drive won't be journal based.

This problem is addressed by apar IC40627 and will be fixed in the next
5.23
fixtest (check with IBM service for availability).

I'm not sure if this answers all your questions as to why the backup is so
slow but it should at least answer why journaling isn't working.



Pete Tanenhaus
Tivoli Storage Solutions Software Development
email: tanenhau AT us.ibm DOT com
tieline: 320.8778, external: 607.754.4213

"Those who refuse to challenge authority are condemned to conform to it"

---------------------- Forwarded by Pete Tanenhaus/San Jose/IBM on
06/10/2004 10:58 AM --------------------------- Please respond to "ADSM:
Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
Sent by:        "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
To:     ADSM-L AT VM.MARIST DOT EDU
cc:
Subject:        Re: Long pause in the middle of an NT backup.



I don't remember at what client level the syntax changed, but at client
level 5.2 you can add a "testflag instrument:detail" to your dsm.opt and
generate a report to dsminstr.report file that details how the session is
spending it's time.

David

>>> Matt.Anderson AT PEARSON DOT COM 6/9/2004 3:55:32 PM >>>
Thanks for the responses. I'm going to implement a cron job tonight to
query
the sessions every few minutes to see exactly what the session in question
is doing during the "inactive" time.

Journaling (as suggested in another message) is currently being tested for
some on one of our servers. We had problems with it in the past but are
hoping it's improved to the point where it's usable to us.

Regarding fragmentation, I hadn't realized that could be related to this
behavior, but it does make sense. In fact, we know we're horribly
fragmented
but are unable to take the necessary downtime to unload/reload the
database
in the foreseeable future.

Thanks!

--Matt



-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Andy Carlson
Sent: Wednesday, June 09, 2004 2:47 PM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: Long pause in the middle of an NT backup.


One thing that appens during that time is that the TSM server is
downloading
directory entries to the client, to sort through and decide what to back
up.
With millions of entries (which it appears you have), it can take awhile.
We
had a situation with a terribly fragmented DB, where this pause was
lasting
10-15 hours on some boxes.  We did an unload/load, and they have all but
one
gone back to "normal" (the one that didn't has 5.5Million entries, and
does
not have enough memory on the client box).

On Tue, 8 Jun 2004, Anderson, Matt wrote:

> I'm experiencing behavior on some NT nodes where, in the middle of a
> scheduled backup, there will be a long pause where no activity seems
> to occur. It would appear to occur after the backup of one drive
> finishes, but before the next one starts. For example, see this log
> excerpt from this
> morning:
>
> 06/08/2004 01:30:46 ANS1898I ***** Processed 1,911,000 files *****
> 06/08/2004 01:30:47 ANS1898I ***** Processed 1,916,000 files *****
> 06/08/2004 01:30:49 ANS1898I ***** Processed 1,916,500 files *****
> 06/08/2004 01:31:50 Normal File-->       141,971,832
> \\my-server\f$\PerforceETC\Logs\Log.txt  Changed
> 06/08/2004 01:32:34 Retry # 1  Normal File-->       141,971,832
> \\my-server\f$\PerforceETC\Logs\Log.txt [Sent]
> 06/08/2004 01:32:34 Successful incremental backup of
'\\my-server\f$'
>
> 06/08/2004 07:56:10 ANS1898I ***** Processed 1,917,000 files *****
> 06/08/2004 07:56:13 ANS1898I ***** Processed 1,917,500 files *****
> 06/08/2004 07:56:15 ANS1898I ***** Processed 1,918,000 files *****
> 06/08/2004 07:56:17 ANS1898I ***** Processed 1,918,500 files *****
>
> There's almost 6.5 hours where no activity was logged in the
> dsmsched.log. There is a (possibly) corresponding message in the
> dsmerror.log:
>
> 06/08/2004 01:30:48 win32NpWrite(): Error 233 writing to named pipe
>
> The client is Windows 2000 running TSM client version 5.2.2.5 The
TSM
> server is AIX 5.2 running TSM server version 5.2.2.2
>
> Any insight that can be provided would be appreciated!
>
> Thanks,
>
> Matt Anderson
> Backup & Recovery Engineer
>
>
**********************************************************************
> ******
> This email may contain confidential material.
> If you were not an intended recipient,
> Please notify the sender and delete all copies.
> We may monitor email to and from our network.
>
****************************************************************************
>
>

****************************************************************************
This email may contain confidential material.
If you were not an intended recipient,
Please notify the sender and delete all copies.
We may monitor email to and from our network.
****************************************************************************

****************************************************************************
This email may contain confidential material.
If you were not an intended recipient,
Please notify the sender and delete all copies.
We may monitor email to and from our network.
****************************************************************************