ADSM-L

Re: Hanging TSM backup invalidates journal?

2003-11-10 05:14:20
Subject: Re: Hanging TSM backup invalidates journal?
From: David McClelland <David.McClelland AT REUTERS DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Mon, 10 Nov 2003 10:11:34 +0000
Pete,

Thanks, as ever, for your responses.

>> ...indicates that another backup session is attempting to connect to
the journal daemon 
>> while another journal based backup session is in progress...

Indeed - one backup session (two threads) had hung in IdleW, and the
following day's backup had kicked in, so that all seems to fit in.

We are running 5.1.5.0 client, and am awaiting a time slot to upgrade to
5.1.6.7, so I should be able to have a play with this timeout setting. I
look forward to 5.22... Coming to an ftp-server near me soon?

I too have certainly noticed many times to my chagrin that cancelling
out of a backup session (especially during testing - wanting to see if
the magical 'Using journal for x$' comes up...) can invalidate the
journal.

As for the tsmjbbd.exe process, even though there are no journal backups
in progress, I've come in this morning, and it's still sitting at
263,632K, and the backups over the weekend haven't been journaled ones.
I am supposing that the journal process now isn't running properly,
probably something to do with the hung backup of last week, so I'm think
I'm going to have to bounce it and begin again... 

(5 minutes later) Hum, had trouble stopping the TSM Journal Service
process, and found that there was a dsmc backup in a hung state which
was locking the process. I think I probably should have spotted that (I
could swear it *wasn't* there before, but NTPROCINFO.EXE suggested it
had been there since last week, even though no session on the TSM
server). So I think this had most to do with an apparently
non-functional journaling daemon.

That's almost all for now, I think this is resolved now... Has anyone
else who is using TSM Journaling devised an availability/monitoring plan
for it?

Rgds,

David McClelland
Global Management Systems
DTC/6 - x5-4670


-----Original Message-----
From: Pete Tanenhaus [mailto:tanenhau AT US.IBM DOT COM] 
Sent: 07 November 2003 15:38
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Hanging TSM backup invalidates journal?


Based on your description, a couple things are going on.

The following message:

>>> Named pipe error
>>> connecting to server WaitOnPipe failed. > NpOpen: call failed with 
>>> return code:121 pipe name //./pipe.jnl'.

indicates that another backup session is attempting to connect to the
journal daemon while another journal based backup session is in
progress.

This can happen if multiple backup client processes attempt to perform a
journal based backup at the same time, or if the ResourceUtilization
option setting is higher than 2 and produces multiple backup sessions.

The level of client you are running will only wait about 2 minutes for a
connection to the journal daemon to become free and will then timeout.

A testflag was implemented in the 5.1.6.2 level fixtest to allow a
client to specify a timeout value that the client will wait for a
connection to the journal daemon to become free (that is, the currently
running jbb session to finish).

You might also consider reducing the ResourceUtilization setting to 2 or
less.

Multi session journal based backup isn't currently supported and is a
know requirement for a future release (apar IC36361 is currently opened
against this problem).

I have also recently discovered a problem in which a valid journal gets
invalidated anytime a journal based backup starts but doesn't complete
(due to a session drop, client terminated by the user, etc.).

The result of this is that the next backup will not be journal based
(will be a normal full incremental) and journal based backup won't be
available until a full backup completes and re-validates the journal.

Apar IC37908 has been opened against this problem and should be fixed in
the 5.22 level client.

It is reasonable for the journal daemon process to utilize a large
amount of memory while processing a large journal query, which involves
building a sorted list of objects to send to the client, but the memory
should eventually be released when the journal based backup completes.

I have notices that very large journal queries and journal based backups
can create prolonged delays in the journal daemon, and I am looking at
ways of making these queries more efficient, both in terms of memory
utilization and in terms of processing time.

Hope this helps answer your questions ....

Regards, Pete


Pete Tanenhaus
Tivoli Storage Solutions Software Development
email: tanenhau AT us.ibm DOT com
tieline: 320.8778, external: 607.754.4213

"Those who refuse to challenge authority are condemned to conform to it"

---------------------- Forwarded by Pete Tanenhaus/San Jose/IBM on
11/07/2003 10:11 AM --------------------------- Please respond to "ADSM:
Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
Sent by:        "ADSM: Dist Stor Manager" <ADSM-L AT VM.MARIST DOT EDU>
To:     ADSM-L AT VM.MARIST DOT EDU
cc:
Subject:        Hanging TSM backup invalidates journal?



Dear TSMville,

Hanging Backup Invalidates TSM Journal

o - TSM Client 5.1.5.0, Journaling engine backing up c 12,000,000 files
with a daily addition/update of around 80,000. When it works = 40
minutes. When it doesn't = 13 hours... o - TSM Server 5.1.6.2 WinNT

A couple of nights ago, a journal backup hung and just kinda stayed
around on the TSM server in IdleW without anyone noticing. The next
day's backup began and, I'm guessing from hereon, it couldn't get access
to the TSM journal, so it reverted to a looooong normal incremental
backup. I subsequently spotted this, killed off the two IdleW sessions
and kicked off a new backup on the journal client. However, it failed to
do a journal backup and started a normal incremental again...

Looking in the dsmerror.log, I spy a 'NpOpen: Named pipe error
connecting to server WaitOnPipe failed. > NpOpen: call failed with
return code:121 pipe name //./pipe.jnl'.

I understand that this named pipe is opened up at the initiation of a
journal backup as the b/a client attempts to connect to the journal
daemon - the return code 121 suggests that the connect failed, and
possibly the tsmjbbd.exe process wasn't up and running. I look at task
manager, and it is, but consuming a 'healthy' 263,632K of memory.
Observing its behaviour, I see it is still doing some work 'I/O Other'
in Task Manager's useful extra columns, but nothing in the 'I/O Writes'
or 'Reads' section, is this suspect...

I'm guessing that the journal became invalidated somewhere down the line
during the hung backup, or that the subsequent attempt at a backup
failed as maybe the old TSM backup still has a lock on it? The
tsmjbbd.exe is still present, and there is nothing from these dates in
the jbberror.log.

Any ideas what may be going on here? I seem to be able to get around 6
or 7 days of JBB backups before it starts to break and I have to
hand-hold it to get it up again... In terms of automatically monitoring
this, sticking a Tivoli process monitor to make sure the tsmjbbd.exe
process is running is only useful to a point (i.e. it wouldn't have
spotted the above), so it looks as though I'm going to have to trawl the
stdout of our backup logs to make sure that 'using journal for x$' is
present. Any ideas where else I should be looking - perhaps in the (what
we've called) jbberror.log for 'Journal will be restarted for FS x'?

So, questions are:

o - any ideas what might be behind the above? A dead/alive tsmjbbd.exe,
and if so, how? o - tsmjbbd.exe - how big should it be in 'healthy'
usage? Is 263MB a bit excessive? o - any ideas about the best way to
monitor (preferably using Tivoli e.g. ITM, logfile adapters etc) jbb
backups?

Quite a lot there - sorry!

Rgds,

David McClelland
Global Management Systems
Reuters
85 Fleet Street
London EC4P 4AJ
E-mail  david.mcclelland AT reuters DOT com
Reuters Messaging       david.mcclelland.reuters.com AT reuters DOT net




-------------------------------------------------------------- -- Visit
our Internet site at http://www.reuters.com

Get closer to the financial markets with Reuters Messaging - for more
information and to register, visit http://www.reuters.com/messaging

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be the
views of Reuters Ltd.


--------------------------------------------------------------- -
        Visit our Internet site at http://www.reuters.com

Get closer to the financial markets with Reuters Messaging - for more
information and to register, visit http://www.reuters.com/messaging

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.

<Prev in Thread] Current Thread [Next in Thread>