I see this fairly often. It's always a slow-running client who is
backing up a large file. I almost had a Log-full disaster last Sunday
when DB backup started with a slow client already backing up. I
eventually had to DISABLE SESSIONS and cancel all client backups until
the DB backup finished.
The only answer here is to enable throughput monitoring so that slow
clients get cancelled automatically. (I actually have a cron process
that if the log gets 70% full, it parses the output of SHOW LOGPIN and
cancels that client session.) You cannot tolerate slow clients.
We seem to be having a lot more slow clients lately, and one recurring
theme in many of my log full problems is certain faulty versions of
Symantec Endpoint Protection running on client systems. I'm still trying
to pin down the relationship between SEP and slow TSM backups. I'll post
it here if I can narrow it down to a particular release of SEP.
Roger Deschner University of Illinois at Chicago rogerd AT uic DOT edu
Academic Computing & Communications Center
==="If a train station is where a train stops, what's a work station?"==
On Mon, 23 Nov 2009, Dwight Cook wrote:
>I've seen that but generally once the log has hit 80% (of a 13 GB log) and
>TSM is imposing a 2 millisecond delay in transactions in order to try to
>keep the log from filling and crashing TSM.
>I only have seen it a few times a year across various very active
>environments but one thing that seems to be common with me (and you) is that
>there is a long running and slow running client session present in
>conjunction with expiration.
>Notice your SERVERGRAPH client... about 54K sessions old and only 63 MB sent
>& 150 received.
>Not sure how many sessions a day you have but the pattern I've seen is that
>if a slow running client has been running across a couple of expiration
>processes, something will get pinned (but not a deadlock condition) and
>eventually the server will have to be halted to clear the situation.
>I haven't bothered to call in a problem because I don't have the time nor
>the environments available to try to make this recreatable on demand.
>
>I personally think that the long running client blocks the commit from
>expiration and ~if~ it had a long enough time to run it would eventually
>release its pin of the log but in the systems I work with, we can't
>exist/wait that long.
>
>I'll also point out that when we have caught this in its early stage and run
>a DB backup... it would complete and have no impact on clearing the log.
>I've seen this when running both log mode normal and roll forward.
>(Also if caught early enough, I've been able to cancel the client
>successfully.)
>
>Dwight
>
>
>
>
>-----Original Message-----
>From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
>Fred Johanson
>Sent: Monday, November 23, 2009 2:27 PM
>To: ADSM-L AT VM.MARIST DOT EDU
>Subject: [ADSM-L] Log pinned
>
>A different problem than the one a few weeks gone by.
>
>DB backup finished, leaving log usage at 0.1 % used. No sessions running,
>so I started an expiration process. After 30 minutes, this is what I saw:
>
>tsm: GLASSHOUSE>q pr
>
> Process Process Description Status
> Number
>-------- --------------------
>-------------------------------------------------
> 873 Backup Storage Pool Primary Pool SERVDISKPOOL, Copy Pool
>OFFSITEPOOL,
> Files Backed Up: 43179, Bytes Backed
>Up:
> 145,892,524,032, Unreadable Files: 0,
>Unreadable
> Bytes: 0. Current Physical File
>(bytes):
> 2,147,143,680 Current output volume:
>C31199.
> 874 Backup Storage Pool Primary Pool SERVDISKPOOL, Copy Pool
>OFFSITEPOOL,
> Files Backed Up: 12475, Bytes Backed
>Up:
> 137,712,177,152, Unreadable Files: 0,
>Unreadable
> Bytes: 0. Current Physical File
>(bytes):
> 1,384,419,328 Current output volume:
>C32082.
> 875 Backup Storage Pool Primary Pool SERVDISKPOOL, Copy Pool
>OFFSITEPOOL,
> Files Backed Up: 48409, Bytes Backed
>Up:
> 160,964,038,656, Unreadable Files: 0,
>Unreadable
> Bytes: 0. Current Physical File
>(bytes):
> 3,980,283,904 Current output volume:
>C32083.
> 878 Expiration Examined 207408 objects, deleting
>207376 backup
> objects, 0 archive objects, 0 DB
>backup volumes,
> 0 recovery plan files; 0 errors
>encountered.
>
>tsm: GLASSHOUSE>q sess
>
> Sess Comm. Sess Wait Bytes Bytes Sess
>Platform Client Name
> Number Method State Time Sent Recvd Type
>------- ------ ------ ------ ------- ------- -----
>-------- -------------------
>388,208 Tcp/Ip Run 0 S 63.7 M 150 Admin
>Linux86 SERVERGRAPH
>442,970 Tcp/Ip Run 0 S 134.3 K 2.2 K Admin
>AIX FRED
>443,069 Tcp/Ip IdleW 8.9 M 38.4 K 1.2 K Admin
>AIX AILIAS
>445,050 Tcp/Ip IdleW 15 S 4.3 K 405 Node
>SUN SOL- JON
>
>ARIS
>
>tsm: GLASSHOUSE>q log
>
>Available Assigned Maximum Maximum Page Total
>Used Pct Max.
> Space Capacity Extension Reduction Size Usable
>Pages Util Pct
> (MB) (MB) (MB) (MB) (bytes) Pages
>Util
>--------- -------- --------- --------- ------- ---------
>--------- ----- -----
> 12,788 12,788 0 6,996 4,096 3,273,216
>1,482,016 45.3 92.1
>
>
>I cancelled the expiration and the runaway log stopped running.
>
>My thinking is that the expiration is feverishly writing its transaction to
>the log. Since there are no other sessions accessing the log, there are no
>interrupts to trigger a commit. So the log keeps filling, triggering and
>retriggering DB backups. If my thinking is correct, this is a design
>feature in the pre-DB2 b-tree DB, not easily remedied. Just something to
>constantly monitor.
>
>Anybody have any thoughts on this?
>
>
>Fred Johanson
>TSM Administrator
>University of Chicago
>
>773-702-8464
>
|