Archive Log fills up but nothig is running/happening on the instance.

Tomasz

Newcomer
Joined
Feb 10, 2021
Messages
2
Reaction score
0
Points
0
(Side note, in Bold all the important stuff related to my TSM query)

Hello guys,

I have a big problem with my clients Legacy environment. This is the part of their infra which is still on 6.x version of the tsm and is not going to be patched/updated any further as it is clearly just for the purpose of the restoring data from archives.
So no support from the IBM whatsoever due to the outdated infra...

So like I wrote, this is an old environment, and I got literally no docs how it was build/ by whom, and to which LB it is connected (it is but well, no one can tell me if the library itself died or not due to Covid restriction and they DO not want to send anyone to check..). And this was thanks to the "X" and I blame myself as well as I didn't pursued the topic during the KT.

So everything I have are my presumptions and investigation of the env. So enough of the introduction.

When I received this part of the env, I was ask to check if the RISK created for it was still valid. RISK stated that all 3 out of 3 of the instances are dead and the previous team was not able to turn them on (the instances I mean). As I said, those instances were just for the archive restore (retention of 1mln years or close to that number) so once any restore was done, the "X" team halted them as it was agreed with the customer and shut down the library. But once they've halted them, more than half a year passed and they wanted to do some restore, and here the fun part begins. Beside leaving the tsm's halted, "Something" took all the space on the active/archive logs so due to that fact, non of the instances were able to start properly. As "X" team was off-board, I've done some diagnostics (I read what logs gave me = full arch/act logs) and after assigning some additional storage which I had in spare, I was able to run them properly. In a mean time, UNIX team randomly decommissioned some of the luns on which two out of three instances were placed, and therefore crashed those two pore old guys beyond repair(no one from the old team and the guys in charge of the libb didn't know where the tape with the db bkp went - my guess is Ebay..).
But back to the one Instance left.
I was able to run that instance with that additional storage allocated but once I turn off all the processes there (scheds and admin scheds, replications and stuff like that), the Archve still kept filling until the Instance went down. I once again allocated new stg to the archive logs but all that was made from the AIX side, I mean, I was not able the do anything from the TSM side as it was fast in growing.. The second time we made this instance run, my colleague tried to run a db bkp but as here we have Schrodingers Library, and inside were not tapes (nor with data nor scratches), he checked that we have lots of disk space so he made one DB backup on a disk and than once the archive logs were cleaned, he halted the instance. I came back to that "magic" env after 6 months to confirm if this instance is still able to start (server uptime 1200 days, so in case of reboot, it probably would die..). To my total surprise, the archive logs and active logs were once again full (aroundish 650Gb for Arch and 500Gb for Act). No matter the fact that the TSM was stopped.. After a while of digging (like, I read the whole internet..), I found out that process called db2fmc can do that (beside other things, it tries to start the instance if it is dead and if it can't it writes everything to the logs). It was running no matter the fact that the tsm was halted properly.. Once I fixed that issue, I figured that everything should went back to normal and if I add some space to the arch/act logs, I would be able to run the instance, check everything and forget about that part of my life. But as you might already know, this was not the case. And yes, the archive logs did not grown (once the fault db2 deamon was killed) while the instance was halted, but as soon as we power on the instance, it devours all the space we add to the logs in an instant.


So here are my questions as I am not able to keep adding additional space to the archives:

-What might take that arch space and how to kill it/prevent it from doing so? Is there a possibility that the DB backup which tries to run if the arch logs are above 80%, take that place and as it can't go to tape (again - Schrodingers Library) it crashes and loops and fills the logs? While we were increasing the archive logs, it dropped just to 97% so way above the threshold for automatic DB backup. I have just 220GB of spare space, so maybe I should increa the arch logs capacity until it reaches below the 80% threshold?

-We have one db bkp on a disk (the one made by my colleague a year ago), and as it consumed over 1Tbi of space, we do not have a privileged to do another one the same way (lack of storage). But after that bkp, NOTHING was done with the environment (no data movement, nothing) and all the logs were purged so HOW BIG IS THE RISK if we manually delete the logs so the TSM would recreate them by its own during a start? Or maybe we should make a DB restore from that one bkp we have and this should put our logs back to square 1 which is 0% consumption on a pre-started Instance?

There is no way to do another DB backup on disk due to lack of space, and as no one can tell me what is going on with the libray, and as there are no scratched there, the DB backup there is not an option as well..


For all the help you can provide, I am truly thankful!!
And my apologize for my bad English, unfortunately it appears I suck at learning other languages than Pythone:p

Best regards
Tomasz
 
Back
Top