ADSM-L

Re: [ADSM-L] Recovering Linux TSM server from partial filesystem failure

2014-03-11 12:56:24
Subject: Re: [ADSM-L] Recovering Linux TSM server from partial filesystem failure
From: Zoltan Forray <zforray AT VCU DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Tue, 11 Mar 2014 12:52:46 -0400
The db2diag.log file was lost along with the root and /home partition.  All
I have is ghost messages from the activity log (TSMManager console saves a
lot of the messages in its buffers)


On Tue, Mar 11, 2014 at 12:21 PM, Chavdar Cholev
<chavdar.cholev AT gmail DOT com>wrote:

> Zoltan,
> if not you can check here:
> http://www-01.ibm.com/support/docview.wss?uid=swg21420318
>
> On 3/11/2014 17:17, Zoltan Forray wrote:
>
>> With the lack of replies, I am guessing I can't recover this server from
>> what is left behind.  I do have an old DB backups but for what this server
>> does, it isn't worth bothering.  I can rebuild it faster.
>>
>> I do have additional questions that somebody might have an answer to.
>>
>> 1.  Any reason NOT to install 7.1 on this box?  My only hesitation is my
>> last 6.1 server (being upgraded to 6.3.4 in 2-weeks), has to communicate
>> with it, to perform DBSNAPSHOT backups?
>>
>> 2.  When doing postmortem on this failed server (still waiting for results
>> from hardware diagnostics - my OS guy is head to the offsite location to
>> check on the results and to start reinstalling the OS), I notice this
>> message from my monitoring system:
>>
>> 3/6/2014 8:00:11 PM ANR2971E Database backup/restore/rollforward
>> terminated
>> - DB2 sqlcode -980 error.
>>
>> Unfortunately, everywhere I Google sqlcode's, there is no *-980* ?
>>  Anybody
>>
>> have a better magic decoder ring to tell me what this is saying?
>>
>>
>> On Mon, Mar 10, 2014 at 11:57 AM, Zoltan Forray <zforray AT vcu DOT edu> 
>> wrote:
>>
>>  As soon as I know more, I will post here.  My OS guy (offsite with the
>>> box) just reported/confirmed the root filesystem is a loss and will have
>>> to
>>> rebuild/reinstall.  He is running Dell  hardware diagnostics right now.
>>>
>>> Going back through what logs/reports I have available, I found that there
>>> was some kind of hick-up on 03/06/2014,  which seems to be the start of
>>> its
>>> downfall.
>>>
>>> 3/6/2014 1:58:51 PM ANR0106E admnode.c(23257): Unexpected error 4505
>>> fetching row in table "Nodes".
>>> 3/6/2014 1:58:51 PM ANR9999D_2821097399 imInsertArchive(imarins.c:858)
>>> Thread<124724>: Error 9999 setting anyV2Client=yes for nodeId=9, will
>>> continue
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724> issued message 9999 from:
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x00000000dc6503
>>> OutDiagToCons
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x00000000dc9305 outDiagfExt
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x000000007ec0a6
>>> imInsertArchive
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x00000000863a41
>>> imUpdateInventory
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x0000000088a38b
>>> imPrepareTxn
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x00000000d64435 tmEndX
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x00000000b8078d SmEndVbTxn
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x00000000ba6230
>>> SmNodeSession
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x00000000b69c23
>>> smExecuteSession
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x00000000e7119d
>>> psSessionThread
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x00000000e5e01a StartThread
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x00003e6be079d1 *UNKNOWN*
>>> 3/6/2014 1:58:51 PM ANR9999D Thread<124724>  0x00003e6b6e8b6d *UNKNOWN*
>>> 3/6/2014 1:58:51 PM ANR3491E No sender email address found - unable to
>>> send email for alert, ANR9999D.
>>> 3/6/2014 1:58:51 PM ANR0157W Database operation INSERT for table
>>> DF.Segments failed with result code 4505 and tracking ID: 0x7fff6c07dd28.
>>> 3/6/2014 1:58:51 PM ANR0158W Database operation INSERT for table
>>> DF.Segments failed with operation code 4505 and tracking id
>>> 0x7fff6c07dd28.
>>> The data for column 0 is: (int32)2.
>>> 3/6/2014 1:58:51 PM ANR0158W Database operation INSERT for table
>>> DF.Segments failed with operation code 4505 and tracking id
>>> 0x7fff6c07dd28.
>>> The data for column 1 is: (int32)0.
>>> 3/6/2014 1:58:51 PM ANR0158W Database operation INSERT for table
>>> DF.Segments failed with operation code 4505 and tracking id
>>> 0x7fff6c07dd28.
>>> The data for column 2 is: (int64)7348.
>>> 3/6/2014 1:58:51 PM ANR0158W Database operation INSERT for table
>>> DF.Segments failed with operation code 4505 and tracking id
>>> 0x7fff6c07dd28.
>>> The data for column 3 is: (int32)0.
>>> 3/6/2014 1:58:51 PM ANR0102E dfcreate.c(1959): Error 4505 inserting row
>>> in
>>> table "DF.Segments".
>>> 3/6/2014 1:58:51 PM ANR1181E dftxn.c(216): Data storage transaction
>>> 0:2952042 was aborted.
>>> 3/6/2014 1:58:51 PM ANR0532W smnode.c(4155): Transaction 0:2952042 was
>>> aborted for session 38312 for node FIREBALL (Linux/x86_64).
>>> 3/6/2014 1:58:51 PM ANR3491E No sender email address found - unable to
>>> send email for alert, ANR1181E.
>>>
>>> Note, the "FIREBALL" is one of the production servers that does a
>>> DBSNAPSHOT to this server.....
>>>
>>> Then nothing until it tried to backup its own database later that day.
>>>
>>> 3/6/2014 8:00:11 PM ANR2971E Database backup/restore/rollforward
>>> terminated - DB2 sqlcode -980 error.
>>> 3/6/2014 8:00:11 PM ANR1893E Process 209 for Database Backup completed
>>> with a completion state of FAILURE.
>>> 3/6/2014 8:00:11 PM ANR3491E No sender email address found - unable to
>>> send email for alert, ANR1893E.
>>>
>>> Then again the following day and that was all she wrote.  Seized up later
>>> that day/night.....
>>>
>>> 3/7/2014 8:00:10 PM ANR2971E Database backup/restore/rollforward
>>> terminated - DB2 sqlcode -980 error.
>>> 3/7/2014 8:00:10 PM ANR1893E Process 213 for Database Backup completed
>>> with a completion state of FAILURE.
>>> 3/7/2014 8:00:10 PM ANR3491E No sender email address found - unable to
>>> send email for alert, ANR1893E.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Mar 10, 2014 at 11:09 AM, Arbogast, Warren K <warbogas AT iu DOT edu
>>> >wrote:
>>>
>>>  Zoltan,
>>>> We are all eager to know if the something that happened had anything to
>>>> do with TSM 6.3.2 or DB2. Since they seem to be fine and the OS needs
>>>> to be
>>>> rebuilt, presumably not. Sometimes i's and t's beg to dotted and crosed.
>>>>
>>>> Best wishes,
>>>> Keith Arbogast
>>>> Indiana University
>>>>
>>>>
>>>> On Mar 10, 2014, at 10:55 AM, Zoltan Forray wrote:
>>>>
>>>>  We recently had our offsite/recover TSM server (RH Linux 6.4, TSM
>>>>> 6.3.4.200) go south.  Something happened that caused DB2 to start
>>>>> crashing/dumping and subsequently completely filled the filesystem
>>>>> containing /home/tsminst1 directory.  Since this was the root folder,
>>>>>
>>>> the
>>>>
>>>>> system tanked and is now unrecoverable.  My OS guy says the root system
>>>>> seems to be corrupted and will probably require a complete OS
>>>>> reinstall.
>>>>>
>>>>> However, the filesystems containing the TSM DB, LOG and ARCHLOG files
>>>>>
>>>> all
>>>>
>>>>> seem to be OK.
>>>>>
>>>>> Since this is an offsite, non-critical server that simply stored DB
>>>>> Snapshots of my other production TSM servers, nuking and rebuilding is
>>>>>
>>>> not
>>>>
>>>>> a big deal, mostly lots of busy-work.  This could also give me the
>>>>> opportunity to install and play with 7.1.
>>>>>
>>>>> I would like to make this a "DR recovery" scenario/test.  Since the DB
>>>>>
>>>> is
>>>>
>>>>> still there, can it be recovered from what remains, i.e. the /TSMDB,
>>>>> /TSMLOG, /TSMARCHLOG filesystems?
>>>>>
>>>>> --
>>>>> *Zoltan Forray*
>>>>> TSM Software & Hardware Administrator
>>>>> Virginia Commonwealth University
>>>>> UCC/Office of Technology Services
>>>>> zforray AT vcu DOT edu - 804-828-4807
>>>>> Don't be a phishing victim - VCU and other reputable organizations will
>>>>> never use email to request that you reply with your password, social
>>>>> security number or confidential personal information. For more details
>>>>> visit http://infosecurity.vcu.edu/phishing.html
>>>>>
>>>>
>>>
>>> --
>>> *Zoltan Forray*
>>> TSM Software & Hardware Administrator
>>> Virginia Commonwealth University
>>> UCC/Office of Technology Services
>>> zforray AT vcu DOT edu - 804-828-4807
>>> Don't be a phishing victim - VCU and other reputable organizations will
>>> never use email to request that you reply with your password, social
>>> security number or confidential personal information. For more details
>>> visit http://infosecurity.vcu.edu/phishing.html
>>>
>>>
>>
>> --
>> *Zoltan Forray*
>> TSM Software & Hardware Administrator
>> Virginia Commonwealth University
>> UCC/Office of Technology Services
>> zforray AT vcu DOT edu - 804-828-4807
>> Don't be a phishing victim - VCU and other reputable organizations will
>> never use email to request that you reply with your password, social
>> security number or confidential personal information. For more details
>> visit http://infosecurity.vcu.edu/phishing.html
>>
>


--
*Zoltan Forray*
TSM Software & Hardware Administrator
Virginia Commonwealth University
UCC/Office of Technology Services
zforray AT vcu DOT edu - 804-828-4807
Don't be a phishing victim - VCU and other reputable organizations will
never use email to request that you reply with your password, social
security number or confidential personal information. For more details
visit http://infosecurity.vcu.edu/phishing.html