TSM DB RESTORE taking two weeks

paulraines

ADSM.ORG Member
Joined
Aug 12, 2013
Messages
11
Reaction score
0
Points
0
I have a large 1.5 petabyte storage cluster consisting of IBM DDN 9900s connected by dual fibre channel to nine blades in an IBM BladeCenter running RHEL5 and GPFS 3.3. One of these blades is my TSM 6.2.3 server connected to a IBM TS3584 library with 4 LTO-5 drives and over 1400 LTO tapes.

There is about 700TB right now on our GPFS filesytems we backing up by TSM. This corresponds to several hundreds of millions of files. An incremental backup pass through all this now takes over a month (independed on the amount of actually new data to backup) which most experts think is simply how slow GPFS at metadata on this hardware.

First a bit about how TSM is installed on the server. The software /opt/tivoli and the instance user ~tsminst1 are on the server's internal two disk RAID1 on an ext3 filesystems. The TSM DB/logs are on /tsmdb which is a ext3 filesystem on a LUN on the DDN SAN. The TSM disk pools are on a GPFS filesystem of several LUNs from the DDN SAN.

On July 7th one of the internal disks of the TSM server failed. It was part of a two disk RAID1 on the internal LSI-based RAID controller of the HS22 blade. For reason no one can explain, the RAID failed and the system crashed and the root filesystem got corrupted. After a fsck the system rebooted fine. But when I tried to get TSM dsmserv to run again it failed claiming database errors.

At this point I opened a PMR with Tivoli support. It took ten days of back and forth for them to finally decide that my tablespaces where hopeless corrupted and nothing could be done to fix it directly and a TSM RESTORE DB would be the only way to proceed. I still don't understand how the tables in files that were sitting on the DDN SAN and not the local server disk that failed got corrupted, but whatever.

So I start the DB RESTORE (no point in time designated) from the last DB BACKUP in May (remember it takes over a month of each backup pass and I thought I could not run DB BACKUPs during them). I could see the dsmserv process mount the tape with the last DB BACKUP and it ejected it about 8 hours later. It has now been doing a ROLLFORWARD phase for about two weeks. The latest status check today says:

Code:
$ db2pd -recovery -db TSMDB1
Database Partition 0 -- Database TSMDB1 -- Active -- Up 13 days 16:03:54 -- Date 08/12/2013 16:55:43

Recovery:
Recovery Status     0x04000401
Current Log         S0011867.LOG
Current LSN         000005CB7887C494
Job Type            ROLLFORWARD RECOVERY
Job ID              2
Job Start Time      (1375159911) Tue Jul 30 00:51:51 2013
Job Description     Database Rollforward Recovery
Invoker Type        User
Total Phases        2
Current Phase       1

Progress:
Address            PhaseNum   Description          StartTime        CompletedWork                TotalWork
0x0000000200D8EDE8 1          Forward             Tue Jul 30 00:51:51 279241218495 bytes           Unknown
0x0000000200D8EF70 2          Backward            NotStarted          0 bytes                      Unknown

When I look in the directory
/tsmdb/TSM_logs/Archivelog/archmeth1/tsminst1/TSMDB1/NODE0000/C0000000
I see over 306GB of log files. There are over 600 of them with most 500MB in size.
The newest files in the directory are:

Code:
-rw-r----- 1 tsminst1 tsmsrvrs   4308992 Jul 24 08:33 S0011961.LOG
-rw-r----- 1 tsminst1 tsmsrvrs    458752 Jul 23 09:51 S0011960.LOG
-rw-r----- 1 tsminst1 tsmsrvrs   3571712 Jul  8 16:11 S0011959.LOG
-rw-r----- 1 tsminst1 tsmsrvrs   3661824 Jul  8 00:44 S0011958.LOG
-rw-r----- 1 tsminst1 tsmsrvrs 124985344 Jul  8 00:38 S0011957.LOG
-rw-r----- 1 tsminst1 tsmsrvrs 536879104 Jul  7 07:29 S0011956.LOG
-rw-r----- 1 tsminst1 tsmsrvrs 536879104 Jul  6 21:58 S0011955.LOG
-rw-r----- 1 tsminst1 tsmsrvrs 536879104 Jul  6 14:06 S0011954.LOG

so it looks like the ROLLFORWARD has about 100 LOG files to go which at the rate I see it going should be another 3-4 days. BTW, the oldest LOG file in that directory corresponds to the May date of my BACKUP DB.

As to why things have gone so slow, the IBM support people are blaming the fact that the server has only 8GB of RAM and GPFS is taking 4GB of that. That is certainly a factor and I blame IBM for selling me the server with too little RAM as they sold me the whole cluster as a complete solution.

Also my interpretation of what this ROLLFORWARD phase is doing is that all these log files are records of all database transactions that TSM has done since the last DB BACKUP and it is rerunning all these transactions on the DB restored from tape. SO that when this is done the state of my database will be close the state it was at at the time of the crash on July 7th.

However the IBM support person on the PMR insists that the state of the DB when the RESTORE DB finishes will be as it was on the day in May the DB BACKUP was made. I just don't believe him because then what is the point of doing the ROLLFORWARD over all those LOG files?

So that is my question, will my TSM DB2 database be at the May DB BACKUP state or closer to the July crash state when this process finishes?
 
I once recovered DB2 DB that only had 1 DB backup to tape that was 2 years old. Yes there were lots of log files I had to get back to local server but after 2 days the DB was restored to current date. I would say yours will be rolling forward to July. You can check your DB backup with db2adutl query backup command and see the last log sequence the backup contains. I would say it would not contain the logs from July and hence is rolling forward using archived logs stored locally.
Can you run some nmon/topas/perfmon type command and see if the top process is db2, and the free memory etc.
I have 40GB RAM on each TSM server, but I would not run it under 16GB.
 
Thanks for the confirmation. Sounds like the tech I got assigned from TSM support doesn't really understand the DB RESTORE process.

I don't know how to run db2adutl but I don't see how a tape backup in May could have the logs from July.

If I run top and sort by memory I see in the first few lines:

Code:
Tasks: 253 total,   1 running, 252 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.4%us,  1.6%sy,  0.0%ni, 88.5%id,  9.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8148480k total,  7467920k used,   680560k free,    60396k buffers
Swap: 10190840k total,  5918924k used,  4271916k free,  2817848k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27603 root       0 -20 5041m 4.1g  18m S  0.0 52.7 584:08.94 mmfsd
 6017 tsminst1  18   0 4432m 1.0g 995m S  0.0 12.7 420:02.53 db2sysc
 6029 tsminst1  18   0  661m  13m  11m S  0.0  0.2   0:58.68 db2fmp
22549 root      RT   0  156m  10m 2868 S  0.0  0.1 243:19.25 multipathd
 5934 tsminst1  15   0  245m  10m  10m S  0.0  0.1  12:52.62 dsmserv
 5452 root      15   0 94984 8496 8268 S  0.0  0.1   1:12.89 db2fmcd
27295 ntp       15   0 29736 5296 4136 S  0.0  0.1   0:00.04 ntpd
 6015 root      17   0  869m 5224 5220 S  0.0  0.1   0:00.12 db2syscr
 6024 tsminst1  20   0  300m 4768 4764 S  0.0  0.1   0:00.09 db2vend

mmfsd is the GPFS daemon. There is a lot of swap being used, but I have not seen that swap usage number change in days. So it is not swap thrashing. Though I guess the restore process grabbed what memory it could when first run and is staying within that even though more memory is available now.
 
The log file for db2 is called db2diag.log and usually lives in instance_home_dir/sqllib/db2dump. You need to check and see if this is getting updated and the restore is actually running without issues, it sounds like its almost finished since the log its up to is almost the current sequence.
Obviously now you will be backing up the TSM server DB every day at least or like me twice a day.
backup db ty=full devc=lto (Full backup keep in library) Run at 12 AM every day
backup db ty=dbs devc=lto (snapshot, send offsite each day) Run at 12 PM every day
Please let us know how your roll forward finishes and the outcome.
 
db2diag.2.log shows just lines like either "MESSAGE : Started retrieve for log file S0011925.LOG" or "Completed retrieve for log file S0011925.LOG ..." So even though it is taking forever, it appears to not be reporting any errors.
 
That must be a record for long restores, please let us know when finished and the total time, size of your TSM DB etc.
 
The DB RESTORE is now done. I am waiting for the IBM Support people to tell me the next steps.
The size that "db2pd -recovery -db TSMDB1" reported about 30 min before it was done was 326928816683 bytes and the time was 16 days 12:25:28
 
Doing a bunch of queries, everything in TSM seems restored to a point soon before the crash on July 7th, not the May date of the actual BACKUP DB. SO like I thought, the rollforward logs look to have restored most everything.

Looking for STGDELETE and STGREUSE in the tape archive log I found the last one marked as such just a day before the crash. Checking in the TSM database, it is definitely now a scratch tape. So it does not look like I need to audit any tapes.

I think I do need to still audio my disk storage pool. I have 70 files in the disk storage pool of 500GB each. I started a disk audit with

Code:
audit volume /gpfs/nsdtsm01/BACKUPPOOLS/VOL01C FIX=yes

where VOL01C was the most recently modifed storge pool file I found with a mod date on it the day of the crash. At this point a 'q proc' shows

Code:
 Process Process Description  Status
  Number
-------- -------------------- -------------------------------------------------
       1 AUDIT LICENSE        ANR2823I Have audited 0 nodes
       4 Audit Volume         Volume /gpfs/nsdtsm01/BACKUPPOOLS/VOL01C (storage
          (Repair)             pool BACKUPPOOL), Files Processed: 2, Damaged
                               Files Deleted: 0, Partial Files Skipped: 0.
                               Current Physical File (bytes): 24,038,789

and after 12 hours doing 'q proc' shows no change. It still says Files Processed is 2 and being at 24,038,789 bytes.

So I halted the server and took the server down and installed the new memory. I now have 20GB of RAM in the box. I then started dsmserv backup and reran the AUDIT VOLUME command. After several hours the output of 'q proc' is still not changing. Maybe this is normal. top
shows the db2sync process using logs of CPU% and 9.9GB of virt mem. Do you have any idea how long a AUDIT VOLUME on a disk volume normally takes? I have done AUDIT VOLUMES on LTO tapes with over 1TB before and they only took a few hours.

ALso, that AUDIT LICENSE proc has been going forever too. Is that something to worry about?

Also, 'q stg' shows 0% of my disk pool needing to migrate to tape which I know is wrong as before the crash over 70% needed migrating. Maybe this is because NOMIGRRECL is set in dsmserv.opt or maybe because the disk pools need auditing.
 
As part of a license audit operation, the server calculates, by node, the amount of backup, archive, and space management storage in use. Reason why we issue audit license before we issue query auditoccupancy.

Good Luck,
Sias
 
Back
Top