paulraines
ADSM.ORG Member
- Joined
- Aug 12, 2013
- Messages
- 11
- Reaction score
- 0
- Points
- 0
I have a large 1.5 petabyte storage cluster consisting of IBM DDN 9900s connected by dual fibre channel to nine blades in an IBM BladeCenter running RHEL5 and GPFS 3.3. One of these blades is my TSM 6.2.3 server connected to a IBM TS3584 library with 4 LTO-5 drives and over 1400 LTO tapes.
There is about 700TB right now on our GPFS filesytems we backing up by TSM. This corresponds to several hundreds of millions of files. An incremental backup pass through all this now takes over a month (independed on the amount of actually new data to backup) which most experts think is simply how slow GPFS at metadata on this hardware.
First a bit about how TSM is installed on the server. The software /opt/tivoli and the instance user ~tsminst1 are on the server's internal two disk RAID1 on an ext3 filesystems. The TSM DB/logs are on /tsmdb which is a ext3 filesystem on a LUN on the DDN SAN. The TSM disk pools are on a GPFS filesystem of several LUNs from the DDN SAN.
On July 7th one of the internal disks of the TSM server failed. It was part of a two disk RAID1 on the internal LSI-based RAID controller of the HS22 blade. For reason no one can explain, the RAID failed and the system crashed and the root filesystem got corrupted. After a fsck the system rebooted fine. But when I tried to get TSM dsmserv to run again it failed claiming database errors.
At this point I opened a PMR with Tivoli support. It took ten days of back and forth for them to finally decide that my tablespaces where hopeless corrupted and nothing could be done to fix it directly and a TSM RESTORE DB would be the only way to proceed. I still don't understand how the tables in files that were sitting on the DDN SAN and not the local server disk that failed got corrupted, but whatever.
So I start the DB RESTORE (no point in time designated) from the last DB BACKUP in May (remember it takes over a month of each backup pass and I thought I could not run DB BACKUPs during them). I could see the dsmserv process mount the tape with the last DB BACKUP and it ejected it about 8 hours later. It has now been doing a ROLLFORWARD phase for about two weeks. The latest status check today says:
When I look in the directory
/tsmdb/TSM_logs/Archivelog/archmeth1/tsminst1/TSMDB1/NODE0000/C0000000
I see over 306GB of log files. There are over 600 of them with most 500MB in size.
The newest files in the directory are:
so it looks like the ROLLFORWARD has about 100 LOG files to go which at the rate I see it going should be another 3-4 days. BTW, the oldest LOG file in that directory corresponds to the May date of my BACKUP DB.
As to why things have gone so slow, the IBM support people are blaming the fact that the server has only 8GB of RAM and GPFS is taking 4GB of that. That is certainly a factor and I blame IBM for selling me the server with too little RAM as they sold me the whole cluster as a complete solution.
Also my interpretation of what this ROLLFORWARD phase is doing is that all these log files are records of all database transactions that TSM has done since the last DB BACKUP and it is rerunning all these transactions on the DB restored from tape. SO that when this is done the state of my database will be close the state it was at at the time of the crash on July 7th.
However the IBM support person on the PMR insists that the state of the DB when the RESTORE DB finishes will be as it was on the day in May the DB BACKUP was made. I just don't believe him because then what is the point of doing the ROLLFORWARD over all those LOG files?
So that is my question, will my TSM DB2 database be at the May DB BACKUP state or closer to the July crash state when this process finishes?
There is about 700TB right now on our GPFS filesytems we backing up by TSM. This corresponds to several hundreds of millions of files. An incremental backup pass through all this now takes over a month (independed on the amount of actually new data to backup) which most experts think is simply how slow GPFS at metadata on this hardware.
First a bit about how TSM is installed on the server. The software /opt/tivoli and the instance user ~tsminst1 are on the server's internal two disk RAID1 on an ext3 filesystems. The TSM DB/logs are on /tsmdb which is a ext3 filesystem on a LUN on the DDN SAN. The TSM disk pools are on a GPFS filesystem of several LUNs from the DDN SAN.
On July 7th one of the internal disks of the TSM server failed. It was part of a two disk RAID1 on the internal LSI-based RAID controller of the HS22 blade. For reason no one can explain, the RAID failed and the system crashed and the root filesystem got corrupted. After a fsck the system rebooted fine. But when I tried to get TSM dsmserv to run again it failed claiming database errors.
At this point I opened a PMR with Tivoli support. It took ten days of back and forth for them to finally decide that my tablespaces where hopeless corrupted and nothing could be done to fix it directly and a TSM RESTORE DB would be the only way to proceed. I still don't understand how the tables in files that were sitting on the DDN SAN and not the local server disk that failed got corrupted, but whatever.
So I start the DB RESTORE (no point in time designated) from the last DB BACKUP in May (remember it takes over a month of each backup pass and I thought I could not run DB BACKUPs during them). I could see the dsmserv process mount the tape with the last DB BACKUP and it ejected it about 8 hours later. It has now been doing a ROLLFORWARD phase for about two weeks. The latest status check today says:
Code:
$ db2pd -recovery -db TSMDB1
Database Partition 0 -- Database TSMDB1 -- Active -- Up 13 days 16:03:54 -- Date 08/12/2013 16:55:43
Recovery:
Recovery Status 0x04000401
Current Log S0011867.LOG
Current LSN 000005CB7887C494
Job Type ROLLFORWARD RECOVERY
Job ID 2
Job Start Time (1375159911) Tue Jul 30 00:51:51 2013
Job Description Database Rollforward Recovery
Invoker Type User
Total Phases 2
Current Phase 1
Progress:
Address PhaseNum Description StartTime CompletedWork TotalWork
0x0000000200D8EDE8 1 Forward Tue Jul 30 00:51:51 279241218495 bytes Unknown
0x0000000200D8EF70 2 Backward NotStarted 0 bytes Unknown
When I look in the directory
/tsmdb/TSM_logs/Archivelog/archmeth1/tsminst1/TSMDB1/NODE0000/C0000000
I see over 306GB of log files. There are over 600 of them with most 500MB in size.
The newest files in the directory are:
Code:
-rw-r----- 1 tsminst1 tsmsrvrs 4308992 Jul 24 08:33 S0011961.LOG
-rw-r----- 1 tsminst1 tsmsrvrs 458752 Jul 23 09:51 S0011960.LOG
-rw-r----- 1 tsminst1 tsmsrvrs 3571712 Jul 8 16:11 S0011959.LOG
-rw-r----- 1 tsminst1 tsmsrvrs 3661824 Jul 8 00:44 S0011958.LOG
-rw-r----- 1 tsminst1 tsmsrvrs 124985344 Jul 8 00:38 S0011957.LOG
-rw-r----- 1 tsminst1 tsmsrvrs 536879104 Jul 7 07:29 S0011956.LOG
-rw-r----- 1 tsminst1 tsmsrvrs 536879104 Jul 6 21:58 S0011955.LOG
-rw-r----- 1 tsminst1 tsmsrvrs 536879104 Jul 6 14:06 S0011954.LOG
so it looks like the ROLLFORWARD has about 100 LOG files to go which at the rate I see it going should be another 3-4 days. BTW, the oldest LOG file in that directory corresponds to the May date of my BACKUP DB.
As to why things have gone so slow, the IBM support people are blaming the fact that the server has only 8GB of RAM and GPFS is taking 4GB of that. That is certainly a factor and I blame IBM for selling me the server with too little RAM as they sold me the whole cluster as a complete solution.
Also my interpretation of what this ROLLFORWARD phase is doing is that all these log files are records of all database transactions that TSM has done since the last DB BACKUP and it is rerunning all these transactions on the DB restored from tape. SO that when this is done the state of my database will be close the state it was at at the time of the crash on July 7th.
However the IBM support person on the PMR insists that the state of the DB when the RESTORE DB finishes will be as it was on the day in May the DB BACKUP was made. I just don't believe him because then what is the point of doing the ROLLFORWARD over all those LOG files?
So that is my question, will my TSM DB2 database be at the May DB BACKUP state or closer to the July crash state when this process finishes?