Best Practice for Maintenance Scheduling

HardeepSingh · Jul 9, 2015

Hi,

We have a new setup with TSM 7.1 in place.
The old environment was using 8 TSM servers on 5.5.4. Now the entire backup client inventory was moved to 3 TSM servers with better hardware and 7.1 version.
Things were working fine but since we've completed the migration of the entire 2500 client server inventory to the new environment, we've been witnessing the maintenance schedules are taking too long to complete. On one particular server, the backup STG pool takes over 3 days to complete, which isn't healthy.
This is further alarming because the admin scripts are designed to work serially. BACKUP STG will work in parallel for all STG pools, but Backup DB will wait till Backup STG is complete. And so on.
This results in no Backup DB, no DRM, no migration, no expiration and hence, no reclamation being run for 3-4 days straight.
What are the possible options to hasten this process?

There are the specifics of the STG pool

Storage Device Estimated Pct Pct High Low Next Stora-
Pool Name Class Name Capacity Util Migr Mig Mig ge Pool
Pct Pct
----------- ---------- ---------- ----- ----- ---- --- -----------
3592TAPE_P- 3592TAPE 38,664,269 0.3 1.4 90 70
ROD G
3592TAPE_P- 3592TAPE 53,086,120 0.2
ROD_COPY G
3592TAPE_P- 3592TAPE 46,818,929 0.3 1.5 90 70
ROD_ORA G
3592TAPE_P- 3592TAPE 34,468,716 0.3
ROD_ORA_C- G
OPY
ARCHIVEPOOL DISK 0.0 M 0.0 0.0 90 70
BACKUPPOOL DISK 0.0 M 0.0 0.0 90 70
DISK_PROD DISK 6,720 G 55.3 55.0 90 50 3592TAPE_P-
ROD
DISK_PROD_- DISK 6,720 G 51.6 50.9 90 70 3592TAPE_P-
ORA ROD_ORA

I have uploaded this in a txt file for better viewing.

The sequence of BackupSTG we were using was as follows:

backup stgpool 3592tape_prod_ora 3592tape_prod_ora_copy maxprocess=3
backup stgpool disk_prod 3592tape_prod_copy maxprocess=1 wait=yes
backup stgpool disk_prod_ora 3592tape_prod_ora_copy maxprocess=1 wait=yes
backup stgpool 3592tape_prod 3592tape_prod_copy maxprocess=1 wait=yes

We've changed the order now, but it didn't seem to help too much:

backup stgpool disk_prod 3592tape_prod_copy maxprocess=1
backup stgpool 3592tape_prod 3592tape_prod_copy maxprocess=1 wait=yes
backup stgpool disk_prod_ora 3592tape_prod_ora_copy maxprocess=2 wait=yes
backup stgpool 3592tape_prod_ora 3592tape_prod_ora_copy maxprocess=2 wait=yes

Any suggestions ??
should we try changing the order of other admin processes?
The scheduler starts at 04am sharp Server time. It starts with Backup STG, irrespective of any Backup STG running prior. Once complete, it moves on the next script to run Backup DB. Now this script, does a precheck of running backup STG process and when it finds it running, it keeps on rescheduling the script execution.

select process,process_num from processes where process='Backup Storage Pool'
if (rc_ok) goto reschedule
backup db devclass=3592tape type=dbsnapshot wait=yes

marclant · Jul 9, 2015

You are in a bad spot, if you can't address it, it will snow ball as each new maintenance cycle will take longer as there will have to catch up from the backlog being created. Anytime the complete backup cycle (client backup + maintenance) takes more than 24 hours, it's cause for alarms if it drags on for multiple days.

There are a few possibilities, your servers may be running slow due to the config, check all the items found here:
http://www-01.ibm.com/support/knowledgecenter/SSGSG7_7.1.1/com.ibm.itsm.perf.doc/t_srv_hw_check.html

The other option, is that it's possible that you need more tape drives in order to process the amount of data ingested daily in a timely manner.

cfhoffmann · Jul 9, 2015

You say it takes 3 days to back up a storage pool, but what is the actual throughput in MB/s ?
Is this number what you'd expect from a 3592 tape drive or not ? If not then maybe you need to check for hardware issues.

HardeepSingh · Jul 9, 2015

I may have missed citing an important detail. This TSM server in particular gets a huge load of incremental data.
It witnesses over 10-12 TB of data from Oracle servers being sent to the storage pools daily. Other Prod (non-oracle) servers data would amount to 1-2 TB on an average daily basis.
Should we be distributing Oracle data amongst other TSM servers?
Other TSM servers are facing the same issue too, but the problem isn't as serious as this one.

HardeepSingh · Jul 9, 2015

I'll go through your suggested links Marclant. Thank you.
The tape drive count is 50. Even with all client sessions and overdue admin processes in progress, we've never had 100% tape drive utilization so far.
We don't have LAN free, barring just one Oracle node, so almost all of client data goes to disk pool first. It reaches the threshold quickly, say within 18-20 hours and the trigger to migrate the disk pool starts, even with backup stg still in progress. That greatly degrades the performance.

HardeepSingh · Jul 9, 2015

Thank you for your note, Chris.
I think the throughput is very relative owning to the server activities in progress. I'll revisit the performance logs to validate the average throughput. What should be the ideal figure?

moon-buddy · Jul 9, 2015

My two cents:

Since you have consolidated from 8 to 3 TSM servers, have you thought of providing bigger multiple disk pools (and to go to greater extent, multiple copy pools) by backup type?

Through the years, I have posted setup solutions in this forum which I believe has helped me escape this problem - solutions like separated and bigger multiple primary disk pools, separated copy pools, dedicated tape drives by primary or copy pool, etc.

moon-buddy · Jul 9, 2015

Another point I forgot to mention:

Have you placed a maximum of two tape drives per fiber channel? If you have more than two, you will suffer from data racing and will definitely see low tape utilization numbers.

It is hard to setup a Windows box (and I hope you are not running on Windows!) with very many PCI slots to accommodate many Fiber channel HBA cards. This is why, for a consolidated environment like yours, I prefer to have AIX boxes.

cfhoffmann · Jul 9, 2015

HardeepSingh said:
Thank you for your note, Chris.
I think the throughput is very relative owning to the server activities in progress. I'll revisit the performance logs to validate the average throughput. What should be the ideal figure?

There are lots of variables (e.g. do you have TS1120, 30, 40 or 50 ?), but as a ballpark figure if you have late model drives on a 'fast' SAN with a maximum of 2 drives per HBA, as Ed suggests, I would expect to see in TSM at least 100 MB/s per drive average throughput.

Your 12 TB = 12,582,912 MB. At 100 MB/s that should take 125,829.12 seconds or about 35 hours using 1 tape drive....or 17.5 hours using 2 tape drives etc.

This query may help:
select process_num as "Number", process, current_timestamp as "Current time", start_time, (cast(files_processed as decimal(18,0))) as "Files", (cast(bytes_processed as decimal(18,0)))/1024/1024 as "MB", timestampdiff(2,char(current_timestamp-start_time)) as "Elapsed Time", (cast(files_processed as decimal(18,0))/timestampdiff(2,char(current_timestamp-start_time))) as "Files/second", (cast(bytes_processed as decimal(18,0))/1024/1024/timestampdiff(2,char(current_timestamp-start_time))) as "MBytes/second" from processes order by process_num asc
This whitepaper may help:
http://www-03.ibm.com/support/techd...8015/$FILE/TS1150_Performance_White_Paper.pdf

HardeepSingh · Jul 10, 2015

cfhoffmann said:
There are lots of variables (e.g. do you have TS1120, 30, 40 or 50 ?), but as a ballpark figure if you have late model drives on a 'fast' SAN with a maximum of 2 drives per HBA, as Ed suggests, I would expect to see in TSM at least 100 MB/s per drive average throughput.

Your 12 TB = 12,582,912 MB. At 100 MB/s that should take 125,829.12 seconds or about 35 hours using 1 tape drive....or 17.5 hours using 2 tape drives etc.

This query may help:
select process_num as "Number", process, current_timestamp as "Current time", start_time, (cast(files_processed as decimal(18,0))) as "Files", (cast(bytes_processed as decimal(18,0)))/1024/1024 as "MB", timestampdiff(2,char(current_timestamp-start_time)) as "Elapsed Time", (cast(files_processed as decimal(18,0))/timestampdiff(2,char(current_timestamp-start_time))) as "Files/second", (cast(bytes_processed as decimal(18,0))/1024/1024/timestampdiff(2,char(current_timestamp-start_time))) as "MBytes/second" from processes order by process_num asc
This whitepaper may help:
http://www-03.ibm.com/support/techd...8015/$FILE/TS1150_Performance_White_Paper.pdf

Thanks Chris & Ed,

We have IBM TS3500 with IBM 3592 J1A tape drives setup with lin_tape version: 2.9.8
We're running on Linux RHEL 6, library sharing enabled.

The query gave me the following output:

Number: 359
PROCESS: Backup Storage Pool
Current time: 2015-07-09 21:40:04.536262
START_TIME: 2015-07-08 11:23:55.000000
Files: 753
MB: 13187925.4950704574584
Elapsed Time: 123369
Files/second: 0.0061036402986
MBytes/second: 106.8982118285019

Number: 360
PROCESS: Backup Storage Pool
Current time: 2015-07-09 21:40:04.536262
START_TIME: 2015-07-08 11:23:56.000000
Files: 960
MB: 13596581.3890666961669
Elapsed Time: 123368
Files/second: 0.0077815965242
MBytes/second: 110.2115734150403

Number: 361
PROCESS: Backup Storage Pool
Current time: 2015-07-09 21:40:04.536262
START_TIME: 2015-07-08 11:23:56.000000
Files: 1163
MB: 13462155.1676902770996
Elapsed Time: 123368
Files/second: 0.0094270799559
MBytes/second: 109.1219373556374

Number: 369
PROCESS: Migration
Current time: 2015-07-09 21:40:04.536262
START_TIME: 2015-07-09 20:39:49.000000
Files: 31
MB: 141294.5429687500000
Elapsed Time: 3615
Files/second: 0.0085753803596
MBytes/second: 39.0856273772475

Number: 370
PROCESS: Migration
Current time: 2015-07-09 21:40:04.536262
START_TIME: 2015-07-09 20:39:49.000000
Files: 3384
MB: 301735.3750000000000
Elapsed Time: 3615
Files/second: 0.9360995850622
MBytes/second: 83.4676002766251

Number: 371
PROCESS: Migration
Current time: 2015-07-09 21:40:04.536262
START_TIME: 2015-07-09 20:39:49.000000
Files: 1543
MB: 338506.1093750000000
Elapsed Time: 3615
Files/second: 0.4268326417704
MBytes/second: 93.6393110304287

HardeepSingh · Jul 10, 2015

Backlog is making the process stretch for too long. This is the status from last night

HardeepSingh · Jul 10, 2015

moon-buddy said:
Another point I forgot to mention:

Have you placed a maximum of two tape drives per fiber channel? If you have more than two, you will suffer from data racing and will definitely see low tape utilization numbers.

It is hard to setup a Windows box (and I hope you are not running on Windows!) with very many PCI slots to accommodate many Fiber channel HBA cards. This is why, for a consolidated environment like yours, I prefer to have AIX boxes.

To answer to the FC channel setup.
All drives are connected to all 3 TSM servers; On each TSM server, two FC ports are connected to 2 backup VSAN (tape drives).

cfhoffmann · Jul 10, 2015

Considering you have 10 year old tape drives your performance is not bad. You're close to my estimated 100 MB/s per drive.

I'd recommend you greatly increase the maximum number of processes when you run BACKUP STG on your disk pools. BACKUP STG from tape-to-tape uses twice as many tape drives as disk-to-tape. If you make sure everything on diskpool has been copied to the copy storage pool *BEFORE* you migrate the diskpool then there should be almost no data to copy from tape to tape. Try MAXPRocess=10 or more depending on how many tape drives you can spare.

moon-buddy · Jul 10, 2015

HardeepSingh said:
To answer to the FC channel setup.
All drives are connected to all 3 TSM servers; On each TSM server, two FC ports are connected to 2 backup VSAN (tape drives).

If I understood your statement correctly, you have a zoned switch between the TSM servers and the tape drives with only 2 fiber connections from each TSM server.

My honest opinion, this is not sufficient.

In an old environment that I supported with an old P520 P-Series box connected to a 7 drive TS3500 (essentially a beefed up 3584 (with 3592 tape drives, GEN 5) with M/F share), I had 4 fiber connections to the switch. This gave me a 1 to 2 ratio. Node count is 650 with daily Oracle, SQL and Exchange backups of 6 TB (the other non-DB backups is something like 3 TB). Start-to-end, daily admin finishes within 2.5 to 3 hours.

You also have library sharing enabled which makes me believe that any two TSM servers (three would be worst) may be competing for resources at any given point in time. If you have logically assigned tape drives by TSM server, then this should not be an issue.

smajl · Jul 10, 2015

If you have sufficient tape drives available you can try migrate from disk stg pools and create a copy at once.
upd stg ... copypoool=
Migration will consume 2x drives but data will be written in primary and copy stg pools in one shot.
Just keep in mind that this approach doesn't replace 'backup stg' and it should still be scheduled.

Best Practice for Maintenance Scheduling

HardeepSingh

Attachments

marclant

cfhoffmann

HardeepSingh

HardeepSingh

HardeepSingh

moon-buddy

moon-buddy

cfhoffmann

HardeepSingh

HardeepSingh

Attachments

HardeepSingh

cfhoffmann

moon-buddy

smajl

Data Privacy Impact Assessment

Sponsor ADSM.ORG

Navigation Menu

NordVPN 3 Months FREE

Forum statistics