Expiration Woes

jgraul

Active Newcomer
Joined
Feb 17, 2012
Messages
36
Reaction score
0
Points
0
We seem to have the slowest big ass TSM on the planet. Why in the world would a 32 core Windows 2008 R2, 64 GB of RAM, with the DB running on SSD disks (4 SSD configured as RAID 5) be so slow. Please no Windows jokes, we are having problems completing our daily jobs.

We have about 530 nodes in our enterprise, almost entirely Windows servers. We are now expiring about 9 million files a day. We never get more than 3.5 million objects expired per hour. In the last six days expiration increased daily. Day 1 took 162 minutes, day 2 took 196 minutes, day 3 217, day 4 252, day 5 274, day 6 374. As I write this on day seven, it's still running after 300 minutes and has only deleted 5.5 million objects.

Has anyone experienced anything like this?
 
Hi,

just few questions/pointers ...
What is your TSM version? Have you tried running multiple expiration processes in parallel (expire inventory resource=X)? Are the nodes equal in number of files expired or the expiration of single node/filespace is the problem?
Anything in the db2diag.log? Do you use deduplication?

Harry
 
We are on TSM 6.3.3.0 and we use resource=8. We don't use deduplication because we have a Data Domain as our primary storage. It has built-in deduplication. The Windows 2008 servers definitely have a lot more system state files than Windows 2003, but I don't think that would explain the 2x increase in time for the expiration process over 6 days given the number of files examined and processed is less than 500K difference.
 
One question:

Is compression turned ON for the clients?

I have much bigger number of nodes than you have (> 530) and expiration runs fast. We are on Linux and TSM 6.1.5, and use Data Domain with devclass=file for primary backup.
 
Last edited:
We don't have compression turned on.
 
We have 10Gb nics on the TSM server and the DD, which is a DD670.
 
I wouldn't think expiration would have a dependency on the stg pool device, but rather the TSM DB which is on SSD disks.
 
I wouldn't think expiration would have a dependency on the stg pool device, but rather the TSM DB which is on SSD disks.

True but I have seen weird things on Windows.

I also want to get an understanding of what your setup looks like.
 
Here is one thing to check.

Do you have the latest drivers for the SSD, or the RAID card (I am assuming hardware RAID)?

Have you done off-TSM test with the SSD environment? Is the result as fast as expected?
 
If you run expiration like this:

expire inventory node=a,b,c resource=3

Would this be faster than:

expire inventory node=a,b,c (resource defaults to 1)

I am looking at the possibility of a bug or flaw in 6.3.3 assuming all of your hardware checks out.
 
We normally run 'expire inventory skipdirs=no resource=8'. Earlier in the year we started experimenting with the resources and found that we could expire about 8.5 million objects in 140 or so minutes pretty consistently. We really didn't see any improvements with higher resources.

Lately we haven't been able to get consistent results. I rebooted the server yesterday after expiration completed. Very interested to see how it runs today.
 
So the SSD disks are internal (not SAN) ? If so what make and model server are these on? Just wondering if an internal PCI bus bottleneck you're running in to.
 
It's an IBM x3650 M4. Internally we have 2 SAS mirrored drives for the OS, swap, and applications, 2 SAS mirrored drives for the archive logs, and 4 SSD drives RAID 5 for the database and active logs. We have an IBM ServRAID M5110e controller installed.
 
So yesterday it took 398 minutes to process 8,857,527 objects. It has process over 7 million after 35 minutes. What the heck is it doing the rest of the time?

I should point out that this is actually a 16 core system. The 32 comes from hyperthreading...
 
Last edited:
So yesterday it took 398 minutes to process 8,857,527 objects. It has process over 7 million after 35 minutes. What the heck is it doing the rest of the time?

I should point out that this is actually a 16 core system. The 32 comes from hyperthreading...

I am making a guess here.

Since we don't know the ins and outs of the how TSM interfaces with the new DB2 environment, I am guessing that there must be some reorganization of the TSM metadata or TSM is 'looking' for the other 1.8+ millions objects to expire.
 
I'm starting to think it may be something with expiring Windows 2008 and Windows 7 system state files. They are taking a really long time. One Windows 7 pc's system state took over 90 minutes to expire.
 
Back
Top