Compression with Data Deduplication

lightness

ADSM.ORG Member
Joined
Mar 7, 2006
Messages
18
Reaction score
0
Points
0
PREDATAR Control23

Hi All,

I'm exploring the benefits of compression and deduplication and was wondering if someone could assist in getting definite numbers. Running TSM version 7.1.5.0 and deduplication is enabled on the server. I backed up a drive of just over 2 GB and enabled client compression. Running a trace it shows me how much the data was compressed nicely (for this case around 50%) but how do I find out how much the data is deduplicated?? I know that enabling client compression with server deduplication is not the best way to do it, and I may realize less deduplication with the compression enabled, but is there a way to see the numbers on this???
 
PREDATAR Control23

Are you running legacy pools or directory container pools?
If container pools, I'd recommend updating the server as high as you can go for a multitude of benefits.

If running legacy pools, you'll need to ensure the identify duplicate processes are running for your pools.
If you only backed up one server, one time to that one storage pool you won't yet see much in the way of deduplication benefits. Not until data starts to change, and more blocks line up with what you are looking for.

You could get a rough idea via something like this (copied from thobias' github page https://github.com/thobiast/tsm_sql:
SELECT occ.node_name, node.domain_name, node.platform_name, CAST(FLOAT(SUM(logical_mb)) / 1024 AS DEC(8,2)) as GB - FROM occupancy occ, nodes node WHERE occ.node_name=node.node_name GROUP BY occ.node_name,node.domain_name,node.platform_name ORDER BY GB DESC
Or perhaps like this:
SELECT occ.node_name, node.domain_name, node.platform_name, CAST(FLOAT(SUM(logical_mb)) / 1024 AS DEC(8,2)) as GB - FROM occupancy occ, nodes node WHERE occ.node_name=node.node_name GROUP BY occ.node_name,node.domain_name,node.platform_name ORDER BY GB DESC
Compare the values reported vs what the client did.

If you are using directory container pools, there's a handy generate dedupstats command: https://www.ibm.com/support/knowled.../srv.reference/r_cmd_dedupstats_generate.html
Note that it can run a really long time if you have a lot of data/clients.

I will say from my own experience that the legacy pools would get me at best a 2.8:1 reduction. While the directory container pools are achieving almost a 5:1 reduction.

And you may want to look at updating to the most recent 7 code, or even jump into v8 (again go latest). There's been a lot of improvements and issues fixed.
 
PREDATAR Control23

Are you running legacy pools or directory container pools?
If container pools, I'd recommend updating the server as high as you can go for a multitude of benefits.

If running legacy pools, you'll need to ensure the identify duplicate processes are running for your pools.
If you only backed up one server, one time to that one storage pool you won't yet see much in the way of deduplication benefits. Not until data starts to change, and more blocks line up with what you are looking for.

You could get a rough idea via something like this (copied from thobias' github page https://github.com/thobiast/tsm_sql:
SELECT occ.node_name, node.domain_name, node.platform_name, CAST(FLOAT(SUM(logical_mb)) / 1024 AS DEC(8,2)) as GB - FROM occupancy occ, nodes node WHERE occ.node_name=node.node_name GROUP BY occ.node_name,node.domain_name,node.platform_name ORDER BY GB DESC
Or perhaps like this:
SELECT occ.node_name, node.domain_name, node.platform_name, CAST(FLOAT(SUM(logical_mb)) / 1024 AS DEC(8,2)) as GB - FROM occupancy occ, nodes node WHERE occ.node_name=node.node_name GROUP BY occ.node_name,node.domain_name,node.platform_name ORDER BY GB DESC
Compare the values reported vs what the client did.

If you are using directory container pools, there's a handy generate dedupstats command: https://www.ibm.com/support/knowled.../srv.reference/r_cmd_dedupstats_generate.html
Note that it can run a really long time if you have a lot of data/clients.

I will say from my own experience that the legacy pools would get me at best a 2.8:1 reduction. While the directory container pools are achieving almost a 5:1 reduction.

And you may want to look at updating to the most recent 7 code, or even jump into v8 (again go latest). There's been a lot of improvements and issues fixed.
Thank you for that info. We are planning on going to the latest version of v8 in the very near future, and are currently running legacy pools.
 
PREDATAR Control23

Ok. So with legacy pools, identify duplicates need to be running. I still have a mixture of old and new pools, so I find a good spot in my admin tasks to run those processes. Bit of a manual way I've done in the past is
Code:
select * from occupancy whre node_name='NODE_NAME'
For example on of my clients reports this:
LOGICAL_MB: 33944632.97
REPORTING_MB: 44502881.14
So you can look at what the client sent to the server, and then after identify duplicates ran you can then look at logical_mb and get a rough idea of what else was trimmed off. **EDIT: Logical_mb is also taking into account any compression done on the client as well, so it won't be 100% on. Just saying.

If you want to look at the whole storage pool a q stgpool POOL_NAME f=d will give you some info such as this:
Deduplication Savings: 10,947 G (17.00%)

Other's might have a better way to get information but it works well enough for my needs.
 
Top