Thoughts on De-dup setup for Traditional Sequential storage

Colin · Oct 4, 2016

Hi all,

I am currently trying to devise a way to have all my incoming data de-duped before reaching its final resting place. Unfortunately I cannot go to container based storage (with inline de-dup) because I still have some tapes involved.

My current idea is to have some initial storage pools where incoming data is moved to / held that have identify processes running against them. After a certain amount of time these should automatically migrate to their final storage pools (the moving of data causing the de-dup to take effect)

I was just curious if anyone else has a similar scenario and what they came up with as a hands off method of managing de-dup of incoming files.

Thanks,

marclant · Oct 4, 2016

Colin said:
My current idea is to have some initial storage pools where incoming data is moved to / held that have identify processes running against them. After a certain amount of time these should automatically migrate to their final storage pools (the moving of data causing the de-dup to take effect)

I don't follow. You can only run the identify duplicates against a dedup pool. And if you move out of a dedup pool into another dedup pool, the data gets rehydrated, then dedupped again. Best to send it directly to the final destination first. Also best to only have 1 dedup pool for best data reduction as deduplication only works within a pool, not across pools.

Colin said:
Unfortunately I cannot go to container based storage (with inline de-dup) because I still have some tapes involved.

Depending on the reason, maybe that was addressed at 7.1.7 since they added enhancements in that regard: https://www.ibm.com/support/knowledgecenter/SSGSG7_7.1.7/srv.common/r_wn_tsmserver.html

Colin · Oct 4, 2016

I was under the impression the identify duplicates only marks files as duplicated. The actual de-duplification of those files only takes place the next time said data is moved. I.E. a move volume command.

How would one automate this process. Presently I have to go move data on all of my volumes in that storage pool to get the files that have been identified to actually dedup, correct?

marclant · Oct 4, 2016

Colin said:
was under the impression the identify duplicates only marks files as duplicated. The actual de-duplification of those files only takes place the next time said data is moved. I.E. a move volume command.

The identify duplicates process does identify duplicates, but it's only searching in the dedup pool(s).
Then there's a thread that is referencing and deleting duplicate chunks.
Now, this leaves empty space in the pool, so you run reclamation to reclaim the space. You can achieve the same thing manually, but why do manual work that can be automated.

Colin said:
Presently I have to go move data on all of my volumes in that storage pool to get the files that have been identified to actually dedup, correct?

No.

Colin said:
How would one automate this process.

Sent backup directly to deduppool, do client-side dedup where you can to save work on the server.
Do your daily backup stgpool
Do daily expiration and reclamation

Colin · Oct 4, 2016

Alright so just to verify: If I have two ident processes running against a stgpool that has de-duplification enabled on it, it will mark files for de-duplification and then a second hidden thread actually deletes that duplicate data. Then space reclamation happens to clean up the big empty chunks left over. (Correct my if I am wrong anywhere)

If this is correct, is there a way to track the hidden thread's progress / activity?

Sorry in advanced for my noobness.

Also when you say do daily backup of the stgpool are you referring to like a copy pool or something?

marclant · Oct 5, 2016

Colin said:
If this is correct, is there a way to track the hidden thread's progress / activity?

You can use: SHOW DEDUPDELETEINFO The output is somewhat self explanatory.

Colin said:
Also when you say do daily backup of the stgpool are you referring to like a copy pool or something?

Yes, the backup stgpool process copies the data from the primary to the copy pool.

Colin · Oct 5, 2016

Awesome, thanks for all this input. I greatly appreciate it. Helped a lot.

bigred · Oct 6, 2016

marclant said:
Depending on the reason, maybe that was addressed at 7.1.7 since they added enhancements in that regard: https://www.ibm.com/support/knowledgecenter/SSGSG7_7.1.7/srv.common/r_wn_tsmserver.html

The new container - copy storage pools, have painful limitations...
1. You cannot restore files directly from container-copy storage pools (ie: restoring from the tapes in a DR)
2. Should not be used for DR due to length of time required to restore the container pool from the container-copy pool.

So it's basically there for repairing small amounts of data in a container pool. I think TSM v8 is coming out somewhat soon, and will have some new functionality that'll kill potentially kill off tape requirements altogether for a lot of businesses.

Colin · Oct 6, 2016

bigred said:
The new container - copy storage pools, have painful limitations...
1. You cannot restore files directly from container-copy storage pools (ie: restoring from the tapes in a DR)
2. Should not be used for DR due to length of time required to restore the container pool from the container-copy pool.

So it's basically there for repairing small amounts of data in a container pool. I think TSM v8 is coming out somewhat soon, and will have some new functionality that'll kill potentially kill off tape requirements altogether for a lot of businesses.

Those 'limitations' are only with tape correct? If I am fully disk those are not an issue? (Pushing for a full move to disk.)

Thoughts on De-dup setup for Traditional Sequential storage

Colin

marclant

Colin

marclant

Colin

marclant

Colin

bigred

Colin

Data Privacy Impact Assessment

Sponsor ADSM.ORG

Navigation Menu

NordVPN 3 Months FREE

Forum statistics