Anyone use data "de-dup" technology?

Nicke · Jun 24, 2010

Almost half the stgpool size with de-dup in 6.1.3.2

I think the native TSM 6.x de-dup works pretty good

One example is this VTL stgpool with 4,4 TB free space:

Storage Pool Name: ACTIVE_VTL_DD
Storage Pool Type: Primary
Device Class Name: VTLDD
Estimated Capacity: 7,682 G
Space Trigger Util: 94.0
Pct Util: 62.0
Pct Migr: 62.0
Pct Logical: 79.0
High Mig Pct: 90
Low Mig Pct: 70
Migration Delay: 0
Migration Continue: Yes
Migration Processes: 1
Reclamation Processes: 6
Next Storage Pool: ACTIVELTO
Reclaim Storage Pool:
Maximum Size Threshold: No Limit
Access: Read/Write
Description: Dedup-pool
Overflow Location:
Cache Migrated Files?:
Collocate?: Group
Reclamation Threshold: 60
Offsite Reclamation Limit:
Maximum Scratch Volumes Allowed: 760
Number of Scratch Volumes Used: 505
Delay Period for Volume Reuse: 0 Day(s)
Migration in Progress?: No
Amount Migrated (MB): 1,483,681.47
Elapsed Migration Time (seconds): 63,407
Reclamation in Progress?: No
Last Update by (administrator): XXXXX
Last Update Date/Time: 05/21/2010 12:05:12
Storage Pool Data Format: Native
Copy Storage Pool(s):
Active Data Pool(s):
Continue Copy on Error?: Yes
CRC Data: No
Reclamation Type: Threshold
Overwrite Data when Deleted:
Deduplicate Data?: Yes
Processes For Identifying Duplicates: 2
Duplicate Data Not Stored: 4 400 G (49%)

...

We only run 2 de-dup identity processes and we monitor the reclaim processes that free up volumes.

Process Process Description Status
Number
-------- -------------------- -------------------------------------------------
206 Identify Duplicates Storage pool: ACTIVE_VTL_DD. Volume: NONE. State:
idle. State Date/Time: 2010-06-24 08:26:52.
Current Physical File(bytes): 0. Total Files
Processed: 3388025. Total Duplicate Extents
Found: 27 705 973. Total Duplicate Bytes Found:
4 785 235 436 185.
207 Identify Duplicates Storage pool: ACTIVE_VTL_DD. Volume: NONE. State:
idle. State Date/Time: 2010-06-24 08:20:17.
Current Physical File(bytes): 0. Total Files
Processed: 598506. Total Duplicate Extents
Found: 7 594 202. Total Duplicate Bytes Found: 1
043 675 163 147.

...

So we are happy with the TSM 6 de-dup.

...

With IBM ProtecTIER or EMC Data Domain maybe it possible to get 3-6 times (300 - 600 %) de-dup ratio, but it will require a length and costly installation project and also with TSM you have always had a very good native VTL function so you don't need an external VTL product.

...

Kind Regards,
Nicke

vadim · Jun 25, 2010

Nicke said:
With IBM ProtecTIER or EMC Data Domain maybe it possible to get 3-6 times (300 - 600 %) de-dup ratio, but it will require a length and costly installation project and also with TSM you have always had a very good native VTL function so you don't need an external VTL product.

Why lengthy project? I just want to say to when I was DD customer it took only about a day of work.

I guess it depends on a size of the environment and how good are you with it.

I'm agree that you don't want to have VTL, though. VTL is trading physical to virtual, but all the tape management will stay the same. Why anybody would want that?.. NFS will do just fine, or actually better. If it's speed - get 10Gb/s Ethernet adapters. VTL is just another layer that WAS needed for integration when TSM didn't have well developed FILE type device. It's NOT the case anymore. Actually DD folks are trying to talk out customers from using VTL for awhile now, but some just don't want to listen.

eperez507 · Jun 25, 2010

You have to realize that if you do a NFS that data domain has a limit on streams so you better limit the mounts correctly with device class or use Drives and your limiting factor with VTL. We currently run 2 fully loaded DD880's and relicate it offsite to another pair. It works very well and replication and restores are awesome. We see between 5-7x deduplication and no slowness with restores. Inline backup and replication is the way to go.

vadim · Jun 25, 2010

Masonit said:
We run dedup. I have read possible dedup of 500:1 and so on.. Possible on the planet Pandora but not here... Right now we have deduped 16 % and we have in theory "good" data for dedup. I was expecting atlest 40 % so I am very disappointed.

\Masonit

DDR can't dedupe if there are not much of dupes

With incremental forever it's usually the case.

Jeff_Jeske · Jun 25, 2010

We have a NAS, TAPE, VTL, and SATA and we need more TSM storage. I have been recommending we purchase an EMC EDL over DD, NAS, or SATA.

We currently have a VTL and it has been 100% rock solid from day 1. No issues of any kind and we REALLY work it over. The compression is excellent and the the recovery of small files from multiple volumes is almost instant. In terms of recovery, the appliance driven VTL is significantly quicker than windows managed SATA disk.

The CDL is also once removed from the OS system administrators. I don't like that my entire SATA file pool could be whacked by a click happy admin.

Since our business is insurance 90% of our data is database data that arrives in TSM precompressed and sometimes encrypted. Furthermore our database retention is only 10 days. If you ask an honest data domain or other dedup engineer what the likely return on investment would be on 10 day old, compressed, and encrypted data they will tell you not to waste your time. This is without any form of software dedupe.

Appliance dedup today is filling a void that is rapidly being replaced or eliminated by encryption and software dedup. It may have some short term rewards (but not without up front cost) for some but it will be superseded software technology.

Quick and dirty.... I feel the best approach is to buy large quantities of disk that can store any form of one and zero that you put there. I believe large quantities of inexpensive disk will be much more beneficial to a backup engineer than a medium supply of expensive deduped disk.

I believe I could run our entire backup platform with only physical LTO5 for big files and VTL devices for little files. In the backup world I feel an abundance of capacity outweighs an abundance of complexity.

KB · Jun 30, 2010

Real world stats

This article has some interesting statistics related to TSM de-dupe. http://bit.ly/bqgYnj

Jeff_Jeske · Mar 26, 2012

After nearly two years....I'd like to bring this back as hopefully more people are familiar with dedup operations and such.

We recently shifted to occupancy based licensing providing us unlimited usage of the SQL TDP. Therefore my primary objective is to leverage the TDP for SQL backups WITH CLIENT SIDE COMPRESSION.

I have deployed the TDP and configured the TDPSQL folder dsm.opt file to use the dedup managment classes and deduplication yes options. The client is configured with dedupe client or server option. My question is ... are there any other options that need to be configured. If not, how do I determine if client side dedupe is working or how well the data is deduplicating? I can't seem to find a log that provides me with this information.

I can see that the data eventually is deduplicated but I'm not certain where that is taking place. Can I kill identify duplicates and still get client side deduplication?

The biggest consumer of my Storage and NIC resources are my locally stored SQL flatfile backups.

rowl · Apr 4, 2012

I have switched companies since this thread started. Now I have an opportunity to work with Data Domain and TSM. I hope to have some test hardware soon where I can try some TSM backups without any other data on the array to try to get some more pure numbers. It looks like TSM writing to Data Domain will be in the 3:1 to 7:1 range for unstructured data. This is very similar to what I had seen previously with ProtecTIER.

I had the opportunity to attend the Pulse conference this year. I found the information from other users there on TSM deduplication to be disappointing. Several speakers talked about getting 30% - 60% reduction in their backup data footprint. Compared to TSM 5.5 they reported their DB grew by a factor of 10 or more on the servers where they achieved these numbers. One presenter said their TSM 5.5 DB was 45 GB and once the clients were moved to TSM 6.3 utilizing client/server deduplication their new DB was around 600GB.

Does this match others experiences? Really makes me want to shy away from TSM Dedup.

-Rowl

Jeff_Jeske · Apr 4, 2012

Well... much has changed since I started this thread... but here are my findings:

For our storage purchase we decided to buy a pure EMC CX4 sata frame and leverage client side dedupe and compression. We ended up moving about 1000 clients from six TSM 5.5 CDL instances to two TSM 6.2 file device class type nodes. On 5.5 we tried to keep our TSM database down to about 100GB and on 6.2 we are aiming for 200GB. This limit was set to allow us to quickly recover the TSM DB in the event we need to recover it. All of our TSM servers are on Windows 2008.

Initially we tested server side and client side deduplication and that was fairly disappointing. I was expecting to dedupe the crap out of the OS and full SQL DB exports that are run each night but it simply didn't play out that way. On the bright side all of our SATA clients are using client side compression and we are seeing 2:1 on unstructured data and sometimes 6:1 on SQL DB data. Be advised we saw the same level of compression with version 5.5.

Now... We do see some server side deduplication but the question is whether its worth it or not. Our TSM databases reside on very expensive EMC VMAX FC (about $35k per TB) storage and our backups go to inexpensive SATA disk (about $2K per TB). Is it really worth adding the complexity and database growth of dedupe? So far for us it doens't look good for dedupe. This is even more true when you think how much compressed data you can store on an LTO4 tape!!!

Overall I am very happy with TSM 6.2 as it allows us to register and store tons more data per instance. We plan to consolidate about eight TSM 5.5 instances into four TSM 6.2 instances and have capacity to spare. Do do not plan to leverage Dedupe unless we find candidates that TRULY dedupe well as in 10:1. So far we are happy with what compression gives us. (For example: I have a 320GB SQL database that compresses down to 60GB using the TDP with compression.)

My recommendation would be to leverage client side compression wherever you can. Then test sample clients for dedupe and compare that to what you would achieve with compression alone. Lastly we did test running compression and dedupe together and it really didn't provide a tangible benefit.

As for other threads on this topic.... I found quite a few but when I asked deeper questions no one was able to provide the answers I was looking for. This leads me to believe most admins really don't understand what is going on behind the scenes.

chad_small · Apr 4, 2012

Currently I work in an environment with TSM 6.2.3 servers backing up to Data Domain storage for dedupe. Now the problem with any dedupe storage solution is trying to create the offsite while the dedupe disk is trying to compress the data. The whole offsite tape creation can be unbearably slow. Luckily we have a remote site that we replicate our Data Domain's too so I don't have to worry about offsite tape creation. So if you are looking into a dedupe solution I would recommend you identify who you plan to keep data in an offsite location. Trying to generate tapes could be time and cost prohibitive.

As for my compression ratio: 6.2 (160TB precompression 25.9TB post compression)

That's just from one of my 5 data domains.

jbastl · Apr 5, 2012

Ho Timgren

Deduplication works well on databases. (Oracle, MS sql),
if you are using only incremental backups of file systems, dedup benefits are litle bit better then compression.
Some data are not deduplicateble ;p

HW black box vtl with dedup is great, but very expensive. Data are deduplicated on disks only. On tapes thy are stored in full size.

So - if you are planning to use deduplication (hw or sw) think about type of your data. before bying some dedup solution.

timgren · Apr 5, 2012

Wow. I started this thread 06-26-2007. I was shocked to login and see it still alive and well. Amazing.

Let me do a quick follow up.
We have two DD880's, and DD640's (i think that's the model) We get about 5:1 - 6:1 dedup rates with TSM 6.23, which is pretty typical. Were fiddling with TSM dedup, but haven't really spent much time with it. Initial tests haven't been impressive enough to invest a whole lot into it. It works ...but you get what you pay for.

Based the original pitch, dedup certainly isn't the "golden child" vendors liked to claim regardless of the creative marketing TCO they presented. We have yet to "save" anything other then sq ft in the data center. In fact, If we offloaded the entire DD880 to raw SATA drives with NO compression whatsoever, we'd still come out ahead - money wise. The "savings" come down to less spinning physical disk (I see this as a vendor benefit, not a customer), less data center space, less power costs, and ease of management. Actual storage costs are close to a wash using disk storage. (i.e. 5:1 dd rates.... at 5x the total cost) If evaluated against LTO5's, tape will win every time, hands down, no contest.

The DD880 is VERY easy to use, very stable, and has performed very well for us. I have no real complaints about the product or the support. It makes managing TSM storage a breeze. This of course has value too - but it's not value that a financial auditor will ever realize.

The problems we realized early on using the DD880 for direct rman backups was that it presented the DBA's with then entire 70TB filesystem, and no real way to limit to it's use to a set quota. I don't know about your shop... but here... when DBA's see storage -- they fill it up completely! And ONLY when it's absolutely full and unusable do they figure out how to clean it out or expire old data-sets. So unless DBA's were provided with their very own DD880 - direct DB backups on a system shared by TSM isn't an option. I've also found no real value in doing the recommended "full rman backups" daily just so the dedup ratio increases. I find that this number has NO value in a fixed cost environment. Sending 7x the data instead of 2x the data to dedupped storage appliance doesn't "save" anything, nor does it add any additional protection.

We're currently evaluating the Quantum DXi8500 which promises the same capacity at about 1/2 the cost and with a more reasonable annual maintenance fee AND a VM solution that looks promising. Of course.. when it comes to vendor claims... I'll believe it when i see it.

patelp · Mar 26, 2013

My whole enviorment is De-Dup. I am getting almost 70% compression and dedup ratio. I have roughly about 420 Nodes. It has been working great. I had some issue. TSM 6.3.3 resolved some of the issue. You need to upgrade to TSM 6.3.3.100

Thanks,

Anyone use data "de-dup" technology?

Nicke

vadim

eperez507

vadim

Jeff_Jeske

KB

Jeff_Jeske

rowl

Jeff_Jeske

chad_small

jbastl

timgren

patelp

Data Privacy Impact Assessment

Sponsor ADSM.ORG

Navigation Menu

NordVPN 3 Months FREE

Forum statistics