Data dedupe and offsite copies reclamation problem

francs

ADSM.ORG Member
Joined
Jun 9, 2003
Messages
44
Reaction score
0
Points
0
Hi

This is more a "heads up" than asking for a solution, as IBM has extensively looked into my problem and replied that "everything works as designed and expected." ( See the whole reply below)

Background:
We have disk based/file pools for most of our backups that are duplicated.
Then we have copies of these made to LTO4 tape and sent offsite daily.

The problem:
Space reclamation of the offsite copy pool is VERY slow. (Tape-to-tape reclamations are quick)
In a 5 hour period around 60-90Gb of data is reclaimed.
The reclamation threshold used to be set to 70% and would finish in the 5 hour reclamation window we had.
Now only tapes with a 95% reclamation threshold finish in that window.
This has off course wrecked havoc with the tapes cycles.
We had to almost double the amount of tapes in the offsite copy pool in order to have enough vault retrieves coming back to be used for the next cycle of copies.


Below is IBM's response after weeks of troubleshooting:
the development team review all serverperformance traces we gathered in May. Run tests on their machines and review the offsite
reclamation code.
From TSM development point of view, there is nothing else they can do as no defect was found and everything works as designed and
expected.
 
Curiosity --- On what type of storage is your disk based/file pools? We have a NetApp device and when you move data from one location on the NetApp to another location it is very slow.
 
Curiosity --- On what type of storage is your disk based/file pools? We have a NetApp device and when you move data from one location on the NetApp to another location it is very slow.
Our dedupe pools are all on a DS3400 and SATA disks.
The TSM db is on a DS4700 and fibre disk.
When you move data from a dedupe pool to a non-dedupe pool TSM has to reassemble/rebuild the file. This is where the bottleneck is. If you look at what TSM does when you copy the data you'll see a lot of TSM db activity, when it determines where the pieces are and then some activity on the disk pool when it copies the file.
 
Same here.

I'm getting similar numbers... yesterday's reclamation: 120GB reclaimed in 4 hours on the following hardware:

TSM 6.2.3 on RHEL 5.6
16GB ram
DB2 on 300GB 15k SAS disks
2TB 7.2k SATA disks for storage pools
offsite copy storage pool on LTO-5 library with 2 drives
active data storage pool as virtual volume on a remote server

That's about 9 MB/s... not exactly what I expected from $50k worth of hardware.

I've found that any data movement operation from deduplicated storage pools to non-deduplicated ones runs ridiculously slow. It hogs the database disks with 8k random reads. Generating backup sets and exporting data runs slowly as well.

Maybe putting the DB on a fast mirrored SSD would make deduplication more usable. I wish I anticipated this... now we're using about 2x more tapes than I thought we'd be using.

Here's what I did to make offsite reclamation somewhat manageable: set up a script that starts reclamation with threshold 99 and offsitereclaimlimit=1, wait till it finishes, then start new reclaim process with threshold 98 and so on... that way only one tape at a time is reclaimed. If you simply specify a threshold of 50 and there are more volumes that satisfy this criteria, it may well happen that none of them get fully reclaimed.

Also, upgrade to 6.2.3 and use the new server options to disable DB2 reorgs while the reclamation process runs. This helped a bit too. I actually disabled reorgs altogether and do them manually while the server is stopped. Yeah, nuke the site from orbit... it's the only way to be sure :)

In conclusion, I wouldn't recommend mixing deduped and non-deduped storage pools such as tape unless you are working with small data sets (a few TB) or you have some REALLY fancy hardware (multiple SSDs for DB? RAM size greater than DB size? RAMSAN?).

I'm still hoping that IBM forgot to put a proper index on some table (or put too many) and that they'll fix this in some future fixpack. Until then... I'm considering turning off deduplication and using tapes as a second-tier primary storage.

P.S.
Another disappointment for deduplication: It turns out that deduplication of virtual volumes works nowhere near as good as deduplication of files... my primary storage pool had about 60% space saved while on the remote virtual volumes I only had about 10-15% space saved. I'm guessing this is because files have different offsets when stored in virtual volumes... dedup didn't pick up much of what could be deduplicated. I had to give up on a remote copy pool and now I'm using active data pool with grace period of 7 days... so now it's like a copy pool with different retention. Meh.

P.P.S.

...everything works as designed and expected.

Um... "In retrospect, our design sucks a bit. Sorry about that." ? :D
 
P.S.
Another disappointment for deduplication: It turns out that deduplication of virtual volumes works nowhere near as good as deduplication of files... my primary storage pool had about 60% space saved while on the remote virtual volumes I only had about 10-15% space saved. I'm guessing this is because files have different offsets when stored in virtual volumes... dedup didn't pick up much of what could be deduplicated. I had to give up on a remote copy pool and now I'm using active data pool with grace period of 7 days... so now it's like a copy pool with different retention. Meh.

P.P.S.



Um... "In retrospect, our design sucks a bit. Sorry about that." ? :D

Had the same issue. Here is IBM support statement about this:

For normal backup, once the client files are same, the same data will be store on server, so we can
see high dedup ratio. But for virtual volume, a virtual volume is actually stored as an archive
file on target server. So the archive file on target server contains all data for a volume. And for
a volume, it's not only the client files inside, we also have a lot of supporting data, like frame
header, BackInsNorm verb, kind of staffs, which make the volume data to be different each time,
even we are backing up the same client files. Further more, the client files are broken in to piece
when storing it, we have the structure in the volume like |frame hdr|data blk|frame hdr|data blk
... So it's very likely that we wouldn't get a high dedup ratio on virtual volume data on target
server.


They will probably publish a technote.
 
Back
Top