Compression setting in node configuration?

ldmwndletsm

ADSM.ORG Senior Member
Joined
Oct 30, 2019
Messages
232
Reaction score
5
Points
0
There has been much mention of this in a number of threads on this site wherein compression on the client side requires that compression must be set to 'Yes' in the node configuration on the server and also set in the dsm.sys on the client. Not one or the other, but both. Unless I misunderstood then why do they allow for three compression values on the node: Client, Yes, No?

[ Question ]
What would be the difference between 1. setting it to 'Client' and also having 'compression yes' in the dsm.sys versus 2. setting it to 'Yes' and also listing 'compression yes' in dsm.sys?

None of this makes sense. Seems it should instead be:

No on the server: Client will not compress no matter what is or is not listed in dsm.sys
Yes on the server: Client will compress regardless (even if it overtly states 'compression no' in dsm.sys)
Client: Client will choose (No by default; otherwise, it will state yes or no), and server will honor that.

What other logic makes sense here? Other than maybe having the 'compression yes' or 'no' set in some client option set on the server.

[ Question ]
My dsmsched.log ouput indicates that compression is occurring based on the 'Objects compressed by' and 'Total data reduction ratio' lines in the Summary output. Both are always positive values. But there is no 'compression' statement anywhere in the dsm.sys. The node configuration on the server has: compression: yes. The client option set is DEFAULT, and a query on CLOPTSET reports only the one, and it only has two values listed, nothing about compression. Are we sure compression is not occurring? Or are these lines in the Summary output just misleading?
 
TSM is all about choices. Lot's of little choices that will come back to haunt you in your dreams. Kidding! Or am I?

OK taking a stab at this.
So some storage pools you do not want to send compressed client data to. So a setting of no on the server for that client, would prevent an operator of the client from saying 'Yeah, I want my data to be compressed' and modifying the client options. This could be bad for your storage pool.

  • Setting it to YES tells the client to send the server compressed data no matter what. Again, this is to override the client operator. While I have no use cases for this, think of someone with a billion text files.
  • Well, text compresses easily, why send 6kb text files over the network when you can drop them down to 1kb for a whole mess of them. I would assume in most backup environment,s this setting is ignored and left up to the 'client'.
  • And the best choice I feel for most cases, is leaving it up to the client to decide.

So a use cases:
I have to backup a lot of medical images. I do not want to spend client cycles or server cycles trying to perform any data reduction on those images. In fact, it would take up MORE space in the storage pool. The storage pool is a legacy filepool that acts as a buffer before rolling off to LTO media for primary and copy copies. Also, the roll off of data to tape would require the server to rehidrate the data, which uses more server resourcces. Since I know I'm not the only person to get on those boxes, and its not that I don't trust the other people, I just don't trust the others. :) I don't want client compressed data coming in.

Everything else, I have I leave it up to the client. Some clients compress data, others I let the server do the workload with the directory container's. Just because my Power8 is more powerful than the little x86 boxes I am backing up. Or, I am working on getting a more timely backup of them so choices have to be made.

If you see object compressed by a positive or negative value, and as you said that the server is telling the client to compress data, then yes. You are getting some form of compression, even if its negative compression! Think images, or lots of data in a compressed format. You can always update the node and run a new backup and see the changes.
Total data reduction is also a metric of how much of the data didn't change along with any other data reduction techniques. A lot of other backup vendors will say, we get 99% data reduction!'. When in fact, they are counting on incremental forever technology to perform that data reduction. Example, you have a 10g filesystem, with very little new data being created. Each day an incremental runs. Well, you are only backing up what has changed. 99% didn't change. 99% data reduction!

Hope I'm fairly accurate in my statements above! If not, mods smack me :)
And hope it helps.
 
Thank you, RecoveryOne. In checking things over more closely, yes, you are correct. If I have compression set to 'yes' on the server, it is NOT necessary for the client to have any such setting in its client system option file (dsm.sys or dsm.opt or whatever). It will be forced regardless. There is also the following entry that shows up in the dsmsched.log file every time a backup runs that certainly makes it clear to me, but I missed this earlier:

04/28/2020 16:43:07 Data compression forced on by the server

So what would be an example of a storage pool that you would not want to send compressed data to?

I'm not using dedep or container pools so inline compression will not work. I could rely solely on drive compression.

When you said: "Also, the roll off of data to tape would require the server to rehidrate the data, which uses more server resourcces. "

How is the data being rehidrated? Are you referring to sending compressed data to tape drives wherein the tape drive will attempt to compress the data since drive compression is usually enabled, thus possibly inflating it? Or are you instead referring to deduplication wherein the data is rehydrated when written back to tape from disk?

Also, what if you used exclude.compression commands to exclude image files? Not that that would facilitate anything, but would it work mechanically, or would that still add too much time?
 
Starting from the bottom of your post upwards:
Also, what if you used exclude.compression commands to exclude image files? Not that that would facilitate anything, but would it work mechanically, or would that still add too much time?
Yes, one could use exclude.compression, but then in my environment, I would need to provide it with its own management class, and few other considerations. Also, it would add time as the same schedule would be processing multiple drives vs breaking it into smaller chunks.

For example say I have nodea with C:\ and E:\ drive. C:\ is 300gb of data that performs well with data reduction techniques. E:\ drive is all images at 1.6tb that when compressed actually grow by 5-10%. Also, during my setup the directory container pools didn't have a tier to tape option, or even tier to cloud option. I didn't want to burn up my limited spinning disk space.
So I have a client called nodea_c_drive that only backups C:\ and systemstate to my directory container pool. Then I have a client called nodea_e_drive that only backups E:\ to a legacy filepool that has deduplication set to no, and compression controlled by client (disabled).
Also with two separate nodes, I can control which data gets migrated to tape resources. Either via move nodedata, or migrate stgpool. Yes, move nodedata can process via filespaces as well, but I find it much easier to let the system migrate storage when watermarks are set. Less I have to touch it, the better.

When you said: "Also, the roll off of data to tape would require the server to rehidrate the data, which uses more server resourcces.
and
How is the data being rehidrated? Are you referring to sending compressed data to tape drives wherein the tape drive will attempt to compress the data since drive compression is usually enabled, thus possibly inflating it? Or are you instead referring to deduplication wherein the data is rehydrated when written back to tape from disk?

When the filepool hits a high watermark, it empties to tape. If that data had any reduction techniques applied to it by client or server, it would have to be rehydrated by the server before sent to the HBA's and to tape. What the tape drives do is outside of what the Spectrum Protect product does. Right now, the only storage pool that can send native compressed data to tape is the directory container's (in a protect only process). There's some pro's and con's to both approaches. With filepools being migrated or copied to tape the client can restore data directly from tape. With directory container being PROTECTED to tape, you must restore the entire directory container storage pool before any client can do restores. Now that the directory container has tier to tape as of 8.1.8(?) that data is actually rehydrated and then sent to tape. I would assume its treated like a normal tape pool at that point in time and clients can access that data directly, but don't quote me on that. Haven't yet used that functionality.

So what would be an example of a storage pool that you would not want to send compressed data to?
In my case, I have multiple clients that when trying to back them up with data reduction techniques the client logs show negative compression ratios. Generally, that doesn't bother me as the dedup engine makes up for any deficiencies. Also, in a lot of my clients I have compressalways set to no. So when the client tries to compress a file, and it grows, it sends the file uncompressed and moves on to the next file. This does add some time however. But saves my back end storage. Generally think application installers or other things like that. Not TB of uncompressable data.

For the 300 or so TB that I have that doesn't compress at all, I have a legacy file pool that's about 20Tb in size. Dedup processing is turned off for it. I make sure clients don't have dedup or compress set in their opt file. And yeah, not much to it really. Clients backup as normal, that storage pool migrates to tape when high watermarks are set. The nifty thing is, when looking at the tapes used, they report 2.7tb or less at 100% full. So yes, the drives are getting some compression, but not great.. Also, in my environment I don't have the means to keep 300+ of data on spinning disk. And the most efficient way to get it from disk to tape is not to rehydrate it Also, I would assume if you had a client writing directly to a tape device storage pool, you'd want to disable client compression.

Any of that help?
 
Thank you. My responses interleaved below.

Starting from the bottom of your post upwards:

Yes, one could use exclude.compression, but then in my environment, I would need to provide it with its own management class, and few other considerations. Also, it would add time as the same schedule would be processing multiple drives vs breaking it into smaller chunks.

Hmm ... Why would the management class have to change?

For example say I have nodea with C:\ and E:\ drive. C:\ is 300gb of data that performs well with data reduction techniques. E:\ drive is all images at 1.6tb that when compressed actually grow by 5-10%. Also, during my setup the directory container pools didn't have a tier to tape option, or even tier to cloud option. I didn't want to burn up my limited spinning disk space.
So I have a client called nodea_c_drive that only backups C:\ and systemstate to my directory container pool. Then I have a client called nodea_e_drive that only backups E:\ to a legacy filepool that has deduplication set to no, and compression controlled by client (disabled).
Also with two separate nodes, I can control which data gets migrated to tape resources. Either via move nodedata, or migrate stgpool. Yes, move nodedata can process via filespaces as well, but I find it much easier to let the system migrate storage when watermarks are set. Less I have to touch it, the better.

Right. I think I got it.

When the filepool hits a high watermark, it empties to tape. If that data had any reduction techniques applied to it by client or server, it would have to be rehydrated by the server before sent to the HBA's and to tape. What the tape drives do is outside of what the Spectrum Protect product does. Right now, the only storage pool that can send native compressed data to tape is the directory container's (in a protect only process). There's some pro's and con's to both approaches. With filepools being migrated or copied to tape the client can restore data directly from tape. With directory container being PROTECTED to tape, you must restore the entire directory container storage pool before any client can do restores. Now that the directory container has tier to tape as of 8.1.8(?) that data is actually rehydrated and then sent to tape. I would assume its treated like a normal tape pool at that point in time and clients can access that data directly, but don't quote me on that. Haven't yet used that functionality.

Let me first make sure I'm clear on what you're not saying. You're *not* saying that TSM would take data on a standard (traditional) storage pool disk volume (No dedup) that was sent compressed by the client (no dedup) and then uncompress that when migrating (high water mark hit) to tape, right? I wouldn't imagine it would do that, in which case the compressed data will simply arrive at the tape drive, and the tape drive will try to compress it (it has not way of knowing that it's already compressed) and will either fail to compress it, in which case its written as is, or it might be able to squeeze more out of it since drive compression is more robust than the client side compression (lzw standard).

So are you referring instead to a storage pool wherein the data is deduped on the server side (or maybe client but regardless, it's stored deduped on disk on server), so then when it's later migrated to tape (high water mark), it will, of course, be rehydrated? I think rehydration always occurs when moving/copying data from disk, where it's been deduped, to tape, although there were a few companies that played around with preserving deduped data to tape, but I think it's generally never done? That right?

In my case, I have multiple clients that when trying to back them up with data reduction techniques the client logs show negative compression ratios. Generally, that doesn't bother me as the dedup engine makes up for any deficiencies. Also, in a lot of my clients I have compressalways set to no. So when the client tries to compress a file, and it grows, it sends the file uncompressed and moves on to the next file. This does add some time however. But saves my back end storage. Generally think application installers or other things like that. Not TB of uncompressable data.

Yes, I recently added the 'compressalways no' option to my client user-options file, and now I see a lot of messages regarding files having increased in size (.gz in particular) and usually an accompanying line that the file Grew. I wasn't seeing those before. I plan to add some exclude.compression statements to force it not to even bother trying.

For the 300 or so TB that I have that doesn't compress at all, I have a legacy file pool that's about 20Tb in size. Dedup processing is turned off for it. I make sure clients don't have dedup or compress set in their opt file. And yeah, not much to it really. Clients backup as normal, that storage pool migrates to tape when high watermarks are set. The nifty thing is, when looking at the tapes used, they report 2.7tb or less at 100% full. So yes, the drives are getting some compression, but not great.. Also, in my environment I don't have the means to keep 300+ of data on spinning disk. And the most efficient way to get it from disk to tape is not to rehydrate it Also, I would assume if you had a client writing directly to a tape device storage pool, you'd want to disable client compression.
Any of that help?

Are those LTO6? I think its native capacity is 2.5 TB so that sounds about right?

Well, in my experience with EMC NetWorker, I've always written directly to tape, never used client side compression. Reasonably fast network, and drives are LTO6, with pretty good compression ratios. It's not unusual to see some tapes as high as 4-5+ TB, although a number of them might be down on the 2.5-6ish range for a pool that's mostly highly uncompressible data, natch. I find reasonably good write speeds, too, as long as enough streams are writing to keep the drive buffers full. Sometimes, though, when there's only one file system backing up (the lone wolf), and all the others have long finished, speed might drop down precipitously, not surprising. Offloading from disk to tape would resolve a lot of that and maintain better write speeds to tape.

With TSM, I have a disk volume on the server (traditional storage pool; no dedup), a reasonably fast network and a very robust client (server) that seems happy with client side compression. That's less data going over the network (there are a lot of other clients that I don't manage so this seems equitable) and less being stored on the disk volume, so that makes better use of that space. Of course, the tape drives will try to compress it so there is redundancy there, but maybe with it going from disk to tape that mitigates it as opposed to going directly to tape?

Also, with compressalways set to no, this helps reduce those client cpu cycles. I will be curious to see how much the exclude.compression statements help.
 
Hmm ... Why would the management class have to change?
Because in my case, I don't want it to fill up my spinning disk pool that is the directory container. Can't migrate from the directory container pools to other storage pools unless its tier to cloud, or tier to tape. You should be able to define a new management class and have data write to a different storage pool if required.
**Edit: But having it broken up just makes my life and my operations team life easier. Sure I have a shortcut to call different dsm.opt files on the clients, but over all it works well.

Let me first make sure I'm clear on what you're not saying. You're *not* saying that TSM would take data on a standard (traditional) storage pool disk volume (No dedup) that was sent compressed by the client (no dedup) and then uncompress that when migrating (high water mark hit) to tape, right? I wouldn't imagine it would do that, in which case the compressed data will simply arrive at the tape drive, and the tape drive will try to compress it (it has not way of knowing that it's already compressed) and will either fail to compress it, in which case its written as is, or it might be able to squeeze more out of it since drive compression is more robust than the client side compression (lzw standard).

So are you referring instead to a storage pool wherein the data is deduped on the server side (or maybe client but regardless, it's stored deduped on disk on server), so then when it's later migrated to tape (high water mark), it will, of course, be rehydrated? I think rehydration always occurs when moving/copying data from disk, where it's been deduped, to tape, although there were a few companies that played around with preserving deduped data to tape, but I think it's generally never done? That right?

In a legacy pool (file or disk pool) if the client sends file1.zip its a compressed file. When that file is wrote to tape then yes the tape drive will, I'd assume, not perform any compression on it and write it as the bitstream it has to tape. What I'm saying is if you have compression and deduplication on client side, or running id duplicates on the storage pool, all those chunks it found and discarded would be re-assembled by the server before being wrote to tape. Also, it is my understanding the lz4 or lz0 compression used by the spectrum protect server would be reinflated, so the tape drive can then do its native compression. So, no the product will not inflate compressed objects that it didn't create.

Now, with the directory container pools, the data is wrote to tape in a different format. Guess in the container format? That is a compressed and deduplicated bitstream. So in this case, yes it is preserving dedup data to tape. This has a limitation of requiring you to restore the entire directory container pool to disk if you need to do client restores from this 'copy'. It's why IBM really recommends having a replication for these storage pools. As the restore from tape to disk could take a good while. Assuming you have the disk space to restore it to. All or nothing deal.

Yes, I recently added the 'compressalways no' option to my client user-options file, and now I see a lot of messages regarding files having increased in size (.gz in particular) and usually an accompanying line that the file Grew. I wasn't seeing those before. I plan to add some exclude.compression statements to force it not to even bother trying.
I have a few specific exclusions like .zip or tgz, gz etc, but not applied everywhere. For the most part I just let the client duke it out. Or if vendor A is using some other random compression algorithm, I just don't care enough to figure them all out :) Also compressed and encrypted MSSQL flatfile backups are the worst. Uncompressed and non encrypted MSSQL flatfile backups and their cousins are all called... '*.bak'. Yeah, just let the client sort them out :)

And yes LTO6 drives. The non-dedup pool is using about 2.6 to 2.7 on a 2.5tb tape. Several tapes 100% full at 2.4tb. Yes, there is data in those filespaces that do compress. But on the whole, its not worth the cpu cycles to do that on the client or server side, and issues with time (see next paragraph below). My legacy dedup pool for copy volumes to tape are 2.8 or higher. I've never seen anything above 4.0tb. Sure, may not be much but when budgets are denied year after year for infrastructure upgrades, you work with what you got. And at the time, I was getting better performance/results not spending server cycles trying to rehydrate that data. Also writing to a disk pool and then sending that data to tape lets me better control tape resource utilization.

I'm not ashamed to say it, a lot of decisions as to why my environment is setup the way it is, is because of budgets and current infrastructure. We've tried to cram 100lb of sand in a 1 gallon bucket filled by a 1/4" pipe that was designed 8 years ago, and implemented fully 2 years after that. I spent months tuning AIX parameters for volumegroups, jfs2, hba settings trying to eek out every last bit of performance I could. It wasn't uncommon to see so much drive hitching that the write and read speed was 16Mb/sec. After tuning a lot of pramaters, seeing it go up to 20Mb/s was amazing. Pretty sure some of my old posts here reflect those speeds. No matter how many drives I would write to, I'd top out at a max of 256 or so Mb/sec (reported from fcstat/nmon). It would take 8+ hours to copy 1.4tb or so to tape. Since then, some improvements have been made as far as SAN connectivity, but very few 10g Ethernet links. What took 8 hours, is now being accomplished in 2 or there about. In the previous paragraph I mentioned tape resource utilization. That was a huge pain point. 1.4tb took 8 hours to write, database backup took 4 hours, other copies were being made for my other storage pools that could take 8+ hours.... I was cutting that 24 hour window very very tight and leaving little to no room for reclamation of tape volumes! Things are much better now. Its not uncommon to see my admin tasks start around 4am and be completed by 4pm. Which, in my mind having to previously struggle with the fabled 'Wheel of life' is amazing. After several years, only recently have I been able to push our SAN infrastructure to where our SAN team is starting to get worried that I'm using all their bandwidth :D We run disk storage and tape resources over the same fabrics, so there could be times when performance isn't the best due to switches queuing up frames.

That said, I was an early adopter of directory container pools as soon as they added in the ability to protect them via tape resources. I needed the more compact compression features they provided to meet retention. I just do not at this time have enough disk storage to store everything so still leveraging tape resources. Also, our auditors like to know that tape is in use. They describe it as 'slow media, not affected by ransomware' vs online storage. So, if/when I get to a point of a replication environment, I will very likely have a tertiary copy on tape.

I don't know everything, I don't claim to know everything. I still ask the fine folks here for help or what am I missing type of questions. My setup is fairly limited in scope compared to others that use this product. And if I am wrong, call me out on it and help me improve! I've a few health checks with the IBM team, and I've had to explain some design decisions that were made and reasons why, and yes they go against best practices.

So there's a little back story as well. Not sure if all of that is useful, but hopefully gives you an idea of where I'm coming from.
 
In a legacy pool (file or disk pool) if the client sends file1.zip its a compressed file. When that file is wrote to tape then yes the tape drive will, I'd assume, not perform any compression on it and write it as the bitstream it has to tape. What I'm saying is if you have compression and deduplication on client side, or running id duplicates on the storage pool, all those chunks it found and discarded would be re-assembled by the server before being wrote to tape. Also, it is my understanding the lz4 or lz0 compression used by the spectrum protect server would be reinflated, so the tape drive can then do its native compression. So, no the product will not inflate compressed objects that it didn't create.

This raises an interesting question which is somewhat off topic. Let's say, for example, that you're not using deduplication and no container pools; so no server side compression either, BUT you are using client side compression (lzw), and the client compresses a file, and lets say it achieves 70% reduction on that particular file. It's now stored that way (in situ) on the server disk storage pool volume. So what happens when it's migrated to tape? Does TSM decompress that file, sending it to the tape drive the way it was originally on the client's disk? After all, it did create the compressed version of the object. Or does it send it as is, compression maintained, even though it did create the compressed version? My presumption was the later.

I guess what I'm really asking is that if client side compression reduces network traffic, reduces load on the backup server and space on the disk storage pool (all good), and you had compression turned off on the drives (not that you would, but let's say just a test), and for some reason TSM reinflated any files that it previously compressed before writing them to tape then any benefit to compressing them on the client would be ephemeral and would not reduce tape consumption. It's not categorically clear from the IBM documentation that this is or is not the case.

But when I query some tapes that are full, the physical space occupied that's reported is around the standard default native capacity for LTO6. Unless the data is horribly uncompressible then it would seem that the only way these tapes could fill up at that amount would be if the drives were not able to compress much more, which suggests that the data would have already been compressed when it arrived at the drive and remained that way. Otherwise, if that data was somehow decompressed before being sent to the drives then those numbers would be much higher, I would expect. That sound right?

Clearly, with deduplication it would reassemble everything and rehydrate it when it sends it to tape.

Now, with the directory container pools, the data is wrote to tape in a different format. Guess in the container format? That is a compressed and deduplicated bitstream. So in this case, yes it is preserving dedup data to tape. This has a limitation of requiring you to restore the entire directory container pool to disk if you need to do client restores from this 'copy'. It's why IBM really recommends having a replication for these storage pools. As the restore from tape to disk could take a good while. Assuming you have the disk space to restore it to. All or nothing deal.

Very interesting. I was not aware of that capability. But perhaps it's not too much enrollment of belief to preserve dedup to tape given that you have to restore the entire entity in its entirety, not just some arbitrary file. So this is a special case of deduped data not being re-hydrated when written to tape, but one has to note the limitations.

Sounds like you've definitely eked out a lot from your backups and optimization.
 
Back
Top