Volumes with Pct Util < 1 don't get reclaimed

FloydATC

ADSM.ORG Member
Joined
May 11, 2010
Messages
21
Reaction score
0
Points
0
Location
Fet, Norway
I have about 800 sequential file volumes, about 3/4 of them contain almost no data worth keeping. However, when I try to reclaim only those volumes by specifying a high reclamation threshold, TSM tells me there is no data to process. What am I missing? If I specify a really low threshold, reclamation can shuffle data around for days on end but not a single volume is actually freed up. I have audited every single volume and found no discrepancies.

tsm: BACKUP-01>q vol stg=filepool

Volume Name Storage Device Estimated Pct Volume
Pool Name Class Name Capacity Util Status
------------------------ ----------- ---------- --------- ----- --------
Q:\TSM\FILEVOLS\FILE000 FILEPOOL FILE20GB 20.5 G 0.2 Filling
Q:\TSM\FILEVOLS\FILE001 FILEPOOL FILE20GB 20.5 G 17.2 Filling
Q:\TSM\FILEVOLS\FILE002 FILEPOOL FILE20GB 20.5 G 0.3 Filling
Q:\TSM\FILEVOLS\FILE004 FILEPOOL FILE20GB 20.5 G 0.2 Filling
Q:\TSM\FILEVOLS\FILE005 FILEPOOL FILE20GB 20.5 G 1.3 Filling
Q:\TSM\FILEVOLS\FILE007 FILEPOOL FILE20GB 20.5 G 0.3 Filling
Q:\TSM\FILEVOLS\FILE009 FILEPOOL FILE20GB 20.5 G 18.3 Filling
Q:\TSM\FILEVOLS\FILE00A FILEPOOL FILE20GB 20.5 G 50.8 Filling
(...and hundreds more like these)


tsm: BACKUP-01>q stg

Storage Device Estimated Pct Pct High Low Next Stora-
Pool Name Class Name Capacity Util Migr Mig Mig ge Pool
Pct Pct
----------- ---------- ---------- ----- ----- ---- --- -----------
FILEPOOL FILE20GB 16,070 G 22.6 22.6 100 0 FILEPOOL2
FILEPOOL2 FILE20GB 29,791 G 42.1 42.1 90 70
(...other, irrelevant volumes snipped)


tsm: BACKUP-01>reclaim stg filepool threshold=90 duration=60
ANR2111W RECLAIM STGPOOL: There is no data to process for FILEPOOL.
ANS8001I Return code 11.

tsm: BACKUP-01>reclaim stg filepool threshold=60 duration=60
ANR2111W RECLAIM STGPOOL: There is no data to process for FILEPOOL.
ANS8001I Return code 11.

tsm: BACKUP-01>reclaim stg filepool threshold=30 duration=60
ANR2110I RECLAIM STGPOOL started as process 49.
ANR4930I Reclamation process 49 started for primary storage pool FILEPOOL manually, threshold=30, duration=60.
ANS8003I Process number 49 started.

tsm: BACKUP-01>q pr

Process Process Description Status
Number
-------- -------------------- -------------------------------------------------
49 Space Reclamation Volume Q:\TSM\FILEVOLS\FILE0FC (storage pool
FILEPOOL), Moved Files: 25032, Moved Bytes:
93,348,604, Unreadable Files: 0, Unreadable
Bytes: 0. Current Physical File (bytes): 53,784
Current input volume: Q:\TSM\FILEVOLS\FILE0FC.
Current output volume(s):
Q:\TSM\FILEVOLS\FILE15E.
 
Hi Floyd,

You have to know that reclaim processes only the full volumes (in a primary stg).

My advices are :
- use a relatively low threshlod (10-15-20),
- use a longer duration (60 minutes is probably too few),
- use the reclaimpr stgpool option in order to have parallel processes if you disks are fast enough.

Regards,
Erwann
 
The low migration threshold is currently at 0% because I'm trying to completely clean out this storage pool. The migration process completes after just a few minutes, it runs nowhere near 60 minutes. MOVE DATA completes with "success" but doesn't actually change the Pct Util on the volume.

I'm beginning to suspect that this problem somehow has to do with deduplicated data but I have no idea how to solve it.
 
I'm beginning to suspect that this problem somehow has to do with deduplicated data but I have no idea how to solve it.

Hi Floyd,

This is an important information to know that you're using deduplication for that pool.

Your server probably has the DEDUPREQUIRESBACKUP option set to YES (this is the default).

This option allows the reclaim to occur only when the data have been copied to a copy storage pool. See :
http://publib.boulder.ibm.com/infoc...ref.doc/r_opt_server_deduprequiresbackup.html

Note that you probably should not set this option to NO !

So first ensure that all the datas from filepool have been copied.

Regards,
Erwann
 
opermty: Here's a sample volume. Notice that we've set most of the nearly unused volumes in this pool to Read-Only in an attempt to track them over time.

tsm: BACKUP-01>q vol Q:\TSM\FILEVOLS\FILE004 f=d

Volume Name: Q:\TSM\FILEVOLS\FILE004
Storage Pool Name: FILEPOOL
Device Class Name: FILE20GB
Estimated Capacity: 20.5 G
Scaled Capacity Applied:
Pct Util: 0.3
Volume Status: Filling
Access: Read-Only
Pct. Reclaimable Space: 0.0
Scratch Volume?: No
In Error State?: No
Number of Writable Sides: 1
Number of Times Mounted: 11,003
Write Pass Number: 34
Approx. Date Last Written: 10/18/2012 15:31:04
Approx. Date Last Read: 10/18/2012 20:43:45
Date Became Pending:
Number of Write Errors: 0
Number of Read Errors: 0
Volume Location:
Volume is MVS Lanfree Capable : No
Last Update by (administrator): ADMIN
Last Update Date/Time: 10/19/2012 13:03:06
Begin Reclaim Period:
End Reclaim Period:
Drive Encryption Key Manager:

Trident: Collocation is set to Group for all storage pools. Even if it was set to Node it still wouldn't quite explain the number of nearly unused volumes because we don't have that many nodes:

tsm: BACKUP-01>select count(*) from volumes where pct_utilized < 1

Unnamed[1]
------------
501

tsm: BACKUP-01>select count(*) from nodes

Unnamed[1]
------------
176
 
And it's not just the nearly unused volumes that behave this way... Let's try MOVE DATA on a full volume:

tsm: BACKUP-01>q vol Q:\TSM\FILEVOLS\FILE029 f=d
move data Q:\TSM\FILEVOLS\FILE029 stg=filepool2 wait=yes
q vol Q:\TSM\FILEVOLS\FILE029 f=d

Volume Name: Q:\TSM\FILEVOLS\FILE029
Storage Pool Name: FILEPOOL
Device Class Name: FILE20GB
Estimated Capacity: 20.4 G
Scaled Capacity Applied:
Pct Util: 99.9
Volume Status: Full
Access: Read-Only
Pct. Reclaimable Space: 0.1
Scratch Volume?: No
In Error State?: No
Number of Writable Sides: 1
Number of Times Mounted: 5
Write Pass Number: 1
Approx. Date Last Written: 01/05/2012 15:21:18
Approx. Date Last Read: 10/19/2012 01:36:36
Date Became Pending:
Number of Write Errors: 0
Number of Read Errors: 0
Volume Location:
Volume is MVS Lanfree Capable : No
Last Update by (administrator): ADMIN
Last Update Date/Time: 10/19/2012 13:03:06
Begin Reclaim Period:
End Reclaim Period:
Drive Encryption Key Manager:


tsm: BACKUP-01>move data Q:\TSM\FILEVOLS\FILE029 stg=filepool2 wait=yes
ANR2233W This command will move all of the data stored on volume Q:\TSM\FILEVOLS\FILE029 to other volumes in storage pool
FILEPOOL2; the data will be inaccessible to users until the operation completes.

Do you wish to proceed? (Yes (Y)/No (N)) y
ANR0984I Process 1524 for MOVE DATA started in the FOREGROUND at 07:59:52.
ANR1140I Move data process started for volume Q:\TSM\FILEVOLS\FILE029 (process ID 1524).
ANR1141I Move data process ended for volume Q:\TSM\FILEVOLS\FILE029.
ANR0985I Process 1524 for MOVE DATA running in the FOREGROUND completed with completion state SUCCESS at 07:59:54.


The process completed successfully after just two seconds, and here's what the console logged:

ANR2017I Administrator ADMIN issued command: MOVE DATA Q:\TSM\FILEVOLS\FILE029 stg=filepool2 wait=yes
ANR2233W This command will move all of the data stored on volume Q:\TSM\FILEVOLS\FILE029 to other volumes in storage pool
FILEPOOL2; the data will be inaccessible to users until the operation completes.
ANR2017I Administrator ADMIN issued command: MOVE DATA Q:\TSM\FILEVOLS\FILE029 stg=filepool2 wait=yes
ANR1157I Removable volume Q:\TSM\FILEVOLS\FILE029 is required for move process.
ANR0984I Process 1524 for MOVE DATA started in the FOREGROUND at 07:59:52.
ANR1140I Move data process started for volume Q:\TSM\FILEVOLS\FILE029 (process ID 1524).
ANR1176I Moving data for collocation set 1 of 1 on volume Q:\TSM\FILEVOLS\FILE029.
ANR1141I Move data process ended for volume Q:\TSM\FILEVOLS\FILE029.
ANR0985I Process 1524 for MOVE DATA running in the FOREGROUND completed with completion state SUCCESS at 07:59:54.


Notice that nothing changed on the volume, not even the read timestamp:

tsm: BACKUP-01>q vol Q:\TSM\FILEVOLS\FILE029 f=d

Volume Name: Q:\TSM\FILEVOLS\FILE029
Storage Pool Name: FILEPOOL
Device Class Name: FILE20GB
Estimated Capacity: 20.4 G
Scaled Capacity Applied:
Pct Util: 99.9
Volume Status: Full
Access: Read-Only
Pct. Reclaimable Space: 0.1
Scratch Volume?: No
In Error State?: No
Number of Writable Sides: 1
Number of Times Mounted: 5
Write Pass Number: 1
Approx. Date Last Written: 01/05/2012 15:21:18
Approx. Date Last Read: 10/19/2012 01:36:36
Date Became Pending:
Number of Write Errors: 0
Number of Read Errors: 0
Volume Location:
Volume is MVS Lanfree Capable : No
Last Update by (administrator): ADMIN
Last Update Date/Time: 10/19/2012 13:03:06
Begin Reclaim Period:
End Reclaim Period:
Drive Encryption Key Manager:
 
Hi Floyd,
This is an important information to know that you're using deduplication for that pool.
Your server probably has the DEDUPREQUIRESBACKUP option set to YES (this is the default).

DEDUPREQUIRESBACKUP is set to NO, otherwise deduplication would not have worked at all for our setup.

tsm: BACKUP-01>q opt deduprequiresbackup

Server Option Option Setting
----------------- --------------------
DedupRequiresBac- No
kup
 
opermty: Here's a sample volume. Notice that we've set most of the nearly unused volumes in this pool to Read-Only in an attempt to track them over time.

I would set them back to readw. Try a move data after that.
 
I had a chat with Trident, he pointed out to me that using MOVE DATA to an alternate storage pool on a 'Filling' volume doesn't work. Reclamation seems to be going in circles, so I've now set all volumes back to readwrite and I'm trying to do a sort of "manual reclamation" using a script that goes like this:

my $count = 0;
while (my $vol = random_rw_volume()) {
my $volume = $vol->{'Volume Name'};
print $vol->{'Volume Status'} . " volume " . $vol->{'Volume Name'};

if ($vol->{'Volume Status'} eq "Empty") {
$tsm->query("update volume $volume access=readonly");
print " write protected.";
}

if ($vol->{'Volume Status'} eq "Filling") {
$tsm->query("move data $volume wait=yes");
print " moved within FILEPOOL.";
}

if ($vol->{'Volume Status'} eq "Full") {
$tsm->query("move data $volume stg=FILEPOOL2 wait=yes");
$tsm->query("update volume $volume access=readonly");
print " moved to FILEPOOL2 and write protected.";
}

print "\n";
$count++;
}
print "Done, $count processed.\n";


sub random_rw_volume {
my $query = "query volume stgpool=FILEPOOL access=readwrite";
my @records = $tsm->query($query);
return $records[ rand @records ];
}

I'm guessing that collocation will prevent this script from ever filling ALL the volumes so they can be moved but after running for just a few minutes it seems to be moving data. This would mean the script will never actually finish but it should free up atleast 3/4 of the volumes. That is, if it works...
 
That is wrong and I think you are over complicating things. The move data commands didn't work because you had set the volumes to read only. Move data does work on filling volumes and you do not even need to move the data to a different storage group. Just do

move data <volume_name>

Anyway TSM will use filling volumes before creating a new volume. What may be causing an excess of filling tapes is many concurrent backups at once and collocation turned on but servers not placed into collocation groups.

I would suggest you set your lowmig to something <> 0 , 60 would be a good start and use move data to reduce the number of filling tapes so reclaimation can work normally again. You may need to script the move data commands to get back in a healthy place.

Would be good to see your collocation setup. Can you post a "q stg FILEPOOL f=d" and "q collocg". "q collocg f=d" will also show the nodes assigned to the different groups.
 
Collocation is set to 'Group' but I have no collocation groups so in effect I'm running with 'Node' collocation, as explained earlier. And this is fine. For some reason the reclamation process keeps using empty volumes instead of picking a 'Filling' one, even though I have 3-4 times more nearly empty volumes than I have nodes so it has to be something else than collocation. Moving data off a read-only volume is not a problem in itself. Reclaiming a read-only volume is not a problem in itself. The problem is that no volumes actually stay empty because those processes prefer to use empty volumes. Under normal circumstances this would not be a problem but the overall goal here is to free up disk space.

Now, moving data off a 'Filling' volume to an alternate storage pool, indeed seems to have been the issue that's been bugging me for two weeks. As I posted earlier, the command finishes with 'Success' after two seconds, having moved 0 bytes. For a 'Full' volume it works as expected. Maybe that's not supposed to happen, maybe our server is broken, but that's what happens and that's the problem.

Anyway... Based on this experience I've modified my script to write-protect 'Filling' volumes after the MOVE DATA as well, so far I have processed 20 volumes with this script, they are now 'Empty' and write-protected to ensure they stay that way until I'm ready to delete them.
 
Under normal circumstances this would not be a problem but the overall goal here is to free up disk space.

Is reducing the maxscratch setting of the storage pool an option?

Would be interesting to know how many nodes are backing up to this storage pool. There will be at least one filling tape per node if you are not using collocation groups.
 
Unfortunately, maxscratch is not in use for this storage pool. I suspect that if the volumes had been created using maxscratch, I would not have had this problem. The volumes were manually defined by the people who installed TSM for us, because they were spread across multiple RAID groups shared with other storage pools for TDP etc. Since then, I have learned that this is not necessarily a problem if done right. Anyway, the new pool I'm migrating to uses just one massive RAID group and maxscratch. I'm a beginner but I'm making progress.

About the number of nodes, as mentioned earlier:
tsm: BACKUP-01>select count(*) from volumes where pct_utilized < 1

Unnamed[1]
------------
501

tsm: BACKUP-01>select count(*) from nodes

Unnamed[1]
------------
176

I too expected one filling tape per node, plus maybe a few fragments here and there. Nothing like this.

Using my own script /is/ somewhat overcomplicated from the point of view that MIGRATE STGPOOL should be able to do this job so much better than a quick Perl hack, but as long as it doesn't and I can't figure out why, I have to use the tool that gets the job done.
 
Ok, that makes things a little clearer. Are you ready to move everything to the new storage pool? I would update the pool to make the new pool the target for migrations with "upd stg <old_pool_name> next=<new_pool_name>". Then just use "migrate stg <old_pool_name> lo=0".

With 176 nodes, I would use collocation groups. This will reduce the filling tapes. For a FILE type pool sharing the volumes between different nodes won't cause you any issues but will dramatically reduce the number of volumes. I use scripts to create new nodes which set the collocation group then so it doesn't get missed. More of an issue with tape but with your number of nodes you can expect 300+ filling volumes without collocation.
 
Ok, that makes things a little clearer. Are you ready to move everything to the new storage pool? I would update the pool to make the new pool the target for migrations with "upd stg <old_pool_name> next=<new_pool_name>". Then just use "migrate stg <old_pool_name> lo=0".
This is what I tried first. The migration process finishes with status 'success' after moving just a few Gbytes, leaving the bulk of the data behind. I still don't understand why.

I will discuss collocation with our service partner, I think it might be a good idea for atleast most of them. Keeping the few problematic ones apart from the rest is ofcourse what those groups are all about...
 
Back
Top