[ADSM-L] Dealing with defunct filespaces.

Hi all.

Whilst investigating something else, we discovered a number of nodes
that have old filespaces still stored within TSM - eg:

                                      Node Name: (node name)
                                 Filespace Name: /data
                     Hexadecimal Filespace Name:
                                           FSID: 4
                                       Platform: SUN SOLARIS
                                 Filespace Type: UFS
                          Is Filespace Unicode?: No
                                  Capacity (MB): 129,733.3
                                       Pct Util: 92.1
                    Last Backup Start Date/Time: 06/09/05   20:03:56
                 Days Since Last Backup Started: 764
               Last Backup Completion Date/Time: 06/09/05   20:05:16
               Days Since Last Backup Completed: 764
Last Full NAS Image Backup Completion Date/Time:
Days Since Last Full NAS Image Backup Completed:

                                      Node Name: (node name)
                                 Filespace Name: /Z/oracle
                     Hexadecimal Filespace Name:
                                           FSID: 12
                                       Platform: SUN SOLARIS
                                 Filespace Type: UFS
                          Is Filespace Unicode?: No
                                  Capacity (MB): 119,642.2
                                       Pct Util: 31.5
                    Last Backup Start Date/Time: 08/26/05   01:03:08
                 Days Since Last Backup Started: 686
               Last Backup Completion Date/Time: 08/26/05   01:14:01
               Days Since Last Backup Completed: 686
Last Full NAS Image Backup Completion Date/Time:
Days Since Last Full NAS Image Backup Completed:

                                      Node Name: (node name)
                                 Filespace Name: /mnt
                     Hexadecimal Filespace Name:
                                           FSID: 15
                                       Platform: SUN SOLARIS
                                 Filespace Type: UFS
                          Is Filespace Unicode?: No
                                  Capacity (MB): 120,992.9
                                       Pct Util: 55.8
                    Last Backup Start Date/Time: 01/26/06   20:05:15
                 Days Since Last Backup Started: 533
               Last Backup Completion Date/Time: 01/26/06   20:06:34
               Days Since Last Backup Completed: 533
Last Full NAS Image Backup Completion Date/Time:
Days Since Last Full NAS Image Backup Completed:


These are all filesystems which existed at some time in the past, but
which were removed as part of an application upgrade (or system
rebuild, or ...), and hence no longer exist. It seems that TSM is
taking the attitude of "if I can't see the filesystem, I'll not do
anything about marking files in that filesystem inactive", so the
data never expires. I can understand the reasoning behind this
approach, but it does mean that there's a large amount of data
floating around that is no longer needed (a quick and dirty estimate
says around 83 TB across primary and copy pools, although some of
that needs to stay).

A delete filespace will clear them up quickly, obviously, but there's
a twist: how can we identify filesystems like this, short of going
around to each client node and doing a df or equivalent? Searching
the filespaces table gives us some 600 filespaces all up; I *know*
that several of these have to stay - eg, image backups don't update
the backup_end timestamp, and there are some filespaces that are
backed up exclusively with image backups.

At the moment, the best I can come up with is to:
  * use a SELECT statement on the filespaces table to get a "first
cut" (select node_name, filespace_name, filespace_id from filespaces
where backup_end < current_timestamp - N days);
  * use QUERY OCCUPANCY on each of the filespaces mentioned in the
first cut; if the total occupied space is below some threshold,
ignore it as not being worth the effort;
  * use a SELECT statement on the backups table to confirm that no
backups have come through in the past N days. (select 1 from db where
exists (select object_id from backups where node_name=whatever and
filespace_id=whatever and state=ACTIVE_VERSION and current_timestamp
< backup_date+90 days) -- I use exists to try to minimise the effort
TSM needs to put into the query; I also have the active_version check
in there for the same reason (if there's only inactive versions,
they'll drop off the radar anyway in due course). Hopefully TSM's SQL
execution is optimised to stop in this case when it finds one match
rather than trying to find all matches ...)

Does anybody have any better ideas? Unfortunately, because of the
nature of Monash's organisation, simply having central policies
saying "you must do X when shuffling filesystems around" won't cut it
(and let's be honest here - how many sysadmins are likely to remember
such policies, given how infrequent such moves are?)

Yes, I have a call open with IBM support about this. :-) If there's
sufficient interest, I can summarise their eventual response to the
mailing list (so far, it's mostly been clarification of the call, and
a few pointers that match with what we've already done.)

Thanks,

Stuart.