TSM 6.3 volumes and physical disk discrepancy

Rusev

ADSM.ORG Member
Joined
Aug 21, 2013
Messages
26
Reaction score
0
Points
0
Hi Tsm Friends.

I have an issue with TSM server v6.3, basically we have a 100TB Storage space, from which 80TB are visible in TSM. We found that some volumes are for example 1.4 TB on physical side, but tsm volumes are seeing 246GB and showing 98% utilized. It is really a strange thing, we tried to reclaim them, audit the space and etc, but with no success. There was deduplication activated on the server and we deactivated it as a test, but still with no success.

Does anyone had a case like that, where they have missing partial volumes if i this sounds correct...

:mad:
 
I am not sure what device class are you using. If you are talking about deduplication, it is probably FILE class then. But, you are talking about volume of 1.4TB, it sounds like DISK device class to me.
What you are showing here probably means that you are using DISK device class, and you have caching turned on on storage pool.
Correct?


Hi Tsm Friends.

I have an issue with TSM server v6.3, basically we have a 100TB Storage space, from which 80TB are visible in TSM. We found that some volumes are for example 1.4 TB on physical side, but tsm volumes are seeing 246GB and showing 98% utilized. It is really a strange thing, we tried to reclaim them, audit the space and etc, but with no success. There was deduplication activated on the server and we deactivated it as a test, but still with no success.

Does anyone had a case like that, where they have missing partial volumes if i this sounds correct...

:mad:
 
Hi Mita, that is correct. We actually found out what was the issue today in the morning, it was the MAXSCRATCH parameter was set really high resulting in various scratch volumes created, we also created some manually which confused the TSM server and after changing the MAXSCRATCH to a more usable number the stgpool utilization is showing normal and we can see all the space there as usual. We went from 98% utilization of stg pool to 76% utilization for couple minutes.
 
Now there is the mess afterwards as we are constantly getting MediaW sessions. Strange thing as we dont have actually tapes, everything goes in the STG pool, we suspect that the main issue behind this is because the server was doing volumes by himself and we also added manually some volumes and now since we changed that around the server is searching for those volumes and stays in MediaW status.

Any ideas ?
 
You bit mess few issues together.

As for mediawait status, first verify that all DISK volumes are on-line wiht 'q vol', especially those belonging to particular storage pool.

Identify sessions (q ses) with mediawait status and investigate the actlog 'q actlog search=YOURSEESIONID begindate=-1', there should be some hint. Begindate is there because q actlog by default looks only on last hour or so.
 
Hey you are right that i wrote different issues, as far as the q volume i did that and there is no status online, its showing status FULL or Filling, strange thing is that when its showing FULL it actually shows 89% utilization...
The volumes are part of stg pool which is set to sequentiel access with collocation for nodes switched on (if that has any meaning for that situation).
 
You are confusing peoples here with skiping from issue to issue.

So they are tapes - fine, first verify status of the volumes with 'select volume_name,access from volumes'....

If a process waits for a tape, usually all suitable drives are busy, you can verify it with lilke 'q pr'. Or indirectly by 'q mo'...
 
Well actually its looking confusing because we noticed the issue with the data discrepancy, than we changed the maxscratch parameter and it was fine, than the other issues appeared as sessions started to be in MediaW status. We are not using tapes, we are backing up directly to out STG pool with no further location, we have disks from our SAN on which we define the volumes. I suspect that our server was builded with wrong settings as we tried to use deduplication and currently the STG pool is set in sequential access for that reason.

I apologize if the thread became messy it was not my intention, but issue after issue came up so i myself got confused from all.

So current status is that when the maxscratch parameter was with a big number, TSM was creating volumes by itself showing them as scratch and we were missing 20TB of free space because of that, we had volumes from 1,4TB which were showing as 245GB volumes from TSM side, so we moved that parameter and the STG size was normalized, we gained back those 20TB, but now the backups started to fail because they are searching for those old volumes and i dont know which parameter is making them to do so....

Any help will be appreciated.
 
now the backups started to fail because they are searching for those old volumes.

Can you support this? You must have some log or anything claiming like "trying to mount volume XYZ but this is not avaiable". Can you post such text/log?

You would be able to list such volumes that are seeked but was cancelled - if it is the way you say....
 
Those errors appeared couple days ago, we rebooted the machine physically and they dissapeared, i cannot locate them in the log because we dont keep them for so long time. Now the issue is that we are still not seeing those extra 20TB from TSM side and on top of that allot of sessions are staying with status MediaW.
 
well check the RESOURCEUTILIZATION in dsm.opt (on client probably) and/or review output of this: 'select node_name,MAX_MP_ALLOWED from nodes' - I presume you know what nodes are problematic. Well, it is proper for real tapes, but perhaps your DISK volumes it works the same way.

I am not sure but by default RESOURCEUTILIZATION=2

If you run 'q ses' you would see what nodes and how many of session for them are in mediaw status.

As for lost TB, can you post here output of 'q vol XYZ f=d' - first lines would be enough and then from shell (operation system, if it is unix-like system) 'ls -al XYZ'?
 
We put by default resource utilization 4, the select gave me exact number for all nodes - 5, and the device class is currently on mountpoints allowed: 150, i read in IBM site that you should check the number of clients that you have and adjust accordingly, there is also the option for mount wait period and mount retention period, if those are changed to cancel sessions earlier to not wait too much, and increase the mountpoints allowed to 300, which is the number of the clients maybe that will help.

I tried only to increase it, but sessions are still in MediaW most of the more than 60mins, strange thing is that IBM are saying - default value is 60mins for mount retention period and mount wait period, but those are not setted up and go far more than 60mins.
 
I was able to get idle session details:

Sess Number: 20,185
Comm. Method: Tcp/Ip
Sess State: MediaW
Wait Time: 4.8 M
Bytes Sent: 4.8 K
Bytes Recvd: 2.9 K
Sess Type: Node
Platform: TDP MSSQL Win64
Client Name: client233_SQL
Media Access Status: Waiting for output volume(s):
BACKUPPOOL,O:\TSMDATA\SERVER1\BACKUPPOOL\DISK08.DS-
M,(289 Seconds)
User Name:
Date/Time First Data Sent: 09/05/2013 07:11:40
Proxy By Storage Agent:

Sess Number: 17,276
Comm. Method: Tcp/Ip
Sess State: MediaW
Wait Time: 5.1 H
Bytes Sent: 405
Bytes Recvd: 971
Sess Type: Node
Platform: WinNT
Client Name: Clinet11
Media Access Status: Waiting for output volume(s):
BACKUPPOOL,M:\TSMDATA\SERVER1\BACKUPPOOL\DISK12.DS-
M,(18450 Seconds)
User Name:
Date/Time First Data Sent: 09/05/2013 02:11:05

More than 300 sessions are like this, pointing to random volumes and waiting.

Here is example of volume status:

O:\TSMDATA\SERVER1\BACK- BACKUPPOOL INCR 1,023.8 G 28.0 Filling
UPPOOL\DISK08.DSM
M:\TSMDATA\SERVER1\BACK- BACKUPPOOL INCR 496.0 G 93.4 Filling
UPPOOL\DISK12.DSM
 
I looked on our DISK volumes - strange thing - by default we set 2TB volumes and some of them are 2TB (and with low utilization) and some of them are smaller (with more-less random utilization).
But in both cases the size on filesystem match what is presented from withing TSM.
Perhaps there are two types of volumes, one of them "auto-expanding"

So look at their size in filesystem and if they are really smaller then target value, and then make sure that there is free space in filestem for them to grow.
 
Did you check utilization (not occupacy) of your disks, their capacity (I/O speed) is probably limited so no sense to overload them with too many concurent backups.

also, did you check running processes `q pr', also blocked migration a and reclamation can exhaust mount poinst - at least on real tapes this could happen sometimes.

Is it possible that you run the same backups when previous backup is till running?
 
Thats the confusing part, that some volumes are not shown with their correct size, from TSM we see them as 234GB and volumes are actually 1,4TB so it shows that the volume is full, i am woundering why they keep trying to acess the same volume, when we put maxscratch number up, than the sessions are working fine, but our storage space dissapears even more...
 
Well I dont know...

What about filesystem corruption?

You might try to vary offline/online some of these volumes....
 
Well we started to suspect, that because the server was setted up incorrectly in the begining, it can be that volumes are corrupted, and we will need to redo it after moving those 95-96TB of data to a backup location, we contacted IBM as well, but they are as slow as possible on their support effords.

We will see, thanks for the help till now
 
OK, there's not all that much information here but I recognize a couple things going on. The likely reason that the 2 TB volumes are smaller that 2 TB is that the file system most likely filled up. You should drop the 2 TB volume size down to 50 GB. There are a number of reasons to keep them at that size. Now you also need to take a look at the mount limits. In a file device class only one process can write to a volume. That's why need need the volumes smaller and the mount limit increased. That will stop the media wait issue. Make sure you reclaim the volumes too.

Another thing to consider is that the device class may be used for other things like a database backup. If you don't manage your volume history properly you could have a bunch of old database backup volumes in the same file system.
 
Hi,

Currently we are on 150 mountpoint limit as i saw in IBM article that you should not have more mountpoints than you have volumes, now the question is if we get those 2TB volumes to 50GB volumes, this will increase the number of volumes drastically, what are the other reqsons to make them so small? Is it necessary to have sequential access pool for deduplication, because all the issues now are that we setup the sequentila access and it looks like we dont know how to set it up properly from what i am hearing from all the guys here. All other servers are to random access, strange thing is we have another 6.3.4 with sequential and he has no problems, but the data is much smaller, otherwise the volumes are designed the same way. Is that because we already messed up the volumes and we need to start from the begining? I know that IBM are saying deduplication works only with FILE class, but it looks like the teammate who is setting up our servers is now aware how the sequential pool based servers should be setted.

Do you have any other advices how to tune up that one, i appreciate your time
 
Last edited:
Back
Top