Unable to open drive error number= 46 on AIX, anyone seen before?

RecoveryOne

ADSM.ORG Senior Member
Joined
Mar 15, 2017
Messages
347
Reaction score
81
Points
0
Hey everyone,
So been tracking on this issue for some time now. I've two cases open with IBM, one on the Spectrum Protect side, one on the Tape Support side. Figure I'd ask here as I've exhausted local knowledge in my area, and I'd really like to track down what is causing this!

I've a 3584 with library FW H010, and many LTO6 drives running latest or near latest firmware as recommended by IBM. I was actually down level on the FW and the recommendation was to go to current to see if issues are resolved. On current version of atape as well. HBA's are on latest FW when I last looked a few months back. Walked through everything really that I can think of.

So what I'm seeing is this: ANR8779E Unable to open drive DRIVE01 (/dev/rmt11), error number= 46
So error number=46 is being passed to TSM from AIX. AIX errpt is clear, fabric reports no errors. If we look at /usr/include/sys/errno.h we see that 46 is 'device not ready'.

In some of my previous posts here, you will notice that I was complaining about tape performance. It wasn't always great in my environment. One of the good things to come of this event was the need/requirement to migrate to a new SAN fabric. We (IBM/myself) were hoping that the new fabric would help eliminate this error as we cut our ISL's in half if not more, and well, newer fabric. Sadly, two days after the conversion it has cropped up again.

When that error number= 46 pops up, depending on the process, the process will terminate. A protect storage pool type=local worker thread will terminate and won't try to open another drive. So if there's 5 protect processes defined, the summary will then run with 4 and a warning message will be printed to actlog. An audit library with checkl=yes will also terminate. A database backup however will report the error and try to open another drive as will a backup storagepool.

The annoying part is, even if Drive01 reports not ready from one processes, another process or the same process can come by and use it like below where Drive01 is used by the same process after a the error (edited due to reasons):
Code:
02/08/2020 07:50:05 ANR0984I Process 189 for BACKUP STORAGE POOL started in the FOREGROUND at 07:50:05. (SESSION: 44850, PROCESS: 189)
02/08/2020 07:50:05 ANR2110I BACKUP STGPOOL started as process 189. (SESSION: 44850, PROCESS: 189)
02/08/2020 07:56:15 ANR8779E Unable to open drive DRIVE01 (/dev/rmt11), error number= 46.
02/08/2020 07:56:19 ANR8381E LTO volume VOL1 could not be mounted in drive DRIVE01 (/dev/rmt11). (SESSION: 44850, PROCESS: 189)
02/08/2020 07:56:19 ANR1401W Mount request denied for volume VOL1 - mount failed. (SESSION: 44850, PROCESS: 189)
02/08/2020 08:02:30 ANR8779E Unable to open drive DRIVE02 (/dev/rmt12), error number= 46. (SESSION: 44850, PROCESS: 189)
02/08/2020 08:02:34 ANR8381E LTO volume VOL2 could not be mounted in drive DRIVE02 (/dev/rmt12). (SESSION: 44850, PROCESS: 189)
02/08/2020 08:02:34 ANR1401W Mount request denied for volume VOL2 - mount failed. (SESSION: 44850, PROCESS: 189)

But then we loop back to Drive01 and all is fine:
Code:
02/08/2020 08:02:57 ANR8337I LTO volume VOL3 mounted in drive DRIVE01 (/dev/rmt11). (SESSION: 44850, PROCESS: 189)
02/08/2020 08:02:57 ANR0513I Process 189 opened output volume VOL3. (SESSION: 44850, PROCESS: 189)

Its not always the same drive being called out. I've seen it occur on all drives. However, it does seem to cluster around Frame 1 and 2 drives 1-4 more so than the other drives.
This issue has persisted from version TSM 8.1.5 to 8.1.8.
So, has anyone encountered anything like this before?
 
Wanted to add, I've gone over the zoning, and paths and everything from drive wwn's to serials match from TSM to AIX to the library. We used to run alt path for the LTO6 drives in case of fabric disruption, but have since abandoned that in order to try and isolate/resolve the issue. I seen some older posts here with AIX 5.x where the media changer was defined twice. I do not believe that to be a problem, as you should be able to select multiple drives as a control path. In addition to LTO5 and higher drives having multiple HBA's per drive kinda makes that a moot point.
 
current Atape version?
It seems that the tape is not even being mounted, I would look at the barcode reader and have it calibrated.
I have seen similair issues whne the barcode reader have not been correctly instaled, in som postions it was not identifying the barcode and the mount would fail.
 
Hogmaster, as far as I and IBM can tell (even had CE on site) there is nothing physically wrong with the library. I did wonder about the camera calibration, but the volumes are actually making it to the drive and no library errors are present. Also the unit passed whatever diag's the CE ran it through when they were onsite.

Atape is 13.0.34.0
AIX 7.1.5.2
Library is a 3584L53 with a D53 and 3 D52 frames with FW H010
Drives are LTO6 with FW KM40.
HBA's are at FW 210313 which as of October 2019 was latest.

I know Atape 13.0.35.0 was released last month.

The odd thing is, the tape is being mounted. I managed to capture this (once again edited for reasons):
Code:
LIBRARY_NAME         DRIVE_NAME            DRIVE_STATE      VOL_NAME      ONLINE          DRV_OWNER
----------------     -----------------     ------------     ---------     -----------     ---------------------
3584LIB              F1D03                 LOADED           VOL1          YES             TSM
3584LIB              F1D04                 LOADED           VOL2          YES             TSM
3584LIB              F1D05                 LOADED           VOL3          YES             TSM
3584LIB              F1D06                 LOADED           VOL4          YES             TSM
3584LIB              F1D07                 LOADED           VOL5          YES             TSM
3584LIB              F2D01                 LOADED           VOL6          YES             TSM

Now running the query again back to back I got this:
Code:
LIBRARY_NAME         DRIVE_NAME            DRIVE_STATE      VOL_NAME      ONLINE          DRV_OWNER
----------------     -----------------     ------------     ---------     -----------     ---------------------
3584LIB              F1D03                 LOADED           VOL1          YES             TSM
3584LIB              F1D04                 LOADED           VOL2          YES             TSM
3584LIB              F1D05                 LOADED           VOL3          YES             TSM
3584LIB              F1D06                 LOADED           VOL4          YES             TSM
3584LIB              F1D07                 LOADED           VOL5          YES             TSM
3584LIB              F2D01                 UNLOADED         VOL6         YES             TSM

Query I used to capture the above info is:
Code:
select cast((library_name)as char(15)) as LIBRARY_NAME, -
cast((DRIVE_NAME)as char(16)) as DRIVE_NAME, -
cast((drive_state)as char(10)) as DRIVE_STATE, -
cast((volume_name)as char(8)) as VOL_NAME, cast((online) -
as char(10)) as ONLINE, -
cast((ALLOCATED_TO)as char(20)) as DRV_OWNER from drives -
order by library_name, drive_name


So, vol6 went unloaded.
Actlog then reported:
Code:
02/10/2020 14:50:35 ANR8779E Unable to open drive F2D01 (/dev/rmt11), error number= 46. (SESSION: 2316, PROCESS: 24)
02/10/2020 14:50:35 ANR8945W Scratch volume mount failed VOL6. (SESSION: 2316, PROCESS: 24)
02/10/2020 14:50:40 ANR8381E LTO volume VOL6 could not be mounted in drive F2D01 (/dev/rmt11). (SESSION: 2316, PROCESS: 24)
02/10/2020 14:50:40 ANR1404W Scratch volume mount request denied - mount failed. (SESSION: 2316, PROCESS: 24)

I watch the drive above go reserved, loading, loaded, unloaded.

Next thought was the internal volume label was wrong. Using itdt to load the tape and then dd to read the header, everything is fine (according to IBM and my own digging up of how LTO tapes headers should be presented).

The above is just one of many examples of this issue. I'm seeing the error happen on scratch volumes and on data volumes.

**Edit: I should clean up the drive names but being asked to make sure that tape serials are at least blanked out. Its too early and don't feel like tracking down rmt to fake drive name :)
 
Back
Top