RecoveryOne
ADSM.ORG Senior Member
- Joined
- Mar 15, 2017
- Messages
- 347
- Reaction score
- 81
- Points
- 0
Hey everyone,
So been tracking on this issue for some time now. I've two cases open with IBM, one on the Spectrum Protect side, one on the Tape Support side. Figure I'd ask here as I've exhausted local knowledge in my area, and I'd really like to track down what is causing this!
I've a 3584 with library FW H010, and many LTO6 drives running latest or near latest firmware as recommended by IBM. I was actually down level on the FW and the recommendation was to go to current to see if issues are resolved. On current version of atape as well. HBA's are on latest FW when I last looked a few months back. Walked through everything really that I can think of.
So what I'm seeing is this: ANR8779E Unable to open drive DRIVE01 (/dev/rmt11), error number= 46
So error number=46 is being passed to TSM from AIX. AIX errpt is clear, fabric reports no errors. If we look at /usr/include/sys/errno.h we see that 46 is 'device not ready'.
In some of my previous posts here, you will notice that I was complaining about tape performance. It wasn't always great in my environment. One of the good things to come of this event was the need/requirement to migrate to a new SAN fabric. We (IBM/myself) were hoping that the new fabric would help eliminate this error as we cut our ISL's in half if not more, and well, newer fabric. Sadly, two days after the conversion it has cropped up again.
When that error number= 46 pops up, depending on the process, the process will terminate. A protect storage pool type=local worker thread will terminate and won't try to open another drive. So if there's 5 protect processes defined, the summary will then run with 4 and a warning message will be printed to actlog. An audit library with checkl=yes will also terminate. A database backup however will report the error and try to open another drive as will a backup storagepool.
The annoying part is, even if Drive01 reports not ready from one processes, another process or the same process can come by and use it like below where Drive01 is used by the same process after a the error (edited due to reasons):
But then we loop back to Drive01 and all is fine:
Its not always the same drive being called out. I've seen it occur on all drives. However, it does seem to cluster around Frame 1 and 2 drives 1-4 more so than the other drives.
This issue has persisted from version TSM 8.1.5 to 8.1.8.
So, has anyone encountered anything like this before?
So been tracking on this issue for some time now. I've two cases open with IBM, one on the Spectrum Protect side, one on the Tape Support side. Figure I'd ask here as I've exhausted local knowledge in my area, and I'd really like to track down what is causing this!
I've a 3584 with library FW H010, and many LTO6 drives running latest or near latest firmware as recommended by IBM. I was actually down level on the FW and the recommendation was to go to current to see if issues are resolved. On current version of atape as well. HBA's are on latest FW when I last looked a few months back. Walked through everything really that I can think of.
So what I'm seeing is this: ANR8779E Unable to open drive DRIVE01 (/dev/rmt11), error number= 46
So error number=46 is being passed to TSM from AIX. AIX errpt is clear, fabric reports no errors. If we look at /usr/include/sys/errno.h we see that 46 is 'device not ready'.
In some of my previous posts here, you will notice that I was complaining about tape performance. It wasn't always great in my environment. One of the good things to come of this event was the need/requirement to migrate to a new SAN fabric. We (IBM/myself) were hoping that the new fabric would help eliminate this error as we cut our ISL's in half if not more, and well, newer fabric. Sadly, two days after the conversion it has cropped up again.
When that error number= 46 pops up, depending on the process, the process will terminate. A protect storage pool type=local worker thread will terminate and won't try to open another drive. So if there's 5 protect processes defined, the summary will then run with 4 and a warning message will be printed to actlog. An audit library with checkl=yes will also terminate. A database backup however will report the error and try to open another drive as will a backup storagepool.
The annoying part is, even if Drive01 reports not ready from one processes, another process or the same process can come by and use it like below where Drive01 is used by the same process after a the error (edited due to reasons):
Code:
02/08/2020 07:50:05 ANR0984I Process 189 for BACKUP STORAGE POOL started in the FOREGROUND at 07:50:05. (SESSION: 44850, PROCESS: 189)
02/08/2020 07:50:05 ANR2110I BACKUP STGPOOL started as process 189. (SESSION: 44850, PROCESS: 189)
02/08/2020 07:56:15 ANR8779E Unable to open drive DRIVE01 (/dev/rmt11), error number= 46.
02/08/2020 07:56:19 ANR8381E LTO volume VOL1 could not be mounted in drive DRIVE01 (/dev/rmt11). (SESSION: 44850, PROCESS: 189)
02/08/2020 07:56:19 ANR1401W Mount request denied for volume VOL1 - mount failed. (SESSION: 44850, PROCESS: 189)
02/08/2020 08:02:30 ANR8779E Unable to open drive DRIVE02 (/dev/rmt12), error number= 46. (SESSION: 44850, PROCESS: 189)
02/08/2020 08:02:34 ANR8381E LTO volume VOL2 could not be mounted in drive DRIVE02 (/dev/rmt12). (SESSION: 44850, PROCESS: 189)
02/08/2020 08:02:34 ANR1401W Mount request denied for volume VOL2 - mount failed. (SESSION: 44850, PROCESS: 189)
But then we loop back to Drive01 and all is fine:
Code:
02/08/2020 08:02:57 ANR8337I LTO volume VOL3 mounted in drive DRIVE01 (/dev/rmt11). (SESSION: 44850, PROCESS: 189)
02/08/2020 08:02:57 ANR0513I Process 189 opened output volume VOL3. (SESSION: 44850, PROCESS: 189)
Its not always the same drive being called out. I've seen it occur on all drives. However, it does seem to cluster around Frame 1 and 2 drives 1-4 more so than the other drives.
This issue has persisted from version TSM 8.1.5 to 8.1.8.
So, has anyone encountered anything like this before?