TSM operations fail after power outage

WalterITD

ADSM.ORG Member
Joined
Jun 9, 2016
Messages
18
Reaction score
0
Points
0
PREDATAR Control23

G'Day TSM Gurus,

I'm hoping someone could point me in the right direction regarding an issue I am experiencing atm.

Over the weekend a power outage happened and our old TSM server was shutdown abruptly for a few hours;
since then nothing seems to work.

System Details:

- TSM Ver.5, Rel 4, Level 2 running on AIX (yes it's old - will be decommissioned in the next year or so)
- 2x LTO4 tape drives Fibre connected

Issues:

- Using the command line from DSMADMC - cannot eject tapes going to vault or insert scratch tapes
(Direct form the library interface I was able to move the tapes out, but when I run a q libvol those manually ejected tapes are still listed there)

- All paths, drives and library appear online

- q act shows:
03/20/19 10:21:24 ANR8840E Unable to open device /dev/smc0 with error 6. (PROCESS: 4)
03/20/19 10:21:24 ANR8441E Initialization failed for SCSI library TS3310. (PROCESS: 4)
03/20/19 10:21:24 ANR1401W Mount request denied for volume WU0001L3 - mount failed. (PROCESS: 4)

- Can successfully ping library from TSM server

Paths are online:

tsm: TSM>q path f=d

Source Name: TSM
Source Type: SERVER
Destination Name: TS3310
Destination Type: LIBRARY
Library:
Node Name:
Device: /dev/smc0
External Manager:
LUN:
Initiator: 0
Directory:
On-Line: Yes
Last Update by (administrator): ADMIN
Last Update Date/Time: 09/25/14 08:34:56

Source Name: TSM
Source Type: SERVER
Destination Name: US_LTO4_00
Destination Type: DRIVE
Library: TS3310
Node Name:
Device: /dev/rmt0
External Manager:
LUN:
Initiator: 0
Directory:
On-Line: Yes
Last Update by (administrator): ADMIN
Last Update Date/Time: 09/25/14 08:36:42

Source Name: TSM
Source Type: SERVER
Destination Name: US_LTO4_01
Destination Type: DRIVE
Library: TS3310
Node Name:
Device: /dev/rmt1
External Manager:
LUN:
Initiator: 0
Directory:
On-Line: Yes
Last Update by (administrator): ADMIN
Last Update Date/Time: 09/25/14 08:36:57

Drives appear to be normal and online

tsm: TSM>q drive f=d

Library Name: TS3310
Drive Name: US_LTO4_00
Device Type: LTO
On-Line: Yes
Read Formats: ULTRIUM4C,ULTRIUM4,ULTRIUM3C,ULTRIUM3,ULT-
RIUM2C,ULTRIUM2
Write Formats: ULTRIUM4C,ULTRIUM4,ULTRIUM3C,ULTRIUM3
Element: 256
Drive State: UNKNOWN
Volume Name:
Allocated to:
WWN: 500308C09E998000
Serial Number: 1310159398
Last Update by (administrator): ADMIN
Last Update Date/Time: 09/25/14 08:36:42
Cleaning Frequency (Gigabytes/ASNEEDED/NONE): NONE

Library Name: TS3310
Drive Name: US_LTO4_01
Device Type: LTO
On-Line: Yes
Read Formats: ULTRIUM4C,ULTRIUM4,ULTRIUM3C,ULTRIUM3,ULT-
RIUM2C,ULTRIUM2
Write Formats: ULTRIUM4C,ULTRIUM4,ULTRIUM3C,ULTRIUM3
Element: 257
Drive State: UNKNOWN
Volume Name:
Allocated to:
WWN: 500308C09E998004
Serial Number: 1310174122
Last Update by (administrator): ADMIN
Last Update Date/Time: 09/25/14 08:36:57
Cleaning Frequency (Gigabytes/ASNEEDED/NONE): NONE

I tried searching for details b4 posting here but wasn't able to find anything relevant,
hope someone here can help.

Looking forward to your feedback

Thanks in advance!!

Cheers
 
PREDATAR Control23

ANR8840E Unable to open device /dev/smc0 with error 6
Erro 6 is an AIX error code:
# grep 6 /usr/include/sys/errno.h
/* $Header: @(#) AIX71D_area/1 bos/kernel/sys/errno.h, incstd, aix71D, 1123A_71D 2011-06-02T03:26:47--01:00$ */
* (C) COPYRIGHT International Business Machines Corp. 1985, 1996
#define ENXIO 6 /* No such device or address */
 
PREDATAR Control23

thx for the reply Marclant,

So AIX error 6 says device /dev/smc0 (which is the TSM library) doesn't exist but if I do a q path it shows it as online and I can ping the library from the AIX server.

any ideas to fix this or check/confirm that some hardware is dead?

Thx
 
PREDATAR Control23

Essentially, when TSM tries to access smc0, AIX tells TSM that it doesn't exist. So that's nothing with the paths in TSM. You'd have to check that at the AIX level, been too long since I worked with a library on AIX.

Maybe reboot the library and AIX and hope it comes up. You may have to run config manager in AIX to detect it.
 
PREDATAR Control23

You need to look at your aix devices. Start with your tape devices, and your hba's.
What version of AIX are you running? I think most of these commands should work on 5.3, but I don't have any left to verify.

on aix:
lsdev -Cc tape
Should end up with something like:
rmt0 Available 02-00-01-PRI IBM 3580 Ultrium Tape Drive (FCP)
rmt1 Available 02-00-01-PRI IBM 3580 Ultrium Tape Drive (FCP)
rmt2 Available 02-00-01-PRI IBM 3580 Ultrium Tape Drive (FCP)
smc0 Available 02-00-01 IBM 3576 Library Medium Changer (FCP)

If the above says Defined, then your tape drive went missing and I'd look at the library to make sure the tape drive is healthy. Also the TS3310 should be able to tell you if it sees a connection to the fabric as well. If the TS3310 is healthy, then lets poke the HBA's

Check your hba's.
lsdev -Cc adapter | grep fcs
fcs0 Available 02-00 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)
fcs1 Available 02-01 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)
fcs2 Available 0C-00 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)
fcs3 Available 0C-01 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)

Lets take a closer look at your adapters to make sure they are connected
# fcstat fcs0 | grep Attention
Attention Type: Link Up

(or use fcstat fcsX and review the full details, errors, topology, etc.)

Ok, lets loop back to tape assuming the are all available.
lscfg -vpl rmtX or smcX
That should return your serial numbers as AIX sees them. If there's a mismatch between what AIX sees and TSM sees, redefine the drives from TSM to match AIX.

Double check which tape is the control path in the 3310. Make sure that's the SN you are seeing on AIX for smc0.

If you need additional help, feel free to ask. I've two libraries (TS4500 and a TS3310) running on my server :)
I'm a bit tied up so can't walk though all the steps right now, but the above should get you going.

been too long since I worked with a library on AIX.
marclant, that makes me sad :(

**edit
If all else fails, checkout IBM's ITDT http://www-01.ibm.com/support/docview.wss?uid=ssg1S4000662
 
PREDATAR Control23

Hi RecoveryOne,

Thanks so much for your suggestions!

Something is definitely broken.

I pulled the info based on your notes and here are my results:

Although the 3310 library reports the drives online I receive a "Defined" status for each,
also, not sure whats up with the 2 libraries (smc0 & smc1?)

# lsdev -Cc tape
rmt0 Defined 00-08-02 IBM 3580 Ultrium Tape Drive (FCP)
rmt1 Defined 00-09-02 IBM 3580 Ultrium Tape Drive (FCP)
smc0 Defined 00-08-02-PRI IBM 3576 Library Medium Changer (FCP)
smc1 Defined 00-09-02-ALT IBM 3576 Library Medium Changer (FCP)


While checking the fibre channel adaptors I do not get a "Attention" result.

# lsdev -Cc adapter | grep fcs
fcs0 Available 00-08 FC Adapter
fcs1 Available 00-09 FC Adapter

root@nim(/home/root) # fcstat fcs0 | grep Attention
root@nim(/home/root) #
root@nim(/home/root) # fcstat fcs1 | grep Attention
root@nim(/home/root) #

While quering using fcstat for both channels I see:

Link Failure Count: 1
Loss of Sync Count: 1

# fcstat fcs1

FIBRE CHANNEL STATISTICS REPORT: fcs0

Device Type: FC Adapter (df1000fd)
Serial Number: 1C90608129
Option ROM Version: 02C82774
Firmware Version: B1F2.70A5
World Wide Node Name: 0x20000000C984D426
World Wide Port Name: 0x10000000C984D426

FC-4 TYPES:
Supported: 0x0000012000000000000000000000000000000000000000000000000000000000
Active: 0x0000010000000000000000000000000000000000000000000000000000000000
Class of Service: 3
Port Speed (supported): 4 GBIT
Port Speed (running): 4 GBIT
Port FC ID: 0x050000
Port Type: Fabric

Seconds Since Last Reset: 0

Transmit Statistics Receive Statistics
------------------- ------------------
Frames: 26 26
Words: 2816 2816

LIP Count: 0
NOS Count: 0
Error Frames: 0
Dumped Frames: 0
Link Failure Count: 1
Loss of Sync Count: 1
Loss of Signal: 0
Primitive Seq Protocol Error Count: 0
Invalid Tx Word Count: 8
Invalid CRC Count: 0

IP over FC Adapter Driver Information
No DMA Resource Count: 0
No Adapter Elements Count: 0

FC SCSI Adapter Driver Information
No DMA Resource Count: 0
No Adapter Elements Count: 0
No Command Resource Count: 0

IP over FC Traffic Statistics
Input Requests: 0
Output Requests: 0
Control Requests: 0
Input Bytes: 0
Output Bytes: 0

FC SCSI Traffic Statistics
Input Requests: 0
Output Requests: 0
Control Requests: 0
Input Bytes: 0
Output Bytes: 0


# fcstat fcs1

FIBRE CHANNEL STATISTICS REPORT: fcs1

Device Type: FC Adapter (df1000fd)
Serial Number: 1C90608129
Option ROM Version: 02C82774
Firmware Version: B1F2.70A5
World Wide Node Name: 0x20000000C984D427
World Wide Port Name: 0x10000000C984D427

FC-4 TYPES:
Supported: 0x0000012000000000000000000000000000000000000000000000000000000000
Active: 0x0000010000000000000000000000000000000000000000000000000000000000
Class of Service: 3
Port Speed (supported): 4 GBIT
Port Speed (running): 4 GBIT
Port FC ID: 0x060000
Port Type: Fabric

Seconds Since Last Reset: 0

Transmit Statistics Receive Statistics
------------------- ------------------
Frames: 35 35
Words: 2816 3072

LIP Count: 0
NOS Count: 0
Error Frames: 0
Dumped Frames: 0
Link Failure Count: 1
Loss of Sync Count: 1
Loss of Signal: 0
Primitive Seq Protocol Error Count: 0
Invalid Tx Word Count: 8
Invalid CRC Count: 0

IP over FC Adapter Driver Information
No DMA Resource Count: 0
No Adapter Elements Count: 0

FC SCSI Adapter Driver Information
No DMA Resource Count: 0
No Adapter Elements Count: 0
No Command Resource Count: 0

IP over FC Traffic Statistics
Input Requests: 0
Output Requests: 0
Control Requests: 0
Input Bytes: 0
Output Bytes: 0

FC SCSI Traffic Statistics
Input Requests: 0
Output Requests: 0
Control Requests: 0
Input Bytes: 0
Output Bytes: 0

Quering the Drives results in no serial numbers

# lscfg -vpl rmt0

PLATFORM SPECIFIC

Name: tape
Node: tape
Device Type: byte

# lscfg -vpl rmt1

PLATFORM SPECIFIC

Name: tape
Node: tape
Device Type: byte


Library and server have been rebooted several times but issue persists.

Library interface says all is normal (library + drives online)

I will continue my research based on this

Thanks again in advance for any further insight!

Cheers
 
PREDATAR Control23

**NOTE: All my commands are valid for AIX 7.1, and likely later TL's of 6.1. I have no lower AIX systems to compare against and finding alternate commands that will work is your responsibility :)

What version of AIX you running? That may explain why the link isn't showing.
How much of your infrastructure lost power?

Lets take a look at errpt before going further.
errpt -a | more can tell us a bit more about whats going I'd hope. THIS MIGHT BE A VERY VERY LONG LIST!!!
May see something like:
Code:
---------------------------------------------------------------------------
LABEL:          FCA_ERR6
IDENTIFIER:     ECCE4018

Date/Time:       Wed Mar 20 08:00:09 2019
Sequence Number: 364
Machine Id:      00061D72D900
Node Id:         tsmdev
Class:           S
Type:            TEMP
WPAR:            Global
Resource Name:   fcs1

Description
SOFTWARE PROGRAM ERROR

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES
As you can see, my fcs1 is being called out. This above error is not an issue in my env, as I have REALLY old HBA's connected to a 16gb SAN switch just because :)

Depending on what errpt shows will direct further actions. If errpt just shows system reboot and normal AIX startup continue on. If you have questions of what is being display, feel free to post (or shoot me a direct message if you feel like and remove any sensitive data if it contains any) and will attempt to help as best as I can.

# lsdev -Cc tape
rmt0 Defined 00-08-02 IBM 3580 Ultrium Tape Drive (FCP)
rmt1 Defined 00-09-02 IBM 3580 Ultrium Tape Drive (FCP)
smc0 Defined 00-08-02-PRI IBM 3576 Library Medium Changer (FCP)
smc1 Defined 00-09-02-ALT IBM 3576 Library Medium Changer (FCP
So, AIX is aware of them but not seeing anything.
Are these direct attached or are you using a san switch? From your fcstat output it looks like you are on a switch and you have a pair of them. The "Port FC ID: 0x050000 " and" Port FC ID: 0x060000". Then again, this could be what AIX shows when nothing in plugged in...I honestly do not know off the top of my head.
If SAN, is the switch functional? Any errors such as FC_FAIL ? If so, resolve :)
Double check your SAN Zoning. Did the switch go down? Did it drop its config?

Once you have looked at errpt and your switches, lets work back from the TS3310:

I think most firmware versions should get you to this area or would be fairly similar:
1553108374239.png
If you can't see, that's Manage Drives > Fiber Channel Ports. Confirm Topology with your SAN team if using a switch. Should be able to see link speed. I have 3 drives not connected to anything as you can see by the 'auto' and 'unknown' lines.

Go up to Drive IDs, all show ONLINE right? I have one not ready :) As you can see below, -1,1 shows online, but if you look above in the first screen grab, its not connected! Yes, I physically do not have anything plugged into the back of said drive.
1553108474692.png

Go up one more to Control Paths, this is your /dev/smcX that you are seeing:
1553108534615.png
In my case, logical lib_a has two control paths. In your case, I bet you have two as well, and AIX see's them as ALTernate (smc1 for you) and PRImary (smc0). OR you have the path failover. Which I sorta doubt as you mentioned the drives are LTO4 and it wasn't till LTO5 that they came out with redundant paths (could be mistaken).

IF no SAN issues or the devices are direct connect lets poke your HBA's with a pointy stick via the diagnostics!
As root:
smitty diag > Current Shell Diagnostics > (read the screen Enter to continue) > Resource Selection
This should take you to a screen that looks something like:
Code:
RESOURCE SELECTION LIST                                                   801006

From the list below, select any number of resources by moving
the cursor to the resource and pressing 'Enter'.
To cancel the selection, press 'Enter' again.
To list the supported tasks for the resource highlighted, press 'List'.

Once all selections have been made, press 'Commit'.
To avoid selecting a resource, press 'Previous Menu'.


[TOP]
  All Resources
      This selection will select all the resources currently displayed.
  sys0                                 System Object
  sysplanar0                           System Planar
  vio0                                 Virtual I/O Bus
                 U8203.E4A.1061D72-
  vsa0             V1-C0               LPAR Virtual Serial Adapter
  vty0             V1-C0-L0            Asynchronous Terminal
[MORE...53]

F1=Help             F4=List             F7=Commit           F10=Exit
F3=Previous Menu
Arrow down till you find your HBA's and they are highlighted, then hit enter to put a + in front of the line.
then F7 to commit (if using putty or other term emulator Esc-7 will work).
Next screen will be Run Diagnostics hit enter.
Next screen you will need to arrow down to Problem Determination, then enter.
This will look something like:
1553110339620.png
At the very end if all is ok:
1553110429117.png

In short, lots of things to look at.
My bet is the san switches lost their configs, or your HBA's got crispy when AIX lost power.

If everything checks out, zoning, san, hba's all healthy could run a cfgmgr -v. However, since you stated it has been rebooted a few times, AIX should run the cfgmgr on boot to find devices so, I'm not going to say that will help.

Good luck!
 
PREDATAR Control23

Oh, should note that if your HBA's are having issues, fcstat generally will be unable to get any info out of the device and it will fail. Forget what the message is, but fcstat will hang for a few mins, and then toss a single line to the screen.
Since fcstat worked, and reviewing your output above, I'm not seeing any cause for alarm at the adapter level.
Link Failure, Loss of Sync is fairly normal at boot. The Invalid Tx Word Count has me a little nervous.
An example from my hba:
Code:
        Transmit Statistics     Receive Statistics
        -------------------     ------------------
Frames: 218646331               124878093
Words:  111814470144            56272208640

LIP Count: 0
NOS Count: 0
Error Frames:  0
Dumped Frames: 0
Link Failure Count: 1
Loss of Sync Count: 6
Loss of Signal: 0
Primitive Seq Protocol Error Count: 0
Invalid Tx Word Count: 45

As you can see, sent a lot more data and only had 45 invalid Tx Word's. Might be worth looking into. May not be.

My money is still on the SAN switch if there is one.
 
PREDATAR Control23

Hi RecoveryOne,

This is excellent info, thanks.

I went through everything you posted and everything seems ok until I get to the smitty diag,
this is my list:

All Resources
This selection will select all the resources currently displayed.
sys0 System Object
sysplanar0 System Planar
U789C.001.DQDC393-
pci0 P1 PCI Bus
+ fcs0 P1-C5-T1 FC Adapter
fcnet0 P1-C5-T1 Fibre Channel Network Protocol Device
fscsi0 P1-C5-T1 FC SCSI I/O Controller Protocol Device
fcs1 P1-C5-T2 FC Adapter
fcnet1 P1-C5-T2 Fibre Channel Network Protocol Device
fscsi1 P1-C5-T2 FC SCSI I/O Controller Protocol Device
vio0 Virtual I/O Bus
U8203.E4A.066C422-
vscsi0 V4-C3-T1 Virtual SCSI Client Adapter
vscsi1 V4-C4-T1 Virtual SCSI Client Adapter
hdisk1 V4-C4-T1-L850000000000
Virtual SCSI Disk Drive
hdisk4 V4-C4-T1-L840000000000
Virtual SCSI Disk Drive
hdisk3 V4-C4-T1-L830000000000
Virtual SCSI Disk Drive
hdisk2 V4-C4-T1-L820000000000
Virtual SCSI Disk Drive
hdisk0 V4-C4-T1-L810000000000
Virtual SCSI Disk Drive
ent0 V4-C2-T1 Virtual I/O Ethernet Adapter (l-lan)
vsa0 V4-C0 LPAR Virtual Serial Adapter
vty0 V4-C0-L0 Asynchronous Terminal
L2cache0 L2 Cache
mem0 Memory
oppanel Operator panel


I try running the diags on fcs0 and fsc1 but I keep o getting this:

ADDITIONAL INFORMATION FOR fcs0 IN LOCATION U789C.001.DQDC393-P1-C5-T1 2602908


No trouble was found. However, the resource was not tested
because its other port (fcs1) is configured.


To test this resource, you can perform one of the following:
1) Unconfigure its other port (fcs1) and run Diagnostics again.
To unconfigure fcs1, run the following command
from the command line:
rmdev -Rl fcs1
After running Diagnostics, run the following command:
cfgmgr
2) Shut down the system and run in maintenance mode.
3) Run Diagnostics from the Diagnostic Standalone package.

Press Enter or Cancel to return to the
application.

I then tried selecting both fcs0 & 1 (had to include the systemplaner0 as well or diag option was not avail)
it runs - but i get the same results. I'm not sure I should unconfigure FCS to try and run it.

I'm off to check the SAN switch

Thanks again!!
 
PREDATAR Control23

Ahh yes, see your card is a dual port 4gb hba. I overlooked that above. I was assuming that fcs0 and fcs1 was each their own card. My mistake.

Any SAN disks attached for TSM or other work loads?
'lspath' + a 'lsdev -Cc disk', and a 'lspv -u' would easilly show if you are using SAN storage.
I'll leave it up to you if you want to go though the time to reconfigure your HBA's if you follow option1 the system gave you. Any custom settings/tuneables you have in place would be deleted so you'd have to put those back. Also, all your disks would be gone, and if you removed while TSM is running - and db/logs/whatever was on that SAN volume...yeah ouch. Basically a fair amount of work to get everything back.
If you did do that, the VG descriptors should live on the disks and fairly good chance things will just come right back in. I cannot say with 100% accuracy that is the case. Then again, if you have faulty HBA's you are going down that path anyhow.
Why I can't stress enough two independent cards for multipathing :)

Then again, if you have SAN volumes, and they are all working just fine, I do not see the need to run diags for the HBA's.
 
PREDATAR Control23

Hey Guys,

Just wanted to let you know that in the end after much troubleshooting and testing the issue was that both the drives in the library died and although they were reporting online they were preventing the system from initializing correctly.

thanks again to everyone for your help was greatly appreciated.

cheers
 
PREDATAR Control23

Ouch. Well, that would do it. I just would have thought the library would know that the drives were having an issue.
Glad you have it figured out and up and running.
 
Top