TSM Restore issues

Manufest · Aug 7, 2009

Hello,

I'm having some issues restoring some data off of a Centos box running surgemail. After several hours (18 hours first time) the restore beings, and then after about 9.5 MB, it stops and just sits there. A couple hours later, the connection is severed. I restart the restore (dsmc restart restore) and even though it claims to be a restartable restore, it seems to start all over again and tries to restore the same files over again, then gets severed, and so on.

When I query the backup , it creates a list of files pretty fast. I just cancelled and restarted it with the second command.

Setup:

Centos x64 with TSM Client 5.5.1
Windows 2k3 32 bit TSM server with TSM 5.3.6
100 Mb/s LAN

Restore Commands:

dsmc restore -pitd=7/31/2009 -pitt=07:00:00 -subdir=yes /var/spool/vmail/clientcompany.com/bj/ae/clientusername/ /home/tsm/restore/clientusername/

and most recently:

dsmc restore -pitd=7/31/2009 -pitt=07:00:00 /var/spool/vmail/clientcompany.com/bj/ae/clientusername/mdir/* /home/tsm/restore/clientusername/

Activity log:

As soon as the restore starts or is restarted, it gives me the three volumes that it needs
When the client is disconnected, there's a message about the session being severed.

Client error and schedule logs:

No errors except that the sessions are being interrupted and will attempt to restart.

Other Considerations:

There are no other processes or sessions running on TSM.
Local LAN - no connectivity issues
Backups run fine regularly
Volumes that it needs are all Read-Write access, and are in the library, checked in and all.

The size of the restore is only a couple hundred MB.

Any feedback or advice would be greatly appreciated.

Thanks.

eRogue · Aug 7, 2009

If your server has a Broadcom BMC57xx driver make sure you have the latest driver it... There are some known issues with the TSM and the Broadcom driver but on windows boxes. ...

Manufest · Aug 10, 2009

Thanks for the reply. I've got an intel card, so that isn't it...

I left the restore run and it eventually crashed my TSM instance. Nothing else in the actlog or windows logs explain it. No other changes made on the server. Going back, I see that the server instance had crashed before when I tried restoring these files.

Here's the windows event:

Event Type: Error
Event Source: ADSMServer
Event Category: None
Event ID: 27
Date: 8/8/2009
Time: 4:09:29 PM
User: N/A
Computer: TSM
Description:
TSM Server Diagnostic: ANR9999D: ADSM Exception Information: file = pkthread.c, line = 2279,Code = c0000005, Address = 102B9981
Attempt to read data at address 0~

Any ideas? I'm auditing the volumes that the data is on...

picay · Aug 10, 2009

I've seen ANR9999D, pkthread.c when logs became full...
did you enable trace on client ?

Eldoraan · Aug 10, 2009

I'm not sure that high a level client will work reliably with that low of server code. The official stance is 5.5.x clients are supported with 5.4.x servers and above. That 5.3.6 server may be too low for that high a client level.

javajockey · Aug 10, 2009

That TSM client, should at least in theory work. Why don't you try consolidating your data for the restore. In other words, move the data from your TAPEPOOL to the backuppool. This will eliminate any media waits. You will also know if there are any media errors if the process fails. here is the caommand.

move nodedata XXXXXX fromstg=tapepool tostg=backuppool maxprocess=3
(change the pool names and maxprocess if necessary)

monitor the actlog to see if there are any errors. Once the data is on your diskpool, the restore should be really fast.

Manufest · Aug 10, 2009

Thank you all for all of the replies. They are very appreciated.

Picay:
-Trace is not enabled.
-The recovery log only gets up to about 3 or 4% usage.
-Client logs are relatively small

Eldoraan:
- I can try restoring the data from a 5.3.6 client on a different linux box. Would need to build it but that won't take long (VM template). I was under the impression that it server/client versions didn't matter all that much, that they'd function at whichever functionality is available from either. Though, am I able to restore data that was backed up by a 5.5.1 client with a 5.3.6 client?

javajockey:
- Great idea... and I have a space issue that prevents me from doing it. The node has 2 TB of data in my tape pool, and I don't have any free storage that large.

I came into work tonight and the path to the one drive was offline. I brought it back online, and after a few minutes, it took itself back offline. Activity log states issues with opening the drive. I'll reboot the tape library, but there's a pretty good chance I might have a faulty drive. I also found a fix pack (5.3.6.2) that hasn't been installed. So I'll apply that, reboot the library and the server, take that drive path offline and do the restore with just one drive and see how it goes. I'll test the drive after the restore.

javajockey · Aug 11, 2009

well, One drive going offline really shouldn't be preventing your from performing the restore (assuming you have multiple tape drives). I think at this point, you should delete the drive and the path to the drive. I'm more of a UNIX guy, but windows should be similar. The exact command you would use to define the path is in the devconfig file. You should have all of the stuff documented in case you need to delete the library and recreate it. The steps are as follows.
Delete the path
Delete the drive.
This is the first thing that IBM would have you do after assuming the drive has connectivity (fibre or SCSI).

here is an example for unix for a library named lto4lib

del path tsmsvr1 lto8 srctype=server desttype=drive library=lto4lib
del drive lto4lib lto8

define drive lto4lib lto8 serial=0007854476 element=264 online=yes cleanfreq=asneeded

define path tsmsvr1 lto8 srctype=server desttype=drive library=lto4lib device=/dev/rmt4 online=yes

Manufest · Aug 11, 2009

Deleted and recreated - failed to initialize. Rebooted both TSM and the Library, same thing. Tried it a couple times, and it eventually worked, but the drive is at a state of "unknown." I also go an email with the following error.

Device : <FLX11424C>
Attention : Crit or Warn Drive Tape Alert flag
Number: 0x01, 1
Drive number: 0x02, 2
Tape Alert Flag: 0x20, 32
This message was generated automatically from
IBM 3573-TL

I looked up the error, and it seems

Set when the tape drive detects a
problem with the SCSI, Fibre
Channel, or RS-422 interface.

The first tape drive is fine, just the second drive. So I'll be contacting IBM about that.

I've upgraded to 5.3.6.3 (from 5.3.6.0). I'm currently trying another restore while I'm building another machine that I'll install the 5.3.6 client on and try a restore from there.

Thanks!

javajockey · Aug 11, 2009

At this point, I'd say you've exhausted all other options. In my experience, drive failures generally result in a mechanical failure or some sort (A tape getting stuck :sad:.)

Let us know if it was in fact a drive failure

Eldoraan · Aug 11, 2009

If drives are fibre connected, could be bad cable or gbic causing intermittent errors.

Manufest · Aug 12, 2009

They're SCSI drives. Migrations work fine to the one drive. The other drive I have disabled for now (waiting for IBM warranty).

Pretty well the same result with restoring it this time. I had to restart the restore several times though, and at one point, the local restore said it was finished, but only a few files were restored. When I checked the server, the restore was inactive. When I checked dsmc q restore, it showed that the restore was restartable. I then restarted it.

Every time I restart the restore (dsmc restart restore) it starts from the beginning again, and prompts me if I want to overwrite or skip the already restored files. It loads the first tape, then the second tape, then the third tape, and then dies. Seems to literally restart the entire restore each time (but Elapsed time when I do a q restore on the server keeps going up).

So I just started the restore on freshly built VM with 5.3.6 to see if just maybe the version is the issue. I had pinged the TSM server from the client for 24 hours and it didn't miss a single one, so I don't think network interruptions are the case.

Any ideas?

Cheeky · Aug 12, 2009

pkthread related errors are generally memory related issues.

Have you checked to see what the memory resources are like on either the server or client?

Manufest · Aug 12, 2009

Seems to be working on the VM... I've restored 163 MB so far. All three tapes have been mounted, last one still in use. Not much more left to restore. Been going for over 12 hours. Hasn't had a "severed" connection yet. Both the client and the server are reporting the same amount of data sent (163 MB). Before it took several restarts to get 12 MB over the same time period.

I'll post when it finishes and let you know.

Manufest · Aug 13, 2009

It just finished. A lot more data than I thought, about 513 MB. Took about 22 hours.

So it was either the client version, or the production surgemail server was hogging all of the memory and killing application. Really not sure.

Thanks for all of your help!

Mita201 · Aug 13, 2009

I am pretty sure that you can't restore data with 5.3.x client if the data was backed up with 5.5.x client

Manufest · Aug 13, 2009

I just did....

Mita201 · Aug 13, 2009

and your restored files are ok?

Manufest · Aug 16, 2009

So far, they're good.

Mita201 · Aug 16, 2009

Well, then it is something new for me!
What I have seen in similar situation (but on windows) is crashing dsmc client, or restored files 0 bytes size.

TSM Restore issues

Manufest

eRogue

Manufest

picay

Eldoraan

javajockey

Manufest

javajockey

Manufest

javajockey

Eldoraan

Manufest

Cheeky

Manufest

Manufest

Mita201

Manufest

Mita201

Manufest

Mita201

Data Privacy Impact Assessment

Sponsor ADSM.ORG

Navigation Menu

NordVPN 3 Months FREE

Forum statistics