RMAN: Restore Validate: SOMETIMES with error

adsmsuser

ADSM.ORG Member
Joined
Jan 11, 2010
Messages
118
Reaction score
2
Points
0
Hi!

We have some Oracle DBs here that SOMETIMES have problems with Backup ( in this Case: Restore Validate )

We have IDLETIMEOUT on the Server @ 240 minutes and it seems that TSM killes the Session after that duration!??? No matter if its still needed
Here some Logs:

TSM-SERVER-LOG:
07.04.11 21:44:32 MESZ ANR0406I Session 164254 started for node ORACLE-TDP (TDPO Linux86-64) (TCP/IP 10.6.227.155(63289)). (SESSION: 164254)
07.04.11 21:44:32 MESZ ANR1639I Attributes changed for node ORACLE-TDP: TCP Name from ebrefdb2 to ebprddb1, TCP Address from 10.6.227.165 to 10.6.227.155, GUID from 55.1b.1f.3a.ef.ca.11.de.ad.e5.00.15.17.c8.c7.40 to 00.28.98.58.ea.f8.11.de.88.87.00.15.17.c8.c6.26. (SESSION: 164254)
07.04.11 21:44:32 MESZ ANR0408I Session 164255 started for server TSMLM1 (HP-UX) (TCP/IP) for library sharing. (SESSION: 164254)
07.04.11 21:44:32 MESZ ANR0409I Session 164255 ended for server TSMLM1 (HP-UX). (SESSION: 164254)
07.04.11 21:44:32 MESZ ANR0408I Session 164256 started for server TSMLM2 (HP-UX) (TCP/IP) for library sharing. (SESSION: 164254)
07.04.11 21:44:32 MESZ ANR0409I Session 164256 ended for server TSMLM2 (HP-UX). (SESSION: 164254)
07.04.11 21:44:32 MESZ ANR0408I Session 164257 started for server TSMLM2 (HP-UX) (TCP/IP) for library sharing. (SESSION: 164254)
07.04.11 21:44:32 MESZ ANR0409I Session 164257 ended for server TSMLM2 (HP-UX). (SESSION: 164254)
07.04.11 21:44:32 MESZ ANR0408I Session 164258 started for server TSMLM2 (HP-UX) (TCP/IP) for library sharing. (SESSION: 164254)
07.04.11 21:44:32 MESZ ANR0409I Session 164258 ended for server TSMLM2 (HP-UX). (SESSION: 164254)
07.04.11 21:44:32 MESZ ANR0408I Session 164259 started for server TSMLM2 (HP-UX) (TCP/IP) for library sharing. (SESSION: 164254)
07.04.11 21:44:32 MESZ ANR0409I Session 164259 ended for server TSMLM2 (HP-UX). (SESSION: 164254)
08.04.11 01:45:16 MESZ ANR0482W Session 164254 for node ORACLE-TDP (TDPO Linux86-64) terminated - idle for more than 240 minutes. (SESSION: 164254)
^ HERE IT GETS TERMINATED:

RMAN-LOG:
[FONT=&quot]channel ORA_SBT_TAPE_1: reading from backup piece rman_full_20110402_201510_1_9444_6.bak[/FONT]
[FONT=&quot]channel ORA_SBT_TAPE_1: piece handle=rman_full_20110402_201510_1_9444_6.bak tag=FULL LEVEL 0 BACKUP[/FONT]
[FONT=&quot]channel ORA_SBT_TAPE_1: restored backup piece 6[/FONT]
[FONT=&quot]channel ORA_SBT_TAPE_1: reading from backup piece rman_full_20110402_201510_1_9444_7.bak[/FONT]
[FONT=&quot]channel ORA_SBT_TAPE_1: ORA-19870: error while restoring backup piece rman_full_20110402_201510_1_9444_7.bak[/FONT]
[FONT=&quot]ORA-19507: failed to retrieve sequential file, handle="rman_full_20110402_201510_1_9444_7.bak", parms=""[/FONT]
[FONT=&quot]ORA-27029: skgfrtrv: sbtrestore returned error[/FONT]
[FONT=&quot]ORA-19511: Error received from media manager layer, error text:[/FONT]
[FONT=&quot] ANS1235E (RC-72) An unknown system error has occurred from which TSM cannot recover.[/FONT][FONT=&quot][/FONT]

TDP/O-Tracefile:
[FONT=&quot]2011-04-08 01:45:18.544 [000566] [1303844672] : session2.cpp ( 905): tdpoPrepGet(): dsmHandle = 1, 'ANS1235E (RC-72) An unknown system error has occurred from which TSM cannot recover.'[/FONT]
[FONT=&quot] [/FONT]
[FONT=&quot]2011-04-08 01:45:18.544 [000566] [1303844672] : session2.cpp ( 911): tdpoPrepGet(): Exit - DSMBEGINGETDATA() failed. dsmHandle = 1, rc = -72[/FONT][FONT=&quot][/FONT]
[FONT=&quot] [/FONT]
I am not shure where the problem is located.
actually TSM should not kill the session, but on the otherhand:
rman should do something to keep it alive!????

has anyone expierienced such a problem .. and has a conclusion?
TSM says: increase IDLETIMEOUT.
but i think 4 Hours Idletimeout is far enough!!!


Thanks
 
Hi,

this happens in LANFREE environments when the transfer time of the RMAN backup piece is longer than the IDLETIMEOUT. RMAN sends (retrieves) the data to (from) the storage agent and contacts the TSM server only after the piece is finished.
You can enlarge the IDLETIMEOUT (and COMMTIMEOUT) parameters OR you can limit the size of the backuppiece on the RMAN side (we are now at 8 or 16GB and have no problem).
This approach has another advantage - if anything goes wrong during restore, RMAN knows which pieces were already restored and does not try them again ....

Harry
 
hm. well. i guess we found the error or at least the cause why we ran into timeouts WITH lanfree.

lets say:
some db admins made an rman duplicate while
some db admins made a backup.

as the test-environment hast quite slow storage this duplicate took long enough to let the restore validate run into the 240min timeout.

^^
 
Back
Top