Hello all,
I am facing a really strange issue which nobody seems to be able to solve...
A 2 node windows cluster is not able to do backups, the session/s is just hanging in SendW state from TSM when running incremental backup (the amount of K is going up from time to time to a total to just a few M). If I remove all filespaces for the node and run an inc it's working, and if I am running a selective backup for the same domain it's also working.
I am not the administrator for the cluster nodes, but I know someone did some changes to SAN mounted disks the same time the issue occured (not sure if it's just an coincidence).
I've run some traces and can include some of the output here also:
(ALL_BACK flag)
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 482): txnQ 000000EF9254FFC0.
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 524): consumers: 1, producers: 1, snapCBThreads: 1, on baSpecQ 0, dispatcher not used.
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 443): about to delay for 1 second.
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 482): txnQ 000000EF9254FFC0.
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 524): consumers: 1, producers: 1, snapCBThreads: 1, on baSpecQ 0, dispatcher not used.
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 443): about to delay for 1 second.
(SERVICE flag)
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(1902): TcpRead: Upper level requested 305 bytes
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(2021): TcpReadAvailable: Issuing recv for 305 bytes.
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(2157): TcpReadAvailable: 177 bytes read.
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(1941): TcpRead: 177 bytes read of 305 requested.
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(2021): TcpReadAvailable: Issuing recv for 128 bytes.
B/A Performance thread 4================>
2016-08-31 20:01:47.720 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 482): txnQ 0000005FE2EF6080.
2016-08-31 20:01:47.720 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False
B/A Txn Consumer thread 3================>
2016-08-31 20:01:47.720 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 519): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait but no tries in table; continue to wait.
2016-08-31 20:01:47.720 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 527): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait; CanProceed=False.
2016-08-31 20:01:47.720 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 489): fifoQgetNextWaitNoTS(0000005FE2EF6080): Waiting for next object.
B/A Performance thread 4================>
2016-08-31 20:01:48.721 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 482): txnQ 0000005FE2EF6080.
2016-08-31 20:01:48.721 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False
B/A Txn Consumer thread 3================>
2016-08-31 20:01:48.721 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 519): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait but no tries in table; continue to wait.
2016-08-31 20:01:48.721 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 527): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait; CanProceed=False.
2016-08-31 20:01:48.721 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 489): fifoQgetNextWaitNoTS(0000005FE2EF6080): Waiting for next object.
VSS is working fine, Windows backup is working, I've tried it on two separate TSM servers, I am out of ideas now...
I am facing a really strange issue which nobody seems to be able to solve...
A 2 node windows cluster is not able to do backups, the session/s is just hanging in SendW state from TSM when running incremental backup (the amount of K is going up from time to time to a total to just a few M). If I remove all filespaces for the node and run an inc it's working, and if I am running a selective backup for the same domain it's also working.
I am not the administrator for the cluster nodes, but I know someone did some changes to SAN mounted disks the same time the issue occured (not sure if it's just an coincidence).
I've run some traces and can include some of the output here also:
(ALL_BACK flag)
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 482): txnQ 000000EF9254FFC0.
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 524): consumers: 1, producers: 1, snapCBThreads: 1, on baSpecQ 0, dispatcher not used.
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 443): about to delay for 1 second.
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 482): txnQ 000000EF9254FFC0.
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 524): consumers: 1, producers: 1, snapCBThreads: 1, on baSpecQ 0, dispatcher not used.
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 443): about to delay for 1 second.
(SERVICE flag)
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(1902): TcpRead: Upper level requested 305 bytes
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(2021): TcpReadAvailable: Issuing recv for 305 bytes.
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(2157): TcpReadAvailable: 177 bytes read.
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(1941): TcpRead: 177 bytes read of 305 requested.
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(2021): TcpReadAvailable: Issuing recv for 128 bytes.
B/A Performance thread 4================>
2016-08-31 20:01:47.720 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 482): txnQ 0000005FE2EF6080.
2016-08-31 20:01:47.720 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False
B/A Txn Consumer thread 3================>
2016-08-31 20:01:47.720 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 519): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait but no tries in table; continue to wait.
2016-08-31 20:01:47.720 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 527): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait; CanProceed=False.
2016-08-31 20:01:47.720 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 489): fifoQgetNextWaitNoTS(0000005FE2EF6080): Waiting for next object.
B/A Performance thread 4================>
2016-08-31 20:01:48.721 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 482): txnQ 0000005FE2EF6080.
2016-08-31 20:01:48.721 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False
B/A Txn Consumer thread 3================>
2016-08-31 20:01:48.721 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 519): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait but no tries in table; continue to wait.
2016-08-31 20:01:48.721 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 527): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait; CanProceed=False.
2016-08-31 20:01:48.721 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 489): fifoQgetNextWaitNoTS(0000005FE2EF6080): Waiting for next object.
VSS is working fine, Windows backup is working, I've tried it on two separate TSM servers, I am out of ideas now...