Hanging backup (SendW)

sebbash

ADSM.ORG Member
Joined
Feb 1, 2016
Messages
13
Reaction score
1
Points
0
Hello all,

I am facing a really strange issue which nobody seems to be able to solve...

A 2 node windows cluster is not able to do backups, the session/s is just hanging in SendW state from TSM when running incremental backup (the amount of K is going up from time to time to a total to just a few M). If I remove all filespaces for the node and run an inc it's working, and if I am running a selective backup for the same domain it's also working.

I am not the administrator for the cluster nodes, but I know someone did some changes to SAN mounted disks the same time the issue occured (not sure if it's just an coincidence).

I've run some traces and can include some of the output here also:
(ALL_BACK flag)
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 482): txnQ 000000EF9254FFC0.
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 524): consumers: 1, producers: 1, snapCBThreads: 1, on baSpecQ 0, dispatcher not used.
2016-08-31 19:46:46.971 : ..\..\common\ba\bacpfm.cpp( 443): about to delay for 1 second.
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 482): txnQ 000000EF9254FFC0.
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 524): consumers: 1, producers: 1, snapCBThreads: 1, on baSpecQ 0, dispatcher not used.
2016-08-31 19:46:47.972 : ..\..\common\ba\bacpfm.cpp( 443): about to delay for 1 second.

(SERVICE flag)
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(1902): TcpRead: Upper level requested 305 bytes
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(2021): TcpReadAvailable: Issuing recv for 305 bytes.
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(2157): TcpReadAvailable: 177 bytes read.
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(1941): TcpRead: 177 bytes read of 305 requested.
2016-08-31 20:01:46.998 [013344] [4040] : ..\..\common\com\commtcp.cpp(2021): TcpReadAvailable: Issuing recv for 128 bytes.

B/A Performance thread 4================>
2016-08-31 20:01:47.720 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 482): txnQ 0000005FE2EF6080.
2016-08-31 20:01:47.720 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False

B/A Txn Consumer thread 3================>
2016-08-31 20:01:47.720 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 519): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait but no tries in table; continue to wait.
2016-08-31 20:01:47.720 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 527): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait; CanProceed=False.
2016-08-31 20:01:47.720 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 489): fifoQgetNextWaitNoTS(0000005FE2EF6080): Waiting for next object.

B/A Performance thread 4================>
2016-08-31 20:01:48.721 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 482): txnQ 0000005FE2EF6080.
2016-08-31 20:01:48.721 [013344] [10476] : ..\..\common\ba\bacpfm.cpp( 491): consumerInTransit False

B/A Txn Consumer thread 3================>
2016-08-31 20:01:48.721 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 519): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait but no tries in table; continue to wait.
2016-08-31 20:01:48.721 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 527): fifoQgetNextWaitNoTS(0000005FE2EF6080): Returned from wait; CanProceed=False.
2016-08-31 20:01:48.721 [013344] [14124] : ..\..\common\ut\dsfifo.cpp( 489): fifoQgetNextWaitNoTS(0000005FE2EF6080): Waiting for next object.


VSS is working fine, Windows backup is working, I've tried it on two separate TSM servers, I am out of ideas now...
 
How many objects does it has to inspect? If the consumer is waiting, it's likely that it's waiting for the producer to find candidate files to backup. With a selective, everything is backed up, so the producer has very little work to do, it just gives all files.
 
Well, I haven't really understood the difference between objects and files but after the first backup (which completes flawless) there's approx 300k of files and almost 4 _million_ objects. The difference sounds huge to me, but that might be normal as it's C: D: and systemstate that's getting backup.
Also it keeps hanging for hours and hours without any success.
 
Objects = files + directories

It's likely not hanging, it's determining which files it needs to backup. As you saw, there is two threads, a producer and a consumer. The producer queries the TSM Server to get the list of objects previously backed up and their attributes, then it goes through all the disks and finds which objects have changed, as soon as it finds a match, it gives it to the consumer that actually sends the data to the TSM Server, and meanwhile the producer continues to look for objects that have changed. So with 4 million objects to inspect, that can take a lot of time, and in the foreground it looks like nothing is happening, but a lot is happening in the background.
 
After some more troubleshooting I found out it was systemstate which contained 3.6 million objects. If I exclude systemstate from the backup and only backup C: and D: the incremental backup is working flawless again. However if I add systemstate I'm back on square one again...
I know this might be the wrong forum, but does anyone know what could have happened to systemstate and if it's possible to fix? Doesn't really sound healthy with that many objects...
 
@marclant - you said object equals files plus directories. When I ran a first full backup for systemstate it did backup ~200k files, when removing the same filespace almost 3.6 million objects was removed. Does that mean there is 3.4 million directories somewhere which can't be seen?
 
Does that mean there is 3.4 million directories somewhere which can't be seen?
Unlikely.

When you delete a filespace, you delete all active and inactive versions. So depending on your retention, the number of objects (files and/or directories) in the filespace is much greater than the number of objects on the machine.
 
Yes, of course. But for this case there were only one version as I first deleted all filespaces from it and then ran a full backup once, that's what making me so confused - 3.6 million objects for _1_ full backup of that filespace (without any older versions).
 
I can't explain without data. Show me the backup statistics of the backup systemstate, the output of: query occupancy before you delete the filespace, and the command you used to delete the filespace as well as the summary of the delete filespace process from the activity log.
 
To get systemstate backed up to TSM I ran 'backup systemstate' (took almost 1 hour).

I am not able to paste anything from the occupancy as it's removed at the moment and no access to rerun the backup right now. However I know it was 11.3GB on 130k files.

Filespace removal:
09/01/2016 18:11:03 ANR0802I DELETE FILESPACE WSC01376-N2\SystemState\NULL\Sy-
stem State\SystemState (fsId=11) (backup/archive data)
for node WSC01376-N2 started. (SESSION: 200558, PROCESS:
820)
09/01/2016 18:14:49 ANR0806I The WSC01376-N2\SystemState\NULL\System
State\SystemState (fsId=11) file space was deleted for
node WSC01376-N2: 3,566,216 objects were deleted.
(SESSION: 200558, PROCESS: 820)
09/01/2016 18:14:49 ANR0987I Process 820 for DELETE FILESPACE running in the
BACKGROUND processed 3,566,216 items with a completion
state of SUCCESS at 06:14:49 PM. (SESSION: 200558,
PROCESS: 820)
 
Can't explain it. It does sound odd if in fact the filespace did not exist prior to the full systemstate backup that backed up 130k files.
 
backing up the system state individually was successful or showing the same pattern ?
did you happen to check the windows event log.
 
Windows systemstate backups can contain lot of things, especially with late windows versions and especially domain controllers:
https://technet.microsoft.com/en-us/library/cc938537.aspx
The other thing I'm wondering though is
-Your OS version
-Your TSM client version
-Your TSM Server version
-What more advanced features are you using (deduplication and such)
From current TSM Server perspective, I would think couple million objects is kinderspiel, unless it(or the system being backed up) is having some sort of performance issues - in other words, that shouldn't be a problem.

On the other hand googling for "tsm backup hangs systemstate backup" does give some results, so depending on your client level you may be encountering known issue.
 
OS version - Win Server 2012R2
TSM client version - 7.1.3
TSM server version - 7.1.5.200
Been trying both dedup and non dedup, without luck.

Running backup for systemstate with SYSTEMSTATEBACKUPMETHOD FULL is working, without it's not working.
Running incremental backup for local disks with journaling is working, without it's not working.
Whenever there's no need of exchanging data with the TSM server it's working, including restore sessions bigger than 10KB or so. Below is a snip from two traces with differens file sizes:

Restore - 2mb file
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\session.cpp(1829): Send Verb: Length: 660 Code: 00013A00 Type: diStatsNew
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\commtcp.cpp(2207): TcpWrite(): Warning - protocol violation, session not open
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\commtcp.cpp(2388): TcpFlush: writebuff(0000000000000000), bufflen(0), lock(0000000000000000)
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\commtcp.cpp(2397): TcpFlush(): Warning - protocol violation, session not open
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\commtcp.cpp(2514): TcpFlush: error 10038 during send.
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\commtcp.cpp(2625): ANS1006I TCP/IP write error on socket = 4294967295, errno = 10038, reason : An operation was attempted on something that is not a socket.
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\session.cpp(1882): Send Verb: 2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\session.cpp(1883): ...error
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\session.cpp(1886): sessSendVerb: Error sending Verb, rc: -50
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\session.cpp(4902): sessClose: Transitioning: sInit state ===> sInit state
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\session.cpp(2063): sessClose: Session closed.

Restore - 7kb file
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\com\session.cpp(1829): Send Verb: Length: 582 Code: 00013A00 Type: diStatsNew
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\com\commtcp.cpp(2311): TcpWrite: 582 bytes written of 582 requested.
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\com\session.cpp(1866): Sent Verb: Length: 582 Code: 00013A00 Type: diStatsNew->
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\com\commtcp.cpp(2388): TcpFlush: writebuff(0000000000000000), bufflen(0), lock(0000000000000000)
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\com\commtcp.cpp(2650): TcpFlush: 582 bytes written on socket 1768.
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\com\session.cpp(1958): sessFlushVerb: Successful flush of send data to the server
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ut\dsfifo.cpp( 484): fifoQgetNextWaitNoTS(00000061045C0E20): cbCanProceed=True.
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ut\dsfifo.cpp( 539): fifoQgetNextWaitNoTS(00000061045C0E20): Next object is available.
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ut\dsfifo.cpp( 196): fifoQchangeWaitTime(00000061045C0E20): wait time 0, change 0
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ut\circq.cpp( 248): Q 4584d00, poped 31ce640, count 1, top index 1 - 5fe7690, bot index 2 - 0
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ut\dsfifo.cpp( 594): fifoQgetNext(00000061045C0E20): Giving entry 00000061031CE640, rc 0.
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ba\DccTaskStatus.cpp( 467): DccTaskletMsg.msgID ==> 10
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ut\dsfifo.cpp( 484): fifoQgetNextWaitNoTS(00000061045C0E20): cbCanProceed=True.
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ut\dsfifo.cpp( 539): fifoQgetNextWaitNoTS(00000061045C0E20): Next object is available.
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ut\dsfifo.cpp( 196): fifoQchangeWaitTime(00000061045C0E20): wait time 0, change 0
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ut\circq.cpp( 248): Q 4584d00, poped 5fe7690, count 0, top index 2 - 0, bot index 2 - 0
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ut\dsfifo.cpp( 584): fifoQgetNext(00000061045C0E20): Giving entry 0000006105FE7690, but next entry is NULL.
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ut\dsfifo.cpp( 594): fifoQgetNext(00000061045C0E20): Giving entry 0000006105FE7690, rc 0.
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\ba\DccTaskStatus.cpp( 467): DccTaskletMsg.msgID ==> 18
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\diagent\agent\DdiBStatusOutput.cpp(1520): Entering --> DdiBStatusOutput::ccMsgFileExists
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\common\com\session.cpp(2443): Address of buffer is 6f76068
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\diagent\dicu\dicurest.cpp(1513): =========> Entering dicuSendFileExist()
2016-09-08 13:56:39.165 [004444] [18324] : ..\..\diagent\dicu\dicurest.cpp(1589):

Any advice would be appreciated big time!
 
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\commtcp.cpp(2207): TcpWrite(): Warning - protocol violation, session not open
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\commtcp.cpp(2388): TcpFlush: writebuff(0000000000000000), bufflen(0), lock(0000000000000000)
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\commtcp.cpp(2397): TcpFlush(): Warning - protocol violation, session not open
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\commtcp.cpp(2514): TcpFlush: error 10038 during send.
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\commtcp.cpp(2625): ANS1006I TCP/IP write error on socket = 4294967295, errno = 10038, reason : An operation was attempted on something that is not a socket.
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\session.cpp(1882): Send Verb: 2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\session.cpp(1883): ...error
2016-09-08 12:29:29.577 [005164] [17812] : ..\..\common\com\session.cpp(1886): sessSendVerb: Error sending Verb, rc: -50

That looks like a networking problem. What were the error messages in dsmerror.log for this particular restore?
 
Networking as in the physical network or the networking on the client? I've been talking with the network guys who thinks the network is alright as ping goes through and backup is working (even if it's just one way backup working). Not sure if there's a firewall in between, but the backup/restore is initiated from the client and all traffic should go on that session (which is obviously working, client -> server on port 1500)?
dsmerror at that time:
2016-09-08 12:29:29 ANS0361I DIAG: ..\..\common\com\session.cpp(1886): sessSendVerb: Error sending Verb, rc: -50
2016-09-08 12:29:29 ANS0361I DIAG: sessSendVerb: Error sending Verb, rc: -50
 
The RC -50 is a networking error.

Networking as in the full TCP communication from host to host. So is starts as soon at it hits the TCP stack on the client machine until it exists the TCP stack on the server machine, anything in between could be a potential culprit, which includes the following on both the client and server
- OS patches
- network card
- network card driver
- network card firmware
As well as everything between the client and the server.

A simple ping test confirms connectivity at that instant, it doesn't mean it was ok a few seconds earlier. It's also a really small packet compared to a file restore.

Here are some tests you can do:
- try pinging the TSM Server from the TSM Client using a large packet and repeat 100 times:
ping -n 100 -l 65500 {tsm server IP address}
- do the same test from the server to the client
- try outside of TSM to copy a large file from the client to the server, and vice-versa
- try outside of TSM to copy a several small files from the client to the server, and vice-versa

If any of those fail, the issue is outside TSM for sure.
 
Finally this seems to have been solved!

After changing some settings on the NIC (CPU offload) it somehow magically started to work as expected again.
 
Just wanted to thank everyone for their interest to help out, the schedules are now (finally) running as expected again!
 
Hi,

We are also facing same issue....

sessSendVerb: Error sending Verb, rc: -50

Could you please let us know the changes u made to NIC (CPU offload) settings.

Thanks in advance!
 
Back
Top