Networker

Re: [Networker] Server lost connection problem

2012-06-19 13:45:00
Subject: Re: [Networker] Server lost connection problem
From: Chester Martin <cmartin AT SPP DOT ORG>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Tue, 19 Jun 2012 12:39:38 -0500
Hello,
At first glance you would think it'd be a timeout issue that adjusting the keep 
alive values would fix.  Being that you just updated to a new DDOS you may want 
to see if there are any errors being reported on the data domain side.  Also, 
see if there is a certain time when these errors happen.  Meaning, are there 
clients that kick off at 5pm and the error happens at 6pm and all the client's 
backup that was running at that time cancel with the "connection dropped" 
error?  You didn't mention anything about the data domain devices going 
offline, but if there was an issue with the networker server not talking to 
data domain your devices should go offline.  But if you have "auto media 
management" enabled on the dd devices networker will attempt to bring them back 
online.

I would think that increasing the client parallelism would add to the problem 
instead of help it.  Increasing the client parallelism will cause you to have 
more streams going to the dd box, which may slow your backup down.  If any 
parallelism needs to be adjusted I would do it from the group level and not the 
client level.  Is it possible you have your group parallelism set to 0 and some 
of the clients could be waiting on resources and timing out?

-----Original Message-----
From: EMC NetWorker discussion [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] On 
Behalf Of Stanley R. Horwitz
Sent: Tuesday, June 19, 2012 10:11 AM
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Subject: [Networker] Server lost connection problem

About two weeks ago, I upgraded my NetWorker server from 7.6.1 to NetWorker 
7.6.3.4.Build.879. This server backs up 336 clients (mostly Windows and Linux). 
All of the clients back up to a Data Domain system using Boost and a few are 
cloned nightly to LTO-5 tape. I have 11 Boost devices configured for direct use 
on the server and each Boost device has its max sessions set to a value of 10. 
No storage nodes are involved in this data zone.

After we upgraded our DD system to the latest OS, the backups of larger servers 
improved in their throughput, but for the past few days, I am noticing an 
unusual number of backup failures for several groups of, both Linux and 
Windows, including some that also have NetWorker 7.6.3 on them. The error is 
always the same in the savegroup report "connection dropped."

There does not appear to be anything problems going on with network 
connectivity and in most cases, these clients do not back up via a firewall. I 
do not see any errors on the clients or the NetWorker server when I use 
"netstat -i." Incremental backups of the same clients also work without issue. 

I reviewed the NetWorker tuning guide on PowerLink, but I haven't done any of 
the tests they recommended yet with uasm, although it did contain a 
recommendation to increase client parallelism to 12, which I changed a few 
minutes ago. Most of the clients had the default setting of 4 for their 
parallelism.

If anyone has any ideas on how to investigate this problem, please let me know. 
I am skeptical that doing tests with uasm will bring forth any enlightenment on 
this issue.

<Prev in Thread] Current Thread [Next in Thread>