Networker

Re: [Networker] Server lost connection problem

2012-06-21 13:34:27
Subject: Re: [Networker] Server lost connection problem
From: "Stanley R. Horwitz" <stan AT TEMPLE DOT EDU>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Thu, 21 Jun 2012 17:28:29 +0000
This issue turned out to be the result of many clients being mistaken for 
intruders as a result of a recent update in our intrusion detection system. 

On 06 19, 2012, at 1:39 PM, Chester Martin wrote:

> Hello,
> At first glance you would think it'd be a timeout issue that adjusting the 
> keep alive values would fix.  Being that you just updated to a new DDOS you 
> may want to see if there are any errors being reported on the data domain 
> side.  Also, see if there is a certain time when these errors happen.  
> Meaning, are there clients that kick off at 5pm and the error happens at 6pm 
> and all the client's backup that was running at that time cancel with the 
> "connection dropped" error?  You didn't mention anything about the data 
> domain devices going offline, but if there was an issue with the networker 
> server not talking to data domain your devices should go offline.  But if you 
> have "auto media management" enabled on the dd devices networker will attempt 
> to bring them back online.
> 
> I would think that increasing the client parallelism would add to the problem 
> instead of help it.  Increasing the client parallelism will cause you to have 
> more streams going to the dd box, which may slow your backup down.  If any 
> parallelism needs to be adjusted I would do it from the group level and not 
> the client level.  Is it possible you have your group parallelism set to 0 
> and some of the clients could be waiting on resources and timing out?
> 
> -----Original Message-----
> From: EMC NetWorker discussion [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] 
> On Behalf Of Stanley R. Horwitz
> Sent: Tuesday, June 19, 2012 10:11 AM
> To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
> Subject: [Networker] Server lost connection problem
> 
> About two weeks ago, I upgraded my NetWorker server from 7.6.1 to NetWorker 
> 7.6.3.4.Build.879. This server backs up 336 clients (mostly Windows and 
> Linux). All of the clients back up to a Data Domain system using Boost and a 
> few are cloned nightly to LTO-5 tape. I have 11 Boost devices configured for 
> direct use on the server and each Boost device has its max sessions set to a 
> value of 10. No storage nodes are involved in this data zone.
> 
> After we upgraded our DD system to the latest OS, the backups of larger 
> servers improved in their throughput, but for the past few days, I am 
> noticing an unusual number of backup failures for several groups of, both 
> Linux and Windows, including some that also have NetWorker 7.6.3 on them. The 
> error is always the same in the savegroup report "connection dropped."
> 
> There does not appear to be anything problems going on with network 
> connectivity and in most cases, these clients do not back up via a firewall. 
> I do not see any errors on the clients or the NetWorker server when I use 
> "netstat -i." Incremental backups of the same clients also work without 
> issue. 
> 
> I reviewed the NetWorker tuning guide on PowerLink, but I haven't done any of 
> the tests they recommended yet with uasm, although it did contain a 
> recommendation to increase client parallelism to 12, which I changed a few 
> minutes ago. Most of the clients had the default setting of 4 for their 
> parallelism.
> 
> If anyone has any ideas on how to investigate this problem, please let me 
> know. I am skeptical that doing tests with uasm will bring forth any 
> enlightenment on this issue.

<Prev in Thread] Current Thread [Next in Thread>