This issue turned out to be the result of many clients being mistaken for
intruders as a result of a recent update in our intrusion detection system.
On 06 19, 2012, at 1:39 PM, Chester Martin wrote:
> Hello,
> At first glance you would think it'd be a timeout issue that adjusting the
> keep alive values would fix. Being that you just updated to a new DDOS you
> may want to see if there are any errors being reported on the data domain
> side. Also, see if there is a certain time when these errors happen.
> Meaning, are there clients that kick off at 5pm and the error happens at 6pm
> and all the client's backup that was running at that time cancel with the
> "connection dropped" error? You didn't mention anything about the data
> domain devices going offline, but if there was an issue with the networker
> server not talking to data domain your devices should go offline. But if you
> have "auto media management" enabled on the dd devices networker will attempt
> to bring them back online.
>
> I would think that increasing the client parallelism would add to the problem
> instead of help it. Increasing the client parallelism will cause you to have
> more streams going to the dd box, which may slow your backup down. If any
> parallelism needs to be adjusted I would do it from the group level and not
> the client level. Is it possible you have your group parallelism set to 0
> and some of the clients could be waiting on resources and timing out?
>
> -----Original Message-----
> From: EMC NetWorker discussion [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU]
> On Behalf Of Stanley R. Horwitz
> Sent: Tuesday, June 19, 2012 10:11 AM
> To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
> Subject: [Networker] Server lost connection problem
>
> About two weeks ago, I upgraded my NetWorker server from 7.6.1 to NetWorker
> 7.6.3.4.Build.879. This server backs up 336 clients (mostly Windows and
> Linux). All of the clients back up to a Data Domain system using Boost and a
> few are cloned nightly to LTO-5 tape. I have 11 Boost devices configured for
> direct use on the server and each Boost device has its max sessions set to a
> value of 10. No storage nodes are involved in this data zone.
>
> After we upgraded our DD system to the latest OS, the backups of larger
> servers improved in their throughput, but for the past few days, I am
> noticing an unusual number of backup failures for several groups of, both
> Linux and Windows, including some that also have NetWorker 7.6.3 on them. The
> error is always the same in the savegroup report "connection dropped."
>
> There does not appear to be anything problems going on with network
> connectivity and in most cases, these clients do not back up via a firewall.
> I do not see any errors on the clients or the NetWorker server when I use
> "netstat -i." Incremental backups of the same clients also work without
> issue.
>
> I reviewed the NetWorker tuning guide on PowerLink, but I haven't done any of
> the tests they recommended yet with uasm, although it did contain a
> recommendation to increase client parallelism to 12, which I changed a few
> minutes ago. Most of the clients had the default setting of 4 for their
> parallelism.
>
> If anyone has any ideas on how to investigate this problem, please let me
> know. I am skeptical that doing tests with uasm will bring forth any
> enlightenment on this issue.
|