This is a very complex subject. I have been doing extensive netowrk
monitoring for years with an eye towards being proactive. When networks
were simplier and built on shared ethernet (and the data less distributed),
monitoring the physical layer and looking for the development of errors and
traffic growth was sufficient.
I tested services like web response, mail response, nfs response, font
server response, etc., but was not able to look at the correlation between
traffic volume to the server and the response time. (Most of our servers,
much to my dismay, do not run snmp).
With todays switched networks where one has one node per switch port, it is
possible to do both response time testing and port utilization monitoring
and display them side by side. I am currently working on this. This is
important today in order to be pro-active and discover "potential problems"
before the users complain. Yesterday for example, we had complaints of slow
afs response. I do not have an automatable afs test yet. But looking at
the port stats of the catalyst that the particular server was attached to, I
could see that the traffic to the server was well above "normal". With some
study, one could actually probably develop a baseline and set thresholds for
alerting. If I had data for an extended period of time for both traffic and
response time, I would be able to quickly postulate whether a slowdown was
happening because of something in the server, or whether it was at that
moment, being subjected to a higher volume of activity.
There is much more to this topic than I have time to discuss here. However
we have used this type of analysis many times. The most recent was a few
monthis ago, where it proved that there was a configuration problem with a
server. Our web server response tests were showing that it was taking about
30-60 seconds (and frequent timeouts), to get a web response. The users
weren't really complaining. Only the web master who got paged when it timed
out was complaining. Further investigation revealed that the web server was
not totally configured properly. If we had not had the testing going on, we
might not have discovered this situation until the users complained.
There is actually a moral to the above story. When the webmaster complained
to me about being paged too frequently (when it appeared that there was no
problem), I offered suggestions like: well, let me try several times before
I page you. I really did not want to do that although, as I had been
testing web service response time for years, and we hadn't had this problem.
This is where baselines are especially important. If I hadn't had the
historical experience, I might have changed the test and masked the problem.
Hope this helps.
Also check out:
This discusses our experience with WAN Monitoring.
" Of course the opinions expressed here are my own. "
Connie Logg CAL AT SLAC.Stanford DOT Edu ph: 650-926-2879
Network Management and Performance Analyst
SLAC (MS 97), P.O. Box 4349, Stanford, CA 94309
"Happiness is found along the way, not at the end of the road."