RE: [nv-l] Netview Redundancy / Failover

There are several "best practices" that an IBM team of services people have
created over the years for NetView Redundancy.    There are two basic
designs that are slightly tweaked.

Requirements:

      - Only administer one map.


Best Practices:
1.  Use a pair of MLMs for receiving unsolicted traps.  (A pair for
redundancy)  When your NetView servers discover them, NetView adds itself
automatically as a trap receiver to the MLM.    Via a single SNMP put you
can turn on/off trap forwarding from the backup MLM.   Easy to automate
into your MLM failover script.  This also opens the door to set some trap
filters at these MLMs to block all unsolicted traps you don't care about.

2.  If you need failover in a matter of 1 minute -- If you are a service
provider that MUST be managing customer networks 24x7 with no downtime,
then a set of 3 NetView servers are recommended.   This gives you an
Administrative server, an Operations peer1 server and an Operations peer2
server.    This allows your administrative team to put in managed devices
and remove them without impacting operations.   At end of day, the database
is copied from the Administrative server to the peer server not currently
being used.  At shift change operations can move to the new database by
opening maps on the updated server.    This means that peer1 will be
production monday, peer2 on Tuesday...etc.   The other peer is always
available for redudancy.  In a disaster you can go back to the
Administrative server.

In this scenario we have also created a process to move many NetView
clients from one peer server to another in a designed fashion.   This is
now easier with Web clients since you don't have the map sync process using
ovwdb.

3.  If you need failover in a matter of 10 minutes -- If you are not a
service provider, then a set of 2 NetView servers are recommended.  Again,
single map adminstration is important.   Also, if you have event
automation, rules or forwarding to TEC,  those activities must check to see
if the NetView server is "production" or in "backup" mode.   Scripts can do
that easily.    Using MLMs to control trap flow helps.  In this case, you
have to decide how often to update the backup NetView.   In most customers
I found this to be a weekly maintenance window.
      - backup server kicked into production mode and verified
      - primary server brought down
      - backup of primary server database, config files
      - primary server brought up and verified
      - backup server brought down
      - restore of primary server database on backup server, config files
      - backup server kicked into backup mode

4. If you need failover in a matter of 30 minutes (more or less) -- You can
configure NetView in an HA environment.   Using a shared disk to share the
NetView database between two servers.   The failover time will equal
detection time + NetView startup time.   The risk is high  that a database
problem for a new device is most likely the cause of your problem.    Thus
any solution that utilizes a single NetView database is at the most risk of
not working when your production NetView has a problem.

Summary
Option 1 can and should be used in all cases.
Option 2 and 3 work well, choose it based on your needs.  Takes custom work
from yourself or experienced NetView consultants.
Option 4 is less recommended than Option 3.

Kind regards,
Stephen Hochstetler              shochste AT us.ibm DOT com
International Technical Support Organization  - Austin
Office - 512-436-8564                      FAX - 512-436-9326

ITSO redbooks at  http://www.redbooks.ibm.com