ADSM-L

Re: How do you verify the Completion and A ccuracy of Backups and Restores?

2006-11-09 13:21:42
Subject: Re: How do you verify the Completion and A ccuracy of Backups and Restores?
From: "Prather, Wanda" <Wanda.Prather AT JHUAPL DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Thu, 9 Nov 2006 13:20:33 -0500
Yes, it frequently takes some work to customize the report.
But if customized properly, it won't be that large, even for 100
clients, to go through reliably.

Strategies my customers have used.

1) You need to set up the filters on the ACTIVITY LOG Section of the
report so it only includes ACTIONABLE messages.
For example, you don't need to see all the entries for files that failed
to back up because they are in use, because they are listed in a later
section of the report as well.  So you just exclude that message number.

Likewise, you will find a lot of nuisance messages that show up as
errors, when they were really just a syntax error on a command the admin
typed in.  You can exclude those as well, and you will get the report
down to just the actionable messages.

2) Turn OFF the MISSED FILES SUMMARY section so it isn't produced.  That
just seems pointless to me; you can't do anything with it.  The relevant
information is in the MISSED FILE DETAILS section.

3) Turn OFF any other sections that you don't need to see; for example,
some of the graphs may not be needed every day.

4) Set the option that says "use collapsable sections"; then the viewer
can look  at just 1 section at a time.
On most days, the only sections that have to be reviewed will be the
CUSTOM SUMMARY, the ACTIVITY LOG DETAILS, ADMIN SCHEDULE STATUS, and
CLIENT SCHEDULE STATUS.  MISSED FILE DETAILS should be reviewed about
once a week, and the admin should take action on the files that are
being missed regularly.

5) The term "successful completion of ALL backups" doesn't mean much to
TSM; there are plenty of sites that run backups continuously, 24 hours a
day, and schedules overlap.  SO what schedules are completed, depends on
what time of day you ask.  What you should do with TOR is is have the
TSM administrator adjust the client schedule windows so that there is a
time gap BETWEEN the schedules where you can run the TOR report.  That
will eliminate MOST of the missing reports, unless a client is hung and
running hours behind schedule.

6) Consider distributing some of the workload.  If a client misses or
fails a backup, is your TSM admin supposed to GO to that client and
rerun the backup, or do they just notify the owner of the machine?  Can
you make the client owner responsbile for dealing with it?  In that
case, all you have to do is set up TOR so that it sends an email to the
client owner with the schedule status.

FWIW, 100 TSM clients isn't actually all that many in TSM land.  I think
your TSM admin just needs some help getting a grip on what needs to be
addressed.

 


 

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Wesley Smith
Sent: Thursday, November 09, 2006 12:01 PM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: FW: [ADSM-L] How do you verify the Completion and A ccuracy of
Backups and Restores?

 Thanks, Wanda.

I believe the TOR tool is what they are currently using.  I think a
large part of their problem is that the reports are so large, it is
impossible for one person to go through the report every day with any
amount of reliability.  I know that they are responsible for handling
the backups of well over 100 servers and that it is being done by just
one person.  I've seen the report as currently generated and noted a
number of problems with it.  The report runs at a scheduled time rather
than having job triggers that would kick it off after the successful
completion of all backups.  As a result, the report will show backups
that started but without showing that they have completed.  On some
days, there will be very few of these.  On other days, quite a few.
Throwing stuff like that into the mix of the real errors and other
"pseudo errors" and you find yourself trying to chase down a lot of
non-errors.

I will be passing along to the appropriate people that perhaps there is
some additional filtering that could be done to these reports to reduce
their size to something that is more manageable.  I'm hoping that we
will be able to come up with some filtering and scripting aids that will
help to automate this process as much as possible and reduce to a
minimum the need for the Tivoli support person to spend a lot of time
every day just reviewing the night's work.

Thanks again for your time and help.

Wesley

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Prather, Wanda
Sent: Wednesday, November 08, 2006 1:35 PM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: [ADSM-L] How do you verify the Completion and A ccuracy of
Backups and Restores?

Ditto.

To start, set up the TSM Operational Reporter (TOR).
It is free as part of the product.
It works real well right out of the box, but can also be customized to
do some clever things.  

If your TSM folks are running a non-Windows TSM server, they may not be
familiar with it, as it is a Windows application.
If your TSM server is Windows, TOR gets installed when the server is
installed, but you still need to configure it.

If your TSM server isn't Windows, you'll need to install TOR separately
on a Windows host.
But it doesn't have to be a Windows server or anything fancy; you can
run it on your desktop.

It will tell you, every day, EXACTLY which backup schedules completed or
did not; which clients had missed files, and what those files were.

It also scrapes the TSM activity log for any server-end messages that
need attention (although there are also frequently nuisance messages
that you will want to filter out, using the customization available in
TOR).

You can have the reports generated as HTML that is available for
browsing, or mailed to you.
Sounds like nobody has done this yet.  
SOMEBODY SHOULD REVIEW THIS REPORT EVERY DAY.  
AND ACTUALLY ATTEND TO THE THINGS THAT NEED ATTENDING.

You can read about TOR in the "monitoring your server" section of the
TSM Administrator's Guide.

If the missed backups you are referring to are data bases that are being
backed up using a TSM Data Protection agent (backing up through the
API), you may have to be creative about gathering the reports from those
logs (esp. with Oracle - I think you have to actually view the RMAN logs
to guarnatee that thoose worked correctly.)  But I have had success
writing very small scripts (e.g. perl) that scrape the information out
of those logs, and send it to be displayed in the TOR Daily report. 

Wanda Prather
"I/O, I/O, It's all about I/O"  -(me)
 

 

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Mark Stapleton
Sent: Wednesday, November 08, 2006 11:44 AM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: How do you verify the Completion and A ccuracy of Backups
and Restores?

From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of
Wesley Smith
>       My problem is that they (that sister agency) do not seem to have
a 
>reliable way of verifying that all backups have been properly 
>completed. They don't even seem to have a way to know that all files 
>(that need to be backed up) are being backed up.  I've seen the reports

>that get generated during the backup process and I am definitely 
>unimpressed.  Backups start and backups complete.  There doesn't seem
to
>be anything that says how many rows are copied or how large the files 
>are or anything else that could be used for verifying the accuracy of 
>the backups.  They tell my folks that we should trust Tivoli is doing 
>the job correctly.  Trust is the problem....

Let's start there. When you look at the dsmsched.log file, that contains
a record of all scheduled backups and their outcomes, you should have a
record of what files are backed up, the size of the files, and the
timestamps give an idea of how long it took to back each file up. (This
is assuming that the QUIET feature is not present in the client option
file or the client option set designated for that TSM client.) If you're
using the specialized TSM agents for databases or mail apps, the
scheduled backup logs containing fairly granular information about
individual file backups. What more do you need?

>       We have needed to have restores done on just a few databases in
the 
>past and the restores were not complete and up to date.  In each case 
>we were able to rebuild the data using logs maintained within the 
>applications but that should not have been necessary.  Each recovery
was
>done at a point after a backup and before additional processing had
been
>done within the apps so they should have been complete.  In each case, 
>the folks who run Tivoli for us were able to track down and show that 
>problems had occurred during the processing of the backups.  They did 
>this through circumstantial evidence and in each case once again said 
>that they have no way of verifying that the backups are actually good.
>I hear a lot about the difficulty of trying to write a program to 
>process the Tivoli log files.
>
>       I think I'm at wit's end with these folks and the product.  I
know 
>that the people are competent and I suspect that the product (like 
>other things available from IBM) really is weak on the reporting and 
>verification issue.

While TSM itself does lack some reporting functionalities (particularly
when it comes to client backups and restores), I have to say this:

On every properly maintained and monitored TSM system I have touched in
the 12 years I've adminstered and engineered this product, I have
*never* lost a single byte of information. Period. If you cannot do a
restore because of "lost" data, something is happnening during backups
that is not being caught at the time of the backups.

>I'm hoping that someone out there in the Big Wide World has already 
>solved this problem with an in-house or third-party solution.  Sorry 
>for being so long winded.  Any ideas...?

I think what is needed here is greater familiarity with TSM and its
proper administration. Proper verification of good backups is best done
by regular DR practice of planned bare-metal restores of chosen
machines. If you can take data backed up by TSM and restore a given
machine in a DR environment, and the machine comes back properly, you
know the job is being done right. If it doesn't, *then* you dig into
*why*.

BTW, there are responses to this thread advocating ServerGraph and
Bocada for reporting and monitoring. Be aware that those applications do
a fine job of monitoring server operations. (Well, ServerGraph does,
anyway.) Their reporting, however, is not granular enough to indicate
whether a given file is being backed up properly.

--
Mark Stapleton (mark.s AT evolvingsol DOT com)
Senior TSM consultant