ADSM-L

Re: [ADSM-L] Backup fails with no error message

2014-08-28 09:43:11
Subject: Re: [ADSM-L] Backup fails with no error message
From: Andrew Raibeck <storman AT US.IBM DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Thu, 28 Aug 2014 09:41:05 -0400
Thomas,

I haven't forgotten about this... I did find one place in our code where,
if a memory allocation fails, a message is logged only to a trace file (not
the dsmerror.log file). This is currently targeted for a future release.

I do not know if this covers all the various scenarios you encountered, but
it sounds closest to the issue you initially reported (the RC 12 with no
other error message).

Regards,

- Andy

____________________________________________________________________________

Andrew Raibeck | Tivoli Storage Manager Level 3 Technical Lead |
storman AT us.ibm DOT com

IBM Tivoli Storage Manager links:
Product support:
http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/Tivoli_Storage_Manager

Online documentation:
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli
+Documentation+Central/page/Tivoli+Storage+Manager
Product Wiki:
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli
+Storage+Manager/page/Home

"ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu> wrote on 2014-07-22
10:36:01:

> From: Thomas Denier <Thomas.Denier AT JEFFERSON DOT EDU>
> To: ADSM-L AT vm.marist DOT edu
> Date: 2014-07-22 10:39
> Subject: Re: Backup fails with no error message
> Sent by: "ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu>
>
> Andy,
>
> It looks like the problem was in fact a shortage of memory. The
> problem starting occurring again this past Saturday. A backup with
> tracing after an earlier occurrence of the problem ran out of stack
> space. I attempted to raise the stack size limit from its default of
> about 32 MB to its hard limit of about 4 GB before running another
> backup with tracing. For some reason I only got half the requested
> limit. The backup failed, but produced a useful error message for
> the first time in the history of this problem: an ANS1225E message
> indicating that the client software was unable to obtain memory
> needed for file compression. I was able to rerun the backup
> successfully after using the 'ulimit' command to allow unlimited
> memory size and data segment size. The default soft limits are in
> fact much smaller than the corresponding values on most of our other
> systems. The default data segment size is about 128 MB and the
> default memory size is about 1 GB. I am currently trying to get the
> system vendor to sign off on a request to allow unlimited memory and
> data segment sizes for backups of the resource group disks.
>
> Thomas Denier
> Thomas Jefferson University Hospital
>
> -----Original Message-----
> From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On
> Behalf Of Andrew Raibeck
> Sent: Wednesday, July 09, 2014 10:44 PM
> To: ADSM-L AT VM.MARIST DOT EDU
> Subject: Re: [ADSM-L] Backup fails with no error message
>
> It is a puzzler.
>
> Just to verify: you have checked dsmerror.log as well for error
> messages and found nothing? Another thought is to check the TSM
> server activity log for any tell tale error or warning messages that
> might provide a hint.
>
> The TSM client return codes are derived directly from the severity
> of the messages issued during whatever operation is running.
> ANSnnnnI messages are RC 0; ANSnnnnW are RC 8; and ANSnnnnE or
> ANSnnnnS are RC 12. The exceptions are related to skipped files:
> these "exception" messages are ANSnnnnE but the return code handling
> sets the RC to 4. The highest severity prevails, so if, for example,
> an ANSnnnnW (RC 8) and ANSnnnnE (RC 12) are issued, then the RC will
> be 12. We have had the odd "skipped file" message that is not
> setting the RC to 4, but those have been fixed via APARs, and in any
> case I would still expect some error message in the log. If you
> inspect the error log, let me
>
> The "GlobalRC" trace example I showed you illustrates when a non-
> zero producing message sets the return code. Thus when whatever
> message is processed that trips the RC 12, I would expect to see it
> in the trace. If you have trace files from when the problem did not
> occur, and the RC was 0, then I would not expect to see any of the
> "GlobalRC" messages in the trace.
>
> I am a little surprised if no such error message appears in the
> dsmerror.log file. I have recently seen one case where the client
> experiences an "out of memory" error but no message was written to
> the console, schedule log, or error log. However the SERVICE trace
> is still sufficient to reveal the problem. What are the ulimits set
> to for this client, and are there an unusually large number files in
> any of these file systems? Are we talking about millions of files,
> and maybe the file system is on the cusp of running out of memory
> during backup? It's a long shot, but figured I'd mention it.
>
> If you are willing to continue to run the tracing, it would be a good
idea.
> If the problem persists but you are unable to obtain a trace, open a
> PMR and we'll have to come up with an alternative way to figure out
> what is going on.
>
> Regards,
>
> - Andy
>
>
____________________________________________________________________________

>
> Andrew Raibeck | Tivoli Storage Manager Level 3 Technical Lead |
> storman AT us.ibm DOT com
>
> IBM Tivoli Storage Manager links:
> Product support:
> http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/
> Tivoli_Storage_Manager
>
> Online documentation:
>
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli
> +Documentation+Central/page/Tivoli+Storage+Manager
> Product Wiki:
>
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli
> +Storage+Manager/page/Home
>
> "ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu> wrote on 2014-07-09
> 11:41:51:
>
> > From: Thomas Denier <Thomas.Denier AT JEFFERSON DOT EDU>
> > To: ADSM-L AT vm.marist DOT edu,
> > Date: 2014-07-09 11:42
> > Subject: Re: Backup fails with no error message Sent by: "ADSM: Dist
> > Stor Manager" <ADSM-L AT vm.marist DOT edu>
> >
> > The regularly scheduled backup ran successfully on Tuesday morning.
> > The scheduled backup this morning failed with exit status 12 and no
> > error message. The backup start and end times indicated that the
> > failure occurred while processing a different file system in the same
> > resource group.
> >
> > I ran a backup of the file system with service tracing enabled. The
> > TSM client eventually crashed with a segmentation fault.  I found two
> > trace files, neither of which contained 'GlobalRC'. The core file from
> > the crash consumed nearly all of the remaining space in the root file
> > system. As far  as I can tell, a system administrator responding to an
> > automated alert removed the core file without consulting me.
> >
> > I ran a backup of the entire resource group without tracing. This was
> > successful.
> >
> > I am thinking of upgrading the client software, even though none of
> > the bug fixes listed has any obvious connection to the behavior I am
> > seeing.
> >
> > Should I just keep trying the tracing every time a backup fails and
> > hope I eventually get lucky and obtain a useful trace?
> >
> > Thomas Denier
> > Thomas Jefferson University Hospital
> >
> > -----Original Message-----
> > From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf
> > Of Andrew Raibeck
> > Sent: Monday, July 07, 2014 4:19 PM
> > To: ADSM-L AT VM.MARIST DOT EDU
> > Subject: Re: [ADSM-L] Backup fails with no error message
> >
> > Thomas,
> >
> > Run the failing backup command and this time add these parameters:
> >
> > -traceflags=service -tracefile=/sometracefilename
> >
> > For example:
> >
> > dsmc inc /main/UT -servername=DC1P1_MAIN -traceflags=service -
> > tracefile=/tsmtrace.out
> >
> > Name the trace file whatever you want, just make sure ot put it in a
> > file system with room for a potentially large trace file.
> >
> > Note: If you anticipate GB and GB of output, you can add the option
> > -tracemax=1024 to wrap the trace file at 1 GB. The risk is, if
> > whatever happens is not immediately causing the backup to stop, the
> > needed trace lines could be written over due to wrapping. But based on
> > your description, off-hand I'd say the backup stops when the problem
> > occurs so the risk due to wrapping should be low.
> >
> > After the backup finishes with the RC 12, scan the trace "GlobalRC"
> > (without the quotes) and you should find lines like these:
> >
> > 07/07/2014 16:12:14.122 [003772] [3812] : ..\..\common\ut
> > \GlobalRC.cpp ( 428): msgNum = 1076 changed the Global RC.
> > 07/07/2014 16:12:14.122 [003772] [3812] : ..\..\common\ut
> > \GlobalRC.cpp ( 429): Old values: rc = 0, rcMacroMax = 0, rcMax = 0.
> > 07/07/2014 16:12:14.122 [003772] [3812] : ..\..\common\ut
> > \GlobalRC.cpp ( 443): New values: rc = 12, rcMacroMax = 12, rcMax = 12.
> >
> > This will show you which message is driving the RC change. In my
> > example, "msgNum = 1076" corresponds to ANS1076E
> >
> > Based on the message, you might be able to figure out the rest; but at
> > the least you have a trace file you can send in to support.
> >
> > Regards,
> >
> > - Andy
> >
> >
>
____________________________________________________________________________

>
> >
> > Andrew Raibeck | Tivoli Storage Manager Level 3 Technical Lead |
> > storman AT us.ibm DOT com
> >
> > IBM Tivoli Storage Manager links:
> > Product support:
> > http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/
> > Tivoli_Storage_Manager
> >
> > Online documentation:
> >
>
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli
> > +Documentation+Central/page/Tivoli+Storage+Manager
> > Product Wiki:
> >
>
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli
> > +Storage+Manager/page/Home
> >
> > "ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu> wrote on 2014-07-07
> > 15:59:57:
> >
> > > From: Thomas Denier <Thomas.Denier AT JEFFERSON DOT EDU>
> > > To: ADSM-L AT vm.marist DOT edu,
> > > Date: 2014-07-07 16:00
> > > Subject: Backup fails with no error message Sent by: "ADSM: Dist
> > > Stor Manager" <ADSM-L AT vm.marist DOT edu>
> > >
> > > We have an AIX system on which backups of a specific file system
> > > terminate with exit status 12 but with no error message indicating a
> > > reason for this exit status.
> > > If I execute the command
> > >
> > > dsmc inc /main/UT -servername=DC1P1_MAIN
> > >
> > > as root, I will see typical messages about the number of files
> > > processed and about specific files being backed up, followed by the
> > > usual summary messages. The exit status will be 12. The summary
> > > statistics will show a number of files
> > examined
> > > equal to about half the number of files present in the file system.
> > > There will not
> > > be any error message explaining the exit status or the failure to
> > > examine
> > the
> > > entire file system.
> > >
> > > The DCIP1_MAIN stanza in dsm.sys has some unusual features because
> > > it is
> > used
> > > to back up one of the resource groups for a clustered environment.
> > > The
> > stanza
> > > includes three 'domain' statements listing the file systems in the
> > > resource group.
> > > The stanza includes a 'nodename' option specifying the node name
> > > that
> > owns the
> > > backup files from the resource group. The stanza includes an 'asnode'
> > option
> > > specifying the node name used to authenticate sessions from the
> > > cluster
> > node
> > > involved (we and the system vendor were not able to agree on an
> > acceptable
> > > arrangement for storing a TSM password within the resource group).
> > > This stanza works fine for the other file systems in the same
> > > resource group,
> > and
> > > worked fine for /main/UT up until June 26.
> > >
> > > I have found two ways to circumvent the problem. One circumvention
> > > is to
> > run
> > > the command
> > >
> > > dsmc inc /main/UT/ -subdir=y -servername=DC1P1_MAIN
> > >
> > > to back up the top level directory of the file system rather than
> > > the file system as such. An 'lsfs' command shows nothing unusual
> > > about the file system;
> > it is
> > > a jfs2 file system, like all the other file systems, and uses the
> > > same
> > mount
> > > options as the other file systems. The other circumvention is to add
> > > an 'exclude.dir' line for a specific subdirectory of /main/UT to the
> > > include/exclude file. The subdirectory came under suspicion because
> > > it was last updated a
> > few
> > > hours after the last fully successful backup.
> > >
> > > The client code is TSM 6.4.1.0. The client OS is AIX 7.1. The TSM
> > > server is TSM
> > > 6.2.5.0 running under zSeries Linux.
> > >
> > > Does anyone recognize this as a known problem? If not, does anyone
> > > have suggestions for presenting the problem to TSM support? I am
> > > having difficulty imagining any kind of productive interaction if I
> > > don't have a message identifier to report.
> > >
> > > Thomas Denier
> > > Thomas Jefferson University Hospital The information contained in
> > > this transmission contains privileged and confidential information.
> > > It is intended only for the use of the person named above. If you
> > > are not the intended recipient, you are hereby notified that any
> > > review, dissemination, distribution or duplication of this
> > > communication is strictly prohibited. If you are not the intended
> > > recipient, please contact the sender by reply email and destroy all
> > > copies of the original message.
> > >
> > > CAUTION: Intended recipients should NOT use email communication for
> > > emergent or urgent health care matters.
> > >
> > The information contained in this transmission contains privileged and
> > confidential information. It is intended only for the use of the
> > person named above. If you are not the intended recipient, you are
> > hereby notified that any review, dissemination, distribution or
> > duplication of this communication is strictly prohibited. If you are
> > not the intended recipient, please contact the sender by reply email
> > and destroy all copies of the original message.
> >
> > CAUTION: Intended recipients should NOT use email communication for
> > emergent or urgent health care matters.
> >
> The information contained in this transmission contains privileged
> and confidential information. It is intended only for the use of the
> person named above. If you are not the intended recipient, you are
> hereby notified that any review, dissemination, distribution or
> duplication of this communication is strictly prohibited. If you are
> not the intended recipient, please contact the sender by reply email
> and destroy all copies of the original message.
>
> CAUTION: Intended recipients should NOT use email communication for
> emergent or urgent health care matters.
>
<Prev in Thread] Current Thread [Next in Thread>
  • Re: [ADSM-L] Backup fails with no error message, Andrew Raibeck <=