ADSM-L

Re: [ADSM-L] Backup fails with no error message

2014-07-10 12:49:03
Subject: Re: [ADSM-L] Backup fails with no error message
From: Thomas Denier <Thomas.Denier AT JEFFERSON DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Thu, 10 Jul 2014 16:46:28 +0000
       <OF314FE45A.773F8B60-ON87257D0E.006E2DD7-85257D0E.006F900C AT us.ibm DOT 
com>

          <73a8ba4a53ee4dcea9852cbfa85a60ae AT 
BY2PR05MB631.namprd05.prod.outlook DOT com>
 <OF6D738E72.2896199C-ON87257D11.000D1CE9-85257D11.000EFE79 AT us.ibm DOT com>
In-Reply-To: <OF6D738E72.2896199C-ON87257D11.000D1CE9-85257D11.000EFE79 AT 
us.ibm DOT com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [147.140.233.16]
x-microsoft-antispam: BCL:0;PCL:0;RULEID:
x-forefront-prvs: 0268246AE7
x-forefront-antispam-report: 
SFV:NSPM;SFS:(6009001)(189002)(199002)(377424004)(51704005)(377454003)(85714005)(479174003)(13464003)(93886003)(77982001)(99286002)(15975445006)(76482001)(20776003)(2171001)(87936001)(2656002)(33646001)(64706001)(105586002)(19580405001)(92566001)(88552001)(15202345003)(79102001)(101416001)(50986999)(66066001)(19580395003)(107886001)(83322001)(76176999)(31966008)(99396002)(83072002)(46102001)(76576001)(85306003)(74662001)(106356001)(74316001)(54356999)(95666004)(19625735002)(81342001)(86362001)(74502001)(551544002)(89122001)(80022001)(21056001)(107046002)(85852003)(81542001)(75432001)(108616002)(567094001)(24736002);DIR:OUT;SFP:;SCL:1;SRVR:BY2PR05MB629;H:BY2PR05MB631.namprd05.prod.outlook.com;FPR:;MLV:sfv;PTR:InfoNoRecords;MX:1;LANG:en;
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: jefferson.edu
X-VPM-MSG-ID: e00d0915-4ef0-4ad4-b390-b317d5f556a7
X-VPM-HOST: xvm127.jefferson.edu
X-VPM-GROUP-ID: ca56bb30-8f8c-4377-8f9e-1371248b9de3
X-VPM-ENC-REGIME: Plaintext
X-VPM-CERT-FLAG: 0
X-VPM-IS-HYBRID: 0
X-Barracuda-Connect: zixgateway01.jefferson.edu[147.140.20.158]
X-Barracuda-Start-Time: 1405010793
X-Barracuda-Encrypted: AES256-SHA
X-Barracuda-URL: http://148.100.49.28:8000/cgi-mod/mark.cgi
X-Virus-Scanned: by bsmtpd at marist.edu
X-Barracuda-BRTS-Status: 1
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=3.5 
QUARANTINE_LEVEL=1000.0 KILL_LEVEL=5.5 tests=
X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.7405
        Rule breakdown below
         pts rule name              description
        ---- ---------------------- 
--------------------------------------------------

The error log entries for each day usually include a few messages about
files not found or files that changed while TSM was reading them. There
is a warning message each day noting that a specific directory is excluded.
The directory is named in an 'exclude.dir' statement and is the top level
directory for a file system listed in a 'domain' statement. I have asked th=
e
system vendor for clearance to remove the file system from the domain
statement. I have not gotten a response so far. There are no messages that
have any evident connection to the exit status of 12 or to stopping the
backup prematurely.

The file system in which the backup stopped from June 27 to July 7 has
about 12.6 GB of free space. The file system in which the backup stopped
yesterday has about 6.5 GB of free space. The file system used for TSM logs
has about 3.8 GB of free space. Neither of the file systems in which the
backup stopped at one time or another has millions of files; a successful
backup of the entire resource group early this morning inspected 559,009
files.

The backup that got a segmentation fault apparently ran out of stack space;
the error report in the output from 'errpt -a' includes the words 'Too many
stack elements'. The soft limit on the stack size for root is 65,536 512 by=
te
blocks. The hard limit is 8,388,608 blocks. Are there any published
recommendations for resource limits for the TSM client?

I looked over the other error reports in the output from 'errpt -a'. I didn=
't
find anything recognizably relevant around the times when 'dsmc' ended
with exit status 12, in the interval between the successful /main/UT backup
on June 26 and the failed backup on June 27, or in the interval between the
success  /main/U backup on July 8 and the failed backup on July 9

Thomas Denier
Thomas Jefferson University Hospital

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf Of 
An=
drew Raibeck
Sent: Wednesday, July 09, 2014 10:44 PM
To: ADSM-L AT VM.MARIST DOT EDU
Subject: Re: [ADSM-L] Backup fails with no error message

It is a puzzler.

Just to verify: you have checked dsmerror.log as well for error messages an=
d found nothing? Another thought is to check the TSM server activity log fo=
r any tell tale error or warning messages that might provide a hint.

The TSM client return codes are derived directly from the severity of the m=
essages issued during whatever operation is running. ANSnnnnI messages are =
RC 0; ANSnnnnW are RC 8; and ANSnnnnE or ANSnnnnS are RC 12. The exceptions=
 are related to skipped files: these "exception" messages are ANSnnnnE but =
the return code handling sets the RC to 4. The highest severity prevails, s=
o if, for example, an ANSnnnnW (RC 8) and ANSnnnnE (RC 12) are issued, then=
 the RC will be 12. We have had the odd "skipped file" message that is not =
setting the RC to 4, but those have been fixed via APARs, and in any case I=
 would still expect some error message in the log. If you inspect the error=
 log, let me

The "GlobalRC" trace example I showed you illustrates when a non-zero produ=
cing message sets the return code. Thus when whatever message is processed =
that trips the RC 12, I would expect to see it in the trace. If you have tr=
ace files from when the problem did not occur, and the RC was 0, then I wou=
ld not expect to see any of the "GlobalRC" messages in the trace.

I am a little surprised if no such error message appears in the dsmerror.lo=
g file. I have recently seen one case where the client experiences an "out =
of memory" error but no message was written to the console, schedule log, o=
r error log. However the SERVICE trace is still sufficient to reveal the pr=
oblem. What are the ulimits set to for this client, and are there an unusua=
lly large number files in any of these file systems? Are we talking about m=
illions of files, and maybe the file system is on the cusp of running out o=
f memory during backup? It's a long shot, but figured I'd mention it.

If you are willing to continue to run the tracing, it would be a good idea.
If the problem persists but you are unable to obtain a trace, open a PMR an=
d we'll have to come up with an alternative way to figure out what is going=
 on.

Regards,

- Andy

___________________________________________________________________________=
_

Andrew Raibeck | Tivoli Storage Manager Level 3 Technical Lead | storman@us=
.ibm.com

IBM Tivoli Storage Manager links:
Product support:
http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/Tivoli_Sto=
rage_Manager

Online documentation:
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli
+Documentation+Central/page/Tivoli+Storage+Manager
Product Wiki:
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli
+Storage+Manager/page/Home

"ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu> wrote on 2014-07-09
11:41:51:

> From: Thomas Denier <Thomas.Denier AT JEFFERSON DOT EDU>
> To: ADSM-L AT vm.marist DOT edu,
> Date: 2014-07-09 11:42
> Subject: Re: Backup fails with no error message Sent by: "ADSM: Dist
> Stor Manager" <ADSM-L AT vm.marist DOT edu>
>
> The regularly scheduled backup ran successfully on Tuesday morning.
> The scheduled backup this morning failed with exit status 12 and no
> error message. The backup start and end times indicated that the
> failure occurred while processing a different file system in the same
> resource group.
>
> I ran a backup of the file system with service tracing enabled. The
> TSM client eventually crashed with a segmentation fault.  I found two
> trace files, neither of which contained 'GlobalRC'. The core file from
> the crash consumed nearly all of the remaining space in the root file
> system. As far  as I can tell, a system administrator responding to an
> automated alert removed the core file without consulting me.
>
> I ran a backup of the entire resource group without tracing. This was
> successful.
>
> I am thinking of upgrading the client software, even though none of
> the bug fixes listed has any obvious connection to the behavior I am
> seeing.
>
> Should I just keep trying the tracing every time a backup fails and
> hope I eventually get lucky and obtain a useful trace?
>
> Thomas Denier
> Thomas Jefferson University Hospital
>
> -----Original Message-----
> From: ADSM: Dist Stor Manager [mailto:ADSM-L AT VM.MARIST DOT EDU] On Behalf
> Of Andrew Raibeck
> Sent: Monday, July 07, 2014 4:19 PM
> To: ADSM-L AT VM.MARIST DOT EDU
> Subject: Re: [ADSM-L] Backup fails with no error message
>
> Thomas,
>
> Run the failing backup command and this time add these parameters:
>
> -traceflags=3Dservice -tracefile=3D/sometracefilename
>
> For example:
>
> dsmc inc /main/UT -servername=3DDC1P1_MAIN -traceflags=3Dservice -
> tracefile=3D/tsmtrace.out
>
> Name the trace file whatever you want, just make sure ot put it in a
> file system with room for a potentially large trace file.
>
> Note: If you anticipate GB and GB of output, you can add the option
> -tracemax=3D1024 to wrap the trace file at 1 GB. The risk is, if
> whatever happens is not immediately causing the backup to stop, the
> needed trace lines could be written over due to wrapping. But based on
> your description, off-hand I'd say the backup stops when the problem
> occurs so the risk due to wrapping should be low.
>
> After the backup finishes with the RC 12, scan the trace "GlobalRC"
> (without the quotes) and you should find lines like these:
>
> 07/07/2014 16:12:14.122 [003772] [3812] : ..\..\common\ut
> \GlobalRC.cpp ( 428): msgNum =3D 1076 changed the Global RC.
> 07/07/2014 16:12:14.122 [003772] [3812] : ..\..\common\ut
> \GlobalRC.cpp ( 429): Old values: rc =3D 0, rcMacroMax =3D 0, rcMax =3D 0=
.
> 07/07/2014 16:12:14.122 [003772] [3812] : ..\..\common\ut
> \GlobalRC.cpp ( 443): New values: rc =3D 12, rcMacroMax =3D 12, rcMax =3D=
 12.
>
> This will show you which message is driving the RC change. In my
> example, "msgNum =3D 1076" corresponds to ANS1076E
>
> Based on the message, you might be able to figure out the rest; but at
> the least you have a trace file you can send in to support.
>
> Regards,
>
> - Andy
>
>
___________________________________________________________________________=
_

>
> Andrew Raibeck | Tivoli Storage Manager Level 3 Technical Lead |
> storman AT us.ibm DOT com
>
> IBM Tivoli Storage Manager links:
> Product support:
> http://www.ibm.com/support/entry/portal/Overview/Software/Tivoli/
> Tivoli_Storage_Manager
>
> Online documentation:
>
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli
> +Documentation+Central/page/Tivoli+Storage+Manager
> Product Wiki:
>
https://www.ibm.com/developerworks/mydeveloperworks/wikis/home/wiki/Tivoli
> +Storage+Manager/page/Home
>
> "ADSM: Dist Stor Manager" <ADSM-L AT vm.marist DOT edu> wrote on 2014-07-07
> 15:59:57:
>
> > From: Thomas Denier <Thomas.Denier AT JEFFERSON DOT EDU>
> > To: ADSM-L AT vm.marist DOT edu,
> > Date: 2014-07-07 16:00
> > Subject: Backup fails with no error message Sent by: "ADSM: Dist
> > Stor Manager" <ADSM-L AT vm.marist DOT edu>
> >
> > We have an AIX system on which backups of a specific file system
> > terminate with exit status 12 but with no error message indicating a
> > reason for this exit status.
> > If I execute the command
> >
> > dsmc inc /main/UT -servername=3DDC1P1_MAIN
> >
> > as root, I will see typical messages about the number of files
> > processed and about specific files being backed up, followed by the
> > usual summary messages. The exit status will be 12. The summary
> > statistics will show a number of files
> examined
> > equal to about half the number of files present in the file system.
> > There will not
> > be any error message explaining the exit status or the failure to
> > examine
> the
> > entire file system.
> >
> > The DCIP1_MAIN stanza in dsm.sys has some unusual features because
> > it is
> used
> > to back up one of the resource groups for a clustered environment.
> > The
> stanza
> > includes three 'domain' statements listing the file systems in the
> > resource group.
> > The stanza includes a 'nodename' option specifying the node name
> > that
> owns the
> > backup files from the resource group. The stanza includes an 'asnode'
> option
> > specifying the node name used to authenticate sessions from the
> > cluster
> node
> > involved (we and the system vendor were not able to agree on an
> acceptable
> > arrangement for storing a TSM password within the resource group).
> > This stanza works fine for the other file systems in the same
> > resource group,
> and
> > worked fine for /main/UT up until June 26.
> >
> > I have found two ways to circumvent the problem. One circumvention
> > is to
> run
> > the command
> >
> > dsmc inc /main/UT/ -subdir=3Dy -servername=3DDC1P1_MAIN
> >
> > to back up the top level directory of the file system rather than
> > the file system as such. An 'lsfs' command shows nothing unusual
> > about the file system;
> it is
> > a jfs2 file system, like all the other file systems, and uses the
> > same
> mount
> > options as the other file systems. The other circumvention is to add
> > an 'exclude.dir' line for a specific subdirectory of /main/UT to the
> > include/exclude file. The subdirectory came under suspicion because
> > it was last updated a
> few
> > hours after the last fully successful backup.
> >
> > The client code is TSM 6.4.1.0. The client OS is AIX 7.1. The TSM
> > server is TSM
> > 6.2.5.0 running under zSeries Linux.
> >
> > Does anyone recognize this as a known problem? If not, does anyone
> > have suggestions for presenting the problem to TSM support? I am
> > having difficulty imagining any kind of productive interaction if I
> > don't have a message identifier to report.
> >
> > Thomas Denier
> > Thomas Jefferson University Hospital The information contained in
> > this transmission contains privileged and confidential information.
> > It is intended only for the use of the person named above. If you
> > are not the intended recipient, you are hereby notified that any
> > review, dissemination, distribution or duplication of this
> > communication is strictly prohibited. If you are not the intended
> > recipient, please contact the sender by reply email and destroy all
> > copies of the original message.
> >
> > CAUTION: Intended recipients should NOT use email communication for
> > emergent or urgent health care matters.
> >
> The information contained in this transmission contains privileged and
> confidential information. It is intended only for the use of the
> person named above. If you are not the intended recipient, you are
> hereby notified that any review, dissemination, distribution or
> duplication of this communication is strictly prohibited. If you are
> not the intended recipient, please contact the sender by reply email
> and destroy all copies of the original message.
>
> CAUTION: Intended recipients should NOT use email communication for
> emergent or urgent health care matters.
>
The information contained in this transmission contains privileged and conf=
idential information. It is intended only for the use of the person named a=
bove. If you are not the intended recipient, you are hereby notified that a=
ny review, dissemination, distribution or duplication of this communication=
 is strictly prohibited. If you are not the intended recipient, please cont=
act the sender by reply email and destroy all copies of the original messag=
e.

CAUTION: Intended recipients should NOT use email communication for emergen=
t or urgent health care matters.