ADSM-L

Re: [ADSM-L] Magic Decoder Ring needed

2017-10-13 15:07:36
Subject: Re: [ADSM-L] Magic Decoder Ring needed
From: Zoltan Forray <zforray AT VCU DOT EDU>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Fri, 13 Oct 2017 15:05:55 -0400
Update: This error/problem is now starting to occur once or twice-a-day and
it is usually when a "backup stgpool" of our primary disk pool is happening.

There is nothing in any of our hardware/os logs, including the PERC
controller logs. There is a Dell PERC firmware upgrade pending that is
labeled "Urgent" that we will pursue.

If this is another one of our "bad spots" in one of our disk volumes, can
someone from IBM help decode the error to perhaps point to what stgpool
volume has the "problem"?  We ran an audit on one of the 20+ volumes in
this stgpool but nothing showed up as "bad".  With over 30TB to run audits
on (and of course they are always busy), it will take a while.  The latest
message:

10/13/2017 11:55:53 AM ANR1330E The server has detected possible corruption
in an object that is being restored or moved. The actual values for the
incorrect frame are: magic 20890B50 hdr version 25350 hdr length  2320
sequence number 2114564210 data length D07D0F20 server ID 175174927 segment
ID 9270951345039929524 crc  4C5AC0C.
10/13/2017 11:55:53 AM ANR1331E Invalid frame detected.  Expected magic
53454652 sequence number       71 server id        0 segment id
2720204019.


On Wed, Oct 11, 2017 at 9:33 AM, Skylar Thompson <skylar2 AT u.washington DOT 
edu>
wrote:

>  Content preview:  I'm not aware of a fix for the problem (it's with Dell
> PERC
>     H810s) but the problem manifested itself in lots and lots of media
> errors
>     on a physical device, visible when you export the controller log. The
> symptoms
>     for TSM included both CRC errors in the pool and also sporadically
> awful
>    I/O throughput. [...]
>
>  Content analysis details:   (0.7 points, 5.0 required)
>
>   pts rule name              description
>  ---- ---------------------- ------------------------------
> --------------------
>   0.7 SPF_NEUTRAL            SPF: sender does not match SPF record
> (neutral)
>  -0.0 RP_MATCHES_RCVD        Envelope sender domain matches handover relay
> domain
> X-Barracuda-Connect: mx.gs.washington.edu[128.208.8.134]
> X-Barracuda-Start-Time: 1507728824
> X-Barracuda-Encrypted: ECDHE-RSA-AES256-GCM-SHA384
> X-Barracuda-URL: https://148.100.49.28:443/cgi-mod/mark.cgi
> X-Barracuda-Scan-Msg-Size: 5262
> X-Virus-Scanned: by bsmtpd at marist.edu
> X-Barracuda-BRTS-Status: 1
> X-Barracuda-Spam-Score: 0.00
> X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of
> TAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=5.5 tests=
> X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.43799
>         Rule breakdown below
>          pts rule name              description
>         ---- ---------------------- ------------------------------
> --------------------
>
> I'm not aware of a fix for the problem (it's with Dell PERC H810s) but the
> problem manifested itself in lots and lots of media errors on a physical
> device, visible when you export the controller log. The symptoms for TSM
> included both CRC errors in the pool and also sporadically awful I/O
> throughput.
>
> The controller logs identified the slot with the media errors, and
> replacing the drive made all the above problems go away. Of course the real
> solution is going to be retiring these soon-to-be-EOSL'd devices, and I've
> finally got a budget to do it...
>
> I'm not actually aware of a fix for the problem, though I didn't spend a
> lot of time looking for one given that we'll be getting rid of the
> equipment in a few weeks. It could very well be an interaction between the
> RAID HBA and physical disk firmware. Unfortunately the system has a mix of
> disk vendors since Dell isn't consistent about which vendor they ship for
> replacements, but the drive I identified was a Fujitsu MBD2300RC.
>
> On Tue, Oct 10, 2017 at 02:18:01PM -0400, Zoltan Forray wrote:
> > Thank you for the info.  We have started running AUDIT's but with 30TB+
> in
> > this disk stgpool, it will take a while.  I am very interested in
> > additional details on the RAID firmware issue you mentioned - any
> specifics
> > would be very helpful.  AFAIK, we are up-to-date on all Dell firmware (we
> > patch fairly regularly).
> >
> > Within the past 9-months, this server has had 3-diskpool volumes (all
> part
> > of RAID-5 arrays) suddenly become "bad", requiring full restores, with no
> > explanation since there was no sign of hardware problems. While I did
> open
> > a PMR with IBM, by the time they looked at my last failure, they said
> there
> > was nothing they could do to analyze the problem and to call them back
> the
> > next time it happens.
> >
> > On Tue, Oct 10, 2017 at 2:04 PM, Skylar Thompson <
> skylar2 AT u.washington DOT edu>
> > wrote:
> >
> > > Hi Zoltan,
> > >
> > > We ran into this recently, and it was caused by a firmware bug in a
> RAID
> > > adapter that caused it not to fail and obviously-failing disk in our
> disk
> > > spool. We followed the procedure here:
> > >
> > > https://www.ibm.com/support/knowledgecenter/en/SSGSG7_7.1.
> > > 6/tshoot/r_pdg_1330_1331_msg.html
> > >
> > > It did take a few AUDIT VOLUME-MOVE DATA cycles to find everything but
> now
> > > it's happy. In a few cases, the file shown by SHOW INVO was obviously
> > > detritus, so we deleted it client-side with DELETE BACKUP instead of an
> > > audit, because it takes a long time to audit our disk volumes.
> > >
> > > On Tue, Oct 10, 2017 at 01:56:47PM -0400, Zoltan Forray wrote:
> > > > Recently we started seeing these errors on one of our servers:
> > > >
> > > > 10/10/2017 13:35:51  ANR1330E The server has detected possible
> corruption
> > > > in
> > > >                       an object that is being restored or moved. The
> > > actual
> > > >
> > > >                       values for the incorrect frame are: magic
> 53454652
> > > > hdr
> > > >                       version    2 hdr length    32 sequence number
> > > >  22610
> > > >                       data length    3FFB0 server ID        0
> segment ID
> > > >
> > > >                       2720223190 crc        0. (SESSION: 39218,
> PROCESS:
> > > > 171)
> > > > 10/10/2017 13:35:51  ANR1331E Invalid frame detected.  Expected magic
> > > > 53454652
> > > >
> > > > The Process ID points to a Backup Stgpool process (the only thing
> > > running),
> > > > not anything being "moved or restored".  There are also a bunch of
> > > sessions
> > > > running/stuck/hung but that is a different problem.
> > > >
> > > > Any idea on how to determine what is causing this?  We've seen the
> error
> > > > quite a few times within the past few days.
> > > >
> > > > --
> > > > *Zoltan Forray*
> > > > Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator
> > > > Xymon Monitor Administrator
> > > > VMware Administrator
> > > > Virginia Commonwealth University
> > > > UCC/Office of Technology Services
> > > > www.ucc.vcu.edu
> > > > zforray AT vcu DOT edu - 804-828-4807
> > > > Don't be a phishing victim - VCU and other reputable organizations
> will
> > > > never use email to request that you reply with your password, social
> > > > security number or confidential personal information. For more
> details
> > > > visit http://phishing.vcu.edu/
> > >
> > > --
> > > -- Skylar Thompson (skylar2 AT u.washington DOT edu)
> > > -- Genome Sciences Department, System Administrator
> > > -- Foege Building S046, (206)-685-7354
> > > -- University of Washington School of Medicine
> > >
> >
> >
> >
> > --
> > *Zoltan Forray*
> > Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator
> > Xymon Monitor Administrator
> > VMware Administrator
> > Virginia Commonwealth University
> > UCC/Office of Technology Services
> > www.ucc.vcu.edu
> > zforray AT vcu DOT edu - 804-828-4807
> > Don't be a phishing victim - VCU and other reputable organizations will
> > never use email to request that you reply with your password, social
> > security number or confidential personal information. For more details
> > visit http://phishing.vcu.edu/
>
> --
> -- Skylar Thompson (skylar2 AT u.washington DOT edu)
> -- Genome Sciences Department, System Administrator
> -- Foege Building S046, (206)-685-7354
> -- University of Washington School of Medicine
>



--
*Zoltan Forray*
Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator
Xymon Monitor Administrator
VMware Administrator
Virginia Commonwealth University
UCC/Office of Technology Services
www.ucc.vcu.edu
zforray AT vcu DOT edu - 804-828-4807
Don't be a phishing victim - VCU and other reputable organizations will
never use email to request that you reply with your password, social
security number or confidential personal information. For more details
visit http://phishing.vcu.edu/

<Prev in Thread] Current Thread [Next in Thread>

ADSM.ORG Privacy and Data Security by KimLaw, PLLC