Re: [ADSM-L] Magic Decoder Ring needed

 Content preview:  I'm not aware of a fix for the problem (it's with Dell PERC
    H810s) but the problem manifested itself in lots and lots of media errors
    on a physical device, visible when you export the controller log. The 
symptoms
    for TSM included both CRC errors in the pool and also sporadically awful
   I/O throughput. [...]

 Content analysis details:   (0.7 points, 5.0 required)

  pts rule name              description
 ---- ---------------------- --------------------------------------------------
  0.7 SPF_NEUTRAL            SPF: sender does not match SPF record (neutral)
 -0.0 RP_MATCHES_RCVD        Envelope sender domain matches handover relay 
domain
X-Barracuda-Connect: mx.gs.washington.edu[128.208.8.134]
X-Barracuda-Start-Time: 1507728824
X-Barracuda-Encrypted: ECDHE-RSA-AES256-GCM-SHA384
X-Barracuda-URL: https://148.100.49.28:443/cgi-mod/mark.cgi
X-Barracuda-Scan-Msg-Size: 5262
X-Virus-Scanned: by bsmtpd at marist.edu
X-Barracuda-BRTS-Status: 1
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=3.5 
QUARANTINE_LEVEL=1000.0 KILL_LEVEL=5.5 tests=
X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.43799
        Rule breakdown below
         pts rule name              description
        ---- ---------------------- 
--------------------------------------------------

I'm not aware of a fix for the problem (it's with Dell PERC H810s) but the
problem manifested itself in lots and lots of media errors on a physical
device, visible when you export the controller log. The symptoms for TSM
included both CRC errors in the pool and also sporadically awful I/O
throughput.

The controller logs identified the slot with the media errors, and
replacing the drive made all the above problems go away. Of course the real
solution is going to be retiring these soon-to-be-EOSL'd devices, and I've
finally got a budget to do it...

I'm not actually aware of a fix for the problem, though I didn't spend a
lot of time looking for one given that we'll be getting rid of the
equipment in a few weeks. It could very well be an interaction between the
RAID HBA and physical disk firmware. Unfortunately the system has a mix of
disk vendors since Dell isn't consistent about which vendor they ship for
replacements, but the drive I identified was a Fujitsu MBD2300RC.

On Tue, Oct 10, 2017 at 02:18:01PM -0400, Zoltan Forray wrote:
> Thank you for the info.  We have started running AUDIT's but with 30TB+ in
> this disk stgpool, it will take a while.  I am very interested in
> additional details on the RAID firmware issue you mentioned - any specifics
> would be very helpful.  AFAIK, we are up-to-date on all Dell firmware (we
> patch fairly regularly).
>
> Within the past 9-months, this server has had 3-diskpool volumes (all part
> of RAID-5 arrays) suddenly become "bad", requiring full restores, with no
> explanation since there was no sign of hardware problems. While I did open
> a PMR with IBM, by the time they looked at my last failure, they said there
> was nothing they could do to analyze the problem and to call them back the
> next time it happens.
>
> On Tue, Oct 10, 2017 at 2:04 PM, Skylar Thompson <skylar2 AT u.washington DOT 
> edu>
> wrote:
>
> > Hi Zoltan,
> >
> > We ran into this recently, and it was caused by a firmware bug in a RAID
> > adapter that caused it not to fail and obviously-failing disk in our disk
> > spool. We followed the procedure here:
> >
> > https://www.ibm.com/support/knowledgecenter/en/SSGSG7_7.1.
> > 6/tshoot/r_pdg_1330_1331_msg.html
> >
> > It did take a few AUDIT VOLUME-MOVE DATA cycles to find everything but now
> > it's happy. In a few cases, the file shown by SHOW INVO was obviously
> > detritus, so we deleted it client-side with DELETE BACKUP instead of an
> > audit, because it takes a long time to audit our disk volumes.
> >
> > On Tue, Oct 10, 2017 at 01:56:47PM -0400, Zoltan Forray wrote:
> > > Recently we started seeing these errors on one of our servers:
> > >
> > > 10/10/2017 13:35:51  ANR1330E The server has detected possible corruption
> > > in
> > >                       an object that is being restored or moved. The
> > actual
> > >
> > >                       values for the incorrect frame are: magic 53454652
> > > hdr
> > >                       version    2 hdr length    32 sequence number
> > >  22610
> > >                       data length    3FFB0 server ID        0 segment ID
> > >
> > >                       2720223190 crc        0. (SESSION: 39218, PROCESS:
> > > 171)
> > > 10/10/2017 13:35:51  ANR1331E Invalid frame detected.  Expected magic
> > > 53454652
> > >
> > > The Process ID points to a Backup Stgpool process (the only thing
> > running),
> > > not anything being "moved or restored".  There are also a bunch of
> > sessions
> > > running/stuck/hung but that is a different problem.
> > >
> > > Any idea on how to determine what is causing this?  We've seen the error
> > > quite a few times within the past few days.
> > >
> > > --
> > > *Zoltan Forray*
> > > Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator
> > > Xymon Monitor Administrator
> > > VMware Administrator
> > > Virginia Commonwealth University
> > > UCC/Office of Technology Services
> > > www.ucc.vcu.edu
> > > zforray AT vcu DOT edu - 804-828-4807
> > > Don't be a phishing victim - VCU and other reputable organizations will
> > > never use email to request that you reply with your password, social
> > > security number or confidential personal information. For more details
> > > visit http://phishing.vcu.edu/
> >
> > --
> > -- Skylar Thompson (skylar2 AT u.washington DOT edu)
> > -- Genome Sciences Department, System Administrator
> > -- Foege Building S046, (206)-685-7354
> > -- University of Washington School of Medicine
> >
>
>
>
> --
> *Zoltan Forray*
> Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator
> Xymon Monitor Administrator
> VMware Administrator
> Virginia Commonwealth University
> UCC/Office of Technology Services
> www.ucc.vcu.edu
> zforray AT vcu DOT edu - 804-828-4807
> Don't be a phishing victim - VCU and other reputable organizations will
> never use email to request that you reply with your password, social
> security number or confidential personal information. For more details
> visit http://phishing.vcu.edu/

--
-- Skylar Thompson (skylar2 AT u.washington DOT edu)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine