Networker

Re: [Networker] Very strange problem and question on block size error?

2004-06-06 10:54:49
Subject: Re: [Networker] Very strange problem and question on block size error?
From: "T. S. Kimball" <tkimball AT BRASS DOT COM>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Sun, 6 Jun 2004 10:54:20 -0400
How big is that first backup sent to the tape?

We've had that error when a new tape is (re)labeled, and all that
writes to it is a small index file.  As you say, no other errors and
the tape is otherwise happy.  I've learned to ignore it, but does make
me concerned from time to time.

Our setup:  L700, 6 x DLT7000, all Fuji media (DLT-IV), Sun E-450
Solaris 2.6 & Sun SBU/Legato 6.1.3.


--TSK


On Thu, 3 Jun 2004, George Sinclair wrote:

> Imation. I don't believe we've ever used any other brand of LTO. In the
> case of SDLT, we've used both Imation and Quantum.
>
> George
>
> Davina Treiber wrote:
> >
> > What brand of LTO tapes?
> >
> > George Sinclair wrote:
> > > Hi,
> > >
> > > We were experiencing a very strange problem on Wednesday, but I think
> > > this may be a sign of a deeper problem. I'm hoping someone can help me
> > > resolve this or at least let me know if our configuration looks okay. I
> > > think there is something still wrong or misconfigured with our
> > > stinit.def file and or NetWorker configuration, but allow me to explain.
> > >
> > > This past Wednesday, I was playing around with a new pool of tapes. The
> > > very first time I would write to a tape, the savesets would run just
> > > fine, and once everything completed, NetWorker would issue the following
> > > messages:
> > >
> > > Block size is 32768 bytes not 131072 bytes. Verify the device
> > > configuration. Tape positioning by record is disabled.
> > >
> > > This was occurring on several new tapes. Now, I know many folks have
> > > seen these messages, and they know what this means, but allow me to
> > > explain further. No abnormal messages of any kind were generated when
> > > these tape(s) were first labeled, and they were all brand new tapes. But
> > > subsequent write jobs did NOT issue any such messages - no errors. But
> > > every time I would re-label the tape, it would happen again, but only on
> > > the first backup to the tape. However, NO such messages were generated
> > > when doing a recover, even when recovering data from the very first
> > > write. The recovery went lickity split, no errors. I cleaned all drives,
> > > but same problem. I even switched between drives, same problem. There
> > > were no errors reported in the devices window, though. I should note
> > > that we normally NEVER see any errors or strange messages when labeling
> > > tapes. The only thing we see, under normal circumstances with new tapes,
> > > is the typical input/output error that you expect with new tapes that
> > > have never been labeled prior.
> > >
> > > This weirdness occurred on both LTO and SDLT tapes. The affected drives
> > > were SDLT and LTO first gen. drives located on two different libraries
> > > (an ATL P1000 SDLT library with two drives and Storagetek L80 with 4
> > > Seagate LTO drives) both attached to the same Linux storage node server.
> > > The primary server is a Sun running solaris 2.6. Both storage node
> > > server and primary run 6.1.1.
> > >
> > > Next, I tried the same operations again on Thursday, and the errors did
> > > NOT occur this time. No matter how hard I tried, they never reared their
> > > ugly heads. I even added more new tapes and continued testing,
> > > re-labeling the tapes and switching between drives, but no errors, and
> > > there were NO reboots or any shutdown/restart of the software during
> > > this time. But, last night, while the nightly backups were running
> > > (these use an older, different pool), I did see the error occur on a new
> > > tape that had not previously been written to. The error occurred after
> > > the backups to that tape completed.
> > >
> > > Quite some time ago, we were seeing these kinds of errors when we would
> > > go to do recoveries, and sporadically during backups, too, but mostly
> > > during recoveries. We then created a /etc/stinit.def file on the
> > > storagenode server (I've provided a copy below), and the problems went
> > > away. We do not use any environment variables to set block size, etc. In
> > > investigating this further, however, I see that since January 2004, this
> > > problem has occurred on a number of occasions according to the
> > > /nsr/log/messages file. Since January, we've used 147 tapes (SDLT=66,
> > > LTO=81), and there have been problems on 22 (SDLT=9, LTO=13). For at
> > > least half of these 22 tapes, a similar message(s) appeared in the
> > > NetWorker log file after the tape was marked full, e.g:
> > >
> > > Jan  5 14:51:25 primary root: [ID 702911 daemon.notice] NetWorker Media:
> > > (info) loading volume
> > > FUL605 into rd=storagenode:/dev/nst3
> > > Jan  5 15:10:36 primary root: [ID 702911 daemon.notice] NetWorker media:
> > > (warning)
> > > rd=storagenode:/dev/nst5 writing: No space left on device, at file 137
> > > record 2
> > > Jan  5 15:10:37 primary root: [ID 702911 daemon.notice] NetWorker media:
> > > (notice) sdlt tape
> > > FUL618 on rd=storagenode:/dev/nst5 is full
> > > Jan  5 15:10:37 primary root: [ID 702911 daemon.notice] NetWorker media:
> > > (notice) sdlt tape
> > > FUL618 used 139 GB of 100 GB capacity
> > > Jan  5 15:10:48 primary root: [ID 702911 daemon.notice] NetWorker media:
> > > (notice) Volume "FUL618"
> > > on device "rd=storagenode:/dev/nst5": Cannot decode block. Verify the
> > > device configuration. Tape
> > > positioning by record is disabled.
> > > Jan  5 15:11:50 primary root: [ID 702911 daemon.notice] NetWorker media:
> > > (info) verification of
> > > volume "FUL618", volid 4126804993 succeeded.
> > >
> > > but the tape appeared okay otherwise, but for the other half of the 22
> > > tapes, there were some other errors to suggest that the tape was
> > > prematurely marked full and did not reach its capacity, possibly due to
> > > some server error.
> > >
> > > Here's our stinit.def file:
> > >
> > > # Seagate Ultrium LTO
> > > manufacturer=SEAGATE model = "ULTRIUM06242-XXX" {
> > > scsi2logical=1 can-bsr auto-lock
> > > mode1 blocksize=0
> > > }
> > >
> > > # SDLT220
> > > manufacturer="QUANTUM" model = "SuperDLT1" {
> > > scsi2logical=1
> > > can-bsr=1
> > > auto-lock=0
> > > two-fms=0
> > > drive-buffering=1
> > > buffer-writes
> > > read-ahead=1
> > > async-writes=1
> > > can-partitions=0
> > > fast-mteom=1
> > > #
> > > # If your stinit supports the timeouts:
> > > timeout=3600 # 1 hour
> > > long-timeout=14400 # 4 hours
> > > #
> > > mode1 blocksize=0 density=0x48 compression=1    # 110 GB + compression
> > > mode2 blocksize=0 density=0x48 compression=0    # 110 GB, no compression
> > > }
> > >
> > > I'm not sure why the stinit.def file does not specify the density for
> > > the LTO or whether it even should and whether the values for the SDLT
> > > are correct. Can anyone tell me maybe why we've seen these sporadic
> > > errors and what if any changes we need to make? Does our stinit.def look
> > > okay? Might that be causing this?
> > >
> > > Thanks.
> > >
> > > George
> > >
> > > --
> > > Note: To sign off this list, send a "signoff networker" command via email
> > > to listserv AT listmail.temple DOT edu or visit the list's Web site at
> > > http://listmail.temple.edu/archives/networker.html where you can
> > > also view and post messages to the list.
> > > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
> > >
>
> --
> Note: To sign off this list, send a "signoff networker" command via email
> to listserv AT listmail.temple DOT edu or visit the list's Web site at
> http://listmail.temple.edu/archives/networker.html where you can
> also view and post messages to the list.
> =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
>



STATEMENT OF CONFIDENTIALITY

The information contained in this electronic message and any attachments
to this message are intended for the exclusive use of the addressee(s)
and may contain confidential or privileged information. If you are not
the intended recipient, please notify SunGard Trading Systems immediately
at (201) 499-5900 and destroy all copies of this message and any
attachments.

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=