Networker

Re: [Networker] Very strange problem and question on block size error?

2004-06-03 10:34:19
Subject: Re: [Networker] Very strange problem and question on block size error?
From: George Sinclair <George.Sinclair AT NOAA DOT GOV>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Thu, 3 Jun 2004 10:35:34 -0400
Imation. I don't believe we've ever used any other brand of LTO. In the
case of SDLT, we've used both Imation and Quantum.

George

Davina Treiber wrote:
>
> What brand of LTO tapes?
>
> George Sinclair wrote:
> > Hi,
> >
> > We were experiencing a very strange problem on Wednesday, but I think
> > this may be a sign of a deeper problem. I'm hoping someone can help me
> > resolve this or at least let me know if our configuration looks okay. I
> > think there is something still wrong or misconfigured with our
> > stinit.def file and or NetWorker configuration, but allow me to explain.
> >
> > This past Wednesday, I was playing around with a new pool of tapes. The
> > very first time I would write to a tape, the savesets would run just
> > fine, and once everything completed, NetWorker would issue the following
> > messages:
> >
> > Block size is 32768 bytes not 131072 bytes. Verify the device
> > configuration. Tape positioning by record is disabled.
> >
> > This was occurring on several new tapes. Now, I know many folks have
> > seen these messages, and they know what this means, but allow me to
> > explain further. No abnormal messages of any kind were generated when
> > these tape(s) were first labeled, and they were all brand new tapes. But
> > subsequent write jobs did NOT issue any such messages - no errors. But
> > every time I would re-label the tape, it would happen again, but only on
> > the first backup to the tape. However, NO such messages were generated
> > when doing a recover, even when recovering data from the very first
> > write. The recovery went lickity split, no errors. I cleaned all drives,
> > but same problem. I even switched between drives, same problem. There
> > were no errors reported in the devices window, though. I should note
> > that we normally NEVER see any errors or strange messages when labeling
> > tapes. The only thing we see, under normal circumstances with new tapes,
> > is the typical input/output error that you expect with new tapes that
> > have never been labeled prior.
> >
> > This weirdness occurred on both LTO and SDLT tapes. The affected drives
> > were SDLT and LTO first gen. drives located on two different libraries
> > (an ATL P1000 SDLT library with two drives and Storagetek L80 with 4
> > Seagate LTO drives) both attached to the same Linux storage node server.
> > The primary server is a Sun running solaris 2.6. Both storage node
> > server and primary run 6.1.1.
> >
> > Next, I tried the same operations again on Thursday, and the errors did
> > NOT occur this time. No matter how hard I tried, they never reared their
> > ugly heads. I even added more new tapes and continued testing,
> > re-labeling the tapes and switching between drives, but no errors, and
> > there were NO reboots or any shutdown/restart of the software during
> > this time. But, last night, while the nightly backups were running
> > (these use an older, different pool), I did see the error occur on a new
> > tape that had not previously been written to. The error occurred after
> > the backups to that tape completed.
> >
> > Quite some time ago, we were seeing these kinds of errors when we would
> > go to do recoveries, and sporadically during backups, too, but mostly
> > during recoveries. We then created a /etc/stinit.def file on the
> > storagenode server (I've provided a copy below), and the problems went
> > away. We do not use any environment variables to set block size, etc. In
> > investigating this further, however, I see that since January 2004, this
> > problem has occurred on a number of occasions according to the
> > /nsr/log/messages file. Since January, we've used 147 tapes (SDLT=66,
> > LTO=81), and there have been problems on 22 (SDLT=9, LTO=13). For at
> > least half of these 22 tapes, a similar message(s) appeared in the
> > NetWorker log file after the tape was marked full, e.g:
> >
> > Jan  5 14:51:25 primary root: [ID 702911 daemon.notice] NetWorker Media:
> > (info) loading volume
> > FUL605 into rd=storagenode:/dev/nst3
> > Jan  5 15:10:36 primary root: [ID 702911 daemon.notice] NetWorker media:
> > (warning)
> > rd=storagenode:/dev/nst5 writing: No space left on device, at file 137
> > record 2
> > Jan  5 15:10:37 primary root: [ID 702911 daemon.notice] NetWorker media:
> > (notice) sdlt tape
> > FUL618 on rd=storagenode:/dev/nst5 is full
> > Jan  5 15:10:37 primary root: [ID 702911 daemon.notice] NetWorker media:
> > (notice) sdlt tape
> > FUL618 used 139 GB of 100 GB capacity
> > Jan  5 15:10:48 primary root: [ID 702911 daemon.notice] NetWorker media:
> > (notice) Volume "FUL618"
> > on device "rd=storagenode:/dev/nst5": Cannot decode block. Verify the
> > device configuration. Tape
> > positioning by record is disabled.
> > Jan  5 15:11:50 primary root: [ID 702911 daemon.notice] NetWorker media:
> > (info) verification of
> > volume "FUL618", volid 4126804993 succeeded.
> >
> > but the tape appeared okay otherwise, but for the other half of the 22
> > tapes, there were some other errors to suggest that the tape was
> > prematurely marked full and did not reach its capacity, possibly due to
> > some server error.
> >
> > Here's our stinit.def file:
> >
> > # Seagate Ultrium LTO
> > manufacturer=SEAGATE model = "ULTRIUM06242-XXX" {
> > scsi2logical=1 can-bsr auto-lock
> > mode1 blocksize=0
> > }
> >
> > # SDLT220
> > manufacturer="QUANTUM" model = "SuperDLT1" {
> > scsi2logical=1
> > can-bsr=1
> > auto-lock=0
> > two-fms=0
> > drive-buffering=1
> > buffer-writes
> > read-ahead=1
> > async-writes=1
> > can-partitions=0
> > fast-mteom=1
> > #
> > # If your stinit supports the timeouts:
> > timeout=3600 # 1 hour
> > long-timeout=14400 # 4 hours
> > #
> > mode1 blocksize=0 density=0x48 compression=1    # 110 GB + compression
> > mode2 blocksize=0 density=0x48 compression=0    # 110 GB, no compression
> > }
> >
> > I'm not sure why the stinit.def file does not specify the density for
> > the LTO or whether it even should and whether the values for the SDLT
> > are correct. Can anyone tell me maybe why we've seen these sporadic
> > errors and what if any changes we need to make? Does our stinit.def look
> > okay? Might that be causing this?
> >
> > Thanks.
> >
> > George
> >
> > --
> > Note: To sign off this list, send a "signoff networker" command via email
> > to listserv AT listmail.temple DOT edu or visit the list's Web site at
> > http://listmail.temple.edu/archives/networker.html where you can
> > also view and post messages to the list.
> > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
> >

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=