Networker

Re: [Networker] Very strange problem and question on block size error?

2004-06-07 10:25:26
Subject: Re: [Networker] Very strange problem and question on block size error?
From: George Sinclair <George.Sinclair AT NOAA DOT GOV>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Mon, 7 Jun 2004 10:26:52 -0400
Not that big. In my testing, I was backing up a total of 4 savesets
between 3 clients, via nwadmin, not command line. Here's the breakdown:

client 1: 1 saveset = 81 MB, index = 87 MB
client 2: 1 saveset = 43 MB, index = 123 MB
client 3: 2 savesets = 68 MB, 40 MB, index = 98 MB

Total = 232 MB (savesets), 308 MB (indexes)

As you can see, fairly small. I noticed that it ALWAYS did it on the
first write after re-labeling the tape, but only on the first write.
After that it stopped, and NO problems recovering. I experimented with
5-6 tapes (both LTO and SDLT) and two different libraries (ATL P1000
SDLT with gen. 1 SDLT drives and Storagetek L80 w/ Seagate LTO gen. 1
drives), and it did it every time. Quantum media on SDLT and Imation
media on the LTO. I continued testing a few days later, though, and it
didn't do it. Strange! I can't recall if I was running fulls. I think I
was. Yeah, I think I was running fulls every time since there was so
little data. I have seen this error every now and then, though, on
regular nightly backups but typically only on that first write.

George

"T. S. Kimball" wrote:
>
> How big is that first backup sent to the tape?
>
> We've had that error when a new tape is (re)labeled, and all that
> writes to it is a small index file.  As you say, no other errors and
> the tape is otherwise happy.  I've learned to ignore it, but does make
> me concerned from time to time.
>
> Our setup:  L700, 6 x DLT7000, all Fuji media (DLT-IV), Sun E-450
> Solaris 2.6 & Sun SBU/Legato 6.1.3.
>
> --TSK
>
> On Thu, 3 Jun 2004, George Sinclair wrote:
>
> > Imation. I don't believe we've ever used any other brand of LTO. In the
> > case of SDLT, we've used both Imation and Quantum.
> >
> > George
> >
> > Davina Treiber wrote:
> > >
> > > What brand of LTO tapes?
> > >
> > > George Sinclair wrote:
> > > > Hi,
> > > >
> > > > We were experiencing a very strange problem on Wednesday, but I think
> > > > this may be a sign of a deeper problem. I'm hoping someone can help me
> > > > resolve this or at least let me know if our configuration looks okay. I
> > > > think there is something still wrong or misconfigured with our
> > > > stinit.def file and or NetWorker configuration, but allow me to explain.
> > > >
> > > > This past Wednesday, I was playing around with a new pool of tapes. The
> > > > very first time I would write to a tape, the savesets would run just
> > > > fine, and once everything completed, NetWorker would issue the following
> > > > messages:
> > > >
> > > > Block size is 32768 bytes not 131072 bytes. Verify the device
> > > > configuration. Tape positioning by record is disabled.
> > > >
> > > > This was occurring on several new tapes. Now, I know many folks have
> > > > seen these messages, and they know what this means, but allow me to
> > > > explain further. No abnormal messages of any kind were generated when
> > > > these tape(s) were first labeled, and they were all brand new tapes. But
> > > > subsequent write jobs did NOT issue any such messages - no errors. But
> > > > every time I would re-label the tape, it would happen again, but only on
> > > > the first backup to the tape. However, NO such messages were generated
> > > > when doing a recover, even when recovering data from the very first
> > > > write. The recovery went lickity split, no errors. I cleaned all drives,
> > > > but same problem. I even switched between drives, same problem. There
> > > > were no errors reported in the devices window, though. I should note
> > > > that we normally NEVER see any errors or strange messages when labeling
> > > > tapes. The only thing we see, under normal circumstances with new tapes,
> > > > is the typical input/output error that you expect with new tapes that
> > > > have never been labeled prior.
> > > >
> > > > This weirdness occurred on both LTO and SDLT tapes. The affected drives
> > > > were SDLT and LTO first gen. drives located on two different libraries
> > > > (an ATL P1000 SDLT library with two drives and Storagetek L80 with 4
> > > > Seagate LTO drives) both attached to the same Linux storage node server.
> > > > The primary server is a Sun running solaris 2.6. Both storage node
> > > > server and primary run 6.1.1.
> > > >
> > > > Next, I tried the same operations again on Thursday, and the errors did
> > > > NOT occur this time. No matter how hard I tried, they never reared their
> > > > ugly heads. I even added more new tapes and continued testing,
> > > > re-labeling the tapes and switching between drives, but no errors, and
> > > > there were NO reboots or any shutdown/restart of the software during
> > > > this time. But, last night, while the nightly backups were running
> > > > (these use an older, different pool), I did see the error occur on a new
> > > > tape that had not previously been written to. The error occurred after
> > > > the backups to that tape completed.
> > > >
> > > > Quite some time ago, we were seeing these kinds of errors when we would
> > > > go to do recoveries, and sporadically during backups, too, but mostly
> > > > during recoveries. We then created a /etc/stinit.def file on the
> > > > storagenode server (I've provided a copy below), and the problems went
> > > > away. We do not use any environment variables to set block size, etc. In
> > > > investigating this further, however, I see that since January 2004, this
> > > > problem has occurred on a number of occasions according to the
> > > > /nsr/log/messages file. Since January, we've used 147 tapes (SDLT=66,
> > > > LTO=81), and there have been problems on 22 (SDLT=9, LTO=13). For at
> > > > least half of these 22 tapes, a similar message(s) appeared in the
> > > > NetWorker log file after the tape was marked full, e.g:
> > > >
> > > > Jan  5 14:51:25 primary root: [ID 702911 daemon.notice] NetWorker Media:
> > > > (info) loading volume
> > > > FUL605 into rd=storagenode:/dev/nst3
> > > > Jan  5 15:10:36 primary root: [ID 702911 daemon.notice] NetWorker media:
> > > > (warning)
> > > > rd=storagenode:/dev/nst5 writing: No space left on device, at file 137
> > > > record 2
> > > > Jan  5 15:10:37 primary root: [ID 702911 daemon.notice] NetWorker media:
> > > > (notice) sdlt tape
> > > > FUL618 on rd=storagenode:/dev/nst5 is full
> > > > Jan  5 15:10:37 primary root: [ID 702911 daemon.notice] NetWorker media:
> > > > (notice) sdlt tape
> > > > FUL618 used 139 GB of 100 GB capacity
> > > > Jan  5 15:10:48 primary root: [ID 702911 daemon.notice] NetWorker media:
> > > > (notice) Volume "FUL618"
> > > > on device "rd=storagenode:/dev/nst5": Cannot decode block. Verify the
> > > > device configuration. Tape
> > > > positioning by record is disabled.
> > > > Jan  5 15:11:50 primary root: [ID 702911 daemon.notice] NetWorker media:
> > > > (info) verification of
> > > > volume "FUL618", volid 4126804993 succeeded.
> > > >
> > > > but the tape appeared okay otherwise, but for the other half of the 22
> > > > tapes, there were some other errors to suggest that the tape was
> > > > prematurely marked full and did not reach its capacity, possibly due to
> > > > some server error.
> > > >
> > > > Here's our stinit.def file:
> > > >
> > > > # Seagate Ultrium LTO
> > > > manufacturer=SEAGATE model = "ULTRIUM06242-XXX" {
> > > > scsi2logical=1 can-bsr auto-lock
> > > > mode1 blocksize=0
> > > > }
> > > >
> > > > # SDLT220
> > > > manufacturer="QUANTUM" model = "SuperDLT1" {
> > > > scsi2logical=1
> > > > can-bsr=1
> > > > auto-lock=0
> > > > two-fms=0
> > > > drive-buffering=1
> > > > buffer-writes
> > > > read-ahead=1
> > > > async-writes=1
> > > > can-partitions=0
> > > > fast-mteom=1
> > > > #
> > > > # If your stinit supports the timeouts:
> > > > timeout=3600 # 1 hour
> > > > long-timeout=14400 # 4 hours
> > > > #
> > > > mode1 blocksize=0 density=0x48 compression=1    # 110 GB + compression
> > > > mode2 blocksize=0 density=0x48 compression=0    # 110 GB, no compression
> > > > }
> > > >
> > > > I'm not sure why the stinit.def file does not specify the density for
> > > > the LTO or whether it even should and whether the values for the SDLT
> > > > are correct. Can anyone tell me maybe why we've seen these sporadic
> > > > errors and what if any changes we need to make? Does our stinit.def look
> > > > okay? Might that be causing this?
> > > >
> > > > Thanks.
> > > >
> > > > George
> > > >
> > > > --
> > > > Note: To sign off this list, send a "signoff networker" command via 
> > > > email
> > > > to listserv AT listmail.temple DOT edu or visit the list's Web site at
> > > > http://listmail.temple.edu/archives/networker.html where you can
> > > > also view and post messages to the list.
> > > > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
> > > >
> >
> > --
> > Note: To sign off this list, send a "signoff networker" command via email
> > to listserv AT listmail.temple DOT edu or visit the list's Web site at
> > http://listmail.temple.edu/archives/networker.html where you can
> > also view and post messages to the list.
> > =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
> >
>
> STATEMENT OF CONFIDENTIALITY
>
> The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s)
> and may contain confidential or privileged information. If you are not
> the intended recipient, please notify SunGard Trading Systems immediately
> at (201) 499-5900 and destroy all copies of this message and any
> attachments.

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=