Networker

Re: [Networker] Very strange problem and question on block size error?

2004-06-04 15:38:16
Subject: Re: [Networker] Very strange problem and question on block size error?
From: Chad Smykay <csmykay AT RACKSPACE DOT COM>
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Date: Fri, 4 Jun 2004 14:28:01 -0500
Check that your NSR Block size variable is set for each type of drive.  For
example:

NSR_DEV_BLOCK_SIZE_LTO_Ultrium_2=64
Export NSR_DEV_BLOCK_SIZE_LTO_Ultrium_2

Hope this helps.


Chad Smykay, RHCE, LCNA
Systems Storage Administrator
Rackspace Managed Hosting (TM) - The Managed Hosting Specialist (TM)


-----Original Message-----
From: Legato NetWorker discussion [mailto:NETWORKER AT LISTMAIL.TEMPLE DOT EDU] 
On
Behalf Of George Sinclair
Sent: Friday, May 28, 2004 12:09 PM
To: NETWORKER AT LISTMAIL.TEMPLE DOT EDU
Subject: [Networker] Very strange problem and question on block size error?

Hi,

We were experiencing a very strange problem on Wednesday, but I think this
may be a sign of a deeper problem. I'm hoping someone can help me resolve
this or at least let me know if our configuration looks okay. I think there
is something still wrong or misconfigured with our stinit.def file and or
NetWorker configuration, but allow me to explain.

This past Wednesday, I was playing around with a new pool of tapes. The very
first time I would write to a tape, the savesets would run just fine, and
once everything completed, NetWorker would issue the following
messages:

Block size is 32768 bytes not 131072 bytes. Verify the device configuration.
Tape positioning by record is disabled.

This was occurring on several new tapes. Now, I know many folks have seen
these messages, and they know what this means, but allow me to explain
further. No abnormal messages of any kind were generated when these tape(s)
were first labeled, and they were all brand new tapes. But subsequent write
jobs did NOT issue any such messages - no errors. But every time I would
re-label the tape, it would happen again, but only on the first backup to
the tape. However, NO such messages were generated when doing a recover,
even when recovering data from the very first write. The recovery went
lickity split, no errors. I cleaned all drives, but same problem. I even
switched between drives, same problem. There were no errors reported in the
devices window, though. I should note that we normally NEVER see any errors
or strange messages when labeling tapes. The only thing we see, under normal
circumstances with new tapes, is the typical input/output error that you
expect with new tapes that have never been labeled prior.

This weirdness occurred on both LTO and SDLT tapes. The affected drives were
SDLT and LTO first gen. drives located on two different libraries (an ATL
P1000 SDLT library with two drives and Storagetek L80 with 4 Seagate LTO
drives) both attached to the same Linux storage node server.
The primary server is a Sun running solaris 2.6. Both storage node server
and primary run 6.1.1.

Next, I tried the same operations again on Thursday, and the errors did NOT
occur this time. No matter how hard I tried, they never reared their ugly
heads. I even added more new tapes and continued testing, re-labeling the
tapes and switching between drives, but no errors, and there were NO reboots
or any shutdown/restart of the software during this time. But, last night,
while the nightly backups were running (these use an older, different pool),
I did see the error occur on a new tape that had not previously been written
to. The error occurred after the backups to that tape completed.

Quite some time ago, we were seeing these kinds of errors when we would go
to do recoveries, and sporadically during backups, too, but mostly during
recoveries. We then created a /etc/stinit.def file on the storagenode server
(I've provided a copy below), and the problems went away. We do not use any
environment variables to set block size, etc. In investigating this further,
however, I see that since January 2004, this problem has occurred on a
number of occasions according to the /nsr/log/messages file. Since January,
we've used 147 tapes (SDLT=66, LTO=81), and there have been problems on 22
(SDLT=9, LTO=13). For at least half of these 22 tapes, a similar message(s)
appeared in the NetWorker log file after the tape was marked full, e.g:

Jan  5 14:51:25 primary root: [ID 702911 daemon.notice] NetWorker Media:
(info) loading volume
FUL605 into rd=storagenode:/dev/nst3
Jan  5 15:10:36 primary root: [ID 702911 daemon.notice] NetWorker media:
(warning)
rd=storagenode:/dev/nst5 writing: No space left on device, at file 137
record 2 Jan  5 15:10:37 primary root: [ID 702911 daemon.notice] NetWorker
media:
(notice) sdlt tape
FUL618 on rd=storagenode:/dev/nst5 is full Jan  5 15:10:37 primary root: [ID
702911 daemon.notice] NetWorker media:
(notice) sdlt tape
FUL618 used 139 GB of 100 GB capacity
Jan  5 15:10:48 primary root: [ID 702911 daemon.notice] NetWorker media:
(notice) Volume "FUL618"
on device "rd=storagenode:/dev/nst5": Cannot decode block. Verify the device
configuration. Tape positioning by record is disabled.
Jan  5 15:11:50 primary root: [ID 702911 daemon.notice] NetWorker media:
(info) verification of
volume "FUL618", volid 4126804993 succeeded.

but the tape appeared okay otherwise, but for the other half of the 22
tapes, there were some other errors to suggest that the tape was prematurely
marked full and did not reach its capacity, possibly due to some server
error.

Here's our stinit.def file:

# Seagate Ultrium LTO
manufacturer=SEAGATE model = "ULTRIUM06242-XXX" {
scsi2logical=1 can-bsr auto-lock
mode1 blocksize=0
}

# SDLT220
manufacturer="QUANTUM" model = "SuperDLT1" {
scsi2logical=1
can-bsr=1
auto-lock=0
two-fms=0
drive-buffering=1
buffer-writes
read-ahead=1
async-writes=1
can-partitions=0
fast-mteom=1
#
# If your stinit supports the timeouts:
timeout=3600 # 1 hour
long-timeout=14400 # 4 hours
#
mode1 blocksize=0 density=0x48 compression=1    # 110 GB + compression
mode2 blocksize=0 density=0x48 compression=0    # 110 GB, no compression
}

I'm not sure why the stinit.def file does not specify the density for the
LTO or whether it even should and whether the values for the SDLT are
correct. Can anyone tell me maybe why we've seen these sporadic errors and
what if any changes we need to make? Does our stinit.def look okay? Might
that be causing this?

Thanks.

George

--
Note: To sign off this list, send a "signoff networker" command via email to
listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can also view
and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=