[Networker] Very strange problem and question on block size error?

Hi,

We were experiencing a very strange problem on Wednesday, but I think
this may be a sign of a deeper problem. I'm hoping someone can help me
resolve this or at least let me know if our configuration looks okay. I
think there is something still wrong or misconfigured with our
stinit.def file and or NetWorker configuration, but allow me to explain.

This past Wednesday, I was playing around with a new pool of tapes. The
very first time I would write to a tape, the savesets would run just
fine, and once everything completed, NetWorker would issue the following
messages:

Block size is 32768 bytes not 131072 bytes. Verify the device
configuration. Tape positioning by record is disabled.

This was occurring on several new tapes. Now, I know many folks have
seen these messages, and they know what this means, but allow me to
explain further. No abnormal messages of any kind were generated when
these tape(s) were first labeled, and they were all brand new tapes. But
subsequent write jobs did NOT issue any such messages - no errors. But
every time I would re-label the tape, it would happen again, but only on
the first backup to the tape. However, NO such messages were generated
when doing a recover, even when recovering data from the very first
write. The recovery went lickity split, no errors. I cleaned all drives,
but same problem. I even switched between drives, same problem. There
were no errors reported in the devices window, though. I should note
that we normally NEVER see any errors or strange messages when labeling
tapes. The only thing we see, under normal circumstances with new tapes,
is the typical input/output error that you expect with new tapes that
have never been labeled prior.

This weirdness occurred on both LTO and SDLT tapes. The affected drives
were SDLT and LTO first gen. drives located on two different libraries
(an ATL P1000 SDLT library with two drives and Storagetek L80 with 4
Seagate LTO drives) both attached to the same Linux storage node server.
The primary server is a Sun running solaris 2.6. Both storage node
server and primary run 6.1.1.

Next, I tried the same operations again on Thursday, and the errors did
NOT occur this time. No matter how hard I tried, they never reared their
ugly heads. I even added more new tapes and continued testing,
re-labeling the tapes and switching between drives, but no errors, and
there were NO reboots or any shutdown/restart of the software during
this time. But, last night, while the nightly backups were running
(these use an older, different pool), I did see the error occur on a new
tape that had not previously been written to. The error occurred after
the backups to that tape completed.

Quite some time ago, we were seeing these kinds of errors when we would
go to do recoveries, and sporadically during backups, too, but mostly
during recoveries. We then created a /etc/stinit.def file on the
storagenode server (I've provided a copy below), and the problems went
away. We do not use any environment variables to set block size, etc. In
investigating this further, however, I see that since January 2004, this
problem has occurred on a number of occasions according to the
/nsr/log/messages file. Since January, we've used 147 tapes (SDLT=66,
LTO=81), and there have been problems on 22 (SDLT=9, LTO=13). For at
least half of these 22 tapes, a similar message(s) appeared in the
NetWorker log file after the tape was marked full, e.g:

Jan  5 14:51:25 primary root: [ID 702911 daemon.notice] NetWorker Media:
(info) loading volume
FUL605 into rd=storagenode:/dev/nst3
Jan  5 15:10:36 primary root: [ID 702911 daemon.notice] NetWorker media:
(warning)
rd=storagenode:/dev/nst5 writing: No space left on device, at file 137
record 2
Jan  5 15:10:37 primary root: [ID 702911 daemon.notice] NetWorker media:
(notice) sdlt tape
FUL618 on rd=storagenode:/dev/nst5 is full
Jan  5 15:10:37 primary root: [ID 702911 daemon.notice] NetWorker media:
(notice) sdlt tape
FUL618 used 139 GB of 100 GB capacity
Jan  5 15:10:48 primary root: [ID 702911 daemon.notice] NetWorker media:
(notice) Volume "FUL618"
on device "rd=storagenode:/dev/nst5": Cannot decode block. Verify the
device configuration. Tape
positioning by record is disabled.
Jan  5 15:11:50 primary root: [ID 702911 daemon.notice] NetWorker media:
(info) verification of
volume "FUL618", volid 4126804993 succeeded.

but the tape appeared okay otherwise, but for the other half of the 22
tapes, there were some other errors to suggest that the tape was
prematurely marked full and did not reach its capacity, possibly due to
some server error.

Here's our stinit.def file:

# Seagate Ultrium LTO
manufacturer=SEAGATE model = "ULTRIUM06242-XXX" {
scsi2logical=1 can-bsr auto-lock
mode1 blocksize=0
}

# SDLT220
manufacturer="QUANTUM" model = "SuperDLT1" {
scsi2logical=1
can-bsr=1
auto-lock=0
two-fms=0
drive-buffering=1
buffer-writes
read-ahead=1
async-writes=1
can-partitions=0
fast-mteom=1
#
# If your stinit supports the timeouts:
timeout=3600 # 1 hour
long-timeout=14400 # 4 hours
#
mode1 blocksize=0 density=0x48 compression=1    # 110 GB + compression
mode2 blocksize=0 density=0x48 compression=0    # 110 GB, no compression
}

I'm not sure why the stinit.def file does not specify the density for
the LTO or whether it even should and whether the values for the SDLT
are correct. Can anyone tell me maybe why we've seen these sporadic
errors and what if any changes we need to make? Does our stinit.def look
okay? Might that be causing this?

Thanks.

George

--
Note: To sign off this list, send a "signoff networker" command via email
to listserv AT listmail.temple DOT edu or visit the list's Web site at
http://listmail.temple.edu/archives/networker.html where you can
also view and post messages to the list.
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=