Strange NAS to Surestore backup behaviour

Hi,

I'm having inconsistant problems with the backups of a NAS device. This backup uses the amanda-netapp-dump-0.1setuidump/dump utils.

I apologise for the size of this email but I'm hoping someone with a similar setup might have some pointers, opinions, guesses...

FreeBSD 2.0
Amanda 2.4.4p4
HP Surestore 20 slot DLT8000 Library
Network Appliance FAS250 (approx 97GB data)

1. Some times there are no writes to the first tape even though amcheck was OK including write test e.g.:

---Start Mail Report---
002118D 0:00 0.0 0.0 0
002119D 0:18 4826.7 12.6 13
NOTES:
planner: Full dump of bkup02.anz.lucent.com:/dev/netapp/users promoted from 26 days ahead.
taper: tape 002118D kb 0 fm 0 writing filemark: Input/output error
taper: retrying bkup02.anz.lucent.com:/dev/netapp/users.0 on new tape: [writing filemark: Input/output error]
taper: tape 002119D kb 4943264 fm 13 [OK]
---End Mail Report---

I have put different length sleeps at the end of each function in the chg-chio script but this has made no difference.

2. Some times backups that could fit on 1 tape are spread over multiple tapes only utilising a small percentage of each available tape e.g.:

---Start Mail Report---
These dumps were to tapes 002104D, 002124D, 002111D.
The next 4 tapes Amanda expects to used are: 002118D, 002119D, 002120D, 002121D.

STATISTICS:
                          Total       Full      Daily
                        --------   --------   --------
Estimate Time (hrs:min)    0:38
Run Time (hrs:min)        36:03
Dump Time (hrs:min)       35:02      35:00       0:03
Output Size (meg)       35209.5    35209.4        0.1
Original Size (meg)     88129.0    88127.4        1.6
Avg Compressed Size (%)    40.0       40.0        3.9   (level:#disks ...)
Filesystems Dumped           13         12          1   (1:1)
Avg Dump Rate (k/s)       285.8      286.2        0.4

Tape Time (hrs:min)        2:09       2:09       0:00
Tape Size (meg)         35209.5    35209.4        0.1
Tape Used (%)              91.9       91.9        0.0   (level:#disks ...)
Filesystems Taped            13         12          1   (1:1)
Avg Tp Write Rate (k/s) 4672.5     4673.9       29.7

USAGE BY TAPE:
Label         Time      Size      %    Nb
002104D       0:03     813.6    2.1     4
002124D       0:18    4820.5   12.6     1
002111D       1:48   29575.4   77.2     8

taper: tape 002104D kb 833344 fm 4 writing filemark: Input/output error
taper: retrying bkup02.anz.lucent.com:/dev/netapp/users.0 on new tape: [writing filemark: Input/output error]
taper: tape 002124D kb 4936224 fm 1 writing filemark: Input/output error
taper: retrying bkup02.anz.lucent.com:/dev/netapp/usr/jna.0 on new tape: [writing filemark: Input/output error]
taper: tape 002111D kb 30285632 fm 8 [OK]

DUMP SUMMARY:
                                     DUMPER STATS            TAPER STATS
HOSTNAME     DISK        L ORIG-KB OUT-KB COMP% MMM:SS KB/s MMM:SS KB/s
-------------------------- --------------------------------- ------------
bkup02.anz.l -etapp/blah 0    3038    528 17.4   0:20 26.1   0:02 234.8
bkup02.anz.l -netapp/etc 0 104291 46749 44.8   0:59 791.3   0:095089.9
bkup02.anz.l -tapp/users 0 101528074936168 48.6 73:421116.4 17:344683.0 bkup02.anz.l -/usr/hwcad 0 123085454105162 33.4 94:15 725.9 14:174792.7

bkup02.anz.l -sr/include 0    1950    101   5.2   0:13   8.0   0:02 50.8
bkup02.anz.l -pp/usr/jna 0 6471434426179304 40.51912:46 228.1 93:504649.6
bkup02.anz.l -pp/usr/lib 0    1610      8   0.5   0:11   0.6   0:02   3.7
bkup02.anz.l -/usr/local 1    1627     65   4.0   2:40   0.4   0:02 29.7
bkup02.anz.l -usr/lucent 0    1609      9   0.6   0:10   0.8   0:02   4.1
bkup02.anz.l -pp/usr/ncd 0 200036 85204 42.6   1:51 766.5   0:155525.6
bkup02.anz.l -pp/usr/net 0 111856 43823 39.2   1:08 648.4   0:085251.9
bkup02.anz.l -pp/usr/nms 0    1612     10   0.6   0:11   0.8   0:02   4.6
bkup02.anz.l -sr/swtools 0 2640785 657395 24.9 13:56 786.7   2:095083.7
---End Mail Report---

3. Dump times to the holding disk can sometimes vary greatly e.g.:

Amanda Dump 20050606                    Elapsed Time = 9:40:43
Bandwidth = 25120                       Final Status = TAPE ERROR
Holding disk = 66560                    Dumped/Failed = 13/0
Tape Policy = FIRST                     Output data size = 25563
Dumpers = 4                             Estimated data size = 25608
Driver alg = drain-ends At big end 0

Amanda Dump 20050531                    Elapsed Time = 32:34:04
Bandwidth = 25120                       Final Status = TAPE ERROR
Holding disk = 66560                    Dumped/Failed = 13/0
Tape Policy = FIRST                     Output data size = 25589
Dumpers = 4                             Estimated data size = 25607
Driver alg = drain-ends At big end 0

Apart from the length of time for the dumps to complete I can see no difference between the sessions when they're running. Both the Amanda server and the NAS device appear to be running well with no CPU, memory or disk bottlenecks. No apparent network problems though I have been unable to get LAN utilization stats.

---Start amanda.conf---
org "Toaster"
mailto "backup"
dumpuser "amanda"

inparallel 4
dumporder "BTBTBTBTBTBT"
netusage 10000 Kbps
dumpcycle 4 weeks
runspercycle 20
tapecycle 100 tapes

bumpsize 20 Mb
bumpdays 1
bumpmult 4

etimeout 3600
dtimeout 6000
ctimeout 30
tapebufs 20

runtapes 4
tpchanger "chg-chio"
tapedev "/dev/nrst1"
rawtapedev "/dev/null"
changerfile "/usr/pkg/etc/amanda/Toaster/changer.conf"
changerdev "/dev/ch0"
maxdumpsize -1
tapetype CUST-DLT8000
labelstr "^0021[0-9][0-9]D"
amrecover_do_fsf yes
amrecover_check_label yes
amrecover_changer "/dev/nrst1"

holdingdisk hd1 {
    comment "main holding disk"
    directory "/amanda/hd1/CH0"
    use 65Gb
    chunksize 35Gb
    }

autoflush yes

infofile "/var/amanda/Toaster/curinfo"
logdir "/var/amanda/Toaster"
indexdir "/var/amanda/Toaster/index"

define tapetype CUST-DLT8000 {
    comment "DLT8000 Drive generated by amtapetyep"
    length 38295 mbytes
    filemark 30 kbytes
    speed 5800 kps
}

define dumptype global {
    comment "Global definitions"
    # index yes
    # record no
}

define dumptype comp-high-fast {
    global
    comment "very important partitions on fast machines"
    compress client fast
    priority high
}

define interface local {
comment "a local disk"
use 10000 kbps
}

define interface fxp0 {
comment "100 Mbps ethernet"
use 5120 kbps
}
---End amanda.conf---

About 70% of the backup sessions run well but once or twice a week something goes wrong.

I have a second Surestore library doing normal system backups and these run without problem though no config needs to span multiple tapes i.e. each backup config fits on 1 tape.

Any ideas on where to start troubleshooting these problems greatly appreciated. Does the config file look OK or does anyone recommend changes?

Thanks,
Greg.