Re: Strange NAS to Surestore backup behaviour

--On Thursday, August 11, 2005 11:52:10 +1000 "Keenan, Greg John (Greg)** CTR 
**" <gjkeenan AT lucent DOT com> wrote:

> Hi, 
> 
> I'm having inconsistant problems with the backups of a NAS device.  This
> backup uses the amanda-netapp-dump-0.1setuidump/dump utils.
> 
> I apologise for the size of this email but I'm hoping someone with a
> similar setup might have some pointers, opinions, guesses...
> 
> FreeBSD 2.0 
> Amanda 2.4.4p4 
> HP Surestore 20 slot DLT8000 Library 
> Network Appliance FAS250 (approx 97GB data) 
> 
> 1. Some times there are no writes to the first tape even though amcheck
> was OK including write test e.g.: 
> 
> ---Start Mail Report--- 
> 002118D       0:00       0.0    0.0     0 
> 002119D       0:18    4826.7   12.6    13 
> NOTES: 
>   planner: Full dump of bkup02.anz.lucent.com:/dev/netapp/users promoted
> from 26 days ahead. 
>   taper: tape 002118D kb 0 fm 0 writing filemark: Input/output error 
>   taper: retrying bkup02.anz.lucent.com:/dev/netapp/users.0 on new tape:
> [writing filemark: Input/output error] 
>   taper: tape 002119D kb 4943264 fm 13 [OK] 
> ---End Mail Report--- 
> 
> I have put different length sleeps at the end of each function in the
> chg-chio script but this has made no difference. 
> 
> 
> 
> 2. Some times backups that could fit on 1 tape are spread over multiple
> tapes only utilising a small percentage of each available tape e.g.:
> 
> ---Start Mail Report--- 
> These dumps were to tapes 002104D, 002124D, 002111D. 
> The next 4 tapes Amanda expects to used are: 002118D, 002119D, 002120D,
> 002121D. 
> 
> STATISTICS: 
>                           Total       Full      Daily 
>                         --------   --------   -------- 
> Estimate Time (hrs:min)    0:38 
> Run Time (hrs:min)        36:03 
> Dump Time (hrs:min)       35:02      35:00       0:03 
> Output Size (meg)       35209.5    35209.4        0.1 
> Original Size (meg)     88129.0    88127.4        1.6 
> Avg Compressed Size (%)    40.0       40.0        3.9   (level:#disks
> ...) 
> Filesystems Dumped           13         12          1   (1:1) 
> Avg Dump Rate (k/s)       285.8      286.2        0.4 
> 
> Tape Time (hrs:min)        2:09       2:09       0:00 
> Tape Size (meg)         35209.5    35209.4        0.1 
> Tape Used (%)              91.9       91.9        0.0   (level:#disks
> ...) 
> Filesystems Taped            13         12          1   (1:1) 
> Avg Tp Write Rate (k/s)  4672.5     4673.9       29.7 
> 
> USAGE BY TAPE: 
>   Label         Time      Size      %    Nb 
>   002104D       0:03     813.6    2.1     4 
>   002124D       0:18    4820.5   12.6     1 
>   002111D       1:48   29575.4   77.2     8 
> 
>   taper: tape 002104D kb 833344 fm 4 writing filemark: Input/output
> error 
>   taper: retrying bkup02.anz.lucent.com:/dev/netapp/users.0 on new tape:
> [writing filemark: Input/output error] 
>   taper: tape 002124D kb 4936224 fm 1 writing filemark: Input/output
> error 
>   taper: retrying bkup02.anz.lucent.com:/dev/netapp/usr/jna.0 on new
> tape: [writing filemark: Input/output error]

Anytime you see I/O errors at random offsets you should first check for
a) dirty heads on the tape drive
b) bad tapes (although not likely that many go bad at once, unless they have
   all been heavily used
c) SCSI errors (check your system logs) due to improper termination (none or
   multiply), bad cable (or poor connection, try disconnecting and reconnecting
   the cable), or possibly even a bad controller
d) bad drive

You may be experiencing a totally different problem, but start with the easy
stuff first.

Also, look into the 'columnspec' config option.  It won't help your I/O errors
but it will make your daily report easier to read.

Frank

>   taper: tape 002111D kb 30285632 fm 8 [OK] 
> 
> DUMP SUMMARY: 
>                                      DUMPER STATS            TAPER STATS
> 
> HOSTNAME     DISK        L ORIG-KB OUT-KB COMP% MMM:SS  KB/s MMM:SS
> KB/s 
> -------------------------- ---------------------------------
> ------------ 
> bkup02.anz.l -etapp/blah 0    3038    528  17.4   0:20  26.1   0:02
> 234.8 
> bkup02.anz.l -netapp/etc 0  104291  46749  44.8   0:59 791.3
> 0:095089.9 
> bkup02.anz.l -tapp/users 0 101528074936168  48.6  73:421116.4
> 17:344683.0 bkup02.anz.l -/usr/hwcad 0 123085454105162  33.4  94:15
> 725.9  14:174792.7
> 
> bkup02.anz.l -sr/include 0    1950    101   5.2   0:13   8.0   0:02
> 50.8 
> bkup02.anz.l -pp/usr/jna 0 6471434426179304  40.51912:46 228.1
> 93:504649.6 
> bkup02.anz.l -pp/usr/lib 0    1610      8   0.5   0:11   0.6   0:02
> 3.7 
> bkup02.anz.l -/usr/local 1    1627     65   4.0   2:40   0.4   0:02
> 29.7 
> bkup02.anz.l -usr/lucent 0    1609      9   0.6   0:10   0.8   0:02
> 4.1 
> bkup02.anz.l -pp/usr/ncd 0  200036  85204  42.6   1:51 766.5
> 0:155525.6 
> bkup02.anz.l -pp/usr/net 0  111856  43823  39.2   1:08 648.4
> 0:085251.9 
> bkup02.anz.l -pp/usr/nms 0    1612     10   0.6   0:11   0.8   0:02
> 4.6 
> bkup02.anz.l -sr/swtools 0 2640785 657395  24.9  13:56 786.7
> 2:095083.7 
> ---End Mail Report--- 
> 
> 3. Dump times to the holding disk can sometimes vary greatly e.g.: 
> 
> Amanda Dump 20050606                    Elapsed Time = 9:40:43 
> Bandwidth = 25120                       Final Status = TAPE ERROR 
> Holding disk = 66560                    Dumped/Failed = 13/0 
> Tape Policy = FIRST                     Output data size = 25563 
> Dumpers = 4                             Estimated data size = 25608 
> Driver alg = drain-ends At big end 0 
> 
> Amanda Dump 20050531                    Elapsed Time = 32:34:04 
> Bandwidth = 25120                       Final Status = TAPE ERROR 
> Holding disk = 66560                    Dumped/Failed = 13/0 
> Tape Policy = FIRST                     Output data size = 25589 
> Dumpers = 4                             Estimated data size = 25607 
> Driver alg = drain-ends At big end 0 
> 
> Apart from the length of time for the dumps to complete I can see no
> difference between the sessions when they're running.  Both the Amanda
> server and the NAS device appear to be running well with no CPU, memory
> or disk bottlenecks.  No apparent network problems though I have been
> unable to get LAN utilization stats.
> 
> ---Start amanda.conf--- 
> org "Toaster" 
> mailto "backup" 
> dumpuser "amanda" 
> 
> inparallel 4 
> dumporder "BTBTBTBTBTBT" 
> netusage  10000 Kbps 
> dumpcycle 4 weeks 
> runspercycle 20 
> tapecycle 100 tapes 
> 
> bumpsize 20 Mb 
> bumpdays 1 
> bumpmult 4 
> 
> etimeout 3600 
> dtimeout 6000 
> ctimeout 30 
> tapebufs 20 
> 
> runtapes 4 
> tpchanger "chg-chio" 
> tapedev "/dev/nrst1" 
> rawtapedev "/dev/null" 
> changerfile "/usr/pkg/etc/amanda/Toaster/changer.conf" 
> changerdev "/dev/ch0" 
> maxdumpsize -1 
> tapetype CUST-DLT8000 
> labelstr "^0021[0-9][0-9]D" 
> amrecover_do_fsf yes 
> amrecover_check_label yes 
> amrecover_changer "/dev/nrst1" 
> 
> holdingdisk hd1 { 
>     comment "main holding disk" 
>     directory "/amanda/hd1/CH0" 
>     use 65Gb 
>     chunksize 35Gb 
>     } 
> 
> autoflush yes 
> 
> infofile "/var/amanda/Toaster/curinfo" 
> logdir   "/var/amanda/Toaster" 
> indexdir "/var/amanda/Toaster/index" 
> 
> define tapetype CUST-DLT8000 { 
>     comment "DLT8000 Drive generated by amtapetyep" 
>     length 38295 mbytes 
>     filemark 30 kbytes 
>     speed 5800 kps 
> } 
> 
> define dumptype global { 
>     comment "Global definitions" 
>     # index yes 
>     # record no 
> } 
> 
> define dumptype comp-high-fast { 
>     global 
>     comment "very important partitions on fast machines" 
>     compress client fast 
>     priority high 
> } 
> 
> define interface local { 
>     comment "a local disk" 
>     use 10000 kbps 
> } 
> 
> define interface fxp0 { 
>     comment "100 Mbps ethernet" 
>     use 5120 kbps 
> } 
> ---End amanda.conf--- 
> 
> 
> About 70% of the backup sessions run well but once or twice a week
> something goes wrong. 
> 
> I have a second Surestore library doing normal system backups and these
> run without problem though no config needs to span multiple tapes i.e.
> each backup config fits on 1 tape.
> 
> Any ideas on where to start troubleshooting these problems greatly
> appreciated.  Does the config file look OK or does anyone recommend
> changes?
> 
> Thanks, 
> Greg. 
> 



--
Frank Smith                                                fsmith AT hoovers 
DOT com
Sr. Systems Administrator                                 Voice: 512-374-4673
Hoover's Online                                             Fax: 512-374-4501