Amanda-Users

Re: no backups since a week ago (data write: Connection reset by peer)

2007-05-10 19:27:21
Subject: Re: no backups since a week ago (data write: Connection reset by peer)
From: Gene Heskett <gene.heskett AT verizon DOT net>
To: Steven Settlemyre <settlemy AT asel.udel DOT edu>
Date: Thu, 10 May 2007 09:03:14 -0400
On Thursday 10 May 2007, Steven Settlemyre wrote:
> From amdump log:
>
>This similar flow happens every time it fails. At first I was worried
>about the "not enough diskspace" message, but this is just because my
>holding disk filled up. I see the same thing happening in successful runs.
>
>driver: state time 8459.717 free kps: 37761 space: 55584 taper: idle
>idle-dumpers: 7 qlen tapeq: 0 runq: 18 roomq: 2 wakeup: 0 d
>river-idle: no-diskspace
>driver: interface-state time 8459.717 if : free 37761
>driver: hdisk-state time 8459.717 hdisk 0: free 55584 dumpers 1
>driver: result time 8459.727 from chunker3: RQ-MORE-DISK 03-00004
>find diskspace: not enough diskspace. Left with 508960 K
>find diskspace: not enough diskspace. Left with 71872 K
>find diskspace: not enough diskspace. Left with 17664 K
>find diskspace: not enough diskspace. Left with 508960 K
>driver: Don't know how to send ABORT command to chunker
>taper: DONE [idle wait: 7182.514 secs]
>chunker: error [bad command after RQ-MORE-DISK: "QUIT"]
>chunker: time 7311.263: error [bad command after RQ-MORE-DISK: "QUIT"]
>chunker: time 7311.263: pid 1129 finish time Thu May 10 02:06:02 2007
>chunker: error [bad command after RQ-MORE-DISK: "QUIT"]
>chunker: time 7888.517: error [bad command after RQ-MORE-DISK: "QUIT"]
>chunker: time 7888.517: pid 31879 finish time Thu May 10 02:06:02 2007
>chunker: error [bad command after RQ-MORE-DISK: "QUIT"]
>chunker: time 7888.485: error [bad command after RQ-MORE-DISK: "QUIT"]
>chunker: time 7888.485: pid 31880 finish time Thu May 10 02:06:02 2007
>taper: writing end marker. [VOL14 OK kb 2300352 fm 28]
>dumper: kill index command
>dumper: kill index command
>dumper: kill index command
>amdump: end at Thu May 10 02:06:02 EDT 2007
>Scanning /holding/amanda...
>  20070509234502: found Amanda directory.
>
Humm, I see no such diskspace related messages in last nights amdump.1 file 
here.

How much holding disk do you have, and how much of it is reserved for 
emergency incrementals?  By default, its 100%, so if you don't have 
that "reserved" set at all...

I think one should have enough holding disk that is not reserved to cover the 
two largest disklist entries you have.  In my case that would be around 15GB 
of non-reserved space, but then I have about 80GB available here, more than 
adequate till I fill it up.  Again..  :-)

>Gene Heskett wrote:
>> On Wednesday 09 May 2007, Steven Settlemyre wrote:
>>> Can someone please help me?
>>>
>>> Steven Settlemyre wrote:
>>>> I haven't changed my configs for months and things were running great
>>>> until last week. Since last tues, none of my dailies have finished,
>>>> and last night a monthly failed.
>>>>
>>>> Looking through the logs I see the problem always seems to start with
>>>> "data write: Connection reset by peer" and "Don't know how to send
>>>> ABORT command to chunker". I'm having a hard time interpreting the
>>>> logs and can't seem to find too much in the archives about this. Was
>>>> wondering if someone could walk me through an explanation of the
>>>> problem and how to avoid it in the future.
>>>>
>>>> My monthlies run tape spanning on 3 40G tapes.
>>>>
>>>> Here is the email output generated:
>>>>
>>>> *** THE DUMPS DID NOT FINISH PROPERLY!
>>>>
>>>> These dumps were to tape Monthly21.
>>>> The next 3 tapes Amanda expects to use are: Monthly01, Monthly02,
>>>> Monthly03.
>>>> The next 3 new tapes already labelled are: Monthly19, Monthly20,
>>>> Monthly22.
>>>>
>>>> FAILURE AND STRANGE DUMP SUMMARY:
>>>>  wagstaff      /usr/local                    lev 1  FAILED [data
>>>> write: Connection reset by peer]
>>>>  lollipop      /files1                       lev 0  FAILED [data
>>>> write: Connection reset by peer]
>>>>  helios      /files3                       lev 1  FAILED [data write:
>>>> Connection reset by peer]
>>>>  helios      /                             RESULTS MISSING
>>>>  helios      /files2                       RESULTS MISSING
>>>>  helios      /usr                          RESULTS MISSING
>>>>  helios      /usr/local                    RESULTS MISSING
>>>>  helios      /var                          RESULTS MISSING
>>>>  lollipop      /                             RESULTS MISSING
>>>>  lollipop      /usr                          RESULTS MISSING
>>>>  lollipop      /usr/local                    RESULTS MISSING
>>>>  wagstaff      /files3                       RESULTS MISSING
>>>>  wagstaff      /files4                       RESULTS MISSING
>>>>  wagstaff      /files5                       RESULTS MISSING
>>>>  wagstaff      /files6/vol/Voiceware         RESULTS MISSING
>>>>  wizard        /files2                       RESULTS MISSING
>>>>  snapserver  /hd/vol_mnt0/shares/TermLab   RESULTS MISSING
>>>>  snapserver  /hd/vol_mnt0/shares/bcl       RESULTS MISSING
>>>>  snapserver  /hd/vol_mnt0/shares/biochem   RESULTS MISSING
>>>>  snapserver  /hd/vol_mnt0/shares/confocal  RESULTS MISSING
>>>>  driver: FATAL Don't know how to send ABORT command to chunker
>>>>  chunker: FATAL error [bad command after RQ-MORE-DISK: "QUIT"]
>>>>  chunker: FATAL error [bad command after RQ-MORE-DISK: "QUIT"]
>>>>  chunker: FATAL error [bad command after RQ-MORE-DISK: "QUIT"]
>>>>
>>>>
>>>> STATISTICS:
>>>>                          Total       Full      Incr.
>>>>                        --------   --------   --------
>>>> Estimate Time (hrs:min)    0:08
>>>> Run Time (hrs:min)         1:01
>>>> Dump Time (hrs:min)        1:55       1:40       0:16
>>>> Output Size (meg)        8519.7     7729.7      790.1
>>>> Original Size (meg)     13146.3    11595.5     1550.8
>>>> Avg Compressed Size (%)    64.8       66.7       50.9   (level:#disks
>>>> ...)
>>>> Filesystems Dumped           35         12         23   (1:23)
>>>> Avg Dump Rate (k/s)      1261.0     1323.3      863.1
>>>>
>>>> Tape Time (hrs:min)        0:53       0:44       0:09
>>>> Tape Size (meg)          8521.6     7730.3      791.3
>>>> Tape Used (%)              21.1       19.0        2.1   (level:#disks
>>>> ...)
>>>> Filesystems Taped            35         12         23   (1:23)
>>>>   (level:#chunks ...)
>>>> Chunks Taped                 35         12         23   (1:23)
>>>> Avg Tp Write Rate (k/s)  2724.3     3000.8     1433.6
>>>>
>>>> USAGE BY TAPE:
>>>>  Label           Time      Size      %    Nb    Nc
>>>>  Monthly21       0:53  8726112k   21.1    35    35
>>>>
>>>>
>>>> FAILED AND STRANGE DUMP DETAILS:
>>>>
>>>> /--  wagstaff /usr/local lev 1 FAILED [data write: Connection reset by
>>>> peer]
>>>> sendbackup: start [wagstaff level 1]
>>>> sendbackup: info BACKUP=/usr/sbin/ufsdump
>>>> sendbackup: info RECOVER_CMD=/usr/local/bin/gzip -dc
>>>>
>>>> |/usr/sbin/ufsrestore -f... -
>>>>
>>>> sendbackup: info COMPRESS_SUFFIX=.gz
>>>> sendbackup: info end
>>>>
>>>> |   DUMP: Writing 32 Kilobyte records
>>>> |   DUMP: Date of this level 1 dump: Tue May 08 01:11:26 2007
>>>> |   DUMP: Date of last level 0 dump: Mon Apr 30 23:54:14 2007
>>>> |   DUMP: Dumping /dev/rdsk/c0t0d0s7 (wagstaff:/usr/local) to standard
>>>>
>>>> output.
>>>>
>>>> |   DUMP: Mapping (Pass I) [regular files]
>>>> |   DUMP: Mapping (Pass II) [directories]
>>>> |   DUMP: Mapping (Pass II) [directories]
>>>> |   DUMP: Mapping (Pass II) [directories]
>>>> |   DUMP: Estimated 13585968 blocks (6633.77MB) on 0.10 tapes.
>>>> |   DUMP: Dumping (Pass III) [directories]
>>>> |   DUMP: Dumping (Pass IV) [regular files]
>>>> |   DUMP: 16.49% done, finished in 0:50
>>>> |   DUMP: 28.34% done, finished in 0:57
>>>> |   DUMP: 38.89% done, finished in 1:12
>>>>
>>>> \--------
>>>>
>>>> /--  lollipop /files1 lev 0 FAILED [data write: Connection reset by
>>>> peer] sendbackup: start [lollipop level 0]
>>>> sendbackup: info BACKUP=/usr/sbin/ufsdump
>>>> sendbackup: info RECOVER_CMD=/usr/bin/gzip -dc |/usr/sbin/ufsrestore
>>>> -f... -
>>>> sendbackup: info COMPRESS_SUFFIX=.gz
>>>> sendbackup: info end
>>>>
>>>> |   DUMP: Writing 32 Kilobyte records
>>>> |   DUMP: Date of this level 0 dump: Mon May 07 23:52:56 2007
>>>> |   DUMP: Date of last level 0 dump: the epoch
>>>> |   DUMP: Dumping /dev/rdsk/c0t2d0s2 (lollipop:/files1) to standard
>>>>
>>>> output.
>>>>
>>>> |   DUMP: Mapping (Pass I) [regular files]
>>>> |   DUMP: Mapping (Pass II) [directories]
>>>> |   DUMP: Estimated 34371404 blocks (16782.91MB) on 0.25 tapes.
>>>> |   DUMP: Dumping (Pass III) [directories]
>>>> |   DUMP: Dumping (Pass IV) [regular files]
>>>> |   DUMP: 6.08% done, finished in 2:34
>>>> |   DUMP: 13.06% done, finished in 2:13
>>>> |   DUMP: 19.93% done, finished in 2:00
>>>> |   DUMP: 26.85% done, finished in 1:49
>>>> |   DUMP: 33.93% done, finished in 1:37
>>>> |   DUMP: 41.38% done, finished in 1:25
>>>> |   DUMP: 49.81% done, finished in 1:10
>>>> |   DUMP: 57.36% done, finished in 0:59
>>>> |   DUMP: 60.37% done, finished in 0:59
>>>> |   DUMP: 64.71% done, finished in 0:54
>>>> |   DUMP: 72.54% done, finished in 0:41
>>>> |   DUMP: 78.46% done, finished in 0:32
>>>> |   DUMP: 84.45% done, finished in 0:23
>>>> |   DUMP: 89.91% done, finished in 0:15
>>>> |   DUMP: 96.51% done, finished in 0:05
>>>>
>>>> \--------
>>>>
>>>> /--  helios /files3 lev 1 FAILED [data write: Connection reset by peer]
>>>> sendbackup: start [helios level 1]
>>>> sendbackup: info BACKUP=/bin/tar
>>>> sendbackup: info RECOVER_CMD=/bin/gzip -dc |/bin/tar -f... -
>>>> sendbackup: info COMPRESS_SUFFIX=.gz
>>>> sendbackup: info end
>>>> \--------
>>>>
>>>>
>>>> NOTES:
>>>>  planner: wagstaff /files6/vol/speech7 20070507 0 [dumps too big, 9328
>>>> KB, full dump delayed]
>>>>  planner: oz /files1 20070507 0 [dumps too big, 11716 KB, full dump
>>>> delayed]
>>>>  planner: wagstaff /files6/vol/bvd 20070507 0 [dumps too big, 12856
>>>> KB, full dump delayed]
>>>>  planner: oz / 20070507 0 [dumps too big, 21760 KB, full dump delayed]
>>>>  planner: wagstaff / 20070507 0 [dumps too big, 23105 KB, full dump
>>>> delayed]
>>>>  planner: helios / 20070507 0 [dumps too big, 38583 KB, full dump
>>>> delayed]
>>>>  planner: wizard / 20070507 0 [dumps too big, 44165 KB, full dump
>>>> delayed]
>>>>  planner: snapserver /hd/vol_mnt0/shares/surgery 20070507 0 [dumps too
>>>> big, 91538 KB, full dump delayed]
>>>>  planner: wagstaff /usr 20070507 0 [dumps too big, 112839 KB, full
>>>> dump delayed]
>>>>  planner: lollipop /usr 20070507 0 [dumps too big, 131680 KB, full
>>>> dump delayed]
>>>>  planner: wizard /usr 20070507 0 [dumps too big, 162844 KB, full dump
>>>> delayed]
>>>>  planner: martin /var 20070507 0 [dumps too big, 178985 KB, full dump
>>>> delayed]
>>>>  planner: wizard /var 20070507 0 [dumps too big, 210323 KB, full dump
>>>> delayed]
>>>>  planner: professor /usr/local 20070507 0 [dumps too big, 219910 KB,
>>>> full dump delayed]
>>>>  planner: snapserver /hd/vol_mnt0/shares/NeuroGen 20070507 0 [dumps
>>>> too big, 247911 KB, full dump delayed]
>>>>  planner: wagstaff /files6/vol/spdata1 20070507 0 [dumps too big,
>>>> 358932 KB, full dump delayed]
>>>>  planner: lollipop /usr/local 20070507 0 [dumps too big, 436518 KB,
>>>> full dump delayed]
>>>>  planner: wizard /usr/local 20070507 0 [dumps too big, 458028 KB, full
>>>> dump delayed]
>>>>  planner: snapserver /hd/vol_mnt0/shares/immuno 20070507 0 [dumps too
>>>> big, 494376 KB, full dump delayed]
>>>>  planner: oz /usr 20070507 0 [dumps too big, 520179 KB, full dump
>>>> delayed]
>>>>  planner: wagstaff /files6/vol/spdata2 20070507 0 [dumps too big,
>>>> 522063 KB, full dump delayed]
>>>>  planner: snapserver /hd/vol_mnt0/shares/MID 20070507 0 [dumps too
>>>> big, 522216 KB, full dump delayed]
>>>>  planner: professor /usr 20070507 0 [dumps too big, 565153 KB, full
>>>> dump delayed]
>>>>  planner: helios /usr 20070507 0 [dumps too big, 639655 KB, full dump
>>>> delayed]
>>>>  planner: helios /var 20070507 0 [dumps too big, 1609562 KB, full dump
>>>> delayed]
>>>>  planner: helios /usr/local 20070507 0 [dumps too big, 1930526 KB,
>>>> full dump delayed]
>>>>  planner: helios /files2 20070507 0 [dumps too big, 3326170 KB, full
>>>> dump delayed]
>>>>  planner: snapserver /hd/vol_mnt0/shares/BioInf 20070507 0 [dumps too
>>>> big, 4663399 KB, full dump delayed]
>>>>  planner: wagstaff /usr/local 20070507 0 [dumps too big, 5087313 KB,
>>>> full dump delayed]
>>>>  planner: snapserver /hd/vol_mnt0/shares/urology 20070507 0 [dumps too
>>>> big, 5786512 KB, full dump delayed]
>>>>  planner: helios /files3 20070507 0 [dumps too big, 8327643 KB, full
>>>> dump delayed]
>>>>  planner: wagstaff /files1 20070507 0 [dumps too big, 11203120 KB,
>>>> full dump delayed]
>>>>  taper: tape Monthly21 kb 8726112 fm 35 [OK]
>>>>
>>>>
>>>> DUMP SUMMARY:
>>>>                                          DUMPER STATS
>>>> TAPER STATS
>>>> HOSTNAME     DISK        L    ORIG-kB     OUT-kB  COMP%  MMM:SS   KB/s
>>>> MMM:SS   KB/s
>>>> -------------------------- -------------------------------------------
>>>> -------------
>>>> helios /             MISSING
>>>> -------------------------------------------------
>>>> helios /files2       MISSING
>>>> -------------------------------------------------
>>>> helios /files3     1 FAILED
>>>> --------------------------------------------------
>>>> helios /usr          MISSING
>>>> -------------------------------------------------
>>>> helios /usr/local    MISSING
>>>> -------------------------------------------------
>>>> helios /var          MISSING
>>>> -------------------------------------------------
>>>> lollipop /             MISSING
>>>> -------------------------------------------------
>>>> lollipop /files1     0 FAILED
>>>> --------------------------------------------------
>>>> lollipop /usr          MISSING
>>>> -------------------------------------------------
>>>> lollipop /usr/local    MISSING
>>>> -------------------------------------------------
>>>> martin /var        1      33930      15808   46.6    0:03 5776.6
>>>> 0:04 4107.7
>>>> oz /           1        290         96   33.1    0:00  404.7   0:01
>>>> 83.3
>>>> oz /files1     1       3640       1344   36.9    0:00 4641.9   0:04
>>>> 304.1
>>>> oz /usr        1      60180      18432   30.6    0:05 3425.3   0:10
>>>> 1822.2
>>>> oz /var        0     796760     339968   42.7    1:34 3632.0   3:12
>>>> 1771.7
>>>> professor.as /           0     104980      46016   43.8    0:22
>>>> 2099.4   0:09 4941.8
>>>> professor.as /usr        1       6590        640    9.7    0:01
>>>> 449.2   0:01  534.7
>>>> professor.as /usr/local  1       1640        192   11.7    0:00
>>>> 422.0   0:09   21.8
>>>> snapserver -res/BioInf 1       2380        416   17.5    0:13   27.5
>>>> 0:05   90.0
>>>> snapserver -shares/MID 1        310         96   31.0    0:04   11.8
>>>> 0:01   83.7
>>>> snapserver -es/MolGene 0         10         64  640.0    0:00   32.8
>>>> 0:03   23.2
>>>> snapserver -s/NeuroGen 1         80         64   80.0    0:01   10.5
>>>> 0:01   56.0
>>>> snapserver -es/TermLab   MISSING
>>>> -------------------------------------------------
>>>> snapserver -ares/admin 0         10         64  640.0    0:00   38.7
>>>> 0:01   45.1
>>>> snapserver -shares/bcl   MISSING
>>>> -------------------------------------------------
>>>> snapserver -es/biochem   MISSING
>>>> -------------------------------------------------
>>>> snapserver -s/confocal   MISSING
>>>> -------------------------------------------------
>>>> snapserver -ares/histo 0         10         64  640.0    0:00   31.3
>>>> 0:01   58.4
>>>> snapserver -res/immuno 1       3400        416   12.2    0:04   87.6
>>>> 0:01  292.0
>>>> snapserver -ares/mcard 0         10         64  640.0    0:00   36.2
>>>> 0:01   54.6
>>>> snapserver -ares/mysql 0         10         64  640.0    0:00   41.4
>>>> 0:02   26.7
>>>> snapserver -s/neurosci 0    4421720    2413152   54.6   12:45 3155.2
>>>> 10:18 3905.8
>>>> snapserver -es/surgery 1         10         64  640.0    0:00   39.4
>>>> 0:07    8.6
>>>> snapserver -s/sysadmin 0       2010       2048  101.9    0:00 6036.8
>>>> 0:03  586.2
>>>> snapserver -es/urology 1     160880     113472   70.5    0:19 5824.1
>>>> 0:50 2275.4
>>>> wagstaff /           1         95         64   67.4    0:04    2.9
>>>> 0:01   56.7
>>>> wagstaff /files1     1     824351     564832   68.5    8:59 1048.6
>>>> 3:52 2436.2
>>>> wagstaff /files3       MISSING
>>>> -------------------------------------------------
>>>> wagstaff /files4       MISSING
>>>> -------------------------------------------------
>>>> wagstaff /files5       MISSING
>>>> -------------------------------------------------
>>>> wagstaff -/Voiceware   MISSING
>>>> -------------------------------------------------
>>>> wagstaff -s6/vol/bvd 1         10         64  640.0    0:00    4.5
>>>> 0:01   56.7
>>>> wagstaff -ol/spdata1 1       1890        416   22.0    0:09   42.3
>>>> 0:01  359.9
>>>> wagstaff -ol/spdata2 1        480        160   33.3    0:02   50.8
>>>> 0:04   36.5
>>>> wagstaff -ol/speech7 1         40         64  160.0    0:00   14.4
>>>> 0:01   57.9
>>>> wagstaff /usr        1         31         64  206.5    0:05    0.2
>>>> 0:01   55.9
>>>> wagstaff /usr/local  1 FAILED
>>>> --------------------------------------------------
>>>> wizard /           1         95         64   67.4    0:04    2.9
>>>> 0:02   35.1
>>>> wizard /files1     0    4576895    4086720   89.3   63:58 1064.8
>>>> 14:28 4706.1
>>>> wizard /files2       MISSING
>>>> -------------------------------------------------
>>>> wizard /usr        1         63         64  101.6    0:06    0.2
>>>> 0:01   58.0
>>>> wizard /usr/local  1     369407      21600    5.8    2:39  135.8
>>>> 2:23  151.2
>>>> wizard /var        1     118239      71840   60.8    2:38  454.1
>>>> 1:22  872.8
>>>> wizard /var/log    0     110271      72032   65.3    2:10  554.3
>>>> 11:28  104.7
>>>> wizard /var/mail   0    1861087     955584   51.3   18:53  843.7
>>>> 4:10 3820.7
>>>>
>>>> (brought to you by Amanda version 2.5.0p2)
>>
>> This version is getting a wee bit long in the tooth but I think I'd look
>> in the logs for timeout messages..  One of the reasons for a lack of reply
>> might be that no one is ATM, running a duplicate of your config as deduced
>> from the limited info posted.  More details might help.



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
The number of computer scientists in a room is inversely proportional
to the number of bugs in their code.