Greetings all;
I rebooted to 2.6.27.6 yesterday, from 2.6.28-rc4, and although the ls -l
of /dev/sd* does not change, amanda thought the whole system needed a level 0
disguised as a level 1.
However, there now appears to be no recovery. With 10 chucks of 1GB in the
holding disk from /usr/movies, it flushes them when I issue a flush, but
fails with a holding disk read error, and leaves all 10 chunks sitting in
the holding disk.
This leads me to believe that in addition to its ignoring the
define application-tool app_amgtar {
comment "amgtar"
plugin "amgtar"
#property "GNUTAR-PATH" "/path/to/gtar"
#property "GNUTAR-LISTDIR" "/path/to/gnutar_list_dir"
#default from gnutar_list_dir setting in amanda-client.conf
#property "ONE-FILE-SYSTEM" "yes" #use '--one-file-system' option
#property "SPARSE" "yes" #use '--sparse' option
#property "ATIME-PRESERVE" "yes" #use '--atime-preserve=system' option
property "CHECK-DEVICE" "no" #use '--no-check-device' if set to "no"
}
property option CHECK-DEVICE in my amanda.conf according to
the /tmp/amanda-dbg/amgtar logs, it also comes to mind that amanda may have
problems handling a case where the chunk number appended needs to go past a
single digit. Here is an ls -l of that holding disk after a flush:
total 10419428
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:32 coyote._usr_movies.1
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:33 coyote._usr_movies.1.1
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:33 coyote._usr_movies.1.2
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:33 coyote._usr_movies.1.3
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:35 coyote._usr_movies.1.4
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:35 coyote._usr_movies.1.5
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:36 coyote._usr_movies.1.6
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:36 coyote._usr_movies.1.7
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:36 coyote._usr_movies.1.8
-rw------- 1 amanda amanda 995323904 2008-11-16 01:37 coyote._usr_movies.1.9
But I see the last, .9 chunk is not quite a full gig. So that idea wasn't
triggered in this case. Or was it? Was it trying to complete the .9 and
create a .10 and couldn't get past the 9?
From the emailed backup report:
coyote /usr/movies lev 1: partial taper: Error reading holding disk
'/dumps/20081116012505/coyote._usr_movies.1': Input/output
error'
coyote /usr/movies lev 1: partial taper: No space left on device
STATISTICS:
Total Full Incr.
-------- -------- --------
Estimate Time (hrs:min) 0:07
Run Time (hrs:min) 0:31
Dump Time (hrs:min) 0:21 0:11 0:10
Output Size (meg) 13475.8 2448.8 11027.0
Original Size (meg) 17684.3 5938.3 11746.0
Avg Compressed Size (%) 42.3 38.9 54.5 (level:#disks ...)
Filesystems Dumped 6 3 3 (1:2 9:1)
Avg Dump Rate (k/s) 10967.8 3713.7 19370.2
Tape Time (hrs:min) 0:06 0:01 0:05
Tape Size (meg) 13399.7 2448.8 10950.9
Tape Used (%) 100.1 18.3 81.8 (level:#disks ...)
Filesystems Taped 7 3 4 (1:3 9:1)
(level:#chunks ...)
Chunks Taped 7 3 4 (1:3 9:1)
Avg Tp Write Rate (k/s) 36330.6 34053.0 36882.2
USAGE BY TAPE:
Label Time Size % Nb Nc
Dailys-17 0:06 13721325k 100.1 7 7
FAILED DUMP DETAILS:
/-- coyote /usr/movies lev 1 FAILED [data write: Broken pipe]
sendbackup: info BACKUP=APPLICATION
sendbackup: info APPLICATION=amgtar
sendbackup: info RECOVER_CMD=amgtar -f... -
sendbackup: info end
\--------
NOTES:
planner: Incremental of coyote:/etc bumped to level 3.
planner: coyote /GenesAmandaHelper-0.6 20081116012505 0 "[dumps too big,
4690472 KB, full dump delayed]"
planner: Full dump of coyote:/usr/kerberos promoted from 3 days ahead.
planner: Full dump of coyote:/usr/brlcad promoted from 5 days ahead.
driver: coyote /usr/movies 20081116012505 1 [Will retry dump because of
holding disk error: Error reading holding
disk '/dumps/20081116012505/coyote._usr_movies.1': Input/output error']
taper: Will request retry of failed split part.
driver: Out of tapes; going into degraded mode.
And
DUMP SUMMARY:
DUMPER STATS
TAPER STATS
HOSTNAME DISK L ORIG-kB OUT-kB COMP% MMM:SS KB/s
MMM:SS KB/s
coyote /usr/movies 1 10408860 10408860 -- 4:5 34993.5
2:3 37691.5 TAPE-PARTIAL
The available holding disk runs to about 100GB on any given day, and reserve
is set to 25%. I just cat'd that file named above to stdout, took several \
minutes and this showed up in the inbox:
This email was generated by the smartd daemon running on:
host name: coyote.coyote.den
DNS domain: coyote.den
NIS domain: (none)
The following warning/error was logged by the smartd daemon:
Device: /dev/sda, 1 Currently unreadable (pending) sectors
For details see host's SYSLOG (default: /var/log/messages).
You can also use the smartctl utility for further investigation.
No additional email messages about this problem will be sent.
/var/log/messages: (Its just after logrotation here)
[root@coyote ~]# tail -fn100 /var/log/messages
Nov 16 06:16:26 coyote rsyslogd: [origin software="rsyslogd" swVersion="2.0.2"
x-pid="2752" x-info="http://www.rsyslog.com"]
[x-configInfo udpReception="No" udpPort="514" tcpReception="No" tcpPort="0"]
restart
Nov 16 06:21:26 coyote smartd[4611]: Device: /dev/sda, 1 Currently unreadable
(pending) sectors
Nov 16 06:21:26 coyote smartd[4611]: Device: /dev/sda, 1 Offline uncorrectable
sectors
But the disk is reading ok. That time would seem to correspond with me doing
the flush, which failed. Also, smartctl reports only the last 5 errors, and
they all match this:
=====
Error 36 occurred at disk power-on lifetime: 4782 hours (199 days + 6 hours)
When the command that caused the error occurred, the device was active or
idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 18 c6 ba e0 Error: UNC at LBA = 0x00bac618 = 12240408
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 08 17 c6 ba e0 00 04:19:14.836 READ DMA EXT
27 00 00 00 00 00 e0 00 04:19:14.778 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 04:19:14.775 IDENTIFY DEVICE
ef 03 45 00 00 00 a0 02 04:19:18.900 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 04:19:18.833 READ NATIVE MAX ADDRESS EXT
=====
Power on hours currently 4783, and the disk is reading fine now.
This does seem to indicate that I should mark that sector out,
but I have NDI how if it is not automatic.
Comments anyone, either about the CHECK-DEVICE "no" being ignored, or about this
disk? dd, FWIW, seems to walk right on by that 12240408 block right now.
I have "smartrctl -t long /dev/sda" running on it now. Says its offline, and
will take 163 hours to complete, good grief!
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Be valiant, but not too venturous.
Let thy attire be comely, but not costly.
-- John Lyly
|