Amanda-Users

CHECK-DEVICE "no" still issues a yes.

2008-11-16 07:11:55
Subject: CHECK-DEVICE "no" still issues a yes.
From: Gene Heskett <gene.heskett AT verizon DOT net>
To: AMANDA users <amanda-users AT amanda DOT org>
Date: Sun, 16 Nov 2008 07:05:01 -0500
Greetings all;

I rebooted to 2.6.27.6 yesterday, from 2.6.28-rc4, and although the ls -l 
of /dev/sd* does not change, amanda thought the whole system needed a level 0 
disguised as a level 1.

However, there now appears to be no recovery. With 10 chucks of 1GB in the 
holding disk from /usr/movies, it flushes them when I issue a flush, but 
fails with a holding disk read error, and leaves all 10 chunks sitting in 
the holding disk.

This leads me to believe that in addition to its ignoring the

define application-tool app_amgtar {
     comment "amgtar"
     plugin  "amgtar"
     #property "GNUTAR-PATH" "/path/to/gtar"
     #property "GNUTAR-LISTDIR" "/path/to/gnutar_list_dir"
     #default from gnutar_list_dir setting in amanda-client.conf
     #property "ONE-FILE-SYSTEM" "yes"  #use '--one-file-system' option
     #property "SPARSE" "yes"           #use '--sparse' option
     #property "ATIME-PRESERVE" "yes"   #use '--atime-preserve=system' option
     property "CHECK-DEVICE" "no"       #use '--no-check-device' if set to "no"
}

property option CHECK-DEVICE in my amanda.conf according to 
the /tmp/amanda-dbg/amgtar logs, it also comes to mind that amanda may have
problems handling a case where the chunk number appended needs to go past a 
single digit.  Here is an ls -l of that holding disk after a flush:
total 10419428
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:32 coyote._usr_movies.1
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:33 coyote._usr_movies.1.1
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:33 coyote._usr_movies.1.2
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:33 coyote._usr_movies.1.3
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:35 coyote._usr_movies.1.4
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:35 coyote._usr_movies.1.5
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:36 coyote._usr_movies.1.6
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:36 coyote._usr_movies.1.7
-rw------- 1 amanda amanda 1073741824 2008-11-16 01:36 coyote._usr_movies.1.8
-rw------- 1 amanda amanda  995323904 2008-11-16 01:37 coyote._usr_movies.1.9

But I see the last, .9 chunk is not quite a full gig. So that idea wasn't 
triggered in this case.  Or was it?  Was it trying to complete the .9 and
create a .10 and couldn't get past the 9?

From the emailed backup report:
   coyote /usr/movies lev 1: partial taper:  Error reading holding disk 
'/dumps/20081116012505/coyote._usr_movies.1': Input/output 
error'
   coyote /usr/movies lev 1: partial taper:  No space left on device

STATISTICS:
                          Total       Full      Incr.
                        --------   --------   --------
Estimate Time (hrs:min)    0:07
Run Time (hrs:min)         0:31
Dump Time (hrs:min)        0:21       0:11       0:10
Output Size (meg)       13475.8     2448.8    11027.0
Original Size (meg)     17684.3     5938.3    11746.0
Avg Compressed Size (%)    42.3       38.9       54.5   (level:#disks ...)
Filesystems Dumped            6          3          3   (1:2 9:1)
Avg Dump Rate (k/s)     10967.8     3713.7    19370.2

Tape Time (hrs:min)        0:06       0:01       0:05
Tape Size (meg)         13399.7     2448.8    10950.9
Tape Used (%)             100.1       18.3       81.8   (level:#disks ...)
Filesystems Taped             7          3          4   (1:3 9:1)
   (level:#chunks ...)
Chunks Taped                  7          3          4   (1:3 9:1)
Avg Tp Write Rate (k/s) 36330.6    34053.0    36882.2

USAGE BY TAPE:
  Label           Time      Size      %    Nb    Nc
  Dailys-17       0:06 13721325k  100.1     7     7

FAILED DUMP DETAILS:

/--  coyote /usr/movies lev 1 FAILED [data write: Broken pipe]
sendbackup: info BACKUP=APPLICATION
sendbackup: info APPLICATION=amgtar
sendbackup: info RECOVER_CMD=amgtar -f... -
sendbackup: info end
\--------

NOTES:
  planner: Incremental of coyote:/etc bumped to level 3.
  planner: coyote /GenesAmandaHelper-0.6 20081116012505 0 "[dumps too big, 
4690472 KB, full dump delayed]"
  planner: Full dump of coyote:/usr/kerberos promoted from 3 days ahead.
  planner: Full dump of coyote:/usr/brlcad promoted from 5 days ahead.
  driver: coyote /usr/movies 20081116012505 1 [Will retry dump because of 
holding disk error: Error reading holding 
disk '/dumps/20081116012505/coyote._usr_movies.1': Input/output error']
  taper: Will request retry of failed split part.
  driver: Out of tapes; going into degraded mode.

And
DUMP SUMMARY:
                                                 DUMPER STATS              
TAPER STATS 
HOSTNAME DISK                        L ORIG-kB  OUT-kB COMP% MMM:SS    KB/s 
MMM:SS    KB/s
coyote  /usr/movies                 1 10408860 10408860   --    4:5 34993.5   
2:3 37691.5 TAPE-PARTIAL

The available holding disk runs to about 100GB on any given day, and reserve 
is set to 25%.  I just cat'd that file named above to stdout, took several \
minutes and this showed up in the inbox:
This email was generated by the smartd daemon running on:

   host name: coyote.coyote.den
  DNS domain: coyote.den
  NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/sda, 1 Currently unreadable (pending) sectors

For details see host's SYSLOG (default: /var/log/messages).

You can also use the smartctl utility for further investigation.
No additional email messages about this problem will be sent.

/var/log/messages:  (Its just after logrotation here)
[root@coyote ~]# tail -fn100 /var/log/messages
Nov 16 06:16:26 coyote rsyslogd: [origin software="rsyslogd" swVersion="2.0.2" 
x-pid="2752" x-info="http://www.rsyslog.com";]
[x-configInfo udpReception="No" udpPort="514" tcpReception="No" tcpPort="0"] 
restart
Nov 16 06:21:26 coyote smartd[4611]: Device: /dev/sda, 1 Currently unreadable 
(pending) sectors
Nov 16 06:21:26 coyote smartd[4611]: Device: /dev/sda, 1 Offline uncorrectable 
sectors

But the disk is reading ok.  That time would seem to correspond with me doing 
the flush, which failed.  Also, smartctl reports only the last 5 errors, and
they all match this:
=====
Error 36 occurred at disk power-on lifetime: 4782 hours (199 days + 6 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 18 c6 ba e0  Error: UNC at LBA = 0x00bac618 = 12240408

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 17 c6 ba e0 00      04:19:14.836  READ DMA EXT
  27 00 00 00 00 00 e0 00      04:19:14.778  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02      04:19:14.775  IDENTIFY DEVICE
  ef 03 45 00 00 00 a0 02      04:19:18.900  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      04:19:18.833  READ NATIVE MAX ADDRESS EXT
=====
Power on hours currently 4783, and the disk is reading fine now.
This does seem to indicate that I should mark that sector out,
but I have NDI how if it is not automatic.

Comments anyone, either about the CHECK-DEVICE "no" being ignored, or about this
disk?  dd, FWIW, seems to walk right on by that 12240408 block right now.

I have "smartrctl -t long /dev/sda" running on it now. Says its offline, and 
will take 163 hours to complete, good grief!

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Be valiant, but not too venturous.
Let thy attire be comely, but not costly.
                -- John Lyly

<Prev in Thread] Current Thread [Next in Thread>