BackupPC-users

[BackupPC-users] BackupPC recovery from unreliable disk

2011-12-21 21:52:31
Subject: [BackupPC-users] BackupPC recovery from unreliable disk
From: JP Vossen <jp AT jpsdomain DOT org>
To: BackupPC-users AT lists.sourceforge DOT net
Date: Wed, 21 Dec 2011 21:50:29 -0500
I'm running Debian Squeeze stock backuppc-3.1.0-9 on a server and I'm
getting kernel messages [1] and SMART errors [2] about the WD 2TB SATA
disk.  Fine, I RMA'd it and have the new one...  Now what?  I know I can 
either 'dd' or start fresh.  But...


If I start fresh, I know everything will be work and be valid, but I
lose my historical backups when I wipe the bad disk and RMA it.


If I 'ddrescue' BAD --> GOOD, I'll worry about the integity of the
BackupPC store.  As I understand it, the incoming files are hashed and
stored, but the store itself is never checked (true?).  So when I do
backups, if an incoming file hash matches a file already in the store,
the incoming file is "de-duped" and dropped.  But what if the file
actually in the store is corrupt due to the bad disk?

Am I correct?  If so, is there a way to have BackupPC validate that the
files in the pool actually match their hash and weren't mangled by the disk?


Any other solution I'm missing?

Thanks,
JP
___________________________________________
[1] Example kernel errors:

Security Events for kernel
=-=-=-=-=-=-=-=-=-=-=-=-=-
kernel: [4020993.728571] end_request: I/O error, dev sda, sector 81203507
kernel: [4021009.712952] end_request: I/O error, dev sda, sector 81203507

System Events
=-=-=-=-=-=-=
kernel: [4020983.471256] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x0
kernel: [4020983.471290] ata3.00: BMDMA stat 0x25
kernel: [4020983.471315] ata3.00: failed command: READ DMA
kernel: [4020983.471347] ata3.00: cmd
c8/00:18:33:11:d7/00:00:00:00:00/e4 tag 0 dma 12288 in
kernel: [4020983.471351]          res
51/40:07:33:11:d7/40:00:28:00:00/e4 Emask 0x9 (media error)
kernel: [4020983.471424] ata3.00: status: { DRDY ERR }
kernel: [4020983.471446] ata3.00: error: { UNC }
kernel: [4020983.501157] ata3.00: configured for UDMA/133


[2] Example SMART error:

Error 1704 occurred at disk power-on lifetime: 10149 hours (422 days +
21 hours)
   When the command that caused the error occurred, the device was
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   40 51 40 45 66 01 e0  Error: UNC 64 sectors at LBA = 0x00016645 = 91717

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   c8 00 40 3f 66 01 e0 08  46d+13:36:50.242  READ DMA
   ec 00 00 00 00 00 a0 08  46d+13:36:50.233  IDENTIFY DEVICE
   ef 03 46 00 00 00 a0 08  46d+13:36:50.225  SET FEATURES [Set transfer
mode]

----------------------------|:::======|-------------------------------
JP Vossen, CISSP            |:::======|      http://bashcookbook.com/
My Account, My Opinions     |=========|      http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create 
new or port existing apps to sell to consumers worldwide. Explore the 
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/