Re: [BackupPC-users] why hard links?

I have seen some discussion on other boards about using a database to provide the de-duplication like backuppc does. This is a love it or hate it idea. Though it sounds like a pretty good idea you should consider this. Database backed email servers are typically outperformed by their file based counterparts by significant margins. If you want a real high performance email system you put user and alias data in the database because it is good at storing that type of data and the emails themselves in files.

I do think that there would be a performance advantage to splitting out some of the file info into a database such as the hash, date, size, mod time etc and put just the raw file right on the filesystem. The database will outperform the filesystem for accessing those types of data and the filesystem will perform even better because I/O will be reduced. The work to implement this would be a lot of work and would only offer small gains. It would be much more profitable to put this effort into improving filesystems or implementing some sore of delayed write system.

On Tue, Jun 2, 2009 at 7:15 PM, Holger Parplies <wbppc AT parplies DOT de> wrote:

Hi,

Jeffrey J. Kosowsky wrote on 2009-06-02 14:26:44 -0400 [Re: [BackupPC-users] why hard links?]:

> Les Mikesell wrote at about 12:32:14 -0500 on Tuesday, June 2, 2009:
> > Jeffrey J. Kosowsky wrote:

> > > [...]

> > > > If you have to add an extra system call to lock/unlock around some
> > > > other operation you'll triple the overhead.
> > >
> > > I'm not sure how you definitively get to the number "triple". Maybe
> > > more maybe less.

I agree. It's probably more.

> > Ummm, link(), vs. lock(),link(),unlock() equivalents, looks like 3x the
> > operations to me - and at least the lock/unlock parts have to involve
> > system calls even if you convert the link operation to something else.
>
> 3x operations != 3x worse performance
> Given that disk seeks times and input bandwidth are typical
> bottlenecks, I'm not particularly worried about the added
> computational bandwidth of lock/unlock.

Since you can't lock() the file you are about to create (can you?), you'll
probably need a different file - either one big global lock file or one on the
directory level. I'm not familiar with the kernel code, but I wouldn't be
surprised if that got you the disk seeks you are worried about.

> > > Les - I'm really not sure why you seem so intent on picking apart a
> > > database approach.
> >
> > I'm not. I'm encouraging you to show that something more than black

> > magic is involved. [...]

>
> I never claimed performance. My claims have been around flexibility,
> extendability, and transportability.

And I'm worried about complexity and robustness:
1. Complexity
What additional skills do you need to set up the BackupPC version you are
imagining and keep it running?
2. Complexity
Who is going to write and, more importantly, debug the code? How do you test
all the new cases that can go wrong? How do people feel about entrusting
vital data to a system they no longer have a basic understanding of?
3. Complexity
When everything goes wrong, what can you still do with the data? Currently,
you can locate a file in the file system (file mangling is not that
complicated) or even with an FS debugging tool in an image of an
unmountable FS and BackupPC_zcat it to get the contents. Attributes are lost
that way, but for regaining the contents of a few crucial files, this can
work quite well. It could be made to even restore the attributes with only
slightly more requirements (intact attribs file). With a database, can you
do anything at all without a completely running BackupPC system? What are
the exact requirements? Database file? Database engine? Accessible pool
file system?
4. Robustness, points of failure
How do you handle losing single files, on-disk corruption of a few files?
Losing/corrupting many files? Your database?

> I think all (or nearly all) of my 7 claimed advantages are
> self-evident.

Yes, mostly, though they were claimed in a different thread. I hope everyone
has multiple MUAs open ...

1. I don't see how "platform and filesystem independence" fits together with
the use of a database, though. You are currently dependent on a POSIX file
system. How is depending on one of a set of databases any better?

4. How does backing up the database and *a portion of the pool* work? Sure,
you can make anything fault-tolerant, but are missing files faults of which
you *want* to be tolerant?
But yes, backing up the complete pool would be easier, though it's your
responsibility to get it right (i.e. consistent), and there's probably no
sane way to check.

5.1. Why is file name mangling a kludge, and in what way is storing file names
in a database better?

5.2. What is non-standard about defining a file format any way you like? It's
not like compressed pool files would otherwise adhere to a particular
known file format. But yes, treating compressed and uncompressed files
alike would be nice.

5.3. I'm not really sure encrypting files *on the server* does much, unless
you are thinking of a remote storage pool. In particular, you need to be
able to decrypt files not only for restoration, but also for pooling
(unless you want an intermediate copy and an extra comparison).

5.5. Configuration stored in the database? Is that supposed to be an
advantage?

6. If you mean access controlled by the database (different database users),
I don't really see why you are worried about access to the *meta data* when
the actual contents remain readable (you're not saying that it being such a
huge amount of data is a security feature, are you?).
If you mean that a database will make it easier to implement file level
access control, I honestly don't see how.

7. How that? If you are less concerned about how much space you use, you can
store things in a way that they can be accessed faster. But I still think
you are mistaken in that multiple attrib files would need to be read. I've
had to read so much discussion on this today that I won't check the code
now, but I'd reason that for attrib file pooling to make any sense, the
default would be an identical attrib file (compared to the reference
backup) if no files in the directory were changed.
Or, differently, if BackupPC *would* need to scan multiple attrib files,
your delete-file-from-backups script would only ever need to modify one
attrib file for any file it deletes, right? ;-)

> Plus, I don't want my backup system to be
> filesystem dependent because I might have other reasons for picking
> other filesystems or my OS of the future (or of today) might not even
> support the filesystem features required.

The same arguments hold against incorporating a database.

> I think good system design calls for abstracting the backup software from
> the underlying filesystem.

Well, the only thing you are abstracting from are hardlinks, which are POSIX
standard. I wouldn't be surprised if there were other POSIX dependencies.
BackupPC currently makes no other assumptions about the file system, does it?
Well, file size maybe - you need a file system capable of storing large enough
files. And long enough paths. I look forward to the introduction of
$Conf{PathSeparator} ...

Regards,
Holger

------------------------------------------------------------------------------
OpenSolaris 2009.06 is a cutting edge operating system for enterprises
looking to deploy the next generation of Solaris that includes the latest
innovations from Sun and the OpenSource community. Download a copy and
enjoy capabilities such as Networking, Storage and Virtualization.
Go to: http://p.sf.net/sfu/opensolaris-get
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects

_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/