BackupPC-users

Re: [BackupPC-users] Backing up a BackupPC server

2009-06-02 15:01:14
Subject: Re: [BackupPC-users] Backing up a BackupPC server
From: "Jeffrey J. Kosowsky" <backuppc AT kosowsky DOT org>
To: "General list for user discussion, questions and support" <backuppc-users AT lists.sourceforge DOT net>
Date: Tue, 02 Jun 2009 14:55:42 -0400
Les Mikesell wrote at about 13:16:05 -0500 on Tuesday, June 2, 2009:
 > Jeffrey J. Kosowsky wrote:
 > > 
 > >  > Do you actually have any experience with large scale databases?  I 
 > > think 
 > >  > most installations that come anywhere near the size and activity of a 
 > >  > typical backuppc setup would require a highly experienced DBA to 
 > >  > configure and would have to be spread across many disks to have 
 > > adequate 
 > >  > performance.
 > > 
 > > I am by no means a database expert, but I think you are way
 > > overstating the complexity issues.
 > 
 > I've worked with lots of filesystems and a few databases - and had many 
 > more problems with the databases.  For example, they are not at all 
 > happy or forgiving if you run out of underlying filesystem space.  And 
 > it's not clear how to fix them if they are corrupted by a crash. When 
 > you are dealing with backups you want them to work regardless of other 
 > problems - the time you need them is precisely when you have a bunch of 
 > other problems.

Only the metadata would be stored in the database.

 > 
 > > While the initial design would
 > > certainly need someone with experience, I don't know why each
 > > implementation would require a "highly experienced DBA" or why it
 > > "would have to be spread across many disks" any more than a standard
 > > BackupPC implementation. Modern databases are written to hide a lot of
 > > the complexity of optimization.
 > 
 > Modern filesystems optimize file access because they know the related 
 > structures (directories, inodes, free space list).  Databases don't know 
 > what you are going to put in them or how they relate.  They can be tuned 
 > to optimize them for any particular thing but that isn't inherent.

The filesystem would be used to store the files. The database to store
the metadata. I'm sure just about any modern database would be much
more efficient at storing metadata than a packed ascii text file which is
what the attrib files are. Any time you want to access a file you need
to unpack the attrib file, parse it into a perl structure and then
access the specific data element you want. If you are dealing with
incremental backups you may need to read several attrib files just to
access a single file. You can't tell me that is more efficient than a
well-implemented relational database lookup. Even worse, any change to
an attrib file requires reading it all in, unpacking and parsing it
all, making the change, repacking, rewriting. Again much, much, much
less efficient than a database write.

 > 
 > > Plus the database is large only in the
 > > sense of having lots of table entries but is otherwise not
 > > particularly complex nor do you have to deal with multiple
 > > simultaneous access queries which is usually the major bottleneck
 > > requiring optimization and performance tuning.
 > 
 > Multiple concurrent writes are the hard part, something backuppc will be 
 > doing all night long.

 > > This seems like a red herring. The disk head motion issue applies
 > > whether the data is stored in a database or in a combination of a
 > > filesystem + attrib files.
 > 
 > Sort of, but the OS, filesystem and buffer cache have years of design 
 > optimization for their specific purpose and they are pretty good at it. 
 >   And unless the database uses the raw device it can only add overhead 
 > to the underlying filesystem access.

Overhead is only bad if it is significant and rate limiting.
 > 
 > > If anything, storage in a single database
 > > would be more efficient than having to find and individually load (and
 > > unpack) multiple attrib files since the database storage can be
 > > optimized to some degree automagically while even attrib files that
 > > are logically "sequential" could be scattered all over the disk
 > > leading to inefficient head movement.
 > 
 > This is the sort of thing where you need to produce evidence.  I'd 
 > expect the attrib files to be generally optimized with respect to the 
 > locations of the relevant directories that you will be accessing at the 
 > same time because the filesystem knows about these locations when 
 > allocating the space, whereas a database on top of a filesystem has no 
 > idea of where the disk head will be going next.


Well, again, when you access a file you in general need to read in
multiple attrib files across the chain of incremental backups. There
is no way that the filesystem knows of these relationships. Also,
since the files are hard linked often to pre-existing pool files there
is no reason to think that the attrib files are located logically near
the pool files.

 > 
 >  > Also, the database could be
 > > stored on one disk and the pool on another but this would be difficult
 > > if not impossible to do on BackupPC where the pool, the links, and the
 > > attrib files are all on the same filesystem.
 > 
 > Agreed - if you have a skilled DBA to arrange this.  It's not going to 
 > happen out of the box.

Pool is not stored in the database so no need for skilled DBA.

 > 
 > >  >     Also, while some database do offer remote replication, it isn't 
 > >  > magic either and keeping it working isn't a common skill.
 > >  > 
 > > 
 > > Again a red herring. Jut having the ability to temporarily "throttle"
 > > BackupPC leaving the database in a consistent state would allow one to
 > > just simply copy (e.g., rsync) the database and the pool to a backup
 > > device. This copy would be much faster than today's BackupPC because
 > > you wouldn't have the hard link issue. Remote replication would be
 > > even better but not necessary to solve the common issue of copying the
 > > pool raised by so many people on this list.
 > 
 > There's only a small difference in scale here (and it's not obvious 
 > which direction) between rsync'ing a raw database file and rsync'ing an 
 > image copy of a filesystem.  There's probably not much of a practical 
 > difference.

Except that I have a lot of other stuff on my filesystem so I don't
want to image the whole filesystem. I just want to image the
backups. Also, not all filesystems support efficient methods for
imaging a partially filled filesystem. Again, you are assuming a tight
integration between the functionality and setup of the filesystem and
the backup software whereas I want to abstract away any such
requirements as much as possible even at the expense of some extra
overhead.

------------------------------------------------------------------------------
OpenSolaris 2009.06 is a cutting edge operating system for enterprises 
looking to deploy the next generation of Solaris that includes the latest 
innovations from Sun and the OpenSource community. Download a copy and 
enjoy capabilities such as Networking, Storage and Virtualization. 
Go to: http://p.sf.net/sfu/opensolaris-get
_______________________________________________
BackupPC-users mailing list
BackupPC-users AT lists.sourceforge DOT net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/