Bacula-users

Re: [Bacula-users] [Bacula-devel] Feature / Project request - support for file-system / volume / san dedup for file devices

2010-02-10 02:59:14
Subject: Re: [Bacula-users] [Bacula-devel] Feature / Project request - support for file-system / volume / san dedup for file devices
From: Kern Sibbald <kern AT sibbald DOT com>
To: bacula-devel AT lists.sourceforge DOT net
Date: Wed, 10 Feb 2010 08:56:12 +0100
Hello,

Thanks for your Feature request.

Concerning deduplication: 
- currently Bacula has file level deduplication implemented in version 5.0.0.  
It is done in a novel way (at least I have not seen it in any other product) 
that permits the sys admin to optimize deduplication at the file level. 
- Bacula Systems has a research and development project on a filesystem block 
level deduplication (quite different from your suggestion) that is showing a 
lot of promise
- the Bacula project is just now discussing a new deduplication scheme based 
on partial file deduplication using sliding blocks (not at all the same as 
filesystem block based block level deduplication.
- It is possible that as part of the above mentioned partial file 
deduplication project that we will design a new Volume format, but this is 
not a strict requirement for doing partial file deduplication.
- Bacula's current Volume format is not designed to handle filesystem block 
level deduplication, so any kind of project that attempts to "align" data in 
the current Bacula Volume format based on filesystem blocks, then do 
deduplication is, in my opinion, doomed to failure.

So, there are at least three different major techniques that can be used in 
for deduplication:

1. File level deduplication, which Bacula has and which works with files 
backed up to tape as well as disk using Bacula's current Volume format.

2. Filesystem level block deduplication (using snapshot technology).

3. Individual file block deduplication (there are multiple variations of this 
technique).

Item 1 is already implemented in Bacula version 5.0.0

Item 2 has been internally successfully demonstrated by Bacula Systems for 
Linux systems and we are working on the same for Windows systems -- we hope 
this will be ready for release 3Q2010 with testing possibly sooner. To 
totally automate it we may need some extensions to the current Volume format, 
but they are rather minor and do not require any Volume design changes.

Item 3 is a new project only in early discussion phase on the bacula-devel 
list.  It looks very promissing but needs a lot of work.

With all the above, I do not think that it is yet time to discuss changing the 
Bacula Volume format (though a new (second) Volume format is one of the 
options I am considering for item 3.

Best regards,

Kern



On Wednesday 10 February 2010 02:29:48 Darren Mackay wrote:
> Item :  Support for file-system / volume / san dedup for file devices
>
> Date:   10 Feb 2010
>
> Origin: Darren Mackay (Velitium)
>
> Status:
>
> What:   File devices should provide support for block based deduplication
> provided by the underlying file-systems / volume manager / san.
>
> Why:    A number of file-systems / volume managers / sans now provide block
> based deduplication. For block level dedup, it is not uncommon for
> deduplication ratios to be to be 3x, 4x, or 5x for unstructured data.
>
> Currently it appears (forgive me and advise if this is actually incorrect,
> as this is drawn upon a number of forum posts) that that bacula storage
> daemon is packing the data-stream back-2-back, which prevents block based
> duplication as the data-stream is not aligned to blocks as defined by the
> underlying storage device. I have also read several posts that indicate
> that bacula may multiplex data streams, which in the case of underlying
> dedup, would further prevent dedup from be performed.
>
> Allowing for dedup in the underlying file-system / volume / san would also
> alleviate the need for sysadmins to tune baselines between different hosts
> which use the same storage daemon file device(s).
>
> Notes:
>
> Based on limited testing, some dedup is able be performed, but the number
> of duplicate blocks detected is limited. For instance,  consecutive full
> backs from a single client machine (approx 200GB, both o/s and unstructured
> file data) for only a single concurrent job should have resulted in a
> significant portion of the backup to be detected as duplicate blocks by the
> underlying storage (OpenSolaris ZFS in this case), however, the actual
> ration of dedup detected for the 2nd full backup was approx 70k blocks (~
> 8.5GB). Subsequent runs of the full backup yielded similar results.
> Allowing for metadata, I would have expected at least 80% of the full
> backup to dedup.
>
> Several levels of dedup support, which could be implemented in a staged
> approached.
>
> Phase 1 - File device dedup support
> - This would allow for dedup between file devices on the same system)
> - Add padding at the end of each file to a user configurable block size.
>
>    DedupBlockSize = 8k (configurable, in bytes)
>
> - If the configuration options is missing, then disable all support for
> underlying dedup for file devices.
>
> Phase 2 - Autodetection of dedup supported file-systems
> - When dedup is provided by the host o/s of the file system device, the
> storage daemon should detect if dedup is enabled for the file device
> location. For Solaris / Opensolaris ZFS, this value is available through
> the filesystem extended properties. In this case, if dedup is enabled for
> the ZFS filesystem, the storage daemon should read the filesystem block
> size as use this value. (note - ZFS also uses variable block sizes, and
> thus will only allocate the require size if the requirement is less than
> the actual block size)
>
> Phase 3 - Alignment of the datastream to underlying file-system blocks and
> separate of bacula metadata to separate blocks
> - This would allow for underlying storage system deduplication between both
> bacula file devices and real data stored elsewhere on the file-system /
> volume / san.



------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users