Bacula-users

[Bacula-users] Feature / Project request - support for file-system / volume / san dedup for file devices

2010-02-09 20:32:45
Subject: [Bacula-users] Feature / Project request - support for file-system / volume / san dedup for file devices
From: Darren Mackay <darrenmackay.lists AT gmail DOT com>
To: bacula-devel AT lists.sourceforge DOT net, bacula-users AT lists.sourceforge DOT net
Date: Wed, 10 Feb 2010 11:29:48 +1000
Item :  Support for file-system / volume / san dedup for file devices

Date:   10 Feb 2010

Origin: Darren Mackay (Velitium)

Status:

What:   File devices should provide support for block based deduplication provided by the underlying file-systems / volume manager / san.

Why:    A number of file-systems / volume managers / sans now provide block based deduplication. For block level dedup, it is not uncommon for deduplication ratios to be to be 3x, 4x, or 5x for unstructured data.

Currently it appears (forgive me and advise if this is actually incorrect, as this is drawn upon a number of forum posts) that that bacula storage daemon is packing the data-stream back-2-back, which prevents block based duplication as the data-stream is not aligned to blocks as defined by the underlying storage device. I have also read several posts that indicate that bacula may multiplex data streams, which in the case of underlying dedup, would further prevent dedup from be performed.

Allowing for dedup in the underlying file-system / volume / san would also alleviate the need for sysadmins to tune baselines between different hosts which use the same storage daemon file device(s).

Notes:

Based on limited testing, some dedup is able be performed, but the number of duplicate blocks detected is limited. For instance,  consecutive full backs from a single client machine (approx 200GB, both o/s and unstructured file data) for only a single concurrent job should have resulted in a significant portion of the backup to be detected as duplicate blocks by the underlying storage (OpenSolaris ZFS in this case), however, the actual ration of dedup detected for the 2nd full backup was approx 70k blocks (~ 8.5GB). Subsequent runs of the full backup yielded similar results. Allowing for metadata, I would have expected at least 80% of the full backup to dedup.

Several levels of dedup support, which could be implemented in a staged approached.

Phase 1 - File device dedup support
- This would allow for dedup between file devices on the same system)
- Add padding at the end of each file to a user configurable block size.

   DedupBlockSize = 8k (configurable, in bytes)

- If the configuration options is missing, then disable all support for underlying dedup for file devices.

Phase 2 - Autodetection of dedup supported file-systems
- When dedup is provided by the host o/s of the file system device, the storage daemon should detect if dedup is enabled for the file device location. For Solaris / Opensolaris ZFS, this value is available through the filesystem extended properties. In this case, if dedup is enabled for the ZFS filesystem, the storage daemon should read the filesystem block size as use this value. (note - ZFS also uses variable block sizes, and thus will only allocate the require size if the requirement is less than the actual block size)

Phase 3 - Alignment of the datastream to underlying file-system blocks and separate of bacula metadata to separate blocks
- This would allow for underlying storage system deduplication between both bacula file devices and real data stored elsewhere on the file-system / volume / san.

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users