Feature Request Form

Item 1:   Implement Deduplication feature
  Origin: Corneliu Popescu, corneliu.popescu_at_gmail_dot_com
  Date:   8 April 2009
  Status: New

  What:   Detect duplicate files and not actually backup duplicates, but 
          store a pointer to original backed up file, and make all this 
          transparent on restore operations.


  Why:    It would eliminate redundancy therefore optimizing resource 
          usage during backup: storage space, network bandwidth, backup 
          window time. Especially in case of backing up multiple similar 
          systems (like end-user workstations for example) this would 
          dramatically improve performance.

  Notes:  Feature could be implemented via a checksum calculation. Checksum 
          number could be stored in the database along with file information 
          and could be searched for, to identify previous backed up copies of 
          the same file, regardless of original filename and path. 

          On having to back up a file, first a checksum calculation could be 
          done locally by the backup agent. Then the agent would sent checksum
          and other necessary file information (like filename and path) to the
          server. The server would search for the checksum in its database to
          identify previously backed up copy of the same file. If not found, 
          the server would continue to backup the file normally, asking the 
          agent to send over the network the actual file content. Additionally 
          the server would store the checksum along with the other file 
          information in the file's new database record, to allow for future 
          searches. If found, the server would consider it as duplicate, ask 
          the agent to mark the file as already backed up (therefore not 
          sending actual file content over the network) and to just continue 
          with the next file. Additionally, the server would create a new 
          database record for the duplicate file and also would store a pointer
          to the database record of the original file, so to be able to locate
          previously backed up content on a future restore request.

          On having to restore a file, the server would first have to identify
          if the file is a duplicate or not. If it is not a duplicate, restore
          would continue normally. If it's a duplicate, then the server would 
          follow the original file pointer, locate original's file content and 
          then push it to the agent as the duplicate file.

          Since implementing this feature would increase the probability to 
          have to restore files from different media, some additional 
          optimization of the restore job might be necessary (to restore at 
          once all files stored on the same media, independent of restore file 
          sequence). Also, apropriate backup media management would be 
          required, to make as sure as possible that duplicate file content 
          still exists and can be accessed by a future restore job. 

          The backup admin should be able to enable or disable this feature by 
          job, as needed.