Feature Request Form Item 1: Implement Deduplication feature Origin: Corneliu Popescu, corneliu.popescu_at_gmail_dot_com Date: 8 April 2009 Status: New What: Detect duplicate files and not actually backup duplicates, but store a pointer to original backed up file, and make all this transparent on restore operations. Why: It would eliminate redundancy therefore optimizing resource usage during backup: storage space, network bandwidth, backup window time. Especially in case of backing up multiple similar systems (like end-user workstations for example) this would dramatically improve performance. Notes: Feature could be implemented via a checksum calculation. Checksum number could be stored in the database along with file information and could be searched for, to identify previous backed up copies of the same file, regardless of original filename and path. On having to back up a file, first a checksum calculation could be done locally by the backup agent. Then the agent would sent checksum and other necessary file information (like filename and path) to the server. The server would search for the checksum in its database to identify previously backed up copy of the same file. If not found, the server would continue to backup the file normally, asking the agent to send over the network the actual file content. Additionally the server would store the checksum along with the other file information in the file's new database record, to allow for future searches. If found, the server would consider it as duplicate, ask the agent to mark the file as already backed up (therefore not sending actual file content over the network) and to just continue with the next file. Additionally, the server would create a new database record for the duplicate file and also would store a pointer to the database record of the original file, so to be able to locate previously backed up content on a future restore request. On having to restore a file, the server would first have to identify if the file is a duplicate or not. If it is not a duplicate, restore would continue normally. If it's a duplicate, then the server would follow the original file pointer, locate original's file content and then push it to the agent as the duplicate file. Since implementing this feature would increase the probability to have to restore files from different media, some additional optimization of the restore job might be necessary (to restore at once all files stored on the same media, independent of restore file sequence). Also, apropriate backup media management would be required, to make as sure as possible that duplicate file content still exists and can be accessed by a future restore job. The backup admin should be able to enable or disable this feature by job, as needed.