Last night, I did something I somehow should have done a long time ago: reimplementing the Macintosh's Time Machine. For two reasons: (1) it _can_ be surprisingly slow and sometimes stucks without any reasons, and (2) sorry Apple, but OSX's Time Machine just sucks! I have been wondering for many weeks now "why doesn't this piece of shit use content addressable storage rather than the half implemented logic it currently uses ?" (not to mention how unfriendly the sparsebundle files it uses are, and how fucked up that is)
Beside the speed (or lack thereof), the main problem with Time Machine is that when I rename a folder, it thinks that the contents of the (renamed) folder are all new. More exactly, let us assume the following configuration
Pictures/ |- 2013/ |- file1.jpg |- file2.jpg |- 2014/ |- file1.jpg |- file3.jpg
First of all, assuming that 2013/file1.jpg and 2014/file1.jpg are actually the same file, it won't notice. Moreover, If I add 2014/file4.jpg, leading to...
Pictures/ |- 2013/ |- file1.jpg |- file2.jpg |- 2014/ |- file1.jpg |- file3.jpg |- file4.jpg
Then it will know that 2014/file1.jpg and 2014/file3.jpg are the same, but if I now rename 2014' to 2015', it will think that 2015/file1.jpg, 2015/file3.jpg and 2015/file4.jpg are completely new files. This can be a problem if, for instance, 2014 was weighted 50Gb, because this means a backup that is stupidly going to consume 50Gb of external hardrive space for... essentially nothing.
Now, we have to be very careful when designing a new backup solution, because your backup solution is not the place place where you want to show how clever you are. Your backup solution is what saves your ass if one day you have lost your primary data location and need to be up and running without fuss (possibly using a completely different machine that the one you had). You do not want to find yourself in a position where your backup is not usable the day you need it, simply because whatever "program" you need to use to access it, doesn't work on your new (or borrowed) machine. More precisely, you can be clever in "making" the backup, but you should not need to be clever (or lucky) to be able to "use/consume" it the day you have lost everything but the backup.
My solution is simple. On my backup external drive, I have two folders called Datablocks and Snapshots. Datablock contains data by their sha1 and Snapshots has the natural entire file system trees (each per timestamp).
BACKUP-DRIVE/ |- Datablocks/ |- 00/01/81/sha1-0001816a94d3c43364541e78ce88ec273068bd90 |- 00/01/be/sha1-0001beb1400acda0909a32c1cf6bf492f1121e07 (etc.) |- Snapshots/ |- 2013-12-01-110611/ |- Galaxy/ |- Encyclopaedia |- EnergyGrid (etc.)
The backup logic is simple, when backing up, my program first stores the datablock in Datablocks (if it wasn't already there), and then creates a hard link from the datablock to the file it needs to add in the Snapshots subfolder.
This has few advantages:
- My data is directly accessible in its most natural form (in Snapshots), and in particular any operating system can read it (as long as it understand the filesystem the drive is in, HFS+ in my case).
- It uses space significantly better than Time Machine (a piece of data is stored only once regardless of how many snapshots it appears in).
- Garbage collection is unambiguous and easy to implement: remove any datablock whose link-count is 1 (because this means that they are no longer referenced from any snapshot)
I obviously need to be careful if one day I want to make a copy of my primary backup drive. A straightforward copy (the way the Finder would do it), would break the hard links and blow up space on the new drive). Remedy: use rsync with the --hard-links option.