This blog is highly personal, makes no attempt at being politically correct, will occasionaly offend your sensibility, and certainly does not represent the opinions of the people I work with or for.
One galaxy to rule them all, or the story of Lucille's data management

So I was there, minding my own business, when the following showed up in my inbox

We need to talk about your last weblog entry next time you are around,
I did not get what happening, but sound interesting ... Its not your fault i did not get it, its just i am not a "computer geek" ;)
(where I corrected her typos but not her grammar mistakes.) She was referring to my last entry about Camlistore. So I decided to make the world a better place and explain "Galaxy" from scratch...


When I was a kid I had a notebook in which I used to write everything that was useful for my everyday life. It had my friends phone numbers, few useful infos about school, classes, my homework timetable etc. When filled up, I would buy a new one and on the first few pages of the new one I would write down the information that needed to be carried on from the old notebook to the new one.

Obviously I didn't care about any particular notebook, I only cared about the process of managing useful pieces of everyday life's data that way. The notebooks were just the best way, at the time, to store the information. This process was so important to me that I gave a name to it. I decided to name it Lucille, after the expert system (I prefer calling it an expert system rather than AI) which was the pilot of the interplanetary space ship MARS-1 in the movie Red Planet.

The notion of Lucille was largely symbolic until the moment I got my first digital assistant, a Palm V. It became the new Lucille since I could use it for anything I had used Lucille for, and much more. This said, the Palm came with a variation. I called it Lucille version 1, stylised Lucille.v1 and decided that all my digital devices from this moment (at least the ones having something to do with managing a part or all of my life) would be called Lucille.v[x], where "x" ranges over the natural integers. At the time those lines are written, I have Lucille.v7 my current Samsung (smart) phone and Lucille.v8/sashka the MacBook Air I use since Lucille.v5/alexandra died few months ago. (Note that if the phone wasn't "smart", meaning didn't even have a calendar, it would not qualify as a Lucille.)


My first Mac was Lucille.v3 and from Lucille.v3 to Lucille.v5 I learnt a significant new (mental) skill that I didn't have before: knowing how to manage and think about digital data organisation. I can confidently say that I have never met anybody else who has this skill to the level I developed it.

The thing is that I have a _lot_ of data on my laptop. I have a tendency to store and keep trivial data (which I feel I might need in the future -- and sometime this surprisingly turns out to be true). For instance, the wireless network password of one of my neighbour in Amsterdam (which was also a work colleague), can be found in a file (an encrypted file (*)) located at the folder /Galaxy /Encyclopaedia (Public) /GeoLocation(s) /Earth /Countries /Netherlands /Amsterdam /Tolstraat 40 (I added the spaces for convenience). I let you appreciate the fact that this data being geo-located, there is no other place I would be looking for it if I wanted to check that it still exists. And yes, if one day I visit a friend on Mars I will be putting the equivalent data under /Galaxy /Encyclopaedia (Public) /GeoLocation(s) /Mars.

(*) My friend would not want to see the secrecy of his SSID and password compromised if my laptop gets stolen and Galaxy falls into wrongs hands. More generally all sensitive information in Galaxy is protected (in one form or another).

(Actually encryption within Lucille works in a slightly different way than what I wrote here, but for the purpose of this entry you can see it as I described above)

It became quickly obvious to me that my personal data should not be kept in my user directory on the Mac (eg. at /Users/pascal under the unix's inherited conventions) and on alexandra it was in its own partition called Galaxy. This was psychologically important. I knew that the only "thing" I needed to care about and backup was Galaxy, nothing else, nothing more. The OS and its own data could go to hell, I could not care less. Lucille's digital data is in Galaxy, period. On sashka, /Galaxy is a folder just under Unix's root, but only because sashka was not my machine initially, I am only borrowing it long term. This also means that moving from one machine to another is easy, I just copy Galaxy across and I am good to go.

If you have a Mac you might be interested in knowing that I usually instruct Apple's applications to store data in Galaxy (and not in their default location), consequently iTunes and iPhoto, for instance, have got their repositories in Galaxy, under /Galaxy /EnergyGrid /DataCenter. Also anything in Galaxy should be OS independent (or at least give me the option of truly usable OS independent data exports). That's why I don't use pieces of shit like Apple's Address, or horrors like Evernote.

Galaxy today has exactly 90,034 files and weights 237.13 Gb.

Galaxy Structure

Coming back to the wireless password of my friend, the file is 8 levels deep from the root, but the depth of Galaxy today is 22. This means that I have data which naturally sits 22 levels from the root. This doesn't make the data harder to find as I have a mental picture (location and purpose) of every single file in Galaxy (despite having almost 100,000 of them -- I told you it was a mental skill).

The above picture shows the top level of galaxy, and here are the purposes of each subfolder

  • Backups: This is where I put the data owned by myself or Lucille, but which doesn't naturally exist inside Galaxy. Examples are /Users/pascal/Library which contains various settings, including my daemons launchd property lists etc; or various dumps of databases from sashka or elsewhere.
  • Encyclopaedia: The split versus Private and Public is a recent one, and was introduced to simplify something. Otherwise, Encyclopaedia does exactly what it says of the tin. Its subfolders are themselves "top folders" in the sense that they start entire universes. As of today I have:
    • FileTypes: Data referenced by their type. You will find my GIFs collection there.
    • Manuals: Every single Manual, How To, Software documentation, etc. Anything that explains how something works. Extremely well curated.
    • Subjects: Maths, Computing, Physics, Bioldogy, Finance, The Internet, TV-Show(s) etc.
    • Timelines: Anything that I think of as being related to time. You will find there, the new year's card I once designed and sent to my friends. I think of this as being time related. So its natural place is on the timeline where data is stored by time (Year/Month/Day). By the way, Timeline is a very powerful storage convention, but I have never explained nor advertised it on this blog...
    • GeoLocation(s): The entire geographical data of the universe (currently limited to planet Earth)
    • Individuals: Information kept about public personalities, for instance, this picture of Bill and Hillary Clinton as students
    • Organizations: Same but for organisations (NSA and Facebook I am onto you...)
    • ... plus few other things.
  • EnergyGrid: (named after the underlying universe's energy grid in Iain M. Banks Culture books) Doesn't belong to me but belongs to Lucille. This is where she lives. And in particular I am not allowed to manually move any file under this folder without asking her first for permission (which in practice means modifying what every daemon/program/script that used it.) This is incidentally the biggest subfolder of Galaxy. Lucille owns more data than I do :-)
  • Galactica.webloc: is the bookmark to the entry to the web UI of Galactica (the wiki built on Nyx -- see below).
  • Open Cycles: This is where I have various aliases (soft links) to various in-progress stuff. For instance I have a PhD soft link that points to /Galaxy /Encyclopaedia (Public) /Subjects /Mathematics /4.PhD
  • Pascal Star Field: This is where I put everything I have produced: papers, diagrams, various pre-weblog writings, and all of my programming since 2006.
  • Software Shop: contains the very curated collection of every single program (disk images, binaries zip files or source codes) I have ever installed on the machine (or any previous machine, sometimes other people's machines). This allows me to reconstruct Lucille without needing an internet connection. I also store there the licence files of any purchased program.

One problem I have never managed to solve is of my internet bookmarks. They are in Safari, and yes I export Safari's Bookmarks in my Backup folder (every few weeks of so), but the fact that something as fundamental as "Pascal's map of the Internet" is still somehow under the scope of one of the web browsers I use rather than somewhere within Galaxy, meaning under Lucille's own jurisdiction is a problem for me; but not a big one, so I can still sleep at night :-)

The encryption protocols and various (non standard) special purpose filesystems I run inside or alongside Galaxy are not going to be covered here.

Being more than 200 Gb big, Galaxy fits well on any external hard drive I use to back it up, but the online backup is something I haven't fully solved yet. Dropbox ? You must be joking !... Who do you think I am ? Some moron who has a laptop just for email and Facebook ? Jesus...

I today learnt that a Chinese company is offering 10Tb free storage, but I won't use anything that doesn't let me use rsync. Which excludes most cloud storage companies.

Galaxy's limitations

If my computer was just another computer of some random guy (albeit a very mentally precise guy), the entry would stop here, but Lucille and Galaxy are extensions of my own mind, and to make a long story short, as I was mentally manipulating more and more data, in more and more non standard ways, I realised something fundamentally wrong with Galaxy: the fact that I had to do with Unix's file tree structure. This was imposing constraints on my mental processes that became more and more painful as time went on; because the way I was thinking about the data and more exactly the way I was thinking about possible relations between different pieces of data became radically different to the way I had to store it. In fact I mentally came across the same problem as the creator of Camlistore:

(...) be forced into a POSIX-y filesystem model. That involves thinking of where to put stuff, and most the time I don't even want filenames. If I take a bunch of photos, those don't have filenames (or not good ones, and not unique). They just exist. They don't need a directory or a name. Likewise with blog posts, comments, likes, bookmarks, etc. They're just objects.

The thing, you see, is not that Galaxy has anything wrong in itself. Galaxy is perfect, and stable; in the sense that all my life I will always be looking for that Amsterdam friend's wireless network password exactly where it currently is, because from my point of view this is its most natural position. My problem is that if I want to make available to another friend the collection of all wireless networks passwords I have (and this covers all coffee shops in Central London or Amsterdam), I would need to collect them manually. In other words I have no way to easily look at and manipulate as one entity the set/collection of those otherwise geo-located data. I could also want the collection of everything I have about (or related to) this particular friend [photo, documents, etc...] (a collection that I would, again, need to compile manually).

So there you have it. Galaxy is good at storing. Every file on my computer has a natural place on the unix file tree. But in everyday life, I want to manipulate some collections as one unit and unless that unit corresponds naturally to something tree related (meaning can be mapped to a subtree of Galaxy), I can only have it by painfully constructs the set myself.

Another way to think of this problem is to consider the following case: I take a nude picture of Alice. If you think about it, there is no obvious place to put it. Should I put it in iPhoto and forget it there ? Should I put it in Timeline's "August 2013" folder ?, in Alice's folder ?, or in my Nude Collection folder ? The fact that there is no obvious answer to the question of where to put this picture means that maybe a system that insists in you answering it is simply badly designed to start with. Maybe you should just put the picture wherever, and if you want "Data of August 2013" then it shows up, if you want "all nude pictures" it shows up, if you want "all documents related to Alice" it shows up etc... In other words, it doesn't have a known location, but shows up as element of every set you can think of and that it would be an element of.

Camlistore as well as my very own collection of Galaxy-related conventions and protocols is a small but decisive step towards those mental ideals.

[ add a comment ]