This blog is highly personal, makes no attempt at being politically correct, will occasionaly offend your sensibility, and certainly does not represent the opinions of the people I work with or for.
Inside Nyx; a story of a graph, Merkle trees and functional programming
avatar

Nyx (named after the eponymous Greek goddess) is essentially my personal homegrown content management system. I write it on my spare time, so it's a work in progress. This entry presents its current state.

Part 1: The reasons why Nyx exist

Nyx has existed under various forms and various names over the years (and some of its data have existed since the very early, pre-2010, versions) from the time I was pondering on the problem of unstructured data storage, as part of thinking about the general problem of Personal Information Management. I then got interested in that problem again a couple of years ago as I was wondering how the NSA must be organising their data and decided to solve that problem as if they had tasked me to do so. This lead to Permanet. Permanet remained abandoned for a while, but then my interest in content management systems got renewed when I discovered Camlistore, and my work on content addressable storage lead to the Lucille Object Store.

The current form of Nyx appeared one day when I told myself that I really should solve my hand written math papers problem. I could have had a blog where I would put them; a place where everyday I would put my papers of the day, and, weeks after weeks, see the evolution of my PhD work. Note that it's only at this point (Summer 2015) that Nyx got its graphical user interface. Until that point it existed as a graph that I used to interact with solely from the command line.

Solving that problem would have been easy enough (any blog engine with a user friendly support for pictures would have done it, something I could have easily built), but I wanted to solve the general problem of personal data points existing on a graph of (very) heterogeneous nodes; the ultimate goal also includes that this system should act as a file system if and when needed.

Part 2: The internal organisation

Part 2.1: Permanodes

The atomic elements of Nyx are called permanodes (terminology originally shamelessly stolen from Camlistore). Permanodes are essentially atomic data pieces. Each permanode has a type ranging over: text (string of unlimited length), html (technically same as text but in practice used for HTML documents, such as weblog entries), url (technically same as text but in practice used for URLs), latex which contains LaTeX source code (will come back, to this), the generic type file (including pictures, .pdf files, movies etc.), a unix-directory type (the abstraction of a node of the unix file tree), the email type (straight from STMP servers) and few other types more esoteric. Each permanode carries its own unique identifier and its own type. Those keys are the only metadata shared by permanodes.

This weblog entry is a permanode for instance. You can see the corresponding object here, as well as the naked entry here (served with mimetype "text/html", so the HTML markup will be interpreted as normal).

Although permanodes are inherently different (ie: the data carried by different types is very different), this is where the difference stops. All operations in Nyx are done uniformly across types. This means, for instance, that the operation of adding a description to a permanode is a generic operation which internally works the same regardless of the type. (If you are a OO person, you can think that cross type operations are defined on the generic permanode object that all types inherit to.)

Even though Nyx has slowly acquired JavaScript in-browser capabilities, the interaction with Nyx mostly happens at the command line. For this, I have created a Scheme like language giving me access to all its capabilities; ranging from editing/modifying/updating and deleting permanodes to running complex search engine queries. What happens exactly during the edition of a permanode obviously depends on the type: If the permanode was of type text, my text editor opens pre-filled with the contents of the text, I edit it, and when I close it the updated text is committed as new contents of the permanode; and same goes for other text based permanodes (html, url, latex). If the permanode was of type file, then the file is written on my desktop, I can edit it (or replace it by a new file) and then the binary data is committed as new contents of the same permanode. Similarly for other types (this said, emails are not editable).

It is possible to set a variety of metadata against a permanode. For instance to give it a title, or a label or anything else. It is also possible to specify that a permanode is metadata of another permanode (rare occurrence, but natural operation in some circumstances). It is also possible to link two permanodes and some other operations. The links are typed, directed, and occasionally carry data themselves.

Internally, permanodes and metadata objects are JSON objects. They are stored separately, in the sense that adding a metadata to an existing object does not modify the object, the metadata object is just added to the repository. For instance, nothing prevents (or even morally discourages) a permanode to have two titles. The two metadata objects carrying that information exist independently to each other, and independently to the permanode. This implies that if the permanode is deleted, then the objects carrying metadata for this permanode have simply lost their target. We can remove them by running the garbage collector.

As an example of what a permanode looks like, here is a picture:

{
    "lucille-energy-grid-uuid": "c385083c-a95e-4188-8283-729a3aa03396",
    "lucille-energy-grid-object-type": "nyx:permanode:file-point",
    "nyx:creation-datetime": "2015-07-24 14:33:41 +0100",
    "extension": ".jpg",
    "sha1": "6bd31c98c4366c1c31f8674a0463fb8c962f7a48",
    "nyx:blobrefs": [
        "sha1-6bd31c98c4366c1c31f8674a0463fb8c962f7a48"
    ]
}
The 'blobrefs' are the sha1 hashes of pieces of the file (Nyx cuts big files into smaller ones, for convenience). The binary data of permanodes are stored in a dedicated key value store where the values are binary data. The key is always the hash of the data, this is called contents addressable storage.

Note that the extension is indeed part of the permanode. This is how Nyx knows, for instance, which mime type to use when serving this permanode online [0]; otherwise the permanode would just refer to an unspecified blob of data. The fact that we know that the data is a picture has no bearing on the way Nyx internally treats the permanode relatively to other nyx:permanode:file-point/s, but will be affecting the way the permanode is displayed online.

[0] update: IPFS had to solve the same problem, but did it differently. They use the unixfs data format.

And an example of a metadata object is as follows:

{
    "lucille-energy-grid-uuid": "2950e8db-82ed-4fbf-906e-cd4b37aabba0",
    "lucille-energy-grid-object-type": "nyx:metadata",
    "nyx:creation-datetime": "2015-07-17 22:19:52 +0100",
    "nyx:metadata-type": "description",
    "nyx:target:uuid": "a6aa85e0-3592-4b6b-a915-8f1c6a954d0c",
    "nyx:payload": "Thoughts on Reddit"
}
which gives its title to one of my recent weblog entries.

Sometimes I write some papers then take pictures of them and end up with a collection of permanodes that I mentally see as one single entity. To encapsulate that information, Nyx has a notion of container which represents an ordered sequence of permanodes. (Permanodes of any type, not necessarily pictures.) The fact that individual permanodes of the same container can be of completely different types, is one of the most interesting aspect of containers, and Nyx as a whole. Containers are treated like permanodes, you can attach metadata to them and they are returned as part of search results.

Nyx has a search engine which looks through metadata objects and use (a homegrown version of) the Levenshtein distance to select relevant information. The search time is linear in the size of the Nyx repository [1], but it is sub-second with the current repository (7876 objects in the repository at the time those lines are written). Once the search completes a listing of results (an HTML file showing a view similar to a Google search result page) is automatically generated.

[1] update: This is no longer the case for two reasons. First the repository now has more than 20,000 objects and in its latest incarnation the search now uses Merkle trees representing the repository, and a search is pretty much just the result of a fold of the tree. Because of built-in memoization of the related higher order functions, this is blazingly fast.

As I mentioned above, all entries of this weblog are html permanodes of Nyx. I have a custom script that synchronises the relevant subset of Nyx with a key value store on the alseyn.net server.

Part 2.2: Relations between permanodes

There is fundamentally no hierarchy between permanodes, but there are two relations that drive the user interface. One relation indicates whether a permanode is a "comment" of another one, and another relation indicates if a permanode is a "children" of another one (actually this last relation exists in two version, where the children are non ordered or totally ordered). When a permanode is displayed all its comments and children are displayed as well. Obvisouly, given any permanode, this permanode can be seen as the root of a tree under the parent-child relation; this affect some kinds of displays. Note that using those two relation one can easily implement more complex structures, like Reddit's subreddits and comment system.

Part 3: The visualisation of objects

With the exception of some files with an extension which cannot be displayed online (for instance .zip), most data in Nyx can be fed to a web browser. The case of the LaTeX permanode is particular: the data referred to by the permanode is a piece of text (the LaTeX source code), but the online representation is a picture. Indeed the code is compiled to a pdf file and then exported as a picture. This solution was chosen because I cannot stand the way wikis try and mix HTML code and mathematical formulae (rendered as pictures or some JavaScript nonsense); the alignment is often horrendous [2]. Now, this is obviously not for 100 pages long documents. Existing mathematical documents imported into Nyx are imported as .pdf files. This permanode is only used when I add mathematical annotations to my work (and feel like typing TeX instead of taking another picture of a piece of paper). It is also worth mentioning the case of unix-directory. In this case, the entire folder is rsync'ed mounted to my desktop. I can then have a look at it and destroy it, or make updates to it and then send the updates back to Nyx.

[2] update: I recently noticed that is problem has been solved. The stackedit.io editor, for instance has excellent LaTeX support.

Each permanode has its own URL, and the underlying data is served with the corresponding mime type. In fact, each object inside the Nyx repository has its own URL, including metadata objects. In this latter case the JSON object is showed.

Let's understand the web views of Nyx. Most permanodes can be fed to a web browser, and most, but not all, of can be showed inline, meaning in the middle of an HTML page. For instance this picture has its own url (the previous link) but also has the following inline representation

... whereas this pdf file does not have one.

With the above in mind, rendering a container online is easy. You take each permanode of the container and inline represent them, one after the other. That's all. The picture below, for instance, is the bottom part of an existing container (my screen is not high enough to show all of it). You see three permanodes displayed there. First, the bottom part of a picture, then the rendering of a small LaTeX document, and the announce of a container (containers can be points of other containers).



If you look carefully you will see that each permanode in a container has an ordinal (in the picture above, 1, 2 and 2.5). In fact, a container is its own real line (line of numbers for the non mathematicians) and if I wanted to add a permanode between the picture of the girl (at 1) and the LaTeX picture (at 2), I would just take its identifier and say that I put it on this line at ordinal, say, 1.5 (anything between 1 and 2 will do). So when I said that a container is a ordered sequence of identifiers, do not imagine that there is an array stored somewhere. When a container wants to know what it contains, it asks to the repository "Guys, which of you are on me and where exactly are you ?"... The answer is not even provided by the permanodes themselves actually, but by specific metadata objects (of course).

You can access the LaTeX permanode of that container at its own page (as for all permanodes).



We can also view the display of the json blob of that (latex) permanode



Interestingly, I never had LaTeX in mind when I built Nyx. The introduction of the latex permanode happened mostly by accident (after I, one day, realised that Nyx's permissive design allowed it). In fact, new permanode types can be added anytime (for instance if one day I want to start displaying a new media type that is not covered by the existing permanode types), the only real step when adding a new type is to decide its web rendering (meaning finding its mime type) and its inline rendering; and in the worst case the file is simply written on my Desktop and I use whichever relevant program to open it.

Coming back to internal representation, since each piece of metadata is its own object and since the fact that a permanode is a part of a container is represented by a separate metadata object (which essentially carries the container uuid, the permanode uuid and the position), any permanode can appear in more than one container (or even several positions of the same container).

Part 4: The search engine

(Added to the original entry, end of September)

The search engine is admittedly the most interesting part of Nyx. I wanted it to be fast enough so that it runs faster than the time it takes me to type one letter to another; as to have a complete refresh of the display in real time. Incidentally I wish that Google could do that (which is not going to happen due to basic network latency, not to mention Google's own slowness). Instead of showing me predictions on what I might be searching for, I would like to see the result of what I have typed so far.

Essentially the search engine is simple. Given a pattern, for instance "Paris", I want the list of permanodes which match this pattern. There are several interpretations of "match", and here I am not going to enter into details of what this means in the current version of Nyx. As far as this entry goes, let us says that I have a function which returns a boolean for any ( permanode, pattern ) pair.

The result of a search depends on the entire repository, and in an ideal world I would pass the repository itself as argument to the matching function. I would give it the repository and a pattern and it would return the subset of permanodes matching this pattern. This is trivial to implement using any language select function, but then those trivial implementations have the downside of leading to a search time esssentially linear in the size of the repository.

The fantasy of passing entire data stores as arguments to functions steams from my functional tendencies. I spent quite a while over the past year thinking about it and getting closer and closer to the ideal using tricks, until I realised that actually this is perfectly possible. The thing is that you do not pass the repository itself but the root of its Merkle tree). (And you also need to make sure of storing subtrees against their own hashes.)

"Tree ?! Which tree ? I thought the repository was a graph!?", are you thinking. Well, yes the repository is a graph but given any collection of points you can always arbitrarily put them on a tree. Imagine that you have done that and that each node has a label that is a hash: the hash of the combination of the collection of objects it carries and the hash of its children. This lead to the notion of capital-ship, which is a special node type, part of an overlay tree above the Nyx Graph.

The search engine was explained by me to Marcello in an email, which I reproduce here

To implement the tree, I introduced the notion of capital-ship. It's a node that references permanodes and other capital-ships. The link from one capital-ship to others is directed and the graph has no cycles, so it's a tree. Any modification of any permanode modifies the ship it "belongs to" and modifies the tree recursively up to the root ship. Ships are always referred to by their hash (and as far as the ship network is concerned, permanodes as well). The unique entry point to the tree is a file containing the hash of the root ship.

To talk to the root ship, you simply need to call a function taking three arguments. First argument is the hash of the ship you talk to, second argument is the function that will be used to extract the information from permanodes and third argument is the function used to combine answers coming from children ships.

To make this explicit let me show you how I do the three below operations ( simple cases just to highlight the principle )

  • (1) count how many objects there are in the repository,
  • (2) get the latest created object,
  • (3) select all subnyxes permanodes.

(1) count how many objects there are in the repository



The universe_network_ask_a_question function will be recursively calling itself down the tree from the root ship. The important thing is that it only computes once, ever, for any [ shiphash, lambda, lambda ] combination. (I do memoization using Xcache.)

In this example the first lambda maps any object to integer 1 and the second sums the answers received from children ships. The answer for a ship is the combination of answers from the ships' permanodes and the answers from the ships' children ships.

This is rightfully slow the first time you run it ( where "slow" means that the entire repository will be probed ), but insanely fast every time afterwards. Even if the repository has been updated (eg: new math papers created) and the root hash has changed, most of the ships have not changed and their cached answers can and will be used (therefore the recursive process stops almost instantly).

(2) get the latest created object



Self explanatory.

(3) select all subnyxes permanodes



Self explanatory.

The search engine (which is no longer an "engine" but simply a function call) now works using a similar code as Example (3). the lambda1 used in the search simply selects the object if the object reacts to the search pattern. The final collection is the the collection of all repository objects reacting to that pattern.

Cheers :-)

ps: I didn't do this (yet), but turns out that if you log each change of the Merkle root against a time stamp you can go back to the exact state of the repository at any point in time :-)

I said in the email that the folding function only runs once, ever, for any [ shiphash, lambda, lambda ] combination. This is basic memoization, and as it turns out, Nyx's search implementation highlights one of my best discoveries in programming: immutability and memoization leads to blazingly fast systems.

A fully detailled explanation of Memoization against a Merkle tree is given here.

A screenshot of the search display is given below, by which I was looking for "Nyx". JavaScript is displaying that the last round trip took 9 milliseconds :-)



Otherwise this little .gif I just made should give an idea... (search starts when the string has 3 letters or more)



Another very interesting feature of the search engine is that it has its own language. This was explained in another email to Marcello :-)

You remember the Scheme like language I introduced to help me drive Nyx from the command line. I decided to use the same parser to interpret the input of the search. So now, if the search engine sees something like "Paris Hilton" it will just search for Paris Hilton, as before, but if it sees something like ":: { Pascal } { Paris Hilton :year:2010 }", it does as follows:

1. Because of the starting :: the rest will be interpreted as a sentence of the search language.
2. The sentence "{ Pascal } { Paris Hilton :year:2010 }" will be split in the two expressions "{ Pascal }" and "{ Paris Hilton :year:2010 }"
3. The engine applies a OR across sentences, so the result will be the results of "{ Pascal }" together with the results of "{ Paris Hilton :year:2010 }"
4. The result of "{ Pascal }" is simply what happens when you search for "Pascal"
5. The engine applies a AND between the tokens of an expression, so the result of "{ Paris Hilton :year:2010 }" is the set of permanodes reacting to "Paris" and reacting to "Hilton" and that were created in the year 2010. The fact that the token ":year:2010" starts with ':' indicates that it is to be interpreted as a special command and in this case its evaluation returns all the permanodes created in 2015. [3]

In total, I get all the "Pascal" permanodes and all the "Paris Hilton" permanodes created in 2010 :-)

Forgot to say: Because the round trip to the search engine is so fast, I can afford to send the command line at each letter, of course half way through a correct language expression, it cannot be evaluated. I just get "parsing error". See attached



[3] I would kill to have Google let me do that. There are lots of special commands related to time, size, contents type, country of origin etc. I can think of :-)

Closing remarks

The three layers of Nyx are really separated, both physically and semantically. At the bottom, the data is either binary blobs or JSON objects. Then any programming language can be used to make Nyx itself, and at the top whichever visualisation mechanism you want to use for the objects (currently the web browser) is your choice. This said, above all else, the fact that Nyx (in particular the Graph itself) makes no assumptions about why the data is there in the first place, what it means and how to represent it (in other words that it is very close to any reasonable definition of Universal Content Management System), is where its true value lies.

Update: August 2016: Nyx has been re-designed and re-written from scratch.

[ add a comment ]

Archives