This blog is highly personal, makes no attempt at being politically correct, will occasionaly offend your sensibility, and certainly does not represent the opinions of the people I work with or for.
Memoization and Storage in Nyx
avatar

I am going to explain how caching/memoization (currently) works for Nyx; and I will also explain why I use several SQLite files. It's fun.

Part1: Memoization revisited.

Imagine that you have a little tool that needs to store some data, nothing fancy, you decide to go for a SQLite database. One of the functions does something very simple, like counting the number of entries. You write it like this (pseudocode).


function count(){
	db = get_database_handler()
	resultset = db.query('select count(*) as count from objects')
	return resultset[0]['count']
}

Then, for argument sake, you realise that, say, your language SQLite binding is painfully slow. You're thinking.... and you realise that, after all, if the file has not changed, then the answer should be the same. So you decide to do this:


function count(){

	filehash = sha1('path/to/database.sqlite3')
	count = memcache.get(filehash)
	if(not is_null(count)) {
		return count
	}

	db = get_database_handler()
	resultset = db.query('select count(*) as count from objects')
	count = resultset[0]['count']

	memcache.set(filehash,count)

	return count

}

What you have done here was to cache the count in memcache (or the cache of your choice), against the sha1 of the database file. Of course we assume that the overhead of computing the sha1 is negligible. In Nyx since writes happen far less often than reads, the hash itself is computed and stored in the cache after each write and therefore does not need to be computed at each read. This means that unless just after a write, you get the answer as fast as your memory is. In effect, in Nyx it would look like this


function count(){

	filehash = memcache.get('database-hash')
	count = memcache.get(filehash)
	if(not is_null(count)) {
		return count
	}

	db = get_database_handler()
	resultset = db.query('select count(*) as count from objects')
	count = resultset[0]['count']

	memcache.set(filehash,count)

	return count

}

Ok, so all this is not new, we learnt that 10 years ago when people started to store expensive functions calls in memory (memoization). That being said, my example seems to be slightly more complex than what you see in memcache online tutorials. Indeed, you would see this


function count(){

	count = memcache.get('object-count')
	if(not is_null(count)) {
		return count
	}

	db = get_database_handler()
	resultset = db.query('select count(*) as count from objects')
	count = resultset[0]['count']

	memcache.set('object-count',count)

	return count

}

This would not work for me. Because if I get a result from the cache, how do I know that the number I got is the correct one ? The database could have changed since the last time the number was stored, and I will be getting a count which doesn't actually correspond to reality. What I have seen people doing is to flush that cache key if an object is added, or removed, but this has always felt messy to me. Because then you need to also flush any other cache key which contains a value that could possibly have been changed by your action. I much prefer a convention which *automatically* invalidate *any* previously stored value as soon as the file is changed. In effect any action that changes the database simply needs to update one value, namely 'database-hash', and then any stored values against the previous version of the database is automatically invalidated.

But then one problem remains: I could not possible store the count against the sha1 of the database, because the count function might not be the only function storying information against this version of the database. Therefore, I would actually do this:


function count(){

	filehash = memcache.get('database-hash')
	count = memcache.get(filehash+'167B7458-431A')
	if(not is_null(count)) {
		return count
	}

	db = get_database_handler()
	resultset = db.query('select count(*) as count from objects')
	count = resultset[0]['count']

	memcache.set(filehash+'167B7458-431A',count)

	return count

}

The key that the count is stored against does depend on the exact version of the file, but is unique to this particular function (because the suffix '167B7458-431A' is unique to this function). This explains why uuids turn up in the Nyx code, they ensure that multiple functions have their own specific cache keys while all being sensitive to the same hash.

Last but not least, this works even with parameterized functions. Imagine that we want to count the number of entries with a score greater than the argument of the function. We simply use the score as part of building the storage key. The function becomes


function count(score){

	filehash = memcache.get('database-hash')
	count = memcache.get(filehash+'167B7458-431A'+score)
	if(not is_null(count)) {
		return count
	}

	db = get_database_handler()
	resultset = db.query('select count(*) as count from objects where score>=?', [score])
	count = resultset[0]['count']

	memcache.set(filehash+'167B7458-431A'+score,count)

	return count

}

A diffrent cache key for each score. Easy.

Part2: Moving to multi file storage.

The system that I presented in Part 1 works, but obviously your entire collection of functions needs to recompute some values after each write. This is fine, but we can do better. Turns out that most, if not all, your read functions just collect stuff from the database, there are no complex SQL queries. Moreover once objects are added they rarely get updated. It's not like wikipedia where potentially any page can be updated, it's more like your blog, you write an entry, maybe update it few minutes later, but once an entry is more than few days old, it will probably never get updated.

Introducing timed databases. It works better if I just show a picture...

What you are looking at are the SQLite files containing Nyx data dating back to 2006. Each file contains the objects created on the month corresponding to the month of the filename. (Nyx didn't exist in 2006, but the oldest pieces of data in Nyx are the first weblog entries of this weblog...)

And yes, you are right, when I want to count the number of objects, I need to sum the count of up to 120 files (12 files per year for 10 years). But then, here is the kicker, 119 of those numbers are cached and... will never change. Only 2016-01 (this entry is written in January 2016) might change in the next few days (actually will change as soon as I finish writing this) and the count related to it will need to be recomputed from scratch (meaning an actual SQL query will ba ran against that particular file).

And yes you are also right, The Nyx storage abstraction layer completely isolates from the layers above it that the repository follows those conventions. For the rest of the code, the storage looks like it is.... in one piece.

And, this goes without saying, with the above optimizations, and given that I use Xcache which survives computer restarts, Nyx is *way* faster than if I was performing queries against a database containing 10 years of emails and other personal data. Queries are either not ran at all, or ran on files that have at most one month of data.

(ps) All this is a particular case of something I had explained once: Memoization against a Merkle tree.

[ add a comment ]

Archives