TL;DR The web, somehow, needs to be like a huge git repo, with a sane notion of "object at a location".
I had an epiphany this morning. Something that somehow links what I explained recently about user interfaces (Pascal's deal about User Interfaces, for dummies & Communication Engines and the Timeline Problem), my background mental threads about the world's dynamics (On the great misuse of Knowledge, Experience and Personal Opinions) and my own slowly ongoing work on Nyx (Inside Nyx) (and some other things I haven't blogged about). Something very simple that I had never realised before, even though I may have accidentally expressed it before (without realising that I was...). But let's start at the beginning.
I was prepping up to work, going through my morning todo list and was given this link http://www.win.tue.nl/~gwoegi/P-versus-NP.htm (I sometimes come across interesting web pages, but don't have time to read them so I commit the URLs to my todo list and they come back up later). I have little interest in the P versus NP problem in this moment (need to focus on my own research), but this page seems to be quite well maintained and up to date, so I decided to keep it, in Nyx. That's when I paused, because I was wondering what to keep. Should I keep the URL or make a copy of the web page itself ?
The latter always comes up in my mind when I remember that few webpages I liked in the past have unfortunately disappeared. In this instance, the author (and curator) of the page, staff of the University of Technology (The Netherlands), may do any number of things that will result in the page disappearing, including (but not limited to) giving up research to move to another industry. I am sure that the information itself will always exist somewhere on some CS researcher's hard drive, but not necessarily so neatly given on a webpage. Therefore, I have a tendency to sometimes store the web pages themselves. This is not ideal as my version might then become out of sync with his version as time goes on (after he performs updates), so as to have the best of both worlds I often store both; and in the unlikely event that the URL becomes useless I have at least a copy of the page as it was at some point in time. Note that I sometimes copy the HTML document itself but other times download the web page as a PDF file and keep that instead.
In an ideal world I would register my interest to a web page and not only have my local data operator (Nyx) make a copy of it, but have it automatically download any updates of the page if and when they occur. And you know what ? The beauty is that I can do that today. I can set up a new Nyx permanode to do just that, but that would be solving the wrong problem. You see, in a more ideal world, the operation of "registering" to any piece of data on the web would result in you having a local copy of that data as well as a way to automatically get the updates. I can already see some of you saying "But... that problem has been solved..., git and stuff". Yes it has been solved. The problem is that we are not making use of the solution! In fact there are lots of beautiful ways we could solve that problem, we are just not making use of them, but that's not the point right now because it gets more interesting...
Turns out that few seconds later, I was given another link: Jedi Masters; a weblog entry that I love profoundly. So much that I cannot take the risk that this weblog disappear and the text being lost. So I downloaded it. I could not do it as a PDF file because the text is full of hyperlinks that would be lost in the export, and they really are part of the interest and fun of the text, so I went for the HTML code. Unfortunately unlike the previous web page, this last page comes with lots of stuff that I am not interested in (listing of recent entries etc...). So I displayed the source and extracted the div in which the text itself (the HTML code with the hyperlinks) lives and stored that in Nyx as a HTML permanode. (By the way, if you want to see _this_ entry of my weblog without the layout, just what I actually wrote, here is the link. See, if I can do it why can't others...)
And this is when it hit me. Why the fuck am I doing all this ? The two above operations are the indication that the web is not meeting my needs, for a reason that then became totally obvious to me at that very moment. I realised that my frustration with the web arises from the difference between where the web has been put by 20 years of shortsightedness and the web as it is is my mind. And note that I use the word web purposefully and not Internet.
You see, those people are pushing the web into being a gigantic multi channel TV station. People consume the web, Reddit, Facebook, Twitter, Pinterest etc..., on a daily basis with very little past memory, as a source of entertainment. They get their fix of lols cats, celebrity news and whatever interest they have, whose latest news is given to them easily digestible, one Facebook wall at a time and don't really want to remember anything. This is why news websites feel so wrong to me. They are not presented as data sources for personal investigation and instead continue to look like printed news (with the same layout, way to present the information and lack of interactive features), the kind of paper you read, toss out, and never get back to (like those free magazines you find in public transport in the morning).
In fact most web sites are not trying, to my sadness, to be useful (raw) data providers, but instead try and look like TV channels with visual contents just good enough to capture your time long enough to expose you to advertising (or steal your user activity patterns to be sold). The situation is better on the blogosphere though; weblogs do disappear, and with that possibly large collections of potentially interesting information, but they tend to be stable.
The above totally contrasts with the fact that when exposed to a large amount of information, my mind only wants to (and instinctively starts to) find subsets that carry information that when put in a different order reveals a piece of knowledge that wasn't originally apparent. A bit like looking through a landfill and realising that scattered between lots of rubbish are all the parts of a Ferrari (possibly parts originally coming from different Ferraris), and that with just a little bit of work you could reconstitute an entire car.
Of course, we invented data APIs and it is sometimes possible to retrieve raw data from some websites (never the ones I want), but it always feels like an afterthought, rather than what should be the norm: user interfaces built above data APIs. Let me actually link this to another problem I have. For any existing news site, I wish I could see, and possibly have a live updated view of, their contents in (reverse) chronological order. Literally have a single page where every time a new piece of contents comes up, article or video, it appears at the top and that's all. Nothing else nothing more, just that. And if there are no updates for few hours then fine, the page remains silent (meaning unchanged). I feel literally insulted that I cannot see this view. Now, if this was given to me, the next step would be (at the command line) to select subjects (for instance science and politics news, nothing else) to be displayed (in which case the lastest Kardashian story would not show up). Another step would be to combine the views of one or more sites; you know, awesomeness :-)
The above comes together with another problem: the web is very bad at being downloadable. Videos are a pain to be downloaded (there are few Youtube videos I would *really* like keeping locally -- without the hacks or malware required to do so), web pages are a pain to be downloaded, even text is a pain to be extracted. By the way, I wish we could agree on something. Given a webpage we could have a file format, essentially a simple sqlite file, that would contain the HTML document along with the CSS, JS files and pictures (and sound, if any) needed to redisplay it. You know, something I can open while I am offline and see the webpage as it supposed to be. More generally, every point of data on the web should have a unique identifier (URLs already do that) and the "thing" at that location should be fully downloadable (so that I can embed it into my own data collection). So far only pictures, text files, pdf and movie containers (mp4), have that property (maybe I am forgetting another format, but you get my point). You can see that I am not the only one with that problem in the way news sites show contents from other sites. They screenshot them and show the pictures. During the recent reddit excitement, for instance, this is the way other sites were showing reddit threads. Licences and copyright stuff put aside, why have we lived for so long without ever asking: why cannot I download a reddit thread? I mean the data itself, with all the supporting metadata and support files to redisplay it. You know, not only because I might want to show it on my own site (if legally possible), but also because this thread might be so important to somebody that they might want to keep it so that it survives reddit's own death.
Fortunately, we do not need to rebuild the web, somebody just need to start showing example... And cherry on the cake we could even do it awesomely well: content addressed, totally distributed, permanent, resilient, etc...
Yes, we need something like IPFS.