**Fun with metric spaces**

I made an interesting discovery last night: the most visited websites on the internet are not necessary the ones that people link to... with a twist.

I wrote a crawler (technically not my first crawler, since I wrote a tool to create metadata from wolfram alpha pages for my math work some months ago, but indeed the very first which targets the web as a whole) which starting from an origin, for the purposes of this text assumed to be weblog.alseyn.net, builds its neighbourhood as follows.

It downloads the origin's html page and scans it to extract urls. Then you say that the corresponding extracted urls refer to pages that are, by definition, at distance 1 of the origin. Then it downloads the html pages at distance 1, extracts the urls inside them and say that the newly discovered pages are at distance 2 (unless they already appeared at distance 1 in which case they remain at distance 1 -- this happens if a page at distance 1 refers to another page at distance 1).

Following this method you can define a distance between the origin and virtually any web page on the internet. And if you apply this method to every single page, you have defined a simple metric structure on the set of all web pages, but I won't consider this metric as I am only interested in the distance between a fixed given page and others.

I said that there was a twist. The twist is that my program only considers root urls, not sub pages. Root urls are urls of the form http://wikipedia.org. or http://www.bbc.co.uk, but urls of the form http://twitter.com/alexandra are not root urls as they already refer to sub-pages of the root (the root of http://twitter.com/alexandra is http://twitter.com).

Considering only root urls means that you only consider a subset of the web, and even a very tiny subset of it. For instance, http://wikipedia.org is the root page of more than 10 million pages and those pages are not taken into account because their urls are not root urls. The main reason for this restriction is that without it my program was literally downloading deviantart.com in my database, which in itself was not a problem for me, but this wasn't adding any new domain names to the search because of the way deviantart rewrites urls appearing on its pages.

But more than considering a subset of the web, you also work with a different notion of distance. The distance between weblog.alseyn.net and nsa.gov, for instance, is greater than the same distance computed using the entire web. For instance, weblog.alseyn.net can refer to a specific wikipedia page and this page in turn refer to the NSA website, making the NSA web page at true distance 2 of the origin (here we assume that the origin doesn't directly refer to the NSA wen page). But my program would not discover this path because it is not allowed to go outwards any page along a non root url (in occurrence the url to the wikipedia page).

The exact regex I used was preg_match('/http:\/\/[^"\s<>\')\/\?]*/',strtolower($line),&$primary_matches);

I know I could have done a better job (I actually did in an earlier version when I was registering every single url in the database for statistical analysis), but this got the job done nicely.

alexa.com has a list of the most visited pages on the web (can be found here [alexa.com]) and this list says that the ten most visited web pages on the web are (in decreasing order of frequency): google.com, facebook.com, youtube.com, yahoo.com, blogspot.com, baidu.com, wikipedia.org, live.com, twitter.com, qq.com

The question I asked to myself was: how far away are those pages from weblog.alseyn.net ? And the surprising answer is that within distance 6 of the origin, only 8 of them could be found, not all 10. The 8 lucky ones are: facebook.com (2), twitter.com (2), youtube.com (2), google.com (3), wikipedia.org (4), yahoo.com (4), baidu.com (6), live.com (6); where they are given with their distance to the origin. I also noticed that facebook and twitter are always surprisingly close to any web page (my program accepts any page as the origin and works from it), mostly, I presume, because every body nowadays puts a facebook or twitter button on their pages, even (and above all) root pages.

So the two missing top ten are blogspot.com and qq.com. I am not too surprised about qq.com (even though with Baidu in the lot, QQ would show up sooner than later) but I would have expected blogspot.com itself (as a root url) to be mentioned somewhere.

In case you wonder why I stopped at distance 6, well, after 6 I had few problems: the most important is that it takes so long to compute (at least one hour to go from 1 to 6) that I got bored, and I knew that distances 7 and 8 would bring in just too many pages (exponential growth -- and this is particularly true if you compute true distances). There was the fact that my program could not do more than 10 threads at once (maybe I should blog about multithreaded PHP one of these days), adding to the timing/getting bored issue. The database, MySQL, and the bandwidth were not a problem though, but I know that they would become at some point.

Here is the count of number of pages per distance from weblog.alseyn.net

Distance 0 : 1 page (the origin itself)

Distance 1 : 6 pages (http://www.w3.org, http://alseyn.net, http://pascal.alseyn.net, http://aubrey.alseyn.net, http://nucleuscms.org, http://clustrmaps.com)

Distance 2 : 45 pages

Distance 3 : 285 pages

Distance 4 : 1731 pages

Distance 5 : 8869 pages

Distance 6 : 55450 pages

With a total number of root urls at distance 6 or less from the origin equal to 66342 pages. The entire MySQL dump can be found here: 20110510-web-distance.sql.zip [weblog.alseyn.net]

There are many other questions I could have a look at, but I will stop here, otherwise I will turn all this into a full time job :-)

**Part 1**(Introduction to metric spaces for people who have no clue what they are)I wrote a crawler (technically not my first crawler, since I wrote a tool to create metadata from wolfram alpha pages for my math work some months ago, but indeed the very first which targets the web as a whole) which starting from an origin, for the purposes of this text assumed to be weblog.alseyn.net, builds its neighbourhood as follows.

It downloads the origin's html page and scans it to extract urls. Then you say that the corresponding extracted urls refer to pages that are, by definition, at distance 1 of the origin. Then it downloads the html pages at distance 1, extracts the urls inside them and say that the newly discovered pages are at distance 2 (unless they already appeared at distance 1 in which case they remain at distance 1 -- this happens if a page at distance 1 refers to another page at distance 1).

Following this method you can define a distance between the origin and virtually any web page on the internet. And if you apply this method to every single page, you have defined a simple metric structure on the set of all web pages, but I won't consider this metric as I am only interested in the distance between a fixed given page and others.

I said that there was a twist. The twist is that my program only considers root urls, not sub pages. Root urls are urls of the form http://wikipedia.org. or http://www.bbc.co.uk, but urls of the form http://twitter.com/alexandra are not root urls as they already refer to sub-pages of the root (the root of http://twitter.com/alexandra is http://twitter.com).

Considering only root urls means that you only consider a subset of the web, and even a very tiny subset of it. For instance, http://wikipedia.org is the root page of more than 10 million pages and those pages are not taken into account because their urls are not root urls. The main reason for this restriction is that without it my program was literally downloading deviantart.com in my database, which in itself was not a problem for me, but this wasn't adding any new domain names to the search because of the way deviantart rewrites urls appearing on its pages.

But more than considering a subset of the web, you also work with a different notion of distance. The distance between weblog.alseyn.net and nsa.gov, for instance, is greater than the same distance computed using the entire web. For instance, weblog.alseyn.net can refer to a specific wikipedia page and this page in turn refer to the NSA website, making the NSA web page at true distance 2 of the origin (here we assume that the origin doesn't directly refer to the NSA wen page). But my program would not discover this path because it is not allowed to go outwards any page along a non root url (in occurrence the url to the wikipedia page).

The exact regex I used was preg_match('/http:\/\/[^"\s<>\')\/\?]*/',strtolower($line),&$primary_matches);

I know I could have done a better job (I actually did in an earlier version when I was registering every single url in the database for statistical analysis), but this got the job done nicely.

**Part 2**alexa.com has a list of the most visited pages on the web (can be found here [alexa.com]) and this list says that the ten most visited web pages on the web are (in decreasing order of frequency): google.com, facebook.com, youtube.com, yahoo.com, blogspot.com, baidu.com, wikipedia.org, live.com, twitter.com, qq.com

The question I asked to myself was: how far away are those pages from weblog.alseyn.net ? And the surprising answer is that within distance 6 of the origin, only 8 of them could be found, not all 10. The 8 lucky ones are: facebook.com (2), twitter.com (2), youtube.com (2), google.com (3), wikipedia.org (4), yahoo.com (4), baidu.com (6), live.com (6); where they are given with their distance to the origin. I also noticed that facebook and twitter are always surprisingly close to any web page (my program accepts any page as the origin and works from it), mostly, I presume, because every body nowadays puts a facebook or twitter button on their pages, even (and above all) root pages.

So the two missing top ten are blogspot.com and qq.com. I am not too surprised about qq.com (even though with Baidu in the lot, QQ would show up sooner than later) but I would have expected blogspot.com itself (as a root url) to be mentioned somewhere.

In case you wonder why I stopped at distance 6, well, after 6 I had few problems: the most important is that it takes so long to compute (at least one hour to go from 1 to 6) that I got bored, and I knew that distances 7 and 8 would bring in just too many pages (exponential growth -- and this is particularly true if you compute true distances). There was the fact that my program could not do more than 10 threads at once (maybe I should blog about multithreaded PHP one of these days), adding to the timing/getting bored issue. The database, MySQL, and the bandwidth were not a problem though, but I know that they would become at some point.

**Part 3**Here is the count of number of pages per distance from weblog.alseyn.net

Distance 0 : 1 page (the origin itself)

Distance 1 : 6 pages (http://www.w3.org, http://alseyn.net, http://pascal.alseyn.net, http://aubrey.alseyn.net, http://nucleuscms.org, http://clustrmaps.com)

Distance 2 : 45 pages

Distance 3 : 285 pages

Distance 4 : 1731 pages

Distance 5 : 8869 pages

Distance 6 : 55450 pages

With a total number of root urls at distance 6 or less from the origin equal to 66342 pages. The entire MySQL dump can be found here: 20110510-web-distance.sql.zip [weblog.alseyn.net]

There are many other questions I could have a look at, but I will stop here, otherwise I will turn all this into a full time job :-)

[ add a comment ]