April 23, 2003

Grub, Google and the Semantic Web

Filed under: Semantic Web — Tim @ 6:50 am

Dr. Elwyn Jenkins penned a good overview discussing Grub, Google and search engine convergence (or as extropians might extol: “the Singularity“).

Here’s how the numbers break down:

As of this writing, Google has 3,083,324,652 web pages indexed.

Google crawls about 150 million pages each day, so it takes them about 20 days to crawl what they claim is the web. They arguably have the largest crawling farm in existence, with access to over 100,000 processors and 263,000 hard drives (though, much of what is stored on them are their other services like Newsgroups, Images, Cached sites, etc.). Compare that too Grub:

2,193 clients running Grub crawled 124,456,219 URLs in the last 24 hours.

The Grub client on my computer alone has crawled a total of 150,000 URLs in the past day and a half; I have Grub running on another computer and it has racked up 85,000 URLs in the same amount of time (it was disconnected for several hours – the monkeys mashed too many bananas on the keyboard, you know how that is).

Additionally, Grub’s FAQ section states that there is around 10 billion web pages in existence and about 2 million more added each day.

If we want real-time indexing let’s look at the necessary numbers to do so:

* ~30 days in a solar month
* 24 hours in a Romantic day
* 60 minutes in an Olmecian/Sumerian hour
* 60 seconds in a non-metric minute.

Now let’s assume a few things. First, everything stays the same (ceteris paribus). Second, that only 3 billion pages need to be crawled.

125,000,000 / 24 hours = 5208333.33 sites indexed in an hour
5208333.33 / 60 minutes = 86805.55 sites indexed in a minute
86805.55 / 60 seconds = 1446.759259259 sites indexed in a second

3,000,000,000 sites in existence / 1446.759259259 = 2073600 Grub users grubbing.

That 2073600 is the number of clients needed to index the web in real-time.

If 10 billion is the real number of pages, just multiply 2073600 by 3.33 = 6912000 users.

So, we’re talking about numbers in the seti@home and Kazaa ball-park, which is assuming that nothing changes (like more efficient code or bandwidth allocation).

Here are a couple ideas (some of these may be planned by Looksmart already; I am unaware of them however).

- First, decentralize the servers using a supernode/shard based system like Kazaa has (or the DNS system does). One benefit of this is bandwidth based: no “one” entity is in control of it, so if the servers go down, there are alternatives (as well as the fact that search times could be minimized as they are routed through the nearest node).

This decentralized system could be protocol/standards based, so other companies, organizations and mimes could attach their own results (so an RDF-only crawler could merge with the system – which assists the growth of the Semantic Web).

- Second, more APIs. I’m sure this is being worked on currently but having the ability to search with Grub as you do with the Google Toolbar would be gravy. Actually, all Google has to do is adopt the same sort of system Grub has (it’s open-sourced so they should at least try it out and explain why it’s an inferior solution to their system). If Google did evolve to use this distributed crawling method, they already have a large library of useful APIs to use (instead of having Looksmart or others reinvent the wheel).

- Third, tell the world, or at least geekdom, what you plan on doing with the project in the long-term. So far many geeks think you’re just trying to use individuals like myself as a tool – I certainly hope that is not the case (I doubt it is).

Yup, the distributed computing approach utilized by Grub is innovative and even exciting, I look forward to seeing where it will evolve.

One last note, Google just acquired Applied Semantics, who’s patented CIRCA Technology:

understands, organizes, and extracts knowledge from websites and information repositories in a way that mimics human thought and enables more effective information retrieval. A key application of the CIRCA technology is Applied Semantics’ AdSense product that enables web publishers to understand the key themes on web pages in order to deliver highly relevant and targeted advertisements.

Even if it is just for ads, this buyout shows their is a market for AI agents that can effectively understand what is being discussed on a website – making RDF and OWL that much more important.

MSNBC Weblog Central… again

Filed under: Blogging — Tim @ 4:30 am

I’ve mentioned the MSNBC Weblog Central site a couple of times now. The first time was to rag on them, the second was to explain that my ragging was a little too quick to judge. Now my third time is to show you how one simple email can draw both publicity and legitimacy to ones blog.

As you can see, I made their ‘Best of Blogs’ for the week. What I did was send them a comment regarding my blog and why someone would be interested in it. You can too (feel free to resend mine if you’d like), just scroll to the bottom and fill out the comments section. I think I sent mine in 2 weeks ago, so don’t expect any immediate gratification.

Here’s their comment of my comment:

Tim Swanson explains from Dallas that “syndicating, promoting, advertising and marketing your blog are often not emphasized enough. I spend nearly all of my blogging hours searching for sites and blogs that specifically do that very thing. And the list grows daily; to include more than 50 sites anyone and everyone can use to syndicate their weblog more efficiently as to generate more traffic.”

Thanks, Tim, for that service.

That actually sounds sincere, so I won’t knock it.

So, someone like Mr. Michael Fagan who has oodles of blogging resources would easily get picked for the list (be more like Mike… I wanna be like Mike).

Speaking of resources, here are a few more I found:

The Truth Laid Bear EcosystemThe TTLB Blogosphere Ecosystem is an application which scans weblogs once daily and generates a list of weblogs ranked by the number of incoming links they receive from other weblogs on the list. Add your blog by filling out this.

Malaysia Central – Bloggers from Malaysia.

Twin Cities Babelogue – For bloggers from the twin cities area (that’s in Minnesota). I think Dallas and Fort Worth should be renamed the Sorta Twin Cities Area along with everyone changing their twang to a more Canadian variety: aboot, eh, donchaknow?

Lastly, the Official MSNBC Weblog Index, all they need to do, is copy my list. That’d be too easy.