7/30/2007

Buy it, Build it, then Unbreak it

Filed under: Debate, Semantic Web, TEH INTARWEB, Technology — Tim @ 8:04 am

grubs.jpgFour years ago I wrote about a metasearch project called Grub. At the time, it was basically being hailed as an open-source way to index the interweb, in real-time.

And it pretty much went nowhere for a number of reasons, one of which was that the parent company was free-riding off of its install base.

Another reason is that end-users trying to search the results had a subpar experience when compared to the dominant search engines of the time, especially the bigger players like Yahoo, MSN, and Google.

And now it seems as if a bit of fresh air has been blown back into Grubs little, presumably, unsalted heart.

Last week, Jimmy Wales (co-creator of Wikipedia) purchased the spider/crawler from its beleaguered parent company and intends on “truly” open-sourcing it this time.

In Om Malik’s write-up of this event, he mentions that building a scalable infrastructure was perhaps the prime culprit in Grub’s original demise; I still think that is merely one piece of the overall service.

No matter how big your index is, providing relevant results and presenting them in a useful manner still seems like the most important attribute desired by consumers.

And while a number of factors have gone into success, arguably one of the reasons Google dominates the current industry is because the developers have figured out not just how to index the web in a timely manner, but also how to present it in a productive, meaningful way — this despite the fact that both Yahoo and MSN have indices of comparable sizes and scope.

Here’s to hoping that Wales & Co. will not fulfill Santayana’s maxim. Heck, if they really are on top of things this time, I might even utilize the fat pipe I have here to help with the project.

Note: be sure to check out the Daft Punk song, “Technologic

8/6/2006

Oh what could have been: topical networks

Filed under: Blogging, Collectrix, Economics, Semantic Web, TEH INTARWEB, Technology — Tim @ 1:46 am

Looks like Feedburner is getting in on the same kind of business model Collectrix.com wanted to do 3 years ago.

Best of luck to them (I think they can do it too).

And it most certainly is a sign of the times.

Via Mike Ewens.

8/4/2006

Mispronounced words: Powerpoint edition

Filed under: Culture, Foolish, Semantic Web, TEH INTARWEB — Tim @ 4:46 pm

Below are four words that I have heard mispronounced over the past several weeks during presentations. The way I break it down is not necessarily the proper phonetic way:

Scarcity. This should be pronounced as if it were the following two words: scare + city. I unfortunately heard it stated as scar-city.

Niagara. The first part should be pronounced like the word “nigh” as in, “the end is nigh.” Then “ag” from the word “agriculture.” The “arah” sounds like the portion from the name “Sarah.” So, nigh + ag + arah. Not nigh-uhh-garah.

Aesthetics. The “aes” portion sounds similar to the first part of the word “Aztec.” The “the” in the middle sounds like the “th” in “athletic” or “thespian.” Also, the “thet” portion rhymes with “vet.” The last portion “tics” or “ics” sounds exactly like the insect “ticks.” Thus, az + thet + ticks. Not eyes-thet-icks.

Efficacy. This is not efficiency. It means something entirely different. The first part of the word “eff” sounds like the letter “f” or the first part of the word “effin.” The “i” in the middle sounds like the “i” in “icky.” The “ca” sounds like the first part of “cut.” And the “cy” sounds like the word “sea” or “see.” Therefore, f + ick + cuh + sea. Not eff-ick-kasey.

8/3/2006

Metrics and measurement of feeds

Filed under: Blogging, Culture, Google, Semantic Web, Syndication, TEH INTARWEB, Technology — Tim @ 11:58 pm

I’ll be honest, I don’t care much for the Google Reader. Here’s my brief review of it.

After a solid year, I still have stuck with Bloglines (despite trying others such as Rojo).

However, some people do like it and the gReader team has recently provided some interesting aggregated numbers for our consumption.

Note: these are extensions to the namespaces only — they only looked at things like how often a Creative Commons license is embedded into a feed (and not if it was RSS/Atom).

Via Niall Kennedy.

7/29/2006

Social Bookmarking for Novices and Pros Alike

Filed under: Google, Semantic Web, TEH INTARWEB, Technology — Tim @ 2:10 pm

If you’ve grown tired of trying to organize your bookmarks at each computer you use, perhaps you might be interested in two handy services.

One is Deli.cio.us which is now operated by Yahoo.  The way it works is fairly simple.  You create an account and install a little utility that integrates with your browser.  Whenever you come across a site you that you want to bookmark, you simply click the “tag this” button and it instantly becomes immortalized in your ever growing cornucopia of links.  You can also look at what other people are bookmarking and build RSS feeds off of specific tags (e.g. technology, golf, base jumping). See mine.

The other popular one is Notebook from Google.  It operates in a similar manner, yet it is not only more substantive, but also influences the underlying ranking system of each site that is “noted.”

10/10/2005

General Use Search Engines Will Create The AI Of SciFi

Filed under: Google, Semantic Web, TEH INTARWEB, Technology — Tim @ 1:16 am

artificial intelligenceNumerous academic and commercial endeavors have been undertaken to develop practical AI applications which can “think” like human beings (i.e. they can understand both abstract and concrete meanings).

Among others, one project that has garnered attention over the past few years is Cyc. The developers created an online submission tool in which any Joe can submit a nugget of truth or piece of wisdom into its database. Overtime this database has grown to hundreds of thousands of facts and concepts which can be queried for various uses (its technical engine reportedly has ties with the semantic web via OWL).

My thoughts: as original as projects like Cyc sound, I believe that general purpose search engines from Yahoo, MSN or Google will ultimately accomplish the intuitive ‘human-like’ cognitive abilities that projects like Cyc set out to do, first. If the modus operandi is a bastardized use of the Law of Large Numbers, not only do these firms have larger amounts of resources (financial, human capital, etc.) but they have an ever increasing user base that continues to add concepts, constructs and pithy fortune cookie factoids into their ginormous databases every second of every day (note: there are other methods for creating AI, not just statistical averages).

Perhaps with enough data mining we can one day figure out where Jimmy Hoffa really is.

10/7/2005

Google RSS Reader

Filed under: Google, Semantic Web, Syndication, Technology — Tim @ 9:32 pm

opml coffee cupGoogle finally released a beta of an RSS reader today. It has the slick clean interface like usual, however I find it lacking a couple of features:

I actually like the scrollbar on the left-side that Bloglines provides to navigate through the feed list quickly. With the GReader I manually have to click each time I want to go up or down (including Page up/down), rather than having the ability to slide the scrollbar (although there is a short-cut key for quick navigation).

Also, I like the item preview display that Bloglines has (GReader simply has headlines which you must click on for further information - no blurb).

In addition, perhaps organizing via newsfeed could be added as an option as well (currently only relevancy and date are the duo left to the task).

Layout issues aside, I was able to import my OPML file within a few seconds– everything else seems responsive as well. The keyboard shortcuts are nice too, probably addictive in the long run. It also has the obligatory ‘tag‘ creating ability article labelization (sic).

I am sure it will only be a few months until we see it integrate seemlessly with Blogger (an ‘InstaBlogThis’ feature over the horizon?).

I give it a B-. There is a dicussion of the service up at Google Groups.

10/3/2005

How Much Longer Will You Need To Use That Library Card

you\'re adopted It has been all of 3 days since I last mentioned anything about Google. Today is a quick discussion of Google Print.

What this endeavor is in a nutshell: Google is financing a book-scanning operation of material found at Stanford, Harvard, University of Michigan, Oxford and the New York Public Library. All books, including those that are copyrighted, are included in their databases which can be accessed just like their normal web query tool we have grown accustomed to using. In addition, those little text ads on the side will be displayed each time your search hits a keyword someone paid for (e.g. economics, basketball, Britney Spears).

When Google first announced the library portion of the project, two large organizations cried foul. The first was not a surprise, brick-and-mortar publishers. In fact, the Author’s Guild was so upset that it has actually filed a lawsuit to prevent Google from displaying any information from copyrighted books (here is Google’s non-PR speak response). Due in part to these legal concerns, Google stopped scanning copyrighted texts in August, however it will resume scanning in November. The intervening weeks is a time period in which Google has requested that any publisher or writer that wishes to opt-out of the program can do so by simply filling out this form.

The other organization that went up in arms was various nation-states from Europe, most notably France. In April, several uber nationalists such as Jacques Chirac suggested that Google’s actions will invariably bias the scope of material found online to the Anglo-Saxon variety, “Google’s plans have rattled the cultural establishment in Paris, raising fears that the French language and ideas could be just sidelined on the worldwide web, which is already dominated by English.”

This past September, Google announced that it had begun working with various non-English European publishers to participate in this program.

Despite these accommodations, today the Europe Union and various other regulatory bodies announced they are funding an initiative to place the same material online at tax payers expense (versus financed privately via Google).

To add to this helter-skelter trail, Yahoo announced yesterday that it will be working with the University of California, the University of Toronto and various other Archiving services (such as the Internet Archive) to scan and provide access to books in the public domain (it is called the Open Content Alliance).

There is a catch however. Whereas Google will scan every book and allow users to search each text (although you cannot read the entirety of the book unless otherwise permitted by the author or publisher — similar to Amazon’s Search-Inside-The-Book feature), Yahoo is realistically only going to be able scan approximately 15% of the content available in these libraries. Another oddity in Yahoo’s approach is that it will allow anyone to index the text they scan (mining via metadata like RSS), including Google (whereas Google is relatively closed and proprietary). That raises an unanswered question mark in terms of a business model for Yahoo (perhaps they will use it as a tax write-off).

Incidentally O’Reilly Media is opening up their volumes for free access via Yahoo’s book-scanning project — which is odd considering that Tim O’Reilly sits on the advisory board at Google and has publically endorsed Google Print.

So where does that leave you, Mr. Internet User? I think this digitazation movement can be seen almost unanimously as a win-win situation (sans the operations subsidized via taxes). This will enable people from every walk of life to find information that would otherwise be left to obscurity: it is empowering. And on a personal level, it is a fantastic resource to have on hand as a graduate student working on research projects (Google Scholar is also a great service).