4/22/2003

Blogmatcher redux, RSS Search, Microsoft, HTML Tidy and Sebastian

Filed under: Blogging Links — Tim @ 11:34 am

A couple days ago I mentioned Ryo and his pet hermit crab: BlogMatcher. It appears that the crab has moved into a roomier conk shell, as Ryo stated:

I completely rewrote the searching/matching code in C (instead of PHP), and results pop up in less than a second now (in fact, often < 0.5 seconds). I've also added pagination for easier navigation.

I've refined the search algorithm a bit, partially inspired by your comments. I added some very basic statistical analysis in the scoring algorithm so that common URLs are scored down while uncommon URLs are marked up (at least in theory). It's still a little skewered for your blog, but I think the results have been improved somewhat. Also, the index has grown to 2900+ blogs (and I'm sure it'll continue to grow).

I gave it another whirl and was impressed with the results — best of all there was no bitter aftertaste: Pure, Clean & Fresh.

Ryo was good enough to put enough an FAQ for your reading pleasure as well. The only major change that I would like to see added, is some sort of static page, to where people can click and automatically see the search results (as opposed to typing it in everytime).

In addition to BlogMatcher scanning Weblogs.com, no less than three other services do as well: Organica, Metaweblog and Penthouse (notice how everything semi-important with me always degenerates into something to do with sex).

RSS Search - as the same suggests, it is a search engine that scours RSS feeds. Now just like Feedster, typing in my name reveals that Tim Robbins is the Big Man on Campus. Again, he’s a great actor, but in the blogosphere he best step down. Submit your blog here and hope for the best.

While looking for some dirt on the Opteron, I stumbled upon this article at ExtremeTech discussing the research and development projects over at Microsoft. Yes, despite my disdain for their business model they still work on some neat-o projects. One that I was particularly interested in was the StuffI’veSeen project. This software basically tracks and tags every bit of digital information you bump into throughout the day — I’d like to see this installed on John Poindexter’s computer.

Be sure to also visit Microsoft’s Research Project page, the Terraserver is the same one I used to come up with my GeoURL.

HTML Tidy - This handy piece of software fixes those annoying incompliant markup schemes that people like myself create. In addition to the downloadable open-source version of the application, you can also use this browser-based online copy.

I’ll end this with a remembrance for Sebastian, the pet hermit crab my younger sister gave me 7 years ago and I forgot to feed… Let us never mention his name again:

sebastian.gif

AMD Opteron, 64-bit Computing for the Rest of Us

Filed under: Technology — Tim @ 8:15 am

For those of you not following the CPU industry, Advanced Micro Devices (AMD) unveiled their make-or-break CPU design to the world today. Other than for financial reasons, why is this release important?

Here’s the situation, each year more applications and software packages are developed that require additional processing power and memory (RAM). There is a memory barrier (limit) at which the current 32-bit solutions cannot bypass, 4 Gigabytes. Eventually the makers of CPUs (Intel, AMD, Motorola, etc.) would have to migrate or design a CPU that could jump over this 4 GB barrier.

Meanwhile, the addressable memory limit comes closer and closer each day. Currently, just a handful of computer users need a system that can utilize more than the 4 GB memory limit. However, as software evolves, more and more memory is needed for adequate performance. Databases for instance, can perform better the more memory they have access to. The 4 GB limit is a stickler for companies and individuals that utilize very large databases. The same goes for computer animation (CGI) in movies such as Lord of the Rings or Star Wars; the software used in these two movies required access to more than the 4 GB limit (this was largely by-passed by distributing hundreds of CPUs into Beowulf clusters).

Imagine trying to run applications you currently use on one of your older computers, among other things your old system could not perform adequately (if at all). When Windows 95 was released, users discovered that having less than 32 MB of RAM meant running a less-than-optimal Windows. Microsoft and others encouraged users to purchase more RAM so that Windows would run properly at “full speed” (though, they also owned stock in Kingston and other RAM manufactures, suggesting that they had ulterior motives).

AMD and Intel are both going about a 64-bit solution in two different ways. Intel has a 2-part plan to migrate its clients over with. The first part is to continue to develop the Pentium line (the new Pentium ‘5′ dubbed ‘Prescott‘ will come out in the fall) for the next 3 or so years. Simultaneously, Intel developed a 64-bit solution originally released in May of 2001, called the Itanium. However, the initial release of the Itanium was greeted with less than enthusiastic consumers and very few sales - for several reasons.

- The chip costs a small fortune, starting at ~$1500, something both enthusiasts and IT managers are not willing to shell out.
- These unwilling customers may have purchased the Itanium if the CPU performed better than it did and if there were more applications developed for it.

You see, the Itanium uses a completely ‘new’ (actually 10+ years old) design that is incompatible with the existing software currently available. Software developers have to rewrite much of their code in order for their applications to work, creating headaches and causing large amounts of money to disappear. Intel has spent years working with HP and other companies to help build an infrastructure and support base around the chip, along with investing over $100 million (subsidies) to entice both start-ups and existing software firms to port their software over (it’d be like designing the worlds fastest car that used a engine incompatible with existing fluids).

AMD on the other hand, designed the Opteron which is both 64-bit and backwards compatible with existing 32-bit applications — so, it works right out of the box. Additionally, the Opteron is priced at around half the cost of the Itanium. Also note that there were additional architectural improvements such as an integrated memory controller, on the chip, which lowers the latency and increases the performance of many applications. And, what could be considered another coup, AMD introduced their first HyperTransport-based product. This innovative technology itself is worthy of a blog post, one that I might do later.

What does this all mean to you? If you purchased a computer in the past year or so, you probably have no need to upgrade, especially if all you do is blog, surf and IM. However, if you are looking at upgrading, begin the process of becoming an educated consumer. Read through reviews, attempt to learn some of the lingo being thrown around and keep up-to-date in this evolving new market.

One additional note, the chip AMD released today is geared towards servers and workstations. The desktop version will not be available until sometime in September, so you can take your time, no panic.

Distributed Computing Projects, Yesterday and Today

Filed under: Technology — Tim @ 1:01 am

In the mid-to-late ’90s one of the new creative approaches to number crunching was a project called Distributed.net. The goal of this project was to utilize the spare CPU cycles of thousands of ‘normal’ computers around the globe to crack an encrypted RSA key.

Back in highschool all the cool geeks (versus “uncool” geeks) teamed up together with the BeOS team to crack the encrypted key (actually, it was Digital Terra a sub-team of the BeOS team). A few years later several of us moved onto the well-known Seti@home project - I even started a team called The Vatos that appears to still be submitting work units (not my doing)

Another popular distributed computing project (though most users don’t know that it’s running) is within the Google Toolbar. In addition to rating the PageRank of your site, the Toolbar, analyzes and processes protein folding for the Folding@home project.

One particular implementation that has caused a flurry of lawsuits and sales in vaselinated products to increase is Kazaa. This peer-to-peer program works using a decentralized distributed model in which those running the program act as both a host and user simultaneously. Not that I use it [wink] but I “heard” at any given point over 4 million users have at their disposal an aggregated collection of about 6.5 Petabytes (6,500,000 Gigabytes). Kazaa is actually a mixture of both distributed bandwidth and filesharing (versus number crunching) and there are several other innovative projects that are currently trying to emulate the rampant success of Kazaa.

Which brings me to Grub. After reading the futuristic article regarding the Semantic Web and Google I’ve been looking for applications and AI-agents that can scour RDF-based sites for useful information. A friend of mine sent me a link to Grub, which uses a distributed bandwidth technique to continuously index the web. This open-source project works by the following: users download a copy of the indexing software, install it and initialize bots (spiders, crawlers, etc.) which scour through the web, going from site to site to site to site… The bots then send the results to Wayne Enterprises who in turn sells the data to Emu Ranchers in Texas (they’re all over here).

Actually, the initial results are sent back to the central servers in Grub’s San Francisco office, where they are compiled and organized.

The long-term goal Grub hopes to accomplish, is to have hundreds of thousands of users running the program which then index’s the entire web in real-time (no more 30-day waits). This in turn allows ideas like the Semantic Web to be a reality, enabling a host of innovative approaches to literally every facet in your life: travel plans, purchases, babysitters, carpet cleaning and even locating swinger pubs.

The current method that most search-engine companies use, is to send out their own bots and then crunch the results with their own computers. Google for instance, uses over 100,000 off-the-shelf processors in a distributed computing model to create those .052394731 second results. They re-compile their database once every 30-days or so, using various algorithms to enhance and interpret the data their bots send them.

Eventually an application like Grub will take Google’s place - though I wouldn’t be too surprised if Google came out with something similar to Grub… to stay in business and all that jazz. Why would something like Grub succeed?

- First because it is feasible to do so right now - it’s not a far-fetched theory that can’t be implemented (I have Grub running on two computers)
- Second users such as me, want access to recent data (real-time if possible) for uses ranging from computer purchases to online dating (no comment)
- Third Grub is a shorter word than Google, thus alleviating that much more ink and bandwidth use

Oh and speaking of P2P computing, the co-founder of Napster is now utilizing the distributed computing model for blocking spam. It’s called Spamnet and is being made through his new company, Cloudmark. I’ve used it for about a couple of months now and really have nothing to complain about. It’s simple to install (integrates seamless with Outlook) and uses a “voting” system to categorize and filter mail. Plus, none of the mail is permanently discarded, so you can go back and “Unblock” that Yahoo Group mailing list post or look at why Viagra will make you a happier person.

For further reading about the semantic web, be sure to look at OWL and RDF. And for a listing of various distributed computing projects, visit here - Gateway is even getting in on the P2P action.