Friday, April 2, 2010
Posted by About this Blog at 8:02 PM
Searching for information on electronic media is a field that goes back a long way, perhaps to July 1945 and the publication of Vannevar Bush's "As We May Think" in The Atlantic Monthly.
Gerard Salton is thought to be the father of modern search, having developed the System for the Mechanical Analysis and Retrieval of Text (SMART) Information Retrieval System at Cornell University in the 1960s. Since then, software engineers have worked with librarians and documentation specialists to develop software to quickly find scientific information stored in databases containing tens of thousands of book and articles. These efforts have followed two parallel tracks:
Some automated the work of librarians, who index documents within a subject area, describe those documents with keywords, and then compile those keywords in a database called a thesaurus (not to be confused with the reference book of synonyms and antonyms). Employing these programs, the user (usually an expert) can perform complex research using Boolean operators (and, or, not, and so on).
Others wanted to automate the process fully by having the computer compare the words of the request with those in the documents. In these programs, like LexisNexis from Mead Corporation, the computer shows the user all the documents where the requested keywords appear, and the user can weight them by relevance. To prevent too many junk results in the form of irrelevant documents, the engineers created tools for sorting: The user can ask the machine to show only documents after a certain date, those where two keywords appear in proximity, or documents meeting other criteria.
The elegant simplicity of the latter approach intrigued data-processing specialists because the specialists didn't need to query databases manually. Anyone could enter keywords, thus eliminating the need to prepare and index documents. The hope was that documents could simply be digitized and stored in a database, available to be searched.
Language being what it is, however, the latter attempt to automate search has its disadvantages. For example, if you try to introduce synonyms or contextual meanings into the database, you create more volume and false positives. That's not necessarily a problem as long as the databases remain specialized within a limited field for use by professionals (like the legal documents for attorneys stored in Lexis), but using these programs for searching the Web was another matter.
When searching the Web, users could find plenty of documents containing the words they searched for, but there were too many irrelevant results. As the Web grew, and as more pages were assembled and indexed, search result quality deteriorated. As Page and Brin wrote in their 1998 paper titled "The Anatomy of a Large-Scale Hypertextual Web Search Engine," "'Junk results' often wash out any results that a user is interested in. In fact, as of November 1997, only one of the top four commercial search engines finds itself (returns its own search page in response to its name in the top ten results)."
To counter this failing, early search engines vacillated between two solutions. Some limited the size of their databases because adding pages produced worse results. Others, like Yahoo!, took an approach based on the thesaurus concept: They created elaborate systems to categorize and rank sites based on topic. A webmaster wanting to register a site was told to specify its category with keywords. Once submitted to Yahoo!, specialists called ontologists would check the description's relevance.
The thesaurus method of search posed significant problems. For example, suppose you typed the word horse into a search box and then pressed ENTER for your results. In response, you would see various search categories, such as Zoology, Sports, Art, and so on. Visit the Zoology branch and you'd find sites about the animal that is the horse. Click the Sports track and you'd see pages about horsemanship and betting. The Art category would take you to sites on equestrian paintings. The Food section would reveal French recipes for horse meat. In Politics, you might find a rant by a British activist, complaining about a French conspiracy to eat his pet. Yahoo! employed hundreds of workers to analyze and sort web pages this way according to the thesaurus method, language by language, culture by culture. Clearly, the thesaurus method was inferior to and much more time intensive than the automated search method, but automated search was far more expensive and complex.
Dissatisfied with the current state of search, Page and Brin looked for, and discovered, a way to automatically classify pages found in a search by their relevance or rank. Of course, they were not alone in trying to find a solution to the search problem.
For example, search engines like DirectHit tried to classify sites according to their cumulative use. If someone followed a link to a site and stayed a long time, that site was considered to be more relevant than one that was infrequently and/or briefly visited. This is how Lycos and HotBot still rank sites today.
Ranking pages by cumulative use has certain advantages over former methods, but the method also has inherent flaws. For one, cumulative use is less than reliable. With today's tabbed browsers that open several pages simultaneously, a user might keep a page open for a long time without actually reading it, thus skewing the server statistics to make them at best unreliable, at worst meaningless. And it lends itself to cheating. If I want to push my site up on the search page, all I need to do is write a small robot program that goes to the site, stays for a few minutes, leaves, and then comes back again using a different proxy IP number. Catch me if you can.
Like the developers of DirectHit, Page and Brin decided that reputation was the best way to measure a site's quality and relevance. But rather than measure a site according to the number and duration of hits, they looked at the nature of scientific research and the importance of citations.
For example, to judge the quality of an author, an idea, or a concept, researchers check the number of quotations from the article in scholarly publications and then classify scientific articles by the number of times they are referred to in other articles. In the world of the Internet, links to pages are more or less the equivalent of citations. If I put a link in my text, encouraging readers to load a page on another site, chances are I consider it important or at least relevant. By counting the number of links to various pages, a search engine can classify those pages and obtain more reliable results. This forms the basis of Google's search algorithm.
But Google isn't that simple. For one thing, not all citations have the same value, nor do all links have the same importance. For example, a quote in an article written by a Nobel laureate and published in a prestigious journal has more value than, say, a student's article in some little-known school's newspaper. In the same way, links coming from pages that are cited often are given more weight by Google than those coming from pages with few incoming links.
Google added other subtleties as well, such as the distance between words when a query contains several, and a system of weighting that gives more value to links from sites with many incoming links but few outgoing links. This mechanism made it possible to improve search quality greatly without the need for human intervention.
As obvious as it may appear today, the method requires highly complex mathematics and involves the integration of several classes of problems. This is why initial support for Google came mainly from the scientific community. In fact, Google's initial success was due to a mix of programming theory and network sociology. And because of its novelty, Google qualifies as a genuine invention, which is why it interested scientific researchers and mathematicians.
This is an important detail. As you'll read in this book, throughout Google's history one of its main strengths has been its ability to maintain relationships with the academic community. The quality of these relationships stems from the personalities of the company's founders and their contacts with high-level researchers like Terry Winograd, their former college professor and now a Google consultant. But Google's work in areas of inquiry that interest researchers also enables the company to transform questions posed by its engineers into problems that mathematicians are eager to solve.