In March 2005, Google registered the trademark 'TrustRank' with the U.S. Patent and Trademark Office (USPTO). What might this tell us about Google's forthcoming initiatives and how might this trademark's application, and its potential functionality, fit alongside the existing Google 'PageRank' feature?
Photo credit: Diego Sapriza
PageRank (PR) is at the very core of the Google search engine and is a system of Web site measurement that Web publishers are, typically, obsessed about - in particular how high their Web site's PR is. In very simple terms, PR evaluates and ranks Web sites according to a computed value determined by the number of other sites linking to them.
So, although Google PR determines the 'importance' of a Web site, it does not determine it's value, in terms of the trust-worthiness of the content on a site and of the site overall.
Indeed, spam merchants have been able to exploit this high level of dependence on the number of links to and from a Web site to inflate artificially, through various devious means, the Google PR of their own sites, thereby making them appear higher in search results.
This is where 'TrustRank' may come in and a paper published by researchers at the Stanford (alma mater of the Google co-founders) Digital Library Technologies group last year, called "Combating Web Spam with TrustRank" (.pdf), and recently made available on the Stanford server, may provide a clue as to what it may be all about.
The paper is extremely technical and a full read-through is only really recommended for those who have a deep understanding of algorithms and computer science.
Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam.
We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good.
In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques.
Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.
The paper then goes on to present the research methodology and findings in the following order:
1. We formalize the problem of web spam and spam detection algorithms.
2. We define metrics for assessing the efficacy of detection algorithms.
3. We present schemes for selecting seed sets of pages to be manually evaluated.
4. We introduce the TrustRank algorithm for determining the likelihood that pages are reputable.
5. We discuss the results of an extensive evaluation, based on 31 million sites crawled by the AltaVista search engine, and a manual examination of over 2,000 sites. We provide some interesting statistics on the type and frequency of encountered web contents, and we use our data for evaluating the proposed algorithms.
As the free Web that we know today becomes increasingly chaotic, over-powering and untrustworthy, TrustRank may become an important factor in its long-term survival as a global information repository.
Update - Sunday April 30th 2005
New Scientist reports:
"Now Google, whose name has become synonymous with internet searching, plans to build a database that will compare the track record and credibility of all news sources around the world, and adjust the ranking of any search results accordingly.
The database will be built by continually monitoring the number of stories from all news sources, along with average story length, number with bylines, and number of the bureaux cited, along with how long they have been in business. Google's database will also keep track of the number of staff a news source employs, the volume of internet traffic to its website and the number of countries accessing the site.
Google will take all these parameters, weight them according to formulae it is constructing, and distil them down to create a single value. This number will then be used to rank the results of any news search."