Figure 8: Composition of the evaluation sample.
spam, but we also applied a second filter—we only selected
sites with a clearly identifiable authority (such as a gov-
ernmental or educational institution or company) that con-
trolled the contents of the site. The extra filter was added
to guarantee the longevity of the good seed set, since the
presence of physical authorities decreases the chance that
the sites would degrade in the short run.
6.3 Evaluation Sample
In order to evaluate the metrics presented in Section 3.2, we
needed a set X of sample sites with known oracle scores.
(Note that this is different from the seed set and it is only
used for assessing the performance of our algorithms.) We
settled on a sample of 1000 sites, a number that gave us
enough data points, and was still manageable in terms of
oracle evaluation time.
We decided not to select the 1000 sample sites of X at
random. With a random sample, a great number of the sites
would be very small (with few pages) and/or have very low
PageRank. (Both size and PageRank follow power-law dis-
tributions, with many sites at the tail end of the distribu-
tion.) As we discussed in Section 5.2, it is more important
for us to correctly detect spam in high PageRank sites, since
they will more often appear high in query result sets. Fur-
thermore, it is hard for the oracle to evaluate small sites due
to the reduced body of evidence, so it also does not make
sense to consider many small sites in our sample.
In order to assure diversity, we adopted the following
sampling method. We generated the list of sites in decreas-
ing order of their PageRank scores, and we segmented it
into 20 buckets. Each of the buckets contained a different
number of sites, with scores summing up to 5 percent of the
total PageRank score. Therefore, the first bucket contained
the 86 sites with the highest PageRank scores, bucket 2 the
next 665, while the 20th bucket contained 5 million sites
that were assigned the lowest PageRank scores.
We constructed our sample set of 1000 sites by selecting
50 sites at random from each bucket. Then, we performed a
manual (oracle) evaluation of the sample sites, determining
if they were spam or not. The outcome of the evaluation
process is presented in Figure 8, a pie-chart that shows the
way our sample breaks down to various types of sites. We
found that we could use 748 of the sample sites to evaluate
TrustRank:
• Reputable. 563 sites featured quality contents with
zero or a statistically insignificant number of links
pointing to spam sites.
• Web organization. 37 sites belonged to organizations
that either have a role in the maintenance of the World
Wide Web or perform business related to Internet ser-
vices. While all of them were good sites, most of their
links were automatic (e.g., “Site hosted by Provider
X”). Therefore, we decided to give them a distinct la-
bel to be able to follow their features separately.
• Advertisement. 13 of the sites were ones acting as
targets for banner ads. These sites lack real useful
content and their high PageRank scores are due ex-
clusively to the large number of automatic links that
they receive. Nevertheless, they still qualify as good
sites without any sign of spamming activity.
• Spam. 135 sites featured various forms of spam. We
considered these sites as bad ones.
These 748 sites formed our sample set X. The remaining
252 sites were deemed unusable for the evaluation of Trust-
Rank for various reasons:
• Personal page host. 22 of the sites hosted personal
web pages. The large, uncontrolled body of editors
contributing to the wide variety of contents for each
of these sites made it impossible to categorize them
as either bad or good. Note that this issue would not
appear in a page-level evaluation.
• Alias. 35 sites were simple aliases of sites better
known under a different name. We decided to drop
these aliases because the importance of the alias could
not reflect the importance of the original site appropri-
ately.
• Empty. 56 sites were empty, consisting of a single
page that provided no useful information.
• Non-existent. 96 sites were non-existent—either the
DNS lookup failed, or our systems were not able to
establish a TCP/IP connection with the corresponding
computers.
• Unknown. We were unable to properly evaluate 43
sites based on the available information. These sites
were mainly East Asian ones, which represented a
challenge because of the lack of English translation.
6.4 Results
In Section 4 we described a number of strategies for propa-
gating trust from a set of good seeds. In this section we fo-
cus on three of the alternatives, TrustRank and two baseline
strategies, and evaluate their performance using our sample
X: