Vous êtes sur la page 1sur 20

Page Rank Algorithm

Catherine Benincasa, Adena Calden, Emily Hanlon, Matthew Kindzerske, Kody Law, Eddery Lam, John Rhoades, Ishani Roy, Michael Satz, Eric Valentine and Nathaniel Whitaker Department of Mathematics and Statistics University of Massachusetts, Amherst May 12, 2006

Abstract PageRank is the algorithm used by the Google search engine, originally formulated by Sergey Brin and Larry Page in their paper The Anatomy of a Large-Scale Hypertextual Web Search Engine. It is based on the premise, prevalent in the world of academia, that the importance of a research paper can be judged by the number of citations the paper has from other research papers. Brin and Page have simply transferred this premise to its web equivalent: the importance of a web page can be judged by the number of hyperlinks pointing to it from other web pages.

Introduction

There are various methods of information retrieval (IR) such as latent Symantic Indexing (LSI). LSI uses the singular value decomposition (SVD) of a term by document matrix to capture latent symantic associations. LSI method can eciently handle dicult query terms involving synonynms and polysems. SVD enables LSI to cluster documents and terms into concepts. eg. (car and automobile should belong to the same category.) Unfortunately computation and storage of the SVD of the term by documnet matrix is costly. Secondly there are enormous amounts of documents on the web. The documents are not subjected to editorial review process. Therefore the web contains redundent documents, broken links, or poor quality documents. Moreover the web needs to be updated as pages are modied and/or added and deleted continuously. The nal feature of the IR system which has proven to be math worthwhile, is the webs hyperlink structure. The Pagerank algorithm introduced by Google eectively represents the link structure of the internet, assigning each page a credibility based on this structure. Our focus here will be on the analysis and implementation of this algorithm.

PageRank Algorithm

PageRank uses the hyperlink structure of the web to view inlinks into a page as a recommendation of that page from the author of the inlinking page. Since inlinks from good pages should carry more wight than the inlinks from marginal pages each webpage is assigned an appropriate rank score, which measures the importance of the page. The PageRank algorithm was formulated by Google founders Larry Page and Sergey Brin as a basis for their search engine. After webpages are retrieved by robot crawlers are indexed and cataloged (which will be discussed in section 1); PageRank values are assigned prior to querry time according to perceived importance. The importance of each page is determined by the links to that page. The importance of any page is increased by the number of sites which link to it. Thus the rank r (P ) of a given page P is given by, r(P ) =
QBP

r(Q) |Q|

(1)

where BP = all pages pointing to P and |Q| = number of outlinks from Q. The terms of the matrix P are usually, pi,j =
1 |Pi |

if Pi links to Pj ; otherwise.

(These weights can be distributed in a non-uniform fashion as well, which will be explored in the application section. For this particular application, a uniform distribution will suce.) For theoritical and practical reasons such as convergence and convergence rates the matrix P is adjusted. The raw Google matrix P is nonnegative with row sums equal to one or zero. Zero row sums correspond to pages that have no outlinks; these are referred to as dangling nodes. We eliminate the dangling nodes using one of two techniques. So that the rows artically sum to 1. P is then a row stochastic matrix, which in turn means that the PageRank iteration represents the evolution of a Markov Chain.

2.1

Markov Model

Figure 1 3

Figure 1 is a simple example of the stationary distribution of a Markov model. This structure accurately represents the probability that a random surfer is at each of the three pages at any point in time.The Markov model represents the webs directed graph as a transition probability matrix P whose element pij is the probability of moving from page i to page j in one step (click). This is accomplished through a few steps. Step one is to create a binary Adjacency matrix to represent the link structure. A B A 0 1 B 0 0 C 1 0

C 1 1 0

The second step is to transform this Adjacency matrix into probability matrix by normalizing it ). A B A 0 1 2 B 0 0 C 1 0

C
1 2

1 0

This matrix is the unadjusted or raw google matrix. The dominant eigenvalu for every stochastic matrix P is = 1. Therefore if the Pagerank iteration converges it converges to the normalized left hand eigenvector v T satisfying vT = vT P (2) where v T e = 1 which is the stationary or steady state distribution of the Markov chain. Thus google intuitively characterizes the PageRank value of each site as the long-run proportion of time spent at the site by a Web surfer eternally clicking on links at random. In this model we have not yet considered account clicking back or entering URLs on the command line. In our basic example, we have: (R(A) R(B) R(C)) * A = (R(A) R(B) R(C)) where A is A B A 0 1 2 A= B 0 0 C 1 0

C
1 2

1 0

R(A) = R(C) 1 R(B) = R(A) 2 1 R(C) = R(A) + R(B) 2 R(A) + R(B) + R(C) = 1 and the solution of this linear system is (0.4 0.2 0.4)*Asol = (0.4 0.2 0.4) where Asol is A B A 0 1 2 A= B 0 0 C 1 0

C
1 2

1 0

Let consider a larger network show represents by gure 2.

Figure 2 5

This network has 8 nodes and therefore, the corresponding matrix has a size 8 x 8 matrix, as shown in gure 3.

Figure 3 Again, we can transform it into stochastic matrix, and the result is the following:

2.1.1

Generalization

Before going into the logistics of calculating this Pagerank vector, we generalize to an n-dimentional system. Let Ai be the binary vector of outlinks from page i
N

Ai = (ai1 , ai2 , ..., aiN ) and

Ai

=
j=1

Aij

(3)

P =

A1 A1 A2 A2

1 1

. . . . . .

AN AN

P11 : : PN 1

.. ..

P1N : : .. .. PN N
N

Pi = (pi1 , pi2 , ..., piN ) so

Pi

=
j=1

PiJ = 1

(4)

We now have a row stochastic probability matrix, unless, of course a page (node) points to no others: Ai = P i = 0 . Now let Wi T = Furthermore, let di = 0 if i is not a dead end; 1 if it is a dead end. 1 , N

where i = 1, ..., N

So W = d wT , S = W + P S is a stochastic matrix. It should be noted that there is more than one way to deal with dead ends. Such as removing them altogether or adding an extra link which points to all the others ( a so-called master node). We explore qualitatively the eects these methods have in the results analysis section. (See gure 10 for a deadend).

2.2

Computing PageRank

The computation of PageRank is essentially solving an eigenvector problem of solving the linear system, v T (I P ) = 0, (5)

with v T e = 1. There are several methods which can be utilized in this calculation, provided our matrix is irreductible, we are able to utilize the power method.

2.2.1

Power Method

We are interested in the convergence of the method xm T G = xT . For m+1 convenience we convert this expression to GT xm = xm+1 . Clearly, the eigenvalues of GT are 1> 1 2 ... n . Let v1 , ...vn be the corresponding eigenvectors. Let x0 (dimension n) such that x0 1 = 1,so for a1
n

ai v i G T x 0 =
i=1

ai G T v i =

ai i v i

= a1

n a1 v 1 ai i v i + = x1 a1 a1 i=2 n

G T x 1 = a1 v 1 +
i=2 n

ai 2 v i = x 2 i ai m+1 vi = xm+1 i
i=2

G T x m = a1 v 1 +

so
m

lim GT xm = a1 v1 = .

(The stationary state of Markov Chain)

2.3

Irreducibility and Convergence of Markov Chain

A diculty that arises in comupation is that S can be a reducible matrix when the underlying chain is reduible. reducible chains are those that contain sets of states in which the chain eventually becomes trapped. For example if webpage Si contains only a link to Sj , and Sj contains only a link to Si , then a random surfer who hits either Si or Sj is trapped into bouncing between the two pages bouncing endlessly, which is the essence of reducibility.The denition of Irreducibility is the following, for each pair i, j, there exists an M such that (S m )ij = 0. In the case of an undirected graph, this is equivalent to disjoint, non-empty subsets (see gure 11). However, the issue of meshing these rankings back together in a meaningful way still remiains. 2.3.1 Sink

So far we are dealing with a directed graph, however, we also have to be concerned with the elusive sink.(missing gure 16,17) ) A Markov chain in 9

which every state is eventually reachable from every other state guarantees to possess a unique positive stationary distribution by the Perron-Frobenius Theorem. Hence the raw google matrix P is rst modied to produce a stochastic matrix S. Due to the structure of the World Wide Web and the nature of gathering the web-structure, such as our method breadth rst (which will be explained in the section on implementation), a stochastic matrix is almost certainly reducible. One way to force irreducibility is to displace the stochastic matrix S where is a scalar between 0 and 1. In our computation we choose to be 0.85. For between 0 and 1, consider the following: R(u) = =
v

R(v) + (1 ) nv

where = .85 then the new stochastic matrix G becomes: G = S + (1 D) where D = e WT e = < 1, 1, ..., 1, 1 > 1 1 1 WiT = < , ... > N N N Again, it should be noted that WiT can be any unit vector. In our basic example, this amounts to: 0.85 * A + 0.15 * B = C where A is our usual 3 * 3 stochastic matrix, B is a 3 by 3 matrix with in every entry, and C is C

(6)

1 3

A=

0.05 0.475 0.475 0.9 0.05 0.05 0.9 0.05 0.05

This method allows for additional accuracy in our particular model since it accounts for the possibility of arriving at a particular page by means other 10

than via link. This certainly occurs in reality and hence, this method, improves the accuracy of our model, as well as providing us with our needed irreducibility, and as we will see, improving the rate of convergence of the power method.

Data Management

Up to this point, we assume that we are always able to discover the desired networks or websites that containing information we google for. However, careful readers may notice that we have not really discussed the way of guring the structure of the networks. In this section, we are going to switch our attentions toward more technical feature. How are we going to gure the structure of our networks? Furthermore, suppose if we are able to come up with the list of the websites, is there anyway we can nd out the rank more eciently and economically?

3.1

Breadth First Search

Breadth First Search Method is our main approach to identify the structure of networks and its algorithm is the following. Let us begin with one single node (webpage) in our network, and assigns it with a number 1, as in Figure a

11

Figure a This node links to several nodes and we are going to assign each nodes with a number, as in Figure b

12

Figure b From gure b, we observe there is one node link to node 2, so we assign this node another number. Then we switch to node 3, assigning a number to the node connects to node 3, and so on. Figure c gives us the nal result:

13

Figure c As you can see, by using the Breadth First Search Method, we are able to complete the graph structure, and therefore, we will be able to create our adjacency matrix.

3.2

Sparse Matrix

Now we are able to form our adjacency matrix by knowing the structure of the network through Breadth First Search Method. But in reality, the network contains over millions or even billions pages, and these matrices will be huge. If we apply our power method directly to these matrices, even with the fastest computer in the world, it will take a long time to compute those dominant eigenvector. Therefore, it will be economical for us to develop some ways to reduce the size of these matrices without aecting the ranking of those pages. In this paper, Sparse Matrix method and Compressed Row Storage are the methods we are going to use to accelerate our calculating process. First, let consider the following network:

14

Figure d Link text formats this information from les to les, represent by the table next to the network. Then Sparse PR reads in a from-to le and computes ranks. It outputs the pages in order of rank. Figure (e) is the result of our sample

15

Figure e Sparse Matrix allows us to use less memory storage without compromising the nal ranking. Full matrix format requires N 2 + 2N memory locations (N number of nodes). For 60k nodes about 50 Gbytes RAM. Sparse format requires 3N +2L locations (L number of links). For 60k nodes and 1.6M links about 50 Mbytes RAM. Obviously, Sparse Matrix use a lot less of memory than a full matrix in computation. Therefore, Sparse Matrix is more ecient than a full matrix in terms of the amounts of memory being used.

3.3

Compressed Row Vectors

In this section we want to develop a method to accelerate a process of multiplying the matrix. We decide to compress row vectors, since we already know how each nodes points to other nodes. CRS compresses rows require two vectors of size L (number of links) and one of size N (numbers of nodes). Consider the following example, where we have 3 nodes and 6 links. First, we construct a column vector aa with a size L. This vector represents nonzero entries in reading order. Second, we construct a column vector ja crs 16

vectors with size L. This vector represents column indices of non-zero entries. Finally, we are creating the ia vector with size N. This is a cumulative count of non-zero entries by row. For example, the rst row has two non-entries, therefore the rst element of this ia vector is 2. Second row has one non-entry, therefore the second element of this vector is 3, etc.

Figure f CRS storage allows us to multiply these matrix-vectors in the following concise form: // for each row in original matrix for i = 1 to N // for each nonzero entry in that row for j = ia(i) to ia(i+1) - 1 //multiply that entry by corresponding //entry in vector; accumulate in result result(i) = result(i) + aa(j) * vector(ja(j)) CRS is ecient, since we only need L additions and L multiplications, instead of N additions and N 2 multiplications. Now we can apply the power method and compute those tedious matrix multiplications and additions in more ecient way.

17

Results

To apply the PageRank method, an adjacency matrix is needed which represents a directed graph. The conventional use for PageRank is to rank a subset of the internet. A program called a webcrawler must be employed to crawl a desired domain and map its structure (i.e. links). A simple approach to solving this problem is to use a breadth-rst search technique. This technique involves starting at a particular node, say node 1, and discovering all of Node 1s neighbors before beginning to search for the neighbors of 1s rst discovered neighbors. Figure 4 demonstrates this graphically. This technique can be contrasted with depth-rst search which starts on a path and continues until the path ends before beginning a second unique path. Breadth-rst search is much more appropriate for webcrawlers because it is much more likely that arbitrarily close neighbors wont be excluded during a lengthy crawl.

Figure 4 A crawl in January of 2006 was focused on the umass.edu domain and yielded an adjacency matrix of 60,513x60,513. The PageRank method was implemented in conjunction with the CRS scheme to minimize the resources required. A nal ranking was obtained and a sample can be seen in Figure 5. Notice that the rst and sixth ranked websites are the same. This is due to the fact that the webcrawler did not dierentiate between dierent aliases of a URL. This paper presents one of the possible ways for ranking. However, it is clear that the matrices Google dealing with is thousand times larger than 18

the one we used. Therefore, it is safe to assume that Google would have a more ecient way to compute and to rank webpage. Furthermore, we have not introduced any method to conrm our results and algorithms. It is easy to check if the network is small, but when the networks getting bigger and bigger, verifying the results will become amazingly dicult. One of the potential solutions for this problem is to simulate a web surfer and use a random number generator to determine the linkage between websites. It should be interesting to see the result.

Figure 5 Another implementation can be applied to a network of airports with ights representing directed edges. In this implementation, the notion of multilinking comes into play. More precisely, there may exist more than one ight from one airport to the next. In the internet application, the restriction was made to allow only one link from any particular node to another. Although this requires only slight alterations to the working software to ensure a stochastic matrix. Figure 6 shows a sample of the results in a PageRank application on 1500 North American airports.

Figure 6 19

A more visible application may be in a sports tournament setting. The methods used for ranking collegiate football teams is annually a hot topic for debate. Currently, an average of seven ranking systems are used by the BCS to select which teams are accepted to the appropriate bowl or title games. Five of these models are computer based and are arguably a special case of PageRank.

Conclusion

This paper presents one of the possible ways for ranking. However, it is clear that the matrices Google dealing with is thousand times larger than the one we used. Therefore, it is safe to assume that Google would have a more ecient way to compute and to rank webpage. Furthermore, we have not introduced any method to conrm our results and algorithms. It is easy to check if the network is small, but when the networks getting bigger and bigger, verifying the results will become amazingly dicult. One of the potential solutions for this problem is to simulate a web surfer and use a random number generator to determine the linkage between websites. It should be interesting to see the result.

References
[1] Amy N. Langville, Carl D. Meyer A Survey of Eigenvector Methods for Web Information Retrieval Siam Review Vol 47, No 1 [2] S. Brin, L. Page, R. et.al . The PageRank Citation Ranking: Bringing Order to the Web

20

Vous aimerez peut-être aussi