Vous êtes sur la page 1sur 14

Principal Component Centrality: A KLT-inspired Transform for Identifying Inuential Neighborhoods in Social Network Graphs

Muhammad U. Ilyas* and Hayder Radha


Department of Electrical and Computer Engineering Michigan State University East Lansing, MI 48824 {ilyasmuh, radha}@egr.msu.edu

AbstractThe measurement of node centrality is a key area of research in social network analysis. However, centrality is vaguely dened and there are several measures in existence. The appropriateness of a particular node centrality measure for a particular network is judged by the combination of type of commodity duplication and type of ow process in the network. In the context of social networks, Eigenvector centrality (EVC) is a node centrality measure used to rank nodes according to the inuence they exert. In this paper, we introduce Principal Component Centrality (PCC), a measure of node centrality that is inspired by the Karhunen Lo` ve e transform (KLT)/Principal Component Analysis (PCA). We demonstrate PCCs ability to discover signicantly more social hubs acting as inuential neighborhoods in massive social graphs (with millions of links/nodes) than EVCs limited focus on one cluster within such graphs. We show PCCs performance by processing a friendship graph obtained from Googles Orkut social networking service and a gaming graph obtained from users of Facebooks Fighters Club application.

Local Centrality Maximas

Centrality Plane

Graph Plane

Fig. 1. This gure shows a graph on the lower plane, overlayed with another plane of the interpolated surface plot of node centrality scores. The centrality planes typically exhibit a number of peaks or local maxima.

I. I NTRODUCTION Centrality [3], [4], [6], [14], [26] is a measure to assess the criticality of a nodes position. Node centrality as a measure of a nodes importance by virtue of its central location has been in common use by social scientists in the study of social networks for decades. Over the years several different meanings of centrality have emerged. Naturally, the idea of ranking nodes for their ability to spread or detect (positive or negative) inuence is of signicant interest to social network analysis. Among many centrality measures, Eigenvalue Centrality (EVC) is arguably the most successful tool for
This work was supported in part by NSF Award CNS-0721550, NSF Award CCF-0728996, NSF Award CCF-0515253, and an unrestricted gift from Microsoft Research.

detecting the most inuential node(s) within a social graph. Thus, EVC has been a highly popular centrality measure in the social sciences ( [15], [29], [3], [13], [11], [12], [30], [28], [5], [4]) (it is often referred to simply as centrality). As we demonstrate later in this paper, one key shortcoming of EVC is its focus on (virtually) a single inuential set of nodes that tend to cluster within a single neighborhood. In other words, EVC has the tendency of identifying a set of inuential nodes that are all within the same region of a graph. This shortcoming may not represent a major issue for many social science problems and Internet applications, such as PageRank, where EVC has been used extensively [19]. Meanwhile, when dealing with massive social network graphs, it is hardly the case that there is a single neighborhood of inuential nodes; rather, there are usually multiple

inuential neighborhoods most of which are not detected or identied by EVC. In order to identify inuential neighborhoods, there is a need to associate such neighborhoods with some form of an objective measure of centrality that can be evaluated and searched for. To that end, one can think of a centrality plane that is overlaid over the underlying social graph under consideration. This centrality plane may contain multiple centrality score maxima, each of which is centered on an inuential neighborhood. Nodes that have centrality score higher than other nodes are located under a centrality peak and are more central than any of their neighbors. We use the term social hubs to refer to nodes forming centrality maxima. Figure 1 illustrates this concept. Thus, these social hubs form the kernel of inuential neighborhoods in social networks. Hence, our focus in this paper is on identifying inuential neighborhoods rather than inuential nodes. We will show that EVC has a tendency to be too narrowly focused on a dominating neighborhood. To this end, we introduce a new measure of centrality that we call Principal Component Centrality (PCC) that gradually widens the focus of EVC in a controlled manner. More importantly, PCC provides a general framework for transforming social graphs into a spectral space analogous to popular signal transforms that operate on random signals. In this paper, we give a brief review of common centrality measures accompanied by a critique with regard to their scope of application to social networks. We then present Principal Component Centrality (PCC), a node centrality measure that is inspired by the Karhunen Lo` ve transform (KLT) and Principal Component Anale ysis (PCA). In essence, PCC is a general transform of social graphs that can provide vital insight into the centrality and related characteristics of such graphs. Similar to the KLT of a signal, the proposed PCC of a social graph gives a form of compact representation that identies inuential nodes and more importantly inuential neighborhoods. Hence, PCC provides an elegant social graph transform framework that outperforms EVC as we show in this paper. In particular, early in this paper, we demonstrate EVCs shortcoming by using both EVC and PCC to compute node centralities in a network small enough to allow meaningful illustration. This is followed by a thorough description of PCC, and its utility in transforming massive real-world social network graphs. We also develop the equivalence of an inverse PCC transform that attempts to reconstruct a representation of the original social graph from its

inuential neighborhoods. The rest of this paper is organized as follows. Section II gives a background review of existing centrality measures for graphs, highlights problems in EVC and motivates our development of a new node centrality. Section III introduces PCC as a new measure of centrality. It describes in detail the advantages, mathematical interpretation, visualization and the effect of varying number of features of PCC. Section IV applies PCC to two real-world social graphs. The rst is an undirected, unweighted friendship graph from Googles Orkut social networking service. Orkut currently has approximately 440, 000 active subscribers [24] and has large subscriber bases in Brazil and India. The data set available to us [21] consists of 70, 000 users connected by 2, 971, 776 links. The second is a weighted, undirected gaming graph of matches between users of Facebooks Fighters Club application. This data set was originally collected by Nazir in [22]. It consists of 667, 560 recorded matches between 143, 020 users making for a graph with 526, 224 edges between users. Section V concludes the paper. II. BACKGROUND Let A denote the adjacency matrix of a graph G(V, E) consisting of the set of nodes V = {v1 , v2 , v3 , . . . , vN } of size N and set of undirected edges E . When a link is present between two nodes vi and vj both Ai,j and Aj,i are set equal to 1 and set to 0 otherwise. Let (vi ) denote the neighborhood of vi , the set of nodes vi is connected to directly. A. Degree Centrality The degree centrality of a node in a graph is a measure of the relative importance to the graphs connectivity. The degree centrality of a node is dened as the number of edges incident on it. Nodes with more incident edges have higher degree centrality than nodes with fewer incident edges. If di denotes the degree of node vi then its degree centrality is computed by: CD (vi ) =
di N 1

(1)

Degree centrality is a measure of a nodes rate of dissemination (of an infection) in the immediate short term. It has the advantage that its computation does not require nodes to exchange information. However, it has two signicant disadvantages; 1) Without an exchange of centrality information with other nodes, it is not possible to interpret and

evaluate an individual nodes centrality relative to that of others. 2) Degree centrality does not take into account the centrality of its neighbors. B. Closeness Centrality The closeness centrality of a node is dened as the mean length of geodesic paths to all other nodes. Intuitively, nodes occupying a more central location within the graph are expected to have shorter paths. Closeness centrality is a measure of the rate at which a node can spread an infection to all reachable nodes. Closeness is a suitable measure of centrality when the ow of commodity in the network follows geodesic paths. Closeness centrality is a good measure of the average detection time in a network with ows of nonreplicating commodity following geodesic paths. C. Betweenness Centrality The betweenness centrality of a node is dened as the fraction of geodesic paths (shortest paths) out of all geodesic paths between all pairs of nodes passing through that node. Thus, nodes located on more geodesic paths have a higher betweenness centrality than nodes located on fewer geodesic paths. Intuitively, since the subproblem optimality principal holds for the shortest path problem, a nodes location on a geodesic path implies close proximity to all other nodes on that path. A nodes betweenness can be interpreted as a measure of disruption caused when the node is removed from the network. Like closeness, betweenness too assumes that the ow of commodity is along geodesics. Betweenness centrality is a good measure of the average probability of detection of ows in a network with non-replicating commodity following geodesic paths. D. Eigenvector Centrality Eigenvector centrality (EVC) is a relative score recursively dened as a function of the number and strength of connections to its neighbors and as well as those neighbors centralities. Let x(i) be the EVC score of a node vi . Then,
x(i) = 1 x(j) j(v )
i

2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0 0.5 1 1.5 2 0.1 0.05 0.2 0.15 0.3 0.25

Fig. 2. A spatial graph of 200 nodes. Node colors are indicative of the range in which their EVC falls.

Here is a constant. Equation 2 can be rewritten in vector form equation 3 where x = {x(1), x(2), x(3), . . . , x(N )} is the vector of EVC scores of all nodes.
1 Ax x = Ax

x=

(3)

1 Ai,j x(j) j=1

(2)

This is the well known eigenvector equation where this centrality takes its name from. is an eigenvalue and x is the corresponding eigenvector of matrix A. Obviously several eigenvalue/eigenvector pairs exist for an adjacency matrix A. The EVC of nodes are dened on the basis of the Perron eigenvalue A (the Perron eigenvalue is the largest of all eigenvalues of A and is also called the principal eigenvalue). If is any other eigenvalue of A then A > ||. The eigenvector x = {x(1), x(2), . . . , x(N )} corresponding to the Perron eigenvalue is the Perron eigenvector or principal eigenvector. Thus the EVC of a node vi is the corresponding element x(i) of the Perron eigenvector x. Note that when the adjacency matrix A is symmetric all elements of the principal eigenvector x are positive. As mentioned above, EVC is widely used in the social sciences ( [20], [30], [3], [14], [12], [13], [31], [29], [5], [4]) and is often referred to simply as centrality. EVC does not suffer from the same problems as degree, closeness and betweenness centralities. In computing a nodes EVC it takes into consideration its neighborss EVC scores. Because of its recursive denition,

EVC is suited to measure nodes power to inuence other nodes in the network both directly and indirectly through its neighbors. Connections to neighbors that are in turn well connected themselves are rated higher than connections to neighbors that are weakly connected. Like closeness and betweenness, the EVC of a node provides a network-wide perspective. At the same time it can take advantage of distributed methods of computing eigenvectors/eigenvalues of a matrix but does not have to bear the overhead of excess network trafc. Sankaralingam [27], Kohlsch tter [18] and Canright, Engu Monsen and Jelasity [7], Bischof [2], Bai [1] and Tisseur [28] proposed some parallel algorithms for computing eigenvectors and eigenvalues of adjacency matrices. E. The Need for a New Centrality Measure In the preceding sections we highlighted some of the key characteristics of the most common measures of centrality. Our discussion left us with only one viable measure of centrality that takes into consideration the centrality scores of a nodes neighbors and which provides a network-wide perspective, i.e. eigenvector centrality. EVC has been used extensively to great effect in the study and analysis of a wide variety of networks that are shown to exhibit small-world and scale-free properties. In [8] Canright and Eng-Monsen correlated EVC with the instantaneous rate of spread of contagion on a Gnutella network peer-to-peer graph, a social network of students in Oslo, a collaboration graph of researchers at Telenor R&D and a snapshot of a collaboration graph of the Santa Fe Institute. In [23] Newman analyzed the use of EVC in a lexical network of co-occuring words in Reuters newswire stories. In [9] Carreras et al. used EVC to study the spread of epidemics in mobile networks. They used three sets of traces collected by Intel Cambridge, a trace of the public transportation network from the DieselNet project at the University of Massachusetts at Amherst and mobility and interaction traces from MITs Reality Mining project. Now consider the graph in gure 2. It consists of 200 nodes and is typical of the kinds of peer-to-peer networks formed by social interactions in networks such as the cellphone based Nokia SensorPlanet project ( [25], [17]). Its nodes are assigned one of six colors from the adjacent color palette. Each of the six colors represents one of six bins of a histogram spanning, in uniform step sizes, the range from the smallest to the largest EVCs. As the legend accompanying gure 2 shows, blue represents the lowest EVCs and red the highest. We make the following observations:

Frequency

60 40 20 0 10

Adjacency Matrix Laplacian Matrix Averaged Adjacency Matrix Averaged Laplacian Matrix

10

15

20

25

30

P | | / N | |

i
1

i=1

0.5

i=1

Adjacency Matrix Laplacian Matrix Averaged Adjacency Matrix Averaged Laplacian Matrix 0 50 100 150 200

i
0

P
Fig. 3. [Top] Histogram of Eigenvalues of adjacency matrix and Laplacian matrix A of network in gure 2; [Bottom] Cumulative sum of the sequence of eigenvalues of adjacency matrix and Laplacian matrix of network in gure 2 when sorted in descending order of magnitudes. In both gures the lines plotted in red color are averages of 50 networks generated randomly with the same parameters.

1) EVCs are tightly clustered around a very small region with respect to the total size of the network and drops off sharply as one moves away from the node of peak EVC. 2) EVC is unable to provide much centrality information for the vast majority of nodes in the network. 3) The position of the peak EVC node appears somewhat arbitrary because a visual inspection shows that almost equally signicant clusters of nodes can be visually spotted in other locations in the graph. Counter to intuition, the high EVC cluster is connected to the rest of the network by a single link. III. P RINCIPAL C OMPONENT C ENTRALITY The EVC of a node is recursively dened as a measure of centrality that is proportional to the number of neighbors of a node and their respective EVCs. As we saw in section II-D, the mathematical expression for the vector of node EVCs is equivalent to the principal eigenvector. Our motivation for Principal Component Centrality (PCC) as a new measure of node centrality may be understood by looking at EVC through the lens of the Karhunen Lo` ve Transform (KLT). When the e KLT is derived from an N N covariance matrix of N random variables, the principal eigenvector is the most dominant feature vector, i.e. the direction in N dimensional hyperspace along which the spread of data

A1 2 1 0 0 2 1 0 0

A2 2 1 0 0 2 1 0 0

A3 2 1 0 0
~

A5

2 1 0 0 2 1 0 0
Fig. 4.

1 x ~ A10

1 x ~ A15

1 x ~ A50

1 x

2 1 0 0

A200

1 x

1 x

1 x

1 x

Reconstructed topologies of the graph from gure 2 using only the rst 1, 2, 3, 5, 10, 15, 50 and all 200 eigenvectors.

points is maximized. Similarly, the second eigenvector (corresponding to the second largest eigenvalue) is representative of the second most signicant feature of the data set. It may also be thought of as the most signicant feature after the data points are collapsed along the direction of the principal eigenvector. When the covariance matrix is computed empirically from a set of data points, the eigendecomposition is the well known Principal Component Analysis (PCA) [11]. Since we are operating on the adjacency matrix derived from graph data we call the node centrality proposed in this paper Principal Component Centrality (PCC). In a covariance matrix, a non-zero entry with a large magnitude at positions (i, j) and (j, i) is representative of a strong relationship between the i-th and j -th random variables. A non-zero entry in the adjacency matrix representing a link from one node to another is, in a broad sense, also an indication of a relationship between the two nodes. Based on this understanding we draw an analogy between graph adjacency matrix and covariance matrix. In the preceding section we described various centrality measures from literature. Among them, EVC is the node centrality most often used in the study of social networks and other networks with small-world properties. While EVC assigns centrality to nodes according to the strength of the most dominant feature of the data

set, PCC takes into consideration additional, subsequent features. We dene the PCC of a node in a graph as the Euclidean distance/2 norm of a node from the origin in the P -dimensional eigenspace formed by the P most signicant eigenvectors. For a graph consisting of a single connected component, the N eigenvalues |1 | |2 | . . . |N | = 0 correspond to the normalized eigenvectors x1 , x2 , . . . , xN . The eigenvector/eigenvalue pairs are indexed in order of descending magnitude of eigenvalues. When P = 1, PCC equals a scaled version of EVC. Unlike other measures of centrality, the parameter P in PCC can be used as a tuning parameter to adjust the number of eigenvectors included in the PCC. The question of selection of an appropriate value of P will be addressed in subsequent subsection III-D. Let X denote the N N matrix of concatenated eigenvectors X = [x1 x2 . . . xN ] and let = [1 2 . . . N ] be the vector of eigenvalues. Furthermore, if P < N and if matrix X has dimensions N N , then XN P will denote the submatrix of X consisting of the rst N rows and rst P columns. Then PCC can be expressed in matrix form as: CP =
((AXN P ) (AXN P )) 1P 1

(4)

The operator is the Hadamard (or entrywise

product or Schur product) operator. Equation 4 can also be written in terms of the eigenvalue and eigenvector matrices and X, of the adjacency matrix A: CP =
(XN P XN P ) (P 1 P 1 ).

(5)

It is important to note a major difference between a traditional signal transform under KLT as compared with the proposed PCC graph transform. First, recall that, under KLT, a transform matrix T is derived from a covariance matrix C; and then the eigenvector-based transform T is applied on any realization of the random signal that has covariance C. Meanwhile, under the proposed PCC, the adjacency matrix A plays a dual role: at one hand, it plays the role of the covariance matrix of the KLT; and on the other hand, one can think of A as being the signal that is represented compactly by the PCC vector CP . Effectively, the adjacency matrix A represents the social graph (i.e., signal) that we are interested in analyzing; and at the same time A is used to derive the eigendecomposition; and hence, we have the dual role for A. Later, we will develop the equivalence of an inverse PCC, and we will see this dual role of the adjacency matrix A again. A. Interpretation of Eigenvalues The denition of PCC is based on the graph adjacency matrix A. For a matrix A of size N N its eigenvectors xi for 1 i N are interpreted as N -dimensional features (feature vectors) of the set of N -dimensional data points represented by their covariance (adjacency) matrix A. The magnitude of an eigenvalue corresponding to an eigenvector provides a measure of the importance and prominence of the feature represented by it. The eigenvalue i is the power of the corresponding feature xi in A. An alternative representation of a graphs topology is the graph Laplacian matrix which is frequently used in spectral graph theory [10]. The graph Laplacian can be obtained from the adjacency matrix by setting the diagonal entries of the adjacency matrix to Ai,i = N j=1;i=j Ai,j , i.e. a diagonal entry in a Laplacian matrix is the negative of the sum of all off-diagonal entries in the same row in the adjacency matrix. This denition applies equally to weighted and unweighted graphs. The graph Laplacian is always positive-semidenite which means all of its eigenvalues are non-negative with at least one eigenvalue equal to 0. The adjacency matrix, however, does not guarantee positive semideniteness

and typically has several negative eigenvalues. This is the reason the ordering of features is based on magnitudes of eigenvalues. The bar chart at the top of gure 3 plots histograms of eigenvalues for both adjacency and Laplacian matrices of the network in gure 2. But why then, did we not use the Laplacian matrix in the rst place? The reason is that the eigendecomposition of the adjacency matrix yields greater energy compaction than that of the Laplacian. The middle plot in gure 3 shows the normalized, cumulative function of the sorted sequence of eigenvalue powers. The line for the eigenvalue derived from the adjacency matrix rises faster than that of the Laplacian matrix. The adjacency matrix curve indicates that 25%, 50% and 75% of total power is captured by the rst 15 (7.5%), 44 (22%) and 89 (44.5%) features, respectively. In contrast, the Laplacian matrix eigendecomposition shows that the same power levels are contained in its rst 26 (13%), 61 (30.5%) and 103 (51.5%) features, respectively. Thus eigendecomposition of the adjacency matrix of social graphs offers more energy compaction, i.e. a set of features of the adjacency matrix captures more energy than the same number of features of the corresponding Laplacian matrix. B. Interpretation of Eigenvectors EVC interprets the elements of the Perron-eigenvector x1 of adjacency matrix A as measures of corresponding nodes centralities in the network topology (see section II-D). Research on scale-free network topologies has demonstrated EVCs usefulness. However, when applied to large spatial graphs of uniformly, randomly deployed nodes such as the one in gure 2, EVC fails to assign signicant scores to a large fraction of nodes. For a broader understanding that encompasses all eigenvectors we revert to the interpretation of eigenvectors as features. One way of understanding PCC is in terms of PCA [11], where PCC takes part of its name from. PCA nds the eigenvectors x1 , x2 , x3 , . . . , xN and eigenvalues of Gs adjacency matrix A. Every eigenvector represents a feature of the adjacency matrix. To understand how these feature vectors are to be interpreted in graphical terms, refer to equation 6 which uses eigenvectors and eigenvalues to reconstruct an approximation AP of the adjacency matrix A. Reconstruction can be performed to varying degrees of accuracy depending on P , the number of features/ eigenvectors used. If we set P = N in equation 6 (all eigenvectors/eigenvalues are used), the adjacency matrix can be reconstructed without losses (see He [15]). Here, denotes the diagonal matrix of eigenvalues sorted in descending order of magnitude

C15 3D Spectral Drawing high 2 1 x3(i) 0 1 2 3 4 2 x1(i) 0 2 0 2 x2(i) low

C. Graphical Interpretation of PCC In this section we evaluate the usefulness of the PCC scores assigned to nodes of a network. Recall that a nodes PCC is its 2 norm in P -dimensional eigenspace. Perceptional limitations restrict us from redrawing the graph in any eigenspace with more than 3 dimensions. Figure 5 is a drawing of the graph in gure 2 in the 3-dimensional eigenspace formed by the 3 most signicant eigenvectors of the adjacency matrix A. Nodes are colored according to their C15 PCC scores, derived from the 15 most signicant eigenvectors, divided into 6 equally sized intervals between the lowest and highest PCC score. Based on the interpretation of PCC we expect nodes with higher (red) PCC scores to be located farther away from the origin at (0, 0, 0) than nodes with lower (blue) PCC scores. From gure 5 we can see that this is clearly the case. For clarication, the cluster of low-PCC nodes around the origin (0, 0, 0) is marked with a red, dashed oval. D. Effect of Number of Features on PCC In this section we study the effect varying the number of eigenvectors P has on PCC. For an illustrated example we revert to the randomly generated network topology of 200 nodes in gure 2. We compute PCC while varying P from 1 through 2, 3, 5, 10, 15, 50 and 200. Figures 6a through 6h re-plot the network with nodes colored to indicate their PCC scores. The bin size for all histograms is set to 0.25. Recall that since PCC score at P = 1 are a scaled versions of EVC, the gure 6a represents the baseline case of EVC. In gure 6a, EVC identies a small cluster in the upper right corner as the nodes most central to the network. Note that ironically this cluster is separable from the larger graph by the removal of merely one link! On the other hand, clusters of nodes in the larger, better connected part of the graph are assigned EVC on the low end of the scale. As P is increased from gure 6b through 6h, more clusters of high PCC nodes pop up. As expected, the accompanying histograms below each graph plot show that this has the effect of increasing the variance of PCC scores. Adding successively more features/eigenvectors will have the obvious effect of increasing the sum total of node PCC scores, i.e. 11N Cm > 11N Cn when m > n. However, it is unclear how much PCCs scores change as P is varied from 1 through N . In [7] Canright et al. use the phase difference between eigenvectors computed in successive iterations as a stopping criteria for their fully distributed method for computing the principal eigenvector. We use the phase angle between

Fig. 5. Spectral drawing of graph in three dimensions using entries of x1 , x2 , and x3 for the three coordinate axes. Nodes are colored according to their C15 PCC.

on the diagonal (from upper left corner to lower right corner).


AP = XN P P N XT N N

(6)

0 0 N To illustrate, consider the unweighted, undirected graph G(V, E) shown in gure 2 with adjacency matrix A. As entries are either 0 or 1. However, this is not necessarily true for AP , the version of the matrix reconstructed using the P most signicant eigenvectors. The entries in AP will very likely contain a lot of fractions. Therefore, before viewing the recovered topology in the reconstructed adjacency matrix AP its entries have to be thresholded. Prior to plotting the topology, we rounded values less than 0.5 down to 0 and round values larger than or equal to 0.5 up to 1. Figure 4 plots the adjacency matrix reconstructed from the most signicant 1, 2, 3, 5, 10, 15, 50 and all 200 feature vectors. The plot for A1 shows that the recovered topology information is highly localized to the vicinity of nodes with the highest EVC. The plot using A2 adds another highly connected but still very localized cluster to the network. Adding more feature vectors extends the set of connected nodes in various parts of the network. As more eigenvectors are added to the computation of PCC it has the effect of increasing the resolution of centrality scores in nodes lying in less well connected regions of the network.

where =

1 0 . . .

0 0 . . .

2 . .. . . .

C1(i) 2 1.5 1 0.5 0 0 200 Frequency Frequency 150 100 50 0 0 2 C1(i) 4 1 x 2 y y 2 1.5 1 0.5 0 0 150

C2(i) 2 1.5 1 0.5 1 x 2 0 0 150 Frequency 100 50 0 y

C3(i) 2 1.5 1 0.5 1 x 2 0 0 100 Frequency y

C5(i)

1 x

100 50 0

50

2 C2(i)

2 C3(i)

2 C5(i)

(a)
C10(i) 2 1.5 1 0.5 0 0 40 Frequency Frequency 30 20 10 0 0 2 C (i)
10

(b)
C15(i) 2 1.5 1 0.5 y y 2 1.5 1 0.5 1 x 2 0 0 60 Frequency 40 20 0

(c)
C50(i) 2 1.5 1 0.5 1 x 2 0 0 60 Frequency 40 20 0 y

(d)
C200(i)

1 x

0 0 30 20 10 0

1 x

2 C15(i)

2 C (i)
50

2 C

4 (i)

200

(e)

(f)

(g)

(h)

Fig. 6. PCC of nodes in network of gure 2 when computed using rst 1, 2, 3, 5, 10, 15, 50 and all 200 eigenvectors. The histograms accompanying each graph plot show the distribution of PCC of their nodes. The red lineplot in the histogram represents the average PCC histograms of 50 randomly generated networks with the same parameters as the network in gure 2.

PCC vectors and EVC to study the effect of adding more features. We compute the phase angle (n) of a PCC vector using n features with the EVC vector as,
(P ) = arccos

CP CE . |CP | |CE |

(7)

Here, denotes the inner product operator. The relationship of the phase angle with the number of features used in PCC for the network under consideration is plotted in gure 7. Initially, the function of phase angle rises sharply and then levels off almost completely at 22 features. This means that, in this example, the relative PCCs of nodes cease to change with the addition of more

1 (rad)

0.5

avoid redundancy and in effect, observe a greater number of nodes with the same number of monitors. This is why our focus is on identifying inuential neighborhoods rather than inuential nodes. To this end we demonstrate application of PCC to two social network data sets, one derived from Googles Orkut social network service and the other from Facebooks Fighters Club application.
50 100 150 P # of eigenvectors 200

0 0

A. Orkut Data Set The rst data set is a friendship graph of 70, 000 users of Googles Orkut social network service. This data set was originally collected by Mislove [21] and constitutes an unweighted, undirected graph. The data set is a social graph obtained from subscriber friends lists of Googles Orkut social network service [21]. The data set consists of 70, 000 user nodes with 2, 971, 776 undirected links between them and was processed and analyzed on Matlab 7.4 (R2007a) on a Dell PowerEdge server with an Intel QuadCore Xeon 2.13GHz processor and 4GB RAM. Our objective for applying PCC on this data set is the discovery of more social hubs than are identied by EVC. Figure 8a plots the EVC scores of all 70, 000 nodes in the data set. It shows that node 692 in the network has the highest EVC, followed by a cluster of nodes with node IDs centered around 43, 000. Like in the example illustrated earlier the remaining majority of nodes is assigned centrality scores close to 0. The histogram of node EVCs in gure 8b conrms this. Earlier, for the sample network in section III we determined the number of features P to use for the computation of PCC as the number of eigenvalues after which the rate of growth of their cumulative sum begins to decline signicantly. An alternative approach which yielded a clearer cutoff point for the selection of the number of features to use in PCC was the plot of the rate of change in the phase angle of PCC vectors with the EVC vector (see gure 7). Figure 8c plots the phase angles of PCC while varying the number of features from 1 through 100. We select P = 14 as a cut-off point for the computation of PCC (marked in red). Figure 8e plots PCC vector of all nodes in the Orkut graph. When we compare it to the plot of the EVC vector what stands out immediately is how a lot of nodes with near-zero EVCs are assigned higher and highly varying PCC scores. Figure 8f is the histogram of PCC scores which, at a standard deviation of 6839.7, is more spread out than that of the EVC with a standard deviation of 4382.7. Figure 8d plots the number of local maxima that are found in the graph for values of P . At rst glance it

Fig. 7. Plot of phase angles (in radians) of PCC vectors with the EVC vector for the graph in gures 6.

features beyond the rst 22 features. The phase angle plot may be used for determining how many features are sufcient for the computation of PCC of a network. IV. A PPLICATION
OF

PCC

TO

S OCIAL N ETWORKS

In this section we apply PCC to a large scale data set obtained from the friend network of the Orkut social networking service [16]. We apply PCC to identify social hubs in the network. We motivate the search for local centrality maxima by the following application. Suppose we wish to deploy a limited number of monitors in a social graph to spot the emergence and adoption of as many trends as possible. The spread of trends in social networks is modeled as an inuence process. The degree to which a node is inuential in the spread of a trend in the long term is measured by its EVC, which was explained in detail in section II-D. Recall from the earlier example network (on which we studied PCC in detail) that EVC and PCC vary gradually along a path. We dene a social hub in a network as a node whose centrality score forms a local maxima, i.e. its centrality score is higher than all of its neighbors (see gure 1 for an illustration of social hubs and the corresponding local maxima). The number of local maxima identied is used as a performance metric. The results obtained by using EVC provides the baseline for comparison. On a walk over a graph, the EVCs of nodes change gradually, i.e. a nodes EVC is high in part because its neighbors EVC is also high. This means, if we were to pick nodes for placement of monitors in descending order of EVC, quite a few will end up monitoring the same well connected cluster of nodes. This will introduce redundancy in the monitoring at the cost of coverage. With a limited number of monitors available it would be more desirous to position them in different vicinities of the graph. If, on the other hand, a node is selected only if its centrality measure is signicant and locally maximum we can

1 # of nodes
E

x 10 5

0.5 0

3 4 Node ID

6 x 10

7
4

0 0

0.2

0.4

CE

0.6

0.8

(a)
# of maximas 1 0.5 0 1014 20 30 40 50 60 70 P # of eigenvectors 80 90 100 95 90 85 10 20 30

(b)

(rad)

40 50 60 70 P # of eigenvectors

80

90

100

(c)
x 10 # of nodes 0.8 0.6 0.4 0.2 1 2 3 4 Node ID 5 6 x 10 7
4 4

(d)

14

0 0

0.2

0.4

C14

0.6

0.8

(e)
# of nodes # of nodes 20 10 0 0 20 10 0 0

(f)

0.2

0.4

CE

0.6

0.8

0.2

0.4

C14

0.6

0.8

(g)

(h)

95 |SP1 SP2| 90 85

|S V(C [|S |])|

3 2 1 0

80 100

50 60 80 100

10

20

30

40 50 60 P # of features

70

80

90

100

P2

20

40 P1

(i)

(j)

Fig. 8. Orkut data set: a) EVC scores of nodes, b) A histogram of EVC scores, c) Phase angles of PCC vectors C1 through C100 with EVC vector CE , d) Number of local maxima discovered using PCCs of varying number of features, e) PCC scores of nodes using 14 features, f) A histogram of PCC scores of nodes using 14 features, g) A histogram of EVCs of the 20 social hubs with the highest EVCs, and h) A histogram of PCCs of the 20 social hubs with the highest PCCs based on 14 features, i) The size of the intersection set of SP and V (CP [|SP |]) for 1 P 100, and j) The size of the intersection set of SP 1 and SP 2 when 1 P 1, P 2 100.

might appear that EVC found 84, while PCC improves this number to 91 when using just 14 out of 70, 000 possible features. In the plotted range a maximum of 95 maxima are identied when 61 features are used. We examine how many of the social hubs identied are in fact

trivial maxima with low centrality. This is accomplished by viewing the centrality scores of nodes identied as social hubs. Let Sn denote the set of social hubs identied by using PCC with n features/eigenvectors. Figure 8g is the histogram of EVC histogram of the

top 20 EVC scoring nodes in S1 (set of social hubs identied using only EVC/PCC with 1 feature). Only one node (node number 692, the node with the highest EVC of all 70, 000 nodes in gure 8b) truly stands out with a high EVC, whereas the other 19 social hubs EVC scores lie in the lowest bin of the histogram. This behavior is consistent with our observations in the illustrated example in gure 2 where EVC was only able to identify a single social hub. Thus, after the denition of local maxima excludes nodes surrounding the most central node, EVC fails to identify any other inuential neighborhood. One might wonder if, among the more than 2000 nodes with pronounced EVC scores in gure 8b there is not a single node besides node 692 that might be a local maxima. We veried from the data set that node 692 has 2185 neighbors, most of which have node IDs in the range between 42134 and 44314 (clearly visible in gure 8b). In contrast, gure 8h is a histogram of the PCC scores of the 20 nodes with highest PCC scores in S14 (the set of social hubs identied using C14 , PCC with 14 features). Here, a total of 8 social hubs have non-trivial PCC scores, the remaining 12 social hubs have PCC scores too low to be considered signicant. The IDs of nodes identied as social hubs of inuential neighborhoods in descending order of PCCs are 692, 317, 4749, 487, 39, 14857, 35348 and 12219. This is a substantial improvement over the single neighborhood identied using EVC. Thus, using PCC in conjunction with a node selection criteria provided by the denition of local maxima identies many more inuential neighborhoods in a social network than is possible by using EVC. We also raise the question of how different the set of nodes identied as social hubs is from the nodes we would have identied as central were we to rely solely on nodes centrality scores. This raises the question of how different SP , the set of nodes identied as social hubs based on CP , is from V (CP [|SP |]), the vertex set (returned by the function V ()) of the rst |SP | nodes ranked in descending order of CP . Figure 8i plots the size of the intersection set |SP V (CP [|SP |])|. The data point at P = 1 is 1 and is the number of nodes common in the set of social hubs identied by EVC and those identied by node EVC scores alone. For the range 1 P 100 at most 3 nodes identied as social hubs are also present in the set of the rst |SP | most central nodes, i.e. placing monitors at nodes based solely on their centrality scores produces a lot of redundant coverage. We compute the sizes of intersection sets of all pairs

of SP 1 and SP 2 for 1 P 1, P 2 100. This is plotted in gure 8j. It shows that as the number of features P 1 used to compute CP 1 is increased from CP 2 , the set of social hubs SP 2 identied by it is (almost) always a superset of the set SP 1 if P 1 < P 2. Thus the inclusion of more feature vectors adds members to the set of social hubs without removing previous ones. B. Facebook Fighters Club Data Set The second data set is derived from a list of matches played between users of Facebooks Fighters Club application. This data set contains 143, 020 users and 667, 560 matches was originally collected by Nazir in [22]. It differs from the Orkut friendship graph in that it is a weighted, undirected graph. The weights of links between two user nodes represent the number of interactions/matches played against each other. Each vertex in the weighted gaming graph represents a user of the application with 526, 224 edges between them. Thus, weight of an edge between two users is the number of matches recorded between them in the data set. Edge weights in this data set range from 1 to 29. Figure 9a plots the EVC scores of all 143, 020 nodes in the data set. Unlike in the preceding Orkut data set, there are only very few nodes that are assigned EVCs signicantly greater than 0. The group of nodes in the node ID space above 130, 000 are the only ones with high EVCs, while almost all other nodes have nearzero EVCs. Like in the example illustrated earlier, the remaining majority of nodes is assigned centrality scores close to 0. The histogram of node EVCs in gure 9b conrms this. Figure 9c plots the phase angles of the PCC vector while varying the number of features from 1 through 100. Figure 9d plots the number of local maximas that are found in the gaming graph for values of 1 P 100. EVC nds approximately 1.23105 while PCC increases this number slightly to 1.232 105 when using just 20 out of 143, 020 possible features. The phase angle attains a stable value around P = 10 features (marked by red line) and so we will use P = 10 for the computation of PCC, i.e. C10 . Figure 9e plots PCCs of all nodes in the graph with 10 features and gure 9f is their histogram. This raises the question of how different SP , the set of nodes classied as social hubs based on CP , is from V (CP [|SP |]), the vertex set (returned by the function V ()) of the rst |SP | nodes ranked in descending order of CP . Figure 9i plots the size of the intersection set |SP V (CP [|SP |])|. The data point at P = 1 is approximately 1.23105 and is the number of nodes common in the set

1 # of nodes
E

x 10 10 5 0 0

0.5 0

8 Node ID

10

12

14 4 x 10

0.2

0.4

CE

0.6

0.8

(a)
# of maximas x 10 1.232 1.23 1.228 10 20 30
5

(b)

(rad)

1 0.5 0 0 10 20 40 60 P # of eigenvectors 80 100

40 50 60 70 P # of eigenvectors

80

90

100

(c)
# of nodes 0.8 0.6 0.4 0.2 2 4 6 8 Node ID 10 12 14 4 x 10 2 1 0 0 0.2 0.4 x 10
5

(d)

10

C10

0.6

0.8

(e)
# of nodes # of nodes 20 10 0

(f)

100 0

0.1

0.2

0.3

0.4

0.5 CE

0.6

0.7

0.8

0.9

0.1

0.2

0.3

0.4

0.5 C10

0.6

0.7

0.8

0.9

(g)
1.235 1.23 1.225 x 10
5

(h)

|S S |

P2

1.25 1.2 1.15 100

x 10

|S V(C [|S |])|

1.22 1.215 1.21 1.205

P1

50
1.2 1.195 1.19 0 10 20 30 40 50 60 70 80 90 100

P2 0 0

P # of features

20

40 P1

60

80

100

(i)

(j)

Fig. 9. Facebook Fighters Club application data set: a) EVC scores of nodes, b) A histogram of EVC scores, c) Phase angles of PCC vectors C1 through C100 with EVC vector CE , d) Number of local maxima discovered using PCCs of varying number of features, e) PCC scores of nodes using 10 features, f) A histogram of PCC scores of nodes using 10 features, g) A histogram of EVCs of the 200 social hubs with the highest EVCs, and h) A histogram of PCCs of the 200 social hubs with the highest PCCs based on 10 features, i) The size of the intersection set of SP and V (CP [|SP |]) for 1 P 100, and j) The size of the intersection set of SP 1 and SP 2 when 1 P 1, P 2 100.

of social hubs identied by EVC and those identied by node EVC scores alone. As we proceed on the horizontal axis the number of features used to compute PCC is increased. Each data point is the number of nodes in the intersection of the set of social hubs by P -feature PCC

(CP ) with the set of most central nodes nodes by PCC CP of equal size. As P increases in the range 1 P 100 the size of the intersection set rapidly drops at rst and then climbs back close to the starting value at 65 features.

We compute the sizes of intersection sets of all pairs of SP 1 and SP 2 for 1 P 1, P 2 100. This is plotted in gure 9j. It shows that as the number of features P 1 used to compute CP 1 is increased from CP 2 , the set of social hubs SP 2 identied by it is (almost) always a superset of the set SP 1 if P 1 < P 2. Thus the inclusion of more feature vectors retains a large fraction of social hubs identied using fewer features. V. C ONCLUSIONS We reviewed previously dened measures of centrality and pointed out their shortcomings in general and EVC in particular. We introduced PCC, a new measure of node centrality. PCC is based on PCA and the KLT which takes the view of treating a graphs adjacency matrix as a covariance matrix. PCC interprets a nodes centrality as its 2 norm from the origin in the eigenspace formed by the P most signicant feature vectors (eigenvectors) of the adjacency matrix. Unlike EVC, PCC allows the addition of more features for the computation of node centralities. We explore two criteria for the selection of the number of features to use for PCC; a) The relative contribution of each features power (eigenvalue) to the total power of adjacency matrix and b) Incremental changes in the phase angle of the PCC with P features and the EVC as P is increased. We also provide a visual interpretation of signicant eigenvectors of an adjacency matrix. The use of the adjacency matrix is compared with that of the Laplacian and it is shown that eigendecomposition of the adjacency yields signicantly higher degree of energy compaction than does the Laplacian at the same number of features. We also investigated the effect of adding successive eigenvectors and the information they contain by looking at reconstructions of the original graphs topology using a subset of features. We applied PCC analysis to Googles Orkut social networking service. Our objective was the identication of social hubs in social networks that are left undiscovered by EVC. In the case of the Orkut graph we saw that using 14 most signicant eigenvectors out of a possible 70, 000 raises the number of inuential neighborhoods discovered from just 1 (that around the most central node) to 8 (including the one identied by EVC). The increase in the number of social hubs found using PCC is even greater. The top 200 social hubs found using PCC all have normalized PCC greater than 0.1. The social hubs found using EVC however contain only 13 social hubs with normalized EVC greater than 0.1. The Orkut friendship graph we used was unweighted and undirected, while the Facebook application graph

was an undirected and weighted graph. However, in order to ensure that eigenvalues and eigenvectors remain real the graphs must be undirected. We compared the sets of nodes identied as social hubs with the set of highest scoring nodes by centrality alone and saw that many of the nodes with high centralities belong to the same neighborhood. Thus, the notion of a local maxima serves its purpose of removing neighbors of highly central nodes. A comparison of social hubs discovered by PCC with different numbers of features showed that the addition of more features in PCC adds new social hubs to the list of identied hubs without replacing previously identied ones. In the future we intend to extend the denition of PCC so it can be applied to both directed and undirected graphs. Furthermore, we propose to formulate a distributed method for computing PCC along the lines of Canrights method [7] for computing EVC in peer-topeer systems. R EFERENCES
[1] Z. Bai, J. Demmel, J. Dongarra, A. Petitet, H. Robinson, and K. Stanley, The spectral decomposition of nonsymmetric matrices on distributed memory parallel computers, SIAM Journal on Scientic Computing, vol. 18, pp. 14461461, 1997. [2] C. Bischof and I. for Defense Analyses, Parallel Performance of a Symmetric Eigensolver Based on the Invariant Subspace Decomposition Approach. Supercomputing Research Center: IDA, 1994. [3] P. Bonacich, Power and centrality: A family of measures, American Journal of Sociology, vol. 92, pp. 11701182, 1987. [4] P. Bonacich and P. Lloyd, Eigenvector-like measures of centrality for asymmetric relations, Social Networks, vol. 23, no. 3, pp. 191201, 2001. [5] S. Borgatti, Centrality and network ow, Social Networks, vol. 27, no. 1, pp. 5571, 2005. [6] S. Borgatti and M. Everett, A graph-theoretic perspective on centrality, Social Networks, vol. 28, no. 4, pp. 466484, 2006. [7] G. Canright, K. Eng-Monsen, and M. Jelasity, Efcient and robust fully distributed power method with an application to link analysis, Department of Computer Science, University of Bologna, Tech. Rep. UBLCS-2005-17, pp. 200517, 2005. [8] G. Canright and K. Eng-Monsen, Spreading on networks: A topographic view, ComPlexUs, vol. 3, no. 1-3, pp. 131146, 2006. [9] I. Carreras, D. Miorandi, G. Canright, and K. Eng-Monsen, Understanding the spread of epidemics in highly partitioned mobile networks, in Proceedings of the 1st international conference on Bio inspired models of network, information and computing systems. ACM New York, NY, USA, 2006. [10] F. R. K. Chung, Spectral Graph Theory (CBMS Regional Conference Series in Mathematics, No. 92) (Cbms Regional Conference Series in Mathematics). American Mathematical Society, February 1997. [Online]. Available: http://www.amazon.com/exec/obidos/redirect?tag= citeulike07-20&path=ASIN/0821803158 [11] R. Duda, P. Hart, and D. Stork, Pattern Classication, 2nd ed. Wiley New York, 2001.

[12] M. Everett and S. Borgatti, Extending centrality, P. J. Carrington, J. Scott, and S. Wasserman, Eds. Cambridge University Press, 2005. [13] L. Freeman, A set of measures of centrality based on betweenness, Sociometry, vol. 40, no. 1, pp. 3541, 1977. [14] N. Friedkin, Theoretical foundations for centrality measures, American Journal of Sociology, vol. 96, no. 6, p. 1478, 1991. [15] H. He, Eigenvectors and reconstruction, the electronic journal of combinatorics, vol. 14, no. 14, p. 1, 2007. [16] G. Inc., orkut - welcome to the orkut club! http://www.orkut.com/Signup, Tech. Rep., Oct 14, 2009. [17] A. Kansal, M. Goraczko, and F. Zhao, Building a sensor network of mobile phones, in Proceedings of the 6th international conference on Information processing in sensor networks. ACM, 2007, p. 548. [18] C. Kohlsch tter, P. Chirita, and W. Nejdl, Efcient parallel u computation of pagerank, Lecture Notes in Computer Science, vol. 3936, p. 241, 2006. [19] A. Langville, C. Meyer, and P. FernAndez, Googles pagerank and beyond: the science of search engine rankings, The Mathematical Intelligencer, vol. 30, no. 1, pp. 6869, 2008. [20] S. Milgram, The small world problem, Psychology Today, vol. 2, no. 1, pp. 6067, 1967. [21] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee, Measurement and analysis of online social networks, in Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC07), San Diego, CA, October 2007. [22] A. Nazir, S. Raza, and C. Chuah, Unveiling facebook: a

[23] [24]

[25]

[26] [27]

[28]

[29] [30]

[31]

measurement study of social network based applications, in Proceedings of the 8th ACM SIGCOMM conference on Internet measurement. ACM New York, NY, USA, 2008, pp. 4356. M. Newman, Analysis of weighted networks, Physical Review E, vol. 70, no. 5, p. 56131, 2004. Orkut, Site prole for orkut.com (rank 4,168) complete, http://siteanalytics.compete.com/orkut.com/?metric=uv, Tech. Rep., September, 2009. B. P l and P. Scientist, Sensorplanet: An open global research a framework for mobile device centric wireless sensor networks, in MobiSensors: NSF Workshop on Data Management for Mobile Sensor Networks, Pittsburgh, 2007. B. Ruhnau, Eigenvector-centrality a node-centrality? Social Networks, vol. 22, no. 4, pp. 357365, 2000. K. Sankaralingam, S. Sethumadhavan, and J. Browne, Distributed pagerank for p2p systems, in High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on, 2003, pp. 5868. F. Tisseur and J. Dongarra, A parallel divide and conquer algorithm for the symmetric eigenvalue problem on distributed memory architectures, SIAM Journal on Scientic Computing, vol. 20, p. 2223, 1999. S. Wasserman and K. Faust, Social network analysis: Methods and applications. Cambridge University Press, 1994. D. Watts, Networks, dynamics, and the small-world phenomenon, American Journal of Sociology, vol. 105, no. 2, pp. 493527, 1999. , Small Worlds: The Dynamics of Networks between Order and Randomness. Princeton University Press, 1999.