Vous êtes sur la page 1sur 20

CS109/Stat121/AC209/E-109

Data Science
Network Models
Hanspeter Pfister & Joe Blitzstein

pfister@seas.harvard.edu / blitzstein@stat.harvard.edu

5 2

4 3
This Week
HW4 due tonight at 11:59 pm

Friday lab 10-11:30 am in MD G115


I Introduction Examples from Newman (2003) 3

FIG. 2 Three examples of the kinds of networks that are the topic of this review. (a) A food web of predator-prey interactions
between species in a freshwater lake [272]. Picture courtesy of Neo Martinez and Richard Williams. (b) The network of
collaborations between scientists at a private research institution [171]. (c) A network of sexual contacts between individuals
in the study by Potterat et al. [342].
Graphs
A graph G=(V,E) consists of a vertex set V and an

edge set E containing unordered pairs {i,j} of vertices.

10
9
16 1
13 3
4
5
1
2 3
15 8 14

7
12 4
2 11
6

graph multigraph
The degree of vertex v is the number of
edges attached to it.
A Plea for Clarity: What is a Network?
graph vs. multigraph (are loops, multiple edges ok?
What is a simple graph?)

directed vs. undirected

weighted vs. unweighted

dynamics of vs. dynamics on

labeled vs. unlabeled

network as quantity of interest vs. quantities of
interest on networks
Why model networks?

Hard to interpret hairballs.



We can define some interesting features
(statistics) of a network, such as measures of
clustering, and compare the observed values
against those of a model

Warning: much of the network literature
carelessly ignores the way in which the network
data were gathered (sampling) and whether
there are missing/unknown nodes or edges!
Erdos-Renyi Random Graph Model

Independently flip coins with prob. p of heads



Let n get large and p get small, with the average
degree c = (n-1)p held constant.

What happens for c < 1?

What happens for c > 1?

What happens for c = 1?
Degree Sequences
Take V = {1, . . . , n} and let di be the degree of vertex i.
The degree sequence of G is d = (d1 , . . . , dn ).

3 8

4 2 9 5 7

1 6

n = 9, d = (3, 4, 3, 3, 4, 3, 3, 3, 2)

A sequence d is graphical if there is a graph G with degree sequence d.


G is a realization of d.
MCMC on Networks
mixing times, burn-in, bottlenecks, autocorrelation,...

1 2 1 2
Switchings Chain

3 4 3 4
Power Laws
Power-law (a.k.a. scale-free) networks: the number of
vertices of degree k is proportional to k-

Stumpf et al (2005): Subnets of scale-free networks are


not scale-free, especially for large

Their subnets are i.i.d. node-based.

What about features other than degree distributions?

on the set of n nodes. Holland and Leinhardts p1 model focuses on dya
p1 Model (Holland-Leinhardt 1981)
keeps track of whether node i links to j, j to i, neither, or both. It conta
parameters:

: a base rate for edge propagation,

i (expansiveness): the effect of an outgoing edge from i,

j (popularity): the effect of an incoming edge into j,

ij (reciprocation/mutuality): the added effect of reciprocated edges.

Let P (0, 0) be the probability for the absence of an edge between i and
probability of i linking to j (1 indicates the outgoing node of the edg
probability of i linking to j and j linking to i. The p1 model posits the follow
(see [149]):

log Pij (0, 0) = ij ,


log Pij (1, 0) = ij + i + j + ,
log Pij (0, 1) = ij + j + i + ,
log Pij (1, 1) = ij + i + j + j + i + 2 + ij .
define a probability
ERGMs measure
(Exponential P on the
Random space
Graph of all grap
Models)
n


P (G) = Z 1
exp i di (G) ,
i=1

malizing constant. The real parameters 1 , . . . , n are ch


egrees. This model appears explicitly in Park and
How can we test and fit this model?
Newma
ge of statisticalHow
mechanics.
can we use this model?

einhardt [35] give iterative algorithms for the maximum
parameters, and Snijders [65] considers MCMC methods
an be used to prove that the maximum likelihood estima
ymptotically normal as n , provided that there is a
all i.
ial models are standard fare in statistics, statistical mech

Pseudolikelihood (Strauss-Ikeda 80)
Fix a pair of nodes {i,j}, and consider the indicator r.v. of
whether an edge {i,j} is present in G.

Conditioning on the rest of G yields great simplification:

P (edge {i, j}|rest) (x(G+ ) x(G ))


=e
P (no edge {i, j}|rest)

So use logistic regression? Be careful of variance estimates!


Dont know the normalizing constant!
MCMCMLE
From now on we (Geyer-Thompson
write 92)
exp( x(G)) = q (G)/c( )
Write P(G) =
c()
Fix some baseline 0 and estimate log-likelihood ratio.
c( )
l( ) l( 0 ) = ( 0) x(G) log
c( 0 )

Ratio of normalizing c( ) q (G)


=E 0
constants is: c( 0 ) 6 q 0 (G)

So can approximate the MLE via MCMC.

What about the choice of 0 though?


short
i.i.d. node i.i.d. edge snowball RDS
paths

Erdos

Dyad

Indep.

ERGM

Fixed
degree

Geom
Latent Space Models

Hoff et al (2002) model:


Degrees
Normalization?
4

4
2
5
3

6
5
2
6 6
7
4

3 4
2 2

5 4

3
Closeness
uses the reciprocal of the average
shortest distance to other nodes
0.4
0.46
0.53 0.39

0.39 0.45 0.49


0.46

0.51 0.54

0.54 0.43
0.53

0.53 0.56

0.45
0.5

0.4 0.46
0.46
Betweenness

many variations:
shortest paths
vs. flow
maximization
vs. all paths vs.
random paths
Eigenvector Centrality
use eigenvector of A corresponding to the largest
eigenvalue (Bonacich); more generally, power centrality