Académique Documents
Professionnel Documents
Culture Documents
Social Networks
Sitan Chen
November 26, 2014
Introduction
The development of Pennebaker et al.s Linguistic Inquiry and Word Count (LIWC) program (2001) near
the turn of the century sparked a rich line of work, surveyed in Pennebakers Secret Life of Pronouns (2011),
centered around studying word usage in everyday speech as an indicator for everything from personality
traits to relationships between people. The idea is that this sort of information about individuals is
captured remarkably well merely by their language styles, i.e. the frequencies with which they use different
parts of speech.
As a case in point, the words we use to communicate with our friends turn out to be closely linked
to the nature of our friendships with those people. This phenomenon has been widely studied in modes
of communication where information content is dense, like email, letter-writing, and spoken word. For
example, a frequency analysis of letters between contemporaries Carl Jung and Sigmund Freud found that
their writing styles were most similar in the first four years of their acquaintance but diverged as relations
between the two psychologists turned sour. Likewise, individuals engaged in conversation have been shown
to match language styles fairly quickly (Pennebaker, 2011).
One might wonder whether this link between friendship and language matching is robust enough to
manifest itself even in modes of communication where information is sparse, like posts on social media
websites. In this work, we study this by asking whether it is true that the closer two friends are, the more
similar the language styles of their posts on Facebook.
While a common-sense explanation for language matching is simply that when two interacting individuals pay attention to each other, their behaviors will converge, there is also evidence that language
matching occurs at a more fundamental level by way of so-called mirror neurons (Pennebaker, 2011).
These are cells which fire both when an individual is performing some action and when she is observing
others perform that action (Rizzolatti & Craighero, 2004). These explain in large part the phenomenon of
language matching in conversations: mirror neurons fire when we hear the speech of another individual,
facilitating both our understanding and subsequent imitation of what we hear. So if the link between
conversation and language matching is a consequence of the automatic firing of mirror neurons, then I
conjecture that the link between friendship and language matching occurs at a similarly automatic level
and should thus manifest itself in a wide range of modes of communication. This is the central hypothesis
that I will test, and I refer to it as the mirror neuron hypothesis.
Conversely, there are some arguments for why this link shouldnt be detectable in Facebook posts. Of
course, one could simply reason that such posts are too short for there to be a broad enough spectrum
of representable language styles. One could further argue that public Facebook posts, unlike private
correspondence with a particular individual, are authored without the sense of immediacy of a one-on-one
conversation, so it is unclear in this case whether mirror neuron firing plays a role at all. We refer to this
reasoning as the public posts hypothesis. Another theory is that while one close friendship may very well
influence an individuals language style, the sum of the influences of multiple, equally close friends might
amount to little more than noise. We refer to this reasoning as the overcrowded influences hypothesis.
The design of this study for determining which of these three hypotheses holds was simple. As I
describe in the next section, I wrote a computer script which, roughly speaking, scanned through my
Facebook account and determined for each pair of friends 1) a measure of closeness based on number of
public Facebook interactions, and 2) a measure of similarity between the frequencies with which they used
particular parts of speech.
In order to determine if any of the three hypotheses defined above holds, one would first determine
whether there is a statistically significant correlation between closeness and language matching on Facebook. If so, then the fact that the latter can serve as indicator of the former even in short blocks of text
on Facebook would support the mirror neuron hypothesis that language matching occurs at some fundamental, automatic level, mirror neuron firing being the dominant explanation. On the other hand, should
there be no noticeable correlation between closeness and language matching except for individuals with a
small number of close friends, this would suggest that a single close friend affects ones language style even
if many close friends cumulatively do not, which is precisely the reasoning of the overcrowded influences
hypothesis. Lastly, should no correlation be observed at all, this would suggest that the mirror neuron
hypothesis, which we already know applies to one-on-one private interaction, breaks down in the analysis
of public posts by that individual, which is precisely the public posts hypothesis.
Methods
As mentioned above, data collection for this study was centered around a computer script which scanned
through my own social network on Facebook. The source code is attached in the appendix.
The script begins by building a graph corresponding to my social network, where each node represents
a person and each edge is a connection between two friends in my network. Then for each of my friends,
the script makes calls to the Facebook Graph API to extract all statuses made by that persons timeline
since 2010. For each status by user A, the script counts the number of likes made likes(A, B) or comments
left comments(A, B) by each user B and stores the comments and status in a file for later processing. The
script then scans my friends public photos and counts the number photos(A, B) of photos in which A and
B are tagged together, for all A, B. Lastly, the script computes the number of mutual friends mutual(A, B)
any two friends A, B share inside my network. From these values, it computes the friendship closeness of
any pair of users A, B to be
In the next half of the script, each file containing text written by one of my friends A is passed through
the Python Natural Language Toolkits part-of-speech tagger and reduced to a vector vA = {viA } whose
entries are frequencies viA of each part of speech i. Each vA is normalized to vA :=
vA
kvA k
any two friends A, B, the script computes language distance to be the Euclidean distance
, and between
v
u n
uX
A
B
dist(A, B) =
v v
= t (viA viB )2 ,
i=1
2.
The script then generates five scatter plots of dist(A, B) versus each of closeness(A, B), comments(A, B),
likes(A, B), photos(A, B), and mutual(A, B).
Results
Figures 1a, 1b, 1c, 1d, and 1e are respectively plots of closeness(A, B), comments(A, B), likes(A, B),
photos(A, B), and mutual(A, B) versus dist(A, B) over all pairs of friends. We ran linear regressions for
these five sets of points, the results of which are shown in Table 1, indicating that the closer two friends
are, the lower their language distance is and the more similar their language styles are.
closeness
comments
likes
photos
mutual friends
slope
intercept
-0.00169
-0.00281
-0.00225
-0.000899
-0.00178
0.618
0.520
0.525
0.514
0.610
-0.279
-0.133
-0.146
-0.00635
-0.239
0.01
0.01
0.01
0.0997
0.01
Discussion
The low p-values and negative slopes from our linear regressions of a variety of measures of closeness versus
language similarity support the hypothesis that the closer two friends are, the more similar their language
styles on Facebook. For modes of communication like letter-writing and conversation in which information
content is denser, such a link is already known, but to observe this link even in social media posts is striking.
Such a result renders the aforementioned public posts and overcrowded influences hypotheses unlikely, as
the same phenomenon of language matching observed in private one-on-one interaction also seems to
persist in peoples public Facebook posts and even among users with many close friends. Moreover, that
this persists at the microscopic level of a Facebook comment or status suggests that some fundamental,
automatic process drives this language matching. In the case of conversations, this process is the firing of
mirror neurons (Pennebaker, 2011), so it seems reasonable that the mirror neuron hypothesis also holds for
Facebook posts, though verifying this calls for brain imaging techniques outside the scope of this project.
There are however some limitations to these conclusions related to the availability of data and the
metric for friendship closeness used. The most obvious issue is that this scheme makes use only of the
nodes in my own social network. For one, this means any conclusions drawn are based on a possibly highly
unrepresentative sample of the total population of all Facebook users.
recent interactions the same as old ones. Under closeness(A, B), two friends who have a long history of close
friendship but have drifted apart in recent years would still be deemed close friends. Secondly, our metric
ignores the relative frequency with which each user interacts with others. The result is that extroverted
users who interact often with everyone get treated as close friends with everyone, while introverted users
who interact infrequently get treated as close friends to no one. It is possible to derive a better model of
closeness using techniques in supervised machine learning, but this is also out of the scope of this paper.
In addition to the improvements suggested above, another potential avenue of study would be to study
one-sided friendships. In the current analysis, because closeness(A, B) = closeness(B, A), we treat all
friendships as symmetric relationships between people and ignore friendships in which one individual is
more invested than the other. One interesting question is thus whether an individuals language style is
influenced more by one-sided friendships in which she invests disproportionately more, or ones in which
she invests disproportionately less. One could study this by redefining closeness(A, B) to be
References
[1] Pennebaker, J., Francis, M.E., & Booth, R.J. (2001). Linguistic inquiry and word count: A computerized text analysis program. Mahwah, NJ: Erlbaum.
[2] Pennebaker, J. (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Press.
[3] Rizzolatti, G., & and Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience
27: 169-192.
Appendix
7
import n l t k
import f a c e b o o k
import r e q u e s t s
import j s o n
import random
import math
import numpy as np
import p i c k l e
import m a t p l o t l i b . p y p l o t as p l t
import s c i p y
# Obtain FB i d s o f a l l f r i e n d s
f r i e n d s = graph . g e t c o n n e c t i o n s ( me , f r i e n d s )
f r i e n d i d s = [ f r i e n d [ i d ] for f r i e n d in f r i e n d s [ data ] ]
# C o n s t r u c t s s o c i a l graph
# Nodes : F r i e n d s
# Edges : P a i r s o f your f r i e n d s who are f r i e n d s w i t h each o t h e r
# Weights on e d g e s : [ Comments , L i k e s , Photos ]
def b u i l d g r a p h ( ) :
adjacency list = dict ()
for f r i e n d in f r i e n d i d s :
print f r i e n d
n e i g h b o r s = mutual with me ( f r i e n d )
a d j a c e n c y l i s t [ f r i e n d ] = { n e i g h b o r : [ 0 , 0 , 0 ] for n e i g h b o r in n e i g h b o
return a d j a c e n c y l i s t
# C a l l b u i l d g r a p h t o c o n s t r u c t own s o c i a l graph
adjacency list = build graph ()
# Create empty f i l e s f o r s t o r i n g f r i e n d s w r i t i n g l a t e r
for ( f r i e n d , d ) in a d j a c e n c y l i s t . i t e r i t e m s ( ) :
f = open ( / U s e r s / s i t a n c h e n /Documents /2014 2015/ psy / data / +f r i e n d , w+ )
f . close ()
# T a l l i e s l i k e s of user to s posts
def u p d a t e l i k e s ( l i k e s , t o ) :
for l i k e in l i k e s [ data ] :
i f l i k e [ i d ] in f r i e n d i d s and l i k e [ i d ] != t o :
i f l i k e [ i d ] in a d j a c e n c y l i s t [ t o ] . keys ( ) and t o in a d j a c
a d j a c e n c y l i s t [ t o ] [ l i k e [ i d ] ] [ 1 ] += 1
a d j a c e n c y l i s t [ l i k e [ i d ] ] [ t o ] [ 1 ] += 1
# T a l l i e s comments made on u s e r t o s p o s t s
# Writes comments t o a p p r o p r i a t e a u t h o r s f i l e
def update comments ( comments , t o ) :
for comment in comments [ data ] :
fm = comment [ from ] [ i d ]
i f fm in f r i e n d i d s and fm != t o :
i f fm in a d j a c e n c y l i s t [ t o ] . keys ( ) and t o in a d j a c e n c y l i s
msg = comment [ message ]
u p d a t e f i l e ( msg , fm )
a d j a c e n c y l i s t [ t o ] [ fm ] [ 0 ] += 1
a d j a c e n c y l i s t [ fm ] [ t o ] [ 0 ] += 1
# T a l l i e s p h o t o s i n which f r i e n d i s t a g g e d
10
def u p d a t e t a g s ( f r i e n d , t a g i d s ) :
for t a g i d in t a g i d s :
i f t a g i d in f r i e n d i d s and t a g i d != f r i e n d :
i f f r i e n d in a d j a c e n c y l i s t [ t a g i d ] . keys ( ) and t a g i d in a
a d j a c e n c y l i s t [ t a g i d ] [ f r i e n d ] [ 2 ] += 1
a d j a c e n c y l i s t [ f r i e n d ] [ t a g i d ] [ 2 ] += 1
# R e t r i e v e more FB d a t a
def g e t m o r e ( l s ) :
i f paging in l s . keys ( ) :
i f next in l s [ paging ] :
return r e q u e s t s . g e t ( l s [ paging ] [ next ] ) . j s o n ( )
return [ ]
# Main t e x t e x t r a c t i o n
def g e t t x t ( ) :
for f r i e n d in f r i e n d i d s [ 8 0 0 : ] :
y e a r = 2014
p o s t s = graph . g e t c o n n e c t i o n s ( f r i e n d , p o s t s )
# To s a v e time , j u s t c h e c k u n t i l 2010
while y e a r [ 0 : 3 ] == 201 :
i f data not in p o s t s . keys ( ) :
print p o s t s
break
else :
# Update l i k e s and comments found
for p o s t in p o s t s [ data ] :
i f l i k e s in p o s t . keys ( ) :
l i k e s = post [ l i k e s ]
update likes ( likes , friend )
11
i f comments in p o s t . keys ( ) :
comments = p o s t [ comments ]
update comments ( comments , f r i e n d )
year = post [ created time ] [ 0 : 4 ]
# Try r e t r i e v i n g more FB d a t a
posts = get more ( posts )
i f not p o s t s :
break
# T a b u l a t e s mutual p h o t o s
def g e t p h o t o s ( ) :
for f r i e n d in f r i e n d i d s [ 3 9 7 : ] :
print f r i e n d
y e a r = 2014
photos = graph . g e t c o n n e c t i o n s ( f r i e n d , photos )
# To s a v e time , j u s t c h e c k u n t i l 2010
while y e a r [ 0 : 3 ] == 201 :
i f data in photos . keys ( ) :
for photo in photos [ data ] :
# R e t r i e v e t a g s f o r each p h o t o
i f t a g s in photo . keys ( ) :
t a g s = photo [ t a g s ] [ data ]
new tags = [ ]
for t ag in t a g s :
i f i d in t ag . keys ( ) :
n e w t a g s . append ( ta g
update tags ( f riend , new tags )
y e a r = photo [ c r e a t e d t i m e ] [ 0 : 4 ]
# Try r e t r i e v i n g more FB d a t a
photos = g e t m o r e ( photos )
12
i f not photos :
break
# Computes n o r m a l i z e d v e c t o r o f f r e q u e n c i e s o f
# each p a r t o f s p e e c h used by t h e u s e r
def l a n g p r o f i l e ( u s e r ) :
def n o r m a l i z e ( v ) :
norm = math . s q r t ( sum ( [ x 2 for x in v ] ) )
return [ x/norm for x in v ]
# Reads i n f i l e c o n t a i n i n g a l l w r i t i n g done by t h e u s e r on FB
with open ( / U s e r s / s i t a n c h e n /Documents /2014 2015/ psy / data / +us e r , r ) as f :
total = [0] len ( tagset )
for p o s t in f . r e a d l i n e s ( ) :
p o s t = p o s t . s t r i p ( \n )
13
# S p l i t t h e p o s t i n t o t o k e n s , i . e . words
post = nltk . word tokenize ( post )
n = len ( post )
p a i r s = nltk . pos tag ( post )
# T a l l i e s d i f f e r e n t p a r t s o f s p e e c h used
for p a i r in p a i r s :
t ag = p a i r [ 1 ]
i f t ag in t a g d i c t . keys ( ) :
t o t a l [ t a g d i c t [ ta g ] ] += 1 .
# normalize vector
i f sum ( t o t a l ) != 0 :
return n o r m a l i z e ( t o t a l )
else :
return [ np . i n f t y ] l e n ( t a g d i c t )
# E u c l i d e a n d i s t a n c e between two v e c t o r s
def d i s t v e c ( u , v ) :
d i f f = [ a i b i for a i , b i in z i p ( u , v ) ]
return math . s q r t ( sum ( [ x 2 for x in d i f f ] ) )
# R e t r i e v e p h o t o and t e x t d a t a
get txt ()
get photos ()
# P r o f i l e s w i l l s t o r e t h e l i n g u i s t i c p r o f i l e v e c t o r o f each f r i e n d
# D i s t s , comments , l i k e s , photos , and mutual w i l l s t o r e t h e
# c l o s e n e s s /comment count / l i k e count / p h o t o count / mutual f r i e n d count
# and l a n g u a g e s i m i l a r i t y between a l l p a i r s o f f r i e n d s
p r o f i l e s = {};
d i s t s = { } ; comments = { } ; l i k e s = { } ; photos = { } ; mutual = {}
14
# Compute l i n g u i s t i c p r o f i l e s and i n i t i a l i z e d i s t s
for f r i e n d in f r i e n d i d s :
profile = lang profile ( friend )
profiles [ friend ] = profile
# closeness , distance of p r o f i l e s
d i s t s [ f r i e n d ] = { n e i g h b o r : [ 0 , 0 ] for n e i g h b o r in a d j a c e n c y l i s t [ f r i e n d ] . key
# P l o t d a t a and run l i n e a r r e g r e s s i o n
def s c a t p l o t ( d , name , t r u n c a t e=True , t h r e s =400):
# Process data into a l i s t of p a i r s
p r e p a i r s = [ a . v a l u e s ( ) for a in d . v a l u e s ( ) ]
f l a t t e n e d = [ a for b in p r e p a i r s for a in b ]
pairs = [ ]
for p a i r in f l a t t e n e d :
i f math . i s n a n ( p a i r [ 1 ] ) == F a l s e and p a i r [ 1 ] != np . i n f :
p a i r s . append ( p a i r )
# Optionally truncate overly popular users because
15
16