Vous êtes sur la page 1sur 16

Friendship Closeness and Language Style Matching in Online

Social Networks
Sitan Chen
November 26, 2014

Introduction

The development of Pennebaker et al.s Linguistic Inquiry and Word Count (LIWC) program (2001) near
the turn of the century sparked a rich line of work, surveyed in Pennebakers Secret Life of Pronouns (2011),
centered around studying word usage in everyday speech as an indicator for everything from personality
traits to relationships between people. The idea is that this sort of information about individuals is
captured remarkably well merely by their language styles, i.e. the frequencies with which they use different
parts of speech.
As a case in point, the words we use to communicate with our friends turn out to be closely linked
to the nature of our friendships with those people. This phenomenon has been widely studied in modes
of communication where information content is dense, like email, letter-writing, and spoken word. For
example, a frequency analysis of letters between contemporaries Carl Jung and Sigmund Freud found that
their writing styles were most similar in the first four years of their acquaintance but diverged as relations
between the two psychologists turned sour. Likewise, individuals engaged in conversation have been shown
to match language styles fairly quickly (Pennebaker, 2011).
One might wonder whether this link between friendship and language matching is robust enough to
manifest itself even in modes of communication where information is sparse, like posts on social media
websites. In this work, we study this by asking whether it is true that the closer two friends are, the more
similar the language styles of their posts on Facebook.
While a common-sense explanation for language matching is simply that when two interacting individuals pay attention to each other, their behaviors will converge, there is also evidence that language
matching occurs at a more fundamental level by way of so-called mirror neurons (Pennebaker, 2011).

These are cells which fire both when an individual is performing some action and when she is observing
others perform that action (Rizzolatti & Craighero, 2004). These explain in large part the phenomenon of
language matching in conversations: mirror neurons fire when we hear the speech of another individual,
facilitating both our understanding and subsequent imitation of what we hear. So if the link between
conversation and language matching is a consequence of the automatic firing of mirror neurons, then I
conjecture that the link between friendship and language matching occurs at a similarly automatic level
and should thus manifest itself in a wide range of modes of communication. This is the central hypothesis
that I will test, and I refer to it as the mirror neuron hypothesis.
Conversely, there are some arguments for why this link shouldnt be detectable in Facebook posts. Of
course, one could simply reason that such posts are too short for there to be a broad enough spectrum
of representable language styles. One could further argue that public Facebook posts, unlike private
correspondence with a particular individual, are authored without the sense of immediacy of a one-on-one
conversation, so it is unclear in this case whether mirror neuron firing plays a role at all. We refer to this
reasoning as the public posts hypothesis. Another theory is that while one close friendship may very well
influence an individuals language style, the sum of the influences of multiple, equally close friends might
amount to little more than noise. We refer to this reasoning as the overcrowded influences hypothesis.
The design of this study for determining which of these three hypotheses holds was simple. As I
describe in the next section, I wrote a computer script which, roughly speaking, scanned through my
Facebook account and determined for each pair of friends 1) a measure of closeness based on number of
public Facebook interactions, and 2) a measure of similarity between the frequencies with which they used
particular parts of speech.
In order to determine if any of the three hypotheses defined above holds, one would first determine
whether there is a statistically significant correlation between closeness and language matching on Facebook. If so, then the fact that the latter can serve as indicator of the former even in short blocks of text
on Facebook would support the mirror neuron hypothesis that language matching occurs at some fundamental, automatic level, mirror neuron firing being the dominant explanation. On the other hand, should
there be no noticeable correlation between closeness and language matching except for individuals with a
small number of close friends, this would suggest that a single close friend affects ones language style even
if many close friends cumulatively do not, which is precisely the reasoning of the overcrowded influences
hypothesis. Lastly, should no correlation be observed at all, this would suggest that the mirror neuron

hypothesis, which we already know applies to one-on-one private interaction, breaks down in the analysis
of public posts by that individual, which is precisely the public posts hypothesis.

Methods

As mentioned above, data collection for this study was centered around a computer script which scanned
through my own social network on Facebook. The source code is attached in the appendix.
The script begins by building a graph corresponding to my social network, where each node represents
a person and each edge is a connection between two friends in my network. Then for each of my friends,
the script makes calls to the Facebook Graph API to extract all statuses made by that persons timeline
since 2010. For each status by user A, the script counts the number of likes made likes(A, B) or comments
left comments(A, B) by each user B and stores the comments and status in a file for later processing. The
script then scans my friends public photos and counts the number photos(A, B) of photos in which A and
B are tagged together, for all A, B. Lastly, the script computes the number of mutual friends mutual(A, B)
any two friends A, B share inside my network. From these values, it computes the friendship closeness of
any pair of users A, B to be

closeness(A, B) = likes(A, B) + comments(A, B) + photos(A, B) + mutual(A, B).

In the next half of the script, each file containing text written by one of my friends A is passed through
the Python Natural Language Toolkits part-of-speech tagger and reduced to a vector vA = {viA } whose
entries are frequencies viA of each part of speech i. Each vA is normalized to vA :=

vA

kvA k
any two friends A, B, the script computes language distance to be the Euclidean distance

, and between

v
u n
uX
A

B
dist(A, B) = v v = t (viA viB )2 ,
i=1

some positive scalar between 0 and

2.

The script then generates five scatter plots of dist(A, B) versus each of closeness(A, B), comments(A, B),
likes(A, B), photos(A, B), and mutual(A, B).

Results

(a) Closeness versus language similarity

(b) Comments versus language similarity

(c) Likes versus language similarity

(d) Photos versus language similarity

(e) Mutual friends versus language similarity

Figure 1: Results of frequency analysis script

Figures 1a, 1b, 1c, 1d, and 1e are respectively plots of closeness(A, B), comments(A, B), likes(A, B),
photos(A, B), and mutual(A, B) versus dist(A, B) over all pairs of friends. We ran linear regressions for
these five sets of points, the results of which are shown in Table 1, indicating that the closer two friends
are, the lower their language distance is and the more similar their language styles are.

closeness
comments
likes
photos
mutual friends

slope

intercept

-0.00169
-0.00281
-0.00225
-0.000899
-0.00178

0.618
0.520
0.525
0.514
0.610

-0.279
-0.133
-0.146
-0.00635
-0.239

 0.01
 0.01
 0.01
0.0997
 0.01

Table 1: Results of linear regressions


The r-values shown in Table 1 are insufficiently high to suggest a statistically significant linear correlation between language similarity and friendship closeness. That said, negative slopes were observed for
each regression, and for friendship closeness, comments, likes, and mutual friends, the p-value corresponding to the null hypothesis that the slope should be zero is essentially zero, suggesting that it is highly
probable that some correlation exists between these measures and language similarity. The reason this
does not apply to photo count is that in many cases, privacy settings prevented the script from accessing
photo data, so photo counts for most friendships were zero.

Discussion

The low p-values and negative slopes from our linear regressions of a variety of measures of closeness versus
language similarity support the hypothesis that the closer two friends are, the more similar their language
styles on Facebook. For modes of communication like letter-writing and conversation in which information
content is denser, such a link is already known, but to observe this link even in social media posts is striking.
Such a result renders the aforementioned public posts and overcrowded influences hypotheses unlikely, as
the same phenomenon of language matching observed in private one-on-one interaction also seems to
persist in peoples public Facebook posts and even among users with many close friends. Moreover, that
this persists at the microscopic level of a Facebook comment or status suggests that some fundamental,
automatic process drives this language matching. In the case of conversations, this process is the firing of
mirror neurons (Pennebaker, 2011), so it seems reasonable that the mirror neuron hypothesis also holds for

Facebook posts, though verifying this calls for brain imaging techniques outside the scope of this project.
There are however some limitations to these conclusions related to the availability of data and the
metric for friendship closeness used. The most obvious issue is that this scheme makes use only of the
nodes in my own social network. For one, this means any conclusions drawn are based on a possibly highly
unrepresentative sample of the total population of all Facebook users.

Figure 2: Distribution of word counts (truncated to counts of less than 50)


For another, the resulting limit on the raw quantity of data available drastically affects our analysis
of language styles. Indeed, Facebook statuses and comments may be unrepresentative of users language
styles to begin with, and this is especially true of posts that are too short to reflect anything about the
individual. One fix is to ignore all blocks of text below a certain length, leaving only posts long enough
to reflect nuances in a users language style. Unfortunately, as the histogram in Figure 2 indicates, the
dearth of data used in this study meant that even for N as small as 20, filtering out posts of length less
than N would have left a mere 4274 of 84141 posts to work with. One way to obtain enough data to get
away with this filtering would be to analyze more Facebook accounts; another option would be to conduct
this study on another social media site like Twitter where public posts are more plentiful.
One last limitation was the crudeness of the metric used to estimate friendship closeness. Due to time
constraints, it was most convenient to define closeness(A, B) as the total number of public interactions on
Facebook between two users. This definition was rather arbitrary, and there are two issues in particular
that hamper its effectiveness in describing closeness. Firstly, it ignores time-sensitive information, treating
6

recent interactions the same as old ones. Under closeness(A, B), two friends who have a long history of close
friendship but have drifted apart in recent years would still be deemed close friends. Secondly, our metric
ignores the relative frequency with which each user interacts with others. The result is that extroverted
users who interact often with everyone get treated as close friends with everyone, while introverted users
who interact infrequently get treated as close friends to no one. It is possible to derive a better model of
closeness using techniques in supervised machine learning, but this is also out of the scope of this paper.
In addition to the improvements suggested above, another potential avenue of study would be to study
one-sided friendships. In the current analysis, because closeness(A, B) = closeness(B, A), we treat all
friendships as symmetric relationships between people and ignore friendships in which one individual is
more invested than the other. One interesting question is thus whether an individuals language style is
influenced more by one-sided friendships in which she invests disproportionately more, or ones in which
she invests disproportionately less. One could study this by redefining closeness(A, B) to be

closeness(A, B) = (likes by B of As posts) + (comments by B on As posts)

+(photos by B of A) + (mutual friends)


and checking for each user A whether A is most similar in language to B for whom closeness(A, B) is high,
or to B for whom closeness(B, A) is high.

References
[1] Pennebaker, J., Francis, M.E., & Booth, R.J. (2001). Linguistic inquiry and word count: A computerized text analysis program. Mahwah, NJ: Erlbaum.
[2] Pennebaker, J. (2011). The secret life of pronouns: What our words say about us. New York: Bloomsbury Press.
[3] Rizzolatti, G., & and Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience
27: 169-192.

Appendix
7

import n l t k
import f a c e b o o k
import r e q u e s t s
import j s o n
import random
import math
import numpy as np
import p i c k l e
import m a t p l o t l i b . p y p l o t as p l t
import s c i p y

# Login t o FB and go t o h t t p s : / / d e v e l o p e r s . f a c e b o o k . com/ t o o l s / e x p l o r e r


# t o o b t a i n your own a c c e s s t o k e n f o r i n t e r a c t i n g w i t h t h e Graph API
a c c e s s t o k e n = < i n s e r t your a c c e s s token here>
graph = f a c e b o o k . GraphAPI ( a c c e s s t o k e n )

# Obtain FB i d s o f a l l f r i e n d s
f r i e n d s = graph . g e t c o n n e c t i o n s ( me , f r i e n d s )
f r i e n d i d s = [ f r i e n d [ i d ] for f r i e n d in f r i e n d s [ data ] ]

# Read i n f i l e w i t h l i s t o f a l l p a r t s o f s p e e c h used i n NLTK s POS t a g g e r


with open ( / U s e r s / s i t a n c h e n /Documents /2014 2015/ psy / c a t s / t a g s e t . t x t , r ) as f :
tagset = f . readlines ()
t a g s e t = [ t ag . s t r i p ( \n ) for t ag in t a g s e t ]
t a g d i c t = { t a g s e t [ i ] : i for i in range ( l e n ( t a g s e t ) ) }

# Counts number o f mutual f r i e n d s i d and I s h a r e


def mutual with me ( u s e r ) :
friendlist = []
f r i e n d s = graph . g e t c o n n e c t i o n s ( me , m u t u a l f r i e n d s /+u s e r )
8

while paging in f r i e n d s . keys ( ) :


for f r i e n d in f r i e n d s [ data ] :
i f f r i e n d [ i d ] in f r i e n d i d s :
f r i e n d l i s t . append ( f r i e n d [ i d ] )
i f next in f r i e n d s [ paging ] . keys ( ) :
f r i e n d s = r e q u e s t s . g e t ( f r i e n d s [ paging ] [ next ] ) . j s o n ( )
else :
break
return f r i e n d l i s t

# C o n s t r u c t s s o c i a l graph
# Nodes : F r i e n d s
# Edges : P a i r s o f your f r i e n d s who are f r i e n d s w i t h each o t h e r
# Weights on e d g e s : [ Comments , L i k e s , Photos ]
def b u i l d g r a p h ( ) :
adjacency list = dict ()
for f r i e n d in f r i e n d i d s :
print f r i e n d
n e i g h b o r s = mutual with me ( f r i e n d )

a d j a c e n c y l i s t [ f r i e n d ] = { n e i g h b o r : [ 0 , 0 , 0 ] for n e i g h b o r in n e i g h b o
return a d j a c e n c y l i s t

# C a l l b u i l d g r a p h t o c o n s t r u c t own s o c i a l graph
adjacency list = build graph ()

# Create empty f i l e s f o r s t o r i n g f r i e n d s w r i t i n g l a t e r
for ( f r i e n d , d ) in a d j a c e n c y l i s t . i t e r i t e m s ( ) :
f = open ( / U s e r s / s i t a n c h e n /Documents /2014 2015/ psy / data / +f r i e n d , w+ )
f . close ()

# Checks w h e t h e r two p e o p l e are a c t u a l l y f r i e n d s


def a r e f r i e n d s (A, B ) :
return ( l i k e [ i d ] in a d j a c e n c y l i s t [ t o ] . keys ( ) ) and ( t o in a d j a c e n c y l i s t

# Writes t o u s e r fm s f i l e t h e message msg


def u p d a t e f i l e ( msg , fm ) :
with open ( / U s e r s / s i t a n c h e n /Documents /2014 2015/ psy / data / +fm , a ) as f :
f . w r i t e ( msg . encode ( u t f 8 ) + \n )

# T a l l i e s l i k e s of user to s posts
def u p d a t e l i k e s ( l i k e s , t o ) :
for l i k e in l i k e s [ data ] :
i f l i k e [ i d ] in f r i e n d i d s and l i k e [ i d ] != t o :

i f l i k e [ i d ] in a d j a c e n c y l i s t [ t o ] . keys ( ) and t o in a d j a c
a d j a c e n c y l i s t [ t o ] [ l i k e [ i d ] ] [ 1 ] += 1
a d j a c e n c y l i s t [ l i k e [ i d ] ] [ t o ] [ 1 ] += 1

# T a l l i e s comments made on u s e r t o s p o s t s
# Writes comments t o a p p r o p r i a t e a u t h o r s f i l e
def update comments ( comments , t o ) :
for comment in comments [ data ] :
fm = comment [ from ] [ i d ]
i f fm in f r i e n d i d s and fm != t o :
i f fm in a d j a c e n c y l i s t [ t o ] . keys ( ) and t o in a d j a c e n c y l i s
msg = comment [ message ]
u p d a t e f i l e ( msg , fm )
a d j a c e n c y l i s t [ t o ] [ fm ] [ 0 ] += 1
a d j a c e n c y l i s t [ fm ] [ t o ] [ 0 ] += 1

# T a l l i e s p h o t o s i n which f r i e n d i s t a g g e d
10

def u p d a t e t a g s ( f r i e n d , t a g i d s ) :
for t a g i d in t a g i d s :
i f t a g i d in f r i e n d i d s and t a g i d != f r i e n d :

i f f r i e n d in a d j a c e n c y l i s t [ t a g i d ] . keys ( ) and t a g i d in a
a d j a c e n c y l i s t [ t a g i d ] [ f r i e n d ] [ 2 ] += 1
a d j a c e n c y l i s t [ f r i e n d ] [ t a g i d ] [ 2 ] += 1

# R e t r i e v e more FB d a t a
def g e t m o r e ( l s ) :
i f paging in l s . keys ( ) :
i f next in l s [ paging ] :
return r e q u e s t s . g e t ( l s [ paging ] [ next ] ) . j s o n ( )
return [ ]

# Main t e x t e x t r a c t i o n
def g e t t x t ( ) :
for f r i e n d in f r i e n d i d s [ 8 0 0 : ] :
y e a r = 2014
p o s t s = graph . g e t c o n n e c t i o n s ( f r i e n d , p o s t s )
# To s a v e time , j u s t c h e c k u n t i l 2010
while y e a r [ 0 : 3 ] == 201 :
i f data not in p o s t s . keys ( ) :
print p o s t s
break
else :
# Update l i k e s and comments found
for p o s t in p o s t s [ data ] :
i f l i k e s in p o s t . keys ( ) :
l i k e s = post [ l i k e s ]
update likes ( likes , friend )
11

i f comments in p o s t . keys ( ) :
comments = p o s t [ comments ]
update comments ( comments , f r i e n d )
year = post [ created time ] [ 0 : 4 ]
# Try r e t r i e v i n g more FB d a t a
posts = get more ( posts )
i f not p o s t s :
break

# T a b u l a t e s mutual p h o t o s
def g e t p h o t o s ( ) :
for f r i e n d in f r i e n d i d s [ 3 9 7 : ] :
print f r i e n d
y e a r = 2014
photos = graph . g e t c o n n e c t i o n s ( f r i e n d , photos )
# To s a v e time , j u s t c h e c k u n t i l 2010
while y e a r [ 0 : 3 ] == 201 :
i f data in photos . keys ( ) :
for photo in photos [ data ] :
# R e t r i e v e t a g s f o r each p h o t o
i f t a g s in photo . keys ( ) :
t a g s = photo [ t a g s ] [ data ]
new tags = [ ]
for t ag in t a g s :
i f i d in t ag . keys ( ) :

n e w t a g s . append ( ta g
update tags ( f riend , new tags )
y e a r = photo [ c r e a t e d t i m e ] [ 0 : 4 ]
# Try r e t r i e v i n g more FB d a t a
photos = g e t m o r e ( photos )
12

i f not photos :
break

# Counts number o f mutual f r i e n d s between A and B


def m u t u a l f r i e n d s (A, B ) :
nbdA = s e t ( a d j a c e n c y l i s t [A ] . keys ( ) )
nbdB = s e t ( a d j a c e n c y l i s t [ B ] . keys ( ) )
return l e n (nbdA & nbdB )

# Computes f r i e n d s h i p c l o s e n e s s between A and B,


# d e f i n e d t o be number o f a l l p u b l i c i n t e r a c t i o n s on FB
def c l o s e n e s s (A, B ) :
i n t e r a c t = a d j a c e n c y l i s t [A ] [ B ]
comments = i n t e r a c t [ 0 ]
likes = interact [1]
photos = i n t e r a c t [ 2 ]
f r i e n d s = m u t u a l f r i e n d s (A, B)
return comments + l i k e s + photos + f r i e n d s

# Computes n o r m a l i z e d v e c t o r o f f r e q u e n c i e s o f
# each p a r t o f s p e e c h used by t h e u s e r
def l a n g p r o f i l e ( u s e r ) :
def n o r m a l i z e ( v ) :
norm = math . s q r t ( sum ( [ x 2 for x in v ] ) )
return [ x/norm for x in v ]
# Reads i n f i l e c o n t a i n i n g a l l w r i t i n g done by t h e u s e r on FB
with open ( / U s e r s / s i t a n c h e n /Documents /2014 2015/ psy / data / +us e r , r ) as f :
total = [0] len ( tagset )
for p o s t in f . r e a d l i n e s ( ) :
p o s t = p o s t . s t r i p ( \n )
13

# S p l i t t h e p o s t i n t o t o k e n s , i . e . words
post = nltk . word tokenize ( post )
n = len ( post )
p a i r s = nltk . pos tag ( post )
# T a l l i e s d i f f e r e n t p a r t s o f s p e e c h used
for p a i r in p a i r s :
t ag = p a i r [ 1 ]
i f t ag in t a g d i c t . keys ( ) :
t o t a l [ t a g d i c t [ ta g ] ] += 1 .
# normalize vector
i f sum ( t o t a l ) != 0 :
return n o r m a l i z e ( t o t a l )
else :
return [ np . i n f t y ] l e n ( t a g d i c t )

# E u c l i d e a n d i s t a n c e between two v e c t o r s
def d i s t v e c ( u , v ) :
d i f f = [ a i b i for a i , b i in z i p ( u , v ) ]
return math . s q r t ( sum ( [ x 2 for x in d i f f ] ) )

# R e t r i e v e p h o t o and t e x t d a t a
get txt ()
get photos ()

# P r o f i l e s w i l l s t o r e t h e l i n g u i s t i c p r o f i l e v e c t o r o f each f r i e n d
# D i s t s , comments , l i k e s , photos , and mutual w i l l s t o r e t h e
# c l o s e n e s s /comment count / l i k e count / p h o t o count / mutual f r i e n d count
# and l a n g u a g e s i m i l a r i t y between a l l p a i r s o f f r i e n d s
p r o f i l e s = {};
d i s t s = { } ; comments = { } ; l i k e s = { } ; photos = { } ; mutual = {}
14

# Compute l i n g u i s t i c p r o f i l e s and i n i t i a l i z e d i s t s
for f r i e n d in f r i e n d i d s :
profile = lang profile ( friend )
profiles [ friend ] = profile
# closeness , distance of p r o f i l e s
d i s t s [ f r i e n d ] = { n e i g h b o r : [ 0 , 0 ] for n e i g h b o r in a d j a c e n c y l i s t [ f r i e n d ] . key

# Compute d i s t s , comments , l i k e s , photos , mutual


for node in f r i e n d i d s :
for f r i e n d in a d j a c e n c y l i s t [ node ] . keys ( ) :
# Call closeness ( , ) to get friendship closeness
c n e s s = c l o s e n e s s ( node , f r i e n d )
d i s t = d i s t v e c ( p r o f i l e s [ f r i e n d ] , p r o f i l e s [ node ] )
d i s t s [ f r i e n d ] [ node ] = [ c n e s s , d i s t ]
comments [ f r i e n d ] [ node ] = [ a d j a c e n c y l i s t [ f r i e n d ] [ node ] [ 0 ] , d i s t ]
l i k e s [ f r i e n d ] [ node ] = [ a d j a c e n c y l i s t [ f r i e n d ] [ node ] [ 1 ] , d i s t ]
photos [ f r i e n d ] [ node ] = [ a d j a c e n c y l i s t [ f r i e n d ] [ node ] [ 2 ] , d i s t ]
[ m u t u a l f r i e n d s ( id , k ) , v [ 1 ] ]

# P l o t d a t a and run l i n e a r r e g r e s s i o n
def s c a t p l o t ( d , name , t r u n c a t e=True , t h r e s =400):
# Process data into a l i s t of p a i r s
p r e p a i r s = [ a . v a l u e s ( ) for a in d . v a l u e s ( ) ]
f l a t t e n e d = [ a for b in p r e p a i r s for a in b ]
pairs = [ ]
for p a i r in f l a t t e n e d :
i f math . i s n a n ( p a i r [ 1 ] ) == F a l s e and p a i r [ 1 ] != np . i n f :
p a i r s . append ( p a i r )
# Optionally truncate overly popular users because
15

# they d i s t o r t the scale of the s c a t t e r plot


i f truncate :
p a i r s = [ p a i r for p a i r in p a i r s i f p a i r [ 0 ] < t h r e s ]
x = [ a [ 0 ] for a in p a i r s ]
y = [ a [ 1 ] for a in p a i r s ]
# Run l i n e a r r e g r e s s i o n u s i n g s c i p y . s t a t s
slope , intercept , r value , p value , s t d e r r = scipy . s t a t s . l i n r e g r e s s (x , y )
print s l o p e , i n t e r c e p t , r v a l u e , p v a l u e , s t d e r r
# Plot
plt . plot (x , y , . ) ;
p l t . p l o t ( x , np . add ( np . m u l t i p l y ( s l o p e , x ) , i n t e r c e p t ) , ) ;
p l t . x l a b e l ( name )
p l t . y l a b e l ( language s t y l e distance )
p l t . s a v e f i g ( name+ . png )
p l t . show ( )

# Plot a l l the things !


scat plot ( closeness , closeness )
s c a t p l o t ( comments , comments )
scat plot ( likes , likes )
s c a t p l o t ( photos , photos )
s c a t p l o t ( mutual , mutual f r i e n d s )

16

Vous aimerez peut-être aussi