Académique Documents
Professionnel Documents
Culture Documents
5, MAY 2018
Abstract—Spatial clustering deals with the unsupervised grouping of places into clusters and finds important applications in urban
planning and marketing. Current spatial clustering models disregard information about the people and the time who and when are
related to the clustered places. In this paper, we show how the density-based clustering paradigm can be extended to apply on places
which are visited by users of a geo-social network. Our model considers spatio-temporal information and the social relationships
between users who visit the clustered places. After formally defining the model and the distance measure it relies on, we provide
alternatives to our model and the distance measure. We evaluate the effectiveness of our model via a case study on real data; in
addition, we design two quantitative measures, called social entropy and community score, to evaluate the quality of the discovered
clusters. The results show that temporal-geo-social clusters have special properties and cannot be found by applying simple spatial
clustering approaches and other alternatives.
Ç
1 INTRODUCTION
A checkin is a triplet huid; pid; timei modeling the fact that
C LUSTERING is commonly used as a method for data
exploration, characterization, and summarization.
Density-based clustering [1], in particular, divides a large
user uid visited place pid at a certain time.
We define the new problem of Density-based Clustering
collection of points into densely populated regions and it is Places in Geo-Social Networks (DCPGS), to detect geo-social
the most appropriate clustering paradigm for spatial data, clusters in GeoSNs. DCPGS extends DBSCAN by replacing
which have low dimensionality [2]. Density-based clusters the Euclidean distance threshold eps for the extents of dense
have arbitrary shapes and sizes and exclude objects in areas regions by a threshold , which considers both the spatial and
of low density (i.e., outliers). The DBSCAN model [1] finds the social distances between places. For two places pi and pj ,
the spatial eps-neighborhood of each point p in the dataset, the spatial distance is considered to be the Euclidean dis-
which is a circular region centered at p with radius eps. If tance between pi and pj , while the social distance should con-
the eps-neighborhood of p is dense, meaning that it contains sider the social relationships between the two sets of users Upi
no less than MinPts places, p is called a core point. Dense and Upj which have checked in pi and pj , respectively. We
eps-neighborhoods are put into the same cluster if they con- define and use such a social distance measure, based on the
tain the cores of each other. intuition that two places are socially similar if they share
In this paper, we investigate the extension of traditional many common users in their checkin records or the users in
density-based clustering for spatial locations to consider these records are linked by friendship edges. Fig. 1a illus-
their relationship to a social network of people who visit them trates the data of a GeoSN that includes eight users (u1 –u8 )
and the time when they were visited. In specific, we consider and two places (pi and pj ). The dashed lines represent user
the places of a Geo-Social Network (GeoSN) application, friendships and the solid lines annotated with timestamps
which allows users to capture their geographic locations (t1 –t9 ) illustrate checkins of users at places. For instance, user
and share them in the social network, by an operation called u1 is a friend with u2 and u3 and has visited place pi at time t3
checkin. Online social networks with this functionality and pj at t5 . As Fig. 1b shows, each place is modeled by its
include Gowalla,1 Foursquare,2 and Facebook Places.3 spatial coordinates (e.g., hlai ; loi i for the latitude and longi-
tude of pi ) and the set of users that have visited it (e.g., Upi for
pi ). For the spatial distance between pi and pj , we can use the
1. http://gowalla.com
2. https://foursquare.com Euclidean distance between hlai ; loi i and hlaj ; loj i, while for
3. https://www.facebook.com/about/location the social distance component we use the set of common
users in Upi and Upj (e.g., fu1 ; u2 g) and the users in one
D. Wu is with the College of Computer Science and Engineering, Shenzhen place’s record who are friends with visitors of the other place
University, Shenzhen 518060, China. E-mail: dingming@szu.edu.cn. (e.g., u3 2 Upi and u6 2 Upj who are friends with each other).
J. Shi is with the Lenovo Big Data Lab, Hong Kong. E-mail: jshi@lenovo.com.
N. Mamoulis is with the Department of Computer Science and Engineer- The intuition is that users who are friends can influence each
ing, University of Ioannina, Ioannina GR 45110, Greece. other to visit the places included in their checkin history. The
E-mail: nikos@cs.uoi.gr. details of our distance measure are presented in Section 2.
Manuscript received 5 Jan. 2017; revised 27 Nov. 2017; accepted 3 Dec. 2017. In our previous work [3], the time of the check-ins made by
Date of publication 11 Dec. 2017; date of current version 30 Mar. 2018. users is not considered during the clustering process. The
(Corresponding author: Jieming Shi.) social connections established between places may be either
Recommended for acceptance by G. Li.
out-dated or based on distant checkins in terms of time.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below. For instance, the place clusters discovered according to the
Digital Object Identifier no. 10.1109/TKDE.2017.2782256 checkins one year ago may not interest analysts that value
1041-4347 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
WU ET AL.: DENSITY-BASED PLACE CLUSTERING USING GEO-SOCIAL NETWORK DATA 839
the temporal information in the clustering not only reveals and parameters later on. If the geo-social -neighborhood of
how the clusters evolve with time, but also helps finding a place pi contains at least MinPts places, then pi is a core
groups of places having more coherent social relationships. place. In DCPGS, w.r.t. and MinPts; a place pi is directly
Summing up, the contributions of this paper are as density-reachable from a place pj if pi 2 N ðpj Þ and
follows: N ðpj Þ MinPts; a place pi is density reachable from a place
pj if there is a chain of places p1 ; . . . ; pn ; p1 ¼ pi ; pn ¼ pj such
We propose and formulate the problem of density-
based clustering GeoSN places. that pk1 is directly density-reachable from pk (1 < k n); a
We define a simple but effective social distance mea- place pi is density connected to a place pj if there is a place pk ,
sure between places in GeoSNs. such that both pi and pj are density-reachable from pk . A
We demonstrate the effectiveness of DCPGS by case cluster C in DCPGS w.r.t. and MinPts is a non-empty sub-
studies and quantitative evaluation through two set of places satisfying the following conditions: 1) 8pi ; pj : if
quality measures that are also devised in this paper. pi 2 C and pj is density-reachable from pi then pj 2 C. 2)
We study three ways of incorporating temporal 8pi ; pj 2 C: pi is density-connected to pj . Outliers are the pla-
information in DCPGS and evaluate the resulting ces that do not belong to any cluster.
temporal-geo-social clusters. Parameters. and MinPts are the main parameters of
The rest of the paper is organized as follows. Section 2 for- DCPGS. MinPts (i.e., the minimum number of places in the
mulates the DCPGS problem, defines the social distance mea- neighborhood of a core point) is set as in the original
sure between places that we use, and introduces three DBSCAN model (see [1]); a typical value is 5. takes a value
methods of incorporating temporal information into DCPGS. between 0 and 1, because, as we explain later on, we define
The effectiveness of our framework are analyzed in Sections 3. Dgs ðpi ; pj Þ to take values in this range. Since the geo-social
Related work is reviewed in Section 4. Finally, Section 5 con- distance Dgs ðpi ; pj Þ is a function of a spatial and a social dis-
cludes this paper and discusses future work. The efficient tance, t and maxD constrain these individual distances to
algorithms in our previous work [3] for clustering GeoSN avoid the following two cases that negatively affect the
places and the performance evaluations are presented in quality of geo-social clusters.
Appendices B and C, which can be found on the Computer
Society Digital Library at http://doi.ieeecomputersociety. The geo-social distance between two places pi and pj
org/10.1109/TKDE.2017.2782256. could be less than if they are extremely close to
each other in space, but have no social connection at
all. This may lead to putting places close to each
2 MODEL AND DEFINITIONS other spatially, but having no social relationship,
Our data input includes three components: a social network, into the same cluster.
a set of places and the checkins of users to the places. The The geo-social distance between two places pi and pj
social network is an undirected graph G ¼ ðU; EÞ, where U could be less than if they have very small social dis-
is the set of all users and each edge ðui ; uj Þ 2 E indicates tance, but they are extremely far from each other spa-
that users ui ; uj 2 U are friends. Set P is the set of all places tially. This may lead to putting places with close
visited by users, in the form of hlatitude; longitudei GPS social distances, but large spatial distances, into the
points. Thus, identifiers are assigned to places according to same cluster.
their distinct GPS coordinates. Set CK ¼ fhui ; pk ; tr ijui 2 U Constraints t and maxD are defined for quality control
and pk 2 P g includes all checkins generated by users in U. and can be set by experts or according to the analyst’s expe-
A checkin in CK is a triplet hui ; pk ; tr i modelling the fact rience. We experimentally study how clustering quality is
affected by the two constraints and in Section 3.
that user ui visited place pk at a certain tr . For a place pk , the
Distance Functions. The social distance DS ðpi ; pj Þ takes as
set Upk of visiting users of pk is defined by Upk ¼ fui jhui ; pk ; i
inputs the sets of users Upi and Upj who have visited pi and
2 CKg, where means any time. Fig. 1b shows Upi and Upj
pj , respectively, and returns a value between 0 and 1. In
for the two places pi and pj of the toy example in Fig. 1a.
Section 2.2, we present our definition for DS ðpi ; pj Þ and
The figure also connects the user pairs in the two sets who alternative ways to define it based on previous work.
are linked by friendship edges in the social network. Note Before defining the geo-social distance Dgs ðpi ; pj Þ, we con-
that user u8 does not belong to either Upi or Upj , but connects vert the Euclidean distance Eðpi ; pj Þ into a spatial distance
users u4 and u7 in the social graph. Eðpi ;pj Þ
DP ðpi ; pj Þ ¼ maxD so that any place pj in the geo-social
-neighborhood of pi has spatial distance no larger than 1.
2.1 DCPGS Model
Finally, Dgs ðpi ; pj Þ is defined as weighted sum of DS ðpi ; pj Þ
Our Density-based Clustering Places in Geo-Social Net-
and DP ðpi ; pj Þ, i.e.,
works model extends the model of DBSCAN [1]; for each
place pi in the GeoSN, DCPGS finds the geo-social -neighbor- Dgs ðpi ; pj Þ ¼ v DP ðpi ; pj Þ þ ð1 vÞ DS ðpi ; pj Þ; (1)
hood N ðpi Þ of pi , which includes all places pj such that where v 2 ½0; 1.
Dgs ðpi ; pj Þ , DS ðpi ; pj Þ t, and Eðpi ; pj Þ maxD. For
two places pi , pj , Eðpi ; pj Þ is the Euclidean distance, DS ðpi ; pj Þ 2.1.1 Alternatives to DCPGS
is the social distance, and Dgs ðpi ; pj Þ ¼ fðDS ðpi ; pj Þ; Eðpi ; pj ÞÞ Using the proposed geo-social distance, our place clustering
is the geo-social distance, defined as a function of Eðpi ; pj Þ model extends density-based clustering in spatial data-
and DS ðpi ; pj Þ. Parameter is geo-social distance threshold, bases. We next present two alternatives to DCPGS, which
while t and maxD are two sanity constraints for the social can be also used to cluster GeoSN places.
and the spatial distances between places, respectively. We SNN-based Clustering. The SNN clustering algorithm [8] is
will give detailed definitions for all above distance functions an improvement of DBSCAN that can find clusters of widely
WU ET AL.: DENSITY-BASED PLACE CLUSTERING USING GEO-SOCIAL NETWORK DATA 841
differing shapes, sizes, and densities. It first finds the k nearest The above definition of DS ðpi ; pj Þ takes both the set similar-
neighbors of each data point and then redefines the similarity ity between sets Upi and Upj and the social relationships
between pairs of points in terms of how many nearest neigh- among users in Upi and Upj into account. In addition, the dis-
bors the two points share. Using this definition of similarity, tance measure penalizes pairs of places pi and pj which are
the SNN algorithm identifies core points and then builds clus- popular (i.e., Upi and/or Upj are large) but their set of contrib-
ters around the core points as DBSCAN does. uting users is relatively small (see Equation (3)). The reason is
We extend SNN to cluster places in the geo-social net- that such place pairs are not characteristic to their (loose)
work as follows. First, the neighborhood of a place p in social connections. As an example, consider places pi and pj
SNN is defined as NNðpÞ ¼ fq 2 P ^ q 6¼ pjEðp; qÞ maxD of Fig. 1. To compute DS ðpi ; pj Þ, we first set Upi ¼ fu1 ; u2 ; u3 ; u4 g
and DS ðp; qÞ tg. Then, according to SNN, the similarity and Upj ¼ fu1 ; u2 ; u5 ; u6 ; u7 g. All users in Upi and Upj are
between two places pi and pj is defined as the cardin- checked one by one to obtain the contributing users between
ality of the common places in their neighborhood, i.e., pi and pj . We derive CU ij ¼ fu1 ; u2 ; u3 ; u5 ; u6 g, since (i) both
Ssnn ðpi ; pj Þ ¼ jNNðpi Þ \ NNðpj Þj. Given a user specified sim- u1 and u2 have visited pi and pj , (ii) user u3 , who visited pi ,
ilarity parameter epssnn , the SNN geo-social neighborhood has a friend u6 who visited pj , (iii) symmetrically, user u6 ,
of a place pi is defined as Nsnn ðpi Þ ¼ fpj 2 P jSsnn ðpi ; pj Þ who visited pj , has a friend u3 who visited pi , and (iv) u5
epssnn g. Place pi is a core place if jNsnn ðpi Þj MinPtssnn , (2 Upj ) has a friend u2 having been to pi . According to Defini-
where MinPtssnn is the user specified density threshold. If tion 2, the social distance DS ðpi ; pj Þ between pi and pj in Fig. 1
two core places are within each other’s SNN geo-social is 1 jCU ij j=ðjUpi [ Upj jÞ ¼ 1 5=7 0:2857.
neighborhood, then they are put into the same cluster. Observe that only direct friendship edges between users
Graph-based Clustering. GeoSN places can alternatively be of Upi and Upj are considered in our social distance defini-
clustered by the use of graph clustering models. The main tion. Longer network paths, such as friend-of-friend rela-
idea of such a model is to construct a place network PN, tionships, are ignored (e.g., the case of users u4 and u7 in
which connects places according to their social and spatial Fig. 1b who are connected via user u8 ). According to the
distances and then apply an off-the-shelf community detec- small world effect [12], a user in a social network can reach
tion algorithm on PN. Specifically, given two places pi and a large portion of other users within only few hops. For
pj , if Eðpi ; pj Þ maxD and DS ðpi ; pj Þ t, an undirected and instance, the 90-percentile effective diameter [13] of Gowalla
weighted edge with geo-social weight Wgs ðpi ; pj Þ ¼ 1 Dgs GeoSN, used in our experiments, is just 5.7.4 This means
ðpi ; pj Þ is added between pi and pj . Community detection within quite a few hops most users can reach a very large
algorithms like Link Clustering [9], [10] or Metis [11] can percentage of all users. In Gowalla, only 8 users can access
then be applied to derive the clusters. Link Clustering con- more than 1 percent of all the users in 1 hop, while 40516
structs a dendrogram of network communities (that may users (20.61 percent of all the users) can access more than 1
overlap) in a hierarchical manner. Metis is another multi- percent of all the users in 2 hops and 141582 users (72.02
level graph partition paradigm that includes three phases: percent of all the users) can reach more than 1 percent of all
graph coarsening, initial partitioning, and uncoarsening; the users in 3 hops. The number of users who can reach
Metis divides a network into k non-overlapping communi- more than 1 percent users in 2 or 3 hops increases dramati-
ties. As we will show in Section 3, these graph clustering cally compared to the percentage of those visited within 1
methods are inferior to DCPGS. hop. Thus, paths longer than 1 hops are too common and
cannot be considered as (indirect) user relationships; i.e.,
2.2 Social Distance between Places
their impact is much weaker compared to direct friendship
The social distance DS ðpi ; pj Þ between pi and pj naturally edges. Hence, Definition 2 introduces a simple, but power-
depends on the social network relationships between the ful social distance measure. Properties of DS ðpi ; pj Þ include
sets Upi and Upj of users who visited pi and pj , respectively. symmetry (i.e., DS ðpi ; pj Þ ¼ DS ðpj ; pi Þ) and self-minimality
DS ðpi ; pj Þ is based on the set CU ij of contributing users (i.e., DS ðpi ; pj Þ 2 ½0; 1, and DS ðpi ; pj Þ ¼ 0 for Upi ¼ Upj ). On
between two places pi and pj : the other hand, DS ðpi ; pj Þ does not obey the triangular
Definition 1 (Contributing Users). Given two places pi and inequality, but this does not affect our clustering algorithm.
pj with visiting users Upi and Upj , respectively, the set of con-
tributing users CU ij for the place pair ðpi ; pj Þ is defined as 2.2.1 Alternatives to DS
Our DCPGS model is independent of the social distance
CU ij ¼ fua 2 Upi jua 2 Upj or 9ub 2 Upj ; ðua ; ub Þ 2 Eg definition between places (i.e., DS ). As an alternative to
(2)
[ fua 2 Upj jua 2 Upi or 9ub 2 Upi ; ðua ; ub Þ 2 Eg: our Definition 2, the following measures can be used. In
Section 3, we evaluate the effectiveness of these alternatives.
Specifically, if a user ua has visited both pi and pj , then ua is Jaccard. Based on the Jaccard similarity Jðpi ; pj Þ ¼ ðjUpi \
a contributing user. Also if ua has visited place pi , ub has vis- Upj jÞ=ðjUpi [ Upj jÞ between the sets of visiting users for
ited pj , and ua and ub are friends, both ua and ub are contribut- pi and pj , we define DJac S ðpi ; pj Þ ¼ 1 Jðpi ; pj Þ, which is
ing users. Users in CU ij contribute positively (negatively) to not intuitive; it disregards the social network, assuming that
the social similarity (distance) between pi and pj . Formally: two users who are friends do not affect each other in visiting
Definition 2 (Social Distance). Given two places pi and pj GeoSN places.
with visiting users Upi and Upj , respectively, the social dis- SimRank. SimRank is a structural-context model for mea-
tance between pi and pj is defined as suring the similarity between nodes in a graph. The idea is
that two nodes are equivalent if they relate to equivalent
jCU ij j nodes. We can define a SimRank-based social distance
DS ðpi ; pj Þ ¼ 1 : (3)
jUpi [ Upj j
4. http://snap.stanford.edu/data/index.html
842 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 5, MAY 2018
spatially loose clusters by reducing the density parameters, Alternatives to DCPGS. We visually analyzed the results
this would result in merging too many clusters together, of the alternatives to DCPGS, i.e., SNN-based clustering
making denser clusters indistinguishable. model and graph-based clustering models (LinkClustering
Fuzzy Boundary Clusters. Some DCPGS geo-social clusters and Metis), described in Section 2.1.1. LinkClustering and
have fuzzy boundaries with each other, which is reasonable Metis produce similar results; indicatively, we show the
in the real world, since groups of socially connected users clusters produced by LinkClustering in Fig. 2d. LinkClus-
may spatially overlap. On the other hand, DBSCAN produ- tering produces thousands of small clusters (average size
ces clusters with strict boundaries. For instance, in Fig. 2a, around 3), which are typically not well-separated spatially.
there is no strict boundary between the two clusters Due to the sparsity of geo-social network data, the con-
enclosed in region C. Although PureSocialDistance, which structed place network contains many connected compo-
is the other extreme method, also produces clusters with nents that are disconnected with each other (e.g., the place
fuzzy boundaries (see Fig. 2c), the clusters are spatially network built when t ¼ 0:7, maxD ¼ 100, and v ¼ 0:5 con-
indistinguishable and they are not interesting, i.e., for the tains 34,496 connected components with 4.3 nodes and 8.2
applications mentioned in the Introduction. edges on average). The clusters found by Metis are fewer
WU ET AL.: DENSITY-BASED PLACE CLUSTERING USING GEO-SOCIAL NETWORK DATA 845
Fig. 7. Continuous history frame geo-social clustering in Manhattan: ¼ 0:5, MinPts ¼ 5, maxD ¼ 120, t ¼ 0:7, and v ¼ 0:5.
848 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 30, NO. 5, MAY 2018
Fig. 9. Temporally contributing users in Manhattan: ¼ 0:5, MinPts ¼ 5, maxD ¼ 120, t ¼ 0:7, and v ¼ 0:5.
Temporally Contributing Users. Fig. 9 shows the temporal- temporal-geo-social clusters, which may serve various pur-
geo-social clusters found in Manhattan when parameter u of poses of analysis and investigation. The history-frame method
temporally contributing users is set to 1 week, 1 month, and offers the evolution of clusters over time. The damping win-
6 months, respectively. When u is a short time interval (e.g., dow method generates the up-to-date clusters. The temporally
1 week), the result reveals that the places in the same clus- contributing user method shows the places that are revisited
ters are revisited by socially connected users within a short within a period of time. More results from the area of Chicago
period of time. As u increases, the temporal-social distances on the Brightkite data set can be found in Appendix A,
between more places decrease, thus as expected (1) some available in the online supplemental material.
clusters expand, (2) multiple clusters merge (e.g., regions A Social Entropy based Evaluation. Fig. 12 shows the social
and B), (3) new clusters are found (e.g., region C). For mar- entropy of the temporal-geo-social clusters discovered by the
keting and management purpose, people may be interested three methods introduced in Section 2.3 compared with the
in both the short term and long term clusters. social entropy of the geo-social clusters found by DCPGS.
Damping Window. In Fig. 11, we show the temporal-geo- Recall that a low social entropy means that the great majority
social clusters found by the damping window method in of the visitors of a place cluster come from the same commu-
Manhattan, where the starting time is 01/02/2009 and the nity, indicating that the places in the cluster have tighter social
current time is set to 31/07/2010 and 31/01/2011, respec- relationships between each other. We observe that the clusters
tively. To better understand the effect of the damping found by the damping window method (DCPGSDW-Exp)
window, Fig. 10 shows the clusters found by DCPGS on the and the temporally contributing user method (DCPGSTT)
same data in the same periods as the damping window have better (lower) social entropy, compared to the result
method. We observe that both the size of the clusters and found by DCPGS that does not consider the temporally infor-
the number of clusters found by the damping window are mation. Furthermore, the improvement achieved by damping
smaller than DCPGS. This is expected, since weighing the window method is larger than that of the temporally contrib-
old data less increases the temporal-social distances between uting user method. However, the social entropy of the clusters
places. Under the same parameter setting, more regions in found by the history frame method (DCPGSHF) is worse than
the damping window are considered as sparse. However, the result of DCPGS. DCPGSHF in fact splits the whole data
the advantage of the damping window is offering the into several sub-datasets based on time frames and performs
up-to-date clustering result. As we can see in Fig. 11, the dis- clustering on these sub-datasets. The social entropies of the
covered clusters for the two periods are different, which are clusters from these sub-datasets should not necessarily be
sensitive to time. For instance, by 31/07/2010, clusters smaller than the clusters from the whole data. DCPGSTT and
are found on Lafayette St., while by 31/01/2011, clusters are DCPGSDW-Exp require more intra temporal closeness
found close to Washington Square Park. Nevertheless, it is among places within the temporal-geo-social cluster. The
not easy to notice these interesting up-to-date small clusters smaller entropies of these two methods indicates that tempo-
in Fig. 10 because of the accumulated old data. ral constraints can enhance the social connections within a
In summary, the three ways of considering the temporal cluster. According to the analysis in previously sections, the
information in the geo-social clustering yield different
Fig. 10. DCPGS in Manhattan: ¼ 0:5, MinPts ¼ 5, maxD ¼ 120, t ¼ 0:7, Fig. 11. Damping window in Manhattan: ¼ 0:5, MinPts ¼ 5, maxD ¼
and v ¼ 0:5. 120, t ¼ 0:7, and v ¼ 0:5.
WU ET AL.: DENSITY-BASED PLACE CLUSTERING USING GEO-SOCIAL NETWORK DATA 849
link prediction model [5]. Backstrom et al. [41] predict the places, considering the social ties between users that visit
location of an individual from a sparse set of known user them. Our measure is shown to be more effective to compute,
locations using the relationship between geography and compared to more complex ones based on node-to-node graph
friendship. Wang et al. [15] find that the similarity between proximity and SNN-based model. We analyzed the effective-
the movements of two individuals strongly correlates with ness of DCPGS via case studies and demonstrated that DCPGS
their proximity in the social network. This correlation is can discover clusters with interesting properties (i.e., barrier-
used as a tool for link prediction in a social network. Pham based splitting, spatially loose clusters, clusters with fuzzy
et al. [42] propose an entropy-based model (EBM) that not boundaries), which cannot be found by merely using spatial
only infers social connections but also estimates the strength clustering. To improve the quality of clusters, we incorporate
of social connections by analyzing people’s co-occurrences the temporal information of the checkins using three different
in space and time. Different from existing work, we neither ways that satisfy different analysis and investigation require-
study the spatial-social relationship nor do prediction or ments. Besides, we designed two evaluation measures to
recommendation utilizing this relationship. We perform quantitatively evaluate the social quality of clusters detected
density-based clustering of GeoSN places considering both by DCPGS or competitors, called social entropy and commu-
the spatial distances between them and the social relation- nity score, which also confirm that DCPGS is more effective
ships between users that visit the places. than alternative approaches and the temporal dimension fur-
Detecting and Evaluating Communities in Networks. There ther improves the quality of the clusters.
are many existing works on network community detection
and clustering of nodes in a graph using only the network ACKNOWLEDGMENTS
distance between nodes [43], [44], [45], [46]. SCAN [45] is an
algorithm that detects clusters, hubs, and outliers in net- This work was supported in part by NSFC grant No. 61502310
works. [46] proposed partitioning, hierarchical and density- and by the European Unions Horizon 2020 research and inno-
based algorithms to cluster objects on spatial networks, vation programme under grant agreement No 657347.
based on shortest-path distance. [19] summarized and
empirically evaluated algorithms for network community REFERENCES
detection. In Section 3.2.2, we have used network commu- [1] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based
nity quality measures [19] to evaluate the social quality of algorithm for discovering clusters in large spatial databases with
the place clusters found by our algorithms. noise,” in Proc. 2nd Int. Conf. Knowl. Discovery Data Mining, 1996,
Importing Time into Clustering. Previous works on spatio- pp. 226–231.
[2] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Min-
temporal data clustering typically include the time concept in ing. Boston, MA, USA: Addison-Wesley, 2005.
the data or algorithms as a threshold or as a parameter of the [3] J. Shi, N. Mamoulis, D. Wu, and D. W. Cheung, “Density-based
distance function. Some works transform the spatio-temporal place clustering in geo-social networks,” in Proc. ACM SIGMOD
clustering problem to a multi-sequence spatial data clustering Int. Conf. Manag. Data, 2014, pp. 99–110.
[4] S. Scellato, A. Noulas, R. Lambiotte, and C. Mascolo, “Socio-spa-
problem [47]. The history-frame clustering method that we tial properties of online location-based social networks,” in Proc.
propose is similar to multi-sequence clustering. Trasarti [48] Int. Conf. Web Social Media, 2011, pp. 329–336.
imported time to data as a new dimension. Thus clustering is [5] S. Scellato, A. Noulas, and C. Mascolo, “Exploiting place features
performed on high dimensional data using standard distance in link prediction on location-based social networks,” in Proc. 17th
measure such as the Euclidean distance. In this method, the ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2011,
pp. 1046–1054.
time effect may be reduced compared with many other [6] D.-N. Yang, C.-Y. Shen, W.-C. Lee, and M.-S. Chen, “On socio-spa-
dimensions. ST-DBSCAN [49] improves DBSCAN to cluster tial group query for location-based social networks,” in Proc. 18th
spatialtemporal data where a time period attached to the spa- ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2012,
tial data expresses when it was valid or stored in the database. pp. 949–957.
During clustering, spatio-temporal data are filtered by retain- [7] N. Armenatzoglou, S. Papadopoulos, and D. Papadias, “A general
framework for geo-social query processing,” Proc. VLDB Endow-
ing only the temporal neighbors and their corresponding spa- ment, vol. 6, no. 10, pp. 913–924, 2013.
tial values. Two objects are temporal neighbors if the values [8] L. Ert€oz, M. Steinbach, and V. Kumar, “Finding clusters of differ-
of these objects are observed in consecutive time units such as ent sizes, shapes, and densities in noisy, high dimensional data,”
consecutive days in the same year or in the same day in conse- in Proc. 7th SIAM Int. Conf. Data Mining, 2003, pp. 47–58.
[9] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, “Link communities
cutive years. Similarly, [50], [51], [52] integrate the temporal reveal multiscale complexity in networks,” Nature, vol. 466,
information into the distance function used for clustering. In pp. 761–764, 2010.
other words, two objects have to be close in terms of time [10] I. R. Brilhante, M. Berlingerio, R. Trasarti, C. Renso, J. A. F. de
(e.g., time difference lower than 1 hour) in order to belong to Mac^edo, and M. A. Casanova, “ComeTogether: Discovering com-
the same cluster. Different from existing works that simply munities of places in mobility data,” in Porc. 12th IEEE Int. Conf.
Mobile Data Manage., 2012, pp. 268–273.
set a time threshold and use it as an additional filtering step [11] G. Karypis and V. Kumar, “A fast and high quality multilevel
when computing distances between objects, the damping scheme for partitioning irregular graphs,” SIAM J. Scientific Com-
window and temporally contributing user methods in the put., vol. 20, no. 1, pp. 359–392, 1998.
paper take the temporal information and the users’ social rela- [12] S. Milgram, “The small world problem,” Psychology Today, vol. 2,
tionships into account simultaneously. no. 1, pp. 60–67, 1967.
[13] J. Leskovec, J. M. Kleinberg, and C. Faloutsos, “Graph evolution:
Densification and shrinking diameters,” ACM Trans. Knowl. Dis-
5 CONCLUSION covery Data, vol. 1, no. 1, 2007, Art. no. 2.
In this paper, we studied for the first time the problem of Den- [14] G. Jeh and J. Widom, “SimRank: A measure of structural-context
sity-based Clustering Places in Geo-Social Networks. Our clus- similarity,” in Proc. of 8th ACM SIGKDD Int. Conf. Knowl. Discovery
Data Mining, 2002, pp. 538–543.
tering model extends the density-based clustering paradigm [15] D. Wang, D. Pedreschi, C. Song, F. Giannotti, and A.-L. Barabasi,
to consider both the spatial and social distances between pla- “Human mobility, social ties, and link prediction,” in Proc. 17th ACM
ces. We defined a new measure for the social distance between SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2011, pp. 1100–1108.
WU ET AL.: DENSITY-BASED PLACE CLUSTERING USING GEO-SOCIAL NETWORK DATA 851
[16] C. Wang, V. Satuluri, and S. Parthasarathy, “Local probabilistic [40] L. Nahemow and M. Lawton, “Similarity and propinquity in
models for link prediction,” in Proc. IEEE Int. Conf. Data Mining, friendship formation,” J. Personality Soc. Psychology, vol. 32, no. 2,
2007, pp. 322–331. pp. 205–213, 1975.
[17] P. Sarkar and A. W. Moore, “A tractable approach to finding clos- [41] L. Backstrom, E. Sun, and C. Marlow, “Find me if you can: Improv-
est truncated-commute-time neighbors in large graphs,” in Proc. ing geographical prediction with social and spatial proximity,” in
26th Conf. Uncertainty Artif. Intell., 2007, pp. 335–343. Proc. 23rd Int. Conf. World Wide Web, 2010, pp. 61–70.
[18] K. V. Mardia, J. Kent, and J. Bibby, Multivariate Analysis (Probability [42] H. Pham, C. Shahabi, and Y. Liu, “EBM: An entropy-based model
and Mathematical Statistics). London, U.K.: Academic Press, 1980. to infer social strength from spatiotemporal data,” in Proc. ACM
[19] J. Leskovec, K. J. Lang, and M. W. Mahoney, “Empirical compari- SIGMOD Int. Conf. Manag. Data, 2013, pp. 265–276.
son of algorithms for network community detection,” in Proc. 19th [43] C. Tantipathananandh, T. Y. Berger-Wolf, and D. Kempe, “A
Int. Conf. World Wide Web, 2010, pp. 631–640. framework for community identification in dynamic social
[20] J. Han and M. Kamber, Data Mining: Concepts and Techniques. networks,” in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discovery
Burlington, MA, USA: Morgan Kaufmann, 2000. Data Mining, 2007, pp. 717–726.
[21] R. T. Ng and J. Han, “Efficient and effective clustering methods [44] A. Clauset, M. E. Newman, and C. Moore, “Finding community
for spatial data mining,” in Proc. 20th Int. Conf. Very Large Data structure in very large networks,” Phys. Rev. E, vol. 70, no. 6, 2004.
Bases, 1994, pp. 144–155. [45] X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A struc-
[22] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient tural clustering algorithm for networks,” in Proc. 13th ACM
data clustering method for very large databases,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2007, pp. 824–
SIGMOD Int. Conf. Manag. Data, 1996, pp. 103–114. 833.
[23] G. Karypis, E.-H. Han, and V. Kumar, “Chameleon: Hierarchical [46] M. L. Yiu and N. Mamoulis, “Clustering objects on a spatial
clustering using dynamic modeling,” IEEE Comput., vol. 32, no. 8, network,” in Proc. ACM SIGMOD Int. Conf. Manag. Data, 2004,
pp. 68–75, Aug. 1999. pp. 443–454.
[24] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering [47] H. F. Tork, “Spatio-temporal clustering methods classification,” in
algorithm for large databases,” in Proc. ACM SIGMOD Int. Conf. Proc. 11th Doctoral Symp. Inf. Eng., 2012.
Manag. Data, 1998, pp. 73–84. [48] R. Trasarti, “Mastering the spatio-temporal knowledge discovery
[25] A. Gunawan, “A faster algorithm for DBSCAN,” Master Thesis. process,” Ph.D. dissertation, Department of Computer Science,
Dept. Math. Comput. Sci., Technische Univ. Eindhoven, Eindhoven, University of Pisa, Pisa, Italy, 2010.
Netherlands, Mar. 2013. [49] D. Birant and A. Kut, “ST-DBSCAN: An algorithm for clustering
[26] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: spatial-temporal data,” Data Knowl. Eng., vol. 60, no. 1, pp. 208–
Ordering points to identify the clustering structure,” in Proc. ACM 221, 2007.
SIGMOD Int. Conf. Manag. Data, 1999, pp. 49–60. [50] M. Wang, A. Wang, and A. Li, “Mining spatial-temporal clusters
[27] A. Hinneburg and D. A. Keim, “An efficient approach to cluster- from geo-databases,” in Proc. 2nd Int. Conf. Adv. Data Mining
ing in large multimedia databases with noise,” in Proc. 4th ACM Appl., 2006, pp. 263–270.
SIGKDD Int. Conf. Knowl. Discovery Data Mining, 1998, pp. 58–65. [51] G. Andrienko and N. Andrienko, “Interactive cluster analysis of
[28] S. Jahirabadkar and P. Kulkarni, “Short communication: Algo- diverse types of spatiotemporal data,” SIGKDD Explor. Newsl.,
rithm to determine epsilon-distance parameter in density based vol. 11, no. 2, pp. 19–28, 2010.
clustering,” Expert Syst. Appl., vol. 41, no. 6, pp. 2939–2946, 2014. [52] D. Fitrianah, A. N. Hidayanto, H. Fahmi1, J. L. Gaol, and A. M.
[29] J. Sander, M. Ester, H.-P. Kriegel, and X. Xu, “Density-based clus- Arymurthy, “ST-AGRID:: A spatio temporal grid density based
tering in spatial databases: The algorithm GDBSCAN and its clustering and its application for determining the potential fishing
applications,” Data Min. Knowl. Discov., vol. 2, no. 2, pp. 169–194, zones,” Int. J. Softw. Eng. Appl., vol. 9, no. 1, pp. 13–26, 2015.
1998.
[30] S. T. Mai, X. He, J. Feng, C. Plant, and C. B€ohm, “Anytime density- Dingming Wu received the PhD degree in com-
based clustering of complex data,” Knowl. Inf. Syst., vol. 45, no. 2, puter science from Aalborg University, Denmark,
pp. 319–355, 2015. in 2011. She is an assistant professor in the Col-
[31] J. Gan and Y. Tao, “DBSCAN revisited: Mis-claim, un-fixability, lege of Computer Science & Software Engineer-
and approximation,” in Proc. ACM SIGMOD Int. Conf. Manag. ing, Shenzhen University, China. Her general
Data, 2015, pp. 519–530. research area is data management and mining,
[32] G. L. Andrienko, N. V. Andrienko, C. Hurter, S. Rinzivillo, and including data modeling, database design, and
S. Wrobel, “Scalable analysis of movement data for extracting and query languages, efficient query and update
exploring significant places,” IEEE Trans. Vis. Comput. Graph., processing, indexing, and mining algorithms.
vol. 19, no. 7, pp. 1078–1094, Jul. 2013.
[33] A. Noulas, B. Shaw, R. Lambiotte, and C. Mascolo, “Topological
properties and temporal dynamics of place networks in urban Jieming Shi received the bachelor’s degree in
environments,” in Proc. 23rd Int. Conf. World Wide Web, 2015, science from the Department of Computer Sci-
pp. 431–441. ence and Technology, Nanjing University, in,
[34] Z. Yu, O. C. Au, R. Zou, W. Yu, and J. Tian, “An adaptive unsu- 2011, and the PhD degree in computer science
pervised approach toward pixel clustering and color image from the University of Hong Kong. He is now a
segmentation,” Pattern Recog., vol. 43, no. 5, pp. 1889–1906, 2010. staff researcher with the Lenovo Big Data Lab in
[35] Y. van Gennip, B. Hunter, R. Ahn, P. Elliott, K. Luh, Hong Kong. His research interests include geo-
M. Halvorson, S. Reid, M. Valasik, J. Wo, G. E. Tita, A. L. Bertozzi, social network mining, knowledge graph manage-
and P. J. Brantingham, “Community detection using spectral ment, and query processing on spatial-textual
clustering on sparse geosocial data,” SIAM J. Applied Mathematics, data.
vol. 73, no. 1, pp. 67–83, 2013.
[36] B. Zhang, W. J. Yin, M. Xie, and J. Dong, “Geo-spatial clustering Nikos Mamoulis received the diploma in com-
with non-spatial attributes and geographic non-overlapping con- puter engineering and informatics from the Uni-
straint: A penalized spatial distance measure,” in Proc. 12th versity of Patras, Greece, and the PhD degree in
Pacific-Asia Conf. Adv. Knowl. Discovery Data Mining, 2007, computer science from the Hong Kong University
pp. 1072–1079. of Science and Technology, in 1995 and 2000,
[37] S. Yokoyama, A. Bog ardi-Mesz€oly, and H. Ishikawa, “EBSCAN: An respectively. He is currently a faculty member in
entanglement-based algorithm for discovering dense regions in the Department of Computer Science and Engi-
large geo-social data streams with noise,” in Proc. 8th ACM SIGSPA- neering, University of Ioannina. His research
TIAL Int. Workshop Location-Based Soc. Netw., 2015, pp. 7:1–7:10. focuses on the management and mining of com-
[38] J. Q. Stewart, “An inverse distance variation for certain social plex data types, privacy and security in data-
influence,” Sci., vol. 93, no. 2404, pp. 89–90, 1941. bases, and uncertain data management.
[39] L. Festinger, S. Schachter, and K. Back, Social Pressures in Informal
Groups: A Study of Human Factors in Housing. Palo Alto, CA, USA: " For more information on this or any other computing topic,
Stanford Univ. Press, 1963. please visit our Digital Library at www.computer.org/publications/dlib.