Vous êtes sur la page 1sur 7

Improving Trend Analysis Using Social Network Features

ABSTRACT
In recent years, big volumes of data have been massively
studied by organizations trying to extract and use information produced by people in the internet. In this context,
trend analysis is one of the most important areas explored
by researchers. Typically, good prediction results are hard
to obtain because the complexity of the problem. This paper goes beyond simple trend identification methods by including the structure of the information sources, i.e., social
network metrics, as an additional dimension to model and
predict trends through time. The results show that the inclusion of such metrics improved the accuracy of the prediction. Our experiments used the publications titles from
all the Brazilian PhD in Computer Science for the periods
analyzed in order to evaluate the developed trend prediction
approach.

CCS Concepts
Computing methodologies Model development
and analysis; Machine learning approaches;

Keywords
social network; trend analysis; data mining

1.

INTRODUCTION

The data produced by people in the internet has increased in


a huge way mainly because of the great number of business
and social media applications. It is known that this data can
be stored, treated, organized and analyzed to be useful for
many kinds of organizations. Trend analysis is one of the
research areas that can be used to provide insights about
users behavior in the World Wide Web. For example, some
e-commerce companies can analyze the users purchase behavior to improve the logistic and sale process, whilst state
managers can identify potential research areas to invest in.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SAC16, April 4-8, 2016, Pisa, Italy

Copyright 2016 ACM 978-1-4503-3739-7/16/04. . . $15.00

http://dx.doi.org/xx.xxxx/xxxxxxx.xxxxxxx

The problem of finding significant trends from this big volume of data is challenging. The challenge arises from the
dynamicity of the data combined with other factors, such
as, the influence of the entities that are producing the data.
In order to model and predict trends many studies have used
time series and content based approaches.
Trends can also be analyzed from the social structure of the
data sources. The social structure plays an important role
in the information spreading dynamics context. Although
there have been several works studying trend identification
from time series and content based data, and others focusing
in social network techniques, literature still lacks researches
that combine these two concepts. This work starts from the
premise that combining social network and time series data
can reach better results in trend modeling and prediction.
Thus, this work aims to improve trend prediction capability by attaching social network metrics to time series and
content based trend analysis models.
In this paper, we propose an approach which constructs a
prediction model combining time series with social network
metrics. The approach was tested and validated using data
from the publications titles of the Brazilian PhD in Computer Science and then compared to time series regression
model.
This paper is organized as follows. Section 2 describes some
basic concepts and related work. Section 3 details the methodology used. The results are described in Section 4. At last,
conclusions are presented in Section 5.

2.
2.1

RELATED WORK
Time Series and Content Based Trend Analysis

Analyzing information frequency behavior through time to


identify patterns has been the focus of trend analysis studies
in the last years. The use of time as the independent variable and variables usually extracted from the meta-analysis
of data have been the only factors used for most of the studies [18]. Frequency indexes for trend analysis of terms [1,
17] and price for stock market applications [16] have been
well explored by researchers.
Sometimes the variables have to be built and not just extracted. The identification of trends in texts (such as chats,
forums, and academic papers) is one of these cases [18]. For

form1 . This system maintains the curricula of the major


researchers working in Brazil and actually contains more
than 4 million curricula registered. In this work, all the
information used is from Lattes Platform.

Figure 1: Schematic data flow of the work process

these applications, the step of mining the data and building the dependent variables becomes very important. Thus,
many techniques have been developed to extract topics from
text to automate all the trend analysis process [9]. Furthermore, besides just identifying topics with potential for
popularity, some works have gone deeper trying to find out
the behavior of these topics. In the last years, some studies
[8, 13] have built models to understand the stability of topics or classify topics as trends depending of the behavior of
subtopics.

2.2

Social Network Trend Analysis

The use of social network theory in trend analysis can be


seen as an important improvement in the area. Information is produced by individuals and individuals have social
characteristics that matters in the way that information diffuses [3]. The structure of the social connections are not
the only factor: some individuals have more influence than
others and there are methods being developed to identify
them [14]. Another important issue in the area is finding
the starting point of the trends in the network, rather than
finding the most influential nodes, and it has been a research
topic as well [2]. Therefore, social network information are
being used in several ways to predict trends based on the
network behavior [12].
Several kinds of indicators can derive from scholarly data
and knowledge can be discovered in a quantitative way [10].
Research productivity, for example, can be measured by
models that use citation index and academic social network
analysis [4]. Our work also uses data from a science and technology system aiming to identify trends areas and themes of
research.

For data gathering, all the Computer Science PhDs curricula


were selected for the periods analyzed (comprising 5,642 curricula). The information extracted was organized and stored
in a relational database using the methodology described
by Digiampietri et al. [6]. From these curricula, 55,710 titles were identified from papers published between the years
1991 and 2012. For the experiments, the prediction was
made for the year of 2012.

3.2

With all the possible sets of terms, we used a scoring system


to identify the most important terms. This scoring method
was based on the adjacent frequency of the words that composes the terms. The equation used to calculate the importance of each candidate term is

LRF (CT ) = f (CT )(

METHODOLOGY

In this work, we have assessed the curricula of 5,642 Brazilian PhDs in Computer Science that have published scientific papers between 1991 and 2012. For each curriculum,
we have gathered all the published full papers titles. Then,
we automatically extracted terms from these titles and performed trend analysis for each term. Figure 1 illustrates
the schematic data flow used in this work. Following, each
method is described.

3.1

Data Gathering

Brazil has an unique curricula platform called Lattes Plat-

T
Y

(LF (N i)+1)(RF (N i)+1))1/T > 1.0

i=1

f (CT ) is the frequency of the candidate term CT , LF (N i)


and RF (N i) indicates the frequency of the left and right
candidates, respectively. This equation is described in detail
by Nakagawa et al. [11].
We observed that the composed terms had more significance
than the simple terms for the subjects discussed in the publications. Therefore, we selected the 1,638 most important
composed terms. These 1,638 terms were extracted in the
three different periods used for the experiments as explained
in Section 4.

3.3
3.3.1

3.

Term Extraction

In this step, the goal was to automate the data preparation.


The first part of term extraction was to split the titles into
subsets of words or sequence of words without stop-words.
As an example, the title Social Network Analysis For Digital
Media was splitted into the following terms: Social, Network,
Analysis, Digital, Media, Social Network, Network Analysis,
Digital Media and Social Network Analysis.

Social Network Analysis


The network

In this paper, the social network was modeled based on


scientific collaboration. Sonnenwald defined scientific collaboration as the interaction among two or more scientists
sharing the same goal for facilitating the development of
tasks [15]. More specifically, the network was built according to the joint publications (coauthorship relationships).
The social network was modeled as a undirected graph that
is composed by vertices (authors) and edges (coautorships).
Figures 2, 3 and 4 were adapted from [7], which used the
same dataset we used, but their work aimed to characterize the Brazilian Computer Science PhDs social network. In
Figure 2 each node is a Brazilian state and it is possible to
1

http://lattes.cnpq.br/

Figure 2: Computer science PhDs social network Brazilian states


see the interstate relations. In Figures 3 and 4 each node
represents a PhD in Computer Science and they are colored according to their Brazilian states. The algorithm that
produced these figures uses a force-directed approach, where
there are a repulsion force between nodes that are not related
and an attraction one between the nodes that are related.
The differences between the two networks occurs because in
Figure 3 there are only edges between nodes from the same
state. On the other hand, the network in Figure 4 contains
all the edges from this social academic network.

rics were selected to become part of the independent variables set. The selection was based on assumptions about the
capability of each metric to explain the information spreading. For example, one of the assumptions is that a node
included in the giant component of a network is more capable of disseminating information through the network than
a node which is not included. The metrics selected are: giant composition, shortest path to the most central node, degree centrality, eigenvector centrality, page rank centrality,
betweenness centrality, closeness centrality, clustering coefficient, structural equivalence with the most central node and
community average centrality. The centrality metrics can explain how important a node is in the network, the shortest
path metric indicates how far the node is from the central
node while the structural equivalence shows how similar the
node been analyzed is with the most central node. The most
important/central node was used as reference. To justify the
use of the most central node as the reference one, Table 1
shows the difference in the degree and eigenvector centrality
among the most central node and the other top ten most
important nodes in the network.
Table 1: Eigenvector centrality and number of degrees of top ten central nodes
Top impor- Eigenvector Degree
tant nodes
Centrality
1
1.000
67
2
0.986
45
3
0.944
45
4
0.845
31
5
0.825
35
6
0.799
29
7
0.798
37
8
0.763
24
9
0.745
30
10
0.744
27

Figure 3: Brazilian Computer Science PhDs social


network

3.3.2

The metrics

Metrics of social network consists of different characteristics


that can be quantified. In the proposed approach, some met-

Figure 4: Brazilian Computer Science PhDs social


network - Reorganized

After the selection of metrics, each of the nodes had all related metrics calculated. Then, for each of the 1,638 extracted terms, the metrics of the nodes were summed. However, instead of building the variables with the summed metrics, the nodes were grouped into communities (we used the
algorithm proposed by Clauset et al [5] to identify the communities). We assumed that using the communities metrics instead of the node metrics individually for the most of
the variables would better explain the information diffusion.
The individual information would result in misleading social
metrics for some terms that could be well widespread in a
single community but not on the network as a whole. That
way, if many nodes of the same community use the same
term, the metrics could become skewed, This balance intra
community aims to add the factor that the information inside a community propagates quickly and tends to become
a general knowledge of the nodes belonging to it. Therefore,
the final values of the variables were based on the communities values. For example, let us assume that there is a
term A that is used by two authors: author 1 and author
2. If author 1 and author 2 are in the same community, the
metrics would be calculated based on the metrics average of
that community for the authors that used term A as well.
But if author 1 and author 2 were in different communities, the metrics values average of each community would be
calculated and these results would be finally summed. The
details of the metrics are described below.
Giant composition: number of nodes in the giant component; Shortest path to the most central node: the shortest path to the most central node; Degree centrality: average degree centrality of the nodes inside the community;
Eigenvector centrality: average eigenvector centrality of the
nodes inside the community; Page rank centrality: average
page rank centrality of the nodes inside the community; Betweenness centrality: average betweenness centrality of the
nodes inside the community; Closeness centrality: average
closeness centrality of the nodes inside the community; Clustering coefficient: average value of the clustering coefficient
from the nodes inside the community; Structural equivalence
with the most central node: average value of the structural
equivalence from the nodes inside the community; Community average centrality: average centrality of all community
nodes.
At last, the social network metrics were put together with
time series prediction forming the independent variables of
the dataset.
For avaliation and models comparison we used Relative Absolute Error (RAE). The equation for RAE is
Pn
|fi yi |
RAE = Pi=1
n
|
i=1 |yi y
where:
y =

4.

n
1X
yi .
n i=1

EXPERIMENTS

The experiments aim to measure and compare the improvement of the proposed model against the time series model.

Figure 5: Examples of regression curves


First of all, we made experiments for the time series data
and, after that, we experimented the proposed model. Furthermore, as a specific goal, we wanted to know the importance of the period of analysis in the results. Thereby
we used three different periods: 1991-2011, 2002-2011 and
2007-2011.

4.1

Time Series Trend Analysis

Given one dependent and independents variables, a regression analysis can be formulated as

Y f (X, )
where the dependent variable Y can be approximated by the
independent variables X and the respective parameters for
a function f .
Before adding the social network variables for the trend analysis, we performed time series regression analysis with importance index TF-IDF (which is a widespread index for
terms importance measurement) as the dependent variable,
and time (year) as independent variable. For each term extracted by the method described at Section 3.2, TF-IDF
were calculated for each year in each of the three experimentation periods. Then, parametric and non parametric
methods were performed.
For the parametric tests we did not use only one kind of regression that best fit the time series data. We worked with
the regression (linear or nonlinear) that best fitted each of
the series. Ordinary Least Squares was used to evaluate it.
The kinds of regression used were linear, exponential, logarithmic, power law and polynomial with two to five degrees.
The regression curves for some terms are shown in Figure 5.
For the non parametric tests, we used Artificial Neural Network (ANN), Support Vector Machine (SVM) and Rotation
Forest tring to approximate the best function that explain
the historical series distribution.
Table 2 shows the best results of RAE to the three periods

for the time series trend analysis. For parametric methods,


the best result was obtained for the longest period while for
the non parametric methods, the shortest period was the
best. We can see that the non parametric methods produced
results more accurate than the parametric ones.
Table 2: Parametric and non parametric regression
RAE results for three periods
Parametric Non Parametric
1991-2011 113.16%
51.52%
2002-2011 136.42%
51.01%
2007-2011 288.14%
50.31%

4.2

Proposed Model: Adding Social Network


Features to Trend Analysis

In the previous subsection, we modeled the problem in a


way that the TF-IDF index of each term only depends on
the year. In the proposed model, with the addition of the
social network information the problem is modeled in the
following way

T F IDF (T erm) = SN M (T erm) + RR(T erm),


where SNM is the set of social network metrics built based
on the section 3.3.2 and RR is the best regression result of
the term for the prediction year.
For these experiments we used four prediction techniques:
Linear Regression, Artificial Neural Network (ANN), Support Vector Machine (SVM) and Rotation Forest. We varied the parameters for each technique generating 16 tests
for ANN, 9 tests for SVM and 15 tests for Rotation Forest.
Furthermore we varied the types of dataset selection. We
generated datasets with all attributes and datasets with attributes selected by Relief and manual selection, that is a
good selection method if the analyst has knowledge about
the dataset. The exception is the linear regression technique.
The technique was executed for all possible sets and the set
with the least RAE was chosen.
Table 3 presents the best results (RAE) for each technique
according to the periods and selection methods.
In the techniques point of view, the best performances obtained, as shown in Table 3, were achieved by Rotation Forest. It is possible to see that Rotation Forest achieved the
best performances for short periods while SVM did better
for long periods getting better results than Rotation Forest
in the 1991 - 2011 period.
With respect to the periods, the best result was obtained
in the 2007-2011 period (39.28%). However, in general, the
best results were obtained in the 2002 - 2011 period. The
average RAE values for the best techniques are: 43.77%
for 2002-2011; 51.57% for 2007-2011; and 69.68% for 19912011. There is an important difference between the models
at this point. While the parametric model had better results for long periods (Table 2), the non-parametric and the
proposed model had better results for shorter periods. FOr
the proposed model, it can be explained by the dynamism
of the network. The metrics built in a static way based on

Table 3: Best results of each prediction


All at- Manual
tributes selection
Linear Reg. 91-11 72.18%
ANN 91-11
72.95%
73.43%
SVM 91-11
65.21%
62.21%
Rot. Forest 91-11 71.51%
70.57%
Linear Reg. 02-11 53.22%
RNA 02-11
45.75%
46.25%
SVM 02-11
43.45%
41.05%
Rot. Forest 02-11 40.29%
40.04%
Linear Reg. 07-11 62.76%
RNA 07-11
57.77%
59.02%
SVM 07-11
53.31%
52.37%
Rot. Forest 07-11 41.68%
39.28%

technique
Relief

73.44%
64.15%
71.18%
46.23%
41.30%
40.12%
57.98%
51.22%
40.33%

a network consisting of a long period can result in mistaken


values, i.e., some metrics can indicate characteristics of the
network that was true in the past but was false in the end
of the period.
Comparing the best results of the proposed model with the
parametric model we had an error reduction of 42%, 70%
and 85% for the 1991-2011, 2002-2011 and 2007-2011 periods, respectively. Comparing the best results again with the
non-parametric model we had an error increase of 20% for
1991-2011 and error reductions of 20% and 25% for 20022011 and 2007-2011, respectively.
Table 4 compares the results of 15 terms for both models.
Fore reference, these terms were selected based on the main
tendencies calculated by the pure time series trend analysis
(parametric model). In this table the real TF-IDF of each
term is compared with the predicted results from the parametric and non parametric time series prediction models and
the results of the proposed model. The prediction technique
used for the proposed model was Rotation Forest for the period 2007-2011 (the best prediction results presented, as we
can see at Table 3).
The accuracy gain displayed in Table 4 is a sample of the
trend analysis improvement when using social network features. The experimental results shows the error produced
by the proposed model corresponds, in average, to only 17%
and 18% of the error produced by the parametric and non
parametric models which do not use social network features.

5.

DISCUSSION

As discussed before, time series and content based analysis


has been widely used to predict trends. It considers that
all the information is equally generated, except by the time
dimension. However, the content generated by people, primarily in the internet, has clear influences based on the connections of the generators with other people. Intending to
fill this gap, we presented a new concept of trend analysis
including the social network factor to content based trend
analysis model. The proposed model achieved better results
than the time series based model. In addition to simple
prediction techniques as linear regression, we applied more

Table 4: Models results comparison for 15


Term
Real Parametric
service discovery
135.17
441.52
based approach
155.19
424.16
information systems
147.32
334.29
supply chain
174.31
298.37
web services
225.28
297.74
product line
174.99
291.57
motion estimation
107.78
274.36
social network
249.05
269.42
business process
131.75
240.09
time series
150.79
217.76
neural network
213.36
178.86
sign language
108.21
176.83
s
ao paulo
191.93
172.84
genetic programming 128.25
156.64
routing problem
101.11
147.16

first trends of the time series prediction model in 2012


Error Non parametric Error Proposed Error
306.35
58.43
76.74
123.39
11.77
268.97
249.40
94.21
161.10
5.91
186.97
182.08
34.76
148.37
1.05
124.06
145.71
28.60
143.96
30.35
72.46
190.14
35.14
201.05
24.23
116.57
481.68
306.69
154.73
20.26
166.58
174.73
66.95
99.00
8.78
20.38
327.70
78.65
198.94
50.11
108.34
264.25
132.50
119.61
12.14
66.97
196.08
45.29
147.03
3.76
34.51
565.81
352.45
198.85
14.51
68.62
76.97
31.24
101.69
6.52
19.09
71.51
120.42
145.79
46.15
28.39
104.18
24.07
107.98
20.26
46.05
195.61
94.50
83.75
17.36

robust techniques that resulted in even more accurate models. As we supposed, these findings cast light on the issue
of trend prediction. Information content and characteristics of the social structure of the information sources can
be combined to improve the explanation of the information
temporal behavior.
This work explored a concept still little studied by other
researchers and, thus, there are some shortcomings to be
worked on. The dynamism of the social network is one of
them. We worked with a whole time window to the social
network modeling, however, slicing the time interval probably will improve the prediction models by better representing the dynamism of the network through time in the social
structures. Another improvement can be done in the applications point of view. The grouping of the extracted terms
in topics can be more relevant for the academic scholars then
analyzing each term alone. In conclusion, we found out that
looking to the social structure of data sources in a support
perspective can help to explain the information temporal
behavior.

Acknowledgments
This work was partially funded by FAPESP, CAPES and
CNPq.

6.

REFERENCES

[1] H. Abe and S. Tsumoto. Evaluating a method to


detect temporal trends of phrases in research
documents. In 2009 8th IEEE International
Conference on Cognitive Informatics, pages 378383.
IEEE, 2009.
[2] Y. Altshuler, W. Pan, and A. Pentland. Trends
prediction using social diffusion models. . . . -cultural
modeling and prediction, pages 97104, 2012.
[3] E. Bakshy, I. Rosenn, C. Marlow, and L. Adamic. The
role of social networks in information diffusion. In
Proceedings of the 21st international conference on
World Wide Web, pages 519528. ACM, 2012.

[4] O. Cimenler, K. a. Reeves, and J. Skvoretz. A


Z
social network
regression analysis of researchers
aA
metrics on their citation performance in a college of
engineering. Journal of Informetrics, 8(3):667682,
July 2014.
[5] A. Clauset, M. E. Newman, and C. Moore. Finding
community structure in very large networks. Physical
review E, 70(6):066111, 2004.
[6] L. Digiampietri, J. Mena-Chalco, J. de Jesus
Perez-Alc
azar, E. F. Tuesta, K. Delgado, and
R. Mugnaini. Minerando e caracterizando dados de
currculos lattes. In CSBC 2012 - BraSNAM, jul 2012.
[7] L. A. Digiampietri, C. M. Alves, C. C. Trucolo, and
R. A. Oliveira. An
alise da rede dos doutores que
atuam em computaca
o no brasil. In CSBC 2014 BRASNAM, pages 3344, 2014.
[8] N. Kawamae. Theme Chronicle Model: Chronicle
Consists of Timestamp and TopicalWords over Each
Theme. In Proceedings of the 21st ACM international
conference on Information and knowledge management
- CIKM 12, page 2065, New York, New York, USA,
2012. ACM Press.
[9] A. Kontostathis, L. Galitsky, and W. Pottenger. A
survey of emerging trend detection in textual data
mining. Survey of Text . . . , pages 144, 2004.
[10] H. Moed, W. Gl
anzel, and U. Schmoch. Handbook of
quantitative science and technology research. 2004 ed,
2004.
[11] H. Nakagawa and T. Mori. A Simple but Powerful
Automatic Term Extraction Method. In COLING-02
on COMPUTERM 2002: Second International
Workshop on Computational Terminology - Volume
14, COMPUTERM 02, pages 17, Stroudsburg, PA,
USA, 2002. Association for Computational Linguistics.
[12] W. Pan, N. Aharony, and A. Pentland. Composite
social network for predicting mobile apps installation.
In AAAI, 2011.
[13] H. Park, E. Kim, K.-J. Bae, H. Hahn, T.-E. Sung, and
H.-C. Kwon. Detection and Analysis of Trend Topics
for Global Scientific Literature Using Feature

[14]

[15]

[16]

[17]

[18]

Selection Based on Gini-Index. In 2011 IEEE 23rd


International Conference on Tools with Artificial
Intelligence, pages 965969. IEEE, 2011.
S. Singh, N. Mishra, and S. Sharma. Survey of various
techniques for determining influential users in social
networks. In Emerging Trends in Computing,
Communication and Nanotechnology (ICE-CCN),
2013 International Conference on, pages 398403,
March 2013.
D. H. Sonnenwald. Scientific collaboration. Annual
review of information science and technology,
41(1):643681, 2007.
L. A. Teixeira and A. L. I. de Oliveira. Predicting
stock trends through technical analysis and nearest
neighbor classification. In 2009 IEEE International
Conference on Systems, Man and Cybernetics, pages
30943099. IEEE, 2009.
C. C. Trucolo and L. A. Digiampietri. Trend Analysis
of the Brazilian Scientific Production in Computer
Science. FSMA, 14:29, 2014.
C. C. Trucolo and L. A. Digiampietri. Uma Revis
ao
Sistem
atica acerca das Tecnicas de Identificaca
o e
An
alise de Tendencias. In X Simp
osio Brasileiro de
Sistemas de Informaca
o (SBSI 2014), pages 639650,
Londrina, 2014.

Vous aimerez peut-être aussi