Académique Documents
Professionnel Documents
Culture Documents
ABSTRACT
In recent years, big volumes of data have been massively
studied by organizations trying to extract and use information produced by people in the internet. In this context,
trend analysis is one of the most important areas explored
by researchers. Typically, good prediction results are hard
to obtain because the complexity of the problem. This paper goes beyond simple trend identification methods by including the structure of the information sources, i.e., social
network metrics, as an additional dimension to model and
predict trends through time. The results show that the inclusion of such metrics improved the accuracy of the prediction. Our experiments used the publications titles from
all the Brazilian PhD in Computer Science for the periods
analyzed in order to evaluate the developed trend prediction
approach.
CCS Concepts
Computing methodologies Model development
and analysis; Machine learning approaches;
Keywords
social network; trend analysis; data mining
1.
INTRODUCTION
http://dx.doi.org/xx.xxxx/xxxxxxx.xxxxxxx
The problem of finding significant trends from this big volume of data is challenging. The challenge arises from the
dynamicity of the data combined with other factors, such
as, the influence of the entities that are producing the data.
In order to model and predict trends many studies have used
time series and content based approaches.
Trends can also be analyzed from the social structure of the
data sources. The social structure plays an important role
in the information spreading dynamics context. Although
there have been several works studying trend identification
from time series and content based data, and others focusing
in social network techniques, literature still lacks researches
that combine these two concepts. This work starts from the
premise that combining social network and time series data
can reach better results in trend modeling and prediction.
Thus, this work aims to improve trend prediction capability by attaching social network metrics to time series and
content based trend analysis models.
In this paper, we propose an approach which constructs a
prediction model combining time series with social network
metrics. The approach was tested and validated using data
from the publications titles of the Brazilian PhD in Computer Science and then compared to time series regression
model.
This paper is organized as follows. Section 2 describes some
basic concepts and related work. Section 3 details the methodology used. The results are described in Section 4. At last,
conclusions are presented in Section 5.
2.
2.1
RELATED WORK
Time Series and Content Based Trend Analysis
these applications, the step of mining the data and building the dependent variables becomes very important. Thus,
many techniques have been developed to extract topics from
text to automate all the trend analysis process [9]. Furthermore, besides just identifying topics with potential for
popularity, some works have gone deeper trying to find out
the behavior of these topics. In the last years, some studies
[8, 13] have built models to understand the stability of topics or classify topics as trends depending of the behavior of
subtopics.
2.2
3.2
METHODOLOGY
In this work, we have assessed the curricula of 5,642 Brazilian PhDs in Computer Science that have published scientific papers between 1991 and 2012. For each curriculum,
we have gathered all the published full papers titles. Then,
we automatically extracted terms from these titles and performed trend analysis for each term. Figure 1 illustrates
the schematic data flow used in this work. Following, each
method is described.
3.1
Data Gathering
T
Y
i=1
3.3
3.3.1
3.
Term Extraction
http://lattes.cnpq.br/
rics were selected to become part of the independent variables set. The selection was based on assumptions about the
capability of each metric to explain the information spreading. For example, one of the assumptions is that a node
included in the giant component of a network is more capable of disseminating information through the network than
a node which is not included. The metrics selected are: giant composition, shortest path to the most central node, degree centrality, eigenvector centrality, page rank centrality,
betweenness centrality, closeness centrality, clustering coefficient, structural equivalence with the most central node and
community average centrality. The centrality metrics can explain how important a node is in the network, the shortest
path metric indicates how far the node is from the central
node while the structural equivalence shows how similar the
node been analyzed is with the most central node. The most
important/central node was used as reference. To justify the
use of the most central node as the reference one, Table 1
shows the difference in the degree and eigenvector centrality
among the most central node and the other top ten most
important nodes in the network.
Table 1: Eigenvector centrality and number of degrees of top ten central nodes
Top impor- Eigenvector Degree
tant nodes
Centrality
1
1.000
67
2
0.986
45
3
0.944
45
4
0.845
31
5
0.825
35
6
0.799
29
7
0.798
37
8
0.763
24
9
0.745
30
10
0.744
27
3.3.2
The metrics
After the selection of metrics, each of the nodes had all related metrics calculated. Then, for each of the 1,638 extracted terms, the metrics of the nodes were summed. However, instead of building the variables with the summed metrics, the nodes were grouped into communities (we used the
algorithm proposed by Clauset et al [5] to identify the communities). We assumed that using the communities metrics instead of the node metrics individually for the most of
the variables would better explain the information diffusion.
The individual information would result in misleading social
metrics for some terms that could be well widespread in a
single community but not on the network as a whole. That
way, if many nodes of the same community use the same
term, the metrics could become skewed, This balance intra
community aims to add the factor that the information inside a community propagates quickly and tends to become
a general knowledge of the nodes belonging to it. Therefore,
the final values of the variables were based on the communities values. For example, let us assume that there is a
term A that is used by two authors: author 1 and author
2. If author 1 and author 2 are in the same community, the
metrics would be calculated based on the metrics average of
that community for the authors that used term A as well.
But if author 1 and author 2 were in different communities, the metrics values average of each community would be
calculated and these results would be finally summed. The
details of the metrics are described below.
Giant composition: number of nodes in the giant component; Shortest path to the most central node: the shortest path to the most central node; Degree centrality: average degree centrality of the nodes inside the community;
Eigenvector centrality: average eigenvector centrality of the
nodes inside the community; Page rank centrality: average
page rank centrality of the nodes inside the community; Betweenness centrality: average betweenness centrality of the
nodes inside the community; Closeness centrality: average
closeness centrality of the nodes inside the community; Clustering coefficient: average value of the clustering coefficient
from the nodes inside the community; Structural equivalence
with the most central node: average value of the structural
equivalence from the nodes inside the community; Community average centrality: average centrality of all community
nodes.
At last, the social network metrics were put together with
time series prediction forming the independent variables of
the dataset.
For avaliation and models comparison we used Relative Absolute Error (RAE). The equation for RAE is
Pn
|fi yi |
RAE = Pi=1
n
|
i=1 |yi y
where:
y =
4.
n
1X
yi .
n i=1
EXPERIMENTS
The experiments aim to measure and compare the improvement of the proposed model against the time series model.
4.1
Given one dependent and independents variables, a regression analysis can be formulated as
Y f (X, )
where the dependent variable Y can be approximated by the
independent variables X and the respective parameters for
a function f .
Before adding the social network variables for the trend analysis, we performed time series regression analysis with importance index TF-IDF (which is a widespread index for
terms importance measurement) as the dependent variable,
and time (year) as independent variable. For each term extracted by the method described at Section 3.2, TF-IDF
were calculated for each year in each of the three experimentation periods. Then, parametric and non parametric
methods were performed.
For the parametric tests we did not use only one kind of regression that best fit the time series data. We worked with
the regression (linear or nonlinear) that best fitted each of
the series. Ordinary Least Squares was used to evaluate it.
The kinds of regression used were linear, exponential, logarithmic, power law and polynomial with two to five degrees.
The regression curves for some terms are shown in Figure 5.
For the non parametric tests, we used Artificial Neural Network (ANN), Support Vector Machine (SVM) and Rotation
Forest tring to approximate the best function that explain
the historical series distribution.
Table 2 shows the best results of RAE to the three periods
4.2
technique
Relief
73.44%
64.15%
71.18%
46.23%
41.30%
40.12%
57.98%
51.22%
40.33%
5.
DISCUSSION
robust techniques that resulted in even more accurate models. As we supposed, these findings cast light on the issue
of trend prediction. Information content and characteristics of the social structure of the information sources can
be combined to improve the explanation of the information
temporal behavior.
This work explored a concept still little studied by other
researchers and, thus, there are some shortcomings to be
worked on. The dynamism of the social network is one of
them. We worked with a whole time window to the social
network modeling, however, slicing the time interval probably will improve the prediction models by better representing the dynamism of the network through time in the social
structures. Another improvement can be done in the applications point of view. The grouping of the extracted terms
in topics can be more relevant for the academic scholars then
analyzing each term alone. In conclusion, we found out that
looking to the social structure of data sources in a support
perspective can help to explain the information temporal
behavior.
Acknowledgments
This work was partially funded by FAPESP, CAPES and
CNPq.
6.
REFERENCES
[14]
[15]
[16]
[17]
[18]