Vous êtes sur la page 1sur 35

A Framework of Mining Trajectories from Untrustworthy Data

in Cyber-Physical System
LU-AN TANG, NEC Labs America
XIAO YU, QUANQUAN GU, and JIAWEI HAN, University of Illinois at Urbana-Champaign
GUOFEI JIANG, NEC Labs America
ALICE LEUNG, BBN Technology
THOMAS LA PORTA, Pennsylvania State University

A cyber-physical system (CPS) integrates physical (i.e., sensor) devices with cyber (i.e., informational) components to form a context-sensitive system that responds intelligently to dynamic changes in real-world
situations. The CPS has wide applications in scenarios such as environment monitoring, battlefield surveillance, and traffic control. One key research problem of CPS is called mining lines in the sand. With a large
number of sensors (sand) deployed in a designated area, the CPS is required to discover all trajectories (lines)
of passing intruders in real time. There are two crucial challenges that need to be addressed: (1) the collected
sensor data are not trustworthy, and (2) the intruders do not send out any identification information. The
system needs to distinguish multiple intruders and track their movements. This study proposes a method
called LiSM (Line-in-the-Sand Miner) to discover trajectories from untrustworthy sensor data. LiSM constructs a watching network from sensor data and computes the locations of intruder appearances based on
the link information of the network. The system retrieves a cone model from the historical trajectories to
track multiple intruders. Finally, the system validates the mining results and updates sensors reliability
scores in a feedback process. In addition, LoRM (Line-on-the-Road Miner) is proposed for trajectory discovery
on road networksmining lines on the roads. LoRM employs a filtering-and-refinement framework to reduce
the distance computational overhead on road networks and uses a shortest-path-measure to track intruders.
The proposed methods are evaluated with extensive experiments on big datasets. The experimental results
show that the proposed methods achieve higher accuracy and efficiency in trajectory mining tasks.
Categories and Subject Descriptors: H.2.8 [Database Applications]: Data Mining
General Terms: Algorithms, Experimentation, Performance
Additional Key Words and Phrases: Cyber-physical system, sensor network, trajectory

Research was sponsored in part by the U.S. Army Research Lab under Cooperative Agreement No. W911NF09-2-0053 (NSCTA); the Army Research Office under Cooperative Agreement No. W911NF-13-1-0193; National Science Foundation IIS-1017362, IIS-1320617, and IIS-1354329; HDTRA1-10-1-0120; and MIAS, a
DHS-IDS Center for Multimodal Information Access and Synthesis at UIUC. The views and conclusions
contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. government. The U.S.
government is authorized to reproduce and distribute reprints for government purposes notwithstanding
any copyright notation here on.
Authors addresses: L.-A. Tang and G. Jiang, NEC Labs America, 4 Independence Way, Suite 200, Princeton,
NJ 08540; emails: {ltang, gfj}@nec-labs.com; X. Yu, Q. Gu, and J. Han, University of Illinois at UrbanaChampaign, 201 N. Goodwin Avenue, Urbana, IL 61801; emails: {xiaoyu1, qgu3, hanj}@illinois.edu; A. Leung,
BBN Technology, 10 Moulton Street, Cambridge, MA 02138; email: aleung@bbn.com; T. La Porta, Pennsylvania State University, 342 Information Sciences and Tech. Building, University Park, PA 16802; email:
tlp@cse.psu.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c 2015 ACM 1556-4681/2015/02-ART16 $15.00

DOI: http://dx.doi.org/10.1145/2700394

ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16

16:2

L.-A. Tang et al.

ACM Reference Format:


Lu-An Tang, Xiao Yu, Quanquan Gu, Jiawei Han, Guofei Jiang, Alice Leung, and Thomas La Porta. 2015.
A framework of mining trajectories from untrustworthy data in cyber-physical system. ACM Trans. Knowl.
Discov. Data 9, 3, Article 16 (February 2015), 35 pages.
DOI: http://dx.doi.org/10.1145/2700394

1. INTRODUCTION

A cyber-physical system (CPS) is an integration of sensor networks with informational devices [National Science Foundation 2008]. The CPS employs a large number
of low-cost, densely deployed sensors to watch over designated areas and automatically
discover passing intruders. Such a system has many promising applications in both
military and civilian fields, including missile defense [Hwang et al. 2004], battlefield
awareness [Hewish 2001; Tang et al. 2012a], traffic control [Lo et al. 2008; Zheng and
Zhou 2011], neighborhood watch [Li et al. 2011b], environment monitoring [Tolle et al.
2005], and wildlife tracking [Li et al. 2011a]. The key problem in the preceding applications is called mining lines in the sand [Arora et al. 2004]that is, discovering
trajectories of passing intruders from the collected sensor data.
Figure 1 shows the framework of a battlefield CPS: the sand (seismic, acoustic, and
magnetic sensors) is deployed in a designated area. It constantly collects signals of
vibration, sound, and magnetic force from the environment. When an intruder passes
by, these nsors detect a signal change and send out detection records. The system
analyzes the collected data and discovers intruder trajectories in real time. Such a
system helps military forces see through the fog of war and protect troops and bases
on the battlefield.
However, mining lines in the sand is considered one of the major challenges in CPS
research field, partly due to the following problems:
Untrustworthy data: Many deployment experiences have shown that untrustworthy (i.e., faulty) data is the most serious problem that impacts CPS performance
[Szewczyk et al. 2004; Tolle et al. 2005]. Untrustworthy data are generated due to
various reasons, including hardware failure, communication limits, environmental
influences, and so on. It is difficult to filter them out solely based on signal values,
because the values of faulty signals are similar to the correct ones.
Tracking intruders: There are usually multiple intruders in the monitoring area,
and the system is required to track all of them. Since the intruders do not send out
any identification information, the system has to distinguish them and track their
movements.
Massive data: A CPS usually contains hundreds and even thousands of sensors
[National Science Foundation 2008]. Each sensor generates a data record every few
minutes; such records form a big dataset. In several applications, actions must be
taken immediately to deal with the intruders. The system is required to discover
trajectories in real time.
In this study, we propose a framework called LiSM (Line-in-the-Sand Miner) to discover intruder trajectories from untrustworthy sensor data. LiSM first constructs a
watching network to model the relationship between sensors and data records. Then
LiSM detects the intruders appearances based on link information of the watching
network. To track multiple intruders, a cone model is proposed to generate intruder
trajectories. The system employs a validation process to filter out false positives and
updates sensors reliability scores. The technical contributions of this study are summarized as follows.
Constructing a watching network to model the relationship among sensors, records,
and intruders. Such a network helps detect the intruder appearances in every
time-stamp.
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:3

Fig. 1. The framework of battlefield CPS.

Proposing a cone model to track multiple intruders. The cone model is constructed
from historical trajectories and indicates the possible regions of an intruders next
appearance. The system matches newly detected appearances with the cone model
to generate intruder trajectories.
Validating the candidate trajectories. The system filters out false positives and updates sensors reliability scores in a feedback process.
Extending the proposed framework to support trajectory discovery on the road
network.
Conducting extensive experiments to evaluate the effectiveness and efficiency of
proposed methods on big datasets. The experiment results show that our approach
yields higher precision and recall than existing methods.
This article substantially extends the version from that presented at the ACM
SIGKDD 2013 conference [Tang et al. 2013] in the following ways by (1) introducing the task of mining lines on the roads to model a new trajectory discovery problem
on the road network; (2) analyzing the problems when applying LiSM to new scenarios
and proposing a new method called LoRM (Line-on-the-Road Miner); (3) designing a
filtering-and-refinement framework to improve algorithm efficiency; (3) proposing the
shortest-path-measure to help track intruders on the road network; (4) carrying out the
time complexity analysis for proposed algorithms; (5) providing complete formal proofs
for properties and propositions; (6) covering the related studies in more details and including recent ones; (7) expanding our performance studies on road network datasets;
and (8) discussing the important issues for mining lines in the sand. The experimental
results show that LoRM only costs 15% to 20% the time of LiSM and achieves higher
accuracy in mining tasks.
The rest of the article is organized as follows. Section 2 introduces the background
knowledge and problem formulation. Section 3 proposes the techniques of intruder
detection. Section 4 introduces the intruder tracking methods. Section 5 introduces
LoRM for trajectory discovery on the road network. Section 6 evaluates the algorithms
performances. Section 7 discusses some important issues of the problem. Section 8
gives a survey of related work. Finally, Section 9 concludes the article.
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:4

L.-A. Tang et al.

Fig. 2. Example: the detection records.

2. PROBLEM STATEMENT

Recent advances in sensor technology have produced many types of sensors for areamonitoring purposes. Such sensors can be roughly classified into two categories: (1) active sensors (e.g., infrared sensors and radar sensors), which radiate signal pulses and
detect objects by the echo bouncing off the intruders, and (2) passive sensors (e.g., acoustic sensors, seismic sensors, and magnetic sensors), which only receive signals from the
environment. Active sensors achieve higher accuracy but require significantly more
power to operate and drain batteries quickly. Furthermore, when active sensors radiate signal pulses, they are at high risk of being detected by the intruders. As a result,
the CPS is usually deployed with a large number of low-cost, energy-saving passive
sensors.
Passive sensors constantly collect signals of sound, vibration, and magnetic forces
from the environment. When an intruder passes by, the sensors detect it based on the
signal changes. However, due to hardware limitation, the sensors can only report the
possible area of an intruders appearance rather than a point location. In this study,
we model the reported area as a planar region bounded by a circle.
Definition 1 (Detection Record). Let si be a sensor and t j be a timestamp. The
detection record ri, j is a two-tuples, ri, j ={cen(ri, j ), rad(ri, j )}. cen(ri, j ) and rad(ri, j ) are
the center and radius of a round area indicating the possible position of intruders
appearance in t j .
Example 1. Figure 2 shows a list of detection records in time t1 . The solid triangle
node is the intruder o1 . The round nodes are nearby sensors. The solid round nodes
(red) are the responding sensors that send out detection records, such as s1 , s2 , and
s7 . The centers of the estimated regions are tagged as hollow triangles. Sensor s6 is a
nonresponding sensor that does not generate any detection record. It is tagged as a
shadowed round node (blue).
Example 1 reveals three major problems of passive sensors. First, even if the intruder
is detected by multiple sensors, each sensor reports an intruders appearance with a
margin of error. The detection records should be aggregated for a more accurate result.
Second, some false-positive records are generated, such as r2,1 and r5,1 . The system
must filter them out. Third, the sensor, s6 , should send out a detection record but fails
to do so. It is a false negative.
False-positive and false-negative records are caused by various reasons, such as
the wind blowing and animal movements. Sensor reliability is a critical factor that
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:5

impacts the quality of detection results. We introduce two measurements of the sensors
reliability, as defined next.
Definition 2 (Valid Detection). Let qk, j be the position of intruder ok in time t j . A record
ri, j is called a valid detection if there exists an intruder ok that dist(cen(ri, j ), qk, j )
rad(ri, j ).
Definition 3 (Robustness). Let s be a sensor. The robustness (s) is defined as the
proportion of valid detections in all the records generated by s.
Definition 4 (Sensitivity). Let s be a sensor. The sensitivity (s) is defined as the
probabilities that s sends out a valid detection record when an intruder passes through
ss watching area.
The robustness denotes the sensors detection precision, and the sensitivity denotes
the sensors recall. Knowledge of the sensors robustness and sensitivity is important for
filtering out false data. However, the two scores may change over time. In the beginning,
the sensors robustness and sensitivity are both high. As time elapses, sensors may be
damaged by the harsh environment or run out of battery power. Therefore, both scores
will drop, and they should be dynamically updated based on the detection results.
The intruder is an object entering the watching area. The system discovers the
intruders movement as an intruder trajectory, which is a sequence of intruder appearances in different timestamps.
Definition 5 (Intruder Trajectory). Let ok be an intruder and t j be a timestamp. The
intruder appearance pk, j = {xk, t j } indicates oks spatial position xk in t j . The intruder
trajectory is defined as Lk = { pk,1 , pk,2 , . . . , pk,n}.
Since users are only interested in trajectories that are long enough, they may set a
threshold on the trajectory size. In addition, the sensor data arrive continuously in
a data stream format. The system cannot output the results after scanning the whole
dataset. Users require intruder trajectories to be discovered in real time.
The main theme of this study is on data mining, and we assume that the sensors
have been already deployed and synchronized. The detection records are collected and
transmitted to a data center. There are many state-of-the-art works on sensor deployment and synchronization, gateway design, and message transmission [Sivrikaya and
Yener 2004; Cevher and Kaplan 2007]. Now the task boils down to finding out the
intruders trajectories from sensor data.
Problem Statement (Mining Lines in the Sand). Let S be the set of sensors and R
be the sensor data arriving by time, R ={R1 , R2 , . . . , R j , . . .}, where R j = {r1, j , r2, j , . . .,
rm, j }. The sensors locations are fixed, and their robustness and sensitivity scores are
initialized. Given a length threshold , the task of mining lines in the sand is to discover
the set of intruder trajectories L={L1 , L2 , . . . , Lk} in real time, where size(Lk) .
Note that the total number of intruders is not known in advance. LiSM is required
to discover trajectories of all intruders entering the watching area. Due to the larger
number of false-positive and false-negative records, if the intruder has a long trajectory
across monitoring region, it is very hard for the system to detect all intruder appearances and track them together as the original trajectory. Instead, the system should
report some subtrajectories, which can be composed to recover majority parts of the
original trajectory. The goal of the system is to provide trustworthy trajectories with
enough length to help the user understand the movement of intruders and make a
decision.
The system framework is illustrated in Figure 3. LiSM is composed with three
modules: the intruder appearance miner, the trajectory generator, and the trajectory
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:6

L.-A. Tang et al.

Fig. 3. The system framework of LiSM.

Fig. 4. List of notations.

validator. The appearance miner constructs a watching network from the sensor data
and detects the intruder appearances in each snapshot. The trajectory generator
composes the detected appearances to be intruder trajectories. A cone model is built
from the historical trajectory to predict the intruders next possible movement. The
detected intruder appearances are matched with the prediction, and the best-matched
one is added to the corresponding trajectory. The trajectory validator calculates the
trustworthiness of each candidate trajectories, selects the ones with high trustworthiness as mining results, and removes the low-trustworthy candidates. Finally, the
system updates sensor reliability scores based on the trajectory trustworthiness.
We will introduce the detailed techniques of LiSM in the following sections. Figure 4
lists the notations used throughout this article.
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:7

Fig. 5. Example: the watching network.

3. THE WATCHING NETWORK

In Example 1, s1 , s3 , and s4 all detect the appearance of intruder o1 . However, another


nearby sensor, s6 , should detect the intruder but does not generate any record. Such
a nonresponding sensor disagrees with its responding neighbors. Therefore, the first
task of LiSM is to retrieve the hidden relationships of these sensors and intruders.
Definition 6 (Watching Sensors). Let S be the sensor set and ri, j be a detection
record. The watching sensor set S(ri, j ) is defined as S(ri, j ) = {s|s S, dist(s, cen(ri, j ))
range(s) + rad(ri, j )}, where dist(s, cen(ri, j )) is the distance between sensor s and the
center of ri, j , range(s) is the sensors maximum sensing range.
Based on the detection records, the watching sensor set is partitioned into two parts
of responding sensors and nonresponding sensors.
Definition 7 (Responding and Nonresponding Sensors). Let ri, j be a detection record,
and S(ri, j ) be the watching sensor set of ri, j . The responding sensor set Sr (ri, j ) is defined
as Sr (ri, j ) = {sk|sk S(ri, j ), rk, j that dist(cen(rk, j ), cen(ri, j )) rad(rk, j ) + rad(ri, j )}, the
nonresponding sensor set Sn(ri, j ) = S(ri, j ) Sr (ri, j ).
For a sensor s, if dist(s, cen(ri, j )) > range(s) + rad(ri, j ), then s cannot detect the
intruder. If s locates in the area that range(s) rad(ri, j ) dist(s, cen(ri, j )) range(s) +
rad(ri, j ), s will not generate any record if the position of intruder is out of range(s).
The watching sensor set includes all sensors within range(s) + rad(ri, j )}. Therefore, the
set of nonresponding sensors is actually a superset of the false-negative ones. We will
refine them later, after computing the intruder position.
With Definitions 6 and 7, we can construct a watching network. This network contains nodes representing sensors and records. Two types of links are constructed in the
network: positive links connect the records to responding sensors, and negative links
connect the records to nonresponding sensors.
Example 2. Figure 5 shows a watching network constructed from the records in
Example 1. For the sake of simplicity, we assume that all sensors have the same sensing
range in this example. The system draws a circle for each record ri, j . The circles center
is at cen(ri, j ), and the radius is range(s) rad(ri, j ). The watching sensors S(ri, j ) are
located inside this circle (e.g., Figure 5 shows a circle of r4,1 ). The system then connects
records with positive links (solid lines) to responding sensors and generates negative
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:8

L.-A. Tang et al.

links (dashed lines) between records and nonresponding sensors. Since sensor s6 does
not send any record, it has negative links to all related records. Note that even though
s2 is a watching sensor of r4,1 and s2 sends out a detection record r2,1 , the distance
between cen(r4,1 ) and cen(r2,1 ) is larger than rad(r4,1 ) + rad(r2,1 ), thus the link between
s2 and r4,1 is a negative link.
In the sensor data, many detection records are caused by the same intruderfor
example, r1,1 , r3,1 and r4,1 are caused by intruder o1 . Such records are called homologous
records.
Definition 8 (Homologous Record Set). Let qk, j be the position of intruder ok in time
t j and R j be the detection record set in t j . The homologous record set of qk, j is defined
as Hk, j = {r|ri, j R j , dist(cen(ri, j ), qk, j ) rad(ri, j )}.
If the intruders position, qk, j , is known in advance, the system can easily find the
homologous records. However, the intruders position is exactly required as the mining
result. The system has to approximate the homologous records based on the following
property.
PROPERTY 1. Let Hk, j be a homologous record set in t j , ri, j , rl, j Hk, j be two records,
and si , sl be the sensors that send out those records. Then, si is a responding sensor of
rl, j and sl is a responding sensor of ri, j .
PROOF. Let qk, j be the position of the corresponding intruder in Hk, j . According to
Definition 8, dist(cen(ri, j ), qk, j ) rad(ri, j ) and dist(cen(rl, j ), qk, j ) rad(rl, j ).
Based on triangle inequality, dist(cen(ri, j ),cen(rl, j )) dist(cen(ri, j ),qk, j ) +
dist(cen(rl, j ),qk, j ) rad(ri, j ) + rad(rl, j ).
By Definition 7, si is a responding sensor of rl, j and sl is a responding sensor of ri, j .
The homologous record sets can be approximated by scanning the watching network.
The system first picks a record as the seed to initialize a homologous record set. Then
the system repeats following steps: (1) randomly selecting a record in the set, retrieving
all responding sensors following the positive links; (2) checking each responding sensor,
and if it is also the responding sensor for other records in the set, adding this record to
the set. This iteration ends when all records in the homologous set have been processed.
In this way, we can guarantee that there is an intersected area among all member
records of the approximated homologous record set.
Once a homologous record set is generated, we estimate the position of an intruder
appearance with Equation (1), where i, j is a normalized weight based on the radius of
ri, j . The records with with lower uncertainty (i.e., smaller radius) have higher weights
in determining the position of intruder appearance. Note that we adopt a linear model
to compute i, j for general cases; the weight computation can be modified based on
specific signal decay models of the sensors:

pk, j =
i, j cen(ri, j )
ri, j Hk, j

rad(ri, j )
.
rl, j Hk, j rad(rl, j )

i, j = 1 

(1)

When pk, j is computed, we can refine the watching sensor sets and filter out the
sensors that locate outside the detection range.
Definition 9 (Refined Watching Sensors). Let S be the sensor set and pk, j be an
intruder appearance. The refined watching sensor set S( pk, j ) is defined as S( pk, j ) =
{s|s S, dist(s, pk, j ) < range(s)}.
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:9

Fig. 6. Example: the watching network with intruder appearances.

Definition 10 (Refined Responding and Nonresponding Sensors). Let pk, j be an


intruder appearance and S( pk, j ) the watching sensor set of pk, j . The refined responding
sensor set Sr ( pk, j ) is defined as Sr ( pk, j ) = {si |si S( pk, j ), ri, j that dist( pk, j ,cen(ri, j ))
rad(ri, j )}, the refined nonresponding sensor set Sn( pk, j ) = S( pk, j )-Sr ( pk, j ).
The intruder appearances are added as new nodes to the watching network. Similarly,
the positive and negative links are connected between the refined sensors and the
appearances, as shown in Figure 6.
With the link information of the watching network, we can estimate the trustworthiness of each intruder appearance based on the sensors robustness and sensitivity.
For an appearance pk, j , let si Sr ( pk, j ) be a responding sensor and s j Sn( pk, j ) be a
nonresponding sensor. If pk, j is a real appearance, then si reports a valid detection and
s j is a false negative. The probability of pk, j being a valid detection is calculated as
Equation (2), where (si ) is the robustness of si and (s j ) is the sensitivity of s j :


(si )
(1 (s j )).
(2)
Pr ( pk, j )+ =
si Sr ( pk, j )

s j Sn( pk, j )

Similarly, the probability of pk, j being a false positive can be written as Equation
(3):


Pr ( pk, j ) =
(1 (si ))
(s j ).
(3)
si Sr ( pk, j )

s j Sn( pk, j )

The trustworthiness of intruder appearance, ( pk, j ), is then calculated as


Equation (4):
( pk, j ) = log
=

Pr ( pk, j )+

Pr ( pk, j )

(si )
+
log
1 (si )

si Sr ( pk, j )


s j Sn( pk, j )

log

1 (s j )
.
(s j )

(4)

The robustness (si ) and sensitivity (s j ) are initialized by user in the beginning.
The system will automatically update the two scores based on the results of trajectory
discovery. We will discuss the details of score updating in Section 4.
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:10

L.-A. Tang et al.

Fig. 7. Algoirthm: the intruder appearance detection.

Figure 7 lists the algorithm to detect intruder appearances. The algorithm first scans
each detection record and retrieves the responding and nonresponding sensors (lines 1
through 4). Then the system initializes the homologous record Hk, j by randomly picking
a seed record from the watching network (lines 6 and 7). For each unvisited record ri, j
in Hk, j , the algorithm retrieves ri, j s responding sensors and checks its record rl, j . If rl, j
does not belong to any existing homologous record sets and the distance from rl, j to all
other records of Hk, j is less than the sum of the radius, rl, j is then added to Hk, j (lines 8
through 14). Once Hk, j is generated, the system calculates the intruder appearance pk, j
and adds it to the network (lines 15 through 18).
PROPOSITION 1. Let m be the size of record set R j and n be the size of sensor set S. The
time complexity of Algorithm 1 is O(m2 n).
PROOF. Algorithm 1 includes two steps: constructing the watching network (lines 1
through 4) and detecting intruder appearances (lines 5 through 19).
In the first step, the system needs to scan the sensor set and retrieve the corresponding watching sensors for each record in R j , and the total time cost is O(mn).
In the second step, the algorithm generates the homologous record sets by checking
the responding sensors of unvisited records. Let mh be the average size of homologous
record sets and nr be the average number of responding sensors for one record. The time
cost is O(m2hnr ). In the worst case, mh = m and nr = n. Algorithm 1s time complexity is
O(m2 n).
Note that mh and nr are actually much smaller than m and n in real cases. Then, the
algorithms time cost is close to O(mn).
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:11

4. TRAJECTORY GENERATION

The watching network discovers the intruder appearances in each snapshot. It is an


effective tool for mining dots in the sand. However, a more critical task is connecting
the dots as lines. Since the intruders do not send out any identification information,
the system has to distinguish them automatically.
After mining the intruder appearances in the first snapshot, LiSM initializes a set
of candidate trajectories. Each candidate trajectory contains a discovered intruder
appearance. In the following snapshots, the system continues adding newly detected
intruder appearances to the candidate trajectories.
Let pi, j be an intruder appearance in time t j and Lk be a candidate trajectory. The
key problem is to calculate the likelihood pi, j belonging to Lk. This value is determined
by two factors: (1) ( pi, j ), the trustworthiness of pi, j , and (2) P( pi, j , Lk), the matching
probability based on the spatial locations of pi, j and Lk:
( pi, j Lk) = ( pi, j ) P( pi, j , Lk).

(5)

The trustworthiness of pi, j is already calculated by Algorithm 1. To compute the


the matching probability between pi, j and Lk, we propose a cone model. This model
stores the intruders recent moving history and predicts the intruders next move in a
cone area. The detected intruder appearances are projected onto the area to compute
matching probability.
Definition 11 (-recent Trajectory). Let Lk be the trajectory of intruder ok, t j be the
current timestamp, and be a positive number, size(Lk). The -recent trajectory
Lk is defined as a subset of Lk, Lk = { pk, j , pk, j+1 , . . . , pk, j1 }.
The -recent trajectory contains the -latest appearances of intruder ok before time
t j . It is a short history of the intruders movement. The system can calculate oks recent
moving speed and direction based on Lk . The average and deviation of the intruders
speed in period [t j , t j1 ] are calculated as Equations (6) and (7), where (t j1 t j ) is
the time length between the two timestamps:
j2


vk =

dist( pk,i , pk,i+1 )

i= j


 j2

(vk) = 
i= j

(t j1 t j )

(6)

dist( pk,i , pk,i+1 )2


vk2 .
(t j1 t j )(ti+1 ti )

(7)

The function direction( pk,i , pk,i+1 ) is applied to measure the angle between oks moving direction and the x-axis in time [ti , ti+1 ]. The mean and deviation of the moving
direction are computed as shown in Equations (8) and (9):
j2


k =

direction( pk,i , pk,i+1 )

i= j

(t j1 t j )


 j2
  direction( pk,i , pk,i+1 )2
k2 .
(k) = 
(t

t
)(t

t
)
j1
j
i+1
i
i= j

(8)

(9)

When intruders pass through the watching area, they are unlikely to change moving
speed and direction dramatically. We make the assumption that the values of intruder
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:12

L.-A. Tang et al.

Fig. 8. Example: the cone model.

speed and direction follow a normal distribution and build a cone area to predict oks
appearance in t j .
Example 3. Figure 8 shows the cone model for intruder ok. Suppose that is set to
5; the system retrieves oks latest five appearances as Lk and computes oks speed and
direction. If those parameters follow a normal distribution, the probability is 99.7%
that oks speed and direction of period [t j1 , t j ] are within three standard deviations of
the mean values. The system calculates the four boundary points as shown in Figure 8.
The area of oks next possible appearance is then generated as a partial cone with apex
in pk, j1 .
Let pi, j be an intruder appearance in the cone area and pk, j1 be the latest intruder
appearance of Lk . If intruder ok moves from pk, j1 to pi, j , then oks speed and direction
in [t j1 , t j ] are estimated as Equations (10) and (11):
dist( pk, j1 , pi, j )
,
(t j t j1 )

(10)

direction( pk, j1 , pi, j )


.
(t j t j1 )

(11)

vk, j =

k, j =

By comparing vk, j and k, j , the system can estimate the matching probability between
pi, j and Lk as Equation (12):


(vk, j vk)2
1
(12)
P( pi, j , Lk) =
exp
2 (vk)2
2 (vk)


(k, j k)2
1

.
exp
2 (k)2
2 (k)
Example 4. Suppose that there are three intruder appearances detected in t j , as
shown in Figure 8. p1, j and p2, j are located in the cone area, and p3, j is outside the area.
Their trustworthiness scores are ( p1, j ) = 0.1, ( p2, j ) = 0.8, and ( p3, j ) = 0.9. Even
though p3, j has the highest trustworthiness, it is impossible that this is an appearance
of Lk. By considering the matching probability and trustworthiness of the remaining
two appearances, the system selects p2, j as the intruders appearance in t j .
Note that we make the assumption that the values of intruder speed and direction
follow a normal distribution in this study. Based on our experiment results, this
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:13

Fig. 9. Algorithm: the trajectory tracking.

assumption works well. The cone model can be adopted to other distributions/models
of intruder movements.
If the trajectory Lk does not contain enough intruder appearances (i.e., size(Lk)
), the system constructs a cone model with default speed v0 and (v0 ). The default
parameters can be specified by the user or calculated as the mean of all other intruders
-recent trajectories. The system also releases the constraint on movement direction
(i.e., the intruder may move in any direction). The matching probability is then written
as Equation (13):


(vk, j vk)2
1
P( pi, j , Lk) =
exp
.
(13)
2 (vk)2
2 (vk)
Figure 9 shows the detailed steps of trajectory tracking. For each candidate trajectory
Lk, Algorithm 2 first checks the trajectory size. If the size is larger than , the system
retrieves -recent trajectory Lk and calculates the intruders speed and direction. If
the size of Lk is less than , the system uses the default parameters (lines 2 through 5).
Then, the algorithm constructs the cone model. For each intruder appearance inside
the cone area, the system calculates the matching probability. The one with the highest
probability is tagged as matched and added in Lk (lines 6 through 15). Finally, the
system initializes new candidate trajectories for the unmatched intruder appearances
(lines 16 through 18).
PROPOSITION 2. Let m be the size of record set R j and n be the number of candidate
trajectories. The time complexity of Algorithm 2 is O(mn).
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:14

L.-A. Tang et al.

PROOF. Algorithm 2 contains two steps: (1) retrieving the -recent trajectory and (2)
computing the cone model.
The time complexity of step 1 is O(n).
For step 2, let mi be the total number of intruder appearances in time t j and mc be the
average number of intruder appearances in the cone of a trajectory. No matter whether
the -recent trajectory can be retrieved or not, for each trajectory, the system has to
check all intruder appearances in the cone area. The time cost is O(mc n).
In the worst case, for each trajectory, all intruder appearances are located in the cone
area, mc = mi . In addition, the maximum number of intruder appearances equals the
size of record set R j , mi = m. Hence, the time complexity of step 2 is O(mn).
Since m is magnitude larger than , the time complexity of Algorithm 2 is O(mn).
Note that mi is the number of intruder appearances detected in each snapshot. It is
usually much smaller than m. mc is the average number of intruder appearances in
the cone. mc is even smaller than mi . Algorithm 2s efficiency is indeed determined by
n, which is the number of candidate trajectories. The system may create many new
trajectories in each snapshot, and n will eventually be a large value as time elapses.
The system needs to control the number of candidate trajectories to improve efficiency.
On the other hand, Algorithm 2 initializes new trajectories based on unmatched
intruder appearances in every snapshot. However, the majority of them are ghost
trajectories. The ghost trajectories are generated by the false-positive appearances,
such as p2,1 , p3,1 in Figure 6. When the time elapses, real trajectories grow longer with
more subsequent appearances added in, but ghost trajectories are unlikely to get more
appearances. Hence, we can eventually prune them.
Definition 12 (Trajectory Expectation). Let Lk be a candidate trajectory and t j be the
current timestamp. The trajectory expectation E j (Lk) is an indicator on the expectation
that Lk be a qualified mining result in time t j . E j (Lk) is defined as Equation (14), where
t1 is the timestamp of the first intruder appearance in Lk, and is a decay constant:

( pk,i ) (t j t1 ).
(14)
E j (Lk) =
pk,i Lk

In the end of every snapshot, the system checks the expectation of each candidate
trajectory. If the expectation is less than zero, such a trajectory is unlikely to become a
qualified result and should be removed from main memory. Meanwhile, if a trajectorys
length is longer than the threshold , the system will report it to the user.
In many CPS applications, the sensors may be damaged by the environment or
run out of battery power as time elapses; the system should also update the sensors
reliability scores.
Let Lk be a candidate trajectory, Lk = { pk,1 , pk,2 , . . . , pk,n}. If Lk is removed from the
candidate set as a ghost trajectory, all intruder appearances of Lk will be tagged as
ghost appearances. Let pk, j be such a ghost appearance. For all responding sensors
si Sr ( pk, j ), si has reported a false positive, and its robustness should be reduced. (si )
is then updated as shown in Equation (15), where li is the number of false positives
reported by si , and ni is the total number of detection records generated by si :
(si ) = 1

li
.
ni

(15)

Meanwhile, if Lk is output as a qualified mining result, all intruder appearances of Lk


are considered to be true. Let pk, j be a true appearance; for the nonresponding sensor
s j Sn( pk, j ), s j has made a false-negative error. The sensitivity of si is then reduced as
shown in Equation (16), where fi is the number of false negatives by si , and mi is the
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:15

Fig. 10. Algorithm: mining lines in the sand.

total number of intruders passed through si s watching area. Let li be the number of
false positives by si and ni be the total number of detection records sent by si , mi = ni
li + fi :
(si ) = 1

fi
fi
=1
.
mi
ni li + fi

(16)

Algorithm 3 shows the detailed process of LiSM (Figure 10). When new data arrives,
the system first calls Algorithm 1 to construct the watching network and detect the
intruder appearances (lines 2 through 4), then tracks trajectories with the cone model
(line 5). After that, the system checks each candidates trustworthy expectation (line 6).
If the expectation is less than zero, such a trajectory is a ghost trajectory and should
be removed. The system retrieves the responding sensors for every appearance of the
ghost trajectory and reduces their robustness scores (lines 7 through 12). Meanwhile,
if the trajectorys size reaches the length threshold , it will be added to the result set.
The system retrieves all nonresponding sensors and reduces their sensitivity (lines 13
through 20).
PROPOSITION 3. Let k be the number of record sets in R, l be the average size of R j , m
be the size of the sensor set, and n be the average size of the candidate trajectory set. The
time complexity of Algorithm 3 is O(kl(lm + n)).
PROOF. Algorithm 3 has three steps to process the record set R j : the intruder detection (calling Algorithm 1), trajectory tracking (calling Algorithm 2), and the candidate
validation step (lines 6 through 20).
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:16

L.-A. Tang et al.

According to Propositions 1 and 2, the time costs of Algorithm 1 and 2 are O(l2 m)
and O(ln).
In the validation step, the system needs to check the links for each intruder appearance. In the worst case, the system has to scan all links between the sensors and
intruder appearances. Let p be the average number of links between the sensors and
records in the watching network G j . The time complexity of this step is O( p). Since
p lm, the time cost of processing record set R j can be written as O(l(lm + n)), and
the total time complexity of Algorithm 3 is O(kl(lm + n)).
5. MINING LINES ON THE ROADS

In the previous sections, we have investigated the problem of mining lines in the sand.
LiSM is proposed to discover intruder trajectories in 2D Euclidean space. In many real
applications, the user requires detection of intruders moving on the roads. The problem
of mining lines on the roads poses some unique challenges. This section proposes LoRM
for trajectory mining on the road network.
Problem Statement (Mining Lines on the Roads). Let M be a road network, S be the
set of sensors installed on M, and R be the sensor data arriving by time, R = {R1 ,
R2 , . . . , R j , . . .}, where R j = {r1, j , r2, j , . . . , rm, j }. Given a length threshold , the task of
mining lines on the roads is to discover the set of intruder trajectories on M, L = {L1 ,
L2 , . . . , Lk}, where size(Lk) .
The general framework of LiSM can be used by LoRM. The system first scans record
sets, constructs the watching network, and detects intruder appearances. Then, LoRM
combines detected appearances as intruder trajectories and updates sensors reliability
scores. However, the detailed steps of both detection and tracking should be changed
according to several unique difficulties of the problem scenario:
Constraints of intruder detection: The intruders move on the roads, but the sensors detection records may be off-road. The system should match sensors detection
records to the road network for meaningful detections.
Network distance computation: On the road network, the time cost to track intruders
is much higher, because the system has to search the shortest paths for distance
computation. The algorithm efficiency becomes a major issue.
Unpredictable moving direction: The cone model is proposed to help intruder tracking. Such a model assumes that the intruders move on a 2D plane, and hence their
moving directions are predictable from the historical data. However, now the intruders are moving on the road network. Their moving directions are bounded along the
road segments, and the cone model is no longer feasible to predict the intruders
movements.
5.1. Intruder Detection on the Roads

The sensors are installed on the roads to collect sound or vibration signals from environments. When an intruder passes by, the sensors detect it based on the signal changes.
Due to hardware limitations, the sensors detections may be off-road; the system needs
to match the detection records onto the road network.
Definition 13 (On-Road Detection). Let M be a road network, si be a sensor, t j be a
timestamp, and detection record ri, j = {cen(ri, j ), rad(ri, j )}. On-road detection road(ri, j )
is defined as a spatial coordinate of M (road(ri, j ) M) satisfying the following:
(1) p M, dist(cen(ri, j ), road(ri, j )) dist(cen(ri, j ), p);
(2) dist(cen(ri, j ), road(ri, j )) rad(ri, j ).
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:17

Fig. 11. Example: the on-road detections.

Intuitively, road(ri, j ) is the closest match of cen(ri, j ) to road network M. Since the
area of the intruders appearance is bounded by rad(ri, j ), the system only checks the
road segments located inside this area. If there is no road segment, detection record ri, j
must be a false positive. The system will directly remove ri, j and reduce the robustness
score of sensor si .
Example 5. Figure 11(a) shows a set of detection records in time t1 . The round nodes
are the monitoring sensors. The solid round nodes (red) are the responding sensors that
send out detection records, such as s1 , s3 , and s4 . The system matches the records to
the nearest roads to get on-road detections (hollow triangles). Since detection r2,1 and
r4,1 cannot be matched to any road segment, they are false positives (blue triangles).
Sensor s2 is a nonresponding sensor that does not generate any detection record. It is
tagged as a shadowed round node (blue).
After matching detection records to the road network, the system computes responding and nonresponding sensors for each road detection. The watching network is constructed as Figure 11(b). To estimate the position of intruder appearances, the system
generates homologous detection sets from the watching network. The intruder appearance, pk, j , is then calculated as a weighted average of on-road detections in the homologous set, as shown in Equation (17). The weight i, j is determined by the distance
between cen(ri, j ) and road(ri, j ). The records with lower uncertainty (smaller distance)
have higher weights:


pk, j = road
i, j road(ri, j )
ri, j Hk, j

i, j = 1

dist(cen(ri, j ) road(ri, j ))

.
dist(cen(rl, j ) road(rl, j ))

(17)

rl, j Hk, j

Note that it is possible that ri, j Hk, j i, j locates outside the road network M. In such
cases, the system matches the coordinate to the nearest road as the position for pk, j .
Figure 12 lists the algorithm to detect intruder appearances on the road network.
The algorithm first matches the detection records to nearby roads to get on-road detections. The false positives are removed (lines 1 through 6). Then, the system retrieves
responding and nonresponding sensors for each on-road detection (lines 7 through 9).
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:18

L.-A. Tang et al.

Fig. 12. Algorithm: intruder detection on road network.

After that, the system generates homologous record sets, following similar steps as
Algorithm 1 (lines 11 through 19). Once a homologous set Hk, j is generated, the system
calculates the intruder appearance pk, j and adds it to the watching network (lines 20
through 23).
PROPOSITION 4. Let m be the size of record set R j , n be the size of sensor set S, and l
be the average number of road segments in the possible area of a detection record. The
time complexity of Algorithm 4 is O(m(n + l)).
PROOF. Algorithm 4 has two steps: generating on-road detections (lines 1 through 9)
and detecting intruder appearances (lines 10 through 25).
In the first step, the system needs to match the detection records to the road network
and scan the sensor set for responding sensors; the total time cost is O(m(l + mn)).
In the second step, the algorithm generates the homologous record sets by checking
the responding sensors of unvisited records. Let mh be the average size of homologous
record sets and nr be the average number of responding sensors for one record. The
time cost is O(m2hnr ). In the worst case, mh = m and nr = n. Then, Algorithm 4s time
complexity is O(m(l + mn)).
5.2. Tracking Trajectories on the Roads

After detecting intruder appearances, LoRM needs to generate trajectories. However,


the algorithm efficiency becomes a problem. The bottleneck is computing the road
network distance.
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:19

Fig. 13. Example: the problem of tracking intruders with the cone model.

Definition 14 (Road Network Distance). Let pi and p j be two spatial coordinates of


road network M. The road network distance netd( pi , p j ) is the length of the shortest
path connecting pi and p j on M.
Suppose that the system maintains m candidate trajectories in memory and detects
l intruder appearances in a new snapshot. The road network contains n edges (i.e.,
road segments). The system has to compute the network distance between every pair
of candidate and appearance. There are totally lm pairs. In the worst case, the system
has to search all edges of the network to compute the shortest path. Hence, the total
time complexity of intruder tracking is O(lmn). Note that a road network typically
contains millions of edges; n is a very large number.
Another problem is about the tracking effectiveness, as illustrated in the following
example.
Example 6. In Figure 13, the system retrieves a recent trajectory, Lk = { pk,1 , pk,2 }, and
computes the cone model, conek. In time t3 , six new intruder appearances are detected
as p1,3 , p2,3 , . . . , p6,3 . However, none of them is located inside conek. The problem is
caused by the cone model. The cone model assumes that intruder oks moving direction
can be predicted from the historical data. However, ok moves along the road segments
now. The cone model is thus no longer accurate.
To solve the problems, we propose two techniques: (1) a filtering-and-refinement
framework to improve the tracking efficiency and (2) a shortest-path-measure for effectively tracking.
PROPERTY 2. In road network M, the Euclidean distance between two points is the
lower bound of the network distance: pi , p j M, dist( pi , p j ) netd( pi , p j ).
PROOF. In the Euclidean space, the shortest path between two points is a straight
line. Since the road network is also in the same Euclidean space, the Euclidean distance
is less than or equal to the road network distance.
The algorithms overhead can be significantly reduced based on Property 2. Let pk, j1
be the last intruder appearance in candidate trajectory Lk and pi, j be a newly detected
appearance in time t j . Before computing the network distance between pk, j1 and
pi, j , the system first calculates the Euclidean distance dist( pk, j1 , pi, j ). The Euclidean
distance computation only needs the spatial coordinates and involves no cost to access
the road network. If dist( pk, j1 , pi, j )/(t j t j1 ) is already larger than the maximum speed
of intruder ok, pi, j is impossible to be the next appearance of ok. The system filters it
directly without computing netd( pk, j1 , pi, j ).
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:20

L.-A. Tang et al.

Fig. 14. Example: track intruder with three appearances.

Based on Property 2, LoRM runs a filtering process, as illustrated in Figure 14.


The system draws a circle with pk,2 as the center and vmax (t j t j1 ) as the radius.
The appearances that locates outside the circlefor example, p4,3 , p5,3 , and p6,3 are
all filtered out. Only the inside ones ( p1,3 , p2,3, and p3,3 ) are possible to be oks next
appearance.
So which one of the three is most likely to be the next appearance? After a careful
examination, one may find that p1,3 is more likely to be the one. Since ok has already
moved from pk,1 to pk,2 , it has to turn back to visit p3,3 and p2,3 . Such a move is relatively
rare in the real world.
In most cases, an intruder moves from the source to destination following the shortest
path between the two points. Based on this observation, we propose a shortest-pathmeasure as defined next.
Definition 15 (Partial Trajectory Length). Let Lk be a -recent trajectory, Lk =
{ pk, j , pk, j+1 , . . . , pk, j1 }, j l j 2. The partial trajectory length is defined
 j2
as length( pk,l , pk, j1 ) = i=l netd( pk,i , pk,i+1 ).
Definition 16 (Shortest-Path-Measure). Let pi, j be an intruder appearance, Lk be a
-recent trajectory, j l j 2. The shortest-path-measure between pk,l and pi, j
is defined as Equation (18):
SP( pk,l , pi, j ) =

netd( pk,l , pi, j )


.
length( pk,l , pk, j1 ) + netd( pk, j1 , pi, j )

(18)

Intuitively, this measure reflects the ratio of the shortest path and the actual distance
between pk,l and pi, j . If they are the same, SP( pk,l , pi, j ) has the maximum value as 1.
Let pi, j be an intruder appearance and Lk be a -recent trajectory. The average shortest-path-measure between Lk s apperances and pi, j is calculated as
Equation (19):
 j


l= j2 SP( pk,l , pi, j )
.
(19)
ASP Lk , pi, j =
w1
The trustworthiness of pi, j being the next appearance of Lk, ( pi, j Lk), is then
computed as shown in Equation (20). It has three parts: the trustworthiness of pi, j , the
matching probability based on moving speed, and the average shortest-path-measure
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:21

Fig. 15. Algorithm: tracking trajectories on road network.

between Lk and pi, j :


1

(vk, j vk)2

( pi, j Lk) = ( pi, j )


exp
2 (vk)
2 (vk)2



ASP Lk , pi, j .

(20)

In Equation (20), we assume that the moving speed of the intruders still follows the
normal distribution. However, the moving speed may be influenced by the topology and
traffic conditions of the road. We will discuss this issue in Section 7.
Figure 15 lists the detailed steps to track intruders on the road network. For each
candidate trajectory, Algorithm 5 first retrieves the -recent trajectory Lk (lines 2
through 4). Then, the algorithm filters the intruder appearances using Property 2
(lines 6 through 8). After that, the system carries out the refinement process to compute
the shortest-path-measure and the matching trustworthiness. The one with the highest
trustworthiness is tagged as matched and added to Lk (lines 9 through 15). Finally, the
system initializes new candidate trajectories for the unmatched intruder appearances
(lines 16 through 18).
PROPOSITION 5. Let n be the number of edges in the road network, m be the number of
trajectory candidates, and l be the number of intruder appearances. The time complexity
of Algorithm 5 is O(lmn).
PROOF. In the worst case, no appearances can be filtered out and the system has
compute the shortest distance between each pair of candidate trajectory and intruder
appearances. The total time complexity is still O(lmn).
Note that even Algorithm 5s time complexity is the same as the original one. In the
experiment, we find that about 80% of the intruder appearances can be filtered and
the algorithms efficiency is improved dramatically.
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:22

L.-A. Tang et al.

Fig. 16. Experiment settings.

6. PERFORMANCE EVALUATION
6.1. Experiment Setup

Datasets: To test the performance of LiSM in big and untrustworthy data, we generated
four datasets based on the real military trajectories from the CBMANET project [Krout
2007]. The data generator retrieves 10 to 40 trajectories from CBMANET and simulates
sensor monitoring fields along their routes with 200 to 10,000 deployed sensors. If
an intruder passes by, the sensor generates a detection record. The data generator
randomly selects a portion of sensors as false-positive reporters, which may generate
detection records without any local intruder. The system also selects some a portion
of negative sensors; such sensors may not send a detection record when an intruder
passes by. The detailed parameters of those datasets are listed in Figure 16.
Baselines: The proposed LiSM algorithm (LM) is compared with two baselines: (1)
The Kalman filteringbased method (KF) and (2) the TruAlarm method with nearestneighboring tracking strategy (TA). TruAlarm is a method to evaluate the trustworthy
alarms and detect intruder appearance Tang et al. [2010].
Environments: The experiments are conducted on a PC with Intel 7500 Dual CPU
2.20GHz and 3.00GB RAM. The operating system is Windows 7 Enterprise. All algorithms are implemented in Java on the Eclipse 3.3.1 platform with JDK 1.5.0.
6.2. Evaluations on Mining Efficiency

In the first experiment, we evaluate the efficiency of different algorithms with default
parameters. The system processes LM, KF, and TA on the four datasets and records
their time costs. Figure 17(a) shows the results on the four datasets. Note that the y-axis
is in logarithmic scale. In general, all three algorithms are efficient enough to process
the data. LM achieves the best efficiency in all cases, because the algorithm filters
out low-expectation trajectory candidates in each snapshot and tracks the trajectories
quickly with the cone model.
We then study the factors that influence LMs efficiency. We set the decay factor
from 0.05 to 0.2 and record the algorithms time cost on datasets D1 to D4 in Figure 17(b).
With larger , the system prunes more candidate trajectories and achieves better time
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:23

Fig. 17. Efficiency: time costs on different datasets (a) and influence of (b).

efficiency. We also study the algorithms running time with trajectory size threshold
and the recent trajectory length . Both parameters do not influence the algorithms
efficiency, so we omit the results here.
6.3. Evaluations on Mining Effectiveness

To evaluate the quality of mining results, we retrieve the intruders true trajectories as
ground truth and compare against the mining results. There are two stage of mining
lines in the sand: (1) detecting the intruder appearances and (2) tracking their trajectories. In this experiment, we first compare the detected intruder appearances with
the ground truth. If their distance is less than a reasonable error bound (20m), the
detection is considered as a valid result. Then we check each generated trajectory Lk;
since the user only prefers high-quality results to help in their decision making, we
consider Lk as a valid trajectory only if more than X% of Lks intruder appearances can
be matched to a real trajectory in the ground truth (the default X% is 90%).
We compute two measurements to evaluate the algorithms effectiveness:
Precision: The proportion of valid appearances/trajectories over the mining results.
This represents the algorithms selectivity for filtering out false positives.
Recall: The proportion of valid appearances/trajectories over the ground truth. This
criterion shows the algorithms sensitivity for detecting the intruders.
Note that due to the large number of false-positive and false-negative records, if an
intruder has a long trajectory across the monitoring region, the system usually cannot
discover the original one. Instead, LiSM outputs several subtrajectories with length
larger than the threshold. The tracking recall is then calculated as the proportion of the
number of intruder appearances in valid trajectories over the total number of intruder
appearances in the trajectories of ground truth.
The detection precision and recall of LM, KF, and TA are shown in Figure 18. All
three methods can achieve a relative high recall of about 80%. However, the precision
of KF and TA drops rapidly in D3 and D4 , which have more false detection records. The
precision of KF is less than 20% in D4 , which is only one fourth of LMs precision. TAs
precision is also lower than 50%. In contrast, LM filters out the false-positive data and
keeps the precision higher than 80%.
In the next experiment, we investigate LMs precision and recall with different trajectory length threshold . The results are shown in Figures 19 and 20. With larger
, fewer trajectories are reported. Hence, the algorithms precision increases, but the
recall drops. We also notice that the recall of LM on D2 is lower than other three.
The reason is about the dataset; even the size of D2 is smaller than D3 and D4 , but
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:24

L.-A. Tang et al.

Fig. 18. Effectiveness: detecting precision (a) and recall (b) of intruder appearances on different datasets.

Fig. 19. Effectiveness: detecting precision (a) and recall (b) with respect to .

Fig. 20. Effectiveness: tracking precision (a) and recall (b) with respect to .

some trajectories in D2 are hard to be tracked and the recall is lower. We will discuss
such cases in Section 7. Based on the experiment results, our suggestion is to select
moderate (e.g., 8 to 10) to make LM achieve the best performance.
Finally, we study the influences of parameter and . The results of LMs effectiveness are recorded in Figures 21 to 24. If the length of -recent trajectory is too short,
LM may not be able to track the intruder with an accurate cone model. Therefore,
should be set as a reasonable large value (e.g., 6 to 9).
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:25

Fig. 21. Effectiveness: detecting precision (a) and recall (b) with respect to .

Fig. 22. Effectiveness: tracking precision (a) and recall (b) with respect to .

Fig. 23. Effectiveness: detecting precision (a) and recall (b) with respect to .

The decay factor is used to filter the candidate trajectories; if it is set too large, the
algorithm may prune some trustworthy candidates. The recall of LM is then reduced.
Note that the recall of LM drops rapidly on D4 ; nearly half of the intruder trajectories
cannot be discovered if is reduced to 0.2. Because the false-positive and false-negative
rates are much higher in D4 than other datasets, the trustworthiness scores of valid
detection are not high. According to Equation (14), if the value of is set too high, some
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:26

L.-A. Tang et al.

Fig. 24. Effectiveness: tracking precision (a) and recall (b) with respect to .

Fig. 25. Experiment settings of mining lines on the roads.

valid trajectories will also be pruned. As a result, the value of should be set relatively
small (e.g., 0.05) to discovery trajectories in a noisy dataset.
6.4. Experiments of Mining Lines on the Roads

We evaluate the performance of three algorithms: (1) LM, the original LiSM algorithm;
(2) FR, the improved algorithm with a filtering-and-refinement framework, but still
using the cone model for tracking; and (3) LR, the LoRM algorithm, with a filtering-andrefinement framework and tracking based on the shortest-path-measure. The details
of the experiment setting are listed in Figure 25.
First, we evaluate the efficiency of three algorithms. The time costs on the four
datasets are recorded in Figure 26(a). Note that the y-axis is in logarithmic scale. FR
and LRs time costs are only 15% to 20% of LM. Figure 26(b) shows the percentage of
road network distance computation time over the total cost. Without a filtering and
refinement framework, distance computation becomes the bottleneck of LM. However,
FR and LR filter most intruder appearances without distance computation. A huge
amount of time is then saved.
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:27

Fig. 26. Efficiency: time costs on different datasets (a) and the portion of distance computation (b).

Fig. 27. Effectiveness: tracking precision (a) and recall (b) on different datasets.

In the next experiment, we check tracking precision and recall of the three algorithms. The results are shown in Figure 27. Although FRs efficiency is close to LR, its
tracking precision and recall are much lower. LR has the highest tracking precision
and recall on all of the datasets. Dataset D4 contains 50% false positives and 30%
false negatives. LR still achieves about 90% precision and near 80% recall on it. These
results indicate that LR is more suitable than LM and FR to track intruders on the
road network.
7. DISCUSSION

In this section we discuss some issues of mining lines in the sand.


(1) Why use the cone model for trajectory tracking?
In this study, we propose the cone model for trajectory tracking. Indeed, there are
several state-of-the-art methods to model trajectory uncertainty and conduct trajectory
predictions, such as the cylinder model [Trajcevski et al. 2004] and bead model [Pfoser
and Jensen 1999] (more details of these works are introduced in Section 8.3). Most of
them are designed for the purposes of completing low sampling GPS trajectories or
range query. In these models, the system computes possible regions of a moving object
between two consecutive trajectory points. However, in the scenario of mining lines in
the sand, the system needs to predict the future intruder appearance based on historical data. We cannot use these uncertainty models directly. In addition, the efficiency
of model computation is a major issue. LiSM uses a greed algorithm for trajectory
tracking, where each trajectory captures the best-matched appearance. Hence, the
system can output the results in online time.
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:28

L.-A. Tang et al.

Fig. 28. Example: the different models for trajectory tracking.

Fig. 29. Effectiveness: tracking precision (a) and recall (b) on different models.

There are some prediction models similar to cone model, such as the pie model
(Figure 28(b)) and disc model (Figure 28(c)). The disc model is computed with a fixed
upper bound for moving speed, and the pie model is refined from the disc model by
considering the moving direction. We conduct an experiment to test the effectiveness
of these models. The experiment settings are the same as shown in Figure 16. We use
Algorithm 1 to detect intruder appearances and track the trajectories with different
models. The tracking precision and recall are shown in Figure 29.
The results show that even the disc model and pie model have similar recall with the
cone model; however, their tracking precision is much lower. The reason is the lower
bound for moving speed. Both the pie model and disc model do not have any restrictions
on the minimum speed of the moving objects. The false-positive sensors continuously
produce many false detection records, and the positions of the false detections are close
to each other. Without a lower bound for moving speed, the system is likely to track
those neighboring false detections together and report many ghost trajectories.
(2) In which cases is the cone model not effective?
We propose the cone model to predict an intruders next appearance. This model is
designed based on an observation from the real data: when intruders travel through
monitoring regions, they move with a relatively stable speed and direction. There are
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:29

Fig. 30. Example: the case of intruders meeting.

rare sudden changes of the speed or direction in the movements. In addition, most
intruders pass through the monitoring region as soon as possible; they do not stop
or stay for a long time. Therefore, the cone model assumes that an intruders moving
speed and direction follow the normal distribution.
If the intruders have some specific movements, such as patrol around the region,
stop at some positions, and frequently change speed and direction, the cone model
is then not effective to track them. The cone model is also not effective on the road
network. We propose the shortest-path-measure and redesign the tracking algorithm
on the road. In the real world, the vehicle movements on the road are influenced by
many factors, including the traffic, weather, and road condition. Since we only focus on
studying the sensors data, our road network model does not consider these factors. The
infrastructure of proposed methods can be easily adapted with more complex models
to achieve better accuracy.
The cone model is also limited in the cases of meeting, as shown in Figure 30. oi and o j
are two intruders moving through the monitoring region; they meet together and make
a turn in time t4 . The detected intruder appearances, pi,4 and p j,4 , are close to each
other. In such a case, the cone model wrongly tracks their movements, and the mined
trajectories are misleading. The problem is caused by only considering the spatial
information: oi s next appearance is the best match of o j s historical movement, and the
cone model cannot distinguish two intruders solely based on their recent trajectories.
To solve the problem, the system needs extra information to identify multiple intruders,
such as the signal strength emitted from the intruder (e.g., magnetic force or sound),
the speed of the intruder (not estimated from -recent trajectory but measured by the
sensors), and so on. In our previous work [Tang et al. 2012a], we studied the problem
of identifying the types of intruders based on the emitted signal strength.
(3) Are there any problems with using discovery results to measure sensor reliability?
In the framework of LiSM, the systems feeds back the results of discovered trajectories to adjust sensor robustness and sensitivity scores. This mechanism has a potential
risk of mispunishment: (1) the ghost trajectories may contain some valid intruder
appearances, and the system may wrongly reduce the robustness of the sensors that
make correct detections; (2) it is possible that some false-positive detections make up a
ghost trajectory, and such a trajectory is longer than the length threshold. The system
then makes a mistake to reduce the sensitivity of all of nonresponding sensors. These
cases are indeed rare, as most ghost trajectories do not contain any valid appearance,
and few of them are longer than the length threshold.
To reduce the risk of mispunishment, we can set a lower bound for ni (the total
number of detection records by sensor si ) in Equations (15) and (16). When updating the
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:30

L.-A. Tang et al.

robustness and sensitivity scores, if ni is smaller than the lower bound, the system uses
the value of lower bound to replace ni . In this way, the influence of the preceding cases
are reduced. On the other hand, if a sensor becomes unreliable and begins to generate
false positives or negatives, it is very likely to continue generating more false positives
and negatives. In this way, the number of false reports from this sensor will increase
rapidly, and the system can detect it soon and reduce the robustness and sensitivity.
8. RELATED WORK
8.1. Trustworthiness Analysis in CPS

Sha et al. [2008] review the history of CPS development. The study of data trustworthiness is listed as one of the three major challenges. The CPS applications require a high
level of trust in the operations. The trustworthiness is measured from reliability, safety,
security, and usability. The system models and abstractions should incorporate fault
models and recovery policies that reflect the scale, lifetime, control, and reparability of
components.
Ganti et al. [2008] propose a CPS application of SenseWorld. It is used to facilitate
connecting sensors, people, and software objects to build community-centric sensing
applications. The authors point out that the first and foremost challenge is to infer higher-level information from the lower-level sensor data. To obtain meaningful
information, they use a frequent itemset mining algorithm to collect the high-level
inferences and get a general picture.
Johnson and Mitra [2009] provide several directions to handle the failures in CPS.
They classify the failures into three categories: (1) the cyber-side failures, such as software bugs and system crashes; (2) the physical-side problems, such as sensor failures
and irrelevant object influences; and (3) the communication-side issues, including message drops, omissions, man-in-the-middle attack, and so on. The authors suggest that
the degradation of the system state could potentially be used to detect failures.
Makedon et al. [2009] design an event-driven framework for assistant CPS environments. The event is modeled as the abnormal behavior of the system, such as accidents
or acute needs. Those behaviors are organized in a hierarchical tree. Two-step event
identification first assimilates different types of data and then identifies the event of
interest. A low-level security standard is applied to the raw data, and a high-level
security strategy is used to check the generalized events.
The research of CPS is still at the beginning stage. Many papers have addressed the
importance of data trustworthiness, but detailed solution plans are seldom provided.
As mentioned in Johnson and Mitra [2009], the complexity of this problem is the main
challenge. Some studies use the strategies of abstraction and generalization to reduce
the influences of false alarms [Ganti et al. 2008] or to detect untrustworthy data by
performance degradation of the system [Makedon et al. 2009]. However, most works
only propose such strategies as research directions and do not have detailed solutions.
8.2. Intruder Detection from Sensor Data

Arora et al. [2004] propose the intrusion detection problem in wireless networks and
design a detection model with a sensor network. The approach is based on a dense,
distributed wireless network. The authors study nine types of sensors, including magnetic, radar, thermal, acoustic sensors, and so on. Based on the performance requirements of the scenario and the sensing, communication, energy, and computation ability,
the magnetic and radar sensors are used to detection intruders. The authors propose
a classifier to determine the intruder type based on the number of detection records
for example, if a vehicle passes through the monitoring field, about 40 sensors detect it and send out records; if a soldier passes through the monitoring field, only 20
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:31

detection records are generated. Since the classification model is constructed based on
the number of detection records, the accuracy may be influenced by the false positives.
Tang et al. propose the TruAlarm filtering method [Tang et al. 2010, 2012b]. TruAlarm
carries out trustworthiness inferences for the sensor alarms based on the estimated
locations of the objects.
Sheng and Hu [2005] propose the maximum likelihoodbased estimation method.
This method uses acoustic signal energy measurements taken at individual sensors
of an ad hoc wireless sensor network to estimate the locations of multiple acoustic
sources. They propose a multiresolution search algorithm and an EM-like iterative
algorithm to expedite the computation of source locations. Lin et al. [2006] propose
a framework for the in-network intruder tracking. They develop the DAT and ZAT
tree structures according to the physical topology of the sensor network. The proposed
method can analytically formulate the cost of object tracking based on the update and
query rates, which is a significant improvement in this area. Tang et al. [2012a] propose
an IntruMine method to detect and verify the intruders in a sensor network. IntruMine
constructs several monitoring graphs to model the relationships between sensors and
possible intruders, and computes the position and energy of each intruder with the link
information from these monitoring graphs.
Cevher and Kaplan [2007] study the problem of how to assign sensors to different
tracking and monitoring tasks to achieve the optimal efficiency. They propose a sensor
assignment algorithm with fuzzy location estimation.
Most methods try to detect the intruders in a single snapshotthat is, without
considering the temporal information of the intruders movement. Some studies focus
on saving sensors energy and communication bandwidth; they try to provide an optimal
sensor deployment plan. LiSM and LoRM actually complement those technologies and
improve the systems applicability.
8.3. Trajectory Predication and Map Matching

Wolfson points out in a vision paper [Wolfson 2002] that the trajectory of a moving object
is inherently imprecise due to continuous motion. Trajcevski et al. [2004] propose the
cylinder model for the uncertainty of trajectories. In this model, the possible region
of a moving object at a timestamp is within a disc, and the trajectory is represented
as a cylinder in the 3D (2D+time) space. A comparable model, the space-time prism
(beads) [Pfoser and Jensen 1999] is proposed to represent the uncertain movement
of a moving object as the union of two half-cones (a bead) in the 3D space. Kuijpers
and Othman [2010] propose a bead modelbased approach to model the uncertainty
in trajectories. This method has significantly improved efficiency in managing the
uncertainty of trajectories.
The problem of map matching (i.e., match the trajectories to the road network) has
been studied by many researchers in past decades. Greenfeld [2002] uses distance
and orientation similarity measures to match the trajectory points to a road edge.
The adaptive clipping method uses the Dijkstra algorithm to construct the shortest
path on a local free space graph. Yin and Wolfson [2004] employ an offline snapping
method that aims to find a minimum weight path based on edit distance. Brakatsoulas
et al. [2005] propose the map-matching approaches based on the Frechet distance
and its variants. Frechet distance takes the continuity of curves into account and
is therefore suitable for comparing trajectories. Pink and Hummel [2008] propose
a map-matching method based on a Bayesian classifier that incorporates a hidden
Markov model to model topological constraints of the road network. Hummel and
Tischler [2005] extend the Kalman filter and cubic spline interpolation to enhance map
matching.
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:32

L.-A. Tang et al.

Zheng et al. [2011] introduce the uncertain trajectories hierarchy for probabilistic
range queries. In this milestone work, the uncertainty of the objects moving along road
networks are modeled as time-dependent probability distribution functions by giving
a maximal speed on each road segment. In this way, both snapshot and continuous
probabilistic range queries can be conducted on the road network. Kuijpers et al. [2010]
extend the models of space-time prisms to the road network. This model represents
the measurement errors of a moving object. An algorithm is proposed to compute the
envelope of all space-time prisms that have an anchor in these sample regions. The
spatial projection of the network-based space-time prisms under uncertainty is thus
derived.
However, most of the state-of-the-art models are designed for the purposes of trajectory completion and range query. The uncertainly models are frequently used to
address the low sampling problem in GPS trajectories. In the scenario of mining lines
in sand, the sampling frequency of the sensors are relatively high; the main problem
is to predict the next position based on existing data. Hence, these models cannot be
used directly.
8.4. Target Tracking in Sensor Network

Ozdemir et al. [2009] use the techniques of particle filtering to track intruders. They
model the link between sensors and the fusion center as a binary symmetric channel
and create a general framework using particle filters. Three different types of channelaware particle filters have been developed, including hard-decoding links, coherent
soft-decoding links, and noncoherent soft-decoding links.
Hammad et al. [2003] propose the stream window join algorithm to track moving
objects in sensor network database. Aslam et al. [2003] propose a geometry-based
method to track intruders. Lin et al. [2006] propose a framework for in-network intruder
tracking. Zhong et al. [2009] provide the techniques to track targets with the sequence
of the alarming sensors. Trajcevski et al. [2010] propose the methods of trajectory
reduction for energy savings. Ghica et al. [2010] propose the tracking principles with
epoch awareness. Liu et al. [2004] study the distributed state representation problem
for target tracking in sensor networks. Oh et al. [2009] propose the Markov chain
data association method for target tracking. Crespi et al. [2008] propose a quantitative
theory of trackability to investigate the consistent tracking number in a sensor network
application.
In these studies, the researchers assume that the intruders locations at each snapshot are already known. They focus on combining the intruder detections at different
snapshots to generate trajectories. However, as pointed out in Arora et al. [2004], the
intruder tracking results cannot be accurate based on many false intruder detections.
In addition, the tracking model is a problem in theory because it is difficult to build an
accurate kinematic movement model [Crespi et al. 2008]. The model must be adjusted
dynamically during the tracking process. LiSM and LoRM utilize the -recent trajectory to rebuild the tracking model at every timestamp and thus keep high tracking
accuracy.
Pan et al. [2007a] use the supervised learning method to locate receiving signal sensors (RSS). The transfer learning techniques are proposed as semisupervised mining
[Pan et al. 2007b]. The main difference in their works is about the sensors. The RSS
are moving sensors that receive signals from some fix nodes (APs), and the algorithm is
designed to learn sensor locations. In the CPS applications, most sensors are fixed and
their locations are already known. The problem is to detect the location of intruders.
To the best of our knowledge, this is the first study to solve both detecting and
tracking problems in an integrated framework. Figure 31 compares the features of
some related approaches with the proposed method.
ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:33

Fig. 31. The comparison to related works.

9. CONCLUSION AND FUTURE WORK

This study investigates the problem of mining trajectories in a CPS. We propose a novel
method, LiSM, to discover intruder trajectories from untrustworthy sensor data. The
watching network is designed to detect intruder appearances, and the cone model is
used to track their trajectories. In addition, LoRM is proposed to discover trajectories
on the road network. We evaluate the proposed algorithms in extensive experiments
on big datasets. The proposed methods achieve better performances on both mining
efficiency and accuracy.
In the future, we are going to integrate LiSM and LoRM with more information (e.g.,
weather and traffic) to improve system performance.
REFERENCES
Anish Arora, Prabal Dutta, Sandip Bapat, Vinod Kulathumani, Hongwei Zhang, Vinayak Naik, Vineet
Mittal, Hui Cao, Murat Demirbas, Mohamed G. Gouda, Young-ri Choi, Ted Herman, Sandeep S.
Kulkarni, Umamaheswaran Arumugam, Mikhail Nesterenko, Adnan Vora, and Mark Miyashita. 2004.
A line in the sand: A wireless sensor network for target detection, classification, and tracking. Computer
Networks 46, 5, 605634.
Javed Aslam, Zack Butler, Florin Constantin, Valentino Crespi, George Cybenko, and Daniela Rus. 2003.
Tracking a moving object with a binary sensor network. In Proceedings of the 1st International Conference
on Embedded Networked Sensor Systems. 150161.
Sotiris Brakatsoulas, Dieter Pfoser, Randall Salas, and Carola Wenk. 2005. On map-matching vehicle tracking data. In Proceedings of the 31st International Conference on Very Large Data Bases. 853864.
Volkan Cevher and Lance M. Kaplan. 2007. Acoustic sensor network design for position estimation. In ACM
Transactions on Sensor Networks 5, 3, Article No. 21.
Valentino Crespi, George Cybenko, and Guofei Jiang. 2008. The theory of trackability with applications to
sensor networks. ACM Transaction on Sensor Networks 4, 3, Article No. 16.
Raghu K. Ganti, Yu-En Tsai, and Tarek F. Abdelzaher. 2008. SenseWorld: Towards cyber-physical social
networks. In Proceedings of the International Conference on Information Processing in Sensor Networks.
563564.
Oliviu Ghica, Goce Trajcevski, Fan Zhou, Roberto Tamassia, and Peter Scheuermann. 2010. Selecting tracking principals with epoch awareness. In Proceedings of the 18th SIGSPATIAL International Conference
on Advances in Geographic Information Systems. 222231.
Joshua Greenfeld. 2002. Matching GPS observations to locations on a digital map. In Proceedings of the 81st
Annual Meeting of the Transportation Research Board.

ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

16:34

L.-A. Tang et al.

Moustafa A. Hammad, Walid G. Aref, and Ahmed K. Elmagarmid. 2003. Stream window join: Tracking
moving objects in sensor network databases. In Proceedings of the 15th International Conference on
Scientific and Statistical Database Management. 7584.
Mark Hewish. 2001. Reformatting fighter tactics. In Janes International Defense Review.
Britta Hummel and Karin Tischler. 2005. GPS-only map matching: Exploiting vehicle position history,
driving restriction information and road network topology in a statistical framework. In Proceedings of
the GIS Research UK Conference.
Inseok Hwang, Hamsa Balakrishnan, Kaushik Roy, and Claire Tomlin. 2004. Multiple-target tracking and
identity management in clutter, with application to aircraft tracking. In Proceedings of the 2004 American
Control Conference. 34223428.
Taylor Johnson and Sayan Mitra. 2009. Handling failures in cyber-physical systems: Potential directions. In
Proceedings of the Real-Time Systems Symposium.
Tim Krout. 2007. CB MANET scenario data distribution. In BBN Technique Report.
Bart Kuijpers, Harvey J. Miller, Tijs Neutens, and Walied Othman. 2010. Anchor uncertainty and space-time
prisms on road networks. International Journal of Geographical Information Science 24, 8, 12231248.
Bart Kuijpers and Walied Othman. 2010. Trajectory databases: Data models, uncertainty and complete
query languages. Journal of Computer and System Sciences 76, 7, 538560.
Xu Li, Rongxing Lu, Xiaohui Liang, Xuemin Shen, Jiming Chen, and Xiaodong Lin. 2011b. Smart community:
An internet of things application. IEEE Communications Magazine 49, 11, 6875.
Zhenhui Li, Jiawei Han, Ming Ji, Lu An Tang, Yintao Yu, Bolin Ding, Jae-Gil Lee, and Roland Kays. 2011a.
MoveMine: Mining moving object data for discovery of animal movement patterns. ACM Transactions
on Intelligent Systems and Technology 2, 4, Article No. 37.
Chih-Yu Lin, Wen-Chih Peng, and Yu-Chee Tseng. 2006. Efficient in-network moving object tracking in
wireless sensor network. IEEE Transactions on Mobile Computing 5, 8, 10441056.
Juan Liu, Maurice Chu, Jie Liu, Jim Reich, and Feng Zhao. 2004. Distributed state representation for
tracking problems in sensor networks. In Proceedings of the 3rd Workshop on Information Processing in
Sensor Networks. 234242.
Chia-Hao Lo, Wen-Chih Peng, Chien-Wen Chen, Ting-Yu Lin, and Chun-Shuo Lin. 2008. CarWeb: A traffic
data collection platform. In Proceedings of the 9th International Conference on Mobile Data Management.
221222.
Fillia Makedon, Zhengyi Le, Heng Huang, Eric Becker, and Dimitrios Kosmopoulos. 2009. An event driven
framework for assistive CPS environments. ACM SIGBED Review 6, 2, Article No. 3.
National Science Foundation. 2008. Cyber-physical systems. In Program Announcements and Information.
Songhwai Oh, Stuart Russell, and Shankar Sastry. 2009. Markov chain Monte Carlo data association for
multi-target tracking. IEEE Transactions on Automatic Control 54, 3, 481497.
Onur Ozdemir, Ruixin Niu, and Pramod K. Varshney. 2009. Tracking in wireless sensor networks using
particle filtering: Physical layer considerations. IEEE Transactions on Signal Processing 57, 5, 1987
1999.
Rong Pan, Junhui Zhao, Vincent Wenchen Zheng, Jeffrey Junfeng Pan, Dou Shen, Sinno Jialin Pan, and
Qiang Yang. 2007b. Domain constrained semi-supervised mining of tracking models in sensor networks.
In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining. 10231027.
Sinno Jialin Pan, James T. Kwok, Qiang Yang, and Jeffrey Junfeng Pan. 2007a. Adaptive localization in a
dynamic WiFi environment through multi-view learning. In Proceedings of the 22nd National Conference
on Artificial Intelligence (AAAI07). 11081113.
Dieter Pfoser and Christian S. Jensen. 1999. Capturing the uncertainty of moving-object representations. In
Proceedings of the 6th International Symposium on Advances in Spatial Databases (SSD09). 111132.
Oliver Pink and Britta Hummel. 2008. A statistical approach to map matching using road network geometry,
topology and vehicular motion constraints. In Proceedings of the 11th International IEEE Conference on
Intelligent Transportation Systems. 862867.
Lui Sha, Sathish Gopalakrishnan, Xue Liu, and Qixin Wang. 2008. Cyber-physical systems: A new frontier.
In Proceedings of the IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy
Computing. 19.
Xiaohong Sheng and Yu-Hen Hu. 2005. Maximum likelihood multiple source localization using acoustic
energy measurements with wireless sensor networks. IEEE Transactions on Signal Processing. 4453.

Yener. 2004. Time synchronization in sensor networks: A survey. IEEE Network


Fikret Sivrikaya and Bulent
18, 4, 4550.

ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System

16:35

Robert Szewczyk, Joseph Polastre, Alan Mainwaring, and David Culler. 2004. Lessons from a sensor network
expedition. In Wireless Sensor Networks. Lecture Notes in Computer Science, Vol. 2920. Springer, 307
322.
Lu-An Tang, Quanquan Gu, Xiao Yu, Jiawei Han, Thomas La Porta, Alice Leung, Tarek Abdelzaher, and
Lance Kaplan. 2012a. IntruMine: Mining intruders in untrustworthy data of cyber-physical systems. In
Proceedings of the SIAM International Conference on Data Mining (SDM).
Lu-An Tang, Xiao Yu, Quanquan Gu, Jiawei Han, Alice Leung, and Thomas La Porta. 2013. Mining lines in
the sand: On trajectory discovery from untrustworthy data in cyber-physical system. In Proceedings of
the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Lu-An Tang, Xiao Yu, Sangkyum Kim, Quanquan Gu, Jiawei Han, Alice Leung, and Thomas La Porta.
2012b. Trustworthiness analysis of sensor data in cyber-physical systems. Journal of Computer and
System Sciences 79, 3, 383401.
Lu-An Tang, Xiao Yu, Sangkyum Kim, Jiawei Han, Chih-Chieh Hung, and Wen-Chih Peng. 2010. Tru-Alarm:
Trustworthiness analysis of sensor networks in cyber-physical systems. In Proceedings of the 2010 IEEE
10th International Conference on Data Mining. 10791084.
Gilman Tolle, Joseph Polastre, Robert Szewczyk, David E. Culler, Neil Turner, Kevin Tu, Stephen Burgess,
Todd Dawson, Philip Buonadonna, David Gay, and Wei Hong. 2005. A macroscope in the redwoods. In
Proceedings of the 3rd International Conference on Embedded Networked Sensor Systems. 5163.
Goce Trajcevski, Oliviu Ghica, and Peter Scheuermann. 2010. Tracking-based trajectory data reduction in
wireless sensor networks. In Proceedings of the 2010 IEEE International Conference on Sensor Networks,
Ubiquitous, and Trustworthy Computing. 99106.
Goce Trajcevski, Ouri Wolfson, Klaus Hinrichs, and Sam Chamberlain. 2004. Managing uncertainty in
moving objects databases. ACM Transactions on Database Systems 29, 3, 463507.
Ouri Wolfson. 2002. Moving objects information management: The database challenge. In Proceedings of the
5th International Workshop on Next Generation Information Technologies and Systems. 7589.
Huabei Yin and Ouri Wolfson. 2004. A weight-based map matching method in moving objects databases.
In Proceedings of the 16th International Conference on Scientific and Statistical Database Management.
437438.
Kai Zheng, Goce Trajcevski, Xiaofang Zhou, and Peter Scheuermann. 2011. Probabilistic range queries
for uncertain trajectories on road networks. In Proceedings of the 14th International Conference on
Extending Database Technology. 283294.
Yu Zheng and Xiaofang Zhou. 2011. Computing with Spatial Trajectories. Springer.
Ziguo Zhong, Ting Zhu, Dan Wang, and Tian He. 2009. Tracking with unreliable node sequences. In Proceedings of IEEE INFOCOM. 12151223.
Received October 2013; revised July 2014; accepted September 2014

ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.

Vous aimerez peut-être aussi