AI for Detecting Anomalous Vessel Behavior

Artificial Intelligence for Situation Assessment
RIKARD
LAXHAMMAR
Master of Science Thesis Stockholm, Sweden 2007
RIKARD
LAXHAMMAR
Masters Thesis in Computer Science (20 credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2007 Supervisor at CSC was rjan Ekeberg Examiner was Anders Lansner TRITA-CSC-E 2007:046 ISRN-KTH/CSC/E--07/046--SE ISSN-1653-5715
Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se

Abstract
In the first phase of this masters thesis, an overview of different approaches involving artificial intelligence to support situation assessment is presented, discussing the potential and the limits of each approach. In particular, two key issues in the context of automated sea surveillance are addressed: identification of particular events and scenarios in the situation picture, and detection of anomalous behaviour in general in the situation picture. Characterizing all events and situations, which may be of interest for a supervisor, is a very difficult task. The set of available examples for each particular event or situation is usually very limited, as the events and situations sought for occur relatively rarely and may vary significantly from one case to another. However, turning it the other way round, these rare events and situations can be detected as anomalies in a model of routine behaviour. Usually, large amounts of data corresponding to routine behaviour are available, which motivates the use of Data Mining and clustering techniques for building models of normal behaviour. In the second phase of this project, anomaly detection in sea traffic, based on clustering of real recorded vessel traffic, is further investigated and implemented. The implemented feature models are based on momentary vessel locations and velocities. Unsupervised clustering is done by a combination of two different cluster models and learning algorithms; one based on Mixtures of Gaussians (MoG) densities and Expectation-Maximization, and the other based on Neural Networks and Adaptive Resonance Theory (ART). Qualitative results from evaluating the implemented models show that the most distinguishing anomalies found in the typical routine traffic correspond to vessels that are crossing sea lanes and vessels that are travelling close to and in the opposite direction of sea lanes. Generally, the implemented models detect the same anomalies to rather large extent. The anomalies detected by the implemented systems are of a rather elementary nature; the type of feature model essentially determines the character of the detectable anomalies. Therefore, a more sophisticated feature model, based on manoeuvres in the motion pattern over time, is proposed as future development of the implemented system. However, the generality of the proposed system should be stressed, as it is applicable to other domains, involving generic motion in the two-dimensional plane, requiring minimal adaptation and no specific domain knowledge as the systems are based on unsupervised algorithms.
Artificiell intelligens fr situationsanalys

Sammanfattning
I den frsta delen av detta projekt presenteras en vergripande utvrdering kring olika tekniker och metoder, baserade p artificiell intelligens, fr att stdja situationsanalys, dr mjligheter och begrnsningar fr varje metod och teknik diskuteras. Tv nyckelproblem i samband med automatiserad sjvervakning tas upp; identifiering av srskilda hndelser och situationer av intresse, samt detektering av avvikande beteenden i allmnhet. Att karaktrisera alla hndelser och situationer, som kan vara av intresse fr en vervakare, r en mycket svr uppgift. Mngden tillgngliga exempel p varje enskild typ av hndelse eller situation r vanligen mycket begrnsad, d de efterskta hndelserna och situationerna frekommer relativt sllan och kan variera avsevrt frn ett fall till ett annat. Men, vnder man problemet, kan dessa hndelser och situationer upptckas som avvikelser i en modell fr typiskt normalbeteende. Stora mngder data motsvarande typisk normaldata finns vanligen tillgngligt, vilket motiverar databrytnings- och klustringstekniker fr att bygga upp modeller av normalbeteende. I den andra delen av projektet undersks och implementeras algoritmer fr avvikelsedetektion hos sjtrafik, baserat p klustring av inspelad sjtrafik. De implementerade srdragsmodellerna r baserade p fartygens momentana hastigheter och positioner. Overvakad klustring grs med en kombination av tv olika klustermodeller och klustringsalgoritmer; en baserad p mixturer av Gauss-frdelningar och ExpectationMaximization (EM), den andra baserad p artificiella neuronnt och Adaptive Resonance Theory (ART). Kvalitativa resultat frn utvrderingen av de implementerade modellerna visar att de mest framtrdande avvikelserna funna hos typisk rutintrafik svarar mot fartyg som korsar farleder och fartyg som frdas i nrheten av och i motsatt riktning av farleder. Generellt s upptcker de olika modellerna samma avvikelser i ganska stor utstrckning. De avvikelser som upptckts av de implementerade modellerna r ganska elementra till sin natur; typen av srdragsmodell r fundamental fr karaktren hos de framtrdande avvikelserna. Drfr fresls en mer sofistikerad srdragsmodell, baserad p manvrar hos rrelsemnstret ver tiden, som en lmplig utveckling av detta arbete. Allmngiltigheten hos det implementerade systemet skall dock understrykas, d det relativt enkelt kan verfras och anpassas till andra domner som innefattar rrelse i det tv-dimensionella planet.
Foreword
This document constitutes my masters thesis at the department of Computer Science at the Royal Institute of Technology, Stockholm. The masters project has been commissioned by Saab Systems in Jrflla, Sweden, and has been done within the frame of their R&D efforts within the field of Situation Assessment. My closest colleagues at Saab Systems have been Andreas Lingvall, project manager of the Situation Assessment R&D project, and Johan Edlund, system developer within the Situation Assessment R&D project, and also my formal supervisor at Saab. I appreciate the informal contact and discussions I have had with both Andreas and Johan throughout my master project, especially during the initial part of the project. These discussions have been of great importance for my project, helping me understand the problems and issues related to situation assessment and sea surveillance. I would like to acknowledge Dr. rjan Ekeberg, my supervisor at the Royal Institute of Technology, for his easy approachability and the feedback he has given to me during my work. Finally, I would also like to thank Dr. Anders Holst, at the Swedish Institute of Computer Science, for the visit at his office and our discussions regarding anomaly detection in sea traffic.
Contents
1 Introduction.................................................................................................................... 1 1.1 1.2 1.3 2 2.1 2.2 3 3.1 3.2 3.3 3.4 3.5 3.6 4 4.1 4.2 4.3 5 5.1 5.2 5.3 5.4 6 6.1 6.2 7 7.1 7.2 7.3 7.4 7.5 Background .............................................................................................................. 1 Problem description ................................................................................................. 2 Purpose and goal ...................................................................................................... 2 Situation Awareness................................................................................................. 3 Data Fusion and Situation Assessment .................................................................... 4 Traditional Rule-based Expert Systems ................................................................... 6 Bayesian Networks................................................................................................... 8 Fuzzy Reasoning, Fuzzy Sets & Fuzzy Logic........................................................ 12 Case-based Reasoning............................................................................................ 14 Anomaly Detection based on Machine Learning................................................... 15 Conclusions of theoretical study ............................................................................ 20 Data Description .................................................................................................... 23 The feature models................................................................................................. 24 Cluster models and clustering algorithms .............................................................. 25 Data pre-processing................................................................................................ 30 Software packages.................................................................................................. 31 Training the models................................................................................................ 33 Performing Anomaly Detection ............................................................................. 33 Experimental setup................................................................................................. 35 Results.................................................................................................................... 36 Degree of vigilance of the systems ........................................................................ 56 Anomaly Detection in unlabelled data ................................................................... 56 Anomaly detection in labelled data........................................................................ 57 Anomaly detection in artificial data ....................................................................... 58 The feature models................................................................................................. 58
Theoretical background................................................................................................. 3
AI in SA State of the Art............................................................................................. 6
Design of an Anomaly Detection System.................................................................... 23
Implementation ............................................................................................................ 30
Experimental evaluation.............................................................................................. 35
Analysis and discussion................................................................................................ 56
7.6 7.7 8
Comparing MoG and ART in general.................................................................... 58 Future work and improvements ............................................................................. 59
Conclusion..................................................................................................................... 62
References ............................................................................................................................. 64
Introduction
1 Introduction
1.1 Background
At Saab Systems, research and development of systems for tracking and identification of moving objects has been done since many years. The objects of interest may be airborne, land-borne or vessels at sea. The tracking process presents to the supervisor the current state of the operational picture, including a number of objects and their momentary position, velocity and motion history since they were first detected.
Figure 1.1: An example of an operational picture in the context of sea surveillance, where the surveillance area corresponds to the sea of resund between Sweden and Denmark. The trapezoid formed objects correspond to tracked vessels. To enhance a supervisors understanding of what is significant in the operational picture, beyond the existence of multiple unidentified or identified objects travelling in different directions, research within the field of situation assessment has been pursued. The purpose of situation assessment is to increase a supervisors awareness of the current operational picture by finding and identifying relations between objects and their environment. Fundamental to the analysis is an ontology which serves as a model of a specific domain (or subset) of the world. The ontology describes relevant concepts in the domain such as objects, events, rules, relations and situations that are of interest for the supervisor. Interesting relations and situations that the ontology describes can be extracted from the operational picture, effectively enhancing the situation awareness of the supervisor. A prototype rule-based expert system has been developed at Saab Systems for performing situation assessment within the domain of sea surveillance. The system is able to identify a number of basic kinematical relations between objects and then deduce different situations of interest in a simulated real time sea surveillance environment. The domain knowledge required for creating adequate models and rules for identifying situations of interest, like smuggling, piloting, hijacking etc., has been acquired by consulting domain experts within the Swedish Naval Intelligence Battalion.
Introduction
1.2 Problem description

With this background, the main problem is how a system can perform automatic situation assessment in an operational picture in order to enhance the situational awareness of a human supervisor. Related to this issue is the definition of situation assessment and situational awareness and how these concepts relate; a topic that has been subject to a lot of research within the field of cognitive science. The concepts of situational awareness and situation assessment are therefore briefly discussed in the theoretical background part of the thesis. However, a more extensive investigation in this matter is out of the scope of this project. The specific domain chosen for the investigation of situation assessment in this project is sea surveillance. Usability studies performed by Saab at the Malm Sea Surveillance Centre have resulted in a summary of surveillance issues where an automatic surveillance system has the potential to be a useful support for the operators. According to the study, a general problem in sea surveillance is the detection of abnormal traffic patterns. Such abnormal patterns could be related to speeding, prohibited anchoring, grounding, sea drunkenness and other criminal or dangerous activities. In addition to this, special attention is paid to identifying particular scenarios that develop over time and may involve multiple subjects and objects. Typical examples of such scenarios within sea surveillance are smuggling and fish poaching.
1.3 Purpose and goal

The main purpose of this masters project is to investigate different techniques related to the field of Artificial Intelligence for enhancing the system process of automatic situation assessment in general and within the domain of sea surveillance in particular. An alternative or complementary solution to the existing prototype system that is able to solve at least one of the main sea surveillance problems, detection of abnormal traffic patterns and identification of particular scenarios, is sought for. In addition to this, the goal of the system is that it should be able to identify and handle uncertainties, and present them to an operator in a convenient way. Investigating the possibilities of increasing system performance over time by supervised and/or unsupervised learning is another particular goal. The domain for the situation assessment within this project is set to sea surveillance; mainly because of the previous work done at Saab and the relatively high amount of data and valuable knowledge available through Saabs well established contact with the Swedish Naval Intelligence Battalion. However, taking the long view, the goal is also that the results and conclusions from this investigation should be relevant and applicable to other domains within airborne and land-borne surveillance. In the first phase of the project a wide variety of approaches and AI-techniques are investigated by studying literature and articles within the field of AI in general and within the field of situation assessment in particular. The goal of this study is to present an overview, by assessing the potential and capability of the different approaches and techniques. In the second phase of the project a particular approach is chosen for further evaluation and implementation. The goal is that this particular approach, in the long run, will have the potential to either extend, or replace, the existing solution developed by Saab.
Theoretical background
2 Theoretical background
This chapter describes the concepts of Situation Awareness, Data Fusion and Situation Assessment and how these concepts relate to each other.
2.1 Situation Awareness

The term of Situation Awareness (SAW) is commonly used within the community of Human-Computer Interaction. One of the leading scientists within this field is Mica Endsley. She has formulated a general definition of SAW as the perception of elements in the environment within a volume of time and space, the comprehension of their meaning and the projection of their status in the near future [1]. To put it simple, SAW is knowing what is going on around you. One of the main concerns within Human-Computer Interaction his how to design computer interfaces that enhance SAW of an operator in an optimal way. From this point of view the models of SAW describe the cognitive processes that occur in the mind of operators. Below, the Endsley model for SAW in Dynamic Decision making is presented, encompassing her definition of SAW.
Feedback
Situation Awareness Level 1 Perception of Elements in Current Situation Level 2 Comprehension of Current Situation Level 3 Projection of Future Status Performance of Actions
Decision
Goals & Objectives Preconceptions
Figure 2.1: The Endsley model for SAW
In the figure, SAW is modelled as a stage before the cognitive process of decision making. SAW can be thought of as the operators internal model of the state of the environment. This representation serves as the basis for the operators decisions about what to do about the current situation and how to carry out any necessary actions. Relevant to the definition of SAW is a notion of what is important. For a given operator, the specific goals and objectives associated with the current job are highly related to SAW. For example, the arrival of a hostile ship would prove a significant event to a coastal guard in contrast to the sighting of a seagull which is considered to be an irrelevant event in this context. The limitations of the human mind have been proven to have a significant impact on the degree of SAW for humans operating in dynamic environments with large amounts of
information available [1, 8]. Therefore there has been a big interest in developing techniques and systems that support the cognitive processes of SAW, from the low level of perception, to the high level of decision making.
2.2 Data Fusion and Situation Assessment

Due to the human limitations in perceiving and comprehending the large amount of information available from different sensors in the operating environment, models and methods for automatically fusing sensor information to achieve SAW have been developed [2,3,9]. This concept is referred to as data fusion and the Joint Directors of Laboratories (JDL) group [3] has proposed a definition for data fusion as the process of combining data to refine state estimates and predictions. The dominant model for data fusion, also proposed by the JDL, is presented below.
Data Fusion Domain
Level 0 Sub-Object Assessment
Level 1 Object Assessment
Level 2 Situation Assessment
Level 3 Impact Assessment Human/ Computer Interface
Sources
Level 4 Process Refinement
Database Management System
Figure 2.2: The JDL model for Data Fusion
The different logical levels of data fusion are described shortly: Level 0 & 1 - Object Assessment: These levels correspond to the process of object fusion, utilizing the information from multiple data sources over time to assemble a representation of objects of interest in the environment. These created object assessments can include estimations and predictions of kinematics (speed vector) and target type/ID. Level 2 - Situation Assessment: This level essentially involves association and combination of level 1 objects into aggregations, identifying relations. The relations can be of different types, including spatial, temporal, organizational etc. The identification of such relations between objects in the environment constitutes Situation Assessment (SA). It is important to understand the difference between SA, being a process, and SAW, which represents a state that has been achieved through [the process of] SA.
Level 3 - Impact Assessment: Impact Assessment corresponds to the estimation and prediction of the effects of planned, estimated or predicted actions, involving the effects of different situations in the environment.
Level 4 Process Refinement: Process Refinement is an adaptive process that identifies what is required to improve the level 1, 2 and 3 assessments and how sensors should be configured to obtain the most relevant data for improving the assessments.
It is not difficult to see the correspondence between Endsleys cognitive model of SAW and the process of data fusion just described [2]. However, it should be pointed out that the purpose of data fusion, as opposed to SAW, is to maintain a model of the situation external to the operator [3]. Thus, in order for a computer to support SAW in the mental domain, the data fusion model persisting in the technical domain must be communicated through a Human / Computer Interface.
AI in SA State of the Art
3 AI in SA State of the Art

In this chapter an overview of the different approaches and AI techniques related to Situation Assessment are presented and briefly discussed. The first part of the chapter presents approaches that principally try to solve the problem of identifying particular situations and scenarios of interest in the operational picture. In the second part approaches based on anomaly detection are presented. The chapter is concluded by brief summary and discussion of the key observations and conclusions from the preliminary study, motivating the method and model chosen for further evaluation in this masters project.
3.1 Traditional Rule-based Expert Systems

A Rule-based Expert Systems is a knowledge based system consisting of a rule base and a fact database. The domain knowledge of a human expert is encoded as rules and the state of the world is represented as set of facts in the fact database. Generally, the rules have a structure including an antecedent part and a conclusion part, where the conclusion part can be deduced by logical inference if the antecedent part is true. Whenever the facts corresponding to the antecedent part are true the rule can be fired and the new facts corresponding to the conclusion are added to the fact base. The dynamics (i.e. the rule-firing) of the rule-based systems are controlled by an inference engine, also known as a reasoner.
IF Sunny AND Summer THEN Nice weather IF . AND . THEN ..
Expert system
Fact 1: Sunny Fact 2: Summer ..
Provides expert knowledge
Encodes knowledge
Rule base
Fact base
Domain expert
Knowledge engineer
Inference engine
Figure 3.1: Overview of a Rule-based Expert System Defining an ontology for the different types of objects, their relations and the situations of interest and using a Rule-based Expert System to reason about these concepts is one approach for performing SA. By identifying the objects and the basic relations between them in the situation picture, the rules can be used to infer higher-order relations corresponding to situations of interest for the operator. Variations of this approach have been used by Saab Systems [5] and other research groups [4] for achieving SAW (read more about these approaches below). One of the main advantages of rule-based systems is that they are relatively straightforward to implement. There are many inference engines available for developing customized expert systems simply by defining the rules. The fact that domain knowledge is expressed as rules makes the knowledge transparent and easy to interpret. However, one major drawback of classical rule-based systems is the lack of uncertainty modelling. Even though variations of rule-based systems incorporating uncertainty values for
rules and propositions have been developed and implemented, this approach implies serious restrictions and careful considerations when developing the rule base [16]. Therefore, this approach is generally not recommended.
3.1.1 Learning in Rule-based Systems

Implementing algorithms that automatically learn new rules (i.e. without having an operator manually defining the rules) is not an easy task, especially if we consider rules based on first-order logics with predicates and variables. One way to learn a set of rules based on propositional logic is to first learn a decision tree and then translate the tree into an equivalent set of propositional rules, one rule for each leaf node in the tree [15]. Another approach is to develop a genetic algorithm that encodes each rule set as a bit string and uses genetic search operators to explore this hypothesis space [15]. However, these two approaches both have two weaknesses. First, the algorithms are limited to variable-free propositional logic, i.e. they lack the expressive power of first-order logic for learning rules with predicates and variables. Second, the algorithms do not support incremental learning in time. Sequential covering algorithms are a family of greedy algorithms for learning rule sets based on the strategy of learning one rule at a time to incrementally grow the final set of rules [15]. In each learning iteration, a new rule covering as many training examples as possible is found and all of the examples covered by it removed. This procedure is repeated until all of the training examples are covered. To be able to reason about relations between objects (which is relevant to SA) the rules have to support predicates and variables, i.e. first-order logic. Inductive learning of first-order rules is also known as inductive logic programming [16]. Generally, the learning strategy in inductive logic programming can be either top-down oriented (general-to-specific rule induction) or bottom-up oriented (specific-to-general rule induction) or a combination of both [15]. In a top-down approach, one normally starts out with a general hypothetical rule and repeatedly specialize it until it fits the data, i.e. no misclassification of training instances. In contrast, a bottom-up approach would be more example-driven in the sense that one starts out with a specific training example and tries to construct a specific rule explaining it. The field of inductive rule learning is still at a rather academical level. Even though there exists a theoretical framework with well defined algorithms [15], the state of the art appears to be quite limited. The most successful applications of learning systems based on first-order representation have been within the field of chemistry [13], learning different chemical properties. The main problem is that this type of learning requires a relatively large amount of training data composed of examples having the right features. Another problem related to rule-based systems in general is the difficulty of handling noise and uncertainty in data.
3.1.2 Temporal Reasoning

Another relevant issue is temporal reasoning. Conventional rule-based expert systems do not support a natural mechanism for temporal reasoning. However, different tricks for modelling time constraints for specific cases can still be done. At Saab Systems [5] for example, different time connectors constraining the temporal order of relations (i.e. meta-relations) have been successfully devised for use in their rule-base.
3.1.3 The approach of Saab Systems

At Saab Systems, a prototype system for situation assessment within the domain of seasurveillance has been developed. The system has an agent based architecture and an
ontology defining the domain concepts that needs to be communicated between the different agents. Each agent has a specific task including, among others; Extracting and pre-processing input information to the system (Input agent) Finding basic relations like kinematical and spatial relations between objects (Basic Relation Finder agent) Reasoning about relations and situations (Reasoner Agent) Providing interface and functionality for rule definition (Rule Editor agent) Providing database services (Database agent)
The ontology, being a data structure, is defined in the semantic web language Web Ontology Language (OWL)1. The first-order rules are defined in the OWL compliant language Semantic Web Rule Language (SWRL)2. The reasoner agent is implemented with the Java rule-based engine JESS3. However, because JESS doesnt support OWL, rules defined in SWRL have to be translated to the JESS syntax by an auxiliary translation interface [5].
3.1.4 The SAWA approach

The work of Saab Systems mentioned earlier is very much inspired by the original approach to SAW developed by Versatile Information Systems (VIS) [4]. They have developed a system called SAW Assistant (SAWA). Central to their approach is an agent based architecture and a core ontology defined in OWL that captures the relevant domain concepts. Rules are defined in SWRL and a GUI-based rule editor called RuleVisor has been developed by the research team. Reasoning is performed by a Jess inference engine. However, in contrast to Saab, SAWA has a top-down approach when reasoning about events, relations and situations. Even with an appropriate ontology the number of possible relations definable within the ontology constraints can remain intractable. Thus, to further constrain a situation, the specific goals of the user are incorporated into the process. By knowing more specifically what an operator is looking for, automatic focus of attention on relevant events and relations will improve SAW of the operator. This type of relevance reasoning is handled by an agent in the system architecture of SAWA. As is the case with Saab, SAWA currently lacks a model for handling uncertainties. The researchers are therefore currently investigating the possibility of using Bayesian techniques for handling this. Moreover, the SAWA approach lacks a convenient solution for the temporal reasoning issue. In particular, there is no natural support for sequential ordering constraints of events and relations in time.
3.2 Bayesian Networks

Using different variations of Bayesian Belief Networks (BBN) is a very common technique for probabilistic reasoning about events and relations in the context of situation awareness and situation assessment [7, 8, 9, 10 et al.].
1 2
http://www.w3.org/TR/owl-features/ http://www.w3.org/Submission/SWRL/ 3 http://herzberg.ca.sandia.gov/
BBNs are based on Bayes rule which allows the computation of the posterior probability associated with a particular proposition, which could be an object or relation, given the prior probabilities and conditional probabilities associated to it. The BBN can be represented as a graph, where the nodes correspond to the propositions and the directed edges correspond to the causal relationships between the propositions. When setting up a BBN, priori probabilities and conditional probabilities have to be specified for each node in the network. Then, posterior probabilities for each node can be updated according to Bayes rule as new information is presented to the network. In particular, new information regarding the probability of objects and basic relations can be used by the BBN for fusion to assess the probability of higher-level relations or situations. Two important requirements are pointed out by Gonsalves and his colleagues when using BBNs for SA in dynamic real-time environments [8]: 1. Rapid modelling of complex situations via BBNs 2. Efficient BBN inference based on incoming evidence For rapid modelling, they suggest that each BBN is constructed at real-time from a library of smaller component-like BBNs to assess a specific situation. To address the issue of efficient inference, they propose a way in which a BBN can be broken up into sub-networks and distributed across multiple computers, allowing computations to be carried out in parallel. In addition to this, the distribution mechanism allows computation at various levels of abstraction and granularity suitable for hierarchical organizations. The powerful model for handling uncertainty and casual relationships combined with a relatively straightforward implementation make the Bayesian Belief Network a very attractive model. However, BBNs are essentially propositional: the set of variables is fixed and finite and each variable has a fixed domain of possible values [16]. Regular BBNs lack the concept of objects and relations and thus cannot take full advantage of the structure of the domain or reuse [11]. These facts limit the application of BBNs in complex domains. To handle these constrains, different extensions to BBNs called Relational Probabilistic Models (RPM) [11, 16] have been proposed. The first-order RPMs support much more complex models including generic objects and relations, being able to express facts about some or all objects. Another weakness of regular BBNs is that they lack a natural mechanism for temporal reasoning. Various forms of Dynamic Bayesian Networks have been proposed for handling this [16]. However, it is still possible to model temporal sequence constraints in a regular BBN by including multi-state nodes corresponding to a memory [7]. Finally, because Bayesian methods are based on probability theory, adequate statistics are required for making good models. Without a sufficiently large amount of data (training examples) or other prior knowledge, approximating various probability distributions can be a difficult task. Furthermore, modelling the causal relationships may also prove a non trivial task in a complex domain.
3.2.1 Hierarchical Event Recognition

One approach for performing situation assessment is to identify events in the situation picture. According to Higgins [7], an event can be defined as a change in the relationship between objects that occurs over a period of time. In our case, this implies that one has to identify objects and their track motion and relationships to other objects in order to successfully recognize the events as object relationships that evolve over time. Generally speaking, complex events can be modelled as sequences of sub-events. These subevents on their part can be further decomposed in a recursive manner. At the bottom of this
recursive tree we find primitive events and primitive relationships between objects. These primitives constitute the base elements of the event hierarchy and need to be measured directly from the situation picture which may involve different context specific extraction algorithms. Typical primitives in the context of SA could be spatial relations between objects (in front of another object, to the right etc.), kinematical relations (approaching etc.), state of objects (inside-of a region etc) or simple events corresponding to changes in direction of motion.
3.2.2 Implementing Hierarchical Event Recognition through Bayesian Networks

Higgins [7] proposes an approach to event recognition using Bayesian Networks where the networks consist of input nodes and output nodes. The input nodes of the network correspond to the sub-events and primitives in the event hierarchy and the output nodes correspond to the higher-order events. All nodes have an associated measure of probability ascribed to them. Furthermore the causal dependency between sequences of sub-events and primitives and the corresponding complex event is represented by the directed arcs between nodes and the associated conditional probabilities. The idea is to provide probabilities for the input sub-events (evidence) and let the network calculate the probability for the complex events presented in the output nodes. To handle the sequential ordering constraint of the sub-events, i.e. when the time order of the sub-events is part of the definition of a complex event, additional multi-state nodes are also included in the networks. A sequence node is a multi-state node that represents the belief of the state of a particular sequence, i.e. the probability that the sequential constraint is fulfilled for a particular complex event. The state of a sequence node is first set to some initial state and later determined by 1) the input nodes corresponding to the sub-events of the sequence and 2) a memory node representing the previous state of the sequence node. These memory nodes provide a feature resembling a Dynamic Bayesian Network. However, one difference is that the state nodes are not updated at every time step, as time is not explicitly incorporated in the model. Rather, they are updated in response to updates of the input nodes. The probability for a complex event is thus determined by the corresponding (multi-state) sequence node and the rest of the input nodes (for which there is no ordering constraint) and their associated conditional probabilities.
10
Output node T-1 T

Smuggling
Input nodes
Memory
Sequence
A is suspicious foreign freighter
X is a typical smuggling area
State nodes
Vessel A stops in area X
Vessel B approaches area X from area Y
Vessel B returns to area Y
Input nodes Figure 3.2: An example of a Bayesian Network for the complex event Smuggling. The scenario involves smuggling as vessel B picks up illegal cargo previously dumped by vessel A. Notice the multi-state node Sequence that keeps track on the sequential state of the sub-events and the multi-state node Memory that stores the previous state of Sequence.
Taking a holistic view, the system can be seen as consisting of a number of instantiated event networks, where each network is associated with an instance of a complex event and the specific objects related to it. The dynamics of these networks are event driven in the sense that changes in primitive relationships between objects and sub-events will cause the network to process; new output from a network is only produced as a consequence of a change in an event state. Higgins [7] points out that Bayesian networks do not necessarily require training data for configuration. This fact can be exploited when there is a lack of available adequate training data, which is common when one tries to capture unusual events and behaviour. Under such circumstances the structure and conditional probabilities for the network may be defined explicitly based on the experience of a domain expert. However, assessing causal relationships and probabilities may still be far from trivial, even for a domain expert. One advantage of Bayesian Networks is their capability of reasoning with uncertainties; allowing uncertainty in the input and providing a measure of probability for the output, i.e. the probability that an event has occurred or the probability that a complex event is in a particular state. Thus uncertainties in the primitives and the simple events may be propagated upwards through the hierarchy of events in a consistent manner.
11
3.3 Fuzzy Reasoning, Fuzzy Sets & Fuzzy Logic

Another approach to reasoning with uncertainties is Fuzzy Reasoning (also known as Approximate Reasoning) which is based on the theory of Fuzzy Sets and Fuzzy Logic. In classical set theory the membership of elements in relation to a set is assessed in binary terms according to a crisp condition; an element either belongs or does not belong to the set. By contrast, fuzzy set theory, an extension of the classical set theory, permits the gradual assessment of the membership of elements in relation to a set. The fuzzy membership function describe to what degree a particular element (entity or object) is considered to belong to that fuzzy set (concept). As an example, the membership function for the concept of tall could state that somebody that measures 175cm is considered to belong to the set of tall people to the degree of 0.5, while somebody else that measures 200cm is considered to be a member to the degree of 1. In this example, fuzzy sets capture the vagueness of linguistic terms.
Crisp sets (length) 1 Short 0 175cm Length Tall 0 (length) 1
Fuzzy sets
Short
Tall
175cm
Length
Figure 3.3: A comparison of classical crisp set boundaries versus fuzzy set boundaries for the concepts of short and tall. Notice in the case of crisp sets that a person measuring 174cm is considered absolutely short while a person measuring 176 cm is considered absolutely tall, despite the insignificantly small difference in length. Fuzzy logic is a technology that allows realistic complex models of the real world to be defined with some simple qualitative (fuzzy) descriptions of the conditions and rules. In particular, systems based on fuzzy inference with fuzzy rules have been very successful within the field of complex process control [22]. The general methodology in this context is first fuzzyfication of sensor input data from numerical to linguistic (fuzzy set) format, then evaluation of the fuzzy rules through fuzzy inference and finally de-fuzzyfication of the output (conclusions of the fuzzy rules) to numerical format (serving as input to the process).
12
Fuzzification Fuzzy Rule System Process output data (sensor data) Process
De-fuzzification
Process input data (control data)
Figure 3.4: Overview of a Fuzzy Rule System in the context of complex process control. It is important to point out that the type of uncertainty captured by fuzzy logic is generally related to vagueness and imprecision rather than pure probability, which is associated with statistics. In the context of SA, concepts like objects, state of objects, properties and relations between objects could be modelled as fuzzy sets which allows for more abstract and general rules. For example, the fuzzy kinematical relation approaching could describe to what degree an object A is approaching another object B. If object A is heading straight at object B and closing in on it, they would fulfil the relationship to a degree close to one. However, the more the course of object A deviates from the course straight to B, the less degree of membership for the relation of approaching between these objects. As a contrast, in a probability approach this relation would be considered either true or false with a certain probability.
3.3.1 Fuzzy Inference Engines & Algorithms

A fuzzy logic extension to the Java rule-based inference engine Jess has been developed by the Canadian Institute for Information Technology [21]. The extension supports the definition of fuzzy concepts and creation of fuzzy rules for performing fuzzy reasoning in a Java setting. Fuzzy concepts are represented using fuzzy variables, fuzzy sets and fuzzy values. A fuzzy variable is used to describe a general fuzzy concept and consists of a name (like length), its units (centimetres), a range (0-250cm) and a set of fuzzy terms that are used to describe specific fuzzy concepts for this variable. The fuzzy terms are defined by a name (tall, short etc.) together with a corresponding fuzzy set that identifies the degree of membership of the term over the range of the fuzzy variable. The fuzzy variable terms can be combined with logical operators such as and, or, and not to create an expression that is encoded in a fuzzy value, representing a specific fuzzy concept. A fuzzy rule includes two sets of fuzzy values representing the antecedent and conclusion part of the rule. FLINT [23] is another fuzzy toolkit that supports fuzzy first-order predicates and fuzzy inference for the logical programming language Prolog. This toolkit is more powerful than the Java based (mentioned above) in the sense that it also supports Bayesian theory for modelling uncertainty in data. In particular, it supports the construction of fuzzy probabilistic rules combining fuzzy logic and Bayesian reasoning. In addition to this, it also supports the use of certainty values previously mentioned in the section of traditional rule-based expert systems. An algorithm for performing weighted fuzzy reasoning for rule-based systems based on matrix operations is proposed by Chen [24]. Given a number of propositions with associated fuzzy truth values, the algorithm performs a data-driven reasoning by computing (through matrix operations with a rule-matrix) the fuzzy truth values for the remaining propositions in
13
the fact base. To increase flexibility, fuzzy weights for each proposition in the antecedent part of the rule are supported, effectively giving priority to the more important conditions in the rule. In addition to this, the rules themselves can be (fuzzy) weighted to reflect their relative importance. Variations of this approach involving Petri nets [25, 26] have also been proposed, where the reasoning is more goal-oriented in the sense that the nets try to infer truth values for specific propositions based on a number of given truth values.
3.4 Case-based Reasoning

Case-based reasoning (CBR) is normally associated with problem solving, where a particular problem is represented as case and a corresponding solution is sought for. To solve a problem in this context, the particular case is compared to previous cases stored in a library where the associated solutions are also stored. The key assumption is that problems can be solved by analogy, i.e. similar cases have similar solutions. Thus by identifying a sufficiently similar case and adapting its solution to the current case, the problem can be solved. If the adapted solution proved to be an adequate solution, the newly solved case is stored in the library together with its solution. However, should the adapted solution prove inadequate, a different adaptation of the solution is attempted or another similar case is sought for. In the context of SA, CBR can be used for identifying situations (corresponding to cases) as variations of known template situations (corresponding to cases stored in a library). In this context, the concept of a solution to a particular case is not relevant as it is in generic CBR; the problem is limited to identifying the correct situation template. A high-level architecture for reasoning about events and situations has been proposed by Jakobson and his colleagues [20]. The system is based on two component technologies; realtime Event Correlation (EC) and CBR. An event in this context corresponds to a change of the state of an object or a change of the state of a relation between multiple objects, at a specific point in time or during a specific time interval. The EC is used for reasoning about events from a time perspective by correlating sets of events while CBR is used for modelling situations as dynamic cases. A set of correlated events may trigger the invocation of a case, where the case adds further meaning to the set of events and infers a possible situation. In addition, the case may point out further information that is needed (from the event correlation agent or other event data source) in order to strengthen the belief that a particular situation is present. The main problems when trying to do SA by CBR are 1) how to represent a case and 2) how to measure similarity between different cases. Thus, the key problem is the representation; a smart representation will make it more straightforward to devise a suitable metric and algorithm for computing the similarity between cases. In the system described above, the situations (cases) can be expressed as rules containing sets of correlated events. However, the details of the invocation process including the retrieval of cases from the library are not described, i.e. how to measure similarity between the current situation and the stored situations. Generally, CBR is regarded as a good approach when it is difficult to construct an explicit model of the problem domain and the process associated to it, i.e. when it is difficult to formulate rules describing the situations [19]. The fast retrieval and adaptation of a solution that incorporates learning is based on a powerful analogy technique, resembling human cognition. However, as mentioned earlier, it is not straightforward how to represent cases and how to measure similarity between them.
14
3.5 Anomaly Detection based on Machine Learning

A different approach to SA is to look for anomalies in the event pattern. This corresponds to detection of abnormal traffic patterns, mentioned in the problem description of this thesis. Holst [18] attempts to define anomaly detection as a method for separating an often inhomogeneous and hard characterized minority of data from a more regular majority of data by studying and characterizing the majority, so that data in the minority appears as deviations from the patterns found in the majority. In contrast, an alternative to anomaly detection, as defined above, would be to characterize the minority of data and look for these characteristics in the total data set. However, this approach is usually not convenient, because the relatively small amount of data corresponding to the minority we wish to detect is irregular and difficult to characterize. Generally, there are two main methods for performing anomaly detection; statistical methods and model based methods [18]. The statistical methods are data driven in the sense that they consider a set of (primarily) normal situations and try to characterize these. These examples constitute a training set that is used for training (building) the model using machine learning techniques. A trained model can then be used to evaluate new situations, classifying them as more or less anomalous based on their particular features. In contrast to the machine learning based technique, model based methods involve building more explicit models, based on domain expert knowledge that capture normal situations.
3.5.1 Feature Models and Data Clustering

A common statistical method used in the context of anomaly detection is data clustering, which is a form of unsupervised learning. This method involves automatic grouping of training data into multiple unlabelled clusters according to distinguishing features. The assumption is that the presence of anomaly in training data is either non existent or relatively low. Assuming the latter implies that small clusters, i.e. clusters with few data points, are more likely to correspond to anomaly clusters. New un-clustered points presented to the model are regarded as more or less anomalous, based on the distance to the nearest clusters in feature space, i.e. points located close to a (fairly large) cluster are considered normal, while points ending up far away from all clusters are considered anomalous. Fundamental to clustering is the representation (model) of the data in which we want to detect anomalies; what characteristics of the data we chose and how we model these characteristics. As is the case with generic classification, it is important to choose suitable features that adequately distinguish between different relevant categories of the data; in this case normal or anomalous behaviour. In the context of sea surveillance, features of the vessel motion can be divided into layers according to the level of abstraction and degree of locality in time and space of the feature model. Holst [18] suggest a simple low level feature model for vessel motion, based on the momentary velocities in latitude and longitude direction within a particular area. The whole surveillance area is first discretized by introducing a grid. The normal velocities of the training data within each square of the grid are then modelled by a single two-dimensional Gaussian distribution. Anomaly detection is performed simply by calculating the likelihood of a new point belonging to the normal distribution; if the likelihood is below a certain threshold the point is regarded as anomalous.
15
2-dimensional normal distribution 10 8 6 Latitude v eloc ity , m /s 4 2 0 -2 -4 -6 -8 -3 -2 -1 0 1 Longitud velocity, m/s 2 3
Likelihood of point belonging to normal distribution < threshold anomaly detection
Figure 3.5: The principal idea of the Holst model. Normal (training) data is illustrated by the green data points in the velocity feature space. The two-dimensional Gaussian distribution modelling normal vessel velocities in the indicated area is illustrated by its mean and variance. A (red) point corresponding to an anomaly is shown. Taking it to a global level, features of vessel motion can be analyzed over time by considering trajectories. By clustering similar trajectories corresponding to regular traffic, a model of normal vessel routes can be constructed [17]. This approach is described in more detail below. Furthermore, by extracting local higher level events corresponding to manoeuvres in the motion patterns, it is possible characterize more abstract features of the motion. An example of such a model has been proposed in this project and is presented and discussed later on in the thesis.
3.5.2 Probabilistic approaches to Clustering

A probabilistic approach to clustering and anomaly detection is to make a statistical model of data, where the possible values of the features are modelled as probability distributions. These probability distributions model what is considered normal behaviour based on the feature values of the data set. The problem is how to construct these statistical models, i.e. how to find the probability distributions of the features given the data set. This can be solved by using statistical machine learning techniques for finding optimal parameters of the approximating distributions that fits the data best, i.e. an approximation of the true distribution of the feature values.
16
3.5.2.1
Mixtures of Gaussian Densities and the EM-algorithm
A powerful statistical model for approximating arbitrary distributions is the Mixtures of Gaussian (MoG) densities model [27]. A MoG consists of a number of multivariate Gaussian distributions known as mixture components, where each component c has its own parameters; mean value ( c ) and covariance ( c ). In addition to these parameters each component has an associated weight ( c ), where all weights are non-negative and sum to one. The problem is how to place these Gaussians in the feature space (i.e. finding the parameters) so that all data points more or less belong to a corresponding component distribution. This process can be regarded as a clustering problem where each mixture component corresponds to a cluster. A popular technique for the parameter estimation is the Expectation-Maximization (EM) algorithm [15, 16, 27] which incrementally finds the optimal set of parameters that maximizes the likelihood of the training data. The algorithm consists of two main steps that are performed iteratively until a certain end condition is fulfilled, usually some convergence condition. During the Expectation step, the algorithm estimates for each data point x in feature space the probability that the point was generated by each particular component c, taking into account the component weights c . This is done by computing the posterior probabilities
p(c | x ) for each data point x belonging to each component c based on Bayes Rule; p( x | c ) p(c ) = p (x ) p (x | c ) c c p(x | c) c
p (c | x ) =
(3.1)
where p( x | c ) corresponds to the likelihood of component distribution c generating data point x. This expectation of point to component association is then used in the Maximization step where the estimated parameters c , c , c of each component s are updated according to a maximum likelihood criterion (maximization). The Maximization step involves adjusting the parameters of each component in such a way that the component better fits the data points, taking the previous posterior probabilities qnc into account, where q xc = p(c | x ) . More specifically, updating the parameters is done according to formula 3.2 to 3.4;
1 N qnc N n =1
(3.2)
c
c
1 N qnc xn N c n =1
1 N qnc (xn c )(xn c )T N c n =1
(3.3)
(3.4)
17
The updated model is then used to compute updated posterior probabilities in the Expectation step, and so on; i.e. the expectation and maximization steps are repeated by turns to incrementally improve the model. However, one of the major drawbacks of the classical EM-algorithm is that it is very sensitive to initialization. Depending on the starting values of the parameters, the algorithm may (and most probably will) converge to a local maximum that is different from the global maximum. Therefore it is very common to perform multiple runs, restarting the algorithm with a new, more or less random, initialization set of parameters each time it gets stuck in a local maximum. The parameters and the corresponding likelihood for the data are saved for each run, i.e. for each local maximum. Finally, the model corresponding to the maximum data likelihood is chosen. Another problem is that the model tends to degenerate during the EM-process as some components may collapse. These collapses occur when components centre on very small cluster of points and shrink until they cover only a single point, i.e. the variance approaches zero. The classical EM-algorithm does not support any inherent mechanism for determining the number of mixture components. Various extensions to EM have been proposed to solve this problem, some of them involving greedy approaches in which components are added incrementally [27]. 3.5.2.2 Anomaly Detection based on Trajectory Clustering
An Italian research group involved in video-surveillance has proposed an on-line clustering algorithm that is able to group trajectories in real-time [17]. Here the clusters are dynamic, built in real-time as new trajectory data is acquired, thus enabling the on-line learning feature. The trajectories are represented as a list of vectors, each encoding the spatial position (x- and y-coordinates) of an object at each time interval. The clusters are represented in a similar way; a list of vectors encoding the main trajectory and local approximation of the cluster variance in each time interval. Because the time interval between each point in the trajectory is fixed, objects sharing the same spatial trajectory, but having different speeds, will still correspond to different clusters. In order to check if a trajectory fits a given cluster, a distance measure must be defined. Because the Euclidian distance performs poorly in the presence of time shifts, the authors propose an alternative measure that incorporates the time aspect better. When a match is found between a trajectory and a cluster, the cluster is updated in such a manner that each cluster is a dynamic approximation of the mean and variance of all the trajectories that so far matched it. To increase the models ability to quickly adapt to changes in the normal behaviour over time, the weights of older tracks decreases exponentially with time The trajectory model can be built up as a forest of hierarchical tree structures where each node represents a cluster segment. A trajectory can then be represented as path in a tree.
18
Figure 3.6: Square a) of the figure illustrates some recorded trajectories. In b) similar trajectories are first clustered into main classes. These classes are then segmented in c). In d) each cluster segment corresponds to a node in a hierarchical tree structure. Each trajectory cluster segment (i.e. node in a tree) can be assigned certain probability based on the number of trajectories that are associated to it (i.e. nr of trajectories passing the cluster segment). The probability of a certain trajectory can then be evaluated as the product of the constituting cluster segment probabilities. An anomalous trajectory will thus appear as a path in the tree structure having a very small probability. The cluster model is dynamic in the sense that it can be updated incrementally in real-time as new trajectories are discovered. If a new trajectory is identified, the probabilities associated with each cluster segment are updated. If the new trajectory deviates too much from all of the previous paths, the model is extended by adding one or more nodes to the tree. The latter case is obviously an anomaly detection. In addition to anomaly detection, the clusters can also be used as feedback for a low-level tracking system; information about clusters is then used to enhance the predictions of the kinematical models based on linear Kalman-filters. Another approach to trajectory analysis has been proposed by the team from BAE Systems mentioned earlier [13]. They have developed a system that predicts near future vessel location based on vessel type, current location, velocity and course. The prediction model is based on a feedback neural network trained by unsupervised associative learning, where the weights are updated via Hebbian learning. If the difference between the prediction and observation is larger than a certain threshold, the system generates an alert of anomalous vessel motion. In the context of SA, both approaches described above could be used for the analysis of basic motion patterns of vessels. Extracted suspicious behaviour related to specific vessels can then be used as input to some higher level reasoning system (e.g. a rule-based expert system), effectively fusing information to achieve a higher level of SAW. Because the character of the motion trajectories vary depending on the type of vessel, it is important to create different trajectory models for different vessel classes. In particular, the motion pattern of large cargo-ships is generally more constrained and follows relatively well defined routes. A good example of such vessels is car ferries that operate in a very
19
constrained environment. In contrast, the motion pattern of small private boats is generally irregular and differs from that of larger operational vessels. Thus models based on trajectory clustering are probably most suitable for detecting larger vessels that are deviating significantly from their normal operational routes.
3.5.3 A Neural Network Approach to Semi-Supervised Clustering

A research group from BAE Systems [12, 13] has lately been investigating the use of different learning mechanisms based on neural networks for learning behavioural patterns in the domain of port-surveillance. The proposed system features an unsupervised clustering algorithm combined with a supervised mapping and labelling algorithm, allowing an operator the option of guiding the system learning. The goal of the system is to continuously learn to detect event anomalies with little or no operator supervision and with limited training data in a non-stationary environment. To learn a context-sensitive model of vessel behaviour at event-level, they have developed a modified version of the Fuzzy ARTMAP neural network classifier [28]. The major modification of the algorithm involves the discovery of salient features that are sufficient for discriminating between classes. The input to the system consists of momentary vessel reports including data about the vessel type and ID, current time, position, speed and course. To increase context sensitivity, additional information like weather conditions, season etc., as well more detailed vessel properties (port of departure, home port, type of goods etc.), could be encoded as input. The speed and performance of the algorithm makes it suitable for real-time interactive situations, wherein an operator can guide the learning by simple labelling. Learning normal (and anomalous events) can be done by prior off-line supervised training of the system with a set of observations (vessel reports) that are known to reflect normal (or anomalous) behaviour in different contexts. However, this is not necessary as the system can learn different behaviour based on vessel reports in real-time (on-line), autonomously or via operator input. During operation, if a new event is not automatically considered as normal or anomalous by the system, it is labelled as unknown and an alert is issued. This unknown event can then be classified by an operator. However, should the unknown event be left unattended for a period of time, it will eventually be considered as normal by the system. In the port-surveillance application described by BAE Systems [12], discrimination is only made between normal, anomalous behaviour and unknown behaviour. However, the class of anomalous events could be extended to multiple classes of behaviour corresponding to more specific suspected criminal or dangerous behaviour. A relevant limitation of this system is the lack of ability to learn sequences or sets of events as behaviours. For example, the system may detect multiple events that are normal at an atomic level, while the order (i.e. the sequence) of the events is actually anomalous.
3.6 Conclusions of theoretical study

One of the goals in this project was to design and (partially) implement an alternative or complementary solution to the existing prototype implemented by Saab, able solve at least one of the main sea surveillance problems; detection of abnormal traffic patterns and identification of particular scenarios. The current prototype developed by Saab is oriented towards identification of particular scenarios that evolve over time; it has a rule-based reasoning agent that can infer particular situations from known relations and situations (facts) that evolve over time. As is the case
20
with most conventional rule-based systems, the prototype system lacks the ability to reason with uncertainties and it has no support for automatic learning over time; two features that are, however, more or less pursued by developers at Saab.
3.6.1 Introducing a new reasoning agent for handling uncertainties

One way to extend the current system to handle uncertainties could be to introduce Bayesian reasoning according to a model similar to Higgins. The concept of the event hierarchy with complex events composed by sub-events is actually very similar to the model used by Saab. The basic relations in the Saab ontology correspond to the primitives and sub-events while situations in the Saab ontology correspond to complex events. However, defining all the causal relationships, prior probabilities and conditional probabilities may be far from trivial. To make at least fairly good estimations, a substantial amount of qualitative statistics and/or high quality human expert knowledge related to the particular situations sought for is required. An alternative approach, that is less sensitive to definitions, is the Fuzzy Reasoning framework. Because Fuzzy Set Theory is essentially not based on probability theory, it is independent of accurate estimations of probability distributions, as opposed to Bayesian Reasoning. In fact, Fuzzy Reasoning resembles more the way a human expert would reason which motivates its use when emulating a human expert. A fuzzy framework similar to the one proposed by Chen could be introduced in the existing system. This would involve the introduction of fuzzy truth values for each proposition (instance of a relation or situation), fuzzy certainty values for the rules and fuzzy weights for the propositions in the rules. Specifying fuzzy truth values is easier and more intuitive than specifying accurate probability distributions. However, the two extensions just discussed are still oriented towards identifying particular situations of interest. In order to construct such systems, specific knowledge regarding these situations must be encoded into the system. This can be done manually by having a domain expert explicitly define the model (rules, causal relationships, distributions etc), or automatically by training the model using machine learning techniques which requires a set of labelled situation examples.
3.6.2 Introducing an anomaly detection agent

The scenarios and situations that the previous system approaches try to find are by nature difficult to define; they usually occur rarely and may show significant variation from one example to another. Therefore very few, if any, examples covering a fairly wide range of the situations are usually available. In addition to this, much of the relevant knowledge is held within the minds of domain experts and extracting this information is a difficult task. In other words, it is very difficult to accurately characterize generic situation templates. However, turning it the other way round, the scenarios and situations can be detected as anomalies in a model of routine behaviour. Usually, large amounts of data corresponding to routine behaviour are available, which motivates the use of Data Mining and clustering techniques for building models of normal behaviour. Therefore, a third extension that involves the introduction of a new anomaly detection agent has been chosen for further investigation and evaluation in this project. Such an agent would solve the problem of abnormal traffic pattern detection where the anomaly detection could be based on one or more feature models. The systems for anomaly detection described earlier have the advantage of not requiring supervised training, i.e. they do not require labelled examples (even though some of the systems may still benefit from it). Once a feature model has been defined, the systems will
21
self organize in order to find anomalous patterns. However, it should be remarked that some initial supervised knowledge is still required in order to construct an adequate feature model capable of finding relevant anomalies. An anomaly detection agent could work in parallel with the reasoning agent and support data fusion by supplying the reasoning agent with additional qualified input. As an example, the anomaly detection agent might identify a vessel having a more or less anomalous traffic pattern and classify it as suspicious. This information is then supplied to the reasoning agent that may combine this fact with other facts to infer a particular situation, or simply to focus attention to this particular vessel. Thus, an anomaly detection agent should be regarded as a complementary solution to the existing prototype, in contrast to an alternative solution.
22
Design of an Anomaly Detection System
4 Design of an Anomaly Detection System

This chapter describes the anomaly detection approach chosen for further investigation and implementation in this masters project. This includes a description and motivation of the design, method, models and algorithms chosen. In addition a more detailed description of the data is given.
4.1 Data Description

The surveillance data available in this masters project has been supplied by the Swedish naval intelligence battalion and consists of recorded sea traffic at the Malm Sea Surveillance Centre, located in the south of Sweden. The data set contains sea traffic along the coast of south Sweden and parts of the neighbouring coasts of Denmark, Germany and Poland and the sea in between. A rough overview of the surveillance area is presented in the figure below.
Figure 4.1: Overview of the surveillance area. The red delineation roughly delimits the surveillance area The main part of the data consist of two continuous recordings; nine days of continuous autumn traffic during the period 11/11 to 20/11 2006 and seven days of summer traffic
23
during the period 1/7 to 7/7 2006. These recordings are assumed to reflect typical vessel traffic, containing a very low level of anomalies. In addition to this, six shorter scenarios, recorded during January, February, March, April and May, were also supplied, each having a duration of 2-6 hours. Common to all the shorter scenarios is the presence of a known anomalous situation. Two of the scenarios involve a collision between two larger vessels; a large passenger ferry with a smaller freight vessel and a large foreign tanker with a smaller freight vessel. In three of the other scenarios the actual event is grounding, while in the last scenario the reason for the stop is unknown but could be related to a potential smuggling scenario. The tracks of the moving targets (vessels) are stored as rows in MS Access databases where each row corresponds to a vessel report (data point). The columns of the Access database represent the attributes of the vessel reports, including among others target ID, latitude, longitude, absolute speed, course, timestamp and AIS 4 number (MMSI). The majority (approximately 90%) of the tracks in the database correspond to data generated by AIS. These vessels are assigned a MMSI number different from zero. The rest of the tracks correspond to vessels identified and tracked by MST5 systems and are generally smaller ships lacking AIS (and thus having MMSI equal to zero). Apart from the labelled data corresponding to the vessels involved in the anomalous situations in the shorter scenarios, the data supplied in this project is unlabelled, which motivates the use of unsupervised learning techniques. It is assumed that a great majority of the unlabelled vessel traffic corresponds to regular traffic. However, more or less anomalous patterns may still be present in this data. This assumption is of key importance as it influences the learning strategy and the interpretation of the results during anomaly detection based on unsupervised clustering.
4.2 The feature models

The main challenge when constructing a system for anomaly detection is how to represent the data in which anomalies are to be found, i.e. how to find a suitable feature model. As discussed earlier, the choice of feature model is critical for finding the right type of anomalies in the data. Thus, in the case of sea surveillance, the main problem is how to parameterize the motion in a smart way.
4.2.1 Basic model based momentary velocities in two dimensions

Two feature models have been chosen for implementation in this project. The first basic model corresponds to the Holst model (see section 3.5.1), where the surveillance area is divided into a grid in which each square models the vessel velocities in latitude and longitude direction in that particular area. The velocities of each vessel report are simply calculated by transforming the absolute speed and course expressed in polar coordinates into the velocity components in Cartesian coordinates.
The Automatic Identification System (AIS) is a system used by ships and vessel traffic systems, principally for identification of vessels at sea. Most ships using AIS are large operational vessels and it is required by international law that all ships greater than/equal to 300 gross tons have to be fitted with AIS. For more information on AIS: http://www.imo.org/ 5 The Multi Sensor Tracker (MST) is a tracking system developed by Saab that is based on data fusion of multiple sensors, in this case radars. For information on Saabs MST: http://products.saab.se/PDBWeb/ShowProduct.aspx?ProductId=808
24
This model was chosen as it serves as a simple base model for investigating anomaly detection. The simplicity of the model makes it rather generic and therefore applicable in other similar domains.
4.2.2 Extended model incorporating spatial position

The basic model captures anomalies that correspond to vessels travelling in anomalous directions and/or at anomalous speeds. However, the model does not capture anomalies that are related to the spatial position. Consider for example the obvious anomaly of a vessel travelling the wrong direction in a two-way sea lane. Because the vessel has a speed and course that correspond to the regular traffic on the correct side of the sea lane, it will not be detected as an anomaly as its speed and course are not correlated to its spatial position. The second model is a development of the basic model that incorporates the spatial position. This is done by expanding the model to a four-dimensional feature space comprising the two speed vectors in latitude and longitude directions and the global position as latitude and longitude coordinates. Incorporating the spatial position makes the model more powerful as it is able to more accurately distinguish between and characterize different types of vessel motion. Generally this model could be applied to model motion of the whole surveillance area. However, because the total amount of data points may be very large and may contain points that are rather widely spread out in this high-dimensional feature space, a large number of cluster components would be required to adequately cover all samples. As the number of data points and clusters increase, the computational complexity of the clustering algorithm may pose a problem. Therefore the surveillance area should be divided in a similar manner as for the base model discussed earlier. It is important how the map is subdivided in order to model vessel behaviour efficiently. In particular, a dynamic adaptive grid division algorithm could be implemented, so that vessel behaviour within each grid is as homogeneous as possible, thus requiring relatively few mixture components. As an example, vessel behaviour on open sea differs quite a lot from the (more constrained) behaviour in ports and therefore these areas should be well separated. However, designing and implementing such an algorithm has been considered out of the scope of this masters project. To summarize, the chosen feature models are relatively simple but are thought to serve as a convenient base for evaluating different cluster models and clustering algorithms for anomaly detection within sea surveillance. The type of anomalous behaviour that was thought to be captured by these models is rather elementary. Examples of such anomalous behaviour could be vessels that are travelling too fast in speed restricted areas, crossing sea lanes, travelling the wrong direction in sea lanes and vessels that remain stationary in sea lanes or in areas where anchoring is prohibited.
4.3 Cluster models and clustering algorithms

In the Holst model the velocities within each region are modelled by a single twodimensional Gaussian distribution. A serious limitation of this model is that it assumes that a single Gaussian component will cover all possible occurrences of regular data points in an adequate way. Unless the regular data in a particular area is very homogenous, this model is not efficient for modelling normal data. As an illustrative example, consider an area corresponding to a two-directional travelling lane where the absolute speed is roughly the same in both directions. Regular data points in the velocity space will gather in two distinct clusters where each cluster centre corresponds to the mirror of the other. If the clusters are of
25
roughly the same size, a single Gaussian will centre close to zero (i.e. approximately in the middle of the two clusters) and grow rather wide in order to cover both clusters. Data points located close to zero will thus be regarded as perfectly normal occurrences, which is not what we would expect in reality. On the contrary, such points may correspond to vessels that have stopped in the middle of a travelling lane and should thus be regarded as potential anomalies. In order to model regular data more efficiently, other approaches involving more complex cluster models have been investigated. In particular, two models that support multiple clusters have been implemented and evaluated in this project; the MoG model and the Fuzzy ART Neural Network model.
4.3.1 The Mixture of Gaussians model

A natural extension to the Holst model is the introduction of a Mixture of multivariate Gaussians model (MoG) (see section 3.5.2.1) in place of the single Gaussian. Such a model is capable of approximating arbitrarily complex distributions in arbitrarily high dimensions and is a natural choice when there is no reason to believe that data behaves according to any particular distribution. Similar to the Holst model, a new data point is evaluated by calculating the likelihood that the MoG generated it. This likelihood is then compared to a certain threshold; if the likelihood is below this threshold, the point is considered anomalous. As previously discussed, the typical problems when constructing Mixture Models are 1) how to determine a suitable number of mixture components and 2) how to initialize the component parameters. When (normal) data is relatively inhomogeneous, a large number of mixture components are required in order to construct an adequate model. However, as the complexity of the model increases, the risk for over-fitting also increases, a common problem within machine learning in general. Thus, finding the right number of mixture components is not a trivial problem. Furthermore, the data likelihood for the resulting mixtures usually vary depending on the parameter initialization, i.e. the algorithm usually finds different local maxima for different parameter initializations. 4.3.1.1 Greedy EM-learning
To solve the problem of finding the optimal number of components and avoid the need to run the algorithm multiple times from random initializations, an extension to the classical EM-algorithm has previously been developed [27]. The algorithm, known as Greedy EMlearning, is based on a greedy approach that iteratively builds the optimal mixture model adding new components one at a time. Instead of starting with a, more or less, random configuration of a predefined number of components and improve upon this configuration with regular EM, the mixture model is built component wise. The algorithm starts by determining the optimal one-component mixture. It then starts repeating two steps until a stopping criterion is met. The steps are; 1) insert a new component and 2) apply EM until convergence. Before inserting a new component, a set of randomly initialized candidate components are first evaluated in order to determine the optimal new component for insertion in the current mixture model, i.e. the optimal candidate component that maximizes the likelihood of data in the new mixture. The evaluation of the optimal candidate component is done by partial EM searches, one for each candidate in the set, where the parameters of the existing components are fixed, i.e. optimization is done only over the parameters of each candidate component. The reason for performing partial EM searches is simply because of complexity issues. For each insertion problem, a fixed number of candidates per existing mixture component are considered when searching for the optimal component. Thus, the number of
26
candidates considered in each iteration increases linearly with the number of mixture components.
Figure 4.2: Sub-figures (a)-(d) illustrate the construction of a 4-component MoG during greedy EM. Each Gaussian component is shown as an ellipse, centred at the corresponding mean value and having a spreading corresponding to the components variance. To evaluate performance of the current mixture model, the log likelihood of a test set is calculated after each new component insertion. If the log likelihood has decreased since the previous insertion, the stopping criterion is fulfilled and the algorithm terminates, returning the previous mixture model as solution. Should the number of mixture components exceed a certain predefined threshold, the algorithm also terminates, returning the current mixture components.
4.3.2 The Fuzzy ART Neural Network model

In order to have a reference system when evaluating the Mixture Model, a second cluster model was implemented in this project. The Fuzzy ART Neural Network has previously been implemented in a very similar context [12] and has shown promising results. 4.3.2.1 What is Fuzzy ART?
ART stands for Adaptive Resonance Theory and encompasses a wide variety of neural networks based on neurophysiological models [28, 29]. In theory ART networks are defined algorithmically in terms of detailed differential equations intended as plausible models of biological neurons. However, in practice they are implemented using analytical solutions and approximations to these differential equations. Learning in ART Networks can be supervised, unsupervised or a combination of both. The ART Network implemented in this project is based solely on unsupervised learning. Unsupervised ART is very similar to other iterative clustering algorithms including EM
27
(Mixture Models) and other neural networks like Vector Quantization (VQ). However, a crucial feature of ART that distinguishes it from the rest is that the number of categories (corresponding mixture components or prototypes in VQ) is determined dynamically during learning. In other words, ART does not require a predefined number of categories before clustering. This feature is said to solve the classical stability-plasticity dilemma in machine learning by supporting fast learning without forgetting previously learnt knowledge. Fuzzy ART is an extension of binary ART that incorporates fuzzy logic. Input data is fuzzyfied by extending the feature values from strictly binary values (0 or 1) to fuzzy membership values in the range of [0 1]. This allows for more flexibility, as analogue signals can be used directly as input. In particular, Fuzzy ART better supports handling of the continuous feature values in the feature model chosen for implementation in this project. 4.3.2.2 The principal architecture of the Fuzzy ART Network
The Fuzzy ART Network is essentially a three-layer adaptive neural network. In the figure below, the vectors s , x and y correspond to the activity in the three layers; Input Field, Feature Field and Category Field. All nodes in the Feature Field and the Category Field are connected through the bottom-up weights bij and the top-down weights t ij . The Input Field simply serves as a place holder for the input pattern during the pattern categorization process described below.
y
bij
x s
Category Field
t ij
Feature Field Input Field
Input data Figure 4.3: The principal structure of the ART Network architecture 4.3.2.3 The dynamics of unsupervised ART
Before an input pattern is presented to the network it has to be normalized. Furthermore, the normalized input pattern should be complement coded in order to avoid category proliferation [29]. The dynamics of unsupervised ART can be summarized as follows; a pattern that resembles an existing category fairly well updates the category, while a pattern that does not resemble any existing category generates a new category.
28
4.3.2.3.1
Learning
When learning (clustering), a new pattern is presented to the network by activating the nodes of the input field which in turn moves the pattern to the feature field. The x signal of feature field then activates the category field through the bottom-up weights bij . The category node having the highest activation is chosen as the winner node, corresponding to the category that mostly resembles the input pattern, while all remaining nodes are inhibited. Next the x signal is first reset and then reactivated by the y signal (i.e. winning category) through the top-down weights t ij . The updated x signal is then compared to the original input pattern s . If they are sufficiently similar, resonance is said to occur, which involves learning, as the weights (in both directions) between the winning category node and the feature field are updated in order to adjust to the newly learnt pattern (compare to the updating of clusters). However, if no resonance occurs, y is reset and the current winner node inhibited. A new candidate winner node is then activated by repeating the process. This matching process is repeated until a category is found that resonates with the input pattern, i.e. a category is found that is sufficiently similar to the input pattern Eventually, if no resonance occurs with any of the existing categories, a new category with weights is created, representing the new input pattern. The learning algorithm in ART can be regarded as a greedy iterative algorithm in the sense that it incrementally builds and updates the network as new patterns are presented and that the order in which patterns are presented influence how the network evolve and its final configuration. 4.3.2.3.2 Categorization without learning
Input patterns can also be categorized by the network without any learning taking place. The dynamics of the network is similar but no learning occurs during resonance, i.e. the network returns the winning category without adjusting its weights. If no resonance occurs, the pattern is simply classified as unknown and no new category is created. 4.3.2.3.3 Parameters regulating the dynamics
The dynamics of the ART Network are essentially regulated by two parameters; vigilance and learning rate. The vigilance parameter has an interval of [0 1] and determines the threshold value related to resonance. High vigilance makes greater demands on similarity between input pattern and categories in order for resonance to occur which implies that more categories are generated (specialization). Lower vigilance loosens the resonance criterion which implies that fewer categories are generated (generalization). The learning rate parameter has an interval of [0 1] and simply determines how fast learning takes place, i.e. how fast the weights are updated. Setting the learning rate to one corresponds to the fast learning mode in which the weights are updated completely to a stable state, before next pattern is presented.
29
Implementation
5 Implementation
This chapter includes implementation details such as how data was pre-processed, description of the machine learning software packages and how they were used to train and test the models. The implementation of the machine learning algorithms and scripts for presentation and pre-processing of data has been done in MATLAB. In addition to this, the initial part of the data pre-processing was implemented in Java.
5.1 Data pre-processing

The pre-processing of vessel traffic data can be divided into two phases; first extracting data from the Access databases and storing it in a format directly readable by MATLAB, and then performing various filtering to remove noise and other undesirable data and sub-sampling to decrease data size.
5.1.1 Data Extraction

The first phase involves extracting a subset of the attributes (columns) for each vessel report (row) in the databases (Access files) and storing this information in multiple text files. This is done by a script, written in Java, that first extracts each vessel report and its attributes target number, AIS number, latitude- and longitude position, course, speed and timestamp. Each extracted vessel report is then stored as a row in a text file containing all vessel reports of a particular target, in chronological order according to the timestamp attribute. That is, a separate and sorted text file containing all the vessel reports of each distinct target is created, where the name of each such text file matches the corresponding target ID. Creation of separate data files for each distinct target facilitates the process of analyzing individual target tracks over time. The classes and methods of the Java Database Connectivity (JDBC) API were used in the Java script to setup up a client, connect it to the Access databases and perform SQL queries for retrieving the data. In order to connect to the Access databases, a JDBC Driver supporting the Access protocol for data communication had to be installed. The choice fell on a platform independent JDBC type 4 driver written in Java that converts the calls directly to Access protocol6.
5.1.2 Data Filtering

In the second phase, the data in each target text file is loaded into MATLAB, stored in matrix form and further pre-processed by various MATLAB scripts and functions. When data has been stored as matrices it is filtered to remove various noise, undesirable data and other anomalies that may interfere with the learning of normal patterns. The filters implemented are the following: Coordinate filter: Filters targets that are located outside of the surveillance area, corresponding to pure noise.
Hongxin Technology & Trade Ltd. of Xiangtan City (HXTT), http://www.hxtt.com/
30
Implementation
Speed filter: Filters targets having very high speeds (over 44 knots). This data may correspond to very fast motor boats, helicopters or other anomalies and are therefore unsuitable for inclusion in a model of normal behaviour. Course filter: Filters targets having negative course which corresponds to unknown course. Port filter: Filters targets that are located in port areas.
Targets located in port areas usually have speeds close to zero and are therefore not very interesting to analyze in this model. As a matter of fact, the presence of these vessels tends to cause numerical instability and collapse of the mixture model during learning. Therefore a function for filtering the data points corresponding to vessels in port areas was implemented. The port filter function is based on a list of the latitude and longitude coordinates of the major ports located in Sweden, Denmark, Germany and Poland7. The function simply removes all data points located within a certain radius of the port coordinates. Finally, to decrease the size of data without removing too much information, data is subsampled by extracting every 10th sample from each individual target track. The sample process corresponds to an updating interval of approximately three to four minutes for each vessel report. This updating interval was more or less arbitrarily chosen and has been considered adequate with respect to the speed and character of the vessel motion.
5.2 Software packages

To save time, software packages including MATLAB functions for building and training Gaussian mixture distributions and ART-networks have been used for implementing the proposed models and algorithms. These functions have been developed by researchers and are available for download at public websites.
5.2.1 Description of the EM-based Gaussian Mixture Model package

Functions for training Mixtures of Gaussian models with the EM-algorithm have been developed by the researchers Nikos Vlassis (currently at the University of Amsterdam) and Sjaak Verbeek (previously at the University of Amsterdam). They have implemented a version of the Greedy Learning of Gaussian Mixture Models proposed in articles by Verbeek. The package contains a main function, em.m, for performing greedy EM and a number of auxiliary functions used by the main function. The main function takes a number of input arguments: Training data set Test data set (optional) Maximum number of mixture components Number of candidate clusters per component of current mixture Plot flag and a diagnostics flag
The fourth argument refers to the number of candidate clusters to be considered when inserting a new mixture component during greedy EM. If this number is set to zero, standard non-greedy EM is performed instead. The test data set is optional if non-greedy EM is
7
http://www.world-register.org
31
Implementation
performed. However, a test set is required during greedy EM as the stopping criterion is dependent on the current mixtures performance on the test set, i.e. when performance starts to decrease as the number of components increase, the algorithm terminates. If the plot flag is enabled and the data is represented in one or two dimensions only, a plot window will appear illustrating the dynamics and result of the EM algorithm. The plot window includes all the data points in the background and iteratively updates the plots of the mixture components as the means and covariance are adjusted to better fit the data. Apart from the plot window, the output of the main function when the algorithm has terminated is a matrix containing the weight, mean values and covariance of each mixture component. If test data was supplied, the average of the log-likelihood of the test data is also returned.
5.2.2 Description of the Fuzzy ART Neural Network package

The fuzzy ART neural network package for MATLAB used in this masters project was downloaded at the MATLAB Central homepage, an online MATLAB exchange community. The functions are written by Aaron Garrett and have received very good reviews from other community members8. The package allows creation, training, and testing of fuzzy ART neural networks. Four main functions called by the user are included; functions for creating a network, training the network, complement coding the input pattern and categorizing a new input pattern. A number of auxiliary functions called by the main functions are also included in the package. The fuzzy feature of the ART network implies that the features can have continuous values. In this implementation the value range for each feature is between 0 and 1. Therefore the input data has to be normalized first to fit this range. In addition to this, ART requires that patterns are extended by their complement coding before they are presented as input to the network. This is done by the specific complement coding function supplied in the package. When creating a new ART network, the only necessary input from the user is the number of features (dimensions) of the patterns (data). However, there are a number of parameters of the ART network that affect the learning of categories (clusters). If not manually set by the user, these parameters will have default values specified in the parentheses; vigilance (0.75), maximum number of categories (100), bias (0.000001), number of epochs (100) and learning rate (1, fast-learning). The network training function simply takes the network and the training set as input. The output is composed of the updated network together with a vector listing the category assigned to each training sample. Categories are encoded as positive integers, starting with 1 and increasing. To categorize (test) a new pattern it is passed to the categorization function which returns a vector, listing the categories assigned to each test sample. If a sample is to far away from any existing category, i.e. resonance was not achieved, that particular sample is assigned category 1, indicating that it is unknown to the network.
http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=4306&objectType=File
32
Implementation
5.3 Training the models

The training of the Mixture Models and ART networks is done by two separate MATLAB scripts, one for each model type. The training scripts use the functions from the corresponding software packages to implement the machine learning algorithms. Before executing the scripts, a training data set has to be specified. When executed, the scripts start by first discretizing the area to a specified number of grids in latitudinal and longitudinal direction. The scripts then iterate through each such grid; extracting the training samples in that particular grid, calculating the velocities in latitude and longitude direction of each sample, training the corresponding model and storing its parameters. However, if there is not enough training samples available in a particular grid no model will be trained.
5.3.1 Training the Mixture models

Training of the mixture models is done by the Greedy EM algorithm. Therefore a corresponding test data set has to be extracted as well for each grid. The training set and test set are passed to the Greedy EM algorithm that returns the parameters of the optimal number of mixture components. Because Mixture Models are essentially based on statistics and probability theory, the number of training samples should be reasonable large in order to build an adequate model. Therefore, no model is built if the number of training samples of a particular grid is less than 10.
5.3.2 Training the ART networks

When training the ART network, data is first normalized to the interval [0 1], and then complement coded, before it is passed to the training function together with the network. Furthermore, the vigilance parameter of the network has to be specified before training. Similar to the training of Mixture Models, no network model is trained if the number of training samples is less than 10.
5.4 Performing Anomaly Detection

Anomaly detection of data based on the trained models is done by two main scripts, one for each model type. Before executing the scripts an evaluation data set being subject to the anomaly detection has to be specified. When executed, the scripts iterate through each grid; extracting the corresponding samples from the evaluation set, calculating the latitude and longitude velocities, retrieving the grid specific model, and presenting the data to it. Each sample is classified as either normal, anomalous or unknown by the model.
5.4.1 Anomaly detection based on Mixture Models

When performing anomaly detection based on Mixture Models, the logarithm of the likelihood of each sample is calculated. If the log-likelihood of a particular sample is below a specific anomaly threshold, that sample is regarded as anomalous. If there is no model available for a particular grid (due to insufficient amount of training data), all the test samples of that grid are classified as unknown.
5.4.2 Anomaly detection based on ART Networks

When performing anomaly detection based on ART, the evaluation samples are categorized by the network. Before categorizing, the vigilance parameter is specified, corresponding to the anomaly threshold. Samples that are unrecognized by the network (i.e., they are too far
33
Implementation
away from any of the categories of the network in order for resonance to occur), are classified as anomalous. If no network is available for a particular grid, all the corresponding evaluation samples are classified as unknown.
34
Experimental evaluation
6 Experimental evaluation
This chapter describes the setup, goals and results of the experimental evaluation of the anomaly detection system described and implemented in this project.
6.1 Experimental setup

The main goal of the experimental evaluation is to investigate what type of anomalous behaviour that can be captured by the proposed feature model and algorithms, i.e. an analysis that is based mainly on the qualitative results from performing anomaly detection. The experiment is divided into two main parts. In the first part, we investigate the character of the most distinguishing anomalies found in the unlabelled data by the proposed feature models and clustering algorithms. Furthermore, we investigate to what extent the proposed feature models and clustering algorithms detect the same anomalies. In the second part, we evaluate to what degree the labelled anomalies, both real and artificial, can be found by the MoG models, using either the two-dimensional or fourdimensional feature model. In particular, we investigate how well the systems can distinguish the labelled anomalies from unlabelled local background data reflecting typical traffic.
6.1.1 Training sets

When training the MoG and ART models, the seven days long summer recording has been used as training set. Furthermore, when training the MoG model, the nine days long autumn recording has been used as a test set; recall that greedy EM requires a test set to evaluate the current performance during training. It is important that the training set and test set in this case are not too closely related in time as this may imply undesired correlation. Ideally, in order to get a good estimation of model performance there should be no correlation present between the training set and test set. Therefore the autumn recording has been used as test data when training with the summer recording.
6.1.2 Evaluation sets

The data used to evaluate the trained models during anomaly detection are generally of four types; Unlabelled data, reflecting typical vessel traffic, from the six shorter recorded scenarios Unlabelled data, reflecting typical vessel traffic, from the longer autumn recording Labelled data, corresponding to anomalous situations, in the six shorter recorded scenarios Artificially created data, corresponding to vessel behaviour that would be regarded as more or less anomalous by a human supervisor.
The labelled anomalies in the recorded scenarios can be categorized as three grounding scenarios, two collision scenarios and one unexpected stop scenario. The artificially created data include two simulated scenarios; a speeding scenario and a smuggling scenario. These scenarios include anomalies that correspond to vessels that are; crossing or travelling more or less perpendicular to main sea lanes
35
travelling relatively fast in sea lanes and other speed normalized areas stopping or travelling very slow in the middle of or close to sea lanes
6.1.3 Grid size

Grid size is fixed to 30 squares in latitude direction and 40 squares in longitude direction. This valued is found to be well balanced in the sense that it supports adequate resolution without being too computationally demanding.
6.1.4 Model specific parameters

6.1.4.1 Mixture Models
When training the MoG with greedy EM, the maximum number of mixture components was fixed to 20. However, this number of components was never reached during training. The number of candidate components per mixture component during insertion was fixed to 10. During anomaly detection, the only parameter that has been adjusted is the anomaly likelihood threshold. 6.1.4.2 ART Networks
The learning rate has been set to one (i.e. fast learning) and the vigilance parameter fixed to 0.95 for all networks during learning. However, the vigilance value for categorization has been varied during anomaly detection.
6.2 Results
Qualitative results from experimental anomaly detection are presented as figures, illustrating traffic corresponding to training data and evaluation data classified by the model. The colour coding for the traffic is as follows; Light grey traffic corresponds to the models training data. Dark grey traffic corresponds to the evaluation data that is classified as regular by the model. Black traffic corresponds to the evaluation data that is classified as anomalous by the model.
6.2.1 Global Anomaly Detection in unlabelled data

This part shows results from performing anomaly detection in the unlabelled data from the six scenarios at a global level, covering the whole surveillance area. Four different types of models are evaluated; four-dimensional MoGs, two-dimensional MoGs, four-dimensional Fuzzy ART Networks and a two-dimensional Fuzzy ART Networks. Data points are plotted in the two-dimensional longitude and latitude space, i.e. the velocities in each dimension for the data points are not illustrated. Because the unlabelled scenario data is assumed to reflect more or less typical traffic patterns, one would expect an anomaly detection system to detect very few, if any, anomalies in this data. Therefore, the likelihood thresholds and vigilance values have been manually tuned to levels that generate very few anomalies in the unlabelled scenario data. Furthermore, the parameters have been tuned so that models generate approximately the same amount of anomalies for the whole set of unlabelled scenario data.
36
Figure 6.1: Results from global anomaly detection of the unlabelled data using a fourdimensional MoG for each square with anomaly threshold set to -30.
Figure 6.2: Results from global anomaly detection of the unlabelled data using a twodimensional MoG for each square with anomaly threshold set to -17.
37
Figure 6.3: Results from global anomaly detection of the unlabelled data using a fourdimensional ART Network for each square, trained using vigilance value 0.95 and evaluated using vigilance value 0.94.
Figure 6.4: Results from global anomaly detection of the unlabelled data using a twodimensional ART Network for each square, trained using vigilance value 0.95 and evaluated using vigilance value 0.92.
38
6.2.1.1
Quantitative results from global anomaly detection
This section presents quantitative results from comparing the set of anomalous data points detected by the different models. The point with these results is to show to what extent the different feature models and clustering algorithms detect the same anomalies. Let:
A4dim MoG be the set of unlabelled points classified as anomalous by the 4dimensional MoG
A2dim MoG be the set of unlabelled points classified as anomalous by the 2dimensional MoG
A4dim ART be the set of unlabelled points classified as anomalous by the 4dimensional ART Network
A2dim ART be the set of unlabelled points classified as anomalous by the 2dimensional ART Network AMoG = A4dim MoG U A2dim MoG be the union set of unlabelled points classified as
anomalous by the mixture models
AART = A4 dim ART U A2dim ART be the union set of unlabelled points classified as
anomalous by the ART Networks
A4dim = A4dim MoG U A4 dim ART be the union set of unlabelled points classified as
anomalous by the four-dimensional feature models
A2dim = A2dim MoG U A2 dim ART be the union set of unlabelled points classified as
anomalous by the two-dimensional feature models D be the complete set of unlabelled data, classified by the models
Then, for each particular pair of feature model and clustering algorithm, the number of points classified as anomalous divided by the total number of data points are:
A4dim MoG D = 0.72% A2dim MoG D = 0.74% A4dim ART A2dim ART D = 0.78% D = 0.73%
Let us define a similarity measure between two sets of (anomalous) points as:
S ( A1 , A2 ) =
A1 I A2 A1 U A2
39
Using this similarity function, we can now evaluate to what degree the different pairs of feature models and/or clustering algorithms detect the same anomalies:
S ( A4dim MoG , A2dim MoG ) = 63,24% S ( A4dim MoG , A4dim ART ) = 59,66% S ( A4dim MoG , A2dim ART ) = 56,01% S ( A2dim MoG , A4dim ART ) = 69,28% S ( A2dim MoG , A2dim ART ) = 64,95% S ( A4dim ART , A2 dim ART ) = 81,54% S ( AMoG , AART ) = 60,84% S ( A4dim , A2dim ) = 70,60%
Comparing the intersection of all sets of anomalous points and the union of the corresponding sets, we get a measure of to what degree all four models detect the same anomalies:
A4dim MoG I A2dim MoG I A4dim ART I A2 dim ART A4dim MoG U A2dim MoG U A4dim ART U A2 dim ART
= 46,75%
40
6.2.2 Local Anomaly Detection of unlabelled data

This section presents local results from the global anomaly detection, presented previously. The chosen areas illustrate anomalies, jointly detected by all four models, or exclusively detected by either the MoG models or the ART models. Data points are represented as vectors, positioned in the longitude and latitude space. The position of the vector corresponds to the vessels location and the size and direction of the vector correspond to the velocity of the vessel. 6.2.2.1 Anomalies found by all four models
The following figures present three samples areas where all anomalies in the unlabelled data, for the particular area, are jointly detected by all four models.
Anomalies found by all four models 57.5
57.48
57.46
57.44
LATITUDE
57.42
57.4
57.38
57.36
57.34 11.4 11.42 11.44 11.46 11.48 11.5 LONGITUDE 11.52 11.54 11.56 11.58 11.6
Figure 6.5: Obvious anomaly detections; the black tracks correspond to vessels travelling in an area where no other vessels have travelled before and in the opposite direction of the main sea lane.
41
Anomalies found by all four models
57.66
57.64
57.62
57.6
LATITUDE
57.58
57.56
57.54
57.52
57.5 11.2
11.22
11.24
11.26
11.28
11.3 LONGITUDE
11.32
11.34
11.36
11.38
11.4
Figure 6.6: The majority of the model tracks are concentrated to the sea lane going in northwest direction while a minority of the model tracks deviates from it. Two vessel tracks are detected as anomalous; the first travelling close to the sea lane but exactly in the opposite direction and the other crossing the sea lane in a narrow angle.
Anomalies found by all four models 55.5
55.48
55.46
55.44
LATITUDE
55.42
55.4
55.38
55.36
55.34 14.2 14.22 14.24 14.26 14.28 14.3 LONGITUDE 14.32 14.34 14.36 14.38 14.4
Figure 6.7: The main sea lane is less distinct in this area. However, the anomaly detection stands out clearly as a vessel crossing the local tracks at a relatively high speed.
42
6.2.2.2
Anomalies found exclusively by MoG models
The following figures present three samples areas where all anomalies in the unlabelled data, for each particular area, are exclusively detected by the two MoG models.
4-dimensional MoG & 2-dimensional MoG 55.5
55.48
55.46
55.44
LATITUDE
55.42
55.4
55.38
55.36
55.34 14.6 14.62 14.64 14.66 14.68 14.7 LONGITUDE 14.72 14.74 14.76 14.78 14.8
Figure 6.8: The main two-way sea lane is clearly illustrated by the high concentration of light grey training data in two opposite directions. The anomaly, exclusively found by the MoG models, correspond to a vessel travelling in an anomalous direction as it crosses the main sea lane.
43
4-dimensional MoG & 2-dimensional MoG
55.66
55.64
55.62
55.6
LATITUDE
55.58
55.56
55.54
55.52
55.5 15.2
15.22
15.24
15.26
15.28
15.3 LONGITUDE
15.32
15.34
15.36
15.38
15.4
Figure 6.9: The main two-way sea lane is clearly illustrated by the high concentration of blue traffic in two opposite directions. A minority of the traffic travels in parallel to the main sea lane, but still follows the general direction of motion. However, the anomaly, exclusively found by the MoG models, correspond to a vessel travelling in an anomalous direction as it crosses the direction of travel of the traffic.
4-dimensional MoG & 2-dimensional MoG 57
56.98
56.96
56.94
LATITUDE
56.92
56.9
56.88
56.86
56.84 11.6 11.62 11.64 11.66 11.68 11.7 LONGITUDE 11.72 11.74 11.76 11.78 11.8
Figure 6.10: The main two-way sea lane is clearly illustrated. The anomaly, exclusively found by the MoG models, correspond to a vessel crossing the main sea lane. Notice that a few other vessels (corresponding to the blue training data) have previously crossed the sea lane at nearby locations. However, these occurrences are probably too few to have any significant impact on the trained models.
44
6.2.2.3
Anomalies detected exclusively by ART models
The following figures present three samples areas where all anomalies in the unlabelled data, for each particular area, are exclusively detected by the two ART models.
4-dimensional ART & 2-dimensional ART
55.82
55.8
55.78
55.76 LATITUDE 55.74 55.72 55.7 55.68
16.6
16.62
16.64
16.66
16.68
16.7 LONGITUDE
16.72
16.74
16.76
16.78
16.8
Figure 6.11: The anomalies, exclusively detected by the ART models, correspond to a vessel that travels in locally anomalous directions and, above all, travels at a very low speed.
45
4-dimensional ART & 2-dimensional ART 56.66
56.64
56.62
56.6
LATITUDE
56.58
56.56
56.54
56.52
56.5 11.8
11.82
11.84
11.86
11.88
11.9 LONGITUDE
11.92
11.94
11.96
11.98
12
Figure 6.12: The few anomalies, exclusively detected by the ART models, correspond to anomalous vessel motion in the middle of the major sea lane. Specifically, the relatively high speed, and, in some cases, the reverse direction of travel, stands out as clear anomalies in contrast to regular traffic.
4-dimensional ART & 2-dimensional ART
56.82
56.8
56.78
56.76 LATITUDE 56.74 56.72 56.7 56.68
11.2
11.22
11.24
11.26
11.28
11.3 LONGITUDE
11.32
11.34
11.36
11.38
11.4
Figure 6.13: Traffic data available for this area is relatively sparse. However, a single vessel in the upper left corner, distinguished by its relatively high speed, is exclusively detected as an anomaly by the ART models.
46
6.2.3 Anomaly analysis of labelled data

In the anomaly analysis of the recorded and labelled data presented below, the fourdimensional and two-dimensional MoG models have been used for calculating the likelihood of the data corresponding to the labelled anomalies. The mean value of these likelihoods has then been used as threshold, when performing anomaly detection in the nine days of recorded and unlabelled data available in the area. The share of the unlabelled data classified as anomalous at the particular anomaly threshold indicates how well the system can distinguish the labelled anomalies from typical background data, in that particular area. 6.2.3.1 Unexpected stop scenario
This scenario involves a vessel making an unexpected stop in its route, a few nautical miles north-west of Bornholm. The situation may be a potential smuggling scenario where the cargo is dumped during the stop and later picked up by another vessel.
55.5
55.48
55.46
55.44
LATITUDE
55.42
55.4
55.38
55.36
55.34 14.6 14.62 14.64 14.66 14.68 14.7 LONGITUDE 14.72 14.74 14.76 14.78 14.8
Figure 6.14: Potential smuggling scenario, a few nautical miles north-west of Bornholm. The tracks in black correspond to a vessel that has made an unexpected stop in the area indicated by the circle. Table 6.1: Statistics for the unexpected stop scenario 4-dimensional MoG Mean log-likelihood of the labelled data inside the indicated stop area Share of unlabelled data for this region having a likelihood less or equal to the mean likelihood of the data inside the stop area -2.99 5.76% 2-dimensional MoG -9.31 1.57%
47
6.2.3.2
Vessel collisions
The first collision scenario involves a collision between a large freighter vessel and a smaller freighter vessel, a few nautical miles north-west of Bornholm. After the collision, a number of vessels in the neighbouring area appear to travel straight to the collision zone in order to assist the involved vessels.
55.5
55.48
55.46
55.44
LATITUDE
55.42
55.4
55.38
55.36
55.34 14.6 14.62 14.64 14.66 14.68 14.7 LONGITUDE 14.72 14.74 14.76 14.78 14.8
Figure 6.15: The tracks of the two colliding vessels are coloured as black, the location of the collision encircled. The dark grey tracks highlight unlabelled data classified as anomalous by a four-dimensional MoG at threshold -10, which correspond approximately to the mean log-likelihood of the colliding vessel data. These tracks are assumed to correspond to vessels deviating from their routine course in order to assist the distressed vessels.
Table 6.2: Statistics for the first collision scenario 4-dimensional MoG Mean log-likelihood of data corresponding to first vessel in the collision zone Mean log-likelihood of data corresponding to second vessel in the collision zone Share of unlabelled data for this region having a likelihood less or equal to the mean likelihood of the second vessel in the collision zone -10.71 -10.04 0.3% 2-dimensional MoG -10.52 -10.46 0.8%
48
The second collision scenario involved an accident at the west coast of Sweden, south-west of Varberg, where a large passenger ferry collided with a smaller freighter vessel.
57
56.98
56.96
56.94
LATITUDE
56.92
56.9
56.88
56.86
56.84 12 12.02 12.04 12.06 12.08 12.1 LONGITUDE 12.12 12.14 12.16 12.18 12.2
Figure 6.16: Vessel collision at the west coast of Sweden, south-west of Varberg. The tracks of the two colliding vessels are coloured as black and dark grey, the location of the collision encircled.
Table 6.3: Statistics for the second collision scenario 4-dimensional MoG Mean log-likelihood of data corresponding to vessel in black in the collision zone Mean log-likelihood of data corresponding to vessel in dark grey in the collision zone Share of unlabelled data for this region having a likelihood less or equal to the mean likelihood of the vessel in magenta in the collision zone -1.41 -1.72 74,7% 2-dimensional MoG -9.23 -10.08 2.0%
49
6.2.3.3
Grounding scenarios
The first grounding scenario took place close to the coast of Denmark, north-west of Helsingor.
56.16
56.14
56.12
56.1
LATITUDE
56.08
56.06
56.04
56.02
56 12.4
12.42
12.44
12.46
12.48
12.5 LONGITUDE
12.52
12.54
12.56
12.58
12.6
Figure 6.17: The tracks in black correspond to the vessel that has grounded, the grounding spot indicated by the circle.
Table 6.4: Statistics for the first grounding scenario 4-dimensional MoG Mean log-likelihood of the labelled data in the indicated grounding area Share of unlabelled data for this region having a likelihood less or equal to the mean likelihood of the labelled data in grounding area -1.30 11,4% 2-dimensional MoG -4.88 28.5%
50
The second grounding scenario took place west of Malm seaport:
55.66
55.64
55.62
55.6
LATITUDE
55.58
55.56
55.54
55.52
55.5 12.8
12.82
12.84
12.86
12.88
12.9 LONGITUDE
12.92
12.94
12.96
12.98
13
Figure 6.18: Grounding scenario west of Malm sea port (data around the port area has been filtered away as can been seen in the figure). The tracks in black correspond to the vessel that has grounded, the grounding spot indicated by the circle.
Table 6.5: Statistics for the second grounding scenario 4-dimensional MoG Mean log-likelihood of the labelled data in the indicated grounding area Share of unlabelled data for this region having a likelihood less or equal to the mean likelihood of the labelled data in grounding area -0.25 21.5% 2-dimensional MoG -3.64 25.2%
51
The third grounding scenario took place in the resund passage, southwest of Landskrona:
56
55.98
55.96
55.94
LATITUDE
55.92
55.9
55.88
55.86
55.84 12.6 12.62 12.64 12.66 12.68 12.7 LONGITUDE 12.72 12.74 12.76 12.78 12.8
Figure 6.19: Grounding scenario in the resund passage, southwest of Landskrona. Traffic around the Danish island Ven has been filtered away as can been seen by the large white area in the centre of the figure. The tracks in black correspond to the vessel that has grounded, the grounding spot indicated by the circle.
Table 6.6: Statistics for the third grounding scenario 4-dimensional MoG Mean log-likelihood of the labelled data in the indicated grounding area Share of unlabelled data for this region having a likelihood less or equal to the mean likelihood of the labelled data in grounding area -1.96 19.2% 2-dimensional MoG -5.13 23,5%
52
6.2.4 Anomaly Detection of artificial data

6.2.4.1 Simulated smuggling scenario
The vessel in black corresponds to large freighter travelling at a speed of around 7 knots. It slows down to a speed below 1 knot in the indicated rendezvous area, dumps its cargo and then accelerates up to 7 knots after a short while. Meanwhile, another smaller vessel (dark grey coloured) approaches the rendezvous area perpendicular to the sea lane at a speed of around 12 knots. When it reaches the area it slows down, picks up the cargo and accelerates up to 12 knots, heading back in the direction it emerged from.
55.66
55.64
55.62
55.6
LATITUDE
55.58
55.56
55.54
55.52
55.5 15
15.02
15.04
15.06
15.08
15.1 LONGITUDE
15.12
15.14
15.16
15.18
15.2
Figure 6.20: The tracks in black and dark grey correspond to the large freighter and smaller pick-up vessel respectively.
Table 6.7: Statistics for the large freighter (black tracks) involved in the smuggling scenario 4-dimensional MoG Mean log-likelihood of data inside of the rendezvous area Mean log-likelihood data outside of the rendezvous area Share of unlabelled data in this region having a likelihood less or equal to the mean likelihood of artificial data inside of the rendezvous area -3.23 -0.82 6.5% 2-dimensional MoG -8.72 -5.94 1.3%
53
Table 6.7: Statistics for the smaller pick-up vessel (dark grey tracks) involved in the smuggling scenario 4-dimensional MoG Mean log-likelihood data inside of the rendezvous area Mean log-likelihood data outside of the rendezvous area Share of unlabelled data in this region having a likelihood less or equal to the mean likelihood of artificial data inside / outside of the rendezvous area -5.10 -48.39 2.2% / 0% 2-dimensional MoG -10.55 -55.11 0.5% / 0%
6.2.4.2
Simulated speeding scenario
In this scenario, artificial data for a vessel travelling at a relatively high speed in a sea lane has been evaluated. The mean speed of the (real) vessel data used for training the model in this area is 6.97 knots and the variance 3.22. The speed of the artificial vessel is around 15 knots as it travels in the main direction of the sea lane.
55.66
55.64
55.62
55.6
LATITUDE
55.58
55.56
55.54
55.52
55.5 15
15.02
15.04
15.06
15.08
15.1 LONGITUDE
15.12
15.14
15.16
15.18
15.2
Figure 6.21: Speed vectors in black correspond to artificially generated data of vessel travelling at a relatively high speed in the sea lane.
54
Table 6.8: Statistics for the speeding vessel 4-dimensional MoG Mean log-likelihood of artificial data Share of unlabelled evaluation data classified as anomalous at this level -8.25 0.5% 2-dimensional MoG -14.7 0.2%
55
Analysis and discussion
7 Analysis and discussion

In this chapter, the results from the implementation and evaluation of the anomaly detection algorithms are analysed and discussed. The potential and capabilities of the feature models and clustering algorithms are discussed in general and in relation to the specific goals of the experiment. The chapter is concluded with a discussion about future work and improvements of the anomaly detection system.
7.1 Degree of vigilance of the systems

One of the main issues when configuring a system for anomaly detection is how to find an appropriate anomaly threshold, corresponding to the systems level of vigilance. Recall that in the case of the MoG models and the ART networks, the anomaly threshold corresponds to the anomaly likelihood threshold and the vigilance value, respectively. Increasing these values has the effect of making the systems more vigilant as they find more anomalies in data. However, it also implies that the rate of false alarms will increase as the systems tend to find more anomalies that are of no interest for an operator. On the other hand, if the vigilance level is decreased, the system might dismiss more subtle anomalies that are in fact of high interest for an operator. Thus, the level of vigilance highly influences the usability of a system for anomaly detection.
7.2 Anomaly Detection in unlabelled data

One of the main goals of the experimental evaluation was to investigate what type of anomalous data that was detected by the different models (qualitative analysis) and to what extent the models detected the same anomalies (quantitative analysis). The models where tuned to a rather low level of vigilance and had approximately the same rate of anomaly detection; 0.72% to 0.78% of the unlabelled evaluation points were classified as anomalous by the four models.
7.2.1 Quantitative analysis

Considering the quantitative results in general, we observe that the intersection of the four sets of anomalous points (one set for each model) constitutes about 47% of the corresponding union set. In other words, all four models detect the same anomalies to a rather large extent. In particular, the anomalous sets of the two ART models have an intersection set that is more than 80% of the size of the corresponding union set. Considering the other extreme, the 4dimensional MoG model and 2-dimensional ART model have a corresponding ratio of about 56%. Thus, if the goal is to detect a wide range of anomalies, the combination of the 4dimensional MoG model and the 2-dimensional ART model would prove the most effective hybrid solution in this case. Comparing the ART models and MoG models in general, we observe that the intersection of the MoG and ART anomaly sets constitutes approximately 61% of the corresponding union set. Doing the corresponding comparison between the four dimensional and two dimensional models, the ratio is about 71%. Thus, the choice of clustering algorithm appears to have a greater impact than the feature model on the range of anomalies that are detected.
56
7.2.2 Qualitative analysis

Results show that the most distinguishing anomalies found in the unlabelled data correspond to vessels that are crossing sea lanes and vessels travelling close to and in the opposite direction of sea lanes. Generally, the unlabelled anomalies, detected by the systems at the relatively low degree of vigilance, appear rather clear in contrast to the normal data. The location and extension of the major sea lanes are often easily identified. However, information regarding the normal velocities in the sea lanes is less transparent as the high concentration of data vectors makes them rather cluttered. Considering the set of anomalies exclusively detected by the MoG models, the anomalies typically involve vessels crossing major sea lanes at more or less regular speeds. These areas are relatively highly trafficked where a small minority of the traffic deviates from the major sea lanes. The set of anomalies exclusively detected by the ART models typically involve vessels travelling at relatively high or low speeds. Thus, it appears as the MoG models are more sensitive to the direction of the motion while the ART models are more sensitive to the absolute speed of the motion. The quantitative results from comparing the sets of anomalous points detected by the fourdimensional models and two-dimensional models indicate that the two different feature models are sensitive to different types of anomalies. However, no particular correlation or patterns have been found during qualitative analysis of such comparisons.
7.3 Anomaly detection in labelled data

To summarize, the MoG models where quite effective in distinguishing the collision data and the unexpected stop data as anomalous among unlabelled data. However, the models performed rather poorly in detecting the anomalous data related to the grounding scenarios. In the first collision scenario, less than 1% of the unlabelled data had a likelihood equal to or less than the likelihood of the collision data for both MoG models. However, the fourdimensional MoG model performed extremely poorly compared to the two-dimensional MoG model in the second collision scenario; the share of unlabelled data detected as anomalous was 74.7% for the four-dimensional MoG and 2.0% for the two-dimensional MoG. In all three grounding scenarios, both MoG models performed about equally poor in distinguishing grounding data from the remaining data. Between 11.4% and 28.5% of the unlabelled data was detected as anomalous at the anomaly levels of the grounding data; an anomaly detection system that generates anomaly alarms at such a rate is not very useable. Comparing the two feature models, results indicate that the spatial information incorporated in the four-dimensional feature model does not appear to enhance the systems ability to detect the labelled anomalies. As a matter of fact, the four-dimensional model performed considerably worse in some cases; the two-dimensional MoG was considerably more effective in distinguishing the second collision and the unexpected stop as anomalies among the rest of the unlabelled data. As for the rest, performance of the two systems, with respect to distinguishing the labelled anomalies from the rest of the unlabelled data, was equivalent in all scenarios.
57
7.4 Anomaly detection in artificial data

In the simulated smuggling scenario, the large freighter slowing down is detected rather clearly as an anomaly by the two-dimensional MoG and less clearly by the four-dimensional MoG. These results are very similar to the results from the recorded unexpected stop scenario, which we would expect as the vessel behaviour in both cases is rather similar. The smaller vessel approaching the rendezvous zone perpendicular to the sea lane is even more clearly distinguished by the models; the direction and speed of the motion deviates clearly from the normal motion pattern in the area. The results show that the implemented systems have the potential of detecting smuggling scenarios as anomalies in the motion pattern of the involved vessels. However, to be more specific, the systems are rather capable of detecting the two separate anomalies related to the slowing down of the first vessel and the perpendicular approach at relatively high speed of the second vessel. To make the system really aware of the smuggling situation, the two separate anomalies should somehow be correlated, i.e. the feature model should be extended to support spatial relations in time between vessels. This would make the system even more effective in detecting smuggling scenarios. Another issue that should be mentioned is to what degree small vessels, like the one involved in this simulated scenario, actually are detectable by the surveillance systems and are available in the data sets. Considering the training data supplied in this project, the majority of the vessels are of relatively large size; small vessels, like non-operative pleasure boats, are completely absent in the data sets. Thus, if the vessel picking up the cargo is very small and therefore not present in the situation picture, the systems only chance is to detect the anomaly corresponding to the larger vessel dumping its cargo. The simulated speeding scenario was detected as a very clear anomaly. This result was expected as the speed of the vessel was rather far from the mean speed in the region.
7.5 The feature models

The models implemented in this project are very limited as they only consider features that correspond to instantaneous properties of the vessel motion. Anomaly detection is based on the momentary state of the vessel motion without considering the previous states. Furthermore, the states of the surrounding vessels are not considered, e.g. spatial relations in time between vessels are not captured in the feature models. This fact implies that anomalies that involve multiple vessels, like smuggling, hijacking etc, are difficult to find by the implemented models. However, the simplicity of the feature models makes them rather general. The feature models are in principle applicable to any domain involving motion in the two-dimensional plane. Furthermore, no particular domain knowledge is required for defining the features models.
7.6 Comparing MoG and ART in general

Comparing MoG and ART in general, we can identify advantages and disadvantage of each approach to clustering. The main advantage of ART over standard EM and MoG is the ability of dynamic learning; ART Networks do not require a predefined number of categories (clusters) before clustering. The dynamic learning feature implies that ART Networks can react faster to changes in the
58
normal behaviour as new categories are created instantaneously when needed. Furthermore, ART Networks can easily be extended to support supervised learning by introducing an ARTMAP structure; this extension is discussed in the future work section below. However, ARTs ability to quickly learn new patterns also implies that the system is more sensitive to overtraining and noise. If an anomalous pattern is mistakenly presented as a normal pattern during training, the network will instantaneously create an erroneous category for this pattern. Learning of the MoG with EM is much less sensitive to noise, due to the statistical nature of the algorithm, particularly if noise is stochastic; the relative weights of noisy points are negligible when updating the clusters during EM. Thus, if there is reason to believe that training data might include anomalies, EM and MoG should be preferred to unsupervised ART. The MoG has a natural support for uncertainty measure as it presents the likelihood of points belonging to it. The ART Network lacks a corresponding feature as it does not support any output information regarding the degree of similarity between the categories and the input patterns. Another disadvantage that ART shares with neural networks in general, is that they are not transparent in the same sense as MoG. While a MoG can be analysed explicitly by considering the underlying probability distributions, ART Networks behave more like blackboxes.
7.7 Future work and improvements

7.7.1 Feature model based on manoeuvres in motion pattern
As discussed previously, the simplicity of the feature models implemented in this project strongly constrains the type of detectable anomalies. A natural extension of the system would therefore be a more sophisticated feature model. A complementary feature model based on events that correspond to manoeuvres in motion pattern is proposed. By analyzing the motion of a particular track within a limited time window, certain events corresponding to interesting manoeuvres can be identified. In particular, the vessel trajectories are divided into different types of segments where each segment corresponds to a type of manoeuvre, characterized by a number of kinematical features. 7.7.1.1 Segmentation of trajectories
Generally, segments are characterized as regular segments if they are of a specific minimum length and have angular speed and linear acceleration close to constant. The regular segments are then further divided into subcategories according to more specific kinematical constraints: Regular straight segment Linear speed is different from zero during whole segment. Linear acceleration is close to zero during whole segment. Angular speed is close to zero (constant course) during whole segment.
Weak yaw segment Angular speed is close to constant and different from zero during whole segment.
59
Relative difference in angle between course at starting point and course at ending point is within certain interval (less angular difference than sharp yaw).
Sharp yaw segment Angular speed is close to constant and different from zero during whole segment. Relative difference in angle between course at starting point and course at ending point is within certain interval (greater angular difference than weak yaw).
Brake segment Linear acceleration is constant and below certain negative threshold during whole segment.
Acceleration segment Linear acceleration is constant and above certain positive threshold during whole segment.
Stop segment Linear speed is very close to zero during whole segment.
The parts of the trajectory that still remain when all regular segments have been extracted correspond to irregular segments. These segments generally involve more or less jerky motion as linear and angular acceleration differs from zero. 7.7.1.2 Characterizing vessel trajectories in a high-dimensional feature space
A vessel trajectory is characterized by first segmenting it and then counting the number of segments of each type within a limited time window. Each such partial trajectory analysis generates a data point in a high-dimensional feature space where each dimension corresponds to the number of occurrences of a particular manoeuvre (i.e. segment type). As an example we might consider a seven-dimensional space where the features are the number of 1) straight segments 2) weak yaw segments 3) sharp yaw segments 4) brake segments 5) acceleration segments 6) stop segments and 7) irregular segments. In addition to this, features corresponding to total time duration for each segment could be added to the model. Clusters in this high-dimensional space would correspond to typical manoeuvre patterns associated to particular vessel classes; large freighter ships, car ferries, fishing boats, pilots etc. The clusters could be represented by a mixture of multivariate Poisson distributions that model the expected number of occurrences of each manoeuvre within each cluster. To estimate the parameters of the Poisson mixture, an adapted version of the EM-algorithm may be used, similar to the case with mixtures of Gaussians. The Poisson distribution has the advantage that it is defined by a single parameter (compare to the Gaussian that is defined by its mean and variance). This implies that the Poisson distribution does not run the risk of collapsing like the Gaussian distribution, during the EM algorithm, as the variance of the Gaussian approaches zero. An alternative to the Poisson mixture model is to introduce a Hidden Markov Model (HMM) that treats the different types of segments as states of the vessel motion. By using a HMM for modelling sequences of segments, the likelihood of an observed sequence of manoeuvres can be calculated, thereby identifying more or less anomalous sequences of manoeuvres.
60
7.7.2 Introducing semi-supervised learning

Vessels that exhibit irregular motion behaviour that does not belong to any particular cluster may be considered as anomalous. However, an interesting aspect would arise if we were able to introduce supervised learning by labelling the clusters found during training as typical car ferry behaviour, pilot behaviour etc. Then, if an unidentified vessel appears behaving like a typical fishing boat, we might suspect that the vessel is poaching fish. Another example might be an unidentified vessel behaving like a pilot, guiding a foreign vessel, when it is actually hijacking the vessel. Alternatively, we could detect anomalies as deviations from expected behaviour; a vessel identifying itself as a cargo-ship, but behaving like a fishing boat, would raise suspicion regarding its true intent. Taking a more general view, clusters could be labelled as either normal or anomalous, i.e. labelled anomalies are explicitly incorporated in the model as anomaly clusters. Considering an unlabelled instance, the distance to the nearest normal and anomalous cluster would then be used to measure the degree of anomaly of the unlabelled instance. Supervised learning may be achieved in two manners; either explicitly by supplying the system with a labelled set of anomalous data, or implicitly by having an operator confirming or dismissing alarms corresponding to potential anomalies found by the system. The ARTMAP neural network architecture, previously mentioned in section 3.5.2, is an extension of regular ART that incorporates supervised learning. It supports a mechanism for associating learnt categories (clusters) with predefined classes. During learning, each data point is presented to the network together with a class label, strengthening the association between its assigned cluster and the specified class. When presented with an unlabelled data point, the network first determines the corresponding cluster and then predicts the class based on this information. If no existing cluster is sufficiently similar to the unlabelled data, the network may either create a new cluster and associate it with an unknown class, or simply state that the input was unknown to the network. In the case of anomaly detection, clusters labelled as unknown may later be specified as either normal or anomalous by an operator. If more observed data is assigned to the unknown cluster and no operator input is received, the system may automatically re-label the cluster as normal after some time. In the case of the mixture models, supervised learning may be introduced simply by training separate mixture models for each particular class of data, e.g. one mixture model for labelled normal data and another for labelled anomalous data. When performing anomaly detection in unlabelled data, the class corresponding to the mixture model generating the highest likelihood may be regarded as the most likely class. However, this approach is probably not convenient when modelling anomalous behaviour; anomalous data is, as previously discussed, difficult to characterize and the amount of adequate statistics available is usually very limited, making it difficult to construct accurate mixture models. Thus, the ARTMAP structure should be preferred when introducing supervised learning.
61
Conclusion
8 Conclusion
In the first phase of this project, an overview of different approaches involving artificial intelligence to solve the problem of situation assessment has been presented. The potential and capability of the different approaches has been discussed, resulting in suggestions for future development of the prototype system for situation assessment developed by Saab. Two approaches to solving the problem of uncertainty handling have been proposed; hierarchical event recognition through Bayesian Networks and a Fuzzy Reasoning approach involving fuzzy rules and fuzzy truth values. Characterizing all events and situations, which may be of interest for a supervisor, is a very difficult task. The set of available examples for each particular event or situation is usually very limited, as the events and situations sought for occur relatively rarely and may vary significantly from one case to another. However, turning it the other way round, these rare events and situations can be detected as anomalies in a model of routine behaviour. Usually, large amounts of data corresponding to routine behaviour are available, which motivates the use of Data Mining and clustering techniques for building models of normal behaviour. In the second phase of this project, anomaly detection based on unsupervised machine learning techniques has been further investigated and implemented in the domain of sea surveillance. The system is based on unsupervised clustering of vessel traffic data, where the data model is specified by a particular feature model. Two feature models have been developed and implemented; one two-dimensional model based on the momentary velocities in the two-dimensional plane and the other a four-dimensional extension of the base model incorporating the momentary spatial position. Clustering has been done by two different models and learning algorithms; one based on Mixtures of Gaussians (MoG) densities and Expectation-Maximization, and the other based on Neural Networks and Adaptive Resonance Theory (ART). The feature models and algorithms have been evaluated using real recorded data from the sea surveillance centre in Malm and artificial self-made data. The recorded data is in principle unlabelled and is assumed to reflect normal traffic. However, a few labelled anomalies involving vessel collisions, groundings and unexpected stops are present in the recorded data. The artificial data correspond to vessels involved in a simulated smuggling scenario and speeding scenario. The four combinations of the two feature models and clustering algorithms have been evaluated by training on a majority of the unlabelled recorded data and testing on a minority of the recorded data, including the labelled anomalies, and the artificial anomalous data. Qualitative results show that the most distinguishing anomalies found in the unlabelled data correspond to vessels that are crossing sea lanes and vessels travelling close to and in the opposite direction of sea lanes. Generally, the four models detect the same anomalies to a rather large extent. However, results indicate that the MoG models are more sensitive to the direction of the motion while the ART models are more sensitive to the absolute speed of the motion. Qualitative analysis of the recorded labelled anomalies indicated that the MoG models were quite effective in detecting the collisions and the unexpected stop as anomalies among routine traffic. However, the models performed rather poorly in detecting the anomalies related to the grounding scenarios. Comparing the two feature models, results indicate that the spatial information incorporated in the four-dimensional feature model does not appear to enhance the systems ability to detect the labelled anomalies in general.
62
Conclusion
Comparing the two cluster models, the mixture model has been considered more suitable when training data contains noise or anomalies; the dynamic learning feature of ART implies fast learning of new patterns at the cost of increased sensitivity to noise. Generally, the types of anomalies that are detectable by the implemented systems, at a reasonable rate of alarm, are rather elementary in nature. It is not hard to imagine other less advanced ad-hoc systems that could identify these anomalies. However, the generality of the system should be pointed out as it has the potential of being applied to other domains, involving generic motion in the two-dimensional plane, requiring minimal adaptation and no specific domain knowledge as the systems are based on unsupervised algorithms. An extension, involving a more sophisticated feature model, based on manoeuvres in the motion pattern, has been proposed for future work. Such a feature model would prove a powerful complement, capable of capturing anomalies in motion tracks that evolve over time. Another suggested extension involves the introduction of semi-supervised learning, allowing an operator to render learning more effective by confirming or rejecting anomalies detected by the system. Such a system, implemented as an ARTMAP structure, would benefit from operator feedback without being dependent on it, as it self-organizes when operator input is not available.
63
References
[1] M. Endsley, Theoretical underpinnings of Situation Awareness: a critical review, Situation Awareness Analysis and Measurement, Mahwah, NJ 2000 D. A. Lambert, Situations for situation awareness, Proceedings of the 4th International Conference on Information Fusion, August 2001 K. Wallenius, Generic Support for Decision-Making in Effects-Based Management of Operations, Doctoral Thesis in Computer Science, Royal Institute of Technology in Stockholm, Sweden 2005 C. Matheus, M. Kokar, K. Baclawski, J. Letkowski, C. Call, M. Hinman, J. Salerno and D. Boulware, SAWA: An Assistant for Higher-Level Fusion and Situation Awareness, In Proc. SPIE Conference on Multisensor, Multisource Information Fusion, pages 75-85. (2005) C. Matheus, M. Kokar, K. Baclawski, A Core Ontology for Situation Awareness, Proceedings of the Sixth International Conference on Information Fusion, pages 545 552, 2003 J. Edlund, Situation Assessment 2005 final report, L/API-06:0022, Saab Systems Jrflla, Stockholm 2005 (company internal report) R.P. Higgins, Automatic Event Recognition for Enhanced Situational Awareness in UAV Video, SIMA 2005, Atlantic City, NJ. S. Das, R. Grey, P. Gonsalves, Situation Assessment via Bayesian Belief Networks, Proceedings of the fifth International Conference on Information Fusion, Annapolis, Maryland, July 2002 J. Ivansson, Situation Assessment in a Stochastic Environment using Bayesian Networks, Masters Thesis at Department of Electrical Engineering, Linkping University, Sweden 2002
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10] P. Bladon, R. J. Hall, W. A. Wright, Situation Assessment using Graphical Models, Proceedings of the Fifth International Conference on Information Fusion, Vol. 2, Pages 886-893, 2002
64
[11] C. Howard, M. Stumptner, Situation Assessments Using Object Oriented Probabilistic Relational Models, Proceedings of the 7th International Conference on Information Fusion, Vol. 2, 2005 [12] B.J. Rhodes, N.E. Bomberger, M. Seibert, A. M. Waxman, Maritime Situation Monitoring and Awareness Using Learning Mechanisms. SIMA 2005, Atlantic City, NJ. [13] B.J. Rhodes, N.E. Bomberger, M. Seibert, A. M. Waxman, Associative Learning of Vessel Motion Patterns for Maritime Situation Awareness., Proceedings of the International Conference on Information Fusion, 2006, Venice, Italy [14] G. Jakobson, L. Lewis, J. Buford, Col. E. Sherman, Battlespace Situation Analysis: The Dynamic CBR Approach., Proceedings of the Military Communications Conference (MILCOM 2004), November 2005, Monterey CA. [15] T. M. Mitchell, Machine Learning, McGraw-Hill, 1997, New York [16] S. Russell, P. Norvig, Artificial Intelligence - A Modern Approach, Second Edition, Prentice Hall, 2003, New Jersey [17] C. Piciarelli, G.L. Foresti, L. Snidaro, Trajectory Clustering and its Applications for Video Surveillance, Proceedings of the International Conference on Advanced Video and Signal based Surveillance, pages 40-45, September 15-16 2005 [18] Holst, J. Ekman, Avvikelsedetektion av fartygsrrelser, internal report Saab Systems, 2003, Jrflla Stockholm, Sweden [19] Watson, Applying Case-Based Reasoning: Techniques for Enterprise Systems, Morgan Kaufmann Publishers, 1997, San Francisco, USA [20] G. Jakobson, J. Buford, L. Lewis, Towards an Architecture for Reasoning about Complex Event-based Dynamic Situations, International Workshop on Distributed Event-Based Systems (DEBS '04), 24-25 May 2004, Edinburgh, Scotland, UK [21] R. Orchard, Fuzzy Reasoning in Jess: The FuzzyJ Toolkit and FuzzyJess, Proceedings of the Third International Conference on Enterprise Information Systems (ICEIS), pp. 553-542, Portugal, July 7-10 2001 [22] H.-J. Zimmermann, Fuzzy Set Theory And its applications, Fourth Edition, Kluwer Academic Publishers, 2001, Dordrecht, Netherlands
65
[23] R. Shalfield, Flint Reference, http://www.lpa.co.uk, London 2005
Logic
Programming
Associates,
[24] S.-M. Chen, A Fuzzy Reasoning Approach for Rule-Based Systems Based on Fuzzy Logics, IEEE Transactions on Systems, Man and cybernetics part B: Cybernetics, Vol. 26, No. 5, Oct 1996 [25] S.-M. Chen, Weighted Fuzzy Reasoning Using Weighted Fuzzy Petri Nets, IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 2, Mar/Apr 2002 [26] M. Gao, M. Zhou, Fuzzy Reasoning Petri Nets, IEEE Transactions on Systems, Man and cybernetics part A: Systems and Humans, Vol. 33, No. 3, May 2003 [27] J. J. Veerbeek, Mixture Models for Clustering and Dimension Reduction, PhD thesis at the University of Amsterdam, 2004 [28] G. A. Carpenter, S. Grossberg, Fuzzy ARTMAP: A Neural Network Architecture for Incremental Supervised Learning of Analog Multidimensional Maps, IEEE Transactions on Neural Networks, vol 3, no. 5, September 1992 [29] G. A. Carpenter, S. Grossberg, Adaptive Resonance Theory, The handbook of Brain Theory and Neural Networks, second edition, September 1998, revised April 2002
66
TRITA-CSC-E 2007: 046 ISRN-KTH/CSC/E--07/046--SE ISSN-1653-5715
www.kth.se

AI for Detecting Anomalous Vessel Behavior

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

AI for Detecting Anomalous Vessel Behavior

Transféré par

Droits d'auteur :

Formats disponibles

Artificial Intelligence for Situation Assessment

Master of Science Thesis Stockholm, Sweden 2007

Artificial Intelligence for Situation Assessment

Artificial Intelligence for Situation Assessment

Artificiell intelligens fr situationsanalys

AI in SA State of the Art............................................................................................. 6

Design of an Anomaly Detection System.................................................................... 23

Analysis and discussion................................................................................................ 56

Artificial Intelligence for Situation Assessment

Artificial Intelligence for Situation Assessment

1.2 Problem description

1.3 Purpose and goal

Artificial Intelligence for Situation Assessment

2.1 Situation Awareness

Goals & Objectives Preconceptions

Figure 2.1: The Endsley model for SAW

Artificial Intelligence for Situation Assessment

2.2 Data Fusion and Situation Assessment

Data Fusion Domain

Level 0 Sub-Object Assessment

Level 1 Object Assessment

Level 2 Situation Assessment

Level 3 Impact Assessment Human/ Computer Interface

Level 4 Process Refinement

Database Management System

Figure 2.2: The JDL model for Data Fusion

Artificial Intelligence for Situation Assessment

Artificial Intelligence for Situation Assessment

AI in SA State of the Art

3 AI in SA State of the Art

3.1 Traditional Rule-based Expert Systems

IF Sunny AND Summer THEN Nice weather IF . AND . THEN ..

Fact 1: Sunny Fact 2: Summer ..

Provides expert knowledge

Artificial Intelligence for Situation Assessment

AI in SA State of the Art

3.1.1 Learning in Rule-based Systems

3.1.2 Temporal Reasoning

3.1.3 The approach of Saab Systems

Artificial Intelligence for Situation Assessment

AI in SA State of the Art

3.1.4 The SAWA approach

3.2 Bayesian Networks

http://www.w3.org/TR/owl-features/ http://www.w3.org/Submission/SWRL/ 3 http://herzberg.ca.sandia.gov/

Artificial Intelligence for Situation Assessment

AI in SA State of the Art

3.2.1 Hierarchical Event Recognition

Artificial Intelligence for Situation Assessment

AI in SA State of the Art

3.2.2 Implementing Hierarchical Event Recognition through Bayesian Networks

Artificial Intelligence for Situation Assessment

AI in SA State of the Art

Output node T-1 T

A is suspicious foreign freighter

X is a typical smuggling area

Vessel A stops in area X

Vessel B approaches area X from area Y

Vessel B returns to area Y

Artificial Intelligence for Situation Assessment

AI in SA State of the Art

3.3 Fuzzy Reasoning, Fuzzy Sets & Fuzzy Logic

Crisp sets (length) 1 Short 0 175cm Length Tall 0 (length) 1