Vous êtes sur la page 1sur 123

MODIFIED DYNAMIC TIME WARPING FOR DISCOVERING CLIMATE

CHANGE PATTERNS USING HIERARCHICAL CLUSTERING

MAHMOUD AHMED SAMMOUR

THESIS SUBMITTED IN FULFILMENT FOR THE DEGREE OF


MASTER OF INFROMATION TECHNOLOGY

FACULTY OF INFORMATION SCIENCE AND TECHNOLOGY


UNIVERSITI KEBANGSAAN MALAYSIA
BANGI

2017
LEDINGAN MASA DINAMIK YANG DIUBAH SUAI UNTUK MENEMUI CORAK
PERUBAHAN IKLIM MENGGUNAKAN PENGELOMPOKAN HIERARKI

MAHMOUD AHMED SAMMOUR

TESIS YANG DIKEMUKAKAN UNTUK MEMPEROLEH IJAZAH


SARJANA TEKNOLOGI MAKLUMAT

FAKULTI TEKNOLOGI DAN SAINS MAKLUMAT


UNIVERSITI KEBANGSAAN MALAYSIA
BANGI

2016
iii

DECLARATION

I hereby declare that the work in this thesis is my own except for quotations and
summaries which have been duly acknowledged.

29 August 2016 MAHMOUD AHMED SAMMOUR

P78597
iv

ACKNOWLEDGMENTS

First and foremost, praise is to Almighty Allah for all his blessings for giving me
patience and good health throughout the course of my life.

I would like to thank my thesis supervisor Assoc. Prof. Dr. Zulaiha Ali
Othman She consistently allowed this thesis to be my own work, but steered me in the
right the direction whenever she thought I needed it.

I would also like to take this opportunity to thank the ministry of higher
education Malaysia scholarship division for its generosity in funding Malaysian
International Scholarship (MIS). I am very honored to be the recipient of this award.

Finally, I must express my very profound gratitude to my parents, to my


beloved wife and for my friends for providing me with unfailing support and
continuous encouragement throughout my years of study and through the process of
researching and writing this thesis. This accomplishment would not have been
possible without them. Thank you.
v

ABSTRACT

Ozone analysis is the process of identifying meaningful patterns that would facilitate
the prediction of future trends. One of the common techniques that have been used for
ozone analysis is the clustering technique. Clustering is one of the popular methods
which contributes significant knowledge for time series data mining to discover
hidden patterns within it. This can be represented by aggregating similar data in
specific groups. However, clustering in time series is a challenging task due to the
existing clustering approaches intended to handle static data, but is less able to handle
the dynamic data of the time series especially in climate change and environment
change. An accurate clustering data is in demand especially for data similar to the
time such as in tropical climate and pollution change. There are two major factors that
affect the final shape of clusters; the clustering techniques and similarity measure
techniques are used. These techniques consist of partitioning-based clustering methods
such as k-means, k-medoids and fuzzy c-mean, and grouping-based clustering
methods such as Agglomerative Hierarchical Clustering (AHC). Unlike partitioning
techniques which aim to identify an initial solution (centroids), hierarchical clustering
aims to group the objects in a hierarchical manner which is more appropriate for time
series data. On other hand, similarity measures have a significant impact on the
clustering process as regards accuracy. Several existing time series clustering
approaches have utilized common similarity measures such as Euclidean distance and
Minkowski distance. However, such measures have a main drawback which can be
represented by the nature of similarity in time series data where the similarity is based
on the series pattern instead of the distance matching. On the other hand, Dynamic
Time Warping (DTW) has been widely used for time series data due to its ability to
identify sequential correspondences. Although DTW has been used commonly with
partitioning-based clustering, less research efforts have addressed the use of DTW
with hierarchical clustering. Therefore, this study has three aims. Firstly, is to
investigate the performance of applying DTW in AHC. Secondly, enhancing the DTW
in AHC (EDTW-AHC) by improving the dynamic selection of the shortest path suite
for dynamic data. The third is to discover the Malaysian climate and ozone change
pattern. The first and second objectives follow the standard experiment based research
method including design and development of the algorithms, conducting experiments
and evaluation phases. The performance of the proposed algorithms was evaluated
using Twenty UCI time series benchmark data sets. The usefulness of the proposed
algorithms is later used to discover knowledge of Malaysian Ozone change dataset
collected from Putrajaya for year 2006; the patterns have been extracted using the
Apriori Association Rules algorithm. The experiment shows that the extracted patterns
are related to CO which is an interesting relation in accordance to the literature.
vi

ABSTRAK

Analisis ozon adalah proses mengenal pasti corak yang bermakna yang akan
memudahkan ramalan trend masa depan. Salah satu teknik umum yang digunakan
untuk menganalisis ozon adalah teknik pengelompokan. Pengelompokan adalah salah
satu kaedah popular yang menyumbang pengetahuan penting untuk perlombongan
data siri masa bagi mencari corak tersembunyi di dalamnya. Ia boleh diwakili dengan
menjumlahkan data yang serupa ke dalam kumpulan tertentu. Walau bagaimanapun,
pengelompokan siri masa adalah satu tugas yang mencabar kerana pendekatan
pengelompokan yang sedia ada bertujuan untuk mengendalikan data statik, tetapi
kurang mampu untuk mengendalikan data dinamik siri masa terutamanya untuk
perubahan iklim dan perubahan alam sekitar. Pengelompokan data yang tepat
dikehendaki terutamanya bagi data yang serupa dengan masa seperti untuk iklim
tropika dan perubahan pencemaran. Terdapat dua faktor utama yang memberi kesan
kepada bentuk akhir kelompok; teknik pengelompokan dan teknik ukuran keserupaan
digunakan. Teknik-teknik ini terdiri daripada kaedah pengelompokan berdasarkan
pembahagian seperti k-min, k-medoid dan c-min kabur, dan kaedah pengelompokan
berdasarkan kumpulan seperti Hierarki Pengelompokan Aglomeratif (AHC). Tidak
seperti teknik yang bertujuan untuk mengenal pasti penyelesaian awal (sentroid),
pengelompokan hierarki bertujuan untuk mengumpulkan objek-objek secara hierarki
yang lebih sesuai untuk data siri masa. Sebaliknya, sukatan-sukatan keserupaan
mempunyai kesan yang besar ke atas proses pengelompokan dari segi kejituan.
Beberapa pendekatan pengelompokan siri masa sedia ada telah menggunakan sukatan-
sukatan keserupaan biasa seperti jarak Euclid dan jarak Minkowski. Walau
bagaimanapun, sukatan-sukatan tersebut mempunyai kelemahan utama yang boleh
diwakili oleh sifat keserupaan dalam data siri masa di mana keserupaan adalah
berdasarkan corak siri dan bukannya kesamaan jarak. Sebaliknya, Ledingan Masa
Dinamik (DTW) telah digunakan secara meluas untuk data siri masa kerana
kemampuannya untuk mengenal pasti kesamaan urutan. Walaupun DTW telah biasa
digunakan dengan pengelompokan berdasarkan pembahagian, namun usaha
penyelidikan kurang dihalakan kepada penggunaan DTW dengan pengelompokan
hierarki. Oleh itu, kajian ini mempunyai tiga matlamat. Pertama, adalah untuk
menyiasat prestasi penggunaan DTW di dalam AHC. Kedua, meningkatkan DTW di
dalam AHC (EDTW-AHC) dengan meningkatkan pemilihan dinamik jalan tersingkat
yang sesuai untuk data dinamik. Manakala yang ketiga adalah untuk mencari corak
perubahan iklim dan ozon Malaysia. Objektif pertama dan kedua dijalankan mengikut
kaedah penyelidikan berdasarkan eksperimen standard termasuk fasa reka bentuk dan
pembangunan algoritma, menjalankan eksperimen dan penilaian. Prestasi algoritma
yang dicadangkan telah dinilai menggunakan Dua Puluh set data siri masa UCI
sebagai penanda aras. Kebergunaan algoritma yang dicadangkan itu kemudiannya
digunakan untuk mencari pengetahuan mengenai set data perubahan Ozone di
Malaysia yang dikutip dari Putrajaya bagi tahun 2006; corak-corak diekstrak
menggunakan algoritma Peraturan Persatuan Apriori. Eksperimen menunjukkan
bahawa corak yang diekstrak mempunyai kaitan dengan CO di mana ia merupakan
perhubungan yang menarik mengikut kesusasteraan.
vii

TABLE OF CONTENTS

Page
DECLARATION iii

ACKNOWLEDGMENTS iv

ABSTRACT v

ABSTRAK vi

CONTENTS vii

LIST OF TABLES xi

LIST OF FIGURES xii

LIST OF ABBREVIATIONS xiv

CHAPTERS REVIEW

CHAPTER I INTRODUCTION

1.1 Introduction 1
1.2 Problem Statement 3
1.3 Research Objectives 4
1.4 Research Scope 5
1.5 Significance of Research 5
1.6 Research Methodology 6
1.7 Research Organization 7

CHAPTER II LITERATURE REVIEW

2.1 Introduction 9
2.2 Time Series Data Mining 9
2.3 Climate Change 10
2.4 Ozone (O3) 12
2.5 Clustering 14
2.6 Clustering Algorithms 16

2.6.1 Hierarchical clustering 16


viii

2.6.2 Partitioning-based 18
2.6.3 Densities-based clustering 19
2.6.4 Model-based clustering 20

2.7 Hierarchical Clustering of Time-Series 21


2.8 Similarity/Dissimilarity Measure 23

2.8.1 Euclidean distance 23


2.8.2 Minkowski distance 24
2.8.3 Dynamic time warping (DTW) 24

2.9 Time Series Clustering Related Work 26


2.10 Ground-Level Ozone Clustering Related Work 29
2.11 Summary 30

CHAPTER III RESEARCH METHODLOGY

3.1 Introduction 31
3.2 Research Methodology 31

3.2.1 Problem identification phase 32


3.2.2 Data collection and preparation 33
3.2.3 Proposed solution phase 33
3.2.4 Experiment phase 34
3.2.5 Evaluation phase 35

3.3 Dataset 36

3.3.1 UCR benchmark datasets 36


3.3.2 Malaysian climate change dataset 38

3.4 Preprocessing 39

3.4.1 Cleaning 39
3.4.2 Discretization 40

3.5 Agglomerative Hierarchical Clustering (AHC) 41


3.6 Distance Measures 42
3.7 Modified Dynamic Time Warping (M-DTW) 43
3.8 Evaluation 45
3.9 Summary 46
ix

CHAPTER IV COMPARATIVE ANALYSIS OF THREE DISTANCE


MEASURES FOR AGGLOMERATIVE HIERARCHICAL
TIME SERIES CLUSTERING

4.1 Introduction 48
4.2 Experiment Setting 49
4.3 Results of AHC with DTW, Minkowski and Euclidean Distance
Measures for All Datasets 51
4.4 Discussion 54
4.5 Summary 55

CHAPTER V MODIFIED DYNAMIC TIME WARPING

5.1 Introduction 56
5.2 Experiment Setting 57
5.3 Results Of Dtw And Modified Dtw For All Datasets 59
5.4 Discussion 62
5.5 Summary 62

CHAPTER VI GROUND LEVEL OZONE CLUSTERING IN PUTRAJAYA


THE PROPOSED M-DTW WITH AHC

6.1 Introduction 63
6.2 Experiment Setting 64

6.2.1 Evaluation 65

6.3 Results of Intra-Cluster 66


6.4 Results of Inter-Cluster 67
6.5 Categories of Air Pollution 69
6.6 Critical Analysis of Clustering When K=5 69
6.7 Critical Analysis of Clustering When K=9 71
6.8 Comparison Between Clustering When K=5 And K=9 75
6.9 Analysis Using Association Rules 77

6.9.1 Comparison 81

6.10 Summary 81
x

CHAPTER VII CONCLUSION AND FUTURE WORK

7.1 Conclusion 83
7.2 Research Contribution 84
7.3 Future Work 85

REFERENCES 86
APPENDIX A 96
APPENDIX B 107
xi

LIST OF TABLES

Table Number Page

3.1 UCR datasets 37

3.2 Data with missing values 39

3.3 Mean Average mechanism 40

3.4 Pseudo code of AHC 41

3.5 Pseudo code of Euclidean distance 42

3.6 Pseudo code of Minkowski distance 42

3.7 Pseudo code of DTW 42

3.8 Pseudo code of M-DTW 44

3.9 Contingency table 45

4.1 Results of AHC using the three distance measures for ‘CBF’ dataset 50

4.2 Results of AHC with the three distance measures for all datasets 52

5.1 Results of AHC using DTW and modified DTW for ‘TwoLeadECG’ 58

5.2 Results of AHC with DTW and modified DTW for all datasets 60

6.1 Results of DTW and modified DTW using intra-cluster 66

6.2 Results of DTW and modified DTW using intra-cluster 67

6.3 Categories of air pollution (AirNow 2009) 69

6.4 Values using cluster k = 5 75

6.5 Values using cluster k = 9 76

6.6 Sample of rules with their confidence values 78

6.7 Results of association rules for clustering when k=5 79

6.8 Results of association rules for clustering when k=9 80


xii

LIST OF FIGURES

Figure Number Page

2.1 Hierarchical clustering approaches 17

2.2 Partitioning clustering 18

2.3 Dynamic Time Warping (DTW) 25

3.1 General structure of the methodology 31

3.2 Problem Identification phase 32

3.3 Data collection and preparation phase 33

3.4 Proposed Solution phase 34

3.5 Experiment phase 35

3.6 Evaluation phase 36

3.7 Sample of Malaysian climate change data 38

3.8 DTW similarity matrix 44

4.1 Correspondences representation for the three distance measures 51

4.2 Comparison among the three distance functions in terms of f-measure 54

5.1 Correspondences representation for the two distance measures 59

5.2 Comparison among the two distance functions in terms of f-measure 61

6.1 RMSE-SD results of DTW and modified DTW for intra-cluster 67

6.2 RMSE-SD results of DTW and modified DTW for intra-cluster 68

6.3 Results of clustering when k=5 70

6.4 Results of clustering when k=9 73

6.5 Distribution of days over the five categories 77

6.6 Distribution of days over the nine categories 77


xiii

LIST OF ABBREVIATIONS

AHC Agglomerative Hierarchical Clustering

DTW Dynamic Time Warping

EU Euclidean

Min Minkowski

SVM Support Vector Machine

NN Neural Network

CC Climate Change

TS Time Series

AR Association Rules

O3 Ozone

RMSE Root Mean Square Error


CHAPTER I

1. INTRODUCTION

1.1 INTRODUCTION

Knowledge Discovery and Data Mining (KDD) is the process of analyzing large
amount of data in order to acquire meaningful patterns (Bernad 1996). With the
dramatic expansion of data in various domain of interest, a crucial demand has been
appeared in terms of extracting useful knowledge from this data. Such demand has
contributed toward varying the tasks that have been used for extracting useful
knowledge such as pattern recognition, machine learning, data visualization,
optimization, and high-performance (Berry & Linoff 1997). However, discovering
knowledge from time-series data has caught many researchers’ attentions. Time-series
data mining is the process of identifying meaningful patterns within sequences of time
(Hamilton 1994). Time-series data mining has incorporated different fields such as
stock market, brain activity, mathematical finance and climate change (Müller et al.
1997).

Nowadays, a surge research has been arisen as a response toward the climate
change that is occurring around the world. This can be represented by addressing
multiple tasks related to the climate such as weather forecasting (Gneiting & Raftery
2005), rainfall prediction (Htike & Khalifa 2010) and flood estimation (Arnaud et al.
2002). Recently, the ground-level ozone (O3) has been addressed by multiple authors
regarding to its harmful effects on human body (Akimoto et al. 2015; Christensen et
al. 2015; Kocijan et al. 2016; Sharma et al. 2016).

The key characteristic behind climate change analysis lies on the nature of the
examined data in which the data is being formed within intervals (e.g. daily, monthly
and yearly). This kind of analysis is called time-series analysis. The key challenge
2

behind time-series analysis lies on the technique that would be used in the analysis.
Several research studies have examined clustering technique for the time-series
analysis which has the ability to identify meaningful patterns within the intervals.

Clustering is one of popular method contributes a significant knowledge for


time series data mining to discover hidden patterns on it (Vora & Oza 2013). This can
be represented by aggregating the similar data in specific groups. However, clustering
in time series is a challenging task due to the existing clustering approaches intended
to handle static data, but less handle the dynamic data of the time series especially in
climate change and environment change (Ferreira & Zhao 2016). An accurate
clustering data is in demand especially for data look similar towards the time such as
in tropical climate and pollution change (Rani & Sikka 2012).

There are various clustering methods have bene proposed for climate change
analysis which can be categorized into two main types partitioning-based and
grouping-based clustering (Xu & Wunsch 2005). Partitioning-based clustering
methods comprise of k-means, k-medoids and fuzzy c-mean (Vora & Oza 2013),
while grouping-based clustering methods contains Divisive Hierarchical Clustering
(DHC) and Agglomerative Hierarchical Clustering (AHC) (Jiang et al. 2003). Unlike
partitioning techniques which aim to identify an initial solution (centroids),
hierarchical clustering aims to group the objects in a hierarchical manner which is
more appropriate for time series data (Aghagolzadeh et al. 2007).

On other hand, similarity measures have a significant impact on the clustering


process regarding accuracy (Xu & Wunsch 2005). Several existing time series
clustering approaches have utilized common similarity measures such as Euclidean
distance and Minkowski distance (Kalpakis et al. 2001). However, such measures
have a main drawback which can be represented by the nature of similarity in time
series data where the similarity is based on the series pattern instead of the distance
matching.

Apart from the traditional distance measures, Dynamic Time Warping (DTW)
has been widely used for time series data due to its ability to identify sequential
3

correspondences (Niennattrakul & Ratanamahatana 2007). Although DTW has been


used commonly with partitioning-based clustering, less research efforts have
addressed the use of DTW with hierarchical clustering.

Hence, this study aims to investigate the performance of applying DTW in


AHC. In addition, this study aims to propose a modification of DTW in AHC (M-
DTW-AHC) by improving the dynamic selection of the shortest path suite for
dynamic data. Finally, the proposed modification will be carried out on the Malaysia
climate and ozone change pattern.

1.2 PROBLEM STATEMENT

Ozone analysis is the process of identifying meaningful patterns that would facilitate
the prediction of future trends. Such analysis has been conducted by many researchers
with two main approaches statistical approaches (Chang et al. 2014; Frossard et al.
2013; Li et al. 2013; Sharma et al. 2016) and clustering approaches (Ahmadi et al.
2015; Wang et al. 2013). Basically, ozone analysis is mainly relying on time-series
data in which the could be formed hourly, monthly or yearly. (Monteiro et al. 2012)
have conducted a comparative study among statistical and clustering approaches in
terms of ozone analysis. The authors have implied that clustering approach has
outperformed the statistical approach in terms of discovering interesting patterns by
aggregating the similar sequences in multiple clusters. However, there is still a
challenging task in terms of determining an appropriate clustering approach for ozone
analysis. This is due to the variety of clustering approaches that would be used for
time series analysis.

Several clustering methods have been proposed for time-series analysis


including partitioning-based clustering (Niennattrakul & Ratanamahatana 2007;
Vlachos et al. 2003) and grouping-based clustering (Badr et al. 2015; Jiang et al.
2003; Malley et al. 2014). However, according to (Liao 2005) who established a
comparison between partitioning-based (i.e. k-means) clustering and grouping-based
(i.e. AHC) for time series weather forecasting, AHC tends to be more appropriate
clustering method for time series due to its hierarchal manner. Similarly, (Kavitha &
Punithavalli 2010) has investigate the role of AHC for time series analysis, they
4

emphasis that AHC is able to provide comprehensible cluster definition process that
has the ability to group the most similar number with smaller size clusters, which
significantly fit the case of sequences in time series. Hence, AHC has demonstrated
superior performance in terms of time series compared to other clustering approaches.

On the hand, the similarity or distance function used by the clustering


technique plays an essential role in terms of the performance of the clustering task
(Kalpakis et al. 2001). Several distance measures have been proposed including
Euclidean, Minkowski and Dynamic Time Warping distance measures. In fact,
integrating an appropriate distance measure with an appropriate clustering technique is
a challenging task (Kalpakis et al. 2001).

Kumar et al. (2015) have proposed a time series clustering approach using
UCR benchmark datasets. The authors have used two distance measures; DTW and
Euclidean distance. Results have shown that DTW outperforms the Euclidean. In
addition, (Petitjean et al. 2011) have examined DTW for time series clustering, they
implied that the strengthen of DTW in time series analysis lies on its ability to find the
shortest path in the process of identifying sequential matches. Therefore, DTW could
be considered for any time series clustering.

However, there is a point in which random selection could take a place in the
dynamic programing. Such randomization may lead to longer path which tend to be
derived away from the optimum path. Thus, there is a vital demand to modify DTW
by enhancing the dynamic selection of the shortest path.

1.3 RESEARCH OBJECTIVES

The objectives of this research can be illustrated as follows:

i. To identify the best distance measure for agglomerative hierarchical clustering


in time series.

ii. To propose a modified dynamic time warping by reducing its shortest path.
5

iii. To discover Putrajaya ozone changed pattern by applying the modified


dynamic time warping.

1.4 RESEARCH SCOPE

The overall scope of this study lies on clustering ground-level ozone in Malaysia.
First, an investigation would be conducted in order to identify the performance of
three distance measures including Euclidean distance, Minkowski and DTW with
AHC. In addition, this study aims to propose a modification of DTW by improving the
dynamic selection of the shortest path suite for dynamic data. Finally, this study aims
to discover the Malaysia climate and ozone change pattern based on the proposed
method.

The first and second objectives follow the standard experiment based research
method includes design and developed the algorithms, conducting experiment and
evaluation phases. The performance the proposed algorithms will be evaluated using
Twenty UCI time series benchmark data sets. The usefulness of the proposed
algorithms is later used to discover knowledge of Malaysia climate change pattern
using the four stations Climate change data sets (Petaling Jaya, KLIA, Subang and
UM which collected from Institute Climate Change, UKM) and Malaysia Ozone
Change Patterns at Klang Valey using 9 Ozone stations, which collected from Lestari
UKM.

1.5 SIGNIFICANCE OF RESEARCH

The significance of this research lies on the challenging task of selecting an


appropriate method for ground-level ozone analysis. In fact, there is a vital demand to
identify meaningful patterns in order to attain clues behind the increment of ozone
rates. Clustering technique is one of the methods that has been widely examined in
terms of identifying interesting patterns of ozone concentration (Ahmadi et al. 2015;
Saithanu & Mekparyup 2012; Sun et al. 2015; Wang et al. 2013).

However, there are several approaches of clustering such as partitioning-based,


6

density-based and hierarchical approaches. Each approach has its own characteristics
and mechanisms which leads to different performances. In this manner, identifying an
appropriate approach for ground-level ozone clustering would be a challenging task.

Furthermore, any clustering approach should be integrated with a similarity


function that aims to determine the correspondences criteria (i.e. whether similarity or
distance). This similarity function plays an essential role in terms of producing precise
results of clusters. Basically, there are various types of similarity or distance function
such as Euclidean distance, Minkowski distance and Dynamic Time Warping (DTW).
Every distance measure has its own characteristics and performance. Therefore,
identifying a suitable distance measure for ground-level ozone clustering is another
challenging task.

In addition, since the ozone analysis is mainly relying on time intervals or so-
called time-series analysis thus, the distance measures sometimes require sort of
modification in order to fit the intervals. Some researchers have modified the
mechanism of DTW in order to be more appropriate for time-series clustering
(Petitjean et al. 2011; Yuan et al. 2011; Zhu et al. 2012). Hence, it is necessary to
accommodate some adjustment for the distance measure in order to provide robust
ground-level ozone clustering.

1.6 RESEARCH METHODOLOGY

The research methodology of this study follow the standard experiment based research
method including Problem Identification, Data Collection and Preparation, Proposed
Solution, Experiment and Evaluation (Kelly & Lesh 2002). The first phase aims to
identify the problem of the study. This can be performed by reviewing the literature in
terms of ground-level ozone clustering. Obviously, an exploration would be conducted
in order to identify an appropriate clustering approach with suitable distance measure.
Apparently, the output of this phase will be represented as the problem statement of
the study.

The second phase is Data Collection and Preparation is associated with the
7

dataset that would be used, as well as, the preprocessing tasks that would be
conducted to turn the data into appropriate representation.

In the third phase, the proposed solution for the selected problem will be
discussed. This can be represented by accommodating a comparative analysis for
multiple distance measure in order to identify the best performance among these
measures. Obviously, the output of this phase will be represented as the objectives of
the study.

The fourth phase is associated with the experiments that would be conducted. In
this vein, the proposed solution will be carried out upon the prepared data. In addition,
the results obtained by the proposed solution will be observed. The output of this
phase would be associated with the research objectives accomplishment.

The fifth phase is associated with the evaluation of the proposed method in
which multiple evaluation metrics will be used to assess the effectiveness of the
proposed method’s results. The output of this phase is associated with the research
finding. However, Chapter III discusses these phases in more detail.

1.7 RESEARCH ORGANIZATION

This research contains seven chapters that ordering as follow:

Chapter II provides a comprehensive literature review for multiple fields including


data mining, climate change analysis, time series analysis, ground-level ozone
analysis, clustering techniques and distance measures. Additionally, a critical analysis
of the related work is being dedicated.

Chapter III describes the research methdology where the datasets, preprocessing
tasks, clustering technique and selected distance measures are being illustrated.
Additionally, the evaluation methods that have been used to assess the proposed
method based on the effectiveness, are being declared.
8

Chapter VI represents the first contribution in which a comparative study is being


conducted among three distance measures including Euclidean distance, Minkowski
and Dynamic Time Warping using an Agglomerative Hierarchical Clustering. The
experiment setting is being declared by determining the dataset and tool used to
accomodate the comparison. Finally, a critical analysis of the obtained results is being
provided.

Chapter V represents the second contribution in which the DTW distance measure is
being mdoified and applied with AHC for time series analysis. The experiment setting
is being declared by determining the dataset and tool used to accomodate the
modification. Finally, a critical analysis of the obtained results is being provided.

Chapter VI represents the third contribution in which the modified DTW is being
integrated with AHC and carried out on a Malaysia metrological data in order to
perfrom ground-level ozone clustering. The experiment setting is being declared by
determining the dataset and tool used to accomodate the comparison. Finally, a critical
analysis of the obtained results is being provided.

Chapter VII provides the conclusion of the research in which the summary,
contribution accomplishment and future work are being discussed.
CHAPTER II

2 LITERATURE REVIEW

2.1 INTRODUCTION

The increasing of ground-level ozone (O3) nowadays in several urban areas around
the world is a serious problem. Such concentration of ozone-level would negatively
affect the human body. This is a research interest that caught many researchers’
attentions by proposing a prediction approach which enables controlling the
consequences. Clustering is one of the main approaches that have been proposed for
ground-level ozone. Clustering aims to aggregate similar data points within clusters
(Rani & Sikka 2012). In this manner, similar trends could be aggregated in a single
group which facilitates the cause analysis. This chapter aims to provide a
comprehensive literature review for multiple domains including time-series data
mining, climate change, ground-level ozone analysis, clustering techniques and
distance measures. Additionally, every domain will be supported with its
corresponding state of the art.

2.2 TIME SERIES DATA MINING

In the recent years, the field of mining has expanded as attempts have been made in
terms of research and development making increased use of temporal data, and
particularly the time series data (Antunes & Oliveira 2001). Representing a vital
category of temporal data objects, time series data can be acquired from financial and
scientific applications. Those applications include day-to-day temperatures, mutual
funds and stock prices, details of weekly sales, and electrocardiogram (ECG).
According to Hamilton (1994), time series analysis can be defined as:

“A time series is a series of data points listed (or graphed) in time order. Most
commonly, a time series is a sequence taken at successive equally spaced
points in time. Thus it is a sequence of discrete-time data.”
10

In time series, an array of observations is sequenced in chronological order. The kind


of data which time series includes are large in terms of size, dimensionally high, and
becomes updated regularly (Hamilton 1994). Another feature of time series data is
that is marked by numeration and continuity, which means it is continuous and
numerical in nature. In addition, it is also characterized by its wholeness rather than
individual features as we always regard its data as a whole rather than as an individual
numerical field. For this reasons, contrary to conventional databases where search is
based on exact matches, search in time series is customarily executed drawing on
approximate match.

Various kinds of search are executed with respect to time series data such as
looking for analogous time series (Bernad 1996; Chan & Fu 1999), subsequence
searching in time series (Faloutsos et al. 1994), dimensionality reduction (Keogh
1997; Keogh et al. 2001) and segmentation (Abonyi et al. 2005). Researchers have
conducted extensive research studying rigorously both databases as well as
recognizing patterns of communities for different domains of time series data (Keogh
& Kasetty 2003). One of the principal problems with reference to times series data is
to figure out how to represent time series data. To perform this, one of the approaches
which most researchers apply is the transformation of times series data to another
domain. This can then achieve the purpose of dimensionality reduction. In addition,
this follows an indexing mechanism. Furthermore, two other major tasks for different
time series mining is the measurement between time series, time series subsequences,
and segmentation. Literature that draws on time series representation provides an
array of mining tasks that could broadly be synthesized into four different fields such
as classification, pattern discovery and clustering, and rule discovery and
summarization. In terms of focus, some research concentrates on one of the areas,
whereas other may aim any other area of the above.

2.3 CLIMATE CHANGE

Climate change is a change in the statistical distribution of weather patterns when that
change lasts for an extended period of time (i.e., decades to millions of years) (Matson
et al. 2010). In order to study climate change science and impacts on climate change,
11

the discovery of knowledge from the viewpoint of temporal, spatial and


spatiotemporal data can be critical (Laxman & Sastry 2006; Pelekis et al. 2004).
Statically, the study of climate has grown into a mature area. However, off late, some
other areas have led to the expansion of this area. Importantly, some recent growth
and development in the area provides rich opportunities for data miners to discover
more in the field (Dezfuli 2011; Mitsa 2010). Those include growth in observation and
model outputs and increased availability of geographical data jointly offer
opportunities for research in the area. However, carrying out such research is not free
of challenges. The main challenges range from long-range, long-memory and possibly
nonlinear dependence, nonlinear dynamical behavior to presence of thresholds
(Ganguly & Steinhaeuser 2008). In addition, those challenges are also augmented by
critical extreme climatic events such as region-wise stresses triggered by climate
challenges globally, the quantification of uncertainty, and the intersection between
natural and build environment with that of climate change. In terms of domains, two
major areas are identified in Knowledge Discovery (KD) areas. One is data analysis
and mining which aims at extracting patterns of a large volume of observation and
model outputs about climate, and the second one is about data-guided modeling and
simulation (Chang et al. 2014). For example, studying models of water and energy,
and their estimated impacts, which take downscaled outputs as the inputs.

Evidently, the climate change phenomenon exists for centuries. Critically,


climate change is a critical area where the focus stands on the long-lasting change, and
percentage distribution of the patterns of weather over the duration range (Stocker et
al. 2014). Climate changes results in exponential impacts on the global environmental
scenario. And the implications are far, wide and long-lasting on critical dimensions of
environment such as shrinkage of glaciers, break up of ice on the surface of rivers and
lakes, impacts on animal and plants, expedited flowering of trees, and several other
major environmental hazard (Carvalho et al. 2016). Of the most major causes on
climate change is the impact of greenhouse effect, which causes massive impacts on
global warming. This is in relation to phenomenal increase in temperature and on the
ground level ozone.
12

2.4 OZONE (O3)

Scientifically called trioxygen is an inorganic molecule with the chemical formula O3,
it is a pale blue gas with a distinctively pungent smell (Control & Prevention 1997).
Reports suggest that ground level ozone (O3) can be rather harmful to the human
respiratory system . In addition, there are also research reports that this can also result
in several other detrimental diseases such as severe exposure to ozone can negatively
affect and upset functioning of lungs, and can potentially increase inflammation (Bell
et al. 2007). For instance, studies find that mortality rate in urban areas is related to
the effects of ozone, and there is correlation between them (Bell et al. 2004). Other
studies such the one by (Wilson et al. 2014) conclude that the effects of ozone can be
non-linear, and specifically, extreme exposure to ground level ozone can be dangerous
for health. Therefore, in view of the dangers the ozone layer entails, it could be critical
to research and determine the factors which cause the ozone layer to spread the most.

Being a secondary pollutant, the creation of ozone occurs as a result of chemical


reaction when nitrogen oxides (NOx) and volatile organic compounds (VOCs) are
exposed to ultra- violet radiation from sunlight. According to (Jacob & Winner 2009),
the association of high level ozone is with numerous meteorological drivers, which
mainly include increased temperature, increased solar radiation, and low speed of
wind. However, research is yet to conclude that what meteorological conditions mark
an extreme ozone day as to distinguish it from one with only high ozone levels.

In the last decade, many organizations have become aware due to the dramatic
arisen of the ground-level ozone. According to (Vingarzan 2004) who accommodated
a review for the trends of ground-level ozone using a data of the last century in
Canada and USA, have concluded that the ground-level ozone has dramatically
increased in the last three decades. As a response, the research community has
attempted to propose statistical models that has the ability to predict the increasing of
ozone for instance, (Ray & Ensor 2009) have proposed an agglomerative hierarchical
clustering for identifying the most polluted area in Houston, Texas in terms of ground
level ozone. In their study, the authors have declared multiple factors that have a
13

significant impact on the ozone increment such as wind speed, wind direction, and
solar radiation.

In addition, (Adame et al. 2012) have proposed a k-means clustering approach


with Euclidean distance measure in order to identifying peaks of ozone rates in an
industrial area in Central-Southern Spain. The authors have successfully identified
several polluted plots. Another approach was proposed by (Ahmad & Aziz 2013) in
which a statistical method of passive sampling was being used to investigate the air
pollution in Pakistan. Furthermore, (Monteiro et al. 2012) have proposed a
combination of Statistical means of quantile regression and agglomerative hierarchical
clustering in order to measure the pollution of air in terms of ground-level ozone.

Other researchers have attempted to identify characteristics of ground-level


ozone such as (Smith et al. 2013) who proposed a Hybrid Single Particle Lagrangian
Integrated Trajectory (HySPLIT) Model in order to characterize the ground ozone
concentration in the gulf of Texas. In their study they have figured out that the lowest
ozone concentrations are associated with trajectories that remained over the central
Gulf for at least 48 hours. As well as, higher concentrations are associated with
trajectories that pass close to the northern and western Gulf Coast. (Wang et al. 2013)
have addressed the problem of detecting ground level ozone from Spatio-temporal
aspect. The authors have proposed a nearest neighbor clustering approach in order to
identify Spatio-temporal patterns of the air pollution.

Another observational study conducted by (Li et al. 2013) which concentrated


on the pollution in Tangshan, north China. Such study has mainly relied on statistical
analysis. The study has implied the dramatic expansion rates of O3 and NOX from
2008 to 2011. The study has referred the reason behind the increment rates due to the
extent of industries that are located in such city. In addition, (Hájek & Olej 2012) have
accommodate a comparative study of three regression approaches including Neural
Network (NN), Support Vector Machine (SVM) and Fuzzy Logic (FL) in terms of
predicting ground-level ozone. Based on the Root Mean Square Error (RMSE), SVM
has shown superior performance regarding to predict the ozone levels. Similarly,
(Kandil et al. 2014) have examined two NN models including Feed-Forward NN and
14

Back-propagation NN in terms of ozone prediction. Basically, multiple features have


been encoded and fed into the network including temperature, humidity, wind speed,
incoming solar radiation, sulfur dioxide and nitrogen dioxide. Feed forward NN has
outperformed the other model.

In addition, (Sun et al. 2015) have accommodated a comparison among two


linear regression methods including SVM and multi-layer perceptron NN for
identifying ozone levels in Houston–Galveston–Brazoria area, Texas. Results shown
superior performance for SVM. (Tamas et al. 2016) have used three clustering
approaches in order to detect pollution in Air including Artificial Neural Network
(ANN), Self-Organized Mapping (SOM) and K-means clustering. Using hourly data,
the results shown two main sources of pollutions including ozone (O3) and nitrogen
dioxide (NO2).

On the other hand, (Akimoto et al. 2015) have a long-term statistical study for
ground-level ozone in Japan from 1990 to 2010. The study focused on identifying
correlation for the increment rates of ozone. The authors have identified three main
causes stated as; (i) the decrease of NO titration effect, (ii) the increase of
transboundary transport, and (iii) the decrease of situated photochemical production.
Similarly, on an observational study of ozone level’s causes by (Christensen et al.
2015), the authors have indicated that Asia continent is one of the main sources that
affecting the ground-level ozone in Westren United States.

2.5 CLUSTERING

Cluster analysis or so-called clustering is the process of aggregating similar data


points in groups that are called clusters (Vora & Oza 2013). The world around us offer
abundant of data. On a day-to-day basis, people come across and happen to collect a
great deal of information, store it, and represent it as a data to further analyze and
manage it. Thus, one of the most basic and fundamental steps towards processing such
data is form groups or classification, and organize it into a further set of categories or
clusters. It is an established reality, in the long history of human development,
classification is believed to have played a crucial and integral role. Similarly,
classification is regarded as one of the most pervasive primitive activities of the
15

human beings over the course of history. People are observed to be ever curious and
inquisitive to learn about new objects; thus to unearth any new phenomenon, they
always inquisitively strive to discover feature that may describe or form those objects.
In addition, to identify the similarities and differences of those objects, or generalized
proximity of one objects with that of the other, humans tend to compare and contrast
the known objects, devising some standards or criteria.

Fundamentally, the classification systems may be supervised or unsupervised. It


basically depends on whether they assign new inputs to one of a finite number of
discrete supervised classes or unsupervised categories, respectively (Berkhin 2006).
Mapping of a set of input data vectors in the supervised classification to a finite set of
discrete class labels is modeled in terms of some mathematical function, where a
vector of is adaptable limits (Dembélé & Kastner 2003). To determine the value of
those parameters or limits, inductive learning algorithm is applied, which is also
described as inducer. This aims at minimizing an empirical risk functional (related to
an inductive principle) on a finite data set of input–output examples, where is the
finite cardinality of the available representative data set (Kolatch 2001).

Unsupervised classification which is also termed as clustering or exploratory


data analysis, does not contain labelled data (Jain & Dubes 1988). Rather than
providing an exact characterization of unobserved samples created from the same
probability distribution, clustering aims at separating a finite unlabeled data set into a
finite and discrete set of “natural,” concealed data structures (Baraldi & Alpaydin
2002). Thus, this leads clusters to stand outside the framework of unsupervised
predictive learning problems. Notably, clustering is quite different from
multidimensional scaling (perceptual maps). This aims at showing al assessed objects
in a manner to shrink the dimensions to as few as possible while reducing
topographical distortion. In addition, practically, we also use an array of (predictive)
vector quantizers for the analysis of (non-predictive) clustering (Rani & Sikka 2012).

Through algorithms, data are partitioned into a group of cluster, categories or


sub-sets. Clusters have not been given a universally agreed upon definition as yet
(Everitt et al. 2001). Generally, a large segment of researchers define and explain
16

clusters analyzing their homogeneity from within, and their separation from outside
(Hansen & Jaumard 1997). It is related to note that that patterns in the same cluster
should be similar to each other, while patterns in different clusters should not be. And
from the perspective of research and analysis of clusters, it is critical that clear and
meaningful methods and mechanisms should be applied to examine their similarity
and dissimilarity.

2.6 CLUSTERING ALGORITHMS

While making taxonomies of clustering algorithms, various starting points and criteria
are usually used (Kavitha & Punithavalli 2010; Liu & Xiong 2011; Rani & Sikka
2012; Vora & Oza 2013). Drawing on the attributes and properties of the generated
clusters, two classification techniques are widely used and many researcher agree
upon them which are partitioning clustering and hierarchical clustering (Jain et al.
1999). In hierarchical clustering, data objects are grouped with a partition sequence.
The grouping may range from single clusters to a collection of clusters including all
individual clusters or vice versa. Likewise, without dividing clusters into a
hierarchical structure, partitioning clustering splits objects of data directly into some
pre-specified number of clusters. In the following pages, both kinds are explained in
detail.

2.6.1 Hierarchical Clustering

Using algorithms, Hierarchical clustering (HC) draws on proximity matrix to arrange


and organize data into a hierarchy of structures. A binary tree or dendrogram are used
to show the results of hierarchical clustering. The root node of the dendrogram
characterizes the entire data set and each leaf node is viewed as a data object. The
intermediate nodes represent the extent that the objects are proximal to each other; and
the height of the dendrogram typically reflect the distance between each pair of
objects or clusters, or an object and a cluster. In order to obtain the ultimate results of
the clustering, the dendrogram can be cut at various levels. This representation can
inform rich descriptions, and help visualize the main data of clustering structures. This
can particularly happen when actual relations of hierarchy are found in the data such
as the evolving research on multitude of organisms of species.
17

The classification of hierarchical algorithms (HC) is based on divisive and


agglomerative methods as shown in Figure 2.1. Both work in divergent trajectories as
agglomerative begins with clusters and incorporates one objects whereas divisive
works in the opposite direction. Having performed a range of operations merging, the
following process ultimately guides all objects to the same group. At the initial stages,
the whole data sets stand within a single cluster, the following procedure splits all
clusters continuously till clusters are formed singly. For those clusters that contain
objects, they are possibly divided into two sub-sets, which are rather expansive in
terms of computation (Everitt et al. 2001). In view of the divisive nature, mostly they
are not practiced. One of the oft-repeated criticisms on the classical hierarchical
algorithms is that they are short of robustness, and are rather sensitive to noise and
outliers.

Figure 2.1 Hierarchical clustering approaches

In practice, an objects that has been assigned to a cluster, is not considered


again. This signifies a major limitation of HC algorithms, which does not have the
potential to correct all possible miscalculations done previously. Another limitation is
that of the computational complexity of HC as most of HC algorithms are high in
terms of cost, which does not allow for applying in large-scale data sets. Additionally,
18

HC also entails some other disadvantages as they tend to constitute spherical shapes
and reversal phenomenon, in which the usual hierarchical structure remains distorted.

2.6.2 Partitioning-based

Contrary to the hierarchical clustering as delineated in the previous section that tend to
produce clusters successively by iterative fusions or divisions, the nature of
partitioning clustering to assigns a set of objects into clusters with no hierarchical
structure. In terms of principle, some specific criterions can be devised to calculate all
possibilities, and for optimal partitioning. However, this method, which is founded on
sheer force, can be impractical on account of the expense in involves in computation
(Liu 1968). There also a limitation of this is that the number of partition is not
possible even in case of a small level problem of clustering where 30 objects are
organized into 3 groups. Therefore, in view of those limitations, heuristic algorithms
have been designed for the purpose of optimal solutions. Figure 2.2 depicts the
mechanism of aggregating data points using partitioning clustering

Figure 2.2 Partitioning clustering

Criteria of function is one of the most crucial factors in partitional clustering (Vora
& Oza 2013). And, another widely applied criteria also is the total of squared error
function. Moreover, to solve many practical problems, the means algorithm, which is
simple and easy can be applied. In addition, it can function efficiently for compact and
19

hyper-spherical clusters. Time is one of the complexities of means; however, means


can potentially be applied for processing a relatively larger data sets of data. To
accelerate the algorithms, parallel techniques of means could be developed for the
purpose (Jain 2010). Researchers have studied and identified a number of drawbacks
involving the use of means; therefore, an array of other variants have resultantly
appeared for overcoming those drawbacks. Here is a summary of some of the major
disadvantages (Singh & Chauhan 2011):

I. So far, there does not exist any universally efficient methodology that could
identify the initial stages of partition and the numbering of clusters. With
different initial points vary the convergence centroids. Generally, to overcome
the problem, the effective strategy is to operate the algorithm repeatedly with
random initial partitions.

II. The iteratively optimal procedure of -means cannot guarantee convergence to a


global optimum. The stochastic optimal techniques, like simulated annealing
(SA) and genetic algorithms , can find the global optimum with the price of
expensive computation.

III. means is sensitive to outliers and noise. Even if an object is quite far away
from the cluster centroid, it is still forced into a cluster and, thus, distorts the
cluster shapes.

IV. The definition of “means” limits the application only to numerical variables.
The medoids algorithm mentioned previously is a natural choice, when the
computation of means is unavailable, since the medoids do not need any
computation and always exist.

2.6.3 Densities-Based Clustering

Looking at data from the viewpoint of probability, several probability distributions


presumably lead to the generation of data objects. Thus, based on varied distributions
of probability, data points are generated in different data points. Although, different
kinds of density functions they could be derived from such as multivariate Gaussian or
20

distribution and or the same families; however, placed differently in parameters. In


case of known distributary arrangement, locating certain set of clusters is similar to
assessing the parameters of a number of original models.

Fuzzy Clustering Except for GGA, the clustering techniques we have


discussed so far are referred to as hard or crisp clustering, which means that each
object is assigned to only one cluster. For fuzzy clustering, this restriction is relaxed,
and the object can belong to all of the clusters with a certain degree of membership
(Groenen & Jajuga 2001). This is particularly useful when the boundaries among the
clusters are not well separated and ambiguous. Moreover, the memberships may help
us discover more sophisticated relations between a given object and the disclosed
clusters. FCM is one of the most popular fuzzy clustering algorithms

Neural Networks-Based Clustering Neural networks-based clustering has been


dominated by SOFMs and adaptive resonance theory (ART). In competitive neural
networks, active neurons reinforce their neighborhood within certain regions, while
suppressing the activities of other neurons (so-called on-center/off-surround
competition). Typical examples include LVQ and SOFM (Jelinek 1990; Kohonen
1990; Vesanto & Alhoniemi 2000). Intrinsically, LVQ performs supervised learning,
and is not categorized as a clustering algorithm (Pal et al. 1993). But its learning
properties provide an insight to describe the potential data structure using the
prototype vectors in the competitive layer. By pointing out the limitations of LVQ,
including sensitivity to initiation and lack of a definite clustering object, Pal et al.
(1993) proposed a general LVQ algorithm for clustering, known as GLVQ. They
constructed the clustering problem as an optimization process based on minimizing a
loss function, which is defined on the locally weighted error between the input pattern
and the winning prototype. They also showed the relations between LVQ and the
online k-means.

2.6.4 Model-Based Clustering

The function of Model-based clustering is to retain and restore the original models
from a certain set of data (Xiong & Yeung 2002). Additionally, this approach seeks to
obtain model for each cluster, and seeks the best that is suited to the data for this
21

model. To elaborate further, it may be written that assumes that central centroids are
chosen randomly, and subsequently, noise is added to them with a normal distribution.
The model that is recovered from the generated data defines clusters (Fraley &
Raftery 2002). Classically, model-based methods use either statistical approaches,
e.g., COBWEB (Fisher 1987), or Neural Network approaches, e.g., ART (Carpenter &
Grossberg 1988) or Self-Organization Map (Jelinek 1990). In addition, for clustering
of time-series data, researchers also use Self-Organizing Maps (SOM). As stated
before, founded on neural networks, SOM is a model-based clustering which look
likes processing that happens in the brain. To cite an example, authros have also used
cluster time series features (Wang et al. 2004). However, because SOM needs to
define the dimension of weight vector, it cannot work well with time-series of unequal
length (Liao 2005). Moreover, there are also articles composed of polynomial models
that are model based clustering of time-series data which are composed of polynomial
models (Bagnall & Janacek 2005), Gaussian mixed models (Biernacki et al. 2000),
ARIMA (Corduas & Piccolo 2008), Markov chain (Ramoni et al., 2000) and Hidden
Markov models (Bicego et al. 2003, 2004). General, we find two main drawbacks of
this model. One, it is founded on the assumptions of users, and require a set parameter
that might result in false or inaccurate results. Two, its process is also, and it is time
consuming especially on bulky data sets (Andreopoulos et al. 2009).

2.7 HIERARCHICAL CLUSTERING OF TIME-SERIES

Hierarchal clustering according to Kaufman et al. (1990) is a method of clustering in


cluster analysis that builds a hierarchy of clusters through agglomerative or divisive
algorithms. In agglomerative algorithm also known as bottom up approach, every item
is taken as a cluster and then progressively joins the clusters. Contrastively, divisive
algorithm also known as top-down approach initiates with a single cluster as a whole
containing all the object and then divides the cluster to obtain clusters with one object.
There is a flaw found in the quality of hierarchal clustering i.e. inability of adjusting
cluster after dividing a cluster in divisive way, or after joining clusters in
agglomerative way. Consequently, a hybrid clustering approach is adopted, in which
the hierarchal clustering algorithm are joint with another algorithm to resolve this
inability. Furthermore, there are some extensive work done to achieve the
22

performance of hierarchal clustering like Chameleon (Karypis et al. 1999), CURE


(Guha et al. 1998) and BIRCH (Zhang et al. 1996) where the created cluster are
refined.

In hierarchical clustering of time-series also, nested hierarchy of similar


groups is generated based on a pair-wise distance matrix of time-series (Vlachos et al.
2003). Some studies suggest that hierarchical clustering has a great visualization
power in time-series clustering (Keogh & Pazzani 1998; Van Wijk & Van Selow
1999) thus, due to this characteristic of hierarchical clustering there is great need for
use of time-series clustering. Consequently, Oates et al. (2000) suggest the use of
agglomerative clustering to produce the clusters of the experiences of an autonomous
agent. They use Dynamic Time Warping (DTW) as a contrastive measure with a
dataset containing 150 trials of real initiate data in many ways. Additionally, another
study by (Hirano et al. 2004) suggests the use of average linkage agglomerative
clustering which is a type of hierarchical approach for time-series clustering. It is
important to know that hierarchical is used to estimate dimensionality reduction or
distance metric due to its power in visualization. Similarly, Lin et al. (2003)
recommends “Symbolic Aggregate Approximation (SAX) representation and use
hierarchical clustering to evaluate their work”. They found parallel consequences of
using SAX, hierarchical clustering and Euclidean distance.

On the other hand, unlike other algorithms, hierarchy clustering does not depend
upon the amount of cluster as an initial parameter which is taken as extraordinary
element of this algorithm. It is also a strength point in time-series clustering, because
usually it is hard to define the number of clusters in real world problems. In contrast
with other algorithms, hierarchical clustering is capable of clustering time-series with
irregular lengths (Agrawal et al. 1993). However, the possibility of clustering unequal
time-series in this algorithm lies with the use of an appropriate elastic distance
measure such as Dynamic Time Warping (DTW) (Berndt & Clifford 1994), or
Longest Common Subsequence (LCSS) (Banerjee & Ghosh 2001) to figure out the
dissimilarity/similarity of time-series. Actually, lack of necessity for prototypes in its
process, has prepared this algorithm to allow unequal time-series. However,
hierarchical clustering is essentially not capable to deal effectively with large time-
23

series (Wang et al. 2006) due to its quadratic computational complexity. As a result, it
leads to be limited to small datasets because of its poor scalability.

2.8 SIMILARITY/DISSIMILARITY MEASURE

This section looks at methods of distance measurement to choose the best similarity
measure among time-series being compared. In theoretical terms, the issues related to
the time-series similarity/dissimilarity search is projected by Agrawal et al. (1993) and
afterwards it became an essential theoretical issue in data mining community.

Time-series clustering depends upon the distance measure to a high extent. It


is observed that there are various measures can be applied to measure the distance
among time-series. Some of similarity measures are presented stands on some specific
time-series representations, for example, MINDIST which is compatible with SAX
(Lin et al. 2007), and some of them perform apart from the practice of representation
methods, or are compatible with raw time-series. Traditionally, the distance between
the static object in clustering has kept match based while in time-series clustering the
distance is more or less calculated. It is important to figure out the similarity of time-
series while comparing time-series irregular sampling time frame and lengths.

The key characteristic behind clustering process lies on the function that will
be used to identify the similarity between two data. Such data is vary where it could
be formed as raw values of equal or non-equal length, or it could be formed as vector
space of feature-pairs (Liao 2005). In particular, this section aims to focus on three
kinds of distance/similarity measures including Euclidean, Minkowski, and Dynamic
Time Warping (DTW) distance/similarity measures, which are illustrated in the
following sub-sections.

2.8.1 Euclidean Distance

Let and be a P-dimensional vector, then the Euclidean distance can be measured

as (Liao 2005):
24

(2.1)

2.8.2 Minkowski Distance

Let and be a P-dimensional vector, Minkowski distance is a generalization of

Euclidean distance, which is computed as follows (Liao 2005):

(2.2)

Where q is a positive integer. Table 3.6 shows the pseudo code of such measure.

2.8.3 Dynamic Time Warping (DTW)

DTW has been widely used to compare between discrete sequences and sequences of
continuous values (Liao 2005). The warping mechanism inside DTW aims to align
two sequences of time series in order to identify correspondences. The two sequences
can be arranged on the sides of a grid, with one on the top and the other up the left
hand side. Both sequences start on the bottom left of the grid.

Inside each cell a distance measure can be placed, comparing the


corresponding elements of the two sequences. To find the best match or alignment
between these two sequences one need to find a path through the grid which
minimizes the total distance between them. The procedure for computing this overall
distance involves finding all possible routes through the grid and for each one
compute the overall distance. The overall distance is the minimum of the sum of the
distances between the individual elements on the path divided by the sum of the
weighting function. The weighting function is used to normalize for the path length. It
is apparent that for any considerably long sequences the number of possible paths
through the grid will be very large.
25

Let Q and C are two time-series sequences as shown in Figure 2.3, dynamic time
warping aims to find the distance by calculating the accommodated cost using
dynamic programming on the similarity matrix found from the distance between each
point within the two time-series.

Figure 2.3 Dynamic Time Warping (DTW)

Let and be a two time series

sequences. DTW will minimize the differences among these series by representing a
matrix of . In such matrix, the distance/similarity between and will be

calculated using Euclidean distance.

However, a warping path where

will be elements from the matrix that meet three

constraints including boundary condition, continuity and monotonicity. The boundary


condition constraint requires the warping path to start and finish in diagonally
opposite corner cells of the matrix. That is and . The
26

continuity constraint restricts the allowable steps to adjacent cells. The monotonicity
constraint forces the points in the warping path to be monotonically spaced in time.
The warping path that has the minimum distance/similarity between the two series is
of interest. Hence, the DTW can be computed as follows:

(2.3)

2.9 TIME SERIES CLUSTERING RELATED WORK

There are several clustering approaches have been proposed for Time series data
analysis. For instance, Donko et al. (DONKO et al.) have proposed a clustering
approach for Time series analysis. The authors have concentrated on three sub-tasks
which are; presenting the data measurements for specific time sequence, presenting
several conditions based on similar states regions and presenting these conditions
between two consecutive time sequences. For this purpose, agglomerative hierarchical
clustering has been used in order to separate each object in one cluster, then
combining similar clusters. Basically, an object linkage task has been performed in
order to identify the similarity between objects. For this purpose, ‘single linkage’
parameter has been adjusted with Euclidean distance similarity measure. After that,
AHC has been used in order to identify similar objects based on the time manner.
Finally, the authors have used both numeric and graphical representation in order to
represent the results.

Badr et al. (Badr et al. 2015) have presented a tool for regionalization which
has been developed based on agglomerative hierarchical clustering. The main
motivation for this study is the lack of accessible of clustering approaches in the
literature which hinders the comparison with other studies. In fact, the proposed tool
contains several kinds of agglomerative clustering methods that have been used in the
literature including Ward’s method and linkage method which consists of single-
linkage, complete-linkage, Mcquitty’s, Median, and centroid. Such methods can
facilitate users to apply these methods to other clustering problems. In addition, the
proposed tool contains a new and modified clustering method which has been
27

designed particularly for regionalization namely ‘regional-linkage’ method. The


regional linkage method is a modification of the average linkage method that
minimizes inter-regional correlations between region means. It also provides the
ability to identify noisy elements for quality control and to perform an objective tree
cut based on correlation significance.

Serra & Zárate (Serra & Zárate 2015) have provided an analytic study focusing
on the characterization of time series clustering. In fact, the authors have concentrated
on the changes that occurred on the number of clusters and the cluster members within
the time-series intervals. Using a temporal database (TDB), the authors have used both
agglomerative hierarchical clustering and partitioning clustering based on Euclidean
distance as a distance function.

Bouguettaya et al. (Bouguettaya et al. 2015) have proposed an efficient


agglomerative hierarchical clustering using a real-world movie ratings dataset. In fact,
this study has been motivated by the main drawback of AHC. Although AHC is
featured by its ability to accommodate clustering without requiring predefined
parameters (as like in k-means) however, it has higher cost with a complexity of
. In this manner, the authors have modified the AHC to build hierarchies
based on a group of centroids rather than using raw data points. This has been
performed by making the process of bottom-up begins on the middle of hierarchy,
whereas the lower part of the hierarchy can be clustered using a less expensive method
such as k-means clustering. Such combination of AHC and k-means is called KnA
method. Using Euclidean distance measure, the proposed method shown a significant
enhancement regarding the time consumption with relatively the same accuracy of the
standard AHC.

Vasimallaa et al. (Kumar et al. 2015) proposed a semi-supervised classification


method suing three distance measures including Euclidean Distance (ED), Dynamic
Time Warping (DTW) and DTW-D (the ratio of DTW over ED). Their study has been
motivated by the problem of labeling real-world time series dataset. Since UCR time
series benchmark dataset is labeled, many researchers have used it for time series
clustering analysis. This is due to the process of labeling real-world dataset is a
28

challenging task. Therefore, the authors have proposed a semi-supervised


classification method where a small portion of a real-world data has been labeled for
the purpose of training. The experimental results shown that DTW-D has successfully
enhanced the process of labeling time series data.

Ferreira & Zhao (Ferreira & Zhao 2016) have proposed a clustering approach
for timer series community detection in social network. The proposed method used
multiple distance measures including ED, Infinite Norm, DTW, Person Correlation,
Wavelet Transform, and Integrated Periodogram. Among these distance functions,
DTW has the superior performance compared to other measures.

On the other hand, many research efforts have proposed enhancement and
modification for the Dynamic Time Warping (DTW) in order to improve its
functionality regarding specific domains. For instance, Efrat & Fan (Efrat et al. 2007)
have addressed the problem of curve matching which is considered as a drawback of
DTW in which the curves are being sampled as sequences to identify the similarity
among two time-series. The authors suggested that such sampling of curve is directly
affect the quality of the results. In this manner, the authors have proposed a continuous
DTW for calculating similarity among curves using exact and approximate matching
algorithms. The proposed continuous DTW has been evaluated using signature
verification dataset and compared with the traditional DTW. The proposed continuous
DTW has outperformed the traditional (i.e. discrete) DTW.

In addition, Petitjean et al. (2011) have proposed a global averaging


mechanism for DTW regarding clustering time series sequences. The authors have
concentrated on the drawback of using DTW with pairwise sequence similarity in
which a two individual sequences are being compared with each other. The main
drawback behind the pairwise matching lies on the sensitivity of sequence ordering.
The influence on sequence ordering has a significant impact on the equality of results.
Twenty UCR benchmark datasets have been used to evaluate the proposed averaging
DTW. Results shown superior performance for the proposed method compared to the
pairwise matching.
29

In the context of gene expression, Yuan et al. (2011) have figured out that the
traditional DTW cannot be applied to genes that have significant differences in
different time series. Such differences may indicate significant factor rather than a
simple time change. Therefore, time shift estimation appears to be insufficient in terms
of identifying similarity among two genes expressions. Therefore, a modified version
of DTW has been proposed to overcome both time shifting estimation and significant
indicators detection.

Finally, Zhu et al. (2012) have addressed the problem of time-consumption


when using DTW for large-scale time series clustering. The authors have proposed a
novel approximation for DTW in which the DTW distances can be bound between
LB-keogh and Euclidian distance functions. The key characteristic behind the
proposed method lies on the accurate approximation gained by the two bounds LB and
ED together. This can be performed by analyzing the best ‘mixing weight’ of the
upper and lower bounds by sampling a tiny fraction of the true DTW distances. Many
UCR benchmark datasets have been used to evaluate the proposed approximation.
Results shown a significant reduction regarding consumption when using the proposed
approximation of DTW.

2.10 GROUND-LEVEL OZONE CLUSTERING RELATED WORK

Various studies have tackled the problem of detecting the ozone trend for instance,
Solazzo et al. (Solazzo et al. 2012) have conducted a comprehensive analysis for
surface-level ozone based on air quality in Europe and North America in which an
ensemble clustering approach have been used to group the similar data.

On other hand, Saithanu & Mekparyup (Saithanu & Mekparyup 2012) have
proposed an agglomerative hierarchical clustering with Euclidean distance measure
for clustering ozone level at the east of Thailand. In their study, the authors have
concentrated on the significant factors that lead to increase the ozone level such as
temperature, wind direction, humidity and wind speed.

Similarly, Austin et al. (Austin et al. 2014) have concentrated on factors


associated with ozone level such as temperature, pressure and sea level for identifying
30

ozone detection using k-means clustering. The data used in such study is a daily data
collected from Boston Logan airport. In their stud, the authors have attempted to
identify the most appropriate number of clusters. Results shown that five number of
clusters has obtained the superior performance.

In addition, Malley et al. (Malley et al. 2014) have proposed a Hierarchical


Clustering Analysis (HCA) with non-negative matrix factorization for classifying
ozone level in Europe. Multiple datasets have been used in such study related to ozone
variation measurements for the period of 1991-2010. The grouping clustering has been
used to identify relationships influence the ozone levels.

Finally, Ahmadi et al. (Ahmadi et al. 2015) have applied two kinds of
clustering including k-means clustering and agglomerative hierarchical clustering for
ozone level analysis. Basically, k-means has been used firstly in order to detect
significant patterns of ozone. Then, agglomerative hierarchical clustering has been
used to identify hourly ozone patterns. Finally, multiple regression tasks have took a
place in order to predict ozone based on seasons and zones.

2.11 SUMMARY

This chapter has provided a comprehensive literature review for ground-level ozone.
In order to do so, an investigation has been conducted on the time series analysis in
general. Consequentially, a narrow exploration has been made on the climate change
analysis. In this vein, ozone analysis has been tackled as one of the common issues
behind climate change. After that, the common method that has been used for ozone
analysis which is clustering, has been examined. Multiple clustering approaches have
been discussed such as hierarchical, partitioning, density-based and model-based
clustering approaches. In addition, one of the significant aspect of clustering which is
the distance measure, has been addressed by illustrating multiple similarity functions.
Finally, a critical analysis for the related work has been presented.
CHAPTER III

3 RESEARCH METHODLOGY

3.1 INTRODUCTION

This chapter aims to describe the research methodology that has been developed to
carry out the research objectives. Section 3.2 illustrates in details the research design
of this study with its phases. Section 3.3 discusses the first phase of the methodology
which is the dataset in which the details of the data used in the experiments will be
determined. Section 3.4 discusses the second phase of the methodology which is
preprocessing which associated with turning the data into an appropriate form. Section
3.5 discusses the third phase which is clustering phase in which an Agglomerative
Hierarchical Clustering is described in details. Section 3.6 discusses the distance
measures used in this study including Euclidean distance, Minkowski distance,
Dynamic Time Warping (DTW) and the proposed modified DTW. Section 3.7
describes the evaluation method used in this study to measure the performance of the
proposed method.

3.2 RESEARCH METHODOLOGY

This study aims enhance Dynamic Time Warping (DTW) to discover climate change
patterns using hierarchical clustering. In order to accomplish the objectives of the
study which have been determined in Chapter I, the research methodology should be
described. As shown in Figure 3.1, the general structure of the methodology consists
of three main phases; Problem Identification, Development and Evaluation.

Phase 1 Phase 2 Phase 3 Phase 4 Phase 5

Problem Data Proposed


Experiment Evaluation
Identification Collection Solution

Figure 3.1 General structure of the methodology


32

3.2.1 Problem Identification Phase

Problem Identification phase aims to identify the problem of the study. This can be
performed by reviewing the literature in terms of ground-level ozone analysis. Since,
clustering technique is one of the common method that has been used for ground-level
ozone analysis thus, an investigation will be conducted on the clustering approaches in
order to identify an appropriate clustering approach for ozone analysis. In addition,
one of the major issues behind clustering approaches lies on the distance measures
therefore, an exploration for the distance measures has been conducted in order to
identify the suitable measure. Finally, a critical analysis for the state of the work of
ground-level ozone clustering in order to determine the endure gaps that would be
tackled. The output of this phase is the problem statement. Figure 3.2 depicts this
phase.

Literature Review

Investigating ground-
level ozone clustering

Examining clustering Exploring similarity /


approaches distance measures

Problem statement

Figure 3.2 Problem Identification phase


33

3.2.2 Data Collection and Preparation

This phase discusses the collection of the data that will be used in the experiment. In
this regard, two sets of data will be used; first twenty benchmark UCR datasets, and
second a Malaysian climate change dataset. In addition, this phase aims to
accommodate a data preparation task in order to turn the data into suitable form. Two
tasks will be used including Transformation and Discretization. Figure 3.3 depicts
such phase.

Datasets

UCR Time-series Malaysian Climate


Benchmark Datasets Change Dataset

Preparation
Preprocessing

Data
Transformation Discretization

Figure 3.3 Data collection and preparation phase

3.2.3 Proposed Solution Phase

In this phase, the proposed solution for the selected problem will be discussed.
Basically, in order to identify which distance measure has better performance, a
comparative analysis would take a place for three distance measures including
Euclidean distance, Minkowski and Dynamic Time Warping (DTW). The comparison
will be held using the Agglomerative Hierarchical Clustering (AHC) where the
literature review has shown that AHC demonstrates a good performance in terms of
34

time-series clustering compared to other approaches. In addition, a modification task


would take a place in order to enhance the dynamic selection of DTW. Such modified
DTW will be integrated with AHC. The output of this phase would be the objectives
of the study. Figure 3.4 depicts such phase.

Comparative Analysis

Agglomerative Euclidean (EU)


Hierarchical
Minkowski (Min)
Clustering (AHC)
Dynamic Time Warping (DTW)

Enhancement
Enhance the dynamic selection

Dynamic Time Warping Modified (DTW)


(DTW)
Integrate

AHC

Figure 3.4 Proposed Solution phase

3.2.4 Experiment Phase

This phase is associated with the experiments that would be conducted in order to
accomplished the research objectives. Basically, two main experiments will be
performed. First, the comparative study will be applied upon the UCR benchmark
datasets using the agglomerative hierarchical clustering with three distance measure
35

Euclidean distance, Minkowski and dynamic time warping. Second experiment is


associated with applying the modified DTW on a Malaysia climate change dataset.
After that, an observation process will be conducted on the obtained results. Figure 3.5
depicts such phase.

Experiment

AHC AHC

EU Min DTW Modified DTW

UCR Time-series Malaysian Climate


Benchmark Datasets Change Dataset

Results

Figure 3.5 Experiment phase

3.2.5 Evaluation Phase

This phase aims to evaluate the results of the two experiments by assessing the
effectiveness. First experiment will be evaluated based on the density using the
precision, recall and f-measure. In addition, a statistical T-test will be applied to
measure the significant changes. On the other hand, the second experiment will be
evaluated using internal validation method which focuses on the correctness of
36

contained clusters’ instances. In addition, an association rules method is being used to


discover the interesting patterns. Figure 3.6 depicts such phase.

Experiment 1

Density assessment F-measure

Results
Significant changes T-test

Experiment 2

Correctness assessment Internal validation

Results
Discover patterns Association rules

Figure 3.6 Evaluation phase

3.3 DATASET

In fact, this study aims to propose a modified Dynamic Time Warping for time series
clustering. For this purpose, two kinds of dataset would be used; first a benchmark
dataset should be used to demonstrate the enhancement of DTW, then the proposed
enhancement of DTW will be applied on a real data which is called Malaysian
Climate Change dataset. Both kinds of data are being illustrated in the following sub-
sections.

3.3.1 UCR Benchmark Datasets

Twenty UCR benchmark datasets have been used in this study (Yanping et al. 2015).
These datasets are associated with time series sequences for multiple domain of
37

interests such as financial, health or whether. Table 3.1 depicts such datasets with their
details.

Table 3.3.1 UCR datasets

# Dataset # of Classes Length

1 TwoLeadECG 2 82

2 OSU Leaf 6 427

3 Phoneme 39 1024

4 ArrowHead 36 251

5 DistalPhalanxTW 6 80

6 FordA 2 500

7 FISH 7 463

8 50Words 50 270

9 Beef 5 470

10 Swedish Leaf 15 128

11 Face (all) 14 131

12 Computers 2 720

13 Lightning-7 7 319

14 Gun-Point 2 150

15 MiddlePhalanxOutlineAgeGroup 3 80

16 ProximalPhalanxOutlineCorrect 2 80

17 DistalPhalanxOutlineAgeGroup 3 80
38

18 BeetleFly 2 521

19 ProximalPhalanxOutlineAgeGroup 3 80

20 CBF 3 128

3.3.2 Malaysian Climate Change Dataset

Putrajaya is one of the developed cities located in Malaysia. With the dramatic
economic development and population expansion, several environmental pollution
issues have been arisen. One of these issues is the increasing of Ozone pollution.
Apparently, such increment has a significant impact on the human health (Monteiro et
al. 2012). Several stations have been employed nowadays to observe the ozone trends.
A data has collected from the Malaysian Meteorological Department and Drainage
and Irrigation Department, Malaysia. Basically, this data contains multiple types of
meteorological data however, this study will concentrate on Ozone degrees from
various stations. Figure 3.2 shows a sample of such data.

Figure 3.7 Sample of Malaysian climate change data

As shown in Figure 3.7, the data is formed hourly for each day in the year of
2006. Each hour represents the rate of ozone and there is an attribute represents the
average rate per day.
39

3.4 PREPROCESSING

This phase aims to prepare the data in order to be more suitable for processing.
Basically, each data includes irrelevant data, noisy and uncompleted instances.
Handling such data plays an essential role in terms of improving the performance of
prediction process (Isa et al. 2008). Hence, two tasks have been proposed for this
purpose; cleaning and normalization. These tasks are illustrated in details in the
following sub-sections.

3.4.1 Cleaning

This task aims to handle the missing values. In fact, such missing values has the
ability to cause incorrect matches in the process of prediction (Teegavarapu &
Chandramouli 2005). Therefore, such missing values have to be handled. Table 3.2
shows a sample of data with missing values.

Table 3.2 Data with missing values

F1 F2 F3 F4 F5
22.3 87.6 2.31 2.78 2.31
26.4 88.9 5.74 ? 5.74
22.9 84.7 1.68 6.78 1.68
27.8 85.2 ? 5.46 ?
24.1 88.3 ? ? ?
26.5 ? ? ? ?
26.9 ? 2.69 1.64 2.69
29.3 84.2 10.4 2.14 10.4
21.2 ? ? 4.65 ?

As shown in Table 3.2, the data contains missing values that are represented by
the character ‘?’. In order to overcome such data, this study aims to use the mean
average mechanism for filling up such instances. Such mechanism aims to sum all the
instances in the selected attribute then dividing the summation on the number of
records. For instance, in the second attribute (F2), the missing values will be filled up
40

by sum all the instances and then dividing

the results upon the number of all instances which is 9. Table 3.3, shows the same
table after applying the mean average mechanism.

Table 3.3 Mean Average mechanism

Temperature Humidity Rainfall Flow WL


22.3 87.6 2.31 2.78 2.79
26.4 88.9 5.74 4.29 5.74
22.9 84.7 1.68 6.78 1.25
27.8 85.2 5.03 5.46 4.56
24.1 88.3 5.03 4.29 4.56
26.5 86.4 5.03 4.29 4.56
26.9 86.4 2.69 1.64 6.47
29.3 84.2 10.4 2.14 8.46
21.2 86.4 5.03 4.65 4.56

3.4.2 Discretization

This task aims to limit the class values within a specific interval. Such interval will
facilitate the process of clustering where the values will be reduced into particular
range. Such process of discretization is essential for specific algorithms such as
hierarchical clustering (Monira et al. 2010). However, the mechanism of discretizing
the data has been performed based on the following equation:

(3.1)

where x is the data that has to be normalized, is the maximum value of all the

classes, is the minimum value of all the classes, y is the normalized data, is

the desired maximum value, and is the desired minimum value.


41

3.5 AGGLOMERATIVE HIERARCHICAL CLUSTERING (AHC)

AHC has been extensively investigated in Chapter II. However, in this study, AHC
has been carried out with four different similarity measures including Euclidean
distance, Minkowski distance, DTW and M-DTW. In fact, the similarity measure
plays an essential role in the time series clustering techniques. This is regarding the
significant impact of such measures on the overall performance (Cha 2007). Table 3.4
shows the pseudo code of AHC with complete-linkage.

Table 3.4 Pseudo code of AHC

Algorithm 1. AHC with complete-linkage


1. Input:
2. //set of vectors (time series elements)
3. // proximity matrix that shows the distance

4. between and

5. // distance between i and j using Euclidean, Minkowski,


6. DTW and M-DTW measures
7. // r-th cluster, with

8. D( // distance between clusters and

9. K // Sequence number, with


10. L(K) // Sampling number (Distance level of the k-th cluster)

11. Steps:
12. 1. Begin with n clusters, each containing one object and having
13. level L(0) = 0 and //K=0;
14. 2. Find the least dissimilar pair ( in the current

15. clustering, according to D( = min(D( )

16. 3. Increment the sequence number :K=K+1; Merge clusters and


17. into a single cluster to form the next cluster +s; Set
18. the level of this clustering to L(K) = D
19. 4. Update the adjacency matrix, D by deleting the rows and
20. columns corresponding to clusters and and adding a new
21. row and column corresponding to the newly formed cluster.
22. The proximity between the new cluster, denoted by +s and

23. the old cluster is defined this way: D[ , +s] =


max(D[ ], D[ ])
24.
25. 5. If all objects are in one cluster stop. Else go to step 2.
42

3.6 DISTANCE MEASURES

The key characteristic behind clustering process lies on the function that will be used
to identify the similarity between two data. Such data is vary where it could be formed
as raw values of equal or non-equal length, or it could be formed as vector space of
feature-pairs (Liao 2005). In particular, this study aims to utilize four kinds of
distance/similarity measures including Euclidean, Minkowski, DTW and M-DTW
distance/similarity measures. These similarity measures have been extensively
investigated in Chapter II. Table 3.5 shows the pseudo code of Euclidean.

Table 3.5 Pseudo code of Euclidean distance

Algorithm 2. Euclidean Distance


1. //is a vector with n time points

2. //is a vector with m time points

3. Euclidean( , ) =

Table 3.6 shows the pseudo code of Minkowski.

Table 3.6 Pseudo code of Minkowski distance

Algorithm 3. Minkowski Distance


1. //is a vector with n time points

2. //is a vector with n time points

3.
Minkowski( , ) =

Table 3.7 shows the pseudo code of DTW.

Table 3.7 Pseudo code of DTW

Algorithm 4. Dynamic Time Warping


1. Input:
2. S // is a 2-dimensional matrix with
3. //is the first vector with n time points
43

4. // is the second vector with m time points


5. // are loop index, cost is an integer

6. Steps:
7. Initialize the matrix
8. S[0,0] = 0;
9. FOR i = 1 to m DO LOOP
10. S[0 , i ] = ∞
11. END
12. FOR i = 1 to n DO LOOP
13. S[i , 0 ] = ∞
14. END
15. Populating the similarity matrix
16. FOR i = 1 to n DO LOOP
17. FOR j = 1 to m DO LOOP
18. Cost = d(V1[i], V2[j])//Euclidean distance
19. S[i , j] = cost + MIN ( S[i-1 , j] , //increment
20. S[i , j-1] , //decrement
21. S[i-1, j-1] , //match
22. END
23. END
24. Return S[n , m]

3.7 MODIFIED DYNAMIC TIME WARPING (M-DTW)

In order to illustrate the modification that has been conducted on DTW, Figure 3.6
represents the similarity matrix used to identify similar sequences among two-time
series in which bright cells indicate similarity and dark cells indicate dissimilarity.
The objective of this matrix is to find the shortest path by starting from the top most
left corner and using dynamic programming to reach the point of origin

accumulating the sum of the similarity between the went thru points (accumulated
path). On DTW will pick the minimum between three points

, or . Since

is the minimum it was selected and the accumulated cost was increment by its

value. And so on until the point P (0,0) is reached. Yet on the point DTW will

randomly select one of the three points because they all have the same value thus the
minimum is correct for any point, and that may yield in selecting a path other than the
shortest path, therefore DTW was modifying by insuring the selection of optimum
shortest path.
44

Figure 3.8 DTW similarity matrix

A shown in Table 3.8 the pseudo code of the modified M-DTW is illustrated in
which the modification of DTW by insuring the selection of optimum shortest path as
clearly demonstrated in the lines from 24 to 28.

Table 3.8 Pseudo code of M-DTW

Algorithm 5. Modified Dynamic Time Warping


1. Input:
2. S // is a 2-dimensional matrix with
3. //is the first vector with n time points
4. // is the second vector with m time points
5.
// are loop index, cost is an integer

6. Steps:
7. Initialize the matrix
8. S[0,0] = 0;
9. FOR i = 1 to m DO LOOP
10. S[0 , i ] = ∞
11. END
12. FOR i = 1 to n DO LOOP
45

13. S[i , 0 ] = ∞
14. END
15. Populating the similarity matrix
16. FOR i = 1 to n DO LOOP
17. FOR j = 1 to m DO LOOP
18. Cost = d(V1[i], V2[j])//Euclidean distance
19. S[i , j] = cost + MINI
20. MINI = IF S[I-1, J-1] equals S[i, j-1]
21. MINI = S[i-1, j-1]
22. OR if S[i-1, j-1] equals S[i-1, j]
23. MINI = S[i-1, j-1]
24. ELSE MINI = Minimum ( S[i-1 , j] ,
25. //increment
26 S[i , j-1] ,
27. //decrement
28. S[i-1, j-1] ) //match
29. END
30. END
31. Return S[n , m]

3.8 EVALUATION

The experiments are conducted using the proposed AHC with four distance functions
including Euclidean distance, DTW, Minkowski and M-DTW distance measures
which are illustrated earlier. Each experiment has been evaluated based on the
distribution of members within the cluster. In order to illustrate such validation
method, let C is a cluster has to be evaluated with {a, b, c, d} member items. If the
majority of members associated with specific class label (n class) label, then the
cluster will be considered as n where other members that labelled with other classes
will be considered as false positive instances. For example, if a, b and c are labelled
with n class, and d labelled with class m, then the cluster will be considered as class n
with three true positive instances (i.e. a, b and c), and one false positive (i.e. d).
Hence, the contingency table will be used to calculate precision, recall and f-measure.
Table 3.9 shows the contingency table.

Table 3.9 Contingency table

Related Not-Related
Clustered TP FP
Clustered with irrelative cluster TN FN
46

Where TP is the instance that has been affiliated within a cluster and related to
this cluster, FP is the instance that has been affiliated within a cluster but not related,
TN is the instance that is not affiliated within a cluster but it is related to it, and finally
FN is the instance that is not affiliated within a cluster and it is not related to it. Hence,
it is possible to calculate precision, recall and f-measure based on the following
equations:

(3.1)

(3.2)

(3.3)

Since, the evaluation will be held on several datasets therefore, there is an essential
demand to accommodate a statistical T-test.

According to Anderson (1984) T-test is a statistical significance indicates whether or


not the difference between two groups’ averages most likely reflects a “real”
difference in the population from which the groups were sampled.

3.9 SUMMARY

This chapter has discussed the research methodology of the study with their phases.
First phase which is problem identification has been discussed in detail, as well as, the
second phase which is development. While more concentration has been done on the
third phase which is evaluation. This is due to this phase is associated with the
experiments that have been conducted. Such experiments have been illustrated by
identifying the datasets, preprocessing tasks and applying the proposed method.
47

Finally, an explanation has been performed in order to determine the evaluation


methods.
CHAPTER IV

4 COMPARATIVE ANALYSIS OF THREE DISTANCE MEASURES FOR


AGGLOMERATIVE HIERARCHICAL TIME SERIES CLUSTERING

4.1 INTRODUCTION

Time series has been emerged as a response to the data evolution of chronological
representation where the data been made in time intervals (Fu 2011). There are many
kinds of time series data such as financial, weather forecasting, pattern recognition,
etc. (Wismüller et al. 2002). The common task of time series data mining is the
process of identifying similar sequences. Such process is performed using clustering
techniques.

Clustering technique aims to group the similar data in the same cluster based
on a similarity function (Han et al. 2011). Clustering can be performed using two main
approaches; partitioning (e.g. k-means clustering) and hierarchical clustering (e.g.
agglomerative clustering) (Badr et al. 2015). In fact, each clustering technique is
integrated with a particular similarity (distance) measure that has the ability to identify
similarity among the objects. In fact, integrating an appropriate similarity measure
with an appropriate clustering technique is a challenging task (Kalpakis et al. 2001).

Several similarity functions have been proposed to integrate with clustering


techniques such as Euclidean distance, Minkowski distance and Dynamic Time
Warping (DTW) measures. DTW has been widely used for time series data due to its
ability to identify sequential correspondences. Several research efforts have integrated
DTW with k-means clustering technique (Petitjean et al. 2011). However, few
research efforts have addressed the integration between Agglomerative Hierarchical
Clustering (AHC) with DTW for time-series clustering analysis. AHC is an
appropriate clustering method for time series due to its hierarchal manner. It has been
characterized with multiple advantages in the literature. First, AHC provides
comprehensible cluster definition process that has the ability to group the most similar
number with smaller size clusters (Manning et al. 2008). Second, it provides cluster
49

that are easy to interpret and predictable more than the partitioning clustering methods
that tend to provide unstructured set of clusters (Cimiano et al. 2004). Third, it
facilitate validating clusters, i.e. unlike the partitioning clustering methods, it does not
require a predefined number of clusters (Dezfuli 2011).

This chapter aims to accommodate a comparative analysis between three


distance measures including Dynamic Time Warping (DTW), Euclidean and
Minkowski for time series using Agglomerative Hierarchical Clustering (AHC).
Section 4.2 shows the experiment setting in which the parameters for the proposed
method are being declared. Section 4.3 shows the results obtained by the proposed
method and their analysis. Section 4.4 provides a discussion in which the results
obtained by the proposed method are being associated with the state of the art.

4.2 EXPERIMENT SETTING

In order to illustrate the experiment setting, it is necessary to identify the dataset used
in such experiment. Basically, twenty UCR benchmark datasets (Keogh & Folias
2002) have been used in the experiments (illustrated in Chapter III).

Consequentially, the proposed agglomerative hierarchical clustering has been


carrying out based on three distance measures including Euclidean, Minkowski and
DTW. Such method was applied using C# programming language as a complete
linkage.

In fact, the comparison has been take a place among the three distance
measures based on the common information retrieval metrics including precision,
recall and f-measure. The datasets are annotated data which means that it has been
provided with a class label. In this manner, the class label for each dataset indicates
the number of classes contained. Therefore, evaluating the application of AHC with
each distance measure will be based on the closeness between the number of clusters
and number of classes for each dataset. For example, the ‘CBF’ dataset which one of
the UCR benchmark, has a class label of ‘3’. By applying the proposed AHC with the
three distance measures on such dataset the results of f-measure using multiple
number of clusters can be shown in Table 4.1.
50

Table 4.1 Results of AHC using the three distance measures for ‘CBF’ dataset

Class = # Clusters DTW Minkowski Euclidean


3
2 0.7228 0.8333 0.7285
3 0.7176 0.5305 0.7113
4 0.6169 0.4812 0.5796
5 0.6018 0.4516 0.5870
6 0.5931 0.3763 0.5839
7 0.5795 0.3583 0.5289
8 0.5275 0.3135 0.4802
9 0.4825 0.2972 0.4680
10 0.4463 0.2675 0.4325

Since the class label is ‘3’ as shown in Table 4.1 therefore, the comparison
among the three distance measures will be performed based on ‘3’ as a number of
cluster. In this vein, the superiority will be assigned to the highest value of f-measure.
Hence, DTW is outperforming the other measures by achieving 71.76% of f-measure,
followed by Euclidean which obtained 71.1% of f-measure, and finally Minkowski
achieved the lowest value of 53.05% of f-measure. Despite, Minkowski for instance
has achieved an f-measure of 83.3% when the number of cluster was at 2. However,
this value is inappropriate since there is a disagreement between the number of cluster
(i.e. 2) and the class label (i.e. 3).

Therefore, the mechanism of evaluation will be held based on the closeness


between the number of clusters and the class labels. The following section provides
the analysis of the results for all datasets based on such mechanism.
51

4.3 RESULTS OF AHC WITH DTW, MINKOWSKI AND EUCLIDEAN


DISTANCE MEASURES FOR ALL DATASETS

This section aims to presents the results of AHC with the three distance measures
including DTW, Euclidean and Minkowski for all datasets. As mentioned earlier, the
results have been obtained using f-measure. Note that, the comparison among the
three distance measures are being made based on the closeness of number cluster and
the class label. Table 4.2 depicts these results (for all results see Appendix A).

As shown in Table 4.2, the correspondences between the classes and the
number of clusters have been highlighted. For DTW, the correspondences were
occurred at nine datasets including “1, 6, 12, 14, 15, 16, 18, 19”. Whereas, in
Euclidean the correspondences were occurred at eight datasets including “1, 3, 6, 12,
14, 15, 16, 17, 18”. Finally, in Minkowski, the correspondences were occurred at
seven datasets including “1, 6, 12, 14, 15, 16, 18”. Figure 4.1 represents the
correspondences among the three distance measures.

Figure 4.1 Correspondences representation for the three distance measures


Table 4.2 Results of AHC with the three distance measures for all datasets

# Cluster
Dataset Class Length DTW Minkowski Euclidean
# DTW Minkowski Euclidean
1 TwoLeadECG 2 82 11.41 6.26 9.07 2 2 2
2 OSU Leaf 6 427 45.54 33.3 36.9 3 3 3
3 Phoneme 39 1024 39.38 28.43 38.43 39 144 194
4 ArrowHead 3 251 41.8 43.57 41.57 4 2 2
5 DistalPhalanxTW 6 80 44.71 48.94 41.27 2 4 4
6 FordA 2 500 51.28 51.33 50.65 2 2 2
7 FISH 7 463 51.23 40.67 41.09 4 3 3
8 50Words 50 270 49.55 41.35 42.88 26 34 45
9 Beef 5 470 49.91 50.91 49.71 2 2 2
10 Swedish Leaf 15 128 46.92 38.43 38.81 3 5 9
11 Face (all) 14 131 59.54 40.36 34.9 2 2 19
12 Computers 2 720 54 60.32 51.99 2 2 2
13 Lightning-7 7 319 56.18 47.18 47.36 6 2 5
14 Gun-Point 2 150 70.83 57.7 61.14 2 2 2
15 MiddlePhalanxOutlineAgeGroup 3 80 70.06 60.14 61.08 3 2 2
16 ProximalPhalanxOutlineCorrect 2 80 74.85 57.55 54.55 2 2 2
17 DistalPhalanxOutlineAgeGroup 3 80 70.45 48.14 67.17 2 2 3
18 BeetleFly 2 521 76.22 85.18 55.03 2 2 2
19 ProximalPhalanxOutlineAgeGroup 3 80 78.18 66.83 43.93 3 2 2
20 CBF 3 128 71.76 52.69 71.14 2 2 2
Apart from the correspondences and for more specific details, Minkowski has
obtained the greatest value of f-measure for ‘BeetleFly’ dataset by achieving 85.18%.
whereas, Euclidean has obtained the greatest value of f-measure for ‘CBF’ dataset by
achieving 71.4%. finally, DTW has obtained the greatest value of f-measure for
‘ProximalPhalanxOutlineAgeGroup’ by achieving 78.18%.

The result shows that DTW has outperformed both of Minkowski and
Euclidean distance measures in terms of f-measure for 14 out of 20 data sets including
‘1, 2, 3, 7, 8, 10, 11, 13, 14, 15, 16, 17, 19, 20’. In contrast, Minkowski has
outperformed the other distance functions in terms of f-measure for 6 datasets
including '4, 5, 6, 9, 12, 18'. Apparently, Euclidean distance function has obtained the
lowest f-measure values for all datasets. .

As can be seen in Table 4.2, when the length of time series was short, DTW has
outperformed Minkowski (exceptionally Dataset 5) otherwise, when the length was
long, Minkowski has outperformed DTW. Hence, DTW has demonstrated better
performance with short time-series in compared to the long ones. Figure 4.2 depicts
the comparison among the three distance measures based on f-measure.

In order to determine the significance of applying DTW, a statistical test which


is T-test has been used for this purpose. Such test aims to estimate the effectiveness of
the results by comparing the two distance measures Minkowski and DTW. While,
Euclidean has been excluded due to its incompetence of obtaining higher results.
Hence, the results of T-test that has been obtained between Minkowski and DTW is
0.0004. Since the obtained result is less than 0.05 thus, DTW demonstrated a
significant enhancement.
54

Figure 4.2 Comparison among the three distance functions in terms of f-measure

In fact, the comparison between DTW and both Minkowski and Euclidean
distance measures is typically emphasis of the comparison with baseline where
Euclidean distance specifically has been utilized as a state of the art in many
clustering approaches have been proposed for time series data (Bouguettaya et al.
2015; Ferreira & Zhao 2016; Kumar et al. 2015; Serra & Zárate 2015).

4.4 DISCUSSION

As shown in the results section, DTW has outperformed the other distance measures.
This was expected from the literature where several clustering approaches have been
proposed for time series using DTW for various domains (Al-Naymat et al. 2009;
Niennattrakul & Ratanamahatana 2007; Petitjean et al. 2011; Yuan et al. 2011; Zhu et
al. 2012). Multiple comparative analysis studies have been presented to compare
different distance measures with clustering technique.

For instance, Ferreira & Zhao (Ferreira & Zhao 2016) have proposed a clustering
approach for time series community detection in social network using multiple
distance measures including ED, Infinite Norm, DTW, Person Correlation, Wavelet
55

Transform, and Integrated Periodogram. Among these distance functions, DTW has
the superior performance compared to other measures. Similarly, Vasimallaa et al.
(Kumar et al. 2015) have proposed a time series clustering using Euclidean distance
and DTW. DTW shown superior performance.

Based on the outperformance of DTW shown on the experiments of and the


results of literature, DTW has demonstrated a competitive performance compared to
other distance measures.

4.5 SUMMARY

This chapter has accommodated a comparative analysis for three distance measures
including Dynamic Time Warping (DTW), Minkowski and Euclidean functions.
These distance measures have been applied using Agglomerative Hierarchical
Clustering (AHC) for time-series clustering. As a conclusion, DTW has obtained the
superior results compared to the other distance measures.
56

CHAPTER V

5 MODIFIED DYNAMIC TIME WARPING

5.1 INTRODUCTION

Clustering time series data aims to identify sequential correspondences among two
time sequences (Rani & Sikka 2012). Several domains have been examined in the
context of time series such as financial, weather forecasting, pattern recognition, etc.
(Wismüller et al. 2002)

Clustering can be performed using two main approaches; partitioning (e.g. k-


means clustering) and hierarchical clustering (e.g. agglomerative clustering) (Badr et
al. 2015). In fact, each clustering technique is integrated with a particular similarity
(distance) measure that has the ability to identify similarity among the objects. In fact,
integrating an appropriate similarity measure with an appropriate clustering technique
is a challenging task (Kalpakis et al. 2001).

Several similarity functions have been proposed to integrate with clustering


techniques. Dynamic Time Warping (DTW) is one of the common similarity functions
that has demonstrated competitive results compared to other measures. DTW aims to
find the shortest path in the process of identifying sequential matches. However, there
is a point in which random selection could take a place in the dynamic programing.
Such randomization may lead to longer path which tend to be derived away from the
optimum path. This paper proposes a modified DTW that aims to enhance the
dynamic selection of the shortest path.

This chapter aims to present the modified Dynamic Time Warping (M-DTW)
that aims to enhance the dynamic selection of the shortest path. M-DTW is being
carried out using Agglomerative Hierarchical Clustering (AHC) for time-series
57

analysis. Section 5.2 shows the experiment setting in which the parameters for the
proposed method are being declared. Section 5.3 shows the results obtained by the
proposed method and their analysis. Section 5.4 provides a discussion in which the
results obtained by the proposed method are being associated with the state of the art.

5.2 EXPERIMENT SETTING

This section aims to identify the parameters of applying agglomerative hierarchical


clustering with the proposed modified DTW. Basically, twenty UCR benchmark
datasets (Keogh & Folias 2002) have been collected for the experiments (see Chapter
III for more details).

Consequentially, the proposed agglomerative hierarchical clustering has been


carrying out. Such method was applied using C# programming language as a complete
linkage. Since, DTW has been demonstrated the best results compared to other
similarity measures (see Chapter IV), the modified DTW will be compared with
DTW. In this manner, AHC will be applied with both DTW and modified DTW.

In fact, the comparison has been take a place among DTW and the modified
DTW based on the common information retrieval metrics including precision, recall
and f-measure. The datasets are annotated data which means that it has been provided
with a class label. In this manner, the class label for each dataset indicates the number
of classes contained. Therefore, evaluating the application of AHC with each distance
measure will be based on the closeness between the number of clusters and number of
classes for each dataset. For example, the ‘CBF’ dataset which one of the UCR
benchmark, has a class label of ‘3’. By applying the proposed AHC with the DTW
and modified DTW distance measures on such dataset the results of f-measure using
multiple number of clusters can be shown in Table 5.1.
58

Table 5.1 Results of AHC using DTW and modified DTW for ‘TwoLeadECG’ dataset

Class = # Clusters DTW M-DTW


2
2 0.76280 0.79608
3 0.55124 0.50527
4 0.46581 0.40339
5 0.41893 0.33850
6 0.36434 0.29213
7 0.32221 0.25632
8 0.28788 0.22921
9 0.26345 0.20880
10 0.26636 0.20174

Since the class label is ‘2’ as shown in Table 5.1 therefore, the comparison
among the two distance measures will be performed based on ‘2’ as a number of
cluster. In this vein, the superiority will be assigned to the highest value of f-measure.
Hence, M-DTW is outperforming the DTW by achieving 79.6% of f-measure,
whereas DTW has obtained 76.28% of f-measure. Note that, it is possible that one of
the distance measure would obtain the greatest value of f-measure when the number of
cluster is different than the class label. In this case, the evaluation would be applied by
comparing both of number of cluster and the class label.

Therefore, the mechanism of evaluation will be held based on the closeness


between the number of clusters and the class labels. Prior to the results analysis, the
next section illustrates the modification that has been conducted on DTW.
59

5.3 RESULTS OF DTW AND MODIFIED DTW FOR ALL DATASETS

This section aims to presents the results of AHC with the two distance measures
including DTW and the modified DTW for all datasets. As mentioned earlier, the
results have been obtained using f-measure. Note that, the comparison among the two
distance measures are being made based on the closeness of number cluster and the
class label. Table 5.2 depicts these results (for all results see Appendix A).

As shown in Table 5.2, the correspondences between the classes and the
number of clusters have been highlighted. For DTW, the correspondences were
occurred at nine datasets including “1, 6, 12, 14, 15, 16, 18, 19”. Whereas, in the
modified DTW the correspondences were occurred at twelve datasets including “1, 2,
3, 4, 6, 12, 14, 15, 16, 18, 19, 20”. Figure 5.1 represents the correspondences among
DTW and the modified distance measures.

Figure 5.1 Correspondences representation for the two distance measures


60

Table 5.2 Results of AHC with DTW and modified DTW for all datasets

# of Clusters
# Dataset Class Length DTW M-DTW
DTW M-DTW
1 TwoLeadECG 2 82 11.41 12.28 2 2
2 OSU Leaf 6 427 45.54 38.51 3 6
3 Phoneme 39 1024 39.38 39.97 39 39
4 ArrowHead 3 251 41.8 49.56 4 3
5 DistalPhalanxTW 6 80 44.71 51.16 2 2
6 FordA 2 500 51.28 52.25 2 2
7 FISH 7 463 51.23 52.52 4 5
8 50Words 50 270 49.55 52.61 26 26
9 Beef 5 470 49.91 56.98 2 2
10 Swedish Leaf 15 128 46.92 51.21 3 3
11 Face (all) 14 131 59.54 61.77 2 2
12 Computers 2 720 54 63.84 2 2
13 Lightning-7 7 319 56.18 64.41 6 4
14 Gun-Point 2 150 70.83 67.07 2 2
15 MiddlePhalanxOutlineAgeGroup 3 80 70.06 67.77 3 3
16 ProximalPhalanxOutlineCorrect 2 80 74.85 70.22 2 2
17 DistalPhalanxOutlineAgeGroup 3 80 70.45 73.21 2 2
18 BeetleFly 2 521 76.22 78.95 2 2
19 ProximalPhalanxOutlineAgeGroup 3 80 78.18 81.32 3 3
20 CBF 3 128 71.76 93.33 2 3
61

Apart from the correspondences and for more specific details, DTW has obtained the
greatest value of f-measure for ‘ProximalPhalanxOutlineAgeGroup’ by achieving
78.18%. Whereas, the modified DTW has obtained the greatest value of f-measure for
‘CBF’ dataset by achieving 93.33%.

Apparently, DTW has outperformed the modified DTW in terms of F-measure


for five datasets namely; ‘OSU Leaf’, ‘Swedish Leaf’, ‘Gun-Point’,
‘MiddlePhalanxOutlineAgeGroup’ and ‘ProximalPhalanxOutlineCorrect’. This is due
to the limitation of AHC in which the clusters merging process cannot be undone (Gao
et al. 2010). In this manner, the modification of DTW will lead to merge irrelevant
clusters. Similar to incorrect rejection of true in which the error may lead to better
results, the randomization of acquiring the shortest path using DTW will avoid such
limitation of AHC.

On the other hand, the modified DTW has outperformed DTW for the
remaining 15 datasets. In this manner, the modification of DTW to acquire the
shortest path has successfully enhanced the quality of clustering results. Figure 5.2
shows both performances of DTW and modified DTW.

Figure 5.2 Comparison among the two distance functions in terms of f-measure

In order to determine the significance of applying DTW, a statistical test which is T-


test has been used for this purpose. Such test aims to estimate the effectiveness of the
62

results by comparing the two distance measures DTW and the modified DTW. Hence,
the results of T-test that has been obtained between DTW and the modified DTW is
0.0210. Since the obtained result is less than 0.05 thus, the modified DTW
demonstrated a significant enhancement.

5.4 DISCUSSION

According to Niennattrakul and Ratanamahatana (2007), DTW has demonstrated a


good performance compared to other distance measures. However, several researchers
attempted to modify DTW for instance, Zhu et al. (2012) have modified DTW in
terms of time-consuming. Whereas, Yuan et al. (2011) have attempted to make DTW
suitable for specific domain specifically Gene expressions. Furthermore, Petitjean et
al. (2011) have modified DTW to be able to identify similarity via global averaging
mechanism rather than the pair-wise sequence matching. Similar to the mentioned
modifications, this study has proposed a modified DTW by optimizing the acquired
shortest path. Results shown a superior performance compared to the standard DTW.

5.5 SUMMARY

This chapter has presented a modified Dynamic Time Warping (M-DTW) for
Agglomerative hierarchical time-series clustering. It begins with the experiment
setting in which the evaluation method is being identified. This can be performed by
comparing the results of the modified DTW with the results of DTW.
Consequentially, a critical analysis of the results has been made.
CHAPTER VI

6 GROUND LEVEL OZONE CLUSTERING IN PUTRAJAYA THE


PROPOSED M-DTW WITH AHC

6.1 INTRODUCTION

Putrajaya is one of the developed cities located at Malaysia. With the dramatic
economic development and population expansion, several environmental pollution
issues have been arisen. One of these issues is the increasing of Ozone pollution.
Apparently, such increment has a significant impact on the human health (Monteiro et
al. 2012). Several stations have been employed nowadays to observe the ozone trends.
In order to analyze such trends, machine learning techniques especially clustering
technique can be considered as a great opportunity in terms of detecting significant
patterns. Clustering aims to aggregate similar data points within clusters (Rani &
Sikka 2012). In this manner, similar trends could be aggregated in a single group
which facilitates the cause analysis. However, the key challenge behind clustering
ozone levels lies on the representation in which the data is being represented in time
manner (Steinbach et al. 2003).

Time series has been emerged as a response to the data evolution of


chronological representation where the data been made in time intervals (Fu 2011).
There are many kinds of time series data such as financial, weather forecasting,
pattern recognition, etc. (Wismüller et al. 2002). The common task of time series data
mining is the process of identifying similar sequences. Such process is performed
using clustering techniques.

There are many clustering techniques could be used in this task. One of the
common clustering technique is hierarchical clustering which has been considered as
the state of the art for various environmental and metrological data in the literature
(Ahmadi et al. 2015; Malley et al. 2014; Saithanu & Mekparyup 2012). Hierarchical
clustering aims to build a hierarchy of clusters in which the data points are being
initialized as one cluster and then splitt into multiple clusters (Divisive Hierarchical
64

Clustering), or each data point could be initialized as a cluster and then merged into
smaller number of clusters (Agglomerative Hierarchical Clustering) (Berkhin 2006).

On the other hand, the similarity or distance function used by the clustering
technique plays an essential role in terms of the performance of the clustering task
(Kalpakis et al. 2001). Several distance measures have been proposed including
Euclidean, Minkowski and Dynamic Time Warping distance measures. In fact,
integrating an appropriate distance measure with an appropriate clustering technique is
a challenging task (Kalpakis et al. 2001). Therefore, identifying suitable distance
measure for ozone level clustering represents a vital demand process.

This chapter presents the third objective of this study in which an agglomerative
hierarchical clustering technique is being carried out with the modified Dynamic Time
Warping which has been proposed in the second objective. The hybrid of AHC with
modified DTW will be applied on ground level ozone in Putrajaya, Malaysia. Section
6.2 depicts the experiment setting in which the parameters of AHC and modified
DTW application are illustrated. Section 6.3 shows the results of AHC with DTW and
modified DTW for intra-cluster. Whereas, Section 6.4 shows the results of AHC with
DTW and modified DTW for inter-cluster. Finally, Section 6.5 provides a discussion
that emphasis the critical analysis of the results.

6.2 EXPERIMENT SETTING

This section aims to highlight the experiment parameters in which the application of
AHC and modified DTW is being illustrated. First the data has been collected from
LESTARI (LESTARI 2016) which is the Institution for Environment and
Development in Malaysia and the Asia Pacific. Such institution has been established
since 1994 with the structure of Universiti Kebangsaan Malaysia (UKM) in order to
deal with environment and development issues. The data contains ozone levels for one
year (i.e. 2006) particularly for Putrajaya city. The data has been represented hourly as
time intervals, which contained 8544 instances.

After acquiring the data, a preprocessing task has been applied to prepare the data in
order to be more suitable for processing. Basically, each data includes irrelevant,
65

noisy and uncompleted instances. Handling such data plays an essential role in terms
of improving the performance of clustering process (Isa et al. 2008). Hence, two tasks
have been proposed for this purpose; cleaning and discretization. Cleaning aims to
handle the missing values and the calibration errors where such values has the ability
to cause incorrect matches in the process of clustering (Teegavarapu & Chandramouli
2005). In this manner, Microsoft Excel has been used to detect such values in which
the 158 missing values and 431 calibration errors have been identified and dealt with
by Matlab ANN prediction algorithm. Whereas, discretization task aims to limit the
class values within a specific interval. Such interval will facilitate the process of
clustering where the values will be reduced into particular range. Such process of
discretization is essential for specific algorithms such as hierarchical clustering
(Monira et al. 2010).

Consequentially, the proposed AHC with modified DTW has been performed
using C# programming language with max-linkage. Note that, in order to validate the
performance of the proposed modified DTW, the results are being compared with the
standard DTW. The clustering was performed using multiple number of clusters as
parameters with a range of 3-15 number of clusters. Such ranged has been set as a
result of analyzing the data and identifying the appropriate classes.

6.2.1 Evaluation

One of the challenging task behind clustering is evaluating its results in which the
question ‘what is the best way to group the data’ should be clarified (Rauber¹ et al.).
Two main approaches have been proposed for validating clustering process; external
and internal validation of clusters (Liu et al. 2010). External validation aims to
validate the clusters based on the distribution in which the common information
retrieval metrics such as precision, recall and f-measure. However, such mechanism of
validation relies on a labeled data. Since, the real-life data is usually unlabeled thus,
applying external validation tend to be insufficient.

On other hand, internal validation aims to measure the correctness among objects
within a cluster (i.e. intra-cluster) and the correctness among objects within multiple
clusters (i.e. inter-cluster). Basically, the main aim of clustering task is to make sure
66

that the objects within a single cluster are mostly similar, as well as, the objects within
multiple clusters are mostly dissimilar. Hence, computing the Root Mean Square Error
Standard Deviation (RMSE-SD) would measure the homogenous of the objects within
a single cluster and within multiple clusters. Note that, the smaller value of RMSE-SD
between the objects within a single cluster leads to better performance in which the
objects are very similar. In contrast, the bigger value of RMSE-SD between the
objects within a single cluster leads to lower performance in which the homogenous
among the objects is being maximized. Therefore, best results associated with smaller
value of RMSE-SD among intra-cluster, and with greater value of RMSE-SD among
inter-clusters.

6.3 RESULTS OF INTRA-CLUSTER

In this section, the results of the proposed AHC using DTW and modified DTW for
intra-cluster are being declared. Basically, the results have been obtained based on
multiple number of clusters. Based on the observation of data, the number of clusters
should be ranged from 3 to 15. Table 6.1 shows the results for intra-cluster.

Table 6.1 Results of DTW and modified DTW using intra-cluster

# Clusters DTW Modified-DTW


15 0.0118 0.0042
14 0.0121 0.0041
13 0.0125 0.0041
12 0.0133 0.0042
11 0.0118 0.0045
10 0.0121 0.0045
9 0.0124 0.0039
8 0.013 0.0039
7 0.0137 0.0041
6 0.0145 0.0066
5 0.0102 0.0068
4 0.0103 0.0073
3 0.0039 0.0054
67

As shown in Table 6.1, the minimum results of RMSE-SD for DTW have been
obtained at 3 number of clusters by achieving 0.0039, whereas for modified DTW, 8
and 9 number of clusters have obtained the minimum results of RMSE-SD by
achieving 0.0039. As mentioned earlier, the smaller value of RMSE-SD for intra-
cluster leads to better performance. Therefore, 3, 8 and 9 number of cluster are the
most accurate ones. Figure 6.1 depicts such results.

Figure 6.1 RMSE-SD results of DTW and modified DTW for intra-cluster

6.4 RESULTS OF INTER-CLUSTER

In this section, the results of the proposed AHC using DTW and modified DTW for
inter-cluster are being declared. Basically, the results have been obtained based on
multiple number of clusters. Based on the observation of data, the number of clusters
should be ranged from 3-15. Table 6.2 shows the results for intra-cluster.

Table 6.2 Results of DTW and modified DTW using intra-cluster

# Clusters DTW Modified-DTW


15 0.29 0.3869
14 0.26 0.3825
13 0.27 0.3814
12 0.27 0.3813
68

11 0.28 0.3901
10 0.3 0.4031
9 0.29 0.4077
8 0.25 0.3401
7 0.24 0.3252
6 0.27 0.3221
5 0.29 0.3153
4 0.29 0.3149
3 0.34 0.3608

As shown in Table 6.2, the maximum value of RMSE-SD for DTW was at 3
number of clusters by achieving 0.34. Whereas, for modified DTW, the maximum
value of RMSE-SD has been obtained at 9 number of clusters by achieving 0.40. As
mentioned earlier, the maximum value of RMSE-SD for inter-clusters leads to better
performance. By comparing the maximum values of RMSE-SD for both DTW and
modified DTW, it is obvious that the modified DTW has outperformed the DTW by
achieving greater value which is 0.40. Figure 6.2 depicts such results.

Figure 6.2 RMSE-SD results of DTW and modified DTW for intra-cluster
69

6.5 CATEGORIES OF AIR POLLUTION

The US Office of Air and Radiation (AirNow 2009) have discussed the factors that
lead to air pollution. In their investigation, the ozone was one of the main factors that
could harm the human health. For this manner, AirNow (2009) has provided 5
categories of air pollution which are shown in Table 6.3.

Table 6.3 Categories of air pollution (AirNow 2009)

1. Very Unhealthy
2. Unhealthy
3. Unhealthy for Sensitive Groups
4. Moderate
5. Good

In order to provide more critical analysis of the acquired clusters, the best number of
cluster based on the RMSE-SD which is 9 will be considered. In addition, the AirNow
(2009) categorization will be considered. Therefore, two number of clusters will be
considered in the analysis which are 5 and 9, next sections will tackle this analysis.

6.6 CRITICAL ANALYSIS OF CLUSTERING WHEN K=5

This section aims to provide a critical analysis of clustering when k=5 by identifying
new patterns. This can be conducted by detecting anonymous or abnormal trends for
the ground-level ozone rates. In this manner, each cluster included within the five
clusters will be discussed separately. The analysis is tackling the days included in this
cluster and will be conducted based three 8-hour intervals regarding to Christensen et
al. (2015). Figure 6.3 depicts the results of this experiment.
70

Pattern: no pattern Pattern: sharp decrease of the ozone values.

Pattern: no pattern Pattern: sudden increase of the values

Pattern: sharp decrease of the values

Figure 6.3 Results of clustering when k=5

For cluster 1, the first 8-hour interval has begun with 0.004 ppb and ended up with
0.005 ppb. Whereas, second interval has shown a rise of the ozone values reaching to
the peak of 0.061 ppb at 2pm and ended with 0.050 ppb at 5pm. In the third interval,
the ozone values have been gradually decreased reaching 0.008. It is difficult to
identify meaningful patterns from this cluster.
71

For cluster 2, the first interval has begun with 0.014 ppb and ended up with 0.005
ppb. Second interval has shown a rise of ozone values reaching the peak of 0.113 ppb
at 2pm and ended with 0.089 ppb. In the third interval, the values have been decreased
to reach 0.014 ppb. A remarkable pattern could be noticed from this cluster, this
pattern represented by the sharp decrease of the ozone values.

For cluster 3, the first 8-hour interval has begun with 0.017 ppb and ended up with
0.007 ppb. Whereas, second interval has shown an increase of values reaching the
peak of 0.058 ppb at 2 pm, this peak has not changed until 4 pm. In the third interval,
the values have been gradually decreased reaching 0.012 ppb. It is difficult to identify
meaningful patterns from this cluster.

For cluster 4, the first 8-hour interval has begun with 0.005 ppb and ended up with
0.006 ppb. Whereas, second interval has shown an increase of values reaching the
peak of 0.037 ppb at 2 pm, this peak has not changed until 3 pm. In the third interval,
the values have been gradually decreased reaching 0.005 ppb. A remarkable pattern
could be noticed from this cluster, this pattern represented by the sudden increase of
the values.

For cluster 5, the first 8-hour interval has begun with 0.033 ppb and ended up with
0.016 ppb. Whereas, second interval has shown an increase of values reaching the
peak of 0.059 ppb at 3 pm and ended with 0.051 ppb. In the third interval, the values
have been sharply decreased reaching 0.019 ppb at 8 pm and ended with 0.017. A
remarkable pattern could be noticed from this cluster, this pattern represented by the
sharp decrease of values.

6.7 CRITICAL ANALYSIS OF CLUSTERING WHEN K=9

This section aims to provide a critical analysis of clustering when k=9 based by
identifying new patterns. This can be conducted by detecting anonymous or abnormal
trends for the ground-level ozone rates. In this manner, each cluster included within
the nine clusters will be discussed separately. The analysis is tackling the days
72

included in this cluster and will be conducted based three 8-hour intervals regarding to
Christensen et al. (2015). Figure 6.4 depicts the results of this experiment.

Pattern: sharp decrease and increase of values. Pattern: sharp increase.

Pattern: sharp decrease of values. Pattern: multiple peak at noon hours.

Pattern: sharp increase with multiple peaks. Pattern: sharp increase and decrease with
multiple peaks
73

Pattern: sharp decrease of the values with Pattern: sharp increase and decrease of values.
multiple peaks

Pattern: sharp increase and decrease of the values.

Figure 6.4 Results of clustering when k=9

For cluster 1, the first 8-hour interval has begun with 0.008 ppb and ended up with
0.005 ppb. Whereas, second interval has shown a rise of ozone values reaching the
peak of 0.076 ppb at 2pm, then ended up with 0.059 ppb at 5pm. In the third interval,
the values have been gradually decreased reaching 0.007. A remarkable pattern could
be noticed from this cluster, this pattern represented by the sharp decrease and
increase of ozone values.

For cluster 2, the first interval has begun with 0.004 ppb and ended up with 0.005
ppb. Second interval has begun with 0.008 ppb and then sharply increased into the
maximum peak of 0.055 ppb at 2pm and then decreased into 0.046 ppb at 5pm. The
third interval has decreased reaching 0.009 ppb. A remarkable pattern could be
noticed from this cluster, this pattern represented by the sharp increase.
74

For cluster 3, the first interval has begun with 0.014 ppb and ended up with 0.005
ppb. Second interval has shown an increase of values reaching the peak of 0.113 ppb
at 2 pm and ended up with 0.089 ppb. In the third interval, the values have been
decreased into 0.014 ppb. A remarkable pattern could be noticed from this cluster, this
pattern represented by the sharp decrease of values.

For cluster 4, the first interval has begun with 0.015 ppb and ended up with 0.006
ppb. Second interval has shown multiple peak in which the values at 4 pm have
reached 0.053 and 0.049 ppb at 5pm. The third interval has shown a decrease of
values reaching 0.007 ppb. A remarkable pattern could be noticed from this cluster,
this pattern represented by the multiple peak at noon hours.

For cluster 5, the first interval has begun with 0.004 ppb and ended up with 0.005
ppb. Second interval shown three peaks stated as; 0.032 ppb at 12 pm, 0.036 at 2 pm
and 0.035 ppb at 4pm, and ended with 0.029. Whereas, the third interval shown an
unstable decreased until reaching 0.005 ppb. A remarkable pattern could be noticed
from this cluster, this pattern represented by the sharp increase of values and the
multiple peaks.

For cluster 6, the first interval has begun with 0.015 ppb then shown a stable decrease
until 8 am reaching 0.008 ppb. Second interval has shown two peaks of 0.034 at 12
pm and 0.039 at 3 pm. The third interval shown a gradual decrease of the values
reaching 0.005 ppb. A remarkable pattern could be noticed from this cluster, this
pattern represented by the sharp increase and decrease of values, in addiction multiple
peaks have been occurred.

For cluster 7, the first interval has begun with 0.019 ppb and ended up with 0.011
ppb. Second interval has shown two peaks of 0.078 ppb at 4 pm and 0.074 ppb at 2
pm. Third interval has shown gradual decrease reaching 0.019 ppb. A remarkable
pattern could be noticed from this cluster, this pattern represented by the sharp
decrease of the values with multiple peaks.
75

For cluster 8, the first interval has begun with 0.033 ppb and ended up with 0.016
ppb. Second interval has shown a maximum peak of 0.059 ppb at 3 pm and then ended
up with 0.051 ppb. Third interval has shown a sharp decrease of values reaching 0.017
ppb. A remarkable pattern could be noticed from this cluster, this pattern represented
by the sharp increase and decrease of values.

For cluster 9, the first interval has begun with 0.025 ppb and ended up with 0.007
ppb. Second interval has shown a sharp increase of values reaching a maximum of
0.062 ppb at 2 pm and ended with 0.049 ppb. Third interval has shown a gradual
decrease of the values reaching 0.014 ppb. A remarkable pattern could be noticed
from this cluster, this pattern represented by the sharp increase and decrease of the
values.

6.8 COMPARISON BETWEEN CLUSTERING WHEN K=5 AND K=9

This section aims to accommodate a comparison between the two number of cluster 9
and 5 which have been analyzed in the previous sections. The comparison will be
based on multiple variables including starting values of ozone, maximum peak,
maximum peak of median and ending values. Table 6.4 shows the values of 5 number
of clusters.
Table 6.4 Values using cluster k = 5

#Days K=5 Morning Afternoon Evening Standard Category


Class Start end Max Men end
Max

60 4 0.005 0.006 0.09 0.037 0.005 Good


8 5 0.033 0.016 0.093 0.059 0.017 Moderate
104 3 0.017 0.007 0.105 0.058 0.012 Unhealthy for Sensitive Groups
173 1 0.004 0.005 0.115 0.061 0.008 Unhealthy
19 2 0.014 0.005 0.148 0.113 0.014 Very Unhealthy

As shown in Table 6.4, the number of days included in the ‘unhealthy’


category is nearly representing the half of the year which seems to be overestimated
categorization. This means that this category should be divided into more categories.
76

Whereas, the ‘moderate’ category contains only eight days which also seems to be
underestimated categorization. Generally, this category is supposed to contain more
days. However, Table 6.5 shows the values of 9 number of cluster.

Table 6.5 Values using cluster k = 9

#Days K=5 Morning Afternoon Evening Proposed Category


Class Start end Max Men end
Max

38 5 0.004 0.005 0.06 0.036 0.005 Very Good


22 6 0.015 0.008 0.09 0.039 0.005 Good
63 4 0.015 0.006 0.093 0.053 0.007 High Moderate
121 2 0.004 0.005 0.096 0.055 0.009 Moderate
8 8 0.033 0.016 0.093 0.059 0.017 low Moderate
Unhealthy for Sensitive
20 9 0.025 0.007 0.077 0.062 0.014
Groups
Very Unhealthy for
52 1 0.008 0.005 0.115 0.076 0.007
Sensitive Groups
21 7 0.019 0.011 0.105 0.078 0.019 Unhealthy
19 3 0.014 0.005 0.148 0.113 0.014 Very Unhealthy

As shown in Table 6.5, unlike the standard 5 categorization, the 9-categorizaiton has
the ability to provide better description of the year’s days. This can be represented by
giving more categories. For instance, the ‘unhealthy’ category has been splitted into
two categories as ‘unhealthy’ and ‘very unhealthy for sensitive group’. These
categories have shown reasonable contained number of days. In addition, the category
‘moderate’ has been splitted into three categories as ‘high moderate’, ‘moderate’ and
‘low moderate’. Similarly, these categories have contained reasonable number of
days. Finally, the category ‘good’ has been also divided into two categories as ‘very
good’ and ‘good’. However, Figure 6.5 and Figure 6.6 show the distribution of
categories over the number of days.
77

Figure 6.5 Distribution of days over the five categories

Figure 6.6 Distribution of days over the nine categories

6.9 ANALYSIS USING ASSOCIATION RULES

In order to provide more analysis in terms of the cause and effect for the ground level
ozone, the association rule method has been used to identify specific factors.
Basically, determining factors that affect the ozone is a difficult task. Akimoto et al.
(Akimoto et al. 2015) have conducted a study to analyze the causes of ground-level
ozone in Japan using 20 years data. As a conclusion in their study, they have
surprisingly found out that even though with the decrease of NOx and NMHC (i.e.
considered as the main causes of increasing the ozone), there is still an ongoing
78

increment of ground-level ozone rates. Based on their judgment, they have referred
the reason to the transportation.

Hence, it is a challenging task to identify the factors that would affect the
ground-level ozone. However, this study attempts to present an analysis for specific
cases of extreme growth of ozone rates. Therefore, association rules approach has
been used in order to clarify the factors that would increase rates of ozone in
Putrajaya, Malaysia.

In order to distinguish the interesting patterns or rules, it is necessary to


consider the value of confidence which is being illustrated as follows:

Confidence: The confidence of a rule is defined Conf (X implies Y) = supp(X


∪Y)/supp(X) in which supp(X∪Y) means "support for occurrences of transactions
where X and Y both appear". Confidence ranges from 0 to 1, where the closeness to 1
indicates an interesting relation. Confidence is an estimate of Pr(Y | X), the probability
of observing Y given X. The support supp(X) of an itemset X is defined as the
proportion of transactions in the data set which contain the itemset.

Based on the above illustration, an experiment has been performed based on the
clusters when k=5 and k=9 with minimum support of 0.005 and using a Top-N
approach for the confidence. Top-N approach aims to select the best twenty rules with
greatest values of confidence. Table 6.6 shows a sample of rules with their confidence
values (for all rules see Appendix B).
Table 6.6 Sample of rules with their confidence values

Rules ==> Confidence


Rule 1 ==> 0.75
Rule 2 ==> 0.72
Rule 3 ==> 0.70
Rule 4 ==> 0.56
Rule 5 ==> 0.51
Rule 6 ==> 0.47
Rule 7 ==> 0.39
79

Rule 8 ==> 0.26


Rule 9 ==> 0.18
Rule 10 ==> 0.09

Hence, the best rules with highest confidence values will be considered in the
analysis. This is due to these top rules would yield interesting relations, while the rest
would be insignificant. First the results of association rules for the clustering when
k=5 will be depicted in Table 6.7.

Table 6.7 Results of association rules for clustering when k=5

Index Factor 1 ==> Ozone Confidence


1 CO=0.18 ==> 0.038 1.00
2 NOx=0.022 ==> 0.04 1.00
3 NO=0.004 ==> 0.073 1.00
4 NO=0.005 ==> 0.04 1.00
5 NO2=0.007 ==> 0.046 1.00
6 NO2=0.008 ==> 0.032 1.00
7 NO2=0.009 ==> 0.067 1.00
8 NO2=0.017 ==> 0.04 1.00
9 Temp=22 ==> 0.049 1.00
10 Temp=22.2 ==> 0.041 1.00
11 Temp=22.9 ==> 0.038 1.00
12 Temp=23.3 ==> 0.049 1.00
13 Temp=24 ==> 0.034 1.00
14 Temp=24.2 ==> 0.032 1.00
15 Temp=24.3 ==> 0.03 1.00
16 Temp=26.5 ==> 0.05 1.00
17 Temp=27 ==> 0.043 1.00
18 Temp=27.5 ==> 0.052 1.00
19 Temp=27.6 ==> 0.049 1.00
20 Temp=27.8 ==> 0.052 1.00

As shown in Table 6.7, the top twenty rules obtained when k=5 are associated
with a single factor distributed as: 1 rule for CO, 1 rule for NOx, 2 rules for NO, 4
80

rules for NO2 and the rest 12 rules are associated with the temperature.

However, for the association rules of clustering when k=9, Table 6.8 depicts
the top twenty rules.

Table 6.8 Results of association rules for clustering when k=9

Index Factor 1 Factor 2 ==> Ozone Confidence


1 NOx=0.003 Temp=33.7 ==> 0.089 1.00
2 NOx=0.003 Temp=32.1 ==> 0.088 1.00
3 NOx=0.003 Temp=30.4 ==> 0.070 1.00
4 NOx=0.003 Temp=29.9 ==> 0.061 1.00
5 NOx=0.003 Temp=29.2 ==> 0.059 1.00
6 NOx=0.003 NO2=0.013 ==> 0.059 1.00
7 NOx=0.003 NO2=0.002 ==> 0.070 1.00
8 NOx=0.003 CO=1.7 ==> 0.070 1.00
9 NOx=0.003 CO=1.61 ==> 0.061 1.00
10 NOx=0.003 CO=0.31 ==> 0.088 1.00
11 NOx=0.003 CO=0.27 ==> 0.086 1.00
12 NOx=0.003 CO=0.16 ==> 0.078 1.00
13 CO=0.75 - ==> 0.148 1.00
14 CO=0.91 - ==> 0.147 1.00
15 CO=0.7 - ==> 0.143 1.00
16 CO=0.44 - ==> 0.140 1.00
17 CO=0.63 - ==> 0.140 1.00
18 CO=0.60 - ==> 0.139 1.00
19 CO=0.59 - ==> 0.139 1.00
20 CO=0.78 - ==> 0.131 1.00

As we can see in Table 6.7, the first 12 rules are associated with two factors
whereas the other rules are associated with a single factor. In particular, the first fifth
rules (i.e. 1-5) are associated with NOx and the Temperature where the increasing of
temperature with an NOx = 0.003 leads to increasing in the ozone rates. In addition,
the following two rules (i.e. 6 and 7) are associated with NOx and NO2 where the
decreasing of NO2 with an NOx = 0.003 would lead to increment in the ozone rates.
The fifth following rules (i.e. 8-12) are associated with NOx and CO where the
decreasing of CO with an NOx = 0.003 (especially for rule 10 and 11) would lead to
increment in the ozone rates.
81

On the other hand, the remaining 8 rules (i.e. 13-20) are associated with a
single factor which is CO. In fact, these rules are related to the peak or highest rates of
ozone. Although there is no a direct relation between the CO values and the ozone
rates however, as a general view, CO is related to the transportation. This can
evidence the finding of Akimoto et al. (2015) study which implies that the
transportation is one of the main reasons behind the growth of ground-level ozone
rates.

6.9.1 Comparison

In order to compare the results of association rules for both clustering when k=5
(depicted in Table 6.7) and when k=9 (depicted in Table 6.8), firstly a concentration
will be conducted on the number of factors. Obviously, it is noticed that the clustering
when k=9 have generated rules with two factors, while the clustering when k=5 have
shown a single factor rules. This gives a hint that the contained clusters when k=9
have more interesting relations. Secondly, in terms of CO, it is clear that the clustering
when k=9 have shown 13 rules associated with CO compared to a single rule when
k=5. In addition, in terms of NOx, the clustering when k=9 have shown 13 rules
compared to a single rule in k=5. For NO2, k=9 have shown 2 rules compared to 4
rules when k=5. Since Akimoto et al. (2015) have claimed that the most significant
factors that affect the ozone is the CO, it is obvious that clustering when k=9 have
shown superior performance in terms of generating interesting rules compared to
clustering when k=5. On the other hand, (Adame et al. 2012) have illustrated that the
temperature is not significantly associated with ozone but instead the UV radiation.
Since most of the rules generated from clustering when k=5 are related to temperature,
it is obvious that such rules are naïve compared to the rules generated by clustering
when k=9.

6.10 SUMMARY

This chapter discussed the third objective of this study in which the proposed AHC
and both modified DTW and standard DTW, have been carried out on Malaysian
climate change data in Putrajaya city in the year 2006. The results of both DTW and
modified DTW have been highlighted with a critical analysis.
82
CHAPTER VII

7 CONCLUSION AND FUTURE WORK

7.1 CONCLUSION

One of the main reasons behind pollution nowadays is the ground level ozone (O3) in
which the last two decades have shown an increment of ozone in several places
around the world. Many studies have claimed that the concentration of tropospheric
ozone is a potential human health hazard. Furthermore, the impacts of ozone
concentration may be expanded to affect the agricultural and forests. In this vein,
predicting the ground level ozone has a significant impact on controlling the
consequences. Clustering is one of the common technique that has the ability to
discover specific patterns of ground level ozone concentration. However, the key
challenge behind clustering ozone levels lies on the representation in which the data is
being represented in time manner (Steinbach et al. 2003).

Time series has been emerged as a response to the data evolution of


chronological representation where the data been made in time intervals (Fu 2011).
There are many kinds of time series data such as financial, weather forecasting,
pattern recognition, etc. (Wismüller et al. 2002). The common task of time series data
mining is the process of identifying similar sequences. Such process is performed
using clustering techniques.

There are many clustering techniques could be used in this task. One of the
common clustering technique is hierarchical clustering which has been considered as
the state of the art for various environmental and metrological data in the literature
(Ahmadi et al. 2015; Malley et al. 2014; Saithanu & Mekparyup 2012). Hierarchical
clustering aims to build a hierarchy of clusters in which the data points are being
initialized as one cluster and then splitt into multiple clusters (Divisive Hierarchical
84

Clustering), or each data point could be initialized as a cluster and then merged into
smaller number of clusters (Agglomerative Hierarchical Clustering) (Berkhin 2006).

On the hand, the similarity or distance function used by the clustering technique
plays an essential role in terms of the performance of the clustering task (Kalpakis et
al. 2001). Several distance measures have been proposed including Euclidean,
Minkowski and Dynamic Time Warping distance measures. In fact, integrating an
appropriate distance measure with an appropriate clustering technique is a challenging
task (Kalpakis et al. 2001). Therefore, identifying suitable distance measure for ozone
level clustering represents a vital demand process.

Hence, this research has intended to accommodate a comparative study among


three distance measures including Euclidean distance, Minkowski and DTW using
AHC. The comparison has been performed using twenty UCI benchmark time series
datasets. The results of T-test that has been obtained between Minkowski and DTW is
0.0004 as well as, between Euclidean and DTW is 0.0003. Since the obtained result is
less than 0.05 thus, DTW demonstrated a significant superiority.

In addition, this research has presented a modification of DTW where the


dynamic selection of shortest path has been enhanced. The proposed modification of
DTW has been applied with AHC using twenty UCI benchmark time series datasets.
The results of T-test that has been obtained between DTW and the modified DTW is
0.0210. Since the obtained result is less than 0.05 thus, the modified DTW
demonstrated a significant superiority.

Finally, the proposed modified DTW has been applied with AHC on a
Malaysian metrological dataset in order to accommodate a ground-level ozone
clustering. Results shown that the modified DTW has the ability to discover
significant patterns of ozone concentration.

7.2 RESEARCH CONTRIBUTION

The contribution of this research can be represented by accommodating a comparative


study of multiple distance measures for hierarchical time series clustering. This
85

contribution has been accomplished by conducting the application of AHC with three
distance measures using twenty benchmark datasets. In addition, this research has
presented a modification of DTW distance measure. This contribution has been
accomplished by conducting the application of AHC with both DTW and the modified
DTW using twenty benchmark datasets. After demonstrating the superiority of the
proposed modified DTW, this research has intended to apply the proposed method on
a Malaysian metrological dataset in order to accommodate a ground-level ozone
clustering. This contribution has been accomplished by evaluating the results of the
proposed method using Association Rules.

7.3 FUTURE WORK

This section aims to highlight the potential enhancements or modifications that would
be added to the future researches in the field of ground-level ozone clustering. These
enhancements can be represented as suggestions and recommendations that are
illustrated as follows:

i. Using a dataset with multiple years has the ability to identify frequent
patterns which may facilitate the determination of important factors of
ozone.

ii. Analyzing more factors such as wind, rainfall, transportation and


industries has a significant impact on identifying the causes of ozone
rates increment.

iii. Exploring more areas in Malaysia in terms of the ground-level ozone


would significantly provide better understanding.

iv. Using more distance measures and different clustering approaches


would facilitate the process of gaining robust effectiveness.
86

REFERENCES

Abonyi, J., Feil, B., Nemeth, S. & Arva, P. 2005. Modified Gath–Geva clustering for
fuzzy segmentation of multivariate time-series. Fuzzy Sets and Systems 149(1).
39-56.

Adame, J., Notario, A., Villanueva, F. & Albaladejo, J. 2012. Application of cluster
analysis to surface ozone, NO 2 and SO 2 daily patterns in an industrial area in
Central-Southern Spain measured with a DOAS system. Science of the Total
Environment 429. 281-291.

Aghagolzadeh, M., Soltanian-Zadeh, H., Araabi, B. & Aghagolzadeh, A. 2007. A


Hierarchical Clustering Based on Mutual Information Maximization. Image
Processing, 2007. ICIP 2007. IEEE International Conference on, I - 277-I -
280.

Agrawal, R., Faloutsos, C. & Swami, A. 1993. Efficient similarity search in sequence
databases. International Conference on Foundations of Data Organization and
Algorithms, 69-84.

Ahmad, S. S. & Aziz, N. 2013. Spatial and temporal analysis of ground level ozone
and nitrogen dioxide concentration across the twin cities of Pakistan.
Environmental monitoring and assessment 185(4). 3133-3147.

Ahmadi, M., Huang, Y. & John, K. 2015. Predicting Hourly Ozone Pollution in
Dallas-Fort Worth Area Using Spatio-Temporal Clustering.

AirNow. 2009. Ozone and Your Health.


https://cfpub.epa.gov/airnow/index.cfm?action=ozone_health.index#request.P
DFPath#ozone-c.pdf.

Akimoto, H., Mori, Y., Sasaki, K., Nakanishi, H., Ohizumi, T. & Itano, Y. 2015.
Analysis of monitoring data of ground-level ozone in Japan for long-term
trend during 1990–2010: causes of temporal and spatial variation. Atmospheric
environment 102. 302-310.

Al-Naymat, G., Chawla, S. & Taheri, J. 2009. SparseDTW: a novel approach to speed
up dynamic time warping. Proceedings of the Eighth Australasian Data
Mining Conference-Volume 101, 117-127.

Anderson, T. 1984. Multivariate statistical analysis. VVi11ey and Sons, New York, NY.

Andreopoulos, B., An, A., Wang, X. & Schroeder, M. 2009. A roadmap of clustering
algorithms: finding a match for a biomedical application. Briefings in
bioinformatics 10(3). 297-314.

Antunes, C. M. & Oliveira, A. L. 2001. Temporal data mining: An overview. KDD


workshop on temporal data mining, 13.
87

Arnaud, P., Bouvier, C., Cisneros, L. & Dominguez, R. 2002. Influence of rainfall
spatial variability on flood prediction. Journal of Hydrology 260(1). 216-230.

Austin, E., Zanobetti, A., Coull, B., Schwartz, J., Gold, D. R. & Koutrakis, P. 2014.
Ozone trends and their relationship to characteristic weather patterns. Journal
of Exposure Science and Environmental Epidemiology.

Badr, H. S., Zaitchik, B. F. & Dezfuli, A. K. 2015. A tool for hierarchical climate
regionalization. Earth Science Informatics. 1-10.

Bagnall, A. & Janacek, G. 2005. Clustering time series with clipped data. Machine
learning 58(2-3). 151-178.

Banerjee, A. & Ghosh, J. 2001. Clickstream clustering using weighted longest


common subsequences. Proceedings of the web mining workshop at the 1st
SIAM conference on data mining, 144.

Baraldi, A. & Alpaydin, E. 2002. Constructive feedforward ART clustering networks.


II. IEEE transactions on neural networks 13(3). 662-677.

Bell, M. L., Goldberg, R., Hogrefe, C., Kinney, P. L., Knowlton, K., Lynn, B.,
Rosenthal, J., Rosenzweig, C. & Patz, J. A. 2007. Climate change, ambient
ozone, and health in 50 US cities. Climatic Change 82(1-2). 61-76.

Bell, M. L., McDermott, A., Zeger, S. L., Samet, J. M. & Dominici, F. 2004. Ozone
and short-term mortality in 95 US urban communities, 1987-2000. Jama
292(19). 2372-2378.

Berkhin, P. 2006. A survey of clustering data mining techniques. Grouping


multidimensional data. 25-71. Springer.

Bernad, D. 1996. Finding patterns in time series: a dynamic programming approach.


Advances in knowledge discovery and data mining.

Berndt, D. J. & Clifford, J. 1994. Using Dynamic Time Warping to Find Patterns in
Time Series. KDD workshop, 359-370.

Berry, M. J. & Linoff, G. 1997. Data mining techniques: for marketing, sales, and
customer support. John Wiley & Sons, Inc.

Bicego, M., Murino, V. & Figueiredo, M. A. 2003. Similarity-based clustering of


sequences using hidden Markov models. International Workshop on Machine
Learning and Data Mining in Pattern Recognition, 86-95.

Bicego, M., Murino, V. & Figueiredo, M. A. 2004. Similarity-based classification of


sequences using hidden Markov models. Pattern recognition 37(12). 2281-
2291.

Biernacki, C., Celeux, G. & Govaert, G. 2000. Assessing a mixture model for
clustering with the integrated completed likelihood. IEEE transactions on
pattern analysis and machine intelligence 22(7). 719-725.
88

Bouguettaya, A., Yu, Q., Liu, X., Zhou, X. & Song, A. 2015. Efficient agglomerative
hierarchical clustering. Expert Systems with Applications 42(5). 2785-2797.

Carpenter, G. A. & Grossberg, S. 1988. The ART of adaptive pattern recognition by a


self-organizing neural network. Computer 21(3). 77-88.

Carvalho, M., Melo-Gonçalves, P., Teixeira, J. & Rocha, A. 2016. Regionalization of


Europe based on a K-Means Cluster Analysis of the climate change of
temperatures and precipitation. Physics and Chemistry of the Earth, Parts
A/B/C.

Cha, S.-H. 2007. Comprehensive survey on distance/similarity measures between


probability density functions. City 1(2). 1.

Chan, K. & Fu, A. 1999. Efficient Time Series Matching by Wavelets. In proceedings
of the 15 th IEEE Int. Conference on Data Engineering. Sydney: Australia,
126-133.

Chang, H. H., Hao, H. & Sarnat, S. E. 2014. A statistical modeling framework for
projecting future ambient ozone and its health impact due to climate change.
Atmospheric environment 89. 290-297.

Christensen, J. N., Weiss-Penzias, P., Fine, R., McDade, C. E., Trzepla, K., Brown, S.
T. & Gustin, M. S. 2015. Unraveling the sources of ground level ozone in the
Intermountain Western United States using Pb isotopes. Science of the Total
Environment 530. 519-525.

Cimiano, P., Hotho, A. & Staab, S. 2004. Comparing conceptual, divise and
agglomerative clustering for learning taxonomies from text. Proceedings of the
16th Eureopean Conference on Artificial Intelligence, ECAI'2004, including
Prestigious Applicants of Intelligent Systems, PAIS 2004,

Control, C. f. D. & Prevention. 1997. NIOSH pocket guide to chemical hazards.

Corduas, M. & Piccolo, D. 2008. Time series clustering and classification by the
autoregressive metric. Computational Statistics & Data Analysis 52(4). 1860-
1872.

Dembélé, D. & Kastner, P. 2003. Fuzzy C-means method for clustering microarray
data. Bioinformatics 19(8). 973-980.

Dezfuli, A. K. 2011. Spatio-temporal variability of seasonal rainfall in western


equatorial Africa. Theoretical and applied climatology 104(1-2). 57-69.

DONKO, D., HADZIMEJLIC, N. & HADZIMEJLIC, N. Evaluation of Clusters for


Climate Data.

Efrat, A., Fan, Q. & Venkatasubramanian, S. 2007. Curve matching, time warping,
and light fields: New algorithms for computing similarity between curves.
Journal of Mathematical Imaging and Vision 27(3). 203-216.
89

Everitt, B. S., Landau, S. & Leese, M. 2001. Cluster Analysis Arnold. A member of
the Hodder Headline Group, London.

Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. 1994. Fast subsequence


matching in time-series databases. ACM.

Ferreira, L. N. & Zhao, L. 2016. Time series clustering via community detection in
networks. Information Sciences 326. 227-242.

Fisher, D. H. 1987. Knowledge acquisition via incremental conceptual clustering.


Machine learning 2(2). 139-172.

Fraley, C. & Raftery, A. E. 2002. Model-based clustering, discriminant analysis, and


density estimation. Journal of the American statistical Association 97(458).
611-631.

Frossard, L., Rieder, H., Ribatet, M., Staehelin, J., Maeder, J., Rocco, S. D., Davison,
A. & Peter, T. 2013. On the relationship between total ozone and atmospheric
dynamics and chemistry at mid-latitudes–Part 1: Statistical models and spatial
fingerprints of atmospheric dynamics and chemistry. Atmospheric Chemistry
and Physics 13(1). 147-164.

Fu, T.-c. 2011. A review on time series data mining. Engineering Applications of
Artificial Intelligence 24(1). 164-181.

Ganguly, A. R. & Steinhaeuser, K. 2008. Data mining for climate change and impacts.
2008 IEEE International Conference on Data Mining Workshops, 385-394.

Gao, H., Jiang, J., She, L. & Fu, Y. 2010. A New Agglomerative Hierarchical
Clustering Algorithm Implementation based on the Map Reduce Framework.
JDCTA 4(3). 95-100.

Gneiting, T. & Raftery, A. E. 2005. Weather forecasting with ensemble methods.


Science 310(5746). 248-249.

Groenen, P. J. & Jajuga, K. 2001. Fuzzy clustering with squared Minkowski distances.
Fuzzy Sets and Systems 120(2). 227-237.

Guha, S., Rastogi, R. & Shim, K. 1998. CURE: an efficient clustering algorithm for
large databases. ACM SIGMOD Record, 73-84.

Hájek, P. & Olej, V. 2012. Ozone prediction on the basis of neural networks, support
vector regression and methods with uncertainty. Ecological Informatics 12.
31-42.

Hamilton, J. D. 1994. Time series analysis. Princeton university press Princeton.

Han, J., Kamber, M. & Pei, J. 2011. Data mining: concepts and techniques: concepts
and techniques. Elsevier.
90

Hansen, P. & Jaumard, B. 1997. Cluster analysis and mathematical programming.


Mathematical programming 79(1-3). 191-215.

Hirano, S., Sun, X. & Tsumoto, S. 2004. Comparison of clustering methods for
clinical databases. Information Sciences 159(3). 155-165.

Htike, K. K. & Khalifa, O. O. 2010. Rainfall forecasting models using focused time-
delay neural networks. Computer and Communication Engineering (ICCCE),
2010 International Conference on, 1-6.

Isa, D., Lee, L. H., Kallimani, V. & Rajkumar, R. 2008. Text document preprocessing
with the Bayes formula for classification using the support vector machine.
Knowledge and Data Engineering, IEEE Transactions on 20(9). 1264-1272.

Jacob, D. J. & Winner, D. A. 2009. Effect of climate change on air quality.


Atmospheric environment 43(1). 51-63.

Jain, A. K. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition


Letters 31(8). 651-666.

Jain, A. K. & Dubes, R. C. 1988. Algorithms for clustering data. Prentice-Hall, Inc.

Jain, A. K., Murty, M. N. & Flynn, P. J. 1999. Data clustering: a review. ACM
Computing Surveys (CSUR) 31(3). 264-323.

Jelinek, F. 1990. Self-organized language modeling for speech recognition. Readings


in speech recognition. 450-506.

Jiang, D., Pei, J. & Zhang, A. 2003. DHC: a density-based hierarchical clustering
method for time series gene expression data. Bioinformatics and
Bioengineering, 2003. Proceedings. Third IEEE Symposium on, 393-400.

Kalpakis, K., Gada, D. & Puttagunta, V. 2001. Distance measures for effective
clustering of ARIMA time-series. Data Mining, 2001. ICDM 2001,
Proceedings IEEE International Conference on, 273-280.

Kandil, M., Gadallah, A., Tawfik, F. & Kandil, N. 2014. Prediction of Maximum
Ground Ozone Levels using Neural Network. International Journal of
Computing and Digital Systems 3(2). 133-140.

Karypis, G., Han, E.-H. & Kumar, V. 1999. Chameleon: Hierarchical clustering using
dynamic modeling. Computer 32(8). 68-75.

Kavitha, V. & Punithavalli, M. 2010. Clustering time series data stream-a literature
survey. arXiv preprint arXiv:1005.4270.

Kelly, E. & Lesh, R. 2002. Understanding and explicating the design experiment
methodology. Building Research Capacity 3. 1-3.
91

Keogh, E. 1997. Fast similarity search in the presence of longitudinal scaling in time
series databases. Tools with Artificial Intelligence, 1997. Proceedings., Ninth
IEEE International Conference on, 578-584.

Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. 2001. Dimensionality


reduction for fast similarity search in large time series databases. Knowledge
and information Systems 3(3). 263-286.

Keogh, E. & Folias, T. 2002. The UCR time series data mining archive. Computer
Science & Engineering Department, University of California, Riverside, CA.
http://www. cs. ucr. edu/eamonn/TSDMA/index. html.

Keogh, E. & Kasetty, S. 2003. On the need for time series data mining benchmarks: a
survey and empirical demonstration. Data Mining and knowledge discovery
7(4). 349-371.

Keogh, E. J. & Pazzani, M. J. 1998. An Enhanced Representation of Time Series


Which Allows Fast and Accurate Classification, Clustering and Relevance
Feedback. KDD, 239-243.

Kocijan, J., Gradišar, D., Božnar, M. Z., Grašič, B. & Mlakar, P. 2016. On-line
algorithm for ground-level ozone prediction with a mobile station.
Atmospheric environment 131. 326-333.

Kohonen, T. 1990. The self-organizing map. Proceedings of the IEEE 78(9). 1464-
1480.

Kolatch, E. 2001. Clustering algorithms for spatial databases: A survey. PDF is


available on the Web.

Kumar, V., Narasimham, C. & Sujith, B. 2015. Classification of Time Series Data by
One Class Classifier using DTW-D. Procedia Computer Science 54. 343-352.

Laxman, S. & Sastry, P. S. 2006. A survey of temporal data mining. Sadhana 31(2).
173-198.

LESTARI. 2016. The Institute for Environment and Development in Malaysia and the
Asia Pacific. http://www.ukm.my/lestari.

Li, P., Xin, J., Bai, X., Wang, Y., Wang, S., Liu, S. & Feng, X. 2013. Observational
studies and a statistical early warning of surface ozone pollution in Tangshan,
the largest heavy industry city of North China. International journal of
environmental research and public health 10(3). 1048-1061.

Liao, T. W. 2005. Clustering of time series data—a survey. Pattern recognition


38(11). 1857-1874.

Lin, J., Keogh, E., Wei, L. & Lonardi, S. 2007. Experiencing SAX: a novel symbolic
representation of time series. Data Mining and knowledge discovery 15(2).
107-144.
92

Liu, C. L. 1968. Introduction to combinatorial mathematics.

Liu, F. & Xiong, L. 2011. Survey on text clustering algorithm. Software Engineering
and Service Science (ICSESS), 2011 IEEE 2nd International Conference on,
901-904.

Liu, Y., Li, Z., Xiong, H., Gao, X. & Wu, J. 2010. Understanding of internal
clustering validation measures. Data Mining (ICDM), 2010 IEEE 10th
International Conference on, 911-916.

Malley, C. S., Braban, C. F. & Heal, M. R. 2014. The application of hierarchical


cluster analysis and non-negative matrix factorization to European atmospheric
monitoring site classification. Atmospheric Research 138. 30-40.

Manning, C. D., Raghavan, P. & Schütze, H. 2008. Introduction to information


retrieval. Cambridge university press Cambridge.

Matson, P., Dietz, T., Abdalati, W., Busalacchi, A., Caldeira, K., Corell, R., Defries,
R., Fung, I., Gaines, S. & Hornberger, G. 2010. Advancing the science of
climate change. The National Academy of Sciences. 20.

Mitsa, T. 2010. Temporal data mining. CRC Press.

Monira, S. S., Faisal, Z. M. & Hirose, H. 2010. Comparison of artificially intelligent


methods in short term rainfall forecast. Computer and Information Technology
(ICCIT), 2010 13th International Conference on, 39-44.

Monteiro, A., Carvalho, A., Ribeiro, I., Scotto, M., Barbosa, S., Alonso, A.,
Baldasano, J., Pay, M., Miranda, A. & Borrego, C. 2012. Trends in ozone
concentrations in the Iberian Peninsula by quantile regression and clustering.
Atmospheric environment 56. 184-193.

Müller, K.-R., Smola, A. J., Rätsch, G., Schölkopf, B., Kohlmorgen, J. & Vapnik, V.
1997. Predicting time series with support vector machines. International
Conference on Artificial Neural Networks, 999-1004.

Niennattrakul, V. & Ratanamahatana, C. A. 2007. On clustering multimedia time


series data using k-means and dynamic time warping. Multimedia and
Ubiquitous Engineering, 2007. MUE'07. International Conference on, 733-
738.

Pal, N. R., Bezdek, J. C. & Tsao, E.-K. 1993. Generalized clustering networks and
Kohonen's self-organizing scheme. IEEE transactions on neural networks
4(4). 549-557.

Pelekis, N., Theodoulidis, B., Kopanakis, I. & Theodoridis, Y. 2004. Literature review
of spatio-temporal database models. The Knowledge Engineering Review
19(03). 235-274.
93

Petitjean, F., Ketterlin, A. & Gançarski, P. 2011. A global averaging method for
dynamic time warping, with applications to clustering. Pattern recognition
44(3). 678-693.

Rani, S. & Sikka, G. 2012. Recent techniques of clustering of time series data: A
Survey. Int. J. Comput. Appl 52(15). 1-9.

Rauber¹, A., Paralic, J. & Pampalk¹, E. Empirical Evaluation of Clustering


Algorithms*.

Ray, S. J. T. B. K. & Ensor, K. B. 2009. A Model-based Approach for Clustering Air


Quality Monitoring Networks in Houston, Texas.

Saithanu, K. & Mekparyup, J. 2012. Clustering of Air Quality and Meteorological


Variables Associated with High Ground Ozone Concentration in the Industrial
Areas, at the East of Thailand. International Journal of Pure and Applied
Mathematics 81(3). 505-515.

Serra, A. P. & Zárate, L. E. 2015. Characterization of time series for analyzing of the
evolution of time series clusters. Expert Systems with Applications 42(1). 596-
611.

Sharma, S., Sharma, P., Khare, M. & Kwatra, S. 2016. Statistical behavior of ozone in
urban environment. Sustainable Environment Research.

Singh, S. S. & Chauhan, N. 2011. K-means v/s K-medoids: A Comparative Study.


National Conference on Recent Trends in Engineering & Technology,

Smith, J., Mercado, F. & Estes, M. 2013. Characterization of Gulf of Mexico


background ozone concentrations. 12th Annual CMAS Conference, Chapel
Hill, NC,

Solazzo, E., Bianconi, R., Vautard, R., Appel, K. W., Moran, M. D., Hogrefe, C.,
Bessagnet, B., Brandt, J., Christensen, J. H. & Chemel, C. 2012. Model
evaluation and ensemble modelling of surface-level ozone in Europe and
North America in the context of AQMEII. Atmospheric environment 53. 60-
74.

Steinbach, M., Tan, P.-N., Kumar, V., Klooster, S. & Potter, C. 2003. Discovery of
climate indices using clustering. Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining, 446-455.

Stocker, T., Qin, D., Plattner, G.-K., Tignor, M., Allen, S. K., Boschung, J., Nauels,
A., Xia, Y., Bex, V. & Midgley, P. M. 2014. Climate change 2013: The
physical science basis. Cambridge University Press Cambridge, UK, and New
York.

Sun, W., Palazoglu, A., Singh, A., Zhang, H., Wang, Q., Zhao, Z. & Cao, D. 2015.
Prediction of surface ozone episodes using clusters based generalized linear
mixed effects models in Houston–Galveston–Brazoria area, Texas.
Atmospheric Pollution Research 6(2). 245-253.
94

Tamas, W., Notton, G., Paoli, C., Nivet, M.-L. & Voyant, C. 2016. Hybridization of
Air Quality Forecasting Models Using Machine Learning and Clustering: An
Original Approach to Detect Pollutant Peaks. Aerosol and Air Quality
Research 16(2). 405-416.

Teegavarapu, R. S. & Chandramouli, V. 2005. Improved weighting methods,


deterministic and stochastic data-driven models for estimation of missing
precipitation records. Journal of Hydrology 312(1). 191-206.

Van Wijk, J. J. & Van Selow, E. R. 1999. Cluster and calendar based visualization of
time series data. Information Visualization, 1999.(Info Vis' 99) Proceedings.
1999 IEEE Symposium on, 4-9, 140.

Vesanto, J. & Alhoniemi, E. 2000. Clustering of the self-organizing map. Neural


Networks, IEEE Transactions on 11(3). 586-600.

Vingarzan, R. 2004. A review of surface ozone background levels and trends.


Atmospheric environment 38(21). 3431-3442.

Vlachos, M., Lin, J., Keogh, E. & Gunopulos, D. 2003. A wavelet-based anytime
algorithm for k-means clustering of time series. In Proc. Workshop on
Clustering High Dimensionality Data and Its Applications,

Vora, P. & Oza, B. 2013. A Survey on K-mean Clustering and Particle Swarm
Optimization. International Journal of Science and Modern Engineering
(IJISME). 24-26.

Wang, S., Cai, T. & Eick, C. F. 2013. New Spatiotemporal Clustering Algorithms and
their Applications to Ozone Pollution. 2013 IEEE 13th International
Conference on Data Mining Workshops, 1061-1068.

Wang, X., Smith, K. & Hyndman, R. 2006. Characteristic-based clustering for time
series data. Data Mining and knowledge discovery 13(3). 335-364.

Wang, X., Smith, K. A., Hyndman, R. & Alahakoon, D. 2004. A scalable method for
time series clustering. Unrefereed research papers 1.

Wilson, A., Rappold, A. G., Neas, L. M. & Reich, B. J. 2014. Modeling the effect of
temperature on ozone-related mortality. The Annals of Applied Statistics 8(3).
1728-1749.

Wismüller, A., Lange, O., Dersch, D. R., Leinsinger, G. L., Hahn, K., Pütz, B. &
Auer, D. 2002. Cluster analysis of biomedical image time-series. International
Journal of Computer Vision 46(2). 103-128.

Xiong, Y. & Yeung, D.-Y. 2002. Mixtures of ARMA models for model-based time
series clustering. Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE
International Conference on, 717-720.

Xu, R. & Wunsch, D. 2005. Survey of clustering algorithms. Neural Networks, IEEE
Transactions on 16(3). 645-678.
95

The UCR Time Series Classification Archive. 2015.


www.cs.ucr.edu/~eamonn/time_series_data/.

Yuan, Y., Chen, Y.-P. P., Ni, S., Xu, A. G., Tang, L., Vingron, M., Somel, M. &
Khaitovich, P. 2011. Development and application of a modified dynamic time
warping algorithm (DTW-S) to analyses of primate brain expression time
series. BMC bioinformatics 12(1). 347.

Zhang, T., Ramakrishnan, R. & Livny, M. 1996. BIRCH: an efficient data clustering
method for very large databases. ACM Sigmod Record, 103-114.

Zhu, Q., Batista, G. E., Rakthanmanon, T. & Keogh, E. J. 2012. A Novel


Approximation to Dynamic Time Warping allows Anytime Clustering of
Massive Time Series Datasets. SDM, 999-1010.
96

APPENDIX A

RESULTS OF AHC WITH EUCLIDEAN, MINKOWSKI, DTW AND M-DTW


FOR EACH DATASET

Table A-2 CBF dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
1 0.5714 1 0.5714 1 0.5714 1 0.5714
2 0.7228 2 0.8333 2 0.7286 2 0.8235
3 0.7176 3 0.5306 3 0.7114 3 0.9333
4 0.6170 4 0.4813 4 0.5797 4 0.8061
5 0.6018 5 0.4517 5 0.5870 5 0.7074
6 0.5931 6 0.3764 6 0.5839 6 0.6667
7 0.5796 7 0.3583 7 0.5289 7 0.6000
8 0.5276 8 0.3135 8 0.4802 8 0.5455
9 0.4826 9 0.2972 9 0.4680 9 0.5000
10 0.4463 10 0.2675 10 0.4325 10 0.4615

Table A-2 ProximalPhalanxOutlineAgeGroup dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
1 0.6418 1 0.6418 1 0.6418 1 0.6418
2 0.7263 2 0.6878 2 0.5422 2 0.7464
3 0.7818 3 0.6683 3 0.4393 3 0.8132
4 0.6623 4 0.5598 4 0.5524 4 0.6962
5 0.5751 5 0.4974 5 0.4923 5 0.6109
6 0.5109 6 0.4335 6 0.4762 6 0.5407
7 0.4638 7 0.3942 7 0.4366 7 0.4883
8 0.4273 8 0.4368 8 0.4299 8 0.4447
9 0.3954 9 0.3967 9 0.3969 9 0.4112
10 0.3637 10 0.3687 10 0.3669 10 0.3796
97

Table A-3 BeetleFly dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
1 0.6667 1 0.6667 1 0.6667 1 0.6667
2 0.7622 2 0.8518 2 0.5503 2 0.7895
3 0.6166 3 0.6906 3 0.4395 3 0.6449
4 0.5159 4 0.5765 4 0.4012 4 0.5399
5 0.4450 5 0.4935 5 0.4098 5 0.4646
6 0.4531 6 0.4247 6 0.3525 6 0.4085
7 0.4000 7 0.3796 7 0.3135 7 0.3618
8 0.3610 8 0.3403 8 0.3053 8 0.3610
9 0.3472 9 0.3102 9 0.3102 9 0.3287
10 0.3183 10 0.3016 10 0.3016 10 0.3016

Table A-4 DistalPhalanxTW dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
1 0.6949 1 0.6949 1 0.6949 1 0.6949
2 0.8323 2 0.5155 2 0.5155 2 0.8323
3 0.6556 3 0.4475 3 0.4765 3 0.6611
4 0.5451 4 0.5370 4 0.5405 4 0.5487
5 0.5123 5 0.4737 5 0.4758 5 0.4803
6 0.4471 6 0.4894 6 0.4127 6 0.5116
7 0.4567 7 0.4337 7 0.3672 7 0.4860
8 0.4168 8 0.3939 8 0.3956 8 0.4501
9 0.3806 9 0.3651 9 0.3839 9 0.4231
10 0.3584 10 0.3440 10 0.3550 10 0.3909
98

Table A-5 DistalPhalanxOutlineAgeGroup dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
1 0.7823 1 0.7823 1 0.7823 1 0.7823
2 0.8388 2 0.5888 2 0.5888 2 0.8370
3 0.7045 3 0.4814 3 0.6717 3 0.7321
4 0.5964 4 0.5663 4 0.5663 4 0.6438
5 0.6215 5 0.4877 5 0.4870 5 0.5878
6 0.5454 6 0.4399 6 0.4396 6 0.5290
7 0.4904 7 0.4605 7 0.4513 7 0.4696
8 0.4480 8 0.4168 8 0.4155 8 0.4286
9 0.4060 9 0.3859 9 0.3872 9 0.3906
10 0.3753 10 0.3614 10 0.3552 10 0.3613

Table A-6 ProximalPhalanxOutlineCorrect dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
1 0.6593 1 0.6593 1 0.6593 1 0.6593
2 0.7485 2 0.5755 2 0.5455 2 0.7022
3 0.5887 3 0.4726 3 0.4618 3 0.5801
4 0.4985 4 0.4862 4 0.4020 4 0.5171
5 0.4330 5 0.4197 5 0.4202 5 0.4493
6 0.3808 6 0.3635 6 0.3690 6 0.3955
7 0.3378 7 0.3200 7 0.3301 7 0.3523
8 0.3047 8 0.2847 8 0.2957 8 0.3161
9 0.2782 9 0.2568 9 0.2676 9 0.2865
10 0.2549 10 0.2355 10 0.2448 10 0.2622
99

Table A-7 MiddlePhalanxOutlineAgeGroup dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
2 0.6140 2 0.6141 2 0.6141 2 0.6140
3 0.7006 3 0.6014 3 0.6108 3 0.6777
4 0.6369 4 0.5746 4 0.6562 4 0.5863
5 0.5361 5 0.5127 5 0.5566 5 0.4932
6 0.4672 6 0.4525 6 0.4988 6 0.4845
7 0.4229 7 0.3987 7 0.4553 7 0.4306
8 0.3782 8 0.3641 8 0.4171 8 0.3900
9 0.3483 9 0.3830 9 0.3788 9 0.3553
10 0.3191 10 0.3516 10 0.3466 10 0.3271
11 0.2927 11 0.3257 11 0.3212 11 0.3049

Table A-8 Gun-Point dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
2 0.7083 2 0.5770 2 0.6114 2 0.6707
3 0.5780 3 0.4518 3 0.5897 3 0.5827
4 0.5072 4 0.4471 4 0.4930 4 0.5266
5 0.4303 5 0.4138 5 0.4621 5 0.4560
6 0.3786 6 0.3996 6 0.4464 6 0.3965
7 0.3429 7 0.3940 7 0.3940 7 0.3537
8 0.3100 8 0.3552 8 0.3513 8 0.3163
9 0.2809 9 0.3205 9 0.3205 9 0.2873
10 0.2581 10 0.2944 10 0.2944 10 0.2632
11 0.2495 11 0.2722 11 0.2722 11 0.2543
100

Table A-9 Lightning-7 dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
7 0.5618 7 0.4718 7 0.4736 7 0.6441
8 0.5513 8 0.4726 8 0.4452 8 0.5911
9 0.5449 9 0.4832 9 0.4473 9 0.5793
10 0.5225 10 0.4583 10 0.4375 10 0.5890
11 0.4864 11 0.4391 11 0.4052 11 0.5499
12 0.4550 12 0.4227 12 0.3854 12 0.5197
13 0.4314 13 0.4179 13 0.3997 13 0.5049
14 0.4120 14 0.4015 14 0.4174 14 0.4830
15 0.3913 15 0.3822 15 0.4194 15 0.4616
16 0.3755 16 0.3761 16 0.4156 16 0.4496
17 0.3753 17 0.3597 17 0.3970 17 0.4224
18 0.3598 18 0.3533 18 0.4214 18 0.4055
19 0.3524 19 0.3390 19 0.4009 19 0.3994
20 0.3393 20 0.3240 20 0.3922 20 0.3845
21 0.3335 21 0.3116 21 0.3781 21 0.3706

Table A-10 Computers dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
2 0.5400 2 0.6032 2 0.5199 1 0.6384
3 0.4869 3 0.4818 3 0.4135 2 0.5151
4 0.4426 4 0.3963 4 0.3558 3 0.4360
5 0.3717 5 0.3359 5 0.3099 4 0.3679
6 0.3281 6 0.2916 6 0.2984 5 0.3291
7 0.2922 7 0.2560 7 0.2831 6 0.2962
8 0.2633 8 0.2286 8 0.2563 7 0.2659
9 0.2420 9 0.2127 9 0.2334 8 0.2400
10 0.2216 10 0.1958 10 0.2190 9 0.2202
11 0.2036 11 0.1813 11 0.2020 10 0.2023
101

Table A-11 Face (all) dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
14 0.5954 14 0.4036 14 0.3490 14 0.6177
15 0.5994 15 0.3922 15 0.3532 15 0.5933
16 0.5823 16 0.3869 16 0.3575 16 0.5720
17 0.5947 17 0.3825 17 0.3572 17 0.5805
18 0.5745 18 0.3770 18 0.3579 18 0.5618
19 0.5649 19 0.3709 19 0.3622 19 0.5492
20 0.5571 20 0.3647 20 0.3550 20 0.5350
21 0.5419 21 0.3561 21 0.3485 21 0.4941
22 0.5267 22 0.3454 22 0.3418 22 0.4766
23 0.5524 23 0.3403 23 0.3382 23 0.4632
24 0.5384 24 0.3360 24 0.3368 24 0.4540
25 0.5277 25 0.3286 25 0.3342 25 0.4475
26 0.5240 26 0.3249 26 0.3327 26 0.4361
27 0.5175 27 0.3192 27 0.3275 27 0.4252
28 0.5095 28 0.3159 28 0.3196 28 0.4258
29 0.4977 29 0.3075 29 0.3168 29 0.4158
30 0.4866 30 0.3031 30 0.3119 30 0.4062
31 0.4784 31 0.3042 31 0.3071 31 0.4063
32 0.4672 32 0.3037 32 0.3054 32 0.3972
33 0.4574 33 0.3016 33 0.3056 33 0.3882
34 0.4500 34 0.2963 34 0.2998 34 0.4095
35 0.4409 35 0.2961 35 0.2938 35 0.4004
36 0.4321 36 0.2918 36 0.2885 36 0.3999
37 0.4512 37 0.2880 37 0.2842 37 0.3936
38 0.4426 38 0.2917 38 0.2853 38 0.3861
39 0.4344 39 0.2877 39 0.2798 39 0.3788
40 0.4253 40 0.2845 40 0.2776 40 0.3737
41 0.4202 41 0.2818 41 0.2769 41 0.3742
42 0.4128 42 0.2781 42 0.2722 42 0.3665
43 0.4124 43 0.2736 43 0.2687 43 0.3618
44 0.4053 44 0.2717 44 0.2660 44 0.3567
45 0.3998 45 0.2672 45 0.2641 45 0.3505
46 0.3946 46 0.2650 46 0.2613 46 0.3445
47 0.3882 47 0.2632 47 0.2606 47 0.3388
48 0.3821 48 0.2620 48 0.2624 48 0.3332
49 0.3778 49 0.2580 49 0.2584 49 0.3295
50 0.3759 50 0.2535 50 0.2599 50 0.3408
102

Table A-12 Swedish Leaf dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
15 0.4692 15 0.3843 15 0.3881 15 0.5120
16 0.4749 16 0.3779 16 0.3724 16 0.4921
17 0.4591 17 0.3666 17 0.3703 17 0.4836
18 0.4477 18 0.3513 18 0.3547 18 0.4660
19 0.4428 19 0.3647 19 0.3634 19 0.4495
20 0.4283 20 0.3673 20 0.3480 20 0.4336
21 0.4176 21 0.3542 21 0.3413 21 0.4311
22 0.4092 22 0.3446 22 0.3333 22 0.4171
23 0.3967 23 0.3362 23 0.3220 23 0.4348
24 0.3866 24 0.3310 24 0.3122 24 0.4216
25 0.3825 25 0.3236 25 0.3135 25 0.4092
26 0.3964 26 0.3258 26 0.3088 26 0.4164
27 0.3858 27 0.3287 27 0.3005 27 0.4050
28 0.4032 28 0.3203 28 0.2956 28 0.3942
29 0.3932 29 0.3146 29 0.2902 29 0.3840
30 0.3836 30 0.3056 30 0.2878 30 0.3742
31 0.3745 31 0.2985 31 0.2805 31 0.3647
32 0.3653 32 0.2944 32 0.2730 32 0.3556
33 0.3569 33 0.2895 33 0.2690 33 0.3471
34 0.3506 34 0.2829 34 0.2644 34 0.3440
35 0.3449 35 0.2826 35 0.2603 35 0.3387
36 0.3374 36 0.2763 36 0.2544 36 0.3312
37 0.3302 37 0.2739 37 0.2491 37 0.3235
38 0.3233 38 0.2668 38 0.2436 38 0.3188
39 0.3167 39 0.2646 39 0.2384 39 0.3122
40 0.3181 40 0.2595 40 0.2368 40 0.3062
41 0.3144 41 0.2540 41 0.2479 41 0.3001
42 0.3214 42 0.2719 42 0.2471 42 0.2941
43 0.3149 43 0.2685 43 0.2424 43 0.2885
44 0.3107 44 0.2636 44 0.2442 44 0.2935
45 0.3052 45 0.2589 45 0.2701 45 0.2880
46 0.2999 46 0.2615 46 0.2647 46 0.2828
47 0.3064 47 0.2582 47 0.2600 47 0.2777
48 0.3013 48 0.2538 48 0.2553 48 0.2729
49 0.3027 49 0.2507 49 0.2508 49 0.2682
50 0.3022 50 0.2466 50 0.2467 50 0.2648
51 0.2975 51 0.2544 51 0.2427 51 0.2604
103

Table A-13 Beef dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
1 0.3333 1 0.3333 1 0.3333 1 0.3333
2 0.7080 2 0.7080 2 0.7080 2 0.7080
3 0.5196 3 0.6759 3 0.6360 3 0.5952
4 0.4971 4 0.6190 4 0.4971 4 0.5664
5 0.4901 5 0.5091 5 0.4971 5 0.5698
6 0.4190 6 0.4615 6 0.4615 6 0.5198
7 0.4087 7 0.4190 7 0.4190 7 0.4807
8 0.4389 8 0.4215 8 0.4215 8 0.4645
9 0.4170 9 0.3900 9 0.3900 9 0.4313
10 0.4038 10 0.4165 10 0.4165 10 0.4016

Table A-14 50Words dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
91 0.4334 91 0.3847 91 0.4090 91 0.4319
92 0.4308 92 0.3862 92 0.4112 92 0.4334
93 0.4278 93 0.3868 93 0.4117 93 0.4300
94 0.4261 94 0.3856 94 0.4102 94 0.4284
95 0.4253 95 0.3837 95 0.4097 95 0.4277
96 0.4287 96 0.3812 96 0.4065 96 0.4278
97 0.4261 97 0.3809 97 0.4057 97 0.4270
98 0.4241 98 0.3819 98 0.4031 98 0.4274
99 0.4226 99 0.3790 99 0.4036 99 0.4243
100 0.4187 100 0.3759 100 0.4010 100 0.4210
104

Table A-15 FISH dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
7 0.5123 7 0.4067 7 0.4109 7 0.5271
8 0.5062 8 0.4185 8 0.4236 8 0.4989
9 0.4611 9 0.3879 9 0.4031 9 0.4976
10 0.4424 10 0.4317 10 0.3948 10 0.4745
11 0.4590 11 0.4067 11 0.3849 11 0.4403
12 0.4702 12 0.3868 12 0.3633 12 0.4193
13 0.4503 13 0.3699 13 0.3494 13 0.3973
14 0.4419 14 0.3538 14 0.3335 14 0.3753
15 0.4205 15 0.3563 15 0.3237 15 0.3720
16 0.4188 16 0.3411 16 0.3095 16 0.3563
17 0.4000 17 0.3246 17 0.3140 17 0.3688
18 0.3827 18 0.3155 18 0.3246 18 0.3522
19 0.3686 19 0.3001 19 0.3141 19 0.3440
20 0.3542 20 0.2969 20 0.3083 20 0.3318
21 0.3430 21 0.2959 21 0.2956 21 0.3198

Table A-16 FordA dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
2 0.5128 2 0.5123 2 0.5065 2 0.5225
3 0.4908 3 0.4098 3 0.4050 3 0.4205
4 0.4241 4 0.3459 4 0.3386 4 0.4102
5 0.3646 5 0.3053 5 0.3040 5 0.3602
6 0.3161 6 0.2718 6 0.2661 6 0.3158
7 0.2787 7 0.2422 7 0.2493 7 0.2791
8 0.2498 8 0.2176 8 0.2244 8 0.2504
9 0.2268 9 0.1989 9 0.2081 9 0.2332
10 0.2137 10 0.1818 10 0.1906 10 0.2168
11 0.1974 11 0.1682 11 0.1764 11 0.2001
12 0.1896 12 0.1560 12 0.1641 12 0.1865
13 0.1802 13 0.1455 13 0.1547 13 0.1740
14 0.1694 14 0.1374 14 0.1475 14 0.1630
15 0.1591 15 0.1293 15 0.1411 15 0.1555
16 0.1506 16 0.1221 16 0.1334 16 0.1470
17 0.1445 17 0.1156 17 0.1289 17 0.1393
18 0.1372 18 0.1110 18 0.1224 18 0.1324
19 0.1305 19 0.1069 19 0.1166 19 0.1271
20 0.1252 20 0.1027 20 0.1112 20 0.1214
21 0.1198 21 0.0982 21 0.1063 21 0.1160
105

Table A-17 ArrowHead dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
3 0.4180 3 0.4357 3 0.4157 3 0.4956
4 0.4363 4 0.3818 4 0.3818 4 0.4626
5 0.3947 5 0.3569 5 0.3850 5 0.4073
6 0.4306 6 0.4110 6 0.3285 6 0.4489
7 0.3864 7 0.3682 7 0.2941 7 0.4049
8 0.3514 8 0.3329 8 0.3225 8 0.3659
9 0.3354 9 0.3017 9 0.2955 9 0.3301
10 0.3081 10 0.2760 10 0.2722 10 0.3037
11 0.2871 11 0.2602 11 0.2566 11 0.2798
12 0.2657 12 0.2419 12 0.2407 12 0.2586
13 0.2531 13 0.2260 13 0.2307 13 0.2422
14 0.2361 14 0.2171 14 0.2165 14 0.2276
15 0.2229 15 0.2350 15 0.2331 15 0.2150
16 0.2104 16 0.2219 16 0.2231 16 0.2183
17 0.2024 17 0.2104 17 0.2100 17 0.2073
18 0.1998 18 0.2001 18 0.1995 18 0.1974
19 0.1906 19 0.1968 19 0.1903 19 0.1906

Table A-18 Phoneme dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
92 0.3044 92 0.2894 1 0.3007 92 0.3202
93 0.3027 93 0.2870 2 0.2981 93 0.3175
94 0.3020 94 0.2862 3 0.2975 94 0.3287
95 0.3010 95 0.2980 4 0.2961 95 0.3287
96 0.2991 96 0.2977 5 0.2950 96 0.3281
97 0.2980 97 0.2982 6 0.2937 97 0.3256
98 0.3006 98 0.2979 7 0.2929 98 0.3250
99 0.3005 99 0.2955 8 0.2917 99 0.3247
100 0.2982 100 0.2948 9 0.2894 100 0.3218
101 0.2958 101 0.2925 10 0.2886 101 0.3192
106

Table A-19 TwoLeadECG dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
2 0.7628 2 0.7429 2 0.7336 2 0.7961
3 0.5053 3 0.6059 3 0.6323 3 0.5512
4 0.4034 4 0.4964 4 0.5001 4 0.4658
5 0.3385 5 0.4899 5 0.4624 5 0.4189
6 0.2921 6 0.4299 6 0.4304 6 0.3643
7 0.2563 7 0.3839 7 0.3837 7 0.3222
8 0.2292 8 0.3460 8 0.3438 8 0.2879
9 0.2088 9 0.3147 9 0.3187 9 0.2635
10 0.2017 10 0.2795 10 0.2933 10 0.2664
11 0.1849 11 0.2628 11 0.2715 11 0.2469

Table A-20 OSU Leaf dataset

# DTW # Min # EU # M-
Cluster Cluster Cluster Cluster DTW
6 0.4554 6 0.3330 6 0.3600 6 0.3951
7 0.4480 7 0.3342 7 0.3507 7 0.3954
8 0.4489 8 0.3159 8 0.3348 8 0.3951
9 0.4147 9 0.2957 9 0.3286 9 0.3747
10 0.3921 10 0.2997 10 0.3230 10 0.3586
11 0.3674 11 0.2827 11 0.3155 11 0.3599
12 0.3454 12 0.2842 12 0.3106 12 0.3514
13 0.3237 13 0.2679 13 0.3003 13 0.3387
14 0.3216 14 0.2593 14 0.2853 14 0.3268
15 0.3169 15 0.2474 15 0.2728 15 0.3077
16 0.3031 16 0.2443 16 0.2655 16 0.3242
17 0.2979 17 0.2406 17 0.2636 17 0.3094
18 0.2871 18 0.2368 18 0.2605 18 0.2963
19 0.2936 19 0.2303 19 0.2445 19 0.2904
20 0.2823 20 0.2260 20 0.2352 20 0.2784
21 0.2731 21 0.2313 21 0.2271 21 0.2681
107

APPENDIX B

ASSOCIATION RULES

Table B-1 Association rules for clustering when k=9

Index Factor 1 Factor 2 O3 Eval


1 NOx=0.003 Temp=33.7 0.086 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
2 NOx=0.003 Temp=32.1 0.088 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
3 NOx=0.003 Temp=30.4 0.070 conf:(1)> lift:(31.67) lev:(0.01) [0] conv:(0.97)
4 NOx=0.003 Temp=29.9 0.061 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
5 NOx=0.003 Temp=29.2 0.059 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
6 NOx=0.003 NO2=0.013 0.059 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
7 NOx=0.003 NO2=0.002 0.070 conf:(1)> lift:(31.67) lev:(0.01) [0] conv:(0.97)
8 NOx=0.003 CO=1.7 0.070 conf:(1)> lift:(31.67) lev:(0.01) [0] conv:(0.97)
9 NOx=0.003 CO=1.61 0.061 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
10 NOx=0.003 CO=0.31 0.088 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
11 NOx=0.003 CO=0.27 0.086 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
12 NOx=0.003 CO=0.16 0.059 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
13 CO=0.75 - 0.148 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
14 CO=0.91 - 0.147 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
15 CO=0.7 - 0.143 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
16 CO=0.44 - 0.140 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
17 CO=0.63 - 0.140 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
18 Temp=32.6 - 0.139 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
19 CO=0.59 - 0.139 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
20 CO=0.78 - 0.131 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
21 Temp=34.2 - 0.124 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
22 CO=0.48 - 0.124 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
23 NOx=0.015 - 0.123 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
24 CO=0.46 - 0.121 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
25 CO=0.73 - 0.119 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
26 NO2=0.021 - 0.116 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
27 NOx=0.02 - 0.113 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
28 NO2=0.019 - 0.113 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
29 CO=0.67 - 0.113 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
108

30 Temp=32.8 - 0.109 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)


31 Temp=30 - 0.105 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
32 CO=0.43 - 0.105 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
33 Temp=33.1 - 0.100 conf:(1)> lift:(31.67) lev:(0.01) [0] conv:(0.97)
34 NOx=0.081 - 0.099 conf:(1)> lift:(23.75) lev:(0.01) [0] conv:(0.96)
35 NO2=0.059 - 0.099 conf:(1)> lift:(23.75) lev:(0.01) [0] conv:(0.96)
36 CO=0.6 - 0.098 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
37 CO=1.72 - 0.098 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
38 NOx=0.023 - 0.094 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
39 CO=1.71 - 0.087 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
40 Temp=33.7 - 0.086 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
41 CO=0.15 - 0.084 conf:(1)> lift:(23.75) lev:(0.01) [0] conv:(0.96)
42 CO=0.39 - 0.084 conf:(1)> lift:(23.75) lev:(0.01) [0] conv:(0.96)
43 CO=0.5 - 0.084 conf:(1)> lift:(23.75) lev:(0.01) [0] conv:(0.96)
44 Temp=30.2 - 0.079 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
45 CO=0.1 - 0.079 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
46 NOx=0.017 - 0.076 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
47 Temp=28.3 - 0.076 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
48 Temp=32 - 0.075 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
49 NOx=0.028 - 0.071 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
50 NO2=0.024 - 0.071 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
51 Temp=29.4 - 0.071 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
52 CO=0.07 - 0.071 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
53 CO=0.68 - 0.071 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
54 NO2=0.002 - 0.070 conf:(1)> lift:(31.67) lev:(0.01) [0] conv:(0.97)
55 CO=1.7 - 0.070 conf:(1)> lift:(31.67) lev:(0.01) [0] conv:(0.97)
56 CO=0.28 - 0.068 conf:(1)> lift:(31.67) lev:(0.01) [0] conv:(0.97)
57 CO=0.34 - 0.063 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
58 Temp=33.2 - 0.062 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
59 CO=1.61 - 0.061 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
60 Temp=29.7 - 0.060 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
61 CO=0.3 - 0.060 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
62 CO=0.56 - 0.060 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)
63 Temp=28.8 - 0.056 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
64 CO=1.65 - 0.056 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
65 Temp=30.6 - 0.055 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)
109

66 Temp=31.8 - 0.050 conf:(1)> lift:(47.5) lev:(0.01) [0] conv:(0.98)


67 Temp=30.1 - 0.042 conf:(1)> lift:(95) lev:(0.01) [0] conv:(0.99)

Table B-2 Association rules for clustering when k=5

Index Factor 1 O3 Eval


1 CO=0.18 O3=0.038 <conf:(1)> lift:(14) lev:(0.04) conv:(1.86)
2 CO=0.27 O3=0.073 <conf:(0.67)> lift:(9.33) lev:(0.04) conv:(1.39)
3 NOx=0.007 O3=0.042 <conf:(0.67)> lift:(9.33) lev:(0.04) conv:(1.39)
4 NOx=0.007 O3=0.042 <conf:(0.67)> lift:(9.33) lev:(0.04) conv:(1.39)
5 NO=0.001 O3=0.042 <conf:(0.67)> lift:(9.33) lev:(0.04) conv:(1.39)
6 NO=0.001 O3=0.073 <conf:(0.67)> lift:(9.33) lev:(0.04) conv:(1.39)
7 NOx=0.007 O3=0.042 <conf:(0.67)> lift:(9.33) lev:(0.04) conv:(1.39)
8 NOx=0.007 O3=0.042 <conf:(0.5)> lift:(7) lev:(0.04) conv:(1.24)
9 NO2=0.006 O3=0.042 <conf:(0.5)> lift:(7) lev:(0.04) conv:(1.24)
10 NOx=0.001 O3=0.045 <conf:(0.5)> lift:(10.5) lev:(0.04) conv:(1.27)
11 NOx=0.002 O3=0.047 <conf:(0.5)> lift:(10.5) lev:(0.04) conv:(1.27)
12 NOx=0.001 O3=0.045 <conf:(0.5)> lift:(10.5) lev:(0.04) conv:(1.27)
13 NOx=0.001 O3=0.045 <conf:(0.5)> lift:(10.5) lev:(0.04) conv:(1.27)
14 NOx=0.002 O3=0.047 <conf:(0.5)> lift:(10.5) lev:(0.04) conv:(1.27)
15 NOx=0.001 O3=0.045 <conf:(0.5)> lift:(10.5) lev:(0.04) conv:(1.27)
16 NOx=0.001 O3=0.045 <conf:(0.4)> lift:(8.4) lev:(0.04) conv:(1.19)
17 NOx=0.001 O3=0.045 <conf:(0.4)> lift:(8.4) lev:(0.04) conv:(1.19)
18 NOx=0.001 O3=0.045 <conf:(0.4)> lift:(8.4) lev:(0.04) conv:(1.19)
19 NO2=0.001 O3=0.045 <conf:(0.4)> lift:(8.4) lev:(0.04) conv:(1.19)
20 NOx=0.001 O3=0.045 <conf:(0.4)> lift:(8.4) lev:(0.04) conv:(1.19)
21 NO=0.001 O3=0.045 <conf:(0.4)> lift:(8.4) lev:(0.04) conv:(1.19)
22 CO=0.07 O3=0.045 <conf:(0.33)> lift:(7) lev:(0.04) conv:(1.14)
23 NOx=0.002 O3=0.032 <conf:(0.33)> lift:(7) lev:(0.04) conv:(1.14)
24 NOx=0.002 O3=0.046 <conf:(0.33)> lift:(2.8) lev:(0.03) conv:(1.06)
25 NO=0.001 O3=0.049 <conf:(0.33)> lift:(4.67) lev:(0.04) conv:(1.11)
26 NO=0.001 O3=0.045 <conf:(0.33)> lift:(7) lev:(0.04) conv:(1.14)
27 NOx=0.002 O3=0.032 <conf:(0.33)> lift:(7) lev:(0.04) conv:(1.14)
28 NOx=0.002 O3=0.046 <conf:(0.33)> lift:(2.8) lev:(0.03) conv:(1.06)
29 NO2=0.004 O3=0.049 <conf:(0.29)> lift:(4) lev:(0.04) conv:(1.08)
110

30 NO2=0.004 O3=0.073 <conf:(0.29)> lift:(4) lev:(0.04) conv:(1.08)


31 NO2=0.001 O3=0.046 <conf:(0.27)> lift:(2.29) lev:(0.04) conv:(1.08)
32 NO=0.001 O3=0.046 <conf:(0.27)> lift:(2.29) lev:(0.04) conv:(1.08)
33 NO2=0.002 O3=0.047 <conf:(0.25)> lift:(5.25) lev:(0.04) conv:(1.09)
34 NO=0.001 O3=0.047 <conf:(0.25)> lift:(5.25) lev:(0.04) conv:(1.09)
35 NOx=0.002 O3=0.032 <conf:(0.2)> lift:(4.2) lev:(0.04) conv:(1.06)
36 NOx=0.002 O3=0.046 <conf:(0.2)> lift:(1.68) lev:(0.02) conv:(0.98)
37 NOx=0.002 O3=0.047 <conf:(0.2)> lift:(4.2) lev:(0.04) conv:(1.06)
38 NOx=0.002 O3=0.032 <conf:(0.2)> lift:(4.2) lev:(0.04) conv:(1.06)
39 NOx=0.002 O3=0.046 <conf:(0.2)> lift:(1.68) lev:(0.02) conv:(0.98)
40 NOx=0.002 O3=0.047 <conf:(0.2)> lift:(4.2) lev:(0.04) conv:(1.06)
41 NO2=0.001 O3=0.032 <conf:(0.18)> lift:(3.82) lev:(0.04) conv:(1.05)
42 NO2=0.001 O3=0.045 <conf:(0.18)> lift:(3.82) lev:(0.04) conv:(1.05)
43 NO=0.001 O3=0.032 <conf:(0.18)> lift:(3.82) lev:(0.04) conv:(1.05)
44 NO=0.001 O3=0.045 <conf:(0.18)> lift:(3.82) lev:(0.04) conv:(1.05)
45 NO=0.001 O3=0.046 <conf:(0.14)> lift:(1.14) lev:(0.01) conv:(0.99)
46 NO=0.001 O3=0.042 <conf:(0.08)> lift:(1.14) lev:(0.01) conv:(0.98)
47 NO=0.001 O3=0.049 <conf:(0.08)> lift:(1.14) lev:(0.01) conv:(0.98)

Vous aimerez peut-être aussi