Vous êtes sur la page 1sur 39

BACHELOR THESIS

Computer vision based approach


for analysing fluids inside
transparent glass vessels

Jasper Busschers
August 17, 2018

Johan Loeckx, Promotor


Computer Science
Acknowledgements

In this thesis, we study a dataset of fluids in order to learn a machine


learning application to predict and identify how these will change. This
problem proved to be far more difficult than initially expected, this
because of the limited amount of labeled data and the fast amount of
different appearances fluids can have.
These difficulties forced me to study more advanced deep learning ap-
proaches and helped me gain a deeper understanding of the possibilities
and limitations of deep learning. I would like to thank my promotor
J. Loeckx for guiding me during this research.
I would also like to thank F. Zonfrilli from P&G for providing the
dataset and for providing more insight into the problem.
Lastly, I also thank my friend J. Cardon, who allowed me to use his
computer for all experiments. This allowed me to perform far more
experiments.
Abstract

The goal in this bachelor thesis is to identify the changes that oc-
cur to fluids when kept in certain conditions. Identifying these visual
changes in samples and keeping logs of their state is a time-consuming
procedure that is done by many chemistry labs. For example Proc-
ter&Gamble(P&G), who requested this research and also provided the
data. Their interest is to improve quality assurance of their products
by performing experiments on different samples. Another goal is to
predict early how samples will change in order to reduce the duration
of the experiments.
Solving classification problems like the one discussed above is often
done via supervised learning. These methods rely on fast amounts of
labelled data that were not available in our dataset. That’s why this
thesis will cover 2 alternative approaches that can be used to solve the
classification problem. In the first approach, we’ll try to quantify the
amount of change that occurred by making an artificial neural network
(ANN) learn a similarity score. This score can then be used to identify
different changes.
The other approach uses Generative Adversarial Networks[1] in order
to directly classify samples with the correct label. This architecture
consists of 2 different networks; the generator and discriminator. The
generator is trained to produce new samples that appear realistic. This
training is done in an unsupervised way since no labels are required.
The discriminator is then trained to classify the correct change and
to predict whether the input is a real image or one generated by the
generator.
This architecture is both interesting for classifying and predicting the
state of a sample. The discriminator can be trained as a good classifier.
The generator can be trained to predict the future state of a sample.
More on this in section 5.
Contents

1 Introduction 1
1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Classifying the correct change . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Quantifying the changes . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Classifying using a Generative approach . . . . . . . . . . . 3
1.3 Making earlier predictions . . . . . . . . . . . . . . . . . . . . . . . 3

2 Dataset 4
2.1 Sequence representation . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Absolute difference . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Failure types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Manually created labels . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Evaluating the difficulties . . . . . . . . . . . . . . . . . . . . . . . 6

3 Traditional deep Learning approach 8


3.1 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Convolution layer . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.3 Activation layer . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.4 Fully connected layer . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Auto Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Implementation traditional approach 13


4.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Generative approach 17
5.1 Generative adversarial networks . . . . . . . . . . . . . . . . . . . . 18
5.2 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

i
5.2.1 DCGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.2 Cycle GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Semi-supervised learning with GANs . . . . . . . . . . . . . . . . . 20

6 Implementation generative approach 22


6.1 implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.1.1 The architecture . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.4 Prediction with different inputs . . . . . . . . . . . . . . . . . . . . 25

7 Improving predictions 27
7.1 Predicting future appearance . . . . . . . . . . . . . . . . . . . . . . 28
7.1.1 Using the first and middle image . . . . . . . . . . . . . . . 28
7.1.2 Using the absolute difference . . . . . . . . . . . . . . . . . . 28
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

8 Conclusion and future work 31


8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Bibliography 33

ii
Chapter 1

Introduction

In many industries it is very important to test how the quality of certain products
degrade over time. Such tests can be a time-consuming process since they need to
be performed for long period of time. After such an experiment, a log should also
be made for every sample stating which change occurred.
In this thesis, we are especially interested in the changes that can occur inside
fluids that are kept at certain conditions inside glass vessels. The goal is to re-
duce the labour time needed to perform such experiments. This can be done by
automatically classifying the correct change to a sample; this reduces the time
needed to create logs. Another way to do this is by learning to predict the future
appearance of a sample by looking at earlier appearances. This would reduce the
amount of time experiments need to last.
In order to solve these problems, a dataset of fluid samples inside transparent glass
vessels was used. The dataset was provided by P&G and contains approximately
1000 sequences of images, showing samples of fluids during their experiments. In
chapter 2 this dataset will be covered in more detail. Here we will discuss what
kind of data representation best captures the change in a sequence.
Chapter 3 gives an overview of traditional deep learning approaches and discusses
how these can be used to identify changes in fluids. For example auto encoders
which we will be using in chapter 4 to quantify the amount of change that oc-
curred. For this, a neural network is trained to encode stable samples as similar
as possible and the ones showing a change as different as possible.
In chapter 5 an overview will be given of generative approaches for solving the
classification problem. These methods can be trained in a semi-supervised way,
requiring far less labelled data. The results of this approach will be presented in
chapter 6 where we use a Generative Adversarial network (GAN) to classify the
changes in sequences of fluids. This chapter will merely focus on the classification
accuracy and will not cover the quality of the generated data.
In chapter 7 we present a GAN architecture specialised in producing high quality

1
generated images. This network will be trained to predict the final appearance of
a sample by looking at some earlier images.

1.1 Problem definition


As described in the introduction, the goal of this thesis is to reduce the work
needed to perform experiments on samples of fluids. Such experiments require a
lot of labour hours in order to manually keep logs of every sample. In this thesis,
we will try to create a system that is able to automatically detect and classify a
change when it occurs.
Creating handcrafted methods for identifying certain changes would require much
knowledge about the problem. Such a method would also have difficulty dealing
with the wide variety of appearances of fluids. That’s why this thesis will be lim-
ited to deep learning approaches used to identify changes. In chapter 3 an overview
is given of traditional deep learning approaches that make use of neural networks.
In chapter 5 we’ll review another deep learning approach that makes use of gener-
ative networks to learn to identify changes.
Another problem is that these experiments often last long before results can be
measured. This problem can be solved either by identifying the changes earlier or
by predicting the future appearance based on earlier states.
In chapter 7 will discuss how the results can be improved and how we can make
earlier predictions.

1.2 Classifying the correct change


The main goal of this thesis is to classify one of 5 labels to a set of images of a
fluid. A sample can be either classified as stable or one of 4 failure modes being:
sedimentation, creaming, splitting and color change. These will be discussed in
further detail throughout chapter 2.
Usually one would solve such a problem as a recurrent classification problem when
there is enough labelled data available. In such recurrent approach, a whole se-
quence of data is fed to an ANN that is then trained to produce the right label.
The amount of data required to learn a problem in such supervised way is not hard
defined and depends on the difficulty of the problem. Thought the dataset used
in this thesis was only provided with 22 labels for 1256 sequences, which certainly
is not sufficient for such methods. This is why this thesis will not be covering any
fully supervised methods. Instead, 2 different approaches will be presented.

2
1.2.1 Quantifying the changes
In chapter 4 an attempt was made to solve the problem by learning an ANN to
quantify the amount of change between 2 images of the same sample. Siamese
Network is an architecture that was presented for this goal[2]. This network trains
an encoder to encode similar samples as similar as possible while not similar sam-
ples should be encoded very different. In our case we call 2 images of the same
sample similar when its state remained stable.

1.2.2 Classifying using a Generative approach


Chapter 5 discusses a semi-supervised approach that uses of GANs. Here a general
description is given of GANs as well as a brief summary of the related research that
is of interest in this thesis. The most interesting for our goals being the paper ”Im-
proved Techniques for Training GANs” by Goodfellow et al.[3]. Here is described
how GANs can reach state of the art classification accuracy in semi-supervised
learning problems.
In chapter 6 we will discuss the implementation of this generative approach that
is used to solve the original classification problem. Here we will also be presenting
the classification results after performing cross-validation. Since the goal in this
chapter is only to classify sequences and not to generate new detailed images, only
the output of the discriminator will be evaluated.

1.3 Making earlier predictions


Identifying changes automatically is already convenient for reducing the amount
of labour needed for these experiments. But the time needed for these experiments
can also be reduced by predicting changes earlier.
This can also be seen as a recurrent classification problem if the goal is merely
to predict the label early. A disadvantage of this method is that it still relies on
many labelled data.
Another way of dealing with such problems is in a generative way, here we let the
network produce a later image in the sequence. This method doesn’t rely on any
further labelling but uses the final image of a sequence as ground truth instead. In
this method a mapping is created from one or more early images in the sequence,
to the last image in the sequence.
Such a mapping can be used to predict the appearance of the sample at the end
of an experiment by using some earlier states.
More on this in chapter 7 where we will be evaluating ways to make earlier pre-
dictions.

3
Chapter 2

Dataset

The dataset used to study changes in fluids has been provided by P&G and has
been constructed by placing samples inside a chamber at a certain temperature.
Every hour a picture is taken of the samples. This creates a sequence of images that
show the changes that occurred in the sample. The dataset contains 1036 usable
sequences. Every sequence contains somewhere between 200 and 1500 images
depending on how long the experiment was conducted. A sample can be classified
as either stable or one of 4 failure states: sedimentation, phase splitting, creaming
and color change. These will be discussed in more detail in section 2.3.
The initial dataset only provided 20 labeled sequences which would certainly not
be sufficient for any fully supervised machine learning method. Arguably, even
unsupervised methods would still require more labels in order to create a validation
set that represents the dataset well. This is why the first stage of this thesis
revolved around ways to represent this data and also creating new labels to be
able to compare the different approaches.

2.1 Sequence representation


Representing the sequences of images is a tricky problem. When we choose too
many images, the sample size can become very big. This also forces us to use
bigger networks to deal with this larger input. It is known that bigger networks
take more time to train and can require more data.
So if we would want to represent this data as sequences of images, an interval
should be chosen to limit the number of images in the sequence. Since the length
of the sequences in our dataset varies from 200 to 1500, it is difficult to choose an
interval for all samples. Also since the original labeling doesn’t state the image
where the first change occurred, we can only say for sure that the last image shows
the labeled failure without manually checking every sequence.

4
For this reason, we have chosen to represent a sequence by their first and last
image.

2.2 Absolute difference


By using the first and last image of a sequence, we can compute their difference
by subtracting the first image from the last image. Doing this allows us to only
capture the effective relationship between the two images while throwing away all
common values. Below the equation to compute this difference is given.

Dif f (x1, x2) = M AX((x1 − x2), (x2 − x1))

We perform the subtraction in 2 directions because the value of a pixel cannot


be below 0. This gives us the absolute difference between the 2 images as a new
image. This result expresses the relationship between those images. The advan-
tage of this method is that we only need to parse 1 image to a neural network to
classify the relationship between 2 images.
All images are also cropped to 64x64 before taking their difference. This is done
for 2 reasons: first of all, we can only subtract 2 images that are exactly the same
size. The second reason is that this cropped version cuts out the vessel borders
and allow us to only judge changes within the fluid itself.

2.3 Failure types

The figure above shows an example of every failure type: on the left the two input
images x1 and x2 are shown. On the right, we see the result after performing

5
Diff(x1,x2). The result of Diff(x1,x2) gives us some insights into the classification
problem. We can see that stable, phase splitting and color change are easily iden-
tifiable, while sedimentation and creaming only show a very slight difference.

2.4 Manually created labels


Since the original dataset only contained 22 labels, we decided to manually create
more sets of labels. 2 different sets of labels were created, both used to train
different kinds of networks. In the first dataset that was created, every pair of 2
images in a sequence was given a binary label 1 or 0. A label 1 is given to samples
that remain stable over time while any significant change caused a sample to be
labeled as 0.
This set was used in chapter 4 to learn a network to distinguish stable sequences
from sequences that show a failure. The set is made by comparing the first image
of a sequence to 100 other images in the same sequences. This was done for 650
sequences, which resulted in a dataset of 6500 samples.
The second set of labels was constructed by manually going over the first and last
image of all 1036 sequences and labeling those with a number from 0 to 5 for the
corresponding classification. Here we cannot assure that the dataset is completely
correct as it was labeled without deep knowledge in the field, but only by using
the few labeled images as reference.
color change stable sedimentation splitting creaming
269 648 40 53 28

Above you can see the representation of every failure after relabelling the entire
dataset. We can see that while stable samples and color changes are very well
represented in the dataset, sedimentation, splitting and creaming are all very un-
derrepresented. Such a dataset is called an unbalanced dataset. These can cause a
network to optimize only the most occurring labels and ignore those that don’t oc-
cur often. The next section will cover approaches in order to deal with unbalanced
datasets.

2.5 Evaluating the difficulties


In section 2.3 we discussed how samples labeled as creaming or sedimentation only
show a very slight change that is difficult to spot, even for a human. Considering
that these sequences are also very underrepresented in the dataset, makes it safe
to assume these will be the most difficult ones to classify.

6
While we cannot make these failures easier to detect, we can partly solve the prob-
lem of underrepresentation. One way of solving this problem is by oversampling
the underrepresented samples or undersampling those that are overrepresented. In
our case, over-sampling would be much more preferred since we would lose much
training data by undersampling.
Another way to deal with unbalanced datasets is by giving each class a weight that
will say how much a neural network should care about the sample. In order to
compute the weight for each class, we first compute the average weight Aw. This
is the number of samples Ns divided by the number of classifications k.

Aw = N s/k

Then for every classification, we compute the weight by dividing Aw by the number
of samples in that classification. Giving us the following weights:

classification weight
color change 0.77
stable 0.32
sedimentation 5.19
splitting 3.92
creaming 7.41

These will be used in chapter 6 where we will discuss the implementation of a


neural network trained to classify these changes.

7
Chapter 3

Traditional deep Learning


approach

In the previous chapter, we evaluated the dataset and chose a representation for
the data. We also covered the difficulties of this dataset and discussed how it can
be used in machine learning applications.
In this chapter, we will give the necessary overview of neural networks used for
image processing and how these can be used to solve the classification problem as
discussed in 1.1.
Deep neural networks have recently shown far better performance for image pro-
cessing than handcrafted approaches, this because they don’t rely on a predefined
feature representation. Instead, they work by passing an input through a network
of weights to produce an output. The weights are then updated for every sample
so that it fits the dataset best. This allows the network to learn more complex
feature representations of the data.
In 1998 Lecun [4] showed the first implementation of a multi-layer neural net-
work performing such gradient-based learning on text recognition. Many of the
techniques used here, later became key technologies for modeling convolutional
neural networks. In 2014 Alex Krizhevsky et al. presented convolutional neural
networks(CNN) used for image processing[5]. This work showed the first example
of a convolutional neural network, outperforming every traditional neural network
for image classification problems. Section 3.1 will describe convolutional neural
networks in more detail.
Going into the mathematical details or implementations of all these networks would
go beyond the scope of this thesis as these are implemented by specialized machine
learning libraries (PyTorch, Tensorflow, Caffe...). Instead, this chapter will try to
give a general explanation about the building blocks for a convolutional neural
network and models proposed to solve different kinds of learning problems.

8
3.1 Convolutional neural network
In traditional neural networks, image processing was usually achieved by having a
set of neurons as input layer, each taking one pixel of the image as input[6]. Af-
terwards, this input gets processed by fully connected hidden layers and combined
for a result. The limitations of such a network are that it doesn’t scale well to
bigger size images and that neurons on the same layer work independent from one
another without sharing connections. Convolutional neural networks work around
this problem by using 3-dimensional layers (width, height, and depth) [7].

[7]
Using this architecture, neurons can be implemented as filters so-called convo-
lutions. Multiple of those convolutions together create a convolutional layer that
produces a 3-dimensional result where the dept is defined as the number of filters
that were applied on that layer. The layer produces a 3-dimensional output of
which the size depends on the size of the convolution filters used, the stride (over-
lap) and amount of filters used in the layer. More on this in the next section.
In most cases, the goal of a convolution layer is to expand our input into a higher
depth dimension output to extract more advanced features, though reducing the
dimensions can also be done. A convolution layer is most commonly followed by
either a pooling layer or an activation layer. Pooling layers are generally used to
reduce the dimensions of the result that was generated by the convolutional layer.
One of the most commonly used pooling techniques is max pooling. More on this
topic in section 4.1.2.
The purpose of the activation layer is to add non-linearity by using an activa-
tion function on the weight multiplied result for each filter in the previous layer.
The activation function should be a non-linear function (Sigmoid, Tanh, Relu,...).
These will be discussed in section 4.1.3.

3.1.1 Convolution layer


Convolution is a mathematical process originating from signal theory. The goal of
it is to combine multiple results together[6]. Each result contains a weight stating
how strong its value should contribute to the combined result. This is the main

9
process in convolutional neural networks, where we are looking for the weights
that represent a concept.
In neural networks, convolutions are defined by their filter size and a stride. The
filter is used to slide both horizontally and vertically over the image to generate an
activation map stating how much the filter is triggered at each point. The stride
of the filter defines the steps it should skip when moving horizontally or vertically
over the image. The size of the outputted activation map is defined as follow:

D = ((W − F + 2P )/S) + 1
Where D is the dimension of the outputted activation map, W the input di-
mension (width or height), F the filter size and S the stride. The dimension of
the output can be manipulated by adding zero padding around the borders of
the input. P states the amount of padding to be added around the border. It
is common practise to make each filter the same size in a convolutional layer. A
convolutional layer returns the results of each of these filters as a 3-dimensional
matrix, where the depth is defined by the number of filters used on that layer and
the initial depth of the input.

3.1.2 Pooling layer


It is common practice to use a pooling layer in between two convolutional layers.
The pooling layer is responsible for down-sampling the input dimensions in order
to control over-fitting and the amount of computations in the network. Pooling
layers are usually implemented as a 2x2 filter with stride 2. The pooling layer
takes a matrix as input with height H1, width W1, and depth D1 and produces an
output with dimensions W2xH2xD2. The output dimensions are defined by the
following formula.

W 2 = ((W 1 − F )/S) + 1
H2 = ((H1 − F )/S) + 1
D2 = D1
Where F is the filter size and S the stride of the pooling layer, with a filter size
of 2 and stride of 2 it’s easy to see that the original width and height will be cut in
half. The pooling layer works on the full depth of the input, so the output depth
stays the same as the input depth.

10
Max-Pooling Max-pooling is one of the most commonly used techniques to
implement pooling and makes use of a 2x2 filter with stride 2. It works by keeping
the maximum value out of the 4 values the filter covers. These are returned in a
new matrix with half the height and width of its input, keeping only the highest
values.

3.1.3 Activation layer


In 1959 David H. Hubel studied the working of the visual cortex of the brain by
testing different imagery on a cat and looking at the response in the visual cortex
[8]. Here he discovered neurons in the brain have a hierarchical structure, where
initially activated neurons are used to process simpler features as light intensity.
Followed by neurons for more advanced features as movement and orientation.
Activation functions are based on this principle replicating neurons being trig-
gered. They accept a vector x of inputs and computes the weighted sum using the
following formula. X
z= (W i ∗ Xi + b)
i
Where Xi and Wi are the values found on the i’th position of the input vector, and
weights vector, and b the bias to be added to each result. The result z gets passed
to an activation function that, based on z, decides whether a neuron gets triggered
or not. A good choice of learning function can dramatically increase the learning
speed of the network. ReLu or rectified linear unit is often named as the activation
function to be used for convolutional neural networks[6][7]. This because it’s been
shown that a ReLu function converges at least 6 times faster than Tanh or Sigmoid
[5]. The ReLu function also has a very low computational cost as it only relies on
a max operation.
The function is rather simple as it neglects all negative input x and passes all
positive values. The advantage of this is that the gradient never smooths out. A
problem can occur when the input is negative. To address this problem, multi-
ple different adaptations of the ReLu function have been proposed for example,
leaky ReLu and exponential ReLu. These alternatives create a small slope on the
negative side of the axis to improve learning on negative input.

3.1.4 Fully connected layer


Fully connected layers are used at the end of most traditional convolutional neural
networks. Their task is to reduce the space to just an array, stating the truth value
for each classification. This is achieved by having a set of neurons that each take all
activations from the previous layer as input. The amount of neurons defines how
many classifications it should return. Fully connected layers are known to heavily

11
increase the number of parameters and thus memory usage within the network.
For this reason, these layers have fallen out of interest in more recent works [9][10].

3.2 Loss functions


So far we have discussed how a network architecture can be constructed to process
images and return an output. However just making the network is not enough,
before the network can start to improve itself it must first know how well it is
doing already.
The performance of a network is measured by the use of a loss function, this
function returns a score stating how well the network fits the training data. The
network is then trained to minimize this loss thus the error it makes on the training
data.
The choice of loss function determines mostly what the network is going to learn.
That is why many different loss functions exist for different machine learning prob-
lems.
For example, cross entropy loss is most commonly used on classification problems.

3.3 Auto Encoder


Convolutional neural networks cannot only be trained to produce an output classi-
fication but many different combinations of networks can also be used for different
purposes. In autoencoder networks, the goal is not to learn a classification but
rather to learn an efficient encoding for the data.
Learning such encoding has many obvious uses like reducing bandwidth usage but
is also used in unsupervised learning to let a network learn a feature representation
of the data. The feature representation can then be studied to learn to classify
different samples.
Autoencoder networks consist of 2 different networks: the encoder and decoder
that are trained together. The encoder uses multiple convolutional and pooling
layers to reduce the input size to a small encoding. The decoder then uses multiple
transpose convolutions to decode the encoding back to the original input.
The loss of this network is the amount of difference between the input and recon-
structed input after encoding and decoding it. Often the mean squared error is
used as loss metric, but there also exist many other possible metrics.

12
Chapter 4

Implementation traditional
approach

In chapter 2 we discussed how we could prepare the dataset to be used in deep


learning applications. The dataset contained very few labeled data of certain cat-
egories. The last chapter introduced convolutional neural networks and described
how these can be used to build an autoencoder.
This chapter will cover a special kind of encoder network, called Siamese Networks.
These were first introduced in the 2014 paper ”DeepFace closing the gap between
human-level performance”[2]. With ”the gap in human performance” was meant
how a human does not need a lot of examples in order to distinguish 2 persons.
Their approach to do this is called one-shot learning and has mostly been applied
for face recognition. The goal of the network is to learn to encode similar samples
close to each other while dissimilar samples should be encoded to have a big dis-
tance in their encoding.
For this we used the triple dataset discussed in chapter 2; the binary labeling
states whether the samples are similar or not. The distance between 2 encodings
is defined as the Euclidean distance between the two output vectors and is also
called the dissimilarity score. In our case, a high dissimilarity score means that a
failure has occurred while a low score indicates that the sample remained stable. In
this approach, the network is merely trained to correctly identify stable sequences
from unstable ones and not to give an exact classification. This was done to give
an initial idea about the suitability of such a method given a simpler classification
problem.

13
4.1 Network architecture
The architecture of the Siamese Network used in this chapter consists of 2 parts:
the first part being an encoder network out of an autoencoder network. The
encoder consists of 3 convolutions and 3 max-pooling layers. These layers are used
to first increase the size of the input in order to downscale it then to a small
encoding of that image. The second part of the network uses 3 fully connected
layers to flatten the output and reduce the encoding to just a vector of size 5.
This allows us to easily compute the distance between these 2 encodings using
Euclidean distance.
layer size stride padding output size
conv1 3 3 1 16x10x10
max pool1 2 2 0 32*5*5
conv2 3 2 1 48*3*3
max pool2 2 1 0 48*2*2
conv3 3 1 1 48*2*2
max pool3 2 2 0 48*1*1

layer number of input number of outputs


FC1 48 16
FC2 16 8
FC3 8 5

The network has been kept small intentionally to reduce the chance of overfitting
and because deeper networks require more data to be trained correctly. Dropout
and bath normalisation were also applied in between each convolutional layer to
reduce overfitting. The network is not trained to produce an output classification
but rather to differentiate between 2 clasifications. For this reason a special loss
function was introduced contrastive loss. [11]

Above the contrastive loss formula is given, where Dw resembles the euclidean
distance and m a margin from where the score should contribute to the loss. In
our network the chosen margin was 1.

4.2 Training
The Siamese Network is trained by passing the first and the last image of a se-
quence separately through the network. Then the euclidean distance between the

14
2 output encodings gets computed. The network is then trained to produce simi-
lar encodings if the sequence remained stable and otherwise produce very different
encodings
Note that no comparisons were yet made between different failure sequences in
order to keep the initial classification problem simple.
The network was trained using the triple dataset of 6500 samples out of chapter
2. This dataset was split into a training set of 6100 and a testing set of 400.

4.3 Results
Since a Siamese Network does not output a classification but rather a distance
between 2 encodings, no accuracy score can be given without choosing a threshold
from where we view 2 samples as distinct enough. Here we chose to display the
validation set of 400 in a scatter plot to illustrate the problems with this approach.

Even on the simplified classification problem, a high amount of stable samples


were still misclassified with a high dissimilarity score. In section 2.3 we discussed
how sedimentation and creaming only showed a very small difference. Assumed is
that this can be partially the blame of the unsatisfying accuracy.
This could give a potential explanation for the networks inability to learn the
problem. This because we trained the encoder with 2 conflicting goals, the encoder
should encode the first and last image of a sequence very similar whenever this
sequence remained stable. Though, whenever sedimentation or creaming occurred
the first and last image are also almost identical. Still we trained the network to

15
encode those differently. This makes the learning problem incredibly difficult to
optimize.
That’s why the next chapter will cover another approach that uses generative
methods to identify changes.

16
Chapter 5

Generative approach

In chapter 2 we have discussed the difficulties of the dataset. One of the biggest is
the underrepresentation of certain failures. Chapter 3 discussed the key concepts
of deep learning and building blocks of convolutional neural networks. In the last
chapter, we attempted to use these basic building blocks to quantify the amount
of change and discovered that our approach was not well suited for this purpose.
In this chapter, we will discuss a different kind of CNN architecture that uses a
generative approach. This architecture can be trained in a semi-supervised way,
allowing us to not only use the small set of labeled data but the entire collection
of 1.2 million images for unsupervised training.
In such generative approaches, the network is asked to produce new samples to
better learn the representation of the dataset. This chapter will limit itself by
only discussing generative adversarial networks (GAN) while chapter 6 will cover
the actual implementation and results. Other popular models such as Restricted
Boltzmann Machines (RBM’s) and Variational Auto-Encoders (VAE) exist, but
have been illustrated to capture far less detailed representations than GANs[12].
GANs were introduced in 2014 by Ian Goodfellow et. al [1] as an implicit proba-
bilistic model to learn a representation of the data. The way it achieves this is by
training 2 different networks: the discriminator and generator. The discriminator
is trained to identify whether its input is a real sample or one artificially produced.
The generator takes a vector of noise as input and is trained to produce samples
that cause the discriminator to classify those as real. These 2 networks play out a
min-max game, as their objective is to make the other network fail. When the gen-
erator starts making the discriminator fail, the discriminator will start improving
and the other way around. The advantage of this method is that the loss doesn’t
converge too quickly.
In 2016 Alec Radford et al. [13] presented an improved version of the original
GAN architecture called Deep Convolutional Generative Adversarial Networks
(DCGan). Here new guidelines were given to better design and train GANs. These

17
will be further discussed in section 5.2.1.
Although GANs can learn very detailed feature representations of its data, they
can be very difficult to train. One of the biggest difficulties lies in the stability
of power between the generator and discriminator. Imbalance in power can cause
the network to never converge to a minimum. Vanishing gradient can also occur
when the discriminator becomes too good. This happens when the loss of the
discriminator falls to 0 and no more optimization can happen.
One of the most mentioned problems training GANs is called mode collapse and
is a direct consequence of the way the adversarial loss is defined. It happens when
many input noises are mapped to the same output and the generator is no longer
improving.
Researchers are currently striving towards building a fundamental model of GANS
that is not vulnerable to these problems [12]. In the meanwhile many hacks have
been proposed to stabilize the training of GANs while better methods are still
being researched[14, 15, 16]. Many of these hacks revolve around adding noise to
both networks either on the input or via dropout layers.
Ever since the framework for generative adversarial training was presented, many
new architectures have been presented for different purposes. One of the most
obvious extensions presented in 2014 is called conditional GANs[17]. In this archi-
tecture, they added a label to the input of both the generator and discriminator
in order to specify the kind of samples that need to be generated. This also allows
the discriminator to be trained conditionally, judging if samples look real or fake
conditioned by some label.
In 2017 Jun-Yan Zhu et al. [18] presented a new architecture called Cycle GAN.
Their architecture makes major modifications to the original architecture in order
to learn a one to one mapping between 2 sets. Whereas regular GANs are used to
learn a mapping from input noise to data, Cycle GAN is used to learn a one-to-one
mapping from input to output. This is done by introducing a new loss to enforce
this mapping. More detailed description of these and other architectures can be
found in section 5.2.2.

5.1 Generative adversarial networks


In the original definition of GANs, as described by Goodfellow et al. [1], the gen-
erator G is defined as a multilayer perceptron taking as input a prior distributed
noise z, using this noise it tries to learn a mapping G(z;Outputg). The discrimina-
tor D is also defined as a multilayer perceptron taking an input x and produces a
probability D(x) that this input came out of the dataset instead of being generated
by G. The discriminator is trained to maximize the probability of assigning the

18
correct label giving a result near zero on fake examples and a probability near 1
for real samples. When given real data the discriminator is trained to optimise
the following loss.
LossD(x) = log(D(x))
Meanwhile the discriminator is also expected to minimise the output probability
when given a fake example.

LossD2(G(z)) = log(1 − D(G(z)))

The generator, on the other hand, is then trained to maximize the probability D
produces when given a fake sample.

LossG(z) = (1 − D(G(z)))

In theory, these 2 networks play out a min-max game until they settle on a Nash
equilibrium. However, in reality, each model updates its weights independently
to each other, this doesn’t allow any cooperation. Salimans et al. discussed this
problem in 2016 showing that updating gradients in a concurrent way cannot
guarantee a convergence to a minimum[3]. Section 4.2 will cover potential solutions
proposed to solve these convergence problems.
A trained GAN can be used for 2 different purposes: if the goal is just to produce
new samples than it would be sufficient to only keep the trained generator, while if
the goal is to create a classifier, than it would be sufficient to just keep the trained
discriminator.
In chapter 6 the goal is to create a good discriminator used to classify the changes
in sequences.

5.2 Improvements
While the architecture discussed in the previous section showed the huge potential
in GANs, the adaptation went slowly as there was little known about how to design
such networks. This section will cover different improvements and guidelines that
have been proposed to better design GANS.

5.2.1 DCGAN
In 2016 Alec Radford et al. [13] presented the first GAN that only used convolu-
tions and was specialized for image processing. Here also multiple design choices
were explained that later became guidelines for designing deeper convolutional
GANs. Their network eliminated all fully connected layers and pooling layers and
replaced those by convolutional layers.

19
They also showed best stability was achieved when using LeakyRelu across all
layers of the discriminator. They claimed that using their guidelines would result
in a stable architecture.

5.2.2 Cycle GAN


In 2017 Jun-Yan Zhu et al.[18] presented a new way of training generative adver-
sarial networks called Cycle GANs. While all generative architectures discussed
before tried to learn a mapping from a noise to data, in Cycle GAN the goal is to
learn a mapping between 2 sets of data A and B. This mapping was for example
used in their work to create a mapping from satellite images to roadmaps.
Their approach also helps to solve the problem of mode collapse, where all gener-
ator inputs produce the same output. This because the architecture forces every
input to map to another output.
Cycle GAN does this by defining 2 different generators, the first generator Gab
takes as input an image from set A call this realA and produces a transformation
to an image out of B, we call this result fakeB. The second generator Gba is then
given the fakeB produced by Gab and trained to output a reconstruction as close
as possible to realA. To enforce this mapping, a new loss was introduced called the
cycle consistency loss. In this loss, the reconstructed input is compared against
the real input. The loss reaches zero whenever the recovered input becomes the
same as the real input. By adding this loss to the generator loss, it forces the
generator to learn a one to one mapping between an input set A and output set
B.
In 2018 Amjad Almahairi et al. expanded this idea by making the generator learn
a many-to-many mapping[19]. Here they propose to add a noise variable as extra
input to the generator. They also propose using an encoder to learn a mapping
from the output image to the given noise. The encoder loss is added to the gener-
ator to force a correlation between the output image and the input noise. Without
the use of such encoder, the generator would just learn to ignore the noise.
This allows the generator to produce more diverse outputs for a single input by
changing this noise variable, creating a many-to-many mapping.

5.3 Semi-supervised learning with GANs


Traditional GAN architectures as discussed in section 5.1 were only used for unsu-
pervised training. They were given a dataset and were only trained to classify if a
sample is real or not. In the 2016 paper improved techniques for training GANs,
Goodfellow et al. presented a semi-supervised way of training GANs [3]. In their

20
approach, 2 different datasets are used for training one being supervised while the
other contains no labels.
The unlabeled dataset is used to train the generator to produce a wide variety of
new samples. The biggest change lies in the discriminator that is given an extra
output via the use of fully connected layers. This extra output assigns one of k+1
possible classifications to the sample, where k is the number of classifications in
the problem. An extra classification is added to be labeled as a generated sample.
This allows the network to improve the classification problem by using the sam-
ples that were generated. Both a supervised and unsupervised loss can then be
computed over these 2 outputs and added together. Using their technique, they
were able to achieve state of the art results in semi-supervised learning.
They also present new guidelines that proved to reduce the instability problems
when training GANs.

21
Chapter 6

Implementation generative
approach

In the last chapter, we gave a general description of GANs and discussed different
approaches in order to solve the initial instability problems GANs suffer from. We
also discussed how GANs can be used in a semi-supervised way to solve classifica-
tion problems.
In this chapter, we discuss the implementation of a semi-supervised approach used
to solve the original classification problem, where the goal is to classify sequences
of images containing fluids. In this chapter we won’t be evaluating the quality of
the generated results, as these will be discussed in chapter 7. Instead, this chapter
will present the classification results after performing cross-validation.
The labeled dataset used in this chapter is the dataset of 1036 labeled samples as
discussed in chapter 2. The unsupervised set used in this chapter is constructed
by taking 4 comparisons per sequence for 1000 sequences. This makes the unsu-
pervised set a total of 4000 samples.

6.1 implementation
By using the original implementation of GANs, the generator learns a mapping
from a noise vector z to some data. Since samples in our dataset consist of 2
images and a label, we can choose 2 different representations for the generator to
produce.
In the first representation, the generator produces both images as output given
a noise vector z. This allows us to better judge the quality of the output of the
generator, but forces us to use far more parameters, which makes it more difficult
to train. We found in our experiments that the generator would take significantly

22
longer to learn the relationship between the two outputs. This caused the discrim-
inator to learn way faster and cause a vanishing gradient to occur.
Another option would be to have the generator produce the pre-processed differ-
ence between the two images. By doing this, we make the network focus only on
the important features, but it also makes the network very vulnerable to mode
collapse, which also happened in our experiments.
In order to find a middle way in these 2 approaches, some concepts found in Cycle
GAN were adopted. In Cycle GAN the goal is no longer to learn a mapping from
noise to data, but rather create a mapping from images of one set to images of
another set. We adopt this by using the first image of a sequence as set A and the
last image as set B. Then we have our generator learn a mapping from an input
image out of set A to a corresponding image out of set B. The generator produces
an output fake1 that is then given to the discriminator along with the original
input image. The discriminator then judges if the absolute difference between the
input and generated images looks like a real change.

realDif f = max((img1 − img2), (img2 − img1))


f akeDif f = max((img1 − f ake1), (f ake1 − img1))
The discriminator is then trained both supervised and unsupervised. For this
unsupervised training both realDiff and fakeDiff are passed through the network.
The discriminator then outputs the probability of it being real or fake and also
a classification label. The standard discriminator loss can be computed over the
two output probabilities, this is used as the unsupervised loss.
While the original discriminator as discussed in section 5.1 only had a single binary
output to classify real or fake, the discriminator in this chapter also produces a
second output. This output is constructed via the use of fully connected layers and
is used to return one of 6 output classifications. 5 classifications originate from
the given problem while another label is added to indicate a fake example. The
supervised loss is then computed by taking the cross-entropy loss over the output
label.
The total loss of the discriminator is then defined as the sum of the supervised
and unsupervised loss.
The generator is trained to minimize the difference between the real output clas-
sification and the one given to the fake sample. Training against an earlier layer
in such a way was shown to improve semi-supervised training[3].

6.1.1 The architecture


The network architecture of this chapter was implemented in PyTorch and started
of from an example network build for the MNist handwritten digit dataset[20].

23
The generator in this network was replaced by the one used in Cycle GAN[21].
The generator starts by downsampling the input via convolutional layers. The
result is then passed through 9 residual units and finally this result is upsampled
again. The discriminator is a simple convolutional neural network consisting of 5
convolutional layers, each layer halves the width and height of its input.

layer size stride padding output size


conv1+Leaky Relu 4 2 1 32x32x8
conv2+Leaky Relu 4 2 1 16x16x16
conv3+Leaky Relu 4 2 1 8x8x32
conv4+Leaky Relu 4 2 1 4x4x36
conv5+Leaky Relu 3 1 1 4x4x10

layer number of input number of outputs


FC1 (output label) 160 10
FC2 (output probability) 10 5

The discriminator was also inspired by Cycle GAN but has been modified to out-
put a classification label. Also the network size has been reduced since this showed
about the same results. The network consists of 5 convolutional layers, followed by
2 fully connected layers to produce the output probability and classification label.

6.2 Training
The discriminator is then trained on both a supervised and an unsupervised dataset
to correctly identify real samples from generated ones and also to correctly predict
the corresponding label. For supervised dataset, we used the new labeled dataset
discussed in chapter 2. This dataset provided 1036 labeled sequences. The gener-
ator is only trained unsupervised, using the loss computed with the output of the
discriminator.
The images are first cropped from the center to 64*64 in order to reduce the size
of the network and also to only include changes that occurred inside the fluid.
This cropped image is given to the generator to generate the last image for that
sequence. Then the absolute difference is taken between the real input and gen-
erated output. The resulting image shows the relationship between the 2 images.
This image is then given to the discriminator to distinguish between real relation-
ships in the dataset and generated ones. Doing this indirectly forces the generated
image to remain similar enough to the input image while only showing realistic
changes. Both networks were trained on a constant learning rate of 0.002 and

24
optimized using Adam optimizer as suggested by Goodfellow et al. [3].

6.3 Results
The accuracy of this network was computed using cross validation by splitting the
dataset into 11 subsets: 10 subsets of size 100 and 1 subset of size 36. These were
trained separately starting from random initialised weights for 20 epoches each.
In 2.5 we presented weights that can be used for dealing with unbalanced datasets.
These weights are used in the supervised loss to balance the different classifications.
The results can be seen below in the confusion matrix, here is displayed what
category shows the most errors.
confusion matrix color change stable sedimentation splitting creaming
color change 89.6% 7.8% 1.4% 0% 1,1%
stable 7.3% 90.3% 0.7% 1.7% 0%
sedimentation 20% 30% 50% 0% 0%
splitting 9.5% 7.5% 0% 83% 0%
creaming 0% 34.5% 0% 0% 65.5%
We can see that stable and color change are very accurate and the most mistakes
fall in those classifications. This can be partially blamed to the manual labelling
as color changes are not always spotted.
Sedimentation and creaming on the other hand show a low accuracy. This can be
mostly blamed to the fact that these changes are the hardest to spot of all these
failures. But also because there were very few samples in the training set that
showed these failures.
This can be seen in the confusion matrix that shows samples labeled as creaming
are only missclassified as stable because they appear very similar. The same goes
for sedimentation that is either missclassified as being stable or color change.

6.4 Prediction with different inputs


In this section we will be perform an experiment to see if the network is able to
correctly classify a change when it is not given the last image but an earlier one.
This experiment is also done using the same cross validation as in section 7.3. In
this experiment we trained and tested by using first image of the experiment and
the one halfway the experiment.
The table below displays how the accuracy drops as we try to make the prediction
earlier.

25
classification accuracy (first-last) accuracy (first-middle)
color change 89.6% 62.86%
stable 90.3% 83%
sedimentation 50% 27.5%
splitting 83 81.13%%
creaming 65.5% 17.24%

We see that accuracy is lower for every classification when the network is trained
to predict changes earlier. Splitting and stable still remain quite accurate since
these are also visible halfway the experiment. But every other classification that
is less visible in earlier stages has a significantly lower accuracy.
In the next chapter we will discuss an alternative way to make earlier predictions
by generating the final image of an experiment.

26
Chapter 7

Improving predictions

In the last chapter, we tried to predict the final state of a sample by only looking
at the first and last picture of the sequence. We discussed the accuracy of this
approach and tried to make earlier predictions by using the first and middle image
of every sequence to train and test. This method still relies on the assumption
that all samples are labeled correctly. Though this cannot be guaranteed since the
labels only show the change between the first and the last image. This does not
necessarily have to mean the change is already visible in earlier images. They were
also manually labelled without much knowledge in the field of chemistry.
In this chapter we will discuss another approach to make earlier predictions that
does not rely on labeled data. In this approach, we will not try to classify the
change early but rather predict how the sample will look in a later time. In sec-
tion 5.2.2 we evaluated Cycle GAN. This architecture is used to create a mapping
between 2 sets of images. A prediction can also be seen as making a mapping from
one or more previous images, to one in the future.
The detailed results of Cycle GAN triggered a lot of research in new architectures
incorporating this idea. Pix2Pix is one of those architectures presented to improve
the results of the original Cycle GAN[22]. Here they proposed using conditional
GANs and making the generator not only produce realistic looking samples, but
also bad looking samples to better train the discriminator. They also don’t let
the result only be judged by the discriminator but also include an L1 loss with
the true output. Doing this caused their network to produce more detailed results
than the original Cycle GAN did. Section 6.1 will cover the results when using the
Pix2Pix architecture to predict the final image of a sequence.

27
7.1 Predicting future appearance
Predicting the changes in fluids may be a challenging task because fluids that look
similar do not necessarily react in the same way. It is clear that if we want to
predict the future appearance of a sample, we need to give more information than
just the first image.
In this section, we will compare the predictions when training the Pix2Pix ar-
chitecture to predict the final appearance of a sample. First, we will show the
predictions whenever the generator is only given the first image of a sequence.
This result we will merely use to compare against other approaches.

7.1.1 Using the first and middle image


In the last section, we discussed how it is impossible to predict the future ap-
pearance of a fluid by only looking at a single image. In this section we will try
to predict the final appearance of a sample by looking at 2 images. The images
that are used are the first image of a sequence and the one found in the middle;
for convenience we call these Imgf and Imgm . The generator is given Imgf and
Imgm and is trained to predict the final image of the sequence.
Section 1.2 displays the results of these predictions. We can see that the predic-
tions are much closer than those made by only looking at the first image. Though
the generator achieved these predictions by ignoring the first image and only using
Imgm to make its prediction.
The next section will discuss another approach that uses the absolute difference
as discussed in chapter 2.

7.1.2 Using the absolute difference


In the last section, we tried using the first image of a sequence (Imgf ) and the
one found in the middle (Imgm ) to predict the last image. This approach did not
learn the relationship between the first and the last image. It made its predictions
by ignoring the first image and only looking at its other input.
In this section, we propose a different approach that uses the pre-processing
diff(x1,x2) that we discussed in chapter 2. This pre-processing takes the abso-
lute difference of 2 images removing all common pixel values and keeping only the
difference between the two.
In this approach, we will try to predict the final image by using the first image of
a sequence along with the result of diff(Imgf ,Imgm ). This forces the network to
look at both inputs, since both contain important information.

28
In the next section, we will discuss the result of this approach and determine if it
indeed produced better predictions.

7.2 Results
This section will compare the prediction results of every approach we discussed in
this chapter. These results were computed by training an unmodified version of
the Pix2Pix architecture for 25 epochs for all approaches[23]. All network configu-
rations remained the same in all experiments; only the input size of the generator
was increased when 2 images were used instead of 1.
In the first experiment, we trained the generator to predict the last image of a
sequence when it is only given the first image of a sequence.
In the second experiment, we gave the generator both the first image and the im-
age found in the middle of the sequence.
Finally in the last experiment, we used the first image and the difference between
the first an the middle image. The figure below compares the prediction when
using these approaches on 3 different inputs out of the test set.

On the left we see the true data taken from the dataset. Imgf is the first image
of the sequence and Imgm the one found in the middle of the sequence. The last
image of the sequence is taken as ground truth.

29
On the right, we see the predictions of these 3 approaches on the given sequence.
These predictions were made by using samples out of the test set. We first evaluate
the method for which we only used the first image to predict. Like we expected,
this approach was not able to predict a lot of change.
The second approach uses Imgf and Imgm to make its predictions. We can see
that the predictions stay relatively close to Imgm . This shows that the network is
not actually learning the relationship between the first image and the last image.
Instead, it only uses Imgm to make its predictions.
In section 7.1.2 we discussed another approach that no longer used the imgm itself.
Instead we proposed to use the difference between imgf and Imgm . The results of
this approach are displayed on the most right. We can see that these predictions
come far closer to the true outcome.
This chapter does not contain a distance metric measured from the ground truth
and is merely to give insight into the problem of predicting the future appearance.
We showed that if not impossible, it is at least very difficult to predict the future
appearance only using a single image.
We also showed that far more accurate results can be achieved by using the abso-
lute difference as a second input. This because it captures the relationship between
the first and the last image far better than giving an entire image.

30
Chapter 8

Conclusion and future work

8.1 Conclusion
In this thesis, we tried to reduce the time needed for experiments conducted by
chemical and manufacturing companies. This could be achieved both by automat-
ically identifying changes as they occur, but also by learning to predict changes
earlier.
In chapter 2 we have discussed the different classifications and went into more
detail about the dataset used to study the changes in fluids. We saw that the se-
quences classified as sedimentation or creaming were very underrepresented in the
dataset. These failures are also very difficult to identify since they only show very
little visible changes. This can be solved by obtaining more and higher detailed
images of the samples.
First we tried to create a system that could quantify the amount of change that
can be measured between 2 images of the same sample. For this, we used a Siamese
Network trained to encode the first and last image of a sequence. In theory, the
network should encode samples that remained stable as very similar while any
change should also cause the encodings to be very different.
This proved to be a very difficult problem since changes in the camera angle can
also cause a sample to be encoded differently. It also relied on creaming and sedi-
mentation being encoded completely different than stable samples. These changes
are not very visual and only a few samples were available showing those. This
caused the network to ignore these changes.
For this reason, we presented a different approach to identify these failures by
using a generative approach. These generative approaches are able to learn more
detailed feature representations of the data by producing new realistic samples.
We presented GANs as architecture for this generative approach.
In chapter 6 we trained such a GAN in a semi-supervised way to classify the correct

31
change for a given sequence. This approach offered better results on all classifi-
cations, though accuracy when identifying creaming and sedimentation remained
quite low. This accuracy can be improved by using more detailed images or by
increasing the number of samples to learn from. We also saw that when we tried to
predict changes earlier, the classification accuracy dropped a lot. This is because
these predictions still rely a lot on labeling.
In chapter 7 we tried another approach to make earlier predictions. Here we
used a GAN to correctly predict the final image of a sequence. We compared 3
different approaches and found that the best results can be found by using the
pre-processing that we discussed in chapter 7. This because it captures the rela-
tionship between the first and the last image the best.

8.2 Future work


In chapter 6 we discussed a GAN that was trained to identify changes between
the first image of a sequence and the last image. The network showed reasonable
accuracy with only 2 classes that didn’t perform so well. We used techniques for
unbalanced datasets and pre-processing to increase performance on those. While
these helped, the accuracy still remained significantly lower than all other classi-
fications.
We could try to improve this accuracy even further by experimenting with other
kinds of pre-processing that make the changes more visual. Though the most ob-
vious way to improve the accuracy further is by using more samples to train.
While it is difficult to improve the accuracy even further without the use of more
data, we can try to make predictions earlier in the sequence. In chapter 7 we tried
to half the duration of experiments by predicting the final image of a sequence.
This chapter only covered 3 different approaches on the same architecture and did
not give a distance-wise accuracy score.
In further experiments, we could evaluate these approaches further by computing
their distance wise accuracy to the ground truth. Then we could evaluate whether
it is possible to make earlier predictions by generating the last image and then
classifying the change.

32
Bibliography

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,


A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in
neural information processing systems, 2014.

[2] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-
shot image recognition,” 2015.

[3] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and


X. Chen, “Improved techniques for training gans,” in Advances in Neural
Information Processing Systems, 2016.

[4] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object recognition with


gradient-based learning,” in Shape, Contour and Grouping in Computer Vi-
sion, pp. 319–, Springer-Verlag, 1999.

[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with


deep convolutional neurala networks,” 2012.

[6] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org.

[7] “Materials in vessels data set.” http://cs231n.github.io/convolutional-


networks/. Accessed: 2017-11-22.

[8] M. Hubel and T. N. Wiesel, Brain and Visual Perception. Oxford Univeristy
Press, 2005.

[9] e. Christian Szegedy, Wei Liu, “Going deeper with convolutions,” CoRR,
vol. abs/1409.4842, 2014.

[10] K. He, X. Zhang, and etc, “Deep residual learning for image recognition,”
CoRR, vol. abs/1512.03385, 2015.

33
[11] “One shot learning with siamese networks in pytorch.” https:
//hackernoon.com/one-shot-learning-with-siamese-networks-in-
pytorch-8ddaab10340e. Accessed: 2018-3-11.

[12] M. Arjovsky and L. Bottou, “Towards principled methods for training gener-
ative adversarial networks,” arXiv preprint arXiv:1701.04862, 2017.

[13] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learn-


ing with deep convolutional generative adversarial networks,” arXiv preprint
arXiv:1511.06434, 2015.

[14] M. Arjovsky and L. Bottou, “Towards principled methods for training gener-
ative adversarial networks,” 2017.

[15] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann, “Stabilizing training of


generative adversarial networks through regularization,” 2017.

[16] T. White, “Sampling generative networks: Notes on a few effective tech-


niques,” 2016.

[17] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” 2014.

[18] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image trans-


lation using cycle-consistent adversarial networks,” 2017.

[19] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. C. Courville,


“Augmented cyclegan: Learning many-to-many mappings from unpaired
data,” 2018.

[20] “Improved gan (semi-supervised gan).” https://github.com/Sleepychord/


ImprovedGAN-pytorch. Accessed: 2018-4-5.

[21] “Cyclegan and pix2pix in pytorch.” https://github.com/junyanz/


pytorch-CycleGAN-and-pix2pix. Accessed: 2018-4-15.

[22] Q. Chen and V. Koltun, “Photographic image synthesis with cascaded re-
finement networks,” in IEEE International Conference on Computer Vision
(ICCV), 2017.

[23] “https://github.com/taey16/pix2pix.pytorch.” https://github.com/


taey16/pix2pix.pytorch. Accessed: 2018-7-4.

34

Vous aimerez peut-être aussi