Vous êtes sur la page 1sur 33

A REPORT

ON

“Time Series Forecasting Of Collection Data”

By

Harshit Garg - 2014A3PS0257G

At

Nucleus Software Exports Limited, Noida

A Practice School II Station of

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

(July-December, 2018)
A REPORT

ON

“Time Series Forecasting Of Collection Data”

By

Harshit Garg - 2014A3PS0257G – Electrical & Electronics Engineering

Prepared in the partial fulfillment of the


Practice School II Course

At

Nucleus Software Exports Limited, Noida

A Practice School II Station of

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

(July-December, 2018)
ACKNOWLEDGEMENT
I would like to express my special thanks of gratitude to Lending Analytics team who chose me
to work for their project on implementing a neural network for forecasting their collection data.

My mentors Lakshmi Prasad sir and Prachi ma’am guided me to learn Deep Learning concepts
and Java Language. I would like to thank Paramjeet ma’am for teaching web concepts.

I express my deepest thanks to Pankaj Sir (Vice President) and Parag Bhise sir (Senior Vice
President) for reviewing my project weekly and giving necessary advices and guidance to make
the project more useful.

I would like to thank Ritu Arora ma’am for guiding about the internship and reviewing our
project activity at regular times.

I perceive this opportunity as a big milestone in my career development. I will strive to use
gained skills and knowledge in the best possible way, and I will continue to work on their
improvement, in order to attain desired career objectives.

- Harshit Garg
2014A3PS0257G
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI
Practice School Division

Station: Nucleus Software Exports Ltd. Centre: Noida


Duration: 6 Months Date of Start: 4th July, 2018
Date of Submission: 9th December, 2018
Title of the Project: Time Series Forecasting Of Collection Data

Name of the student: Harshit Garg


ID Number: 2014A3PS0257G
Discipline: B.E. (Hons.) Electrical and Electronics Engineering
Supervisor: Lakshmi Prasad Sir
Designation: Group Product Manager
Contact No.: +919891715670
E-mail ID: lakshmi.prasad@nucleussoftware.com
PS Faculty: Dr. Ritu Arora
Project Area: Data Analytics

Abstract:-
Large amount of collection data of past years can be processed through a neural network to
forecast for future years reference. This is what the project is all about, implementing a neural
network using LSTM model with DL4J library to forecast multiple targets of the collection data
of Lending Analytics department.
The project is built in Java. The work is mainly divided into 4 parts.
Building the neural network model, creating a dataset iterator file in java, creating a plot utility
file in Java and the data prediction file. The data is usually multivariate and the prediction
concerns multi-targeting.
In the neural network model there are 4 layers used. Input and output layers with two hidden
layers. Three types of activation functions are used to filter the data accordingly.
The model is a LSTM model which is trained with past years data. With its two hidden layers
and memory, a LSTM model learns data trend and can forecast for future years.
With my mentors’ aid I analyzed passing the univariate data into neural network and classifying
the data into labeled groups.
And I learnt to pass multivariate data into model using DL4J commands.
The project at the end aims to forecast any type of multivariate data with sufficient examples to
train neural net.

Signature of Student Signature of PS Faculty

Date: 09-12-2018 Date:


BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE
PILANI (RAJASTHAN)

Practice School Division

Response Option Sheet

Station: Nucleus Software Exports Limited Center: Noida


Name: Harshit Garg ID No.-2014A3PS0257G

Title of the Project: Time Series Forecasting Of Collection Data

Code No. Response Option Course No.(s) & Name


1. A new course can be designed out of this ----------
project.
2. The project can help modification of the ----------
course content of some of the existing
Courses
3. The project can be used directly in some of
the existing Compulsory Discipline Courses BITS F312,
(CDC)/ Discipline Courses Other than Neural Networks and Fuzzy
Compulsory (DCOC)/ Emerging Area (EA), Logic
etc. Courses
4. The project can be used in preparatory
courses like Analysis and Application
Oriented Courses (AAOC)/ Engineering ----------
Science (ES)/ Technical Art (TA) and Core
Courses.
5. This project cannot come under any of the ----------
above mentioned options as it relates to the
professional work of the host organization.

Signature of Student Signature of Faculty

Date: 09-12-2018 Date:


Table of Contents
Sr. No. Contents Page Numbers
1 Introduction 1
2 Activation Functions 2
3 Feedforward neural networks 3
4 Time Series & RNN 10
5 LSTM 14
6 DL4J Library 19
7 Forecasting the collection data 23
8 Sample Collection Data 24
9 Conclusion 26
Table of Figures

Sr. No. Description Page


Figure 1 A single neuron 1
Figure 2 Different Activation Functions 2
Figure 3 An example of feedforward neural network 3
Figure 4 A multilayer perceptron having one hidden layer 5
Figure 5 Forward propagation step in a multilayer perceptron 8
Figure 6 Backward propagation and weight updation step in a multilayer perceptron 8
Figure 7 The MLP network better performance on the same input 9
Figure 8 A two-layer neural network capable of calculating XOR 9
Figure 9 RNN & FFNN 11
Figure 10 Different types of RNN 12
Figure 11 The concept of forward propagation and backward propagation 12
Figure 12 Unrolled RNN 13
Figure 13 The repeating module in a standard RNN contain a single layer 14
Figure 14 The repeating module in a LSTM contains four interactive layers 15
Figure 15 Notations 15
Figure 16 Cell state of LSTM 15
Figure 17 Sigmoid connection 16
Figure 18 Forget gate of a LSTM 16
Figure 19 Input layer gate and tanh layer 17
Figure 20 Dropping unrequired information and adding new information 17
Figure 21 Output layer 18
Figure 22 Model Output 25
Introduction
A Single Neuron

The basic unit of computation in a neural network is the neuron, often called a node or unit. It
receives input from some other nodes, or from an external source and computes an output. Each
input has an associated weight (w), which is assigned on the basis of its relative importance to
other inputs. The node applies a function f (defined below) to the weighted sum of its inputs as
shown in Figure 1 below:

Figure 1: A single neuron

The above network takes numerical inputs X1 and X2 and has weights w1 and w2 associated with
those inputs. Additionally, there is another input 1 with weight b (called the Bias) associated with
it. The output Y from the neuron is computed as shown in the Figure 1. The function f is non-linear
and is called the Activation Function. The purpose of the activation function is to introduce non-
linearity into the output of a neuron. This is important because most real world data is non
linear and we want neurons to learn these non linear representations.

An Artificial Neural Network (ANN) is a computational model that is inspired by the way
biological neural networks in the human brain process information. Artificial Neural Networks
have generated a lot of excitement in Machine Learning research and industry. Artificial
intelligence, cognitive modelling, and neural networks are information processing paradigms
inspired by the way biological neural systems process data. Artificial intelligence and cognitive
modeling try to simulate some properties of biological neural networks. In the artificial
intelligence field, artificial neural networks have been applied successfully to speech
recognition, image analysis and adaptive control, in order to construct software
agents or autonomous robots.

1
Activation Functions:-
Every activation function (or non-linearity) takes a single number and performs a certain fixed
mathematical operation on it. There are several activation functions you may encounter in practice:

 Sigmoid: takes a real-valued input and squashes it to range between 0 and 1

σ(x) = 1 / (1 + exp(−x))

 tanh: takes a real-valued input and squashes it to the range [-1, 1]

tanh(x) = 2σ(2x) − 1

 ReLU: ReLU stands for Rectified Linear Unit. It takes a real-valued input and thresholds
it at zero (replaces negative values with zero)

f(x) = max (0, x)

The below figures show each of the above activation functions.

Figure 2: Different activation functions

Importance of Bias: The main function of Bias is to provide every node with a trainable constant
value (in addition to the normal inputs that the node receives).

Why we use Activation functions with Neural Networks?


It is used to determine the output of neural network like yes or no. It maps the resulting values in
between 0 to 1 or -1 to 1 etc. (depending upon the function).
There are Linear and Non-linear activations functions.

2
Feedforward Neural Network
The feedforward neural network was the first and simplest type of artificial neural network
devised. It contains multiple neurons (nodes) arranged in layers. Nodes from adjacent layers
have connections or edges between them. All these connections have weights associated with
them.

An example of a feedforward neural network is shown in Figure 3.

Figure 3: An example of feedforward neural network

A feedforward neural network can consist of three types of nodes:

1. Input Nodes – The Input nodes provide information from the outside world to the network
and are together referred to as the “Input Layer”. No computation is performed in any of
the Input nodes – they just pass on the information to the hidden nodes.
2. Hidden Nodes – The Hidden nodes have no direct connection with the outside world
(hence the name “hidden”). They perform computations and transfer information from the
input nodes to the output nodes. A collection of hidden nodes forms a “Hidden Layer”.
While a feedforward network will only have a single input layer and a single output layer,
it can have zero or multiple Hidden Layers.
3. Output Nodes – The Output nodes are collectively referred to as the “Output Layer” and
are responsible for computations and transferring information from the network to the
outside world.

3
In a feedforward network, the information moves in only one direction – forward – from the input
nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in
the network (this property of feed forward networks is different from Recurrent Neural Networks
in which the connections between the nodes form a cycle).

Two examples of feedforward networks are given below:

1. Single Layer Perceptron – This is the simplest feedforward neural network and does not
contain any hidden layer.

2. Multi Layer Perceptron – A Multi Layer Perceptron has one or more hidden layers. Multi
Layer Perceptrons below since they are more useful than Single Layer Perceptrons for
practical applications today.

Multi Layer Perceptron

A Multi Layer Perceptron (MLP) contains one or more hidden layers (apart from one input and
one output layer). While a single layer perceptron can only learn linear functions, a multi layer
perceptron can also learn non – linear functions.

Figure 4 shows a multi layer perceptron with a single hidden layer. Note that all connections have
weights associated with them, but only three weights (w0, w1, w2) are shown in the figure.

Input Layer: The Input layer has three nodes. The Bias node has a value of 1. The other two
nodes take X1 and X2 as external inputs (which are numerical values depending upon the input
dataset). As discussed above, no computation is performed in the Input layer, so the outputs from
nodes in the Input layer are 1, X1 and X2 respectively, which are fed into the Hidden Layer.

Hidden Layer: The Hidden layer also has three nodes with the Bias node having an output of
1. The output of the other two nodes in the Hidden layer depends on the outputs from the Input
layer (1, X1, X2) as well as the weights associated with the connections (edges). Figure 4 shows
the output calculation for one of the hidden nodes (highlighted). Similarly, the output from other
hidden node can be calculated. Remember that f refers to the activation function. These outputs
are then fed to the nodes in the Output layer.

4
Figure 4: A multi layer perceptron having one hidden layer

Output Layer: The Output layer has two nodes which take inputs from the Hidden layer and
perform similar computations as shown for the highlighted hidden node. The values calculated (Y1
and Y2) as a result of these computations act as outputs of the Multi Layer Perceptron.

Given a set of features X = (x1, x2, …) and a target y, a Multi Layer Perceptron can learn the
relationship between the features and the target, for either classification or regression.

Lets take an example to understand Multi Layer Perceptrons better. Suppose we have the following
student-marks dataset:

5
The two input columns show the number of hours the student has studied and the mid term marks
obtained by the student. The Final Result column can have two values 1 or 0 indicating whether
the student passed in the final term. For example, we can see that if the student studied 35 hours
and had obtained 67 marks in the mid term, he / she ended up passing the final term.

Now, suppose, we want to predict whether a student studying 25 hours and having 70 marks in the
mid term will pass the final term.

This is a binary classification problem where a multi layer perceptron can learn from the given
examples (training data) and make an informed prediction given a new data point. We will see
below how a multi layer perceptron learns such relationships.

Training our MLP: The Back-Propagation Algorithm

The process by which a Multi Layer Perceptron learns is called the Backpropagation algorithm.
An ANN consists of nodes in different layers; input layer, intermediate hidden layer(s) and the
output layer. The connections between nodes of adjacent layers have “weights” associated with
them. The goal of learning is to assign correct weights for these edges. Given an input vector, these
weights determine what the output vector is.

In supervised learning, the training set is labeled. This means, for some given inputs, we know the
desired/expected output (label).

Backpropagation-Algorithm:
Initially all the edge weights are randomly assigned. For every input in the training dataset, the
ANN is activated and its output is observed. This output is compared with the desired output that
we already know, and the error is “propagated” back to the previous layer. This error is noted and
the weights are “adjusted” accordingly. This process is repeated until the output error is below a
predetermined threshold.

Once the above algorithm terminates, we have a “learned” ANN which, we consider is ready to
work with “new” inputs. This ANN is said to have learned from several examples (labeled data)
and from its mistakes (error propagation).

6
The Multi Layer Perceptron shown in Figure 5 has two nodes in the input layer (apart from the
Bias node) which take the inputs ‘Hours Studied’ and ‘Mid Term Marks’. It also has a hidden layer
with two nodes (apart from the Bias node). The output layer has two nodes as well – the upper
node outputs the probability of ‘Pass’ while the lower node outputs the probability of ‘Fail’.

In classification tasks, we generally use a Softmax function as the Activation Function in the
Output layer of the Multi Layer Perceptron to ensure that the outputs are probabilities and they
add up to 1. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to
a vector of values between zero and one that sum to one. So, in this case,

Probability (Pass) + Probability (Fail) = 1

Step 1: Forward Propagation

All weights in the network are randomly assigned. Lets consider the hidden layer node marked V in
Figure 5 below. Assume the weights of the connections from the inputs to that node are w1, w2
and w3 (as shown).

The network then takes the first training example as input (we know that for inputs 35 and 67, the
probability of Pass is 1).

 Input to the network = [35, 67]


 Desired output from the network (target) = [1, 0]

Then output V from the node in consideration can be calculated as below (f is an activation
function such as sigmoid):

V = f (1*w1 + 35*w2 + 67*w3)

Similarly, outputs from the other node in the hidden layer is also calculated. The outputs of the
two nodes in the hidden layer act as inputs to the two nodes in the output layer. This enables us to
calculate output probabilities from the two nodes in output layer.

Suppose the output probabilities from the two nodes in the output layer are 0.4 and 0.6 respectively
(since the weights are randomly assigned, outputs will also be random). We can see that the
calculated probabilities (0.4 and 0.6) are very far from the desired probabilities (1 and 0
respectively), hence the network in Figure 5 is said to have an ‘Incorrect Output’.

7
Figure 5: Forward propagation step in a multi layer perceptron

Step 2: Back Propagation and Weight Updation

We calculate the total error at the output nodes and propagate these errors back through the network
using Backpropagation to calculate the gradients. Then we use an optimization method such
as Gradient Descent to ‘adjust’ all weights in the network with an aim of reducing the error at the
output layer. This is shown in the Figure 6 below (ignore the mathematical equations in the figure
for now).

Suppose that the new weights associated with the node in consideration are w4, w5 and w6 (after
Backpropagation and adjusting weights).

Figure 6: Backward propagation and weight updation step in a multi layer perceptron

8
If we now input the same example to the network again, the network should perform better than
before since the weights have now been adjusted to minimize the error in prediction. As shown in
Figure 7, the errors at the output nodes now reduce to [0.2, -0.2] as compared to [0.6, -0.4] earlier.
This means that our network has learnt to correctly classify our first training example.

Figure 7: The MLP network better performance on the same input


Repeating this process with all other training examples in our dataset. Then, our network is said
to have learnt those examples.

Figure 8: A two-layer neural network capable of calculating XOR

The numbers within the neurons represent each neuron's explicit threshold (which can be
factored out so that all neurons have the same threshold, usually 1). The numbers that annotate
arrows represent the weight of the inputs. This net assumes that if the threshold is not reached,
zero (not -1) is output. The bottom layer of inputs is not always considered a real neural network
layer.

9
What is Time Series?
A time series is a series of data points indexed (or listed or graphed) in time order. Most
commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it
is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts
of sunspots, and the daily closing value of the stock prices average.
Time series are very frequently plotted via line charts. Time series are used in statistics, signal
processing, pattern recognition, econometrics, mathematical finance, weather
forecasting, earthquake prediction, electroencephalography, control
engineering, astronomy, communications engineering, and largely in any domain of
applied science and engineering which involves temporal measurements.

Time series forecasting is the use of a model to predict future values based on previously
observed values. While regression analysis is often employed in such a way as to test theories
that the current values of one or more independent time series affect the current value of another
time series, this type of analysis of time series is not called "time series analysis", which focuses
on comparing values of a single time series or multiple dependent time series at different points
in time.[1] Interrupted time series analysis is the analysis of interventions on a single time series.

In statistics, prediction is a part of statistical inference. One particular approach to such inference
is known as predictive inference, but the prediction can be undertaken within any of the several
approaches to statistical inference. Indeed, one description of statistics is that it provides a means
of transferring knowledge about a sample of a population to the whole population, and to other
related populations, which is not necessarily the same as prediction over time. When information
is transferred across time, often to specific points in time, the process is known as forecasting.

Recurrent Neural Networks:-

A recurrent neural network (RNN) is a class of artificial neural network where connections
between nodes form a directed graph along a sequence. This allows it to exhibit temporal
dynamic behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their
internal state (memory) to process sequences of inputs. This makes them applicable to tasks such
as unsegmented, connected handwriting recognition or speech recognition.

Because of their internal memory, RNN’s are able to remember important things about the input
they received, which enables them to be very precise in predicting what’s coming next. This is the
reason why they are the preferred algorithm for sequential data like time series, speech, text,
financial data, audio, video, weather and much more because they can form a much deeper
understanding of a sequence and its context, compared to other algorithms.
In a RNN, the information cycles through a loop. When it makes a decision, it takes into
consideration the current input and also what it has learned from the inputs it received previously.

10
The two images below illustrate the difference in the information flow between a RNN and a
Feed-Forward Neural Network.

Figure 9: RNN & FFNN

A usual RNN has a short-term memory. In combination with a LSTM they also have a long-term
memory, but we will discuss this further below.

Another good way to illustrate the concept of a RNN’s memory is to explain it with an example:
Imagine we have a normal feed-forward neural network and give it the word “neuron” as an input
and it processes the word character by character. At the time it reaches the character “r”, it has
already forgotten about “n”, “e” and “u”, which makes it almost impossible for this type of neural
network to predict what character would come next.A Recurrent Neural Network is able to
remember exactly that, because of its internal memory. It produces output, copies that output and
loops it back into the network.

Recurrent Neural Networks add the immediate past to the present. Therefore a Recurrent Neural
Network has two inputs, the present and the recent past. This is important because the sequence of
data contains crucial information about what is coming next, which is why a RNN can do things

11
other algorithms can’t. A Feed-Forward Neural Network assigns, like all other Deep Learning
algorithms, a weight matrix to its inputs and then produces the output. RNN’s apply weights to
the current and also to the previous input. Furthermore they also tweak their weights for both
through gradient descent and Backpropagation Through Time. While Feed-Forward Neural
Networks map one input to one output, RNN’s can map one to many, many to many (translation)
and many to one (classifying a voice).

Figure 10: Different Types of RNN

Figure 11: The concept of forward propagation and backward propagation

12
Backpropagation Through Time (BPTT) does Backpropagation on an unrolled Recurrent Neural
Network. Unrolling is a visualization and conceptual tool, which helps us to understand what’s
going on within the network. Most of the time when we implement a Recurrent Neural Network
in the common programming frameworks, they automatically take care of the Backpropagation
but we need to understand how it works, which enables us to troubleshoot problems that come up
during the development process.

We can view a RNN as a sequence of Neural Networks that you train one after another with
backpropagation.

The image below illustrates an unrolled RNN. On the left, we can see the RNN, which is unrolled
after the equal sign. There is no cycle after the equal sign since the different timesteps are
visualized and information gets’s passed from one timestep to the next. This illustration also
shows why a RNN can be seen as a sequence of Neural Networks.

Figure 12: Unrolled RNN

If you do Backpropagation Through Time, it is required to do the conceptualization of unrolling,


since the error of a given timestep depends on the previous timestep.

Within BPTT the error is back-propagated from the last to the first timestep, while unrolling all
the timesteps. This allows calculating the error for each timestep, which allows updating the
weights. Note that BPTT can be computationally expensive when you have a high number of
timesteps.

Two issues of standard RNN’s:-


There are two major obstacles RNN’s have or had to deal with. But to understand them, we first
need to know what a gradient is.

13
A gradient is a partial derivative with respect to its inputs. A gradient measures how much the
output of a function changes, if we change the inputs a little bit.

We can also think of a gradient as the slope of a function. The higher the gradient, the steeper the
slope and the faster a model can learn. But if the slope is zero, the model stops to learning. A
gradient simply measures the change in all weights with regard to the change in error.

Vanishing Gradients
We speak of „Vanishing Gradients“ when the values of a gradient are too small and the model
stops learning or takes way too long because of that. This was a major problem in the 1990s and
much harder to solve than the exploding gradients.

LSTM Networks
Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN,
capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber
(1997). They work tremendously well on a large variety of problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering
information for long periods of time is practically their default behavior, not something they struggle
to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural network. In
standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

Figure 13: The repeating module in a standard RNN contains a single layer

LSTMs also have this chain like structure, but the repeating module has a different
structure. Instead of having a single neural network layer, there are four, interacting in a
very special way.

14
Figure 14: The repeating module in a LSTM contains four interacting layers

Notations used:-

Figure 15: Notations


In the above diagram, each line carries an entire vector, from the output of one node to the inputs of
others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes
are learned neural network layers. Lines merging denote concatenation, while a line forking denote
its content being copied and the copies going to different locations.
The Core Idea behind LSTMs:
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some
minor linear interactions. It’s very easy for information to just flow along it unchanged.

Figure 16: Cell state of LSTM

15
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by
structures called gates.
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net
layer and a pointwise multiplication operation.

Figure 17: Sigmoid connection


The sigmoid layer outputs numbers between zero and one, describing how much of each component
should be let through. A value of zero means “let nothing through,” while a value of one means “let
everything through!”
An LSTM has three of these gates, to protect and control the cell state.

Step-by-Step LSTM Walk Through

The first step in LSTM is to decide what information is going to be thrown away from the cell state.
This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1 and xt, and
outputs a number between 0 and 1 for each number in the cell state Ct−1. A 1 represents “completely
keep this” while a 0 represents “completely get rid of this.”
Consider the example of a language model where next word is predicted based on all the previous
ones. In such a problem, the cell state might include the gender of the present subject, so that the
correct pronouns can be used. When a new subject is seen, we want to forget the gender of the old
subject.

Figure 18: Forget Gate of a LSTM

16
The next step is to decide what new information we’re going to store in the cell state. This has two
parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a
tanh layer creates a vector of new candidate values, C~tC~t, that could be added to the state. In the
next step, we’ll combine these two to create an update to the state.

Figure 19: Input layer gate and tanh layer

It’s now time to update the old cell state, Ct−1, into the new cell state Ct. The previous steps already
decided what to do, we just need to actually do it.
We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add it∗C~t.
This is the new candidate values, scaled by how much we decided to update each state value.
In the case of the language model, this is where we’d actually drop the information about the old
subject’s gender and add the new information, as we decided in the previous steps.

Figure 20: Dropping unrequired information and adding new information

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but
will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re
going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1)
and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.
For the language model example, since it just saw a subject, it might want to output information
relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject
is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows
next.

17
Figure 21: Output Layer

Applications of LSTM include:

 Robot control
 Time series prediction
 Speech recognition
 Rhythm learning
 Music composition
 Grammar learning
 Handwriting recognition
 Human action recognition
 Sign Language Translation
 Protein Homology Detection
 Predicting subcellular localization of proteins
 Time series anomaly detection
 Several prediction tasks in the area of business process management
 Prediction in medical care pathways
 Semantic parsing
 Object Co-segmentation

18
DL4J Library
DL4J library is extensively used in the project of sales forecasting. It stands for Deep Learning
for Java. Deeplearning4j is a domain-specific language to configure deep neural networks, which
are made of multiple layers. Everything starts with a MultiLayerConfiguration, which organizes
those layers and their hyperparameters.

Hyperparameters are variables that determine how a neural network learns. They include how
many times to update the weights of the model, how to initialize those weights, which activation
function to attach to the nodes, which optimization algorithm to use, and how fast the model
should learn.

This is what one configuration would look like:

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()

.weightInit(WeightInit.XAVIER)

.activation("relu")

.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)

.updater(new Sgd(0.05))

// ... other hyperparameters

.list()

.backprop(true)

.build();

With Deeplearning4j, a layer can be added by calling layer on


the NeuralNetConfiguration.Builder(), specifying its place in the order of layers (the zero-
indexed layer below is the input layer), the number of input and output nodes, nIn and nOut, as
well as the type: DenseLayer.

Preparing Data for Learning and Prediction


Unlike other machine learning or deep learning frameworks, DL4J treats the tasks of loading
data and training algorithms as separate processes. Data is loaded using DataVec. This gives a lot
more flexibility, and retains the convenience of simple data loading.

Before the algorithm can start learning, data is prepared, even if a trained model is available.
Preparing data means loading it and putting it in the right shape and value range (e.g.
normalization, zero-mean and unit variance). Building these processes from scratch is error
prone, so DataVec should be used wherever possible.

19
Deeplearning4j works with a lot of different data types, such as images, CSV, ARFF, plain text
and, with Apache Camel integration.

To use DataVec, implementations of the RecordReader interface along with


the RecordReaderDataSetIterator is done.

Once the DataSetIterator is ready, which is just a pattern that describes sequential access to data,
it can be used to retrieve the data in a format suited for training a neural net model.

Normalizing Data
Neural networks work best when the data they’re fed is normalized, constrained to a range
between -1 and 1. There are several reasons for that. One is that nets are trained using gradient
descent, and their activation functions usually having an active range somewhere between -1 and
1. Even when using an activation function that doesn’t saturate quickly, it is still good practice to
constrain values to this range to improve performance.

Normalizing data is pretty easy in DL4J. Set the corresponding DataNormalization up as a


preprocessor for DataSetIterator.

The ImagePreProcessingScaler is a good choice for image data. The NormalizerMinMaxScaler is


a good choice if there is a uniform range along all dimensions of input data,
and NormalizerStandardize is used in other cases.

DataSets, INDArrays and Mini-Batches:-


As the name suggests, a DataSetIterator returns DataSet objects. DataSet objects are containers
for the features and labels of data. But they aren’t constrained to holding just a single example at
once. A DataSet can contain as many examples as needed.

It does that by keeping the values in several instances of INDArray: one for the features of your
examples, one for the labels and two additional ones for masking, if you are using time series
data.

An INDArray is one of the n-dimensional arrays, or tensors, used in ND4J. In the case of the
features, it is a matrix of the size Number of Examples x Number of Features. Even with only a
single example, it will have this shape.

Why doesn’t it contain all of the data examples at once?

This is another important concept for deep learning: mini-batching. In order to produce accurate
results, a lot of real-world training data is often needed. Often that is more data than can fit in
available memory, so storing it in a single DataSet sometimes isn’t possible. But even if there is
enough data storage, there is another important reason not to use all of the data at once.

Since the model is trained using gradient descent, it requires a good gradient to learn how to
minimize error. Using only one example at a time will create a gradient that only takes errors

20
produced with the current example into consideration. This would make the learning behavior
erratic, slow down the learning, and may not even lead to a usable result.

A mini-batch should be large enough to provide a representative sample of the real world (or at
least your data). That means that it should always contain all of the classes that are to be
predicted and that the count of those classes should be distributed in approximately the same way
as they are in the overall data.

Building a Neural Net Model


DL4J gives data scientists and developers tools to build a deep neural networks on a high level
using concepts like layer. It employs a builder pattern in order to build the neural net
declaratively:-

MultiLayerConfiguration conf =

new NeuralNetConfiguration.Builder()

.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)

.updater(new Nesterovs(learningRate, 0.9))

.list(

new
DenseLayer.Builder().nIn(numInputs).nOut(numHiddenNodes).activation("relu").build(),

new
OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD).activation("softmax").nIn(num
HiddenNodes).nOut(numOutputs).build()

).backprop(true).build();

Unlike other frameworks, DL4J splits the optimization algorithm from the updater algorithm.
This allows for flexibility as we seek a combination of optimizer and updater that works best for
the data and problem.

Besides the DenseLayer and OutputLayer there are several other layer types,
like GravesLSTM, ConvolutionLayer, RBM, EmbeddingLayer, etc. Using those layers we can
define simple neural networks, recurrent and convolutional networks.

Training a Model
After configuring the neural, we will have to train the model. The simplest case is to simply call
the .fit() method on the model configuration with your DataSetIterator as an argument. This will
train the model on all the data once. A single pass over the entire dataset is called an epoch.
DL4J has several different methods for passing through the data more than just once.

21
The simplest way, is to reset DataSetIterator and loop over the fit call as many times as you
want. This way we can train our model for as many epochs as we think is a good fit.

Evaluating Model Performance


As the training proceeds we will want to test how well it performs. For that test, we will need a
dedicated data set that will not be used for training but instead will only be used for evaluating
the model. This data should have the same distribution as the real-world data we want to make
predictions about with our model. The reason we can’t simply use our training data for
evaluation is because machine learning methods are prone to over fitting (getting good at making
predictions about the training set, but not performing well on larger datasets).

The Evaluation class is used for evaluation. Slightly different methods apply to evaluating a
normal feed forward networks or recurrent networks.

Troubleshooting a Neural Net Model


Building neural networks to solve problems is an empirical process. That is, it requires trial and
error. So we will have to try different settings and architectures in order to find a neural net
configuration that performs well.

DL4J provides a listener facility to help us monitor our network’s performance visually. We can
set up listeners for the model that will be called after each mini-batch is processed. One of most
often used listeners that DL4J ships out of the box is ScoreIterationListener.

While ScoreIterationListener will simply print the current error score for our
network, HistogramIterationListener will start up a web UI that to provide us with a host of
different information that we can use to fine tune our network configuration.

Key Aspects of DataVec:-


 DataVec uses an input/output format system (similar in some ways to how Hadoop
MapReduce uses InputFormat to determine InputSplits and RecordReaders, DataVec also
provides RecordReaders to Serialize Data)
 Designed to support all major types of input data (text, CSV, audio, image and video)
with these specific input formats
 Uses an output format system to specify an implementation-neutral type of vector format
(ARFF, SVMLight, etc.)
 Can be extended for specialized input formats (such as exotic image formats); i.e. We can
write our own custom input format and let the rest of the codebase handle the
transformation pipeline
 Makes vectorization a first-class citizen
 Built in Transformation tools to convert and normalize data

22
Forecasting the Collection Data:-
Collection data from the company’s database is passed to the neural network to predict the prices
for a day ahead.
Prediction is done with multiple targets with many attributes like Total amount overdue, Total
amount outstanding, Count of people and information about buckets.
The project is divided into four parts:-
1. Building a model with appropriate hidden layers and activation functions.
2. Creating a DataSetIterator file in java to process the input data.
3. Creating a plot utility function to plot the graph between actual and predicted
value.
4. Creating the Prediction file which will run with above files.

Collection data are primary and secondary.


Primary Data:
Refers to the data that does not have any prior existence and collected directly from the
respondents. It is considered very reliable in comparison to all other forms of data. However, its
reliability may come under scrutiny for various reasons. For example, the researcher may be
biased while collecting data, the respondents may not feel comfortable to answer the questions,
and the researcher may influence the respondents.

In all these scenarios, primary data would not be very dependable. Therefore, primary data
collection should be done with utmost caution and prudence. Primary data helps the researchers
in understanding the real situation of a problem. It presents the current scenario in front of the
researchers; therefore, it is more effective in taking the business decisions.

Secondary Data:
Refers to the data that is collected in the past, but can be utilized in the present scenario/research
work. The collection of secondary data requires less time in comparison to primary data.

23
Sample Collection Data

Here,
Bucket 1% is the percentage of total Loan cases which fall in DPD 1-30.
DPD is Days Past Due;
Bucket 2% is the percentage of total Loan cases which fall in DPD 31-60.
Self_Curing% is the percentage of total Loan cases which are sorted out before due date.
Amount_Arrer= (Total_Loan_Cases)*1000.
Recovery% = Percentage of Arrer Amount recovered.
Recovery Amount=Amount_Arrer*(Recovery%*0.01)

This sequential data of collection department is fed to neural model. There are 2000 such rows,
out of which 90% are used for training and rest for testing.

So if number of working days in a month are 22, then for testing we pass example 1801 to 1822
in the model and the model predicts for 1823th.
Similarly when we pass data from 1802 to 1823 then it predicts 1824th . In this way we plot the
graph of predictions vs actual data and study its deviation from the actual values.

This serves as the capacity planning tool for collection department which can then use it to
automatically plan capacity for next month.

24
Figure 22:- Model Output

The model outputs the above graph for Bucket 1%. The red line shows predictions and blue line
the actual values.
The important thing to note is that the pattern of predictions is almost same as the actual values
despite having some deviations.
Studying above results bank can optimize their collection capacity.

25
Conclusion:-

For building a forecasting model using neural network in Java, DL4J library is extensively used.
A prior knowledge of deep learning concepts and neural network is required.
Optimization of the model is done by using appropriate activation functions and number of
hidden layers used.
If the data available is raw dataset then it is first processed using DataSetIterator so that it can be
fed to the network. Prediction is done with multiple attributes together. LSTM model is trained
with thousands of examples before it can forecast data.

In the practical world neural network is seen as a vast subject. Many data scientists solely focus
only on neural network techniques. Neural networks particularly work well on some particular
class of problems like image recognition. The neural network algorithms are very calculation
intensive. They require highly efficient computing machines. Large datasets take a significant
amount of runtime on R. Currently, there is a lot of exciting research going on, around neural
networks.

26

Vous aimerez peut-être aussi