Vous êtes sur la page 1sur 13

ABSTRACT

This project deals with the information extraction from an image where the key idea is to
detect multiple objects with all the essential elements present in an image and classify the
object category and localize the object.

The methodology we adopted for this project was a modular approach where we conquered
the different aspects by implementing the project in various phases. The knowledge base for
the same is from convolution neural networks.

We successfully created our object detection model and tested it against various parameters
that can affect the performance of the model.

The tools we used for this project were python, Jupyter Notebook, Labelmg(for labelling the
dataset), OpenCv. Hence, we got to learn to work on an environment (with a computer
language) that was not very familiar to us.

CHAPTER
1
INTRODU
CTION

1.1 Overview
Data present in images contain useful information for automatic annotation, indexing, and
structuring of images. Extraction of this information involves detection, localization, tracking,
extraction, enhancement and recognition of the data from a given image.

However, variations of data due to difference in size, style, orientation as well as low-contrast
and complex background make the problem of automatic data extraction extremely challenging.
While comprehensive surveys of related problem such as face detection, document analysis and
image & video indexing can be found, the problem of information retrieval is not well surveyed.

Large number of techniques have been proposed to address this problem and the purpose of this
project is to classify and review the algorithms, discuss benchmark data and performance
evaluation and to point out promising directions for future research.

1.2 Motivation

The motivation for doing this project is primarily an interest in undertaking a project in this area
of research. The opportunity to learn about a new area of computing not covered in lectures was
appealing. Interest in image retrieval has increased in large part due to the rapid growth of world
wide web. According to a
recent study there are 1.8 billion images shared on the web in a single day.
Data present in images and videos contain useful information for automatic annotation, indexing
and structuring of images.
For this reason, we undertook this project to create a relevant model that will extract the
information present in
Images and videos.

1.3 Objective of the project

The main goal of this project is to design, implement and evaluate a model that will detect
multiple objects in an image, localize the objects and classify them according to their classes.
The aim is to train the model by labelling the dataset and the trained model will be used to
predict the output.
Our concept includes knowledge about the convolutional neural networks, digital image
processing and classification of images. Furthermore, we propose how knowledge of neural
networks helps in classifying images and how they are represented in a computer.
1.4 Organization of the report

The chapter 1 contain the overview and motivation of the project followed by the objective that
we need to achieve in the project. Chapter 2 contains the literature from all the research papers
studied along with the conceptual overview and the introduction about the technologies that
have been involved in this project. Chapter 3 explains our own methodology for this project all
give necessary diagrams and layouts. Chapter 4 contains the results that we have from different
trainings along with the explanation behind them and their analysis. Finally, chapter 6 contains
the summarized results and explains the work that needs and can be done to make this model
better. At the end of report there are the references and the Annexure section that contains the
large code and Model structure.

BACKGROU
ND MATERIAL

In order to understand the workings and methodologies used in this project, we need to be
clear with a few basic concepts. Those concepts with their explanation will be discussed in
this chapter.

2.1 Conceptual Overview

First, we need to learn about neural network and their related terminologies. Later we will
understand how they are accessed and modified and tools required for the same.

2.1.1 Neural Network

It is a machine learning algorithm, which is built on the principle of the organization and
functioning of biological neural networks. This concept arose in an attempt to simulate the
processes occurring in the brain by Warren McCulloch and Walter Pitts in 1943.

Neural networks consist of individual units called neurons. Neurons are located in a series of
groups — layers (see figure allow). Neurons in each layer are connected to neurons of the next
layer. Data comes from the input layer to the output layer along these compounds. Each
individual node performs a simple mathematical calculation. Тhen it transmits its data to all the
nodes it is connected to.

2.1.1.1 Architecture of Neural Network

Neural Network is a set of connected neurons organized in layers:

● input layer: brings the initial data into the system for further processing by subsequent
layers of artificial neurons.

● hidden layer: a layer in between input layers and output layers, where artificial neurons
take in a set of weighted inputs and produce an output through an activation function.

● output layer: the last layer of neurons that produces given outputs for the program.
.

2.1.1.2 Types of Neural Network

● Feedforward Neural Network: This is one of the simplest neural networks, where the data
or the input travels in one direction only. The data passes through the input nodes and exit
on the output nodes. There is no back-propagation algorithm so, if the neural network
outputs the “wrong” answer, there is no way for it to correct itself.
Feedforward Neural Network

● Radial Basis Function Neural Network: Radial basis functions consider the distance of a
point with respect to the center. RBF functions have two layers, first where the features are
combined with the Radial Basis Function in the inner layer and then the output of these
features are taken into consideration while computing the same output in the next function.

Convolutional Neural Network-A convolutional neural network (CNN) uses a variation of the
multilayer perceptrons. A CNN contains one or more than one convolutional layers. These layers
can either be completely interconnected or pooled.

Before passing the result to the next layer, the convolutional layer uses a convolutional operation
on the input. Due to this convolutional operation, the network can be much deeper but with much
fewer parameters.

Due to this ability, convolutional neural networks show very effective results in image and video
recognition, natural language processing, and recommender systems.

Convolutional neural networks also show great results in semantic parsing and paraphrase
detection. They are also applied in signal processing and image classification.

2.1.2 Types of CNN


As we have to select a huge number of regions and this could computationally blow up.There is
one more problem, aspect ratio. A lot of objects can be present in various shapes like a sitting
person will have a different aspect ratio than standing person or sleeping person Therefore,
algorithms like R-CNN, YOLO etc have been developed to find these occurrences and find them
fast..

2.1.2.1 Region-based convolutional neural network(R-CNN)


It was impossible to run CNNs on so many patches generated by sliding window detector. R-
CNN solves this problem by using an object proposal algorithm called Selective Search which
reduces the number of bounding boxes that are fed to the classifier to close to 2000 region
proposals. Selective search uses local cues like texture, intensity, color and/or a measure of
insideness etc to generate all the possible locations of the object. Now, we can feed these boxes
to our CNN based classifier. Remember, fully connected part of CNN takes a fixed sized input
so, we resize(without preserving aspect ratio) all the generated boxes to a fixed size (224×224 for
VGG) and feed to the CNN part. Hence, there are 3 important parts of R-CNN:
I. Run Selective Search to generate probable objects.
II. Feed these patches to CNN, followed by SVM to predict the class of each patch.
III. Optimize patches by training bounding box regression separately.

2.1.2.2 Yolo(You Look only Once)


YOLO divides each image into a grid of S x S and each grid predicts N bounding boxes and
confidence. The confidence reflects the accuracy of the bounding box and whether the bounding
box actually contains an object(regardless of class). YOLO also predicts the classification score
for each box for every class in training. You can combine both the classes to calculate the
probability of each class being present in a predicted box.
So, total SxSxN boxes are predicted. However, most of these boxes have low confidence scores
and if we set a threshold say 30% confidence, we can remove most of them
Notice that at runtime, we have run our image on CNN only once. Hence, YOLO is super fast
and can be run real time. Another key difference is that YOLO sees the complete image at once
as opposed to looking at only a generated region proposals in the previous methods. So, this
contextual information helps in avoiding false positives. However, one limitation for YOLO is
that it only predicts 1 type of class in one grid hence, it struggles with very small objects.

2.2.2 Install Tensorflow-Cpu and Open-Cv


The TensorFlow Object Detection API requires using the specific directory structure provided in
its GitHub repository. It also requires several additional Python packages, specific additions to
the PATH and PYTHONPATH variables, and a few extra setup commands to get everything set
up to run or train an object detection model.
OpenCV (Open Source Computer Vision Library) is an open source computer vision and
machine learning software library. OpenCV was built to provide a common infrastructure for
computer vision applications and to accelerate the use of machine perception in the commercial
products. Being a BSD-licensed product, OpenCV makes it easy for businesses to utilize and
modify the code.

The library has more than 2500 optimized algorithms, which includes a comprehensive set of
both classic and state-of-the-art computer vision and machine learning algorithms. These
algorithms can be used to detect and recognize faces, identify objects, find similar images from
an image database, remove red eyes from images taken using flash, follow eye movements,
recognize scenery and establish markers to overlay it with augmented reality, etc. OpenCV has
more than 47 thousand people of user community and estimated number of downloads
exceeding 18 million. The library is used extensively in companies, research groups and by
governmental bodies.
Fig 2.2.2.1 Object detection using
open-cv

2.2.3 Anaconda Python

The open-source Anaconda Distribution is the easiest way to perform Python/R data science and
machine learning on Linux, Windows, and Mac OS X. With over 11 million users worldwide, it
is the industry standard for developing, testing, and training on a single machine,
enabling individual data scientists to:
1. Quickly download 1,500+ Python/R data science packages
2. Manage libraries, dependencies, and environments with Conda
3. Develop and train machine learning and deep learning models with scikit-
learn, TensorFlow, and Theano
4. Analyze data with scalability and performance with Dask, NumPy, pandas, and Numba
5. Visualize results with Matplotlib, Bokeh, Datashader, and Holoviews

2.2.4 Labelmg

LabelImg is a tool that makes it very easy to annotate images. There are plenty of other tools you
can choose from but LabelImg seems to be one of the most popular It is written in Python and
uses Qt for its graphical interface.

WORK FLOW

Our methodology is based on a number of different implementations, we took ideas from many
different researches and Implementations.

3.1.4 Problem Statement

To retrieve information from an image using object detection techniques.


Our aim is to create an object detection model which will detect multiple objects
From an image. The model will detect the object from an image and identify the types of objects
Recorded in an image.The aim is to train the model by labeling the dataset and the trained model
will be
used to predict the output.

3.2 Methodology and Diagrams

3.2.1 System and Framework

● Operating System: Microsoft Windows 10


● CPU: Intel Core i5
● Environment: Anaconda (Jupyter Notebook)
● Programming Language: Python 3.6.3

3.2.2 Methodology

The entire methodology is divided into 6 basic parts.

● Setting up the object detection directory and virtual environment


● Gathering and labeling dataset
● Creating a label map
● Training &testing the dataset
● Predicting output of new dataset using trained data

3.2.2.1 Setting up the object detection directory and virtual environment

First, we import our required packages — as long as OpenCV and NumPy are installed, your
interpreter will breeze past these lines.

We must parse four command line arguments. Command line arguments are processed at
runtime and allow us to change the inputs to our script from the terminal. Our command line
arguments include:
● --image : The path to the input image. We’ll detect objects in this image using YOLO.
● --yolo : The base path to the YOLO directory. Our script will then load the required
YOLO files in order to perform object detection on the image.
● --confidence : Minimum probability to filter weak detections. I’ve given this a default
value of 50% ( 0.5 )
● --threshold : This is our non-maxima suppression threshold with a default value of 0.3

3.2.2.2 Gathering and labeling dataset

In the first phase we have collected the dataset of four items and labelled them. We have
classified the four items and using the r-cnn model we have trained the data and tested with the
ratio 7:3.After training and testing of the images we have run the python code upon random
images and got the output by detecting the object and classifying it with its class name. But, it
was less efficient so we developed a new python script with pre-trained data and using a large
dataset.

3.2.2.3 Creating a label map

Using the tool labelmg we labelled the dataset which helps in labelling the dataset and
converting the xml file of each image which consists of width, boundaries and many more
attributes of an image. The label map tells the trainer what each object is by defining a mapping
of class names to class ID numbers. Use a text editor to create a new file and save it as
labelmap.pbtxt
3.2.2.4 Training and testing the dataset

With the images labeled, it’s time to generate the TFRecords that serve as input data to the
TensorFlow training model. This tutorial uses the xml_to_csv.py and generate_tfrecord.py. First,
the image .xml data will be used to create .csv files containing all the data for the train and test
images. The training routine periodically saves checkpoints about every five minutes. You can
terminate the training by pressing Ctrl+C while in the command prompt window. Wait until just
after a checkpoint has been saved to terminate the training. You can terminate training and start
it later, and it will restart from the last saved checkpoint. The checkpoint at the highest number
of steps will be used to generate the frozen inference graph.

Vous aimerez peut-être aussi