Vous êtes sur la page 1sur 29

A

Project Report
On

Breast Cancer Detection Using Machine Learning

Submitted in partial fulfillment of the requirements

for the degree of

Bachelor of Engineering
in
Computer
By

Vinay Vilas Patil Roll No.18


Arun Vasantlal Jaiswal Roll No.13
Shubham Satish Sakpal Roll No.20

Supervisor
Prof. Vaibhav Badbe

Technology Personified

Department of Computer Engineering

Innovative Engineers' and Teachers' Education society's


Bharat College of Engineering
Badlapur: - 421504.
(Affiliated to University of Mumbai)

(2018-19)
Technology Personified

Bharat College of Engineering


(Affiliated to the University of Mumbai)
Badlapur: - 421504.

CERTIFICATE
This is to certify that, the Project titled

“ Breast Cancer Detection Using Machine Learning ”

is a bonafide work done by

Vinay Vilas Patil


Arun Vasantlal Jasiwal
Shubham Satish Sakpal

and is submitted in the partial fulfillment of the requirement for the


degree of

Bachelor of Engineering
In
Computer
To the
University of Mumbai

Supervisor
Prof. Vaibhav Badbe

Project Co-ordinator Head of Department Principal

(Prof. Nilesh Yadav) (Prof.Ankur Sharma) (Dr. S.N.Barai)


Project Report Approval forB.E.

This is to certify that the project entitled "Breast Cancer Detection Using Machine
Learning” is a bonafide work done by Vinay Vilas Patil, Arun Vasantlal Jasiwal and Shubham
Satishunder the supervision of Prof. Vaibhav Badbe. This project has been approved for the
award of Bachelor’s Degree in Computer Engineering, University ofMumbai.

Examiners:
1...............................

2...............................

Supervisors:

1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Principal:

..............................

Date:

Place:
Declaration

I declare that this written submission represents my ideas in my own words and where
other’s ideas or words have been included, I have adequately cited and referenced the original
sources.I also declarethatIhaveadheredtoallprinciplesofacademichonestyand integrityand have
not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or
fromwhomproperpermissionhasnotbeentakenwhenneeded.

Vinay Vilas Patil Roll No.-18


Arun Vasantlal Jaiswal Roll No.-13
Shubham Satish Sakpal Roll No-20

Date:
Contents

Abstract ............................................................................................................................................

List of Figures ..................................................................................................................................

List of Tables...................................................................................................................................

1. Introduction 1

1.1 Introduction to Breast Cancer……………………………….…………………...................1

1.2 Problem Definition………………………………………………………………………….2

1.3 Scope of project …………………………………………………………………………….3

1.4 Relevance and Motivation of project………………………………………………….…….4

1.5 Organization of the report…………………………………………………………...………5

2. Review of Literature/Related Work 6

3. Planning and Formulation 9

4. Methodology 13

4.1 Proposed System……………………………………………………………………….….13

4.2 Proposed Methodology……………………………………………………………………14

4.3 System Requirements……………………………………………………………...………17

5. Design of System

4.1 System Architecture/Design………………………………………………………………18

4.2 Data Flow Diagrams……………………………………………………………….……..19

6. Conclusion and Future 20

Reference ………...............................................................................................................................21
Abstract
Breast Cancer is one of the most common cancers among women worldwide,
representing the majority of new cancer cases and cancer related deaths according to global
statistics, making it a significant public health issue in today's society. The early diagnosis of
BC can improve the prognosis and chance of survival significantly, as it can promote timely
clinical treatments to patients. This project aims to detect the type of Breast Cancer (Malignant
or Benign classes) using K-Nearest Neighbors (K-NN) a Machine Learning Algorithm by
taking cell parameters as input. The quality of the results depends largely on the distance and
value of parameter “k” which represents the number of nearest neighbors. This project also
aims to achieve maximum accuracy in detection of Breast Cancer using BC data sets. This
study is conducted on Wisconsin breast cancer dataset (WBCD) obtained by the university of
Wisconsin Hospital from UCI repository.
List of Figures

Figure No. Figure Name Page no.


2.1 Artificial Neural Network 7
2.2 Support Vector Machine 8
3.1 K-NN Tumors 12
4.1 Python Data Flow 13
4.2 Proposed Flow Chart 14
4.3 Wisconsin Breast Cancer Dataset 15
Attributes
5.1.1 System Architecture 18
5.2.1 Data Flow Diagram 19

ii
List of Tables

Table No. Table Name Page No.


3.1.1 Phase I 9
3.1.2 Phase II 9
3.1.3 Breast Cancer Data Sets 10

iii
BREAST CANCER DETECTION USING MACHINE LEARNING

CHAPTER 1

1.1 Introduction:
Breast cancer (BC) is the most common cancer in women, affecting about 10% of all
women at some stages of their life.Over the past few decades, ML techniques have been
widely used in intelligent healthcare systems, especially for breast cancer (BC) detection and
diagnosis.
The breast is made up of different tissue, ranging from very fatty tissue to very dense
tissue. Within this tissue is a network of lobes. Each lobe is made up of tiny, tube-like
structures called lobules that contain milk glands. Tiny ducts connect the glands, lobules, and
lobes, carrying milk from the lobes to the nipple. The nipple is located in the middle of the
areola, which is the darker area that surrounds the nipple. Blood and lymph vessels also run
throughout the breast. Blood nourishes the cells. The lymph system drains bodily waste
products. The lymph vessels connect to lymph nodes, the tiny, bean-shaped organs that help
fight infection.

It has been identified that one of the leading causes of death in developing countries
as breast cancer. Earlier detection of cancer can reduce the death rate and reduce the treatment
phase. As breast cancer is a medical scenario it requires a medical diagnosis to detect it. To
make the process must simpler computer aided tools are adopted. The objective of this work
is to classify the given data set into different types of breast cancer i.e. Benign or Malignant.

A mass of abnormal tissue is known as tumor. Breast cancer tumors are classified into
two types,1. Benign, those that are non-cancerous, and Malignant, those that are cancerous.
Benign Tumors: Generally, these tumors are not aggressive toward surrounding tissue, they
may continue to grow occasionally. Malignant Tumors: Malignant tumors are cancerous and
aggressive because they invade and damage surrounding tissue.
`K-Nearest Neighbors (K-NN) is one of the most prominent classification algorithms
because it is simple, effective and more accurate than many other classification algorithms.
This algorithm does not require any assumption for detection.

BHARAT COLLEGE OF ENGINEERING 1


BREAST CANCER DETECTION USING MACHINE LEARNING

1.2 Problem definition:


Breast cancer is the most common female cancer worldwide representing nearly a
quarter (23%) of all cancers in women . The global burden of breast cancer is expected to cross
2 million by the year 2030, with growing proportions from developing countries. Although
age-standardised incidence rates in India are lower than in the United Kingdom (UK) (25.8
versus 95 per 100,000), mortality rates are nearly as high (12.7 versus 17.1 per 100,000,
respectively) as those of the UK. Breast cancer incidence rates within India display a 3–4-fold
variation across the country, with the highest rates observed in the Northeast and in major
metropolitan cities such as Mumbai and New Delhi

Diagnosis at advanced stages of disease contributes to the high mortality rate among
women due to breast cancer, which can be attributed to low levels of awareness, cumbersome
referral pathways to diagnosis, limited access to effective treatment at regional cancer centres
and incomplete treatment regimens. With the rising breast cancer incidence in India and
disproportionately higher mortality, it is essential to understand the level of cancer literacy,
especially since the average age at diagnosis is 10 years younger than women in Western
countries. An assessment of existing levels of cancer awareness is a pre-requisite for planning
comprehensive health programmes, early detection and treatment campaigns, that effectively
engage communities of women and men.

Although it could be difficult to determine on positive (malignant) and negative


(benign) classes. Classification is a kind of complex optimization problem. Many ML
techniques have been applied by researchers in solving this classification problem. In the
following sections, a comprehensive explanation of different classification methods applied to
BC will be given.
We focus on the k-nearest neighbor (k-NNs) techniques as they are the main methods
used in BC diagnosis and prognosis. Scientists strive to find the best algorithm to achieve the
most accurate classification result, however, data of variable quality will also influence the
classification result. Further, the rarity of data will influence the number of algorithm
applications as well. Overall, most ML techniques are first tested in open source databases.
Over time, a benchmark dataset has arisen in the literature.K-NN is one of the most prominent
classification algorithms because it is simple, effective, and more accurate than many other
classification algorithms. This algorithm does not require any assumption for data
distributions as it is a non-parametric algorithm. According to these reasons, KNN is one of
the most interesting algorithms in machine learning. The accuracy cannot make a distinction
between false positives and false negatives, and so it does not show the performance of the
classifier on positive and negative classes, separately K-Nearest Neighbors(KNN).

BHARAT COLLEGE OF ENGINEERING 2


BREAST CANCER DETECTION USING MACHINE LEARNING

1.3 Scope of Project:


Even though modernization in medical science is increased in terms of technology,
there is lot to achieve. Similarly Breast Cancer is one of the most discussed and researched
disease in medical science. Detection and diagnosis of BC is very though due to reason that
early detection of BC is difficult because symptoms do not show effects in early stages. Every
year around 42,000 women die due to Breast Cancer, mostly mature women are victims of it.

The main scope behind this project is to help women by helping them diagnose Breast
Cancer (BC) using their cell reports and detect about the type of breast cancer, if it non-
dangerous and can be treated or dangerous so as to help them know what to measures they can
take at an early stage.

The project helps women to identify the type of Breast Cancer, if it is malignant or
benign. By adding data from their cell report, the machine learning algorithm can help to
diagnose it. The K-NN algorithm uses cell data to detect the nearest cells which are cancerous
and detects the type, so they can take preventive measures early stages.

BHARAT COLLEGE OF ENGINEERING 3


BREAST CANCER DETECTION USING MACHINE LEARNING

1.4 Motivation of Project:

It has been identified that one of the leading causes of death in developing countries as
breast cancer. Earlier detection of cancer can reduce the death rate and reduce the treatment
phase. As breast cancer is a medical scenario it requires a medical diagnosis to detect it. To
make the process must simpler computer aided tools are adopted.

According to the reports 23% of women in india are still not aware of breast cancer,
even though detection of BC in early stages is difficult. This is the main reason this disease
gets severe as it grows. Cancer cells have a high division rate and that is the reason this grows
quite fast. Thats why the mortality rate of women having BC has decreased in years.

Women are not aware about the early stages of cancer and this leads to the deaths. The
cancer diagnosis after initial teatment is time consuming also cost for treatment is high, in such
cases its takes time for the women to identify the cancer type. So as depending upon the type
of cancer they can start treatent.

Due to this to help women diagnose cancer type we are making a application which
will help women’s to identify the cancer type using machine learning technqiues so as they can
get to check in early stages of disease just by putting the credientials asked on application with
the help of their reports and identify type, which can decrese the death rate due to breast cancer.

BHARAT COLLEGE OF ENGINEERING 4


BREAST CANCER DETECTION USING MACHINE LEARNING

1.5 Organization of the Report:

The organization of the report is as follows:

● Chapter 2 will discuss about the Review of Literature/Related Work.


● Chapter 3 will discuss about Planning and Formulation of project.
● Chapter 4 will discuss about Methodology of project.
● 4.1 Proposed System
● 4.2 Proposed Methodology
● 4.3 System Requirements
● Chapter 5 will discuss about Design of System.
● 5.1 System Architecture
● 5.2 Data Flow Diagram
● Chapter 6 will tell about Conclusion and Future Scope.

BHARAT COLLEGE OF ENGINEERING 5


BREAST CANCER DETECTION USING MACHINE LEARNING

CHAPTER 2

Review of Literature/Related Work:

In Breast cancer detection field there are many studies with many concepts and methods
were used to be a useful methods. Many researchers present many methods and algorithms to
detect the breast cancer disease; here we discussed some of them.

A. Image Processing:

Digital mammographic images are effortlessly available on internet which can be


downloaded from the respective web address. Digital database for screening mammography
(DDSM) is one of the databases available from joint efforts of Massachusetts General Hospital,
Sandia National Laboratories and the University of South Florida Computer Science and
Engineering Department gives approximately 2,500 case studies. The Mammographic Image
Analysis Society (MIAS) Mini-Mammographic Database is another source of digital
mammographic images accessible easily. The mammographic image analysis society is an
organization of UK research group has created a database of digital mammogram.

1. Mammography:

Mammography is the most common method of breast imaging. It uses low-dose


amplitude-X-rays to inspect the human breast. Cancerous masses and calcium deposits look
brighter on the mammogram. This method is good for identifying Ductal Carcinoma In Situ
(DCIS) and calcifications. Currently, mammography is the gold standard method to identify
early stage breast cancer before the lesions develop clinically palpable. Mammography has
assisted to decrease the mortality rate by 25%-30% in screened women when compared with a
control group after 5 to 7 years.

2. MRI:

MRI uses the hydrogen nucleus (single proton) for imaging determinations because this
nucleus is abundant in water and fat. The magnetic property of the hydrogen nucleus is used to
yield detailed images from any part of the body. The patient who is inspected using MRI is
placed in a magnetic field and a radio frequency wave is applied to generate high contrast
images of the breast. In dynamic contrast enhanced-MRI (DCE-MRI), a contrast agent is
inserted before the images are captured. This technique has been found to be more complex
than mammography.

BHARAT COLLEGE OF ENGINEERING 6


BREAST CANCER DETECTION USING MACHINE LEARNING

B. Artificial Neural Network (ANN):

Artificial neural networks (ANN) or connectionist systems are computing systems


vaguely inspired by the biological neural networks that constitute animal brains. The neural
network itself isn't an algorithm, but rather a framework for many different machine learning
algorithms to work together and process complex data inputs. Such systems "learn" to perform
tasks by considering examples, generally without being programmed with any task-specific
rules. For example, in image recognition, they might learn to identify images that contain cats
by analyzing example images that have been manually labeled as "cat" or "no cat" and using
the results to identify cats in other images.

The use of the ANN proved to give better diagnostic performance than the radiologists
when the network output was compared to the radiologists’ categorical assessment. Both
utilized ANN to predict malignancy using different mammographic elements as inputs. The
accuracies were significant and improved by 3–5% compared with conventional experts’
judgment. Thus, more than twenty years ago, ANN has been proved excellent in BC diagnosis
and prognosis. Although ANN has shown a good predictor of results in pattern classification
problems, the ANN is not easily explained as ANN has been considered as a series of “black
box”. By having a better understanding of ANN, a three-phase algorithm has been proposed to
unveil the ANN workings by building a weight-decay BP network, deleting insignificant
connections and extracting rules by recursively discretizing the activation values of the hidden
unit. The rules from this pruned network keep the accuracy as high as the rules from the
standard ANN through a series of tests. After a year, based on the previous method, a modified
pruned network was presented with fewer connections between each neuron, and higher
accuracy.

Fig: 2.1 : Artificial Neural Networks

BHARAT COLLEGE OF ENGINEERING 7


BREAST CANCER DETECTION USING MACHINE LEARNING

C. Support Vector Machine (SVM):

In machine learning, support vector machines are supervised learning models with
associated learning algorithms that analyze data used for classification and regression analysis.
Given a set of training examples, each marked as belonging to one or the other of two
categories, an SVM training algorithm builds a model that assigns new examples to one
category or the other, making it a non-probabilistic binary linear classifier (although methods
such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model
is a representation of the examples as points in space, mapped so that the examples of the
separate categories are divided by a clear gap that is as wide as possible. New examples are
then mapped into that same space and predicted to belong to a category based on which side of
the gap they fall.

Support Vector Machine is a new approach to supervised pattern classification which


has been successfully applied to a wide range of pattern recognition problems and it is also a
training algorithm for learning classification and regression rules from data. SVM is most
suitable for working accurately and efficiently with high dimensionality feature spaces in
addition to that SVM is based on strong mathematical foundations and results in simple way
and very powerful algorithms. The standard SVM algorithm builds a binary classifier. A simple
way to build a binary classifier is to construct a hyper plane separating class members from
non-members in the input space. SVM also finds a nonlinear decision function in the input
space by mapping the data into a higher dimensional feature space and separating by means of
a maximum margin hyper plane. The system automatically identifies a subset of informative
points called support vectors and uses them to represent the separating hyper plane which is
sparsely a linear combination of these points.

Fig: 2.2 Support Vector Machine

BHARAT COLLEGE OF ENGINEERING 8


BREAST CANCER DETECTION USING MACHINE LEARNING

CHAPTER 3

Planning and Formulation:

3.1 Timeline Chart

Activity Description Efforts in Deliverable


person
weeks
Phase I
PI-01 Requirement Analysis 2 Weeks Requirement Gathering
PI-02 Exsisting System Study and 3 Weeks Exsisting System Study
Literature and Literature
PI-03 Technology Selection 2 Weeks Python
PI-04 Modular Specification 2 Weeks Module Specification
PI-05 Design and Modelling 4 Weeks Analysis Report
Total 13 Weeks

Table 3.1.1 Phase I

Activity Description Efforts in Deliverable


person
weeks
Phase II
PII-01 Detailed Design 2 Weeks
PII-02 UI and User Interaction Design Included in UI document
above
PII-03 Coding & Implementation 15 Weeks Code Release
PII-04 Testinga & Bug Fixing 4 Weeks Test Report
PII-05 Release Included in System Release
above
Total 13 Weeks Deployment efforts are
extra

Table 3.1.2 Phase II

BHARAT COLLEGE OF ENGINEERING 9


BREAST CANCER DETECTION USING MACHINE LEARNING

Classification is a kind of complex optimization problem. Many ML techniques have been


applied by researchers in solving this classification problem. In the following sections, a
comprehensive explanation of different classification methods applied to BC will be given. We
focus on k-nearest neighbor (k-NNs) techniques as they are the main methods used in BC
diagnosis and prognosis. Scientists strive to find the best algorithm to achieve the most accurate
classification result, however, data of variable quality will also influence the classification
result. Further, the rarity of data will influence the number of algorithm applications as well.
Overall, most ML techniques are first tested in open source databases. Over time, a benchmark
dataset has arisen in the literature: Wisconsin breast cancer diagnosis (WBCD). There are also
many other BC benchmark data sets, for instance Wisconsin Prognostic Breast Cancer
Chemotherapy (WPBCC), Wisconsin Diagnostic Breast Cancer (WDBC) and so on. ML
techniques that have been used on the WBCD database in BC diagnosis and prognosis show
different levels of accuracy that ranged between 94.36% and 99.90%. Similarly, there are
results with differently modified algorithms relating to BC databases. This review attempts to
provide readers with the essential elements of BC diagnosis and prognosis using ML techniques
on WBCD. By using ML techniques to analyse the WBCD database, BC can be diagnosed
accurately base on 9 attributes as can be seen from Table 1. In the main body of the review
section we will concentrate on how the WBCD has been used to illustrate the great promise of
ML algorithms.

No. Attribute Description Value Range

1 Clump Thickness 1 - 10

2 Uniformity of cell size 1 – 10

3 Uniformity of cell shape 1 – 10

4 Marginal Adhesion 1 – 10

5 Single epithelial cell size 1 – 10

6 Bare Nuclei 1 – 10

7 Bland Chromatin 1 – 10

8 Normal nucioll 1 – 10

9 Mitosis 1 - 10

Table 3.1.3 Data Sets

BHARAT COLLEGE OF ENGINEERING 10


BREAST CANCER DETECTION USING MACHINE LEARNING

Clump thickness: Benign cells tend to be grouped in monolayers, while cancerous cells are
often grouped in multilayers.

Uniformity of cell size/shape: Cancer cells tend to vary in size and shape. That is why these
parameters are valuable in determining whether the cells are cancerous or not.

Marginal adhesion: Normal cells tend to stick together. Cancer cells tends to loos this ability.
So loss of adhesion is a sign of malignancy.

Single epithelial cell size: Is related to the uniformity mentioned above. Epithelial cells that
are significantly enlarged may be a malignant cell.

Bare nuclei: This is a term used for nuclei that is not surrounded by cytoplasm (the rest of the
cell). Those are typically seen in benign tumours.

Bland Chromatin: Describes a uniform "texture" of the nucleus seen in benign cells. In cancer
cells the chromatin tend to be more coarse.

Normal nucleoli: Nucleoli are small structures seen in the nucleus. In normal cells the
nucleolus is usually very small if visible at all. In cancer cells the nucleoli become more
prominent, and sometimes there are more of them.

BHARAT COLLEGE OF ENGINEERING 11


BREAST CANCER DETECTION USING MACHINE LEARNING

K-Nearest Neighbor:

The k-nearest neighbors algorithm is one of the most used algorithms in machine learning. It
is a learning method bases on instances that does not required a learning phase. The training
sample, associated with a distance function and the choice function of the class based on the
classes of nearest neighbors is the model developed. Before classifying a new element, we must
compare it to other elements using a similarity measure. Its k-nearest neighbors are then
considered, the class that appears most among the neighbors is assigned to the element to be
classified. The neighbors are weighted by the distance that separate it to the new elements to
classify.k-NN is one of the most central ML techniques in classification. k-NN is a non-
parametric lazy learning algorithm used for classification, which classifies the objects using
their “k” nearest neighbours. k-NN only considers the neighbours around the object, not the
underlying data distribution. Additionally, there is no training phase with the training data. In
Figure 1, an example of k-NN structure is presented for determining BC diagnosis and
prognosis if k = 3, the test sample (circle) is assigned to malignant BC (square) because there
are 2 squares and only 1 triangle inside the inner circle.

Fig No: 3.1 KNN Malignant and Benign Tumours

If k = 5, the test sample is assigned to benign BC (triangle).k-Nearest neighbor for breast cancer
diagnosis. Green circle means the test sample, red triangle means the malignant BC and blue
square means the benign BC.k-NN related algorithms have a number of applications in BC
diagnosis and prognosis. The quality of the classification depends on the selection of k. In
2000, the k-NN and fuzzy k-NN algorithms were implemented to classify the WBCD. The
different values of k from 1 to 15 were considered, and the best performance was when k
equalled 1.

BHARAT COLLEGE OF ENGINEERING 12


BREAST CANCER DETECTION USING MACHINE LEARNING

CHAPTER 4

Methodology

4.1 Proposed System:

The k-nearest neighbors (KNN) algorithm is one of the simplest similarity-based artificial
learning algorithms, offering interesting performance in some contexts. The choice of the value
of k must be chosen a priori; various techniques have been proposed to select it such as cross-
validation and heuristics. This value should not be a multiple of the number of classes to avoid
tie votes. Thus, in the case of a binary classification, it is necessary to take a value of k odd so
that a majority necessarily emerges.

• The proposed system is to build a cancer prediction system using k- nearest neighbor
algorithm and python language as a base.

• It reduces the error caused by human intervention in cancer prediction and increases the
accuracy of prediction and diagnosis of the disease.

• K-NN - K-Nearest Neighbour algorithms are used which provide high levels of accuracy in
prediction.

Fig: 4.1 Python DataFlow

BHARAT COLLEGE OF ENGINEERING 13


BREAST CANCER DETECTION USING MACHINE LEARNING

4.2 Proposed Methodology:

The proposed methodology is shown in Figure. The main idea is to use k-NN algorithm to
predict the class labels in the test set. Then for each classifier, the con formal prediction
algorithm is applied to calculate the non-con formality score for each prediction and use it to
calculate the confidence. The con-formal prediction algorithm is fully described.

Fig: 4.2 Proposed Method Flowchart

K-Nearest Neighbors (KNN): which is a lazy classifier that is widely used in data
mining applications. In this work, we implement the KNN algorithm using Euclidian distance
as a similarity measure with K=7.

BHARAT COLLEGE OF ENGINEERING 14


BREAST CANCER DETECTION USING MACHINE LEARNING

Breast Cancer Dataset

Studies in this paper are conducted on Wisconsin breast cancer dataset (WBCD) from UCI
repository [18]. This dataset has 699 clinical cases, each one labeled as malignant (cancerous)
or benign (non-cancerous). The number of malignant and benign cases are 241(24.5%) and 458
(65.5%), respectively. This dataset has 16 samples (cases) with some missing values.
Removing these samples from dataset decreases the sample size to 683. Every sample has 11
features (Table 1). The first feature is sample id, and the last one is a class label that keeps two
values: 2 for benign and 4 for malignant. Practically is proved that stratified 10-fold cross
validation is one of the best methods due to low bias and variance. After dividing the dataset
into ten folds, first fold is selected for testing and the combination of the other nine folds for
training. The numbers of test and train samples are equal to 69 and 614 in each run. The
numbers of positive (malignant) and negative (benign) train-samples are equal to 215 and 399.
The standard deviations of positive and negative classes are 8.269 and 3.143, respectively.

After that, every sample in the test fold is classified by finding K nearest samples from the
training set. Now, the values of accuracy, sensitivity, and specificity are measured for the
selected test fold. This process is repeated ten times by selecting each fold exactly once for
testing. At this point, we have ten values for accuracy, sensitivity, and specificity.
In order to increase the correctness of outcome, these steps are repeated 100 times by
considering that the samples are randomly reassigned to the folds again.

Fig: 4.3 Wisconsin Breast Cancer Dataset Attributes

User Breast Cancer Data

The user will input the parameters of the cell mentioned in the report. These parameters will
decide the type of cancer.

BHARAT COLLEGE OF ENGINEERING 15


BREAST CANCER DETECTION USING MACHINE LEARNING

User Data Analysis

At first, the values given by the user specificity, the accuracy for different values of K between
1 and 614 is reported to show the individual performance of the classifier on positive and
negative classes. After that, the maximum values, minimum values, and standard deviations of
positive and negative classes are examined to show the stability of classifier over positive and
negative classes. Then the system detects the cancer type depending upon the training dataset
and user dataset.

There are various factors and variable that define cancer cells. The genomic data is collected
with biological knowledge and stored in a database which is collective called the dataset. This
module’s purpose is to connect to dataset so that it can be processed to predict cancer. There
are 32 variables that contribute to the tumor’s initiation and progression, which are recorded
and stored in the dataset; the variables include radius, texture, volume, size, etc. of the cancer
cells.

BHARAT COLLEGE OF ENGINEERING 16


BREAST CANCER DETECTION USING MACHINE LEARNING

4.3 System Requirements

Hardware:

 Pentium IV or higher, (PIV-300GHz recommended)


 256 MB RAM
 100+ hard free drive space

Software

 Web Browser: Microsoft Internet Explorer, Mozilla, Google Chrome


 Python
 Jupyter NoteBook
 Anaconda
 Microsoft Excel
 Operating System: Windows XP / Windows7/ Windows Vista / Ubuntu

BHARAT COLLEGE OF ENGINEERING 17


BREAST CANCER DETECTION USING MACHINE LEARNING

Chapter 5

Design Of System

5.1 System Architecture

There are 10 variables that contribute to the tumor’s initiation and progression, which
are recorded and stored in the dataset; the variables include radius, texture, volume, size, etc.
of the cancer cells is uploaded as a training set into R language and the kNN algorithm is
applied upon them to get the predicted outcome.The input is taken from the user in form of cell
parameters. The architecture of this system is kept as simple as possible to make it accessible
to a wide range of consumers and to maintain a simple user interface.

Fig: 5.1.1 System Architecture

BHARAT COLLEGE OF ENGINEERING 18


BREAST CANCER DETECTION USING MACHINE LEARNING

5.2 Data Flow Diagram

Fig: 5.2.1 Data Flow Diagram

BHARAT COLLEGE OF ENGINEERING 19


BREAST CANCER DETECTION USING MACHINE LEARNING

Chapter 6

Conclusion and future scope

We have proposed Breast Cancer Detection Sytem Based on KNN Algorithm which is a
machine learnig technology. This approach increases the chances of detecting breast cancer in
early stages so as women can start treatment to be cured or decrease the chances of death in
this cases. This will inturn increase the mortality rate and help women to spread awareness
regarding breast cancer diagnosis.

BHARAT COLLEGE OF ENGINEERING 20


BREAST CANCER DETECTION USING MACHINE LEARNING

References
1. The Performance of K-Nearest Neighbors on Malignant and Benign Classes, Arash
Roshanpoor, Reza Safdari [2017]
2. Cancer Prediction Using KNN, Dheeraj.R , Hariprasath.R , Akshay Kannan.V , Nishanth
Kumar.S [2018]
3. Breast Cancer Detection Using K-Nearest Neighbor Machine Learning Algorithm, Moh'd
Rasoul Al-Hadidi, Abdulsalam Alarabeyyat, Mohannad Alhanahnah [2016]
4. Breast Cancer Diagnosis by using k-Nearest Neighbor with Different Distances and
Classification Rules Seyyid Ahmed Medjahed, Tamazouzt Ait Saadi, Abdelkader Benyettou
5. H. Zhang, T. Arslan, B. Flynn, A Single Antenna Based microwave System for Breast
Cancer Detection: Experimental Results, IEEE, 2013.
6. Medjahed SA, Saadi TA, Benyettou A. Breast Cancer Diagnosis by using k-Nearest
Neighbor with Different Distances and Classification Rules. Int J Comput Appl. 2013;62
7. Gupta S, Kumar D, Sharma A. Data Mining Classification Techniques Applied For Breast
Cancer Diagnosis And Prognosis.

BHARAT COLLEGE OF ENGINEERING 21

Vous aimerez peut-être aussi