Vous êtes sur la page 1sur 21

DISEASE IDENTIFICATION AND TREATMENT USING DATA

MINING

A project report submitted in partial fulfillment of the requirements


for the award of the degree of

BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE & ENGINEERING

BY

Dhara Advani(BE/25022/14)
Shubham Israni(BE/25064/14)
Sakshi Soni(BE/25088/14)

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


BIRLA INSTITUTE OF TECHNOLOGY, MESRA
JAIPUR CAMPUS, JAIPUR
MO-2017
DECLARATION CERTIFICATE
This is to certify that the work presented in the project entitled “Disease Identification and
Treatment using Data mining” in partial fulfillment of the requirement for the award of Degree
of Bachelor of Engineering in Computer Science and Engineering of Birla Institute of
Technology, Mesra, Ranchi, Extension Center Jaipur is an authentic work carried out under my
supervision and guidance.

To the best of my knowledge, the content of this project does not form a basis for the award of any
previous degree to anyone else.

Date:

Mrs. Seema Gaur


(Assistant Professor)
Birla Institute of Technology, Mesra ,Ranchi
Extension Centre,Jaipur
Table of content

CONTENT PAGE NO.

1. Introduction with problem definition, domain, requirements


2. Objectives
3. Software Requirement Specification
3.1 purpose
3.2 Intended Audience and Reading Suggestions
3.3 Product Scope
3.4 Overall Description
3.4.1 Product Perspective
3.4.2 Product Function
3.5 Operating Environment
3.6 Design and Implementation constraints
3.7 Assumption and Dependencies
3.8 System Interface
3.8.1 Use case diagram and characteristic
3.8.2 Software requirements
3.8.3 Hardware requirements
3.8.4 Communication interface
3.9 Functional Requirements
3.9.1 User interface
3.9.2 System Interface
3.10 Other Non-Functional Requirements
3.10.1 Performance
3.10.2 Availability
3.10.3 Security
3.10.4 Maintenance
3.10.5 Scalability
4. Software Design Specification
4.1 Use Case Diagram
4.2 Data Flow Diagram
4.3 System Architecture Description
4.4 Detail Description of Components
4.4.1 Logical description
4.4.2 Disease Identification
4.4.3 Treatment Analysis
4.5 Data Representation
4.6 Design Decision and Tradeoffs
5. Implementation
5.1 Algorithm
5.1.1 Classification Algorithm
5.1.2 Bayesian Classifier
5.1.3 Issue with the Bayesian Classifier
5.1.4 Naive Bayesian Classifier
5.2 Main Method
5.3 Output
5.3.1 Output1
5.3.2 Output2
5.4 Decision Tree

6. Description of work to be accomplished further in points


6.1 Treatment Identification
6.2 Login Interface

7. References
1. Introduction

People are more concerned about their health than ever before. Medline is the richest and most
used source of information. Database that has enormous articles regarding life sciences. People
want Fast access to reliable information and in a manner that is suitable to their habits and
workflow. The medical field has grown to such an extent that the people practicing medicine
should not only have experience but also information about latest discoveries. So there should be a
more efficient and reliable way to find the cause and cure of diseases through computational
technology. Our project focuses on this problem statement.

2. Objective

The goal of Machine Learning is to construct a computer system that can adapt and learn from
their
experience. Machine Learning approach helps to integrate the computer based system into the
healthcare field
in order to obtain best and accurate results for the system. Here the system deals with automatic
identification
of informative sentences from medical published by medical journals. Our main aim is to integrate
machine
learning in medical field and build an application that is capable of automatically identifying and
disseminating disease and treatment related information, further it also identifies semantic
relations that exists
between diseases and treatments.
The task undertaken here are:
1. To identify the disease based on the symptoms of the patient.
2. To find the cures of that disease
3. To find the best and the most effective cure
4. To extract relevant information about that cure.
3. Software Requirement Specification

3.1 Purpose

The purpose of this document is to give a detailed description of the requirements for the “Disease
Identification and Treatment using Machine Learning” release version 1.0. This Software
Requirements Specification document only covers the main system and does not describe the
implementation of the database in which the main system interacts. This document provides:
 A description of the environment in which the application is expected to operate.
 A definition of the application's capabilities.
 A specification of the application's functional and non-functional requirements
All the requirements stated in this document are slated for implementation in version 1.0, unless
otherwise specified.

3.2 Intended Audience and Reading Suggestions

The intended audience is anyone who is interested in implementing and knowing more about
Machine learning and its further use in disease cure identification. The document is to be
utilized by the software engineering professor to evaluate the software’s design and features.
Also, the application maintainers will review the document to clarity their understanding of
what the application does.

3.3 Product Scope

The System will allow people to find and diagnose their disease by simply giving their symptoms
as the input.
Given a large set of dataset, the task undertaken are:
1. To identify the disease based on the symptoms of the patient.
2. To find the cures of that disease
3. To find the best and the most effective cure
To extract relevant information about that cure, the Interface has to be simple to use as the target
end-users for the system are nontechnical persons.
This system aims to automate the analysis of symptoms and compute the relevant disease by
representing the cure of the same.

3.4 Overall Description

3.4.1 Product Perspective

People are more concerned about their health than ever before. Medline is the richest and most
used source of information. Database that has enormous articles regarding life sciences. People
want Fast access to reliable information and in a manner that is suitable to their habits and
workflow. The medical field has grown to such an extent that the people practicing medicine
should not only have experience but also information about latest discoveries. So there should be a
more efficient and reliable way to find the cause and cure of diseases through computational
technology. Our project focuses on this problem statement

3.4.2 Product Functions

The main feature of the system is to provide a system where:

 Symptoms will be classified and analyzed using combination of classification algorithms


and the best match disease will be diagnosed.
 The possible cures of the disease will be found and compared according to the success rate,
cost, side effects using R language.
 The Information of the best cure will be extracted and simplified using Natural Language
Processing

Preprocess Apply Data


Data Mining Result
Load Dataset -Formatting Algorithm Analysis
-Cleaning
3.5 Operating Environment

The model would operate on 32 bit and 64 bit machines. It would also be able to operate on
Linux/Windows OS.

3.6 Design and Implementation Constraints

In this project we will be using modules of PYTHON for data mining techniques and for the
purpose of dataset - the collection of Medical journals and publications is a knowledge database of
disease-symptom associations generated by an automated method based on information in textual
discharge summaries of patients at New York Presbyterian Hospital admitted during 2004.

3.7 Assumptions and Dependencies

The system will be dependent on the database, so it is assumed that it will be maintained and
administered properly. The model will only recommend diseases as this system is working on
symptoms found and known analyzing the inputs. User should have basic knowledge to use the
computer.

3.8 System Interaction

3.8.1 Use Case Diagram and Characteristics

 Register - New user will register by providing the details and the details will be stored in
database.
 Login - The user will login using username and password and the login data will be
verified from the database.
 Input - The user can input of his/her symptoms and the disease alongwith the cure of the
same shall be retrieved in a graphical representation from the database of NEWYORK
HOSPITAL.
3.8.2 Software Requirements

 Windows or linux
 Idle – PYTHON 2.7.4
 MS-EXCEL

3.8.3 Hardware Requirements

 32 bit-64 bit machine.


 A RAM size of 2GB and above.
 Hard disk capacity of 20GB and above.
 Processor Intel i3(min).

3.8.4 Communications Interfaces

No communication interfaces will be required for this model.

3.9 Functional Requirements

3.9.1 User Interface

User can use the system by logging into it using username and password for existing user and by
registering if new user by providing required personal details. User can directly enter their
respective symptoms and get the efficient representation of the disease and the cure related to the
input. Users’ data will be maintained in separate database to store its previous search results for
further recommendation and its login details.

3.9.2 System Interface

The system will provide user name and id and also maintain user session logs for security and
privacy.
For a each user – New or Existing, the system shall ask for the symptom input and enlist the
possibilities of diseases which such symptoms shall lead to and giving the appropriate cure for the
same as the dataset infers.
There will also be a logout button to redirect back to the login/signup page.

3.10 Other Non – Functional Requirements

3.10.1 Performance

The system should recommend correct result with very less percentage from displaying other
data. Also the web crawling should be done properly to extract all possible diseases related to
each symptom.

3.10.2 Availability

For the system to be available to users all the time, online server is must which provide 24-hour
service daily.

3.10.3 Security

The password of all the users should be stored properly and not exposed to anyone.

3.10.4 Maintenance

The password of all the users should be stored properly and not exposed to anyone, insuring a
satisfying experience.

3.10.5 Scalability

The system will recommend the best cure depending on the no. of accurate and relevant input.
4. Software Design Specification

4.1 Use Case Diagram


4.2 Data Flow Diagram
4.3 System architecture description

This section is the main focus in the first version of the SDS, the high level design. This should
give a good view of exact organization of the system as per the requirements.

4.3.1 Overview of modules / components

This project can be divided into 2 major components, the client side and the server side. Both the
components are described below:
● On the client side the interface will be API based, this module can be further divided into
the following components like registration, login, identify disease, treatments, best
treatment. All these components will be in the form of .
● On the server side the main component are the datasets. The historical dataset containing
the medical records and the target dataset. Everything will be stored in the database and the
treatment will be extracted from the dataset on the server side.

4.3.2 User interface issues

The GUI will be interactive and self informative. The user side interface will be optimized to
provide the user with an easy to use interface.

4.4 Detailed description of components

4.4.1 Login Description

Identification Login
Type Module
Purpose The purpose of this module is to provide entry to the program.
Based on the type of login, the user is provided with various
facilities and functionalities.
Function The main function of this module is to allow the user to use the
system.
Subordinates A general user will have access to the disease detection and
treatment identification mechanism.
Dependencies Login will be dependent on the server, the system should be
connected to the server.
Interfaces A computer will be used to access the application and login.
Resources A regular PC with input and output devices will be required. The
PC should have an operating system with a browser and it should be
connected to the server.
Processing When a user requests for login from the server, there are two
scenarios that arise - if a new user wishes to register, he is taken to
the new user registration page or If an existing user requests login,
his username and password are checked from the database by the
server. If all entries are correct, he is redirected to the home page.
Data On the servers, user data will be maintained in MySQL Database
with relevant fields and it will matched with the data entered by the
user.

4.4.2 Disease Identification

Identification Disease Identification


Type Module
Purpose The purpose of this module is classify the symptoms to identify the
disease.
Function The main function of this is to conduct a classification on the
symptoms entered to determine the possible disease indicated by
them.
Interfaces A computer will be used to access the application and access the
module.
Resources A regular PC with input and output devices will be required.
Processing The user will be asked to enter the symptoms through a text field,
then the classification algorithm will search for the symptoms in the
MEDLINE dataset and return the closest match disease.
Data On the servers, user data will be maintained in MySQL Database
with relevant fields. On search of new disease the data will be added
to the table of user. While accessing the previously searched disease
the data will be retrieved from the database and will be displayed on
the screen.

4.4.3 Treatment Analysis.

Identification Treatment
Type Module1
Purpose The purpose of this module is to find the best cure of the disease
from the dataset.
Function The main function of this module is to allow user to get the
treatment of the disease from the MEDLINE dataset.
Interfaces A computer will be used to access the problem module and view the
treatment analysis.
Resources A regular PC with input and output devices will be required. The
PC should have an operating system with a browser and it should be
connected to the server.
Processing The disease identified will be searched for its treatment using the
MEDLINE dataset. The best one will be identified based on success
rate, side effects etc. and will be presented to the user.
Data On the servers, user data will be maintained in MySQL Database
with relevant fields and all the information regarding a problem will
be fetched when anyone will try to access particular problem.
4.5 Data Representation:

4.5.1 Bag-of-Words model:

The bag-of-words model is a simplifying representation used in natural language processing and
information retrieval. In this model, a text (such as a sentence or a document) is represented as an
unordered collection of words and even word order. Recently, the model of bag-of-word has also
been used for computer vision.[8] The bag-of-words model is used in some document
classification methods. When a Naïve Bayes classifier is applied to text document, for example,
the conditional independence assumption adds the assumption that terms are conditionally
independent given the class. Other methods of document classification that use this model are
latent Dirichlet allocation and latent semantic analysis

4.6 Design decisions and tradeoffs

System is designed in a simple and elegant way, so as to make it easy for everyone to use it. It is
designed in a responsive manner so that each and every device will have the interface according to
its resolution.

5. Implementation

5.1 Algorithm

5.1.1 Classification Algorithm

In Machine Learning approach the expertise and previous research provides the guidance to solve
new tasks. The models described should be able to identify and provide informative sentences and
relation between entities. The research should be made in a way to achieve high performance. As
classification algorithms, set of six representative models can be used. They are: adaptive learning
(Ada-Boost), decision-based models (Decision trees), and probabilistic models (Naive Bayes (NB)
and Complement Naive Bayes (CNB), which is adapted for text with imbalanced class
distribution), a linear classifier (support vector machine (SVM) with polynomial kernel), and a
classifier that predicts the majority class in the training data. These classifiers are used to learn
more algorithms and to work on long text and short texts. Probabilistic models based on Naive
Bayes used in text classification and automatic text classification tasks. Decision trees based on
decision models are used in short text. Adaptive learning algorithm is used to mainly focus on hard
concepts such as unbalanced datasets, unrepresented in data available.

5.1.2 Bayesian Classifier

In Bayesian networks (see e.g. Pearl (1988)) statistical dependencies are represented visually as a
graph structure. The idea is that we take into account all information about conditional
independencies and represent a minimal dependency structure of attributes. Each vertex in the
graph corresponds to an attribute and the incoming edges define the set of attributes, on which it
depends. The strength of dependencies is defined by conditional probabilities. For example, if A1
depends on attributes A2 and A3, the model has to define conditional probabilities P(A1|A2, A3)
for all value combinations of A1, A2 and A3.

5.1.3 Issue with the Bayesian Classifier

However, there is a high risk that the resulting network imposes irrelevant dependencies while
skipping actually strong dependencies. When the structure has been selected, the parameters are
learnt from the data. The parameters define the class-conditional distributions P(t|C = c) for all
possible data points t ∈ S and all class values c. When a new data point is classified, it is enough to
calculate class probabilities P(C = c|t) by the Bayes rule: P(C = c|t) = P(C = c)P(t|C = c) P(t) . In
practice, the problem is the large number of probabilities we have to estimate. For example, if all
attributes A1, ..., Ak have v different values and all Ais are mutually dependent, we have to define
O(v k ) probabilities. This means that we also need a large training set to estimate the required
joint probability accurately.

5.1.4 Naïve Bayesian Classifier

The Naive Bayes model solves the problem. The model complexity is restricted by a strong
independence assumption: we assume that all attributes A1, ..., Ak are conditionally independent,
given the class attribute C, i.e. P(A1, ..., Ak|C) = Qk i=1 P(Ai |C). This Naive Bayes assumption
can be represented as a two-layer Bayesian network (Figure 3), with the class variable C as the
root node and all the other variables A1, ..., Ak as leaf nodes. Now we have to estimate only O(kv)
probabilities per class.
The following is a snip it of the code using the naïve Bayesian classifier

5.2 Main method


5.3 Output

5.3.1 Output 1

The following output is an enumeration of how three symptoms are used to find the disease

5.3.2 Output 2

The following output is another enumeration to find the disease based on the naïve Bayesian
algorithm
5.4 Decision Tree

The following is a decision tree based on the inputs as to how the classifier works to find the
disease on the basis of the symptoms and the no. of symptoms, there forth.
6. Work to be accomplished further

6.1 Treatment Identification

This module covers the section of treatment identification such that when a specific disease is
identified, the treatment corresponding to a certain disease is enumerated and the best among the
given is reached out to the patient.

This needs a different dataset of all the treatments that are associated with the specific disease.

6.2 Login Interface


This module covers the section which includes user login and the output display section of the
symptoms that have been given by the user.

7. References

 IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-


8727Volume 16, Issue 3, Ver. VII (May-Jun. 2014), PP 05-12
 https://www.cs.princeton.edu/courses/archive/spr11/cos217/lectures/08DsAlg.pdf
 Dataset -
http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html
 https://raw.githubusercontent.com/Aniruddha-Tapas/Predicting-Diseases-From-
Symptoms/master/tree-top5.png