Vous êtes sur la page 1sur 24

FOOTBALL MATCH DATA ANALYSIS

USING MACHINE LEARNING


A Project Report

Submitted in partial fulfillment of the


requirements for the award of the Degree of
BACHELOR OF SCIENCE (INFORMATION TECHNOLOGY)
By

JAYESH CHANDRAKANT PATEKAR


Seat Number: _______________
Under the esteemed guidance of

Mr. Prabal Deep Das


Assistant Professor, Department of Information Technology

DEPARTMENT OF INFORMATION TECHNOLOGY

VIDYALANKAR SCHOOL OF INFORMATION TECHNOLOGY


(Affiliated to University of Mumbai)
MUMBAI, 400 037
MAHARASHTRA
2018 - 2019
VIDYALANKAR SCHOOL OF INFORMATION TECHNOLOGY
(Affiliated to University of Mumbai)
MUMBAI-MAHARASHTRA-400037
DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATE
This is to certify that the project entitled, "FOOTBALL MATCH ANALYSIS THROUGH
MACHINE LEARNING", is bonafied work of JAYESH PATEKAR bearing Seat No:
__________________ submitted in partial fulfilment of the requirements for
the award of degree of BACHELOR OF SCIENCE in INFORMATION
TECHNOLOGY from University of Mumbai.

Internal Guide Coordinator

Internal Examiner External Examiner

Date: College Seal Principal


ABSTRACT
Apply some preprocessing steps to prepare the data. Then, we will perform a descriptive
analysis of the data to better understand the main characteristics that they have. We will
continue by practicing how to train different machine learning models using scikit-learn. It is
one of the most popular python libraries for machine learning. We will also use a subset of
the dataset for training purposes. Then, we will iterate and evaluate the learned models by
using unseen data. Later, we will compare them until we find a good model that meets our
expectations. Once we have chosen the candidate model, we will use it to perform predictions
and to create a simple web application that consumes this predictive model.
ACKNOWLEDGEMENT
I would like to express my gratitude to my project guide Prof. Prabal Deep Das sir for his
constant support and help as well as our Principal Dr. Rohini Kelkar and Head of the
Department Prof. Sanjeela Sagar for the opportunity to do this wonderful project which also
helped me in doing a lot of research and learn many new things.
I would also like to thank all the teaching and non-teaching staff of Vidyalankar School Of
Information Technology for the encouragement and time-to-time help throughout the
project work.
DECLARATION
I hereby declare that the project entitled, “FOOTBALL MATCH ANALYSIS THROUGH
MACHINE LEARNING” done at Vidyalankar School Of Information Technology, has not
been in any case duplicated to submit to any other universities for the award of any degree.
To the best of my knowledge other than me, no one has submitted to any other university.
The project is done in partial fulfillment of the requirements for the award of degree of
BACHELOR OF SCIENCE (INFORMATION TECHNOLOGY) to be submitted as final
semester project as part of our curriculum.

Name and Signature of the Student


Table of Contents
Chapter 1 Introduction........................................................................................... 1
Background........................................................................................................ 1
Objective............................................................................................................ 2
Purpose, Scope and Applicability...........................................................................3
Purpose............................................................................................................... 3
Scope.................................................................................................................. 3
Applicability........................................................................................................ 3
Chapter 2 Survey of Technologies..........................................................................4
Chapter 3 Requirements and Analysis...................................................................6
3.1 Problem Definition........................................................................................ 6
3.2 Requirement Specification............................................................................7
3.3 Planning and Scheduling.............................................................................. 8
3.4 Software and Hardware Requirement...........................................................9
3.4.1 Hardware Requirement...........................................................................9
3.4.2 Software Requirement............................................................................9
Chapter 4 System Designs..................................................................................11
4.1 Object Oriented Designs............................................................................. 11
4.1.1 ER Diagram.......................................................................................... 11
4.1.2 Activity Diagram................................................................................... 12
4.1.3 Use-Case Diagram................................................................................ 13
4.1.4 Data Flow Diagram............................................................................... 14
4.4.5 Component Diagram............................................................................ 15
4.1.6 Sequece Diagram.................................................................................16
List of Figures
Figure 1:Gantt Chart.............................................................................................. 8
Figure 2: ER Diagram........................................................................................... 11
Figure 3: Activity Diagram................................................................................... 12
Figure 4:Use-Case Diagram.................................................................................13
Figure 5: Data flow Diagram................................................................................14
Figure 6:Component Diagram.............................................................................. 15
Figure 7: Sequence Diagram................................................................................ 16
Chapter 1

Introduction

1.1 Background

Machine learning is a branch of artificial intelligence that allows computer systems to


learn directly from examples, data, and experience. Through enabling computers to perform
specific tasks intelligently, machine learning systems can carry out complex processes
by learning from data, rather than following pre-programmed rules. Recent years have seen
exciting advances in machine learning, which have raised its capabilities across a suite of
applications. Increasing data availability has allowed machine learning systems to be trained
on a large pool of examples, while increasing computer processing power has supported the
analytical capabilities of these systems. Now-a-days machine learning is used in all sectors for
analysing data and perform operations on them including the sports sectors, health sectors and
research sectors.
As one of the most popular sports on the planet, football has always been followed very
closely by a large number of people. In recent years, new types of data have been collected for
many games in various countries, such as play-by-play data including information on each shot
or pass made in a match.
The collection of this data has placed Data Science on the forefront of the football
industry with many possible uses and applications:
• Match strategy, tactics, and analysis
• Identifying players’ playing styles
• Player acquisition, player valuation, and team spending
• Training regimens and focus
• Injury prediction and prevention using test results and workloads
• Performance management and prediction
• Match outcome and league table prediction
• Tournament design and scheduling
• Betting odds calculation
In particular, the betting market has grown very rapidly in the last decade, thanks to
increased coverage of live football matches as well as higher accessibility to betting websites
thanks to the development of mobile and tablet devices. Indeed, the football betting industry is
today estimated to be worth between 300 million and 450 million pounds a year.

1
1.2 Objective

The objective of this project is to analyse a football match according to the dataset.
1. Firstly, we will show the data in which we can see which team wins or loses or
if the match was a draw according to the number of goal differences.
2. Then we will work with a subset of the data e.g If we select Germany then we
will work with the data of Germany and the result will show the data whether
the team Germany will win or lose and so on.
3. In this way at the end we will show the final result that which team wins the
FIFA Cup.
Different Machine Learning models were tested and different model designs
and hypotheses were explored in order to maximise the predictive performance
of the model. In order to generate predictions, there are some objectives that we
need to fulfil: Firstly, we need to find good-quality data and sanitize it to be
used in our models. In order to do so, we will need to find suitable data sources.
This will allow us to have access to a high number of various statistics to use,
compared to most of the past research that has been done on the subject where
only the final result of each match is considered.

2
1.3 Purpose, Scope and Applicability

1.3.1 Purpose

The purpose of this project is to learn and use machine learning on real and existing datasets in
the field of sports. Machine learning is an application of artificial intelligence (AI) that provides
systems the ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves. The process of learning begins with observations
or data, such as examples, direct experience, or instruction, in order to look for patterns in data
and make better decisions in the future based on the examples that we provide. The primary
aim is to allow the computers learn automatically without human intervention or assistance and
adjust actions accordingly. Machine learning has been used to predict baseball and cricket
matches but as football is the most popular game I would want to analyse the datasets of the
game football.

1.3.2 Scope

The main function of the project is to predict which team amongst all the teams playing for the
FIFA Cup will win the match.
It will also predict the teams that win the knockout round and the teams that will play the finals.
The project will only give results according to the model which will be the best suited for the
program and will give faster results.
The main function of the project is also to give accurate and faster results.

1.3.2 Applicability

Machine learning has been already used to predict the outcomes in many sectors but after the
successful completion of this project I would put up all the codes, the data and the explanation
on any open and free learning website which will help the people of the world and also students
like us to learn machine learning and understand the concept.
The project can be used to predict games results for fun but can also be used for sports betting
as these days people are very much involved in sports betting by betting their money for a
particular team or player and making money through it.

3
Chapter 2

Survey of Technologies

1) Python
Python is an interpreted high-level programming language for general-purpose
programming. It is an open source software. Python features a dynamic type system
and automatic memory management. It supports multiple programming paradigms,
including object-oriented, imperative, functional and procedural, and has a large and
comprehensive standard library.

2) Jupyter Notebook
Jupyter Notebook (Formerly IPython Notebooks) is a web-based interactive
computational environment for creating Jupyter notebooks documents. The "notebook"
term can colloquially refer to many different entities, mainly the Jupyter web
application, Jupyter python web server, or Jupyter document format depending on
context. A Jupyter Notebook document is a JSON document, following a versioned
schema, and containing an ordered list of input/output cells which can contain code,
text (using Markdown), mathematics, plots and rich media, usually ending with the
".ipynb" extension.

3) Scikit-learn
Scikit-learn (formerly scikits.learn) is a free software machine learning library for the
Python programming language. It features various classification, regression and
clustering algorithms including support vector machines, random forests, gradient
boosting, k-means and DBSCAN, and is designed to interoperate with the Python
numerical and scientific libraries NumPy and SciPy.

4) Pandas
In computer programming, pandas is a software library written for the Python
programming language for data manipulation and analysis. In particular, it offers data
structures and operations for manipulating numerical tables and time series. It is free
software released under the three-clause BSD license.

4
5) Logistic Regression
Logistic regression is a simple algorithm that can be used for binary/multivariate
classification tasks. Logistic regression is the appropriate regression analysis to conduct
when the dependent variable is dichotomous (binary). Like all regression analyses, the
logistic regression is a predictive analysis. Logistic regression is used to describe data
and to explain the relationship between one dependent binary variable and one or more
nominal, ordinal, interval or ratio-level independent variables.

5
Chapter 3

Requirements and Analysis

3.1 Problem Definition

We face a number of challenges on the path to achieving the objectives we have set out:
1) Data availability & quality:
Finding a public database of football data with the necessary statistical depth to
generate expected goals metrics is an essential part of the project. However, the leading
football data providers do not make their information publicly available. We will need
to scour various public football databases to find one that is suitable for us to use. The
alternative approach in the case where we do not find a suitable database would be to
find websites displaying the required data and using web scraping techniques to create
our own usable database.

2) Research and understanding of prediction landscape:


In order to design our models and test different hypotheses, we will need to
undertake a thorough background research of prediction techniques and develop a
mathematical understanding of various Machine Learning algorithms that can be used
for our predictions.

3) Testing different models and parameters:


An important challenge will be to make the model training and testing tasks as
quick and easy as possible, in order to test and compare different models.

6
3.2 Requirement Specification

We have obtained a dataset from the Kaggle Data Science website called the ’Kaggle
European Soccer Database’. This database has been made publicly available and regroups
data from three different sources, which have been scraped and collected in a usable
database:
• Match scores, lineups, events: http://football-data.mx-api.enetscores.com/
• Betting odds: http://www.football-data.co.uk/
• Players and team attributes from EA Sports FIFA games: http://sofifa.com/
It includes the following:
• Data from more than 25,000 men’s professional football games
• More than 10,000 players
• From the main football championships of 11 European countries
• From 2008 to 2016
• Betting odds from various popular bookmakers
• Team lineups and formations
• Detailed match events (goals, possession, corners, crosses, fouls, shots, etc.) with
additional information to extract such as event location on the pitch (with coordinates)
and event time during the match.
We will be using the dataset of English Premier League.
Acrononyms- https://rstudio-pubs-
static.s3.amazonaws.com/179121_70eb412bbe6c4a55837f2439e5ae6d4e.html
Other repositories- https://github.com/rsibi/epl-prediction-2017 (EPL prediction)

7
3.3 Planning and Scheduling

Figure 1:Gantt Chart

8
3.4 Software and Hardware Requirement

3.4.1 Hardware Requirement


1) Mouse
2) Processor
3) 2 GB of disk space
4) 4 GB RAM

3.4.2 Software Requirement


1) The Operating System
This project can run on any platform including Windows (Windows XP or
higher).
2) Python
Libraries such as:
i. Pandas
In computer programming, pandas is a software library written for
the Python programming language for data manipulation and analysis. In
particular, it offers data structures and operations for manipulating
numerical tables and time series. It is free software released under the three-
clause BSD license.

ii. Sklearn
Scikit-learn (formerly scikits.learn) is a free software machine learning
library for the Python programming language. It features various
classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and
is designed to interoperate with the Python numerical and scientific libraries
NumPy and SciPy.

iii. Seaborn
Seaborn is a Python data visualization library based on matplotlib. It
provides a high-level interface for drawing attractive and informative
statistical graphics.

9
iv. Matplotlib
Matplotlib is a plotting library for the Python programming language and
its numerical mathematics extension NumPy. It provides an object-oriented
API for embedding plots into applications using general-purpose GUI
toolkits like Tkinter, wxPython, Qt, or GTK+.

v. NumPy
NumPy is a library for the Python programming language, adding support
for large, multi-dimensional arrays and matrices, along with a large
collection of high-level mathematical functions to operate on these arrays.

3) Jupyter Notebook
Jupyter Notebook (Formerly IPython Notebooks) is a web-based
interactive computational environment for creating Jupyter notebooks
documents. The "notebook" term can colloquially refer to many different
entities, mainly the Jupyter web application, Jupyter python web server, or
Jupyter document format depending on context. A Jupyter Notebook document
is a JSON document, following a versioned schema, and containing an ordered
list of input/output cells which can contain code, text (using Markdown),
mathematics, plots and rich media, usually ending with the ".ipynb" extension.

4) Microsoft Excel for database


Microsoft Excel, with updated data analysis tools, can help you track and
visualize data for better management and insight of large amounts of
information. Microsoft Excel is a spreadsheet developed by Microsoft for
Windows, macOS, Android and iOS. It features calculation, graphing tools,
pivot tables, and a macro programming language called Visual Basic for
applications.

10
Chapter 4

System Designs

4.1 Object Oriented Designs


4.1.1 ER Diagram

Figure 2: ER Diagram

An entity relationship diagram (ERD) shows the relationships of entity sets stored in a database.
An entity in this context is an object, a component of data. An entity set is a collection of similar
entities. These entities can have attributes that define its properties.
In our project, all the teams and the goals scored in all are saved in one and only one data
resource that we are using in our project. Every goal that is scored is related to one fixture
amongst the all. All the fixtures will be related to one team.

11
4.1.2 Activity Diagram

Figure 3: Activity Diagram

Activity diagram is another important diagram in UML to describe the dynamic aspects of the
system. Activity diagram is basically a flowchart to represent the flow from one activity to
another activity.
According to our project, the activity diagram above performs 6 major activities:
1. The very first important activity is importing the data or database that is required for
performing the analysis.
2. Then we select the columns on which the main analysis is to be performed.
3. From several models, we select a model for our project which gives appropriate results
faster than the other models
4. Then we check if the model that has been selected in the previous activity is accurate
and performing well or not. If it isn’t accurate then we move to the previous activity
and select some other model better than this model.
5. If the model is accurate we move on to the next activity that is training the selected
model.
6. And finally getting the prediction is the last activity that is predicting the winner of the
match.

12
4.1.3 Use-Case Diagram

Figure 4:Use-case diagram

UML Use Case Diagrams. Use case diagrams are usually referred to as behaviour diagrams
used to describe a set of actions (use cases) that some system or systems (subject) should or
can perform in collaboration with one or more external users of the system (actors).
In our project an actor is a user that is the one who is going to use the final program for getting
the predicted results. The user imports and sorts the data of the league so the league has to be
included as the data is dependant on the league. Also, the features are dependent on importing
the data. The user or the actor also has to handle the errors taking place in the program as well
as adding and removing conditions as per the program.

13
4.1.4 Data Flow Diagram

Figure 5:Data flow Diagram

A data flow diagram (DFD) is a graphical representation of the "flow" of data through an
information system, modelling its process aspects. A DFD is often used as a preliminary step
to create an overview of the system without going into great detail, which can later be
elaborated.
The following diagram explains that the data is retrieved form data storage and is been
processed until it gets in a format which we need.
The prepared data is the added to the algorithm i.e. machine learning algorithm and several
models are being used and the one which gives most accurate answer is carried forwarded
and the model is then used for the prediction.

14
4.4.5 Component Diagram

Figure 6:Component Diagram

Component diagram is a special kind of diagram in UML. The purpose is also different from
all other diagrams discussed so far. It does not describe the functionality of the system but it
describes the components used to make those functionalities.
The following component diagram consist of components i.e. Data Ingestion which means add
some data related to the content ,Data Transformation i.e. converting the data as per our
convince for execution ,Model training i.e.train the selected model to carry out result as
required ,Model testing which means to check whether the selected model dose the assigned
work properly and Model prediction that predict the results from the data. And a loop goes on
till an accurate model is selected.

15
4.1.6 Sequence Diagram

Figure 7:Sequence Diagram

Sequence diagrams are sometimes called event diagrams or event scenarios. The following
diagram shows the parallel working of a football data analysis system.
The user end has to open the platform for entering the data and the code
As per the request of the user the platform performs run all the libraries required and retrieves
data for analysis it and tries getting results. The results are then confirmed and the final
prediction is done.

16

Vous aimerez peut-être aussi