Académique Documents
Professionnel Documents
Culture Documents
CERTIFICATE
This is to certify that the project entitled, "FOOTBALL MATCH ANALYSIS THROUGH
MACHINE LEARNING", is bonafied work of JAYESH PATEKAR bearing Seat No:
__________________ submitted in partial fulfilment of the requirements for
the award of degree of BACHELOR OF SCIENCE in INFORMATION
TECHNOLOGY from University of Mumbai.
Introduction
1.1 Background
1
1.2 Objective
The objective of this project is to analyse a football match according to the dataset.
1. Firstly, we will show the data in which we can see which team wins or loses or
if the match was a draw according to the number of goal differences.
2. Then we will work with a subset of the data e.g If we select Germany then we
will work with the data of Germany and the result will show the data whether
the team Germany will win or lose and so on.
3. In this way at the end we will show the final result that which team wins the
FIFA Cup.
Different Machine Learning models were tested and different model designs
and hypotheses were explored in order to maximise the predictive performance
of the model. In order to generate predictions, there are some objectives that we
need to fulfil: Firstly, we need to find good-quality data and sanitize it to be
used in our models. In order to do so, we will need to find suitable data sources.
This will allow us to have access to a high number of various statistics to use,
compared to most of the past research that has been done on the subject where
only the final result of each match is considered.
2
1.3 Purpose, Scope and Applicability
1.3.1 Purpose
The purpose of this project is to learn and use machine learning on real and existing datasets in
the field of sports. Machine learning is an application of artificial intelligence (AI) that provides
systems the ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves. The process of learning begins with observations
or data, such as examples, direct experience, or instruction, in order to look for patterns in data
and make better decisions in the future based on the examples that we provide. The primary
aim is to allow the computers learn automatically without human intervention or assistance and
adjust actions accordingly. Machine learning has been used to predict baseball and cricket
matches but as football is the most popular game I would want to analyse the datasets of the
game football.
1.3.2 Scope
The main function of the project is to predict which team amongst all the teams playing for the
FIFA Cup will win the match.
It will also predict the teams that win the knockout round and the teams that will play the finals.
The project will only give results according to the model which will be the best suited for the
program and will give faster results.
The main function of the project is also to give accurate and faster results.
1.3.2 Applicability
Machine learning has been already used to predict the outcomes in many sectors but after the
successful completion of this project I would put up all the codes, the data and the explanation
on any open and free learning website which will help the people of the world and also students
like us to learn machine learning and understand the concept.
The project can be used to predict games results for fun but can also be used for sports betting
as these days people are very much involved in sports betting by betting their money for a
particular team or player and making money through it.
3
Chapter 2
Survey of Technologies
1) Python
Python is an interpreted high-level programming language for general-purpose
programming. It is an open source software. Python features a dynamic type system
and automatic memory management. It supports multiple programming paradigms,
including object-oriented, imperative, functional and procedural, and has a large and
comprehensive standard library.
2) Jupyter Notebook
Jupyter Notebook (Formerly IPython Notebooks) is a web-based interactive
computational environment for creating Jupyter notebooks documents. The "notebook"
term can colloquially refer to many different entities, mainly the Jupyter web
application, Jupyter python web server, or Jupyter document format depending on
context. A Jupyter Notebook document is a JSON document, following a versioned
schema, and containing an ordered list of input/output cells which can contain code,
text (using Markdown), mathematics, plots and rich media, usually ending with the
".ipynb" extension.
3) Scikit-learn
Scikit-learn (formerly scikits.learn) is a free software machine learning library for the
Python programming language. It features various classification, regression and
clustering algorithms including support vector machines, random forests, gradient
boosting, k-means and DBSCAN, and is designed to interoperate with the Python
numerical and scientific libraries NumPy and SciPy.
4) Pandas
In computer programming, pandas is a software library written for the Python
programming language for data manipulation and analysis. In particular, it offers data
structures and operations for manipulating numerical tables and time series. It is free
software released under the three-clause BSD license.
4
5) Logistic Regression
Logistic regression is a simple algorithm that can be used for binary/multivariate
classification tasks. Logistic regression is the appropriate regression analysis to conduct
when the dependent variable is dichotomous (binary). Like all regression analyses, the
logistic regression is a predictive analysis. Logistic regression is used to describe data
and to explain the relationship between one dependent binary variable and one or more
nominal, ordinal, interval or ratio-level independent variables.
5
Chapter 3
We face a number of challenges on the path to achieving the objectives we have set out:
1) Data availability & quality:
Finding a public database of football data with the necessary statistical depth to
generate expected goals metrics is an essential part of the project. However, the leading
football data providers do not make their information publicly available. We will need
to scour various public football databases to find one that is suitable for us to use. The
alternative approach in the case where we do not find a suitable database would be to
find websites displaying the required data and using web scraping techniques to create
our own usable database.
6
3.2 Requirement Specification
We have obtained a dataset from the Kaggle Data Science website called the ’Kaggle
European Soccer Database’. This database has been made publicly available and regroups
data from three different sources, which have been scraped and collected in a usable
database:
• Match scores, lineups, events: http://football-data.mx-api.enetscores.com/
• Betting odds: http://www.football-data.co.uk/
• Players and team attributes from EA Sports FIFA games: http://sofifa.com/
It includes the following:
• Data from more than 25,000 men’s professional football games
• More than 10,000 players
• From the main football championships of 11 European countries
• From 2008 to 2016
• Betting odds from various popular bookmakers
• Team lineups and formations
• Detailed match events (goals, possession, corners, crosses, fouls, shots, etc.) with
additional information to extract such as event location on the pitch (with coordinates)
and event time during the match.
We will be using the dataset of English Premier League.
Acrononyms- https://rstudio-pubs-
static.s3.amazonaws.com/179121_70eb412bbe6c4a55837f2439e5ae6d4e.html
Other repositories- https://github.com/rsibi/epl-prediction-2017 (EPL prediction)
7
3.3 Planning and Scheduling
8
3.4 Software and Hardware Requirement
ii. Sklearn
Scikit-learn (formerly scikits.learn) is a free software machine learning
library for the Python programming language. It features various
classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and
is designed to interoperate with the Python numerical and scientific libraries
NumPy and SciPy.
iii. Seaborn
Seaborn is a Python data visualization library based on matplotlib. It
provides a high-level interface for drawing attractive and informative
statistical graphics.
9
iv. Matplotlib
Matplotlib is a plotting library for the Python programming language and
its numerical mathematics extension NumPy. It provides an object-oriented
API for embedding plots into applications using general-purpose GUI
toolkits like Tkinter, wxPython, Qt, or GTK+.
v. NumPy
NumPy is a library for the Python programming language, adding support
for large, multi-dimensional arrays and matrices, along with a large
collection of high-level mathematical functions to operate on these arrays.
3) Jupyter Notebook
Jupyter Notebook (Formerly IPython Notebooks) is a web-based
interactive computational environment for creating Jupyter notebooks
documents. The "notebook" term can colloquially refer to many different
entities, mainly the Jupyter web application, Jupyter python web server, or
Jupyter document format depending on context. A Jupyter Notebook document
is a JSON document, following a versioned schema, and containing an ordered
list of input/output cells which can contain code, text (using Markdown),
mathematics, plots and rich media, usually ending with the ".ipynb" extension.
10
Chapter 4
System Designs
Figure 2: ER Diagram
An entity relationship diagram (ERD) shows the relationships of entity sets stored in a database.
An entity in this context is an object, a component of data. An entity set is a collection of similar
entities. These entities can have attributes that define its properties.
In our project, all the teams and the goals scored in all are saved in one and only one data
resource that we are using in our project. Every goal that is scored is related to one fixture
amongst the all. All the fixtures will be related to one team.
11
4.1.2 Activity Diagram
Activity diagram is another important diagram in UML to describe the dynamic aspects of the
system. Activity diagram is basically a flowchart to represent the flow from one activity to
another activity.
According to our project, the activity diagram above performs 6 major activities:
1. The very first important activity is importing the data or database that is required for
performing the analysis.
2. Then we select the columns on which the main analysis is to be performed.
3. From several models, we select a model for our project which gives appropriate results
faster than the other models
4. Then we check if the model that has been selected in the previous activity is accurate
and performing well or not. If it isn’t accurate then we move to the previous activity
and select some other model better than this model.
5. If the model is accurate we move on to the next activity that is training the selected
model.
6. And finally getting the prediction is the last activity that is predicting the winner of the
match.
12
4.1.3 Use-Case Diagram
UML Use Case Diagrams. Use case diagrams are usually referred to as behaviour diagrams
used to describe a set of actions (use cases) that some system or systems (subject) should or
can perform in collaboration with one or more external users of the system (actors).
In our project an actor is a user that is the one who is going to use the final program for getting
the predicted results. The user imports and sorts the data of the league so the league has to be
included as the data is dependant on the league. Also, the features are dependent on importing
the data. The user or the actor also has to handle the errors taking place in the program as well
as adding and removing conditions as per the program.
13
4.1.4 Data Flow Diagram
A data flow diagram (DFD) is a graphical representation of the "flow" of data through an
information system, modelling its process aspects. A DFD is often used as a preliminary step
to create an overview of the system without going into great detail, which can later be
elaborated.
The following diagram explains that the data is retrieved form data storage and is been
processed until it gets in a format which we need.
The prepared data is the added to the algorithm i.e. machine learning algorithm and several
models are being used and the one which gives most accurate answer is carried forwarded
and the model is then used for the prediction.
14
4.4.5 Component Diagram
Component diagram is a special kind of diagram in UML. The purpose is also different from
all other diagrams discussed so far. It does not describe the functionality of the system but it
describes the components used to make those functionalities.
The following component diagram consist of components i.e. Data Ingestion which means add
some data related to the content ,Data Transformation i.e. converting the data as per our
convince for execution ,Model training i.e.train the selected model to carry out result as
required ,Model testing which means to check whether the selected model dose the assigned
work properly and Model prediction that predict the results from the data. And a loop goes on
till an accurate model is selected.
15
4.1.6 Sequence Diagram
Sequence diagrams are sometimes called event diagrams or event scenarios. The following
diagram shows the parallel working of a football data analysis system.
The user end has to open the platform for entering the data and the code
As per the request of the user the platform performs run all the libraries required and retrieves
data for analysis it and tries getting results. The results are then confirmed and the final
prediction is done.
16