Vous êtes sur la page 1sur 15

Can Soccer be Predicted, and how could it be?

John Park

Independent Study

Mrs. Graves

June 7, 2016

What I Already Knew / What I Wanted to Know

Its Layun. Layun passes it to Ruben Neves, Neves to Brahimi, Brahimi


crosses and SUK SCORES! Soccer is my passion, it is the passion for the 7 billion
people on this planet. I have watched soccer ever since when I was six, and have fell in

love with the game ever since. I frequently wake up in 3, 4 AM on Saturdays and
Sundays just to watch my favorite team play. However, when watching soccer, I have
always had the question what can I do to predict scores more efficiently?
In order to do predict scores, one needs to utilize a database and create a model
in order to make an accurate prediction. There are many variables and aspects that need
to be taken into consideration when doing so, and that will help make a more accurate
database taking account a lot of data, making the data more credible. This is where big
data comes into play. Big data is a term describing the large volume of data with
specific characteristics that can be analyzed for insights that lead to better decisions
and strategic moves. I really wanted to find out how to use big data in order to predict
association football (soccer) scores. Big data also requires coding in order to analyze
and derive useful conclusions, which is an aspect which I needed, and was already
somewhat, familiar in. As we see in the movie Moneyball, Oakland Athletics general
manager Billy Beane used analytics to reverse the fortunes of his not-so-well-faring
MLB team. Since then, development of data approaches to sports has continued to
grow and evolve.
With my limited knowledge on the topic, I needed to know more about how big data
is used and applied in order to reach a conclusion. What are the characteristics of coding, big
data and databases? How is big data applied to databases (in a global perspective)? Where
does soccer come into play? Why big data, and how can we use this in soccer? What benefit
could big databases provide? In soccer? I made a list of the things I needed and wanted to
know. From that list, I was able to form my research question: Would it be possible to
program an efficient system for sorting a variety of data in order to predict soccer scores with
programming and big data, and how? Later, I was able to form an answer.

The Story of My Search


Let me first say my research was mostly done home online, and that it was fairly
smooth. There was seldom difficulty in finding resources, although some were harder to
interpret. My search took about three weeks, most of it which I spent digging papers and
websites related to the topic. I began my research by doing some background reading in using
Google Scholars and websites from big data analytic corporations.
To start my research off, I decided to look into the basics of big data. I found
that there are three aspects of a large database. SAS was especially helpful in my
search. As a big data analytics company, SAS provided the basics of my research and
many information that I wouldnt have found elsewhere. In the SAS webpage, I learned
the three aspects of big data, volume, velocity, and variety. Volume refers to the vast
amounts of data generated every second (SAS). Anything from terabytes, records,
transactions, files, etc. are considered volume. Organizations collect data from a
variety of sources, including business transactions, social media and information from
sensor or machine-to-machine data. In the past, storing data wouldve been a problem,
but new technologies such as Hadoop have eased the burden (SAS). Volume doesn't
have to be extremely large or particular, it just has to be relevant (SAS). Velocity refers
to the speed at which new data is generated and the speed at which data moves around
(SAS). Big data is created in real-time or near real-time. Tags, sensors and smart
metering are driving the need to deal with torrents of data in near-real time. Variety
refers to the different types of data we can now use. With big data technology we can
now analyze and bring together data of different types of data, as data comes in all
types of formats from structured, numeric data in traditional databases to
unstructured text documents, email, video, audio, stock ticker data and financial
transactions (SAS).

At first, this made no sense to me. Yes, the mere concepts and possibilities of
big data was interesting enough, but how could this be applied to soccer? After more
research, and writing my two blog posts, I found my answer after reading a couple of
webpages and statistic articles. Soccer is a game of data, it is crucial that one is up to
date with it to make the best managerial choices or betting choices. As a result, big data
has to be efficiently used in order to manage the game.
Variables, such as the home team's goal differential up to that point in the
season, the away team's goal differential up to that point in the season, the home team's
points from the previous season, the away team's points from the previous season, etc.
can be analyzed as a whole to predict scores (Snyder, 2013). Large databases such as
Transfermarkt can be used in order to track player conditions, statistics, and various
information and compile them live in order to make wise predictions. Electronic Arts,
the company that makes the famous FIFA games, already does something like this. I
also realized betting odds could also be possibly used in order to create a model. This
isnt mentioning teams and analysts nowadays have come up with sophisticated ways
of monitoring and capturing the vast amounts of data from the game. Managers,
coaches and athletes are using data manage calorie intake, training levels and even fan
interaction. An extreme example is the German national team during the 2014 World
Cup. Germany used Match Insights, a tool developed by the German SAP, which
allowed the coaches of national teams to crawl through complex video and make it
simple for them to know what they need to win. (Veips, 2014) Match Insights utilized
data capture from on-field cameras and helps pinpoint the areas, or plays in which
players need improvement, boosting the efficiency of practice sessions. This, combined
with other quantitative and qualitative data allowed for more accurate predictions and
estimation of opponents. Something that caught my eye was analyzing social media in

order to predict scores. Twitter feed reactions can be culminated as well as used in
comparison with live data (Ulmer). In doing so, one call look to see if the reactions are
optimistic or negative, and if there are any abnormalities . Social media tracking
establishes trends and patterns that alert teams to changing perceptions. If something is
going wrong, than the teams can react to perhaps, appease the fans. If an action is
popular, maybe the team can continue doing it. Basically, through my basic research, I
reached the conclusion that the task was, take a large set of data from the results,
databases, and run them through someway.
After finding this out, I decided to explore how variables and data from soccer can be
processed. A particular article that helped was an article on sports and big data analytics from
an Indian news outlet. During the 2014 World Cup, Google, Baidu, and Microsoft all
used big data to predict results for the scores (Soni, 2014). A number of factors were
played into this analysis, such as overall team passing in the attacking half, a higher
number of shots and a higher number of goals scored (Soni, 2014). Google then
combined measures for finding a quantitative measure for home field advantage and
team statistics from Opta to create this system (Soni, 2014). After finding this out, I
realized that something similar could be done and a computerized algorithm created for
calculating team advantages and disadvantages to predict scores. This led me to deeper
research that led me to computer analysis via coding.
Regarding coding, I quickly found SQL can be used to code databases for storing
data (Kiourtzoglou, 2013). I also learned that HDFS is another option and is the primary
distributed storage used by Hadoop applications. Hadoop MapReduce can also be used, being
software framework for easily writing applications processing multi-terabyte data-sets inparallel on large clusters of commodity hardware in a reliable, fault-tolerant manner
(Kiourtzoglou, 2013). Apache HBase is another good alternative, providing random, real-

time access to Big Data and is optimized for hosting very large tables. Apache Oozie is
another service that can be used, being a scalable, reliable and extensible workflow scheduler
system to manage Apache Hadoop jobs. There are a variety of methods one could use to store
data (Kiourtzoglou, 2013). These are some that I found that could be applied to sports, and
soccer in specific. Databases like these can be used to store and analyze/process data in order
to make it more measurable/manageable/etc., and can enable the processor to combine
multiple factors in with the processed data in order to make a wise decision or discovery.
Eventually, I hit a roadblock and decided to seek help. Research was proving to be
futile at this point and all of the sources that I was searching for practically all said the same
things or claims. I had three people in mind at first, an ODU professor, my computer science
teacher, and a member of the statistics department from Proctor and Gamble. However, all
three interviewees fell through, and I eventually interviewed my statistics teacher, Nicole
Giancola. I interviewed Mrs. Giancola as she is an expert in statistics. She is my statistics
teacher this year and has taught me many things about the wonders of statistics. Although not
my initial choice, she was one of the first names that came up when I was thinking for
interviewees. Also, she is experienced in the field of statistics, and even did one of her papers
predicting volleyball scores with statistics! She was the perfect fit for my interview, which is
why I interviewed her. She told me about her research in college, about how she came up
with an equation to predict volleyball scores.
Her theory was that there was a way how successful a team was going to be based on
season statistics. What statistic is very important, trends when they did, hitting percentage,
blocks, etc. She used an Enova test to find regression of all the variables. X was hitting
percentage, U was serve percentage, with the outputted equation being her product. She also
emphasized that one needs to look at a wide spectrum.
Although the interview gave me minimal assistance, I still managed to obtain some

inspiration, I continued my research. I went back to researching more deeply regarding


variables and methods of analysis. One way of analyzing that I specifically went back to after
research Twitter. Although I had just glanced over the idea before, I realized that this idea
could become a full-fledged idea. As I was looking more into Twitter posts. A research from
Cardiff University caught my interest. This model was very innovative considering that most
approaches to the game of association football are not unique. As mentioned in the abstract,
"existing models have not been modernized to account for the 21st century phenomena of big
data that could be used to raise the accuracy of such models" (Smith, 2015). By combining
Twitter with traditional statistical analysis, the researcher successfully created an algorithm
that mapped and predicted outcomes of matches. The researcher combined text classification,
and betting odds provided by bookmakers, and found a more accurate model to predict soccer
scores. (Smith, 2015).
Smith used Twitter Streaming API in order to obtain his data. The Tweets can be
collected efficiently any time there is a mention. Smith also used Mango DB in order to store
this large dataset. MongoDB is an open source document-based database, which works well
with storing JSON-like data, which fits the project due to its relation to soccer. He also used a
Python library, and in specific, Twitter-tap in order to save his tweets. Lastly, he used a
NLTK, a natural language tool kit, in order to process the text from the Tweets. Smith then
classified the tweets in to three categories, win, draw, and lose by implementing a Naive
Bayes model, a simple, yet strong classifying model. The Nave Bayes Classifier aims to find
out the predicted classification of a tweet, given the words in that tweet and prior knowledge
we have about the frequency of all words the model is trained on, with an associated class
label (Smith, 2015). Smith then used a Poisson distribution model to estimate the chances of
a variety of events happening (given that one knows the average number of times that this
event happens), and discovered that tweets are in fact, efficient indicators for predicting

matches (Smith, 2015).


After looking into Twitter analysis more, I realized that many researchers and
companies are interested in this method as well, and that this is a very hot method in the field
of sporting analytics. In fact, Betfuze, a company specializing in bet selection and analysis,
has already created a crowdsourced Twitter sentiment engine called the "Sentimeter" and
combines them with odds from major betters (Betfuze, 2014). I have also found that two
researchers from the University College of London, Adamides and Kampakis, have also done
a fair share of research on this topic. The two researchers used Twitter, especially considering
that it has been notable source for predictive modeling on stock and disease spread. The
students used a 3 month batch of Tweets and historical data and calculated the effectiveness
of the analysis with Cohen's kappa, and found it to be considerably accurate (Adamides &
Kampakis). The data was obtained through the spring season, and used Tweets from
approximately 2 million fans worldwide, tracking hashtags such as #toffees, #arsenal,
#comeonyoujacks, in order to track support, while eliminating potential outliers in the form
of multiple team support and teams with similar names. The data was then processed using
TwitterNLP, and combined with historical datasets, and then classified using a Random
Forest method. The method was found to be very accurate, and proved to be
credible (Adamides & Kampakis).
After looking into several facets of predicting matches, gathering information, and
writing my blog posts based on the information, I was satisfied with my results, and reached a
satisfactory conclusion to wrap my research up.

The Search Results


Soccer is becoming the game of statistics, from every pass to every goal, big data
will dictate whether a team wins or not. As a result of my research I have found that I can

program an efficient system for sorting a variety of data in order to predict soccer scores with
programming and big data. To be fair, this is very possible. But how?
Big data allows us to analyze data in real time in big quantities, hence big data
(SAS). The application of big data is astounding in the ways in which it can be applied.
It can be used in order to analyze wars, genocides, sports, and disasters around the
world. If one uses a large set of data set related to sports efficiently, results for soccer
can be accurately predicted. Large databases such as Transfermarkt and Twitter will
help immensely, and be the driving force in a project like this.
Volume, velocity, and variety. These are the three important things to consider
when working with big data. With big data technology one can now analyze and bring
together data of different types of data (SAS). This is especially relevant for soccer,
where it is crucial that one is up to date with it to make the best managerial choices or
betting choices. When predicting soccer scores, there are many ways one can approach
the dilemma. Variables created real time in game can be utilized to predict scores by
culminating them and comparing them among teams (Snyder, 2013). For example, the
German national team during the 2014 World Cup used Match Insights in order to
analyze team and opponent conditions to predict probable results and found ways to
prevent those negative results (Veips, 2014). A similar method was also used by search
engines during the 2014 World Cup in order to predict scores (Soni, 2014).
Something that caught my eye was analyzing social media in order to predict
scores. After reading Smith's paper, which was one of the catalysts for the direction of
my project, I became more interested in how applications in sports work with Twitter. I
looked more closely into Smith's paper and other sources using Google Scholar. Twitter
is an interesting beast. It has a potential that has not yet been explored to the fullest
depths and its potential yet realized. Twitter can be combined with other statistical

factors and be used to create an accurate prediction method for soccer. (Ulmer). But
one must mind that one need a wide and vast amount of data in order to carry this out.
One must combine Twitter with other analytical factors in order to create an efficient and
successful system to sort and analyze soccer data. One must not only use factors like text
classification, but other factors such as betting odds as well (Smith, 2015). Also, as Mrs.
Giancola said, one must find a trend, a correlation, per say, in a big data set in order to find
the important statistic or to see if the system actually does predict accurately (which can be
simply done by running a test of a previous result and comparing it with the actual result).
However, one cannot simply take a raw set of data and just derive conclusions
from it, they need to process it first. This is where coding came into play. Regarding
coding, SQL is used for storing data (Kiourtzoglou, 2013). Hadoop can make the job of
storing and analyzing data easier as well. However, for soccer, as seen in Smiths study,
MongoDB will work better as it stores JSON-like data better. We will also need a device in
order to store the data, for example a Python library, and a device to analyze our results, such
as NLTK, a text analyzer. From there, we can classify the results, in Smiths case, tweets, into
a category by implementing a model such as a Nave Bayes model or a Random Forest model
(Adamides & Kampakis). Lastly, as with many analysis projects, one can use a model, such
as a Poisson distribution model, to estimate the chances of a variety of events (Smith, 2015),
thus reaching a conclusion. .
From Poisson to Python, the statistical analysis is not simple as it seems. It
requires batches and batches of data: big data. This data then has to be carefully
analyzed and interpreted in a way that is logically sound and plausible. Through this,
one can create a sound and accurate way to accurately predict scores, and make the
perfect sporting decision as possible.

My Growth As A Researcher
I actually learned a lot about doing research as a result of this project. For one thing,
doing the research took a lot more time than I thought it would. I had to use my time during
the SOL free blocks in order to get the research done. Also, I learned how to contact someone
and conduct and interview. Although I didnt get too much out of this due to the negative
experience I had, I did learn something, and that was all that mattered to me. My writing
skills also improved drastically as well. Also, I learned how to use resources such Google
Scholar that I hadnt used before but, will use with my next research project. I also learned
how to evaluate information from web sites and how to find credible sites as well. In addition
to these academic benefits, I got a lot of information that will help me continue my project for
my senior capstone as well.

The student has identified at least five areas of interest in light of the
research they have done.
What Next?
This research project has sparked a lot of interest for my future endeavors as a
student. Statistical analysis is one of them. I think I can make a great senior project with
statistical analysis as it is so versatile, with the stock market and fields such as those. Maybe I
could even become a stat researcher? It also made me hopeful for a possible internship with
the Pohang Steelers (Korean soccer teams are generally willing to let high schoolers and nonqualified persons work with the club for promotion) or a university regarding sports analysis
or statistics. If I choose that option, I would also like to expand my research to different
leagues and make a concrete product instead of a theoretical one. I would also like to see if
my research actually applies to all leagues in the world.

References
Betfuze. (2014, January). Betfuze. Retrieved May 26, 2016, from
http://betfuze.com/press/BETFUZE-Launch-January2014.pdf

Kampakis, S., & Adamides, A. (n.d.). Using Twitter to predict football outcomes. Retrieved
May 26, 2016, from https://arxiv.org/ftp/arxiv/papers/1411/1411.1243.pdf

Kiourtzoglou, B. (2013, April 5). What is Big Data Theory to Implementation. Retrieved
May 26, 2016, from https://www.javacodegeeks.com/2013/04/what-is-big-datatheory-to-implementation.html
Davenport, T. (n.d.). Three big benefits of big data analytics. Retrieved May 12,
2016, from http://www.sas.com/en_ca/news/sascom/2014q3/Big-data-davenport.html

SAS. (n.d.). What is Big Data and why it matters. Retrieved May 26, 2016, from
http://www.sas.com/en_us/insights/big-data/what-is-big-data.html

SAS. (n.d.). What is Hadoop? Retrieved May 12, 2016, from


http://www.sas.com/en_my/insights/big-data/hadoop.html

Smith, K. (2015). Predicting association football match outcomes using social media and
existing models. Retrieved May 26, 2016, from
https://www.cs.cf.ac.uk/PATS2/@archive_file?c=&p=file&p=388&n=final&f=1C1148334_Final_Report.pdf

Snyder, J. (2013, May 3). What Actually Wins Soccer Matches: Prediction of the 2011-2012
Premier League for Fun and Profit. Retrieved April 27, 2016, from
https://homes.cs.washington.edu/~jasnyder/papers/thesis.pdf

Soni, A. (2014, July 13). Germany to win FIFA World Cup 2014; predicts Google, Microsoft
and Baidu. Retrieved May 9, 2016, from http://yourstory.com/2014/07/germanyargentina-fifa-world-cup-2014/

Ulmer, B. (n.d.). Predicting soccer results in the English Premier League. Retrieved April
27, 2016, from http://ttic.uchicago.edu/~kgimpel/papers/sinha dyer gimpel
smith.mlsa13.pdfhttp://cs229.stanford.edu/proj2014/Ben Ulmer, Matt Fernandez,
Predicting Soccer Results in the English Premier League.pdf

Veips, L. (2014, September). Big Data Sports: How Data Is Changing The Future Of Sports.
Retrieved May 12, 2016, from http://cloudtweaks.com/2014/09/big-data-big-sportsdata-changing-future-sports/

Vous aimerez peut-être aussi