Xomnia is a word combining the letter ‘X’ – the unknown – and “Omnia” – Latin for everything. Our team of data scientists and big data engineers are trained to find the undefined – X – in all the relevant data sources – Omnia. This unknown – X – is untapped business value. Combining the X and Omnia you get the Xomnia spirit. Eager, curious and dedicated people, who have the belief that the future is big data.

Want to feel the Xomnia spirit? Follow us

# Blog

XomniaBlogBlog: Who will win the football league? A data-driven prediction

# Blog: Who will win the football league? A data-driven prediction

## Football prediction blog

The Dutch football league, de Eredivisie, is coming into the final stages. Nothing is decided yet, as the prizes will (usually) not be given out until May. The football matches played up to now give an indication which teams are the contenders for the title and which teams have to fear for relegation. In this blog, we will give a data-driven prediction of the winning football team.

Using multilevel modeling we discovered a pattern in the football matches played up to now. This pattern can be used to simulate the remaining matches. Using these simulations we can predict an expected ranking. Our approach to learning this model is the Bayesian one, we will sample possible models using the python package pymc3.

### The model

Football isn’t a complicated game when you think about it. Two teams try to score by playing the ball into the other team’s goal. We model the number of goals for both teams as independent random variables. The parameters, which encode attacking potential, defensive capabilities and home advantages of each team, are learned from the football matches played.

This model is a variation on existing approaches. The hierarchical modeling of the home advantages may be new, we haven’t seen it before. The model is pretty simple and doesn’t take into account current form or the fact that Robin van Persie just joined Feyenoord.

### Results

Before diving into the implementation details, we provide an overview of the results. This is to accommodate all you coaches on the couch who are not necessarily interested in non-centered specification! Let’s first look at the parameters that the model has inferred for the Dutch teams. We sorted the positions on attacking rate.

The attack rate corresponds to the number of goals scored on average, while the defense rate is a score that measures the number of goals prevented. In fact, on average, ignoring the home advantage, this model estimates the number of goals in the next football match between Vitesse and AFC Ajax for example.

\begin{align} \text{goals by Vitesse} &= \exp\left(\text{Vitesse attack rate} – \text{Ajax defense rate}\right) \\ &\approx \exp\left(0.5 – 0.2\right) \approx 1.3 \, \text{goals}\\ \text{goals by Ajax} &= \exp\left(\text{Ajax attack rate} – \text{Vitesse defense rate}\right) \\ &\approx \exp\left(0.7 – 0.1\right) \approx 1.8 \, \text{goals} \end{align}
This is a tight matchup. Which might swing in favor of AFC Ajax.
At the moment, PSV is leading the competition. Our model thinks AFC Ajax does slightly better on both offense and defense. This is not surprising considering that Ajax has scored more and accepted less goals this year. Feyenoord, fifth in the ranking, does not have the scoring potential to challenge the first two. However, they are defending among the best.

Most teams in the Dutch competition seem to lower their defenses at home while increasing their scoring potential, perhaps to offer their home crowd a good show? Only PSV, AFC Ajax, PEC Zwolle, and Heracles Almelo defend better at home.

If we run simulations based on these ratings, the results at the end of this season are predicted as shown below.

According to this model, PSV is the big favorite. They have a seven-point lead over AFC Ajax at the moment. However, with their good scoring potential and solid defense, the model gives AFC Ajax a 15% probability of winning the competition.

Positions are less certain at the bottom and the middle of the pack, where a single victory can make a large difference in terms of ranking. The model predicts a lower final ranking for PEC Zwolle, ranked 6th at the moment. This is a reasonable prediction if you take into account their current goal tally.

The same model is applied to the big leagues in Europe. You can find the results of these models below. It’s interesting to note that the Dutch league is one of the more exciting competitions. The victories of super teams like FC Bayern Munchen, FC Barcelona, and Manchester City seem certain. The most exciting competition is the Italian Serie A, with Juventus being the only slightly more likely to win the competition than SSC Napoli.

### Model evaluation

What this approach does not accommodate for, is the effect of a new trainer appointed during the winter break. No ‘Dick Advocaat –‘ (Sparta Rotterdam) or ‘Erik ten Hag –‘ (AFC Ajax) effect can be predicted. The effect of a transfer is also not estimated directly. Basically, the model infers the strengths over the season so far and extrapolates this over the rest of the season. As such, it is probably too certain of the outcome, as transfer effects, injuries, and changes in form can really make the second half of the season different from what we have seen so far.

The first thing to incorporate in this model would be something that tracks the changes in form over the season. Additionally other effects as for example the weather, referee or time of day that a football match is played. There is also a real effect of fatigue, some teams play multiple football matches in a week because they’re involved in a European League or the National Cup. The effect of fatigue will matter less in the second half of the season. Dutch teams traditionally don’t reach the second round in these competitions. However, it could influence the interpretation of earlier results. It would be interesting to be able to check how big this effect is. The cool thing about this model in pymc3 is that it is relatively easy to expand the model and add extra parameters if you, as a data scientist and part-time football coach, think it is of importance.

A necessary step that we omitted here is an evaluation of this model over the previous years.

That said, these simulations lead to what we think are very reasonable probabilities. Taking well into account the strength of the opponents played and those that are still on schedule. The model might also be useful in practice to give teams an indication of what they should improve on.

It is nice that we needed a relatively small amount of data in order to learn this model. In the age of Big Data, not all data sets are large and this technique offers a great way to build robust models. That also gives some certainty about their uncertainty. Additionally, it is possible to interpret the parameters in the model. This all makes Bayesian multilevel modeling extremely powerful. Thanks also to the NUTS-sampler of course, and the pymc3 developers that makes it so accessible.

As a bonus, we give predictions for the big leagues down here. After that, it will be Nerds Only, as we dive into the implementation. If you’re interested in more details, or up-to-date predictions, feel free to reach out!

“Als ik zou willen dat je het begreep, dan had ik wel beter uitgelegd!” – Johan Cruyff

### Bundesliga

In Germany, their best team is FC Bayern Munchen. Given their current point advantage, we’re very sure they will win. After the first position, things are still exciting with many teams in contention for second place. Funny note, compared to the Dutch league, the home advantages are much larger in Germany.

### Primera Division

In Spain, the top spot is pretty much taken. Atlético Madrid is also pretty certain of its second place. This is not due to their scoring potential, but their defensive capabilities have shown to be rivaled only by FC Barcelona (the number 1).

#### Serie A

This is the most exciting competition. Both Juventus and SSC Napoli are in real contention for the overall win. Teams are not especially defensive when playing at home and the home advantages are not as large as in the German division.

### Premier League

In England, we are confident about Manchester City ending up on the top spot. Again, there are some interesting differences in defensive capabilities. Burnley FC is doing well, despite having a very low goal count. West Bromwich Albion FC, the current lowest ranked team, still have chances of reaching up to 15th or even 12th position.

## Implementation

We implement a Bayesian multilevel model using,pymc3 a package that implements the No-U-Turn-Sampler and is built on Theano. To understand what’s going on requires knowledge of Bayesian modeling and the pymc3 package. If you’re new to that, we recommend for example the online workbook Probabilistic Programming and Bayesian Methods for Hackers.

We view a football match as two independent Poisson processes. Each produces a number of goals for each side. Using an exponential link function, the rate parameters are a sum of attacking rate, respective defense rates, and a bonus on both attacking and defending for the home team.

home goals ~ Poisson(exp(μhome))
away goals ~ Poisson(exp(μaway))

Where μ is the rate parameter in each direction.

# Import the necessary packages   import pymc3 as pm import pandas as pd import seaborn as sns   %pylab inline   np.random.seed(10)

Populating the interactive namespace from numpy and matplotlib

 # read in the data, downloaded from http://www.football-data.co.uk/data.php   comp = pd.read_feather("./data/nl_comp_07-02-2018.feather") comp.sample(10)
hometeam awayteam date fthg ftag played
111 Den Haag Ajax 2017-09-17 1.0 1.0 True
281 Twente Zwolle NaT NaN NaN False
236 PSV Eindhoven NAC Breda NaT NaN NaN False
127 Twente VVV Venlo 2017-08-19 1.0 2.0 True
188 Heracles NAC Breda 2017-11-25 2.0 1.0 True
1 Sparta Rotterdam Roda 2017-01-10 1.0 2.0 True
59 Roda Groningen 2017-10-12 2.0 2.0 True
130 Den Haag Heracles 2017-11-19 4.0 1.0 True
279 Twente Groningen NaT NaN NaN False
229 Heracles AZ Alkmaar NaT NaN NaN False

The data is read from the feather format, and cleaning up has been done to make sure the data types are correct. In particular, we make use of the categorytype in pandas. In this post we use it a lot to translate between index positions in the pymc3 model and the team names.

For our “predictors” we use a theano.shared variable. The predictors contain ‘which team was playing at home’ and ‘which team was playing away’ for each football match. This allows us to change these predictors for the football matches that we want predictions for, and use the compiled model to efficiently sample simulations.

 import theano   n_teams = len(comp.hometeam.cat.categories)   played = comp[comp.played] not_played = comp[~comp.played]   ## create shared predictors hometeam = theano.shared(played.hometeam.cat.codes.astype('int8').values) awayteam = theano.shared(played.awayteam.cat.codes.astype('int8').values)

### A good attack starts with a strong defence

In the first set of definitions below, we see the attack_rate for each team. This corresponds to the number of goals that each team scored in the first half of the season.

However, there is another important aspect to football, defense. In order to win games, a team must not only score goals, it also has to prevent the other team from scoring more than they do. And just as with scoring goals, some teams are better at this.

The first part of the model is the attack and defense rates for each team. These are vectors of length n_teams, and we put a hierarchical prior on these parameters, with weak hyperpriors. Keep in mind these parameters will be translated from log scale by the exponential link function. In this model, a non-centered version is necessary for efficient sampling. You can read the following blog for some background. Important point of note, make sure that the defense rates are centered on zero.

 with pm.Model() as model:   attack_hyper = pm.Normal('attack_hyper', 0, 1) attack_hyper_sd = pm.Gamma('attack_hyper_sd', 2, 2) attack_rate_nc = pm.Normal('attack_rate_nc', 0, 1, shape = n_teams) attack_rate = pm.Deterministic('attack_rate', attack_hyper + attack_rate_nc * attack_hyper_sd)   defense_rate_nc = pm.Normal('defense_rate_nc', 0, 1, shape = n_teams) defense_hyper_sd = pm.Gamma('defense_hyper_sd', 2, 2) defense_rate = pm.Deterministic('defense_rate', defense_rate_nc * defense_hyper_sd)

### The twelfth man

Another established effect on the outcome of a football is the home advantage. 60% of football matches are decided in favor of the home side. We determined that a team has both defensive and attacking qualities. We hypothesize that the home advantage can work on both these qualities. On these, we again use a hierarchical prior, with weak hyperpriors.

 with model:   home_attack_hyper_sd = pm.Gamma('hahsd', 2, 2) home_attack_hyper = pm.Normal('hah', 0, 1) home_attack_nc = pm.Normal('hah_nc', 0, 1, shape = n_teams) home_attack_advantage = pm.Deterministic('home_aadv', home_attack_hyper + home_attack_nc * home_attack_hyper_sd)   home_defense_hyper = pm.Normal('hdh', 0, 1) home_defense_hyper_sd = pm.Gamma("hdhsd", 2, 2) home_defense_nc = pm.Normal('hdh_nc', 0, 1, shape = n_teams) home_defense_advantage = pm.Deterministic('home_dadv', home_defense_hyper + home_defense_nc * home_defense_hyper_sd)

The number of expected goals for a team is the exponential of the sum of the right parameters.

with model:   home_diff = attack_rate[hometeam] - defense_rate[awayteam] + \ home_attack_advantage[hometeam]   away_diff = attack_rate[awayteam] - defense_rate[hometeam] - \ home_defense_advantage[hometeam]

The number of goals is a Poisson parameter with an exponential link. We’re not accounting for any overdispersion, which is probably a weakness of the model. However, it won’t influence the predictions that much.

 with model:   home_goals = pm.Poisson('home_goals', pm.math.exp(home_diff), observed = played.fthg.values) away_goals = pm.Poisson('away_goals', pm.math.exp(away_diff), observed = played.ftag.values)

We use two chains to be able to assess the mixing. Sampling is quite efficient and takes half a minute on our laptop. The trace plot looks healthy!

 with model:   trace = pm.sample(1000, tune=500, njobs=2)   pm.traceplot(trace);

With some data-fu, we reorganize the posterior samples to be able to compare the rates per team. To keep things readable, we only plot the medians, which will disappoint the true Bayesian warriors among you.

 team_names = comp.hometeam.cat.categories post = pm.trace_to_dataframe(trace, varnames=["attack_rate", "defense_rate", "home_aadv", "home_dadv"])   post = post.stack().reset_index() post.columns = "sample", "param", "value"   q = post.param.str.contains("__")   post.loc[q, "team"] = team_names[post[q].param.str.split('__').str.get(-1).astype(int)] post["type"] = post.param.str.split('__').str.get(0)   labels = { "attack_rate": "Attack rate", "defense_rate": "Defense rate", "home_aadv": "Home attack advantage", "home_dadv": "Home defense advantage" }   ax = post.groupby(['type', 'team']).value.median().unstack(0).sort_values("attack_rate").\ rename(columns = labels).plot.bar(subplots = True, figsize = (10, 10), legend = False)   plt.tight_layout()

By setting new values for the shared predictor variables, we can use the compiled model to get some simulated outcomes. This is really quick, taking a couple of seconds for 1000 simulated competitions!

 ## simulate matches   hometeam.set_value(not_played.hometeam.cat.codes.astype('int8').values) awayteam.set_value(not_played.awayteam.cat.codes.astype('int8').values)   with model:   ppc = pm.sample_ppc(trace, samples = 1000)   ppc
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1540.72it/s]

{'away_goals': array([[2, 2, 0, ..., 2, 0, 1],
[3, 1, 0, ..., 1, 0, 1],
[2, 0, 3, ..., 2, 1, 0],
...,
[1, 1, 3, ..., 1, 2, 1],
[0, 1, 0, ..., 1, 0, 3],
[3, 2, 0, ..., 0, 2, 5]]),
'home_goals': array([[1, 1, 1, ..., 3, 0, 5],
[0, 0, 1, ..., 0, 2, 3],
[1, 4, 1, ..., 0, 0, 3],
...,
[2, 2, 1, ..., 0, 0, 1],
[1, 1, 0, ..., 3, 1, 5],
[2, 2, 1, ..., 6, 1, 3]])}
 ppc["home_goals"].shape
(1000, 81)

Now we have a 1000 simulated competitions, which has 81 football matches to be played. Each of these 1000 simulations is the full remaining match schedule. Let’s have a look at 2 simulated outcomes for the home matches of AFC Ajax.

 comp.loc[~comp.played, "ftag"] = ppc["away_goals"][0] comp.loc[~comp.played, "fthg"] = ppc["home_goals"][0]   comp[(comp.hometeam == "Ajax") &amp; ~comp.played]
hometeam awayteam date fthg ftag played
283 Ajax Heracles NaT 4.0 2.0 False
284 Ajax VVV Venlo NaT 3.0 0.0 False
285 Ajax AZ Alkmaar NaT 4.0 5.0 False
286 Ajax Heerenveen NaT 3.0 0.0 False

 comp.loc[~comp.played, "ftag"] = ppc["away_goals"][1] comp.loc[~comp.played, "fthg"] = ppc["home_goals"][1]   comp[(comp.hometeam == "Ajax") &amp; ~comp.played]
hometeam awayteam date fthg ftag played
283 Ajax Heracles NaT 3.0 0.0 False
284 Ajax VVV Venlo NaT 3.0 1.0 False
285 Ajax AZ Alkmaar NaT 1.0 2.0 False
286 Ajax Heerenveen NaT 5.0 0.0 False

There is considerable variance here, but overall the predictions for AFC Ajax‘s home games are pretty positive.

Below, we show the ranking as a result of one simulation.

 def get_standing(comp, ppc, simulation):   comp.loc[~comp.played, "ftag"] = ppc["away_goals"][simulation] comp.loc[~comp.played, "fthg"] = ppc["home_goals"][simulation]   comp["homepoints"] = (3 * (comp.fthg &gt; comp.ftag)).fillna(0) + (1 * (comp.fthg == comp.ftag)).fillna(0) comp["awaypoints"] = (3 * (comp.ftag &gt; comp.fthg)).fillna(0) + (1 * (comp.fthg == comp.ftag)).fillna(0)   standing = comp.groupby("hometeam").homepoints.sum().to_frame().join( comp.groupby("awayteam").awaypoints.sum() ).join( comp.groupby("hometeam").fthg.sum() ).join( comp.groupby("awayteam").ftag.sum() ).join( comp.groupby("hometeam").ftag.sum().rename("homegoals_against") ).join( comp.groupby("awayteam").fthg.sum().rename("awaygoals_against") )   standing["points"] = standing.homepoints + standing.awaypoints standing["goals"] = standing.fthg + standing.ftag standing["goals_against"] = standing.homegoals_against + standing.awaygoals_against standing["tally"] = standing.goals - standing.goals_against   return standing.sort_values(["points", "tally", "goals"], ascending = False).assign( rank = list(range(1, len(standing) + 1)) ).reset_index()   get_standing(comp, ppc, 0)
hometeam homepoints awaypoints fthg ftag homegoals_against awaygoals_against points goals goals_against tally rank
0 PSV Eindhoven 47 37 44.0 40.0 14.0 27.0 84 84.0 41.0 43.0 1
1 Ajax 40 31 47.0 38.0 17.0 17.0 71 85.0 34.0 51.0 2
2 AZ Alkmaar 32 37 34.0 36.0 21.0 17.0 69 70.0 38.0 32.0 3
3 Feyenoord 36 30 35.0 35.0 18.0 17.0 66 70.0 35.0 35.0 4
4 Utrecht 36 25 33.0 27.0 20.0 29.0 61 60.0 49.0 11.0 5
5 Zwolle 32 23 26.0 17.0 20.0 25.0 55 43.0 45.0 -2.0 6
6 Vitesse 27 27 35.0 28.0 22.0 20.0 54 63.0 42.0 21.0 7
7 Den Haag 27 27 25.0 20.0 22.0 22.0 54 45.0 44.0 1.0 8
8 VVV Venlo 24 23 28.0 20.0 25.0 23.0 47 48.0 48.0 0.0 9
9 Groningen 26 15 27.0 23.0 23.0 33.0 41 50.0 56.0 -6.0 10
10 Heracles 33 8 33.0 21.0 28.0 48.0 41 54.0 76.0 -22.0 11
11 Excelsior 16 23 13.0 24.0 19.0 27.0 39 37.0 46.0 -9.0 12
12 Heerenveen 16 21 20.0 29.0 35.0 31.0 37 49.0 66.0 -17.0 13
13 Twente 21 8 22.0 15.0 27.0 32.0 29 37.0 59.0 -22.0 14
14 Willem II 18 11 27.0 19.0 29.0 40.0 29 46.0 69.0 -23.0 15
15 Roda 17 11 22.0 22.0 30.0 41.0 28 44.0 71.0 -27.0 16
16 NAC Breda 16 8 24.0 11.0 38.0 26.0 24 35.0 64.0 -29.0 17
17 Sparta Rotterdam 17 5 17.0 13.0 30.0 37.0 22 30.0 67.0 -37.0 18

Not surprisingly, PSV will win the 2017/2018 league and AFC Ajax will be the runner-up! You can compare the prediction to the current standing. There are all kinds of goal counts here, and we could probably fill a magazine going on about all the details or a tv-show. But instead, let’s generate a heatmap with the final projections, which are an aggregation of 200 simulated seasons like the one above.

 n_teams = len(comp.hometeam.cat.categories)   simulations = pd.concat([get_standing(comp, ppc, i) for i in range(200)]) simulations.groupby("hometeam").rank.value_counts(normalize = True).unstack(1).fillna(0)   rankings_agg = simulations.groupby("hometeam").rank.value_counts(normalize = True).unstack(1).fillna(0) rankings_agg = rankings_agg.assign(expected = rankings_agg @ np.arange(n_teams + 1, 1, -1)).sort_values("expected", ascending = False).drop("expected", axis = "columns")   plt.figure(figsize = (10, 10)) sns.heatmap(rankings_agg, annot=True, fmt=".0%", cbar=False, cmap="hot")   plt.ylabel("") plt.tight_layout()

The visualization above shows, for every team, for every position, the percentage of simulations that the team ended up at that position. Based on this heat map, we can see that the remainder of the season is not as exciting as hoped for. PSV has a very high chance of ending on top. At the bottom, we see that Sparta Rotterdam will most likely have to face relegation. This last conclusion is quite remarkable, as Sparta Rotterdam now trails second to last, Roda JC, with just two points. However, Sparta Rotterdam‘s inability to score goals will break them up in the end.

### Bye!

That was it. If you want to know more, reach out to Xomnia via our contact form! We can help you build models with your own data or provide training, for example in Bayesian modeling.

Cheers – David & Gijs, Data Scientists