# Sport Statistics Case Study

Estimation of final standings in football competitions with premature ending: the case of COVID-19

### Research question

This case study is based on a paper by Gorgi, P., Koopman, S. J. and Lit, R. (2020) and can be downloaded here.

The data for this study is provided by https://www.football-data.co.uk.

The socio-economic impact of COVID-19 on our society has been overwhelming. Sport
events have not been an exception and they have been heavily affected by the COVID-19 pandemic. Several ongoing
sport competitions, including some of the main European football competitions, have
experienced a premature ending. The premature ending of a football competition raises
the issue of how to settle its final table. This has created some public debate in the media
(newspapers, radio and TV) and on social media. The final standings of a competition are
important to determine promotions and relegations and to select the teams that take part
in international competitions, for the next season. A possible solution to determine the
final standings is to consider the position of the teams in the table at the time when the
competition has prematurely ended, which we refer to as the incomplete standings. In principle, this should reflect the expected performance in the
remaining games and deliver a fair ranking of the football teams. However, the incomplete
standings suffer some drawbacks for this purpose. The strength of the opposing teams
in the remaining part of the competition may differ among teams. One team may have
already played against all the strong teams in the competition while another team may
still need to face the stronger opponents. This creates an imbalance and favors teams that
have strong opponents left in the games after the premature ending. Another shortcoming
of using the incomplete standings concerns home and away games. The presence of a
significant home ground advantage in football matches is well documented in the literature. Different teams can have a different number of home and
2
away games left to be played and this would favor teams that have already played more
home games before the premature ending of the competition.
We consider an alternative model-based approach that takes into account
the strength of the opposing teams as well as the home ground advantage. We measure
the performance of the teams by means of a statistical paired-comparison model.

### Statistical method

The approach we propose is designed to take the above factors into consideration. We measure the performance of the teams in the season by means of a paired-comparison model as in Maher (1982).
The performance of the teams is obtained only using the outcomes of matches
that have already been played in the same season.
On the basis of this measured performance,
we determine the final standings using the model-implied expected number of points
in the remaining games.
We should emphasize that our analysis is based on a paired-comparison model
that has been widely used to model and predict football matches
by including regression variables as well as time-variation in the strength of the teams.
In particular, the paired-comparison model for outcomes of football matches
of Maher (1982) is adopted in many studies, including Dixon and Coles (1997), Koopman and Lit (2015), and Koopman and Lit (2019)
We denote the outcome of a football match between the home team $i$ and the away team $j$ as
a pair of counts $(X_i,Y_j)$ for $i,j\in \{1,\dots,n\}$, $i\neq j$, where $X_i$ is the
number of goals scored by the home team $i$, $Y_j$ is the number of goals scored by the
away team $j$, and $n$ indicates the number of teams in the competition.
We describe the match outcome by means of a bivariate Poisson paired-comparison model
$$(X_i,Y_j)\sim \mathcal{BP}(\lambda_{ij}^x,\lambda_{ij}^y, \gamma),$$
where $\mathcal{BP}(\lambda_{ij}^x,\lambda_{ij}^y, \gamma)$ denotes
a bivariate Poisson distribution with intensity $\lambda_{ij}^x$ for the home team count,
intensity $\lambda_{ij}^y$ for the away team count, and coefficient $\gamma$ for the dependence
between the two counts.
The probability mass function (pmf) of the bivariate Poisson
$\mathcal{BP}(\lambda_{ij}^x,\lambda_{ij}^y, \gamma)$ is given by
\begin{equation}\label{pmf}
\mathbb{P}(X_i=x, Y_j=y)=\frac{\lambda_{ij}^x \, \lambda_{ij}^y}{x! \, y! \, e^{\lambda_{ij}^x+\lambda_{ij}^y+\gamma}}\sum_{k=0}^{\max\{x,y\}}\binom{x}{k}\binom{y}{k}k!\left(\frac{\gamma}{\lambda_{ij}^x\lambda_{ij}^y}\right)^k,
\end{equation}
for $x,y \in \{0,1,\dots,\infty\}$.
The intensities $\lambda_{ij}^x$ and $\lambda_{ij}^y$ determine the difference in expected goals between the home and away teams. The intensities are specified as
$$\lambda_{ij}^x=\exp(\delta+\alpha_i+\beta_j), \quad \text{and} \quad \lambda_{ij}^y=\exp(\alpha_j+\beta_i),$$
where $\alpha_k$ represents the attacking ability of team $k$, $k=i,j$, $\beta_k$ represents
the defending ability of team $k$, $k=i,j$, and $\delta$ is the home ground advantage.
The specification for the intensities originates from Maher (1982).
It accounts for the different strength level of the teams
$(\alpha_k,\beta_k)$, $k=1,\dots,n$, as well as the home ground advantage $\delta$
to determine the probability distribution of the match outcome.

In most applications of the bivariate Poisson model for sports data,
the objective is to specify $\alpha_k$ and $\beta_k$
to best predict the outcomes of future matches. For instance, the model can extended with other covariates that may explain the strength levels of the team. However, in our current study the objective is to determine a final ranking that reflects
the performance of the teams in the matches that have been played earlier in the season.
We achieve this by estimating the parameters of the model
$(\alpha_1,\dots,\alpha_n,\beta_1,\dots,\beta_n,\delta,\gamma)$ using the method of
maximum likelihood (ML) and only based on the data of the current season.
In this way, the estimated intensities only reflect the performance of the teams in the
current season.
Once the parameters have been estimated, we can obtain the expected number of points of each team in the remaining games and construct the final table by using the model-implied expected number of points at the end of the season. In particular, first we calculate the winning probability of the home team $p_h$, the winning probability of the away team $p_a$, and the probability that the match ends with a draw $p_d$ as follows
$$p_h=\sum_{y=0}^\infty\sum_{x=y+1}^\infty \mathbb{P}(X_i=x, Y_j=y),$$
$$p_a=\sum_{x=0}^\infty\sum_{y=x+1}^\infty \mathbb{P}(X_i=x, Y_j=y),$$
and
$$p_d=\sum_{z=0}^\infty \mathbb{P}(X_i=z, Y_j=z),$$
where the expression of the pmf is given above.
Based on these probabilities, we calculate the expected number of points of the home and away teams. We consider the system of assigning 3 points to the winning team, 0 to the losing team, and 1 point to each of the teams if the game ends with a draw. This system is the standard in most football competitions. The expected number of points of the home team $ep_{h}$ and the one of the away team $ep_{a}$ are
$$ep_{h} = 3 p_h + p_d, \quad \text{and} \quad ep_{a} = 3 p_a + p_d.$$
Finally, the final table is obtained by summing up the expected number of points
of each team in the remaining games and adding these expected points to the points
of the incomplete table.

### Time Series Lab Analysis

We describes the necessary steps to replicate Table 2 and 3 of Gorgi, P., Koopman, S. J. and Lit, R. (2020) with the use of Time Series Lab - Sports Statistics Edition (TSL - SSE).
We show how to calculate the model-based prediction of the final standings for the French competition in some simple steps.

The frontpage of TSL - SSE, which is visible right after the program starts, is shown in Figure 1.

After pressing the *Get started* button on the frontpage you will be taken to the *Load data* step of the program.
Click *Load data* and a file selection window opens up.
Navigate to the data folder which is located in the same folder TSL- SSE is installed in.
Ctrl-click the files *F11920.csv* and *F11920_remaining.csv* so that both files are highlighted, followed by clicking the *open* button.
Alternatively, the data can be downloaded from the Research section.
After loading the data, the screen should look like the one in Figure 2.
An indication that the correct dataset is loaded is given in the upper right corner of the screen.
It shows the number of matches per team and the total number of teams in the competition.
For the French competition these are 38 and 20, respectively.
Since not all scheduled matches were played in the 2019-2020 season, many missing values are part of the dataset.

Click the *Step 2* button which leads to the *Model setup* page.
Select the Bivariate Poisson distribution and tick the boxes in front of *Replace missing values with Expectations* and *Print final table*.
A screenshot of the mandatory selections is given in Figure 3

Click the *Step 3* button which leads to the *Estimate* page.
Click *Estimate* to start model estimation.
After the program is done maximizing the likelihood, output is printed to the *Main page* of the program.
The model-based prediction of the final standings in the French competition is printed on screen as in Figure 4. This printed output matches the results presented in Table 2 of Gorgi, P., Koopman, S. J. and Lit, R. (2020).

# Bibliography

### References

Dixon, M. J. and Coles, S. G. (1997). Modelling association football scores and inefficiencies in the football betting market. *Journal of the Royal Statistical Society: Series C (Applied Statistics)*, 46(2):265–280.

Gorgi, P., Koopman, S. J. and Lit, R. (2020). Estimation of final standings in football competitions with premature ending: the case of COVID-19. *Working paper*.

Koopman, S. J. and Lit, R. (2015). A dynamic bivariate poisson model for analysing and forecasting match results in the english premier league. *Journal of the Royal Statistical Society. Series A (Statistics in Society)*, 178(1):167–186.

Koopman, S. J. and Lit, R. (2019). Forecasting football match results in national league competitions using score-driven time series models. *International Journal of Forecasting*, 35(2):797–809.

Maher, M. J. (1982). Modelling association football scores. *Statistica Neerlandica*, 36(3):109–118.