Sport Statistics Case Study

Estimation of final standings in football competitions with premature ending: the case of COVID-19

Research question

This case study is based on a paper by Gorgi, P., Koopman, S. J. and Lit, R. (2020) and can be downloaded here.

The data for this study is provided by https://www.football-data.co.uk.

The socio-economic impact of COVID-19 on our society has been overwhelming. Sport events have not been an exception and they have been heavily affected by the COVID-19 pandemic. Several ongoing sport competitions, including some of the main European football competitions, have experienced a premature ending. The premature ending of a football competition raises the issue of how to settle its final table. This has created some public debate in the media (newspapers, radio and TV) and on social media. The final standings of a competition are important to determine promotions and relegations and to select the teams that take part in international competitions, for the next season. A possible solution to determine the final standings is to consider the position of the teams in the table at the time when the competition has prematurely ended, which we refer to as the incomplete standings. In principle, this should reflect the expected performance in the remaining games and deliver a fair ranking of the football teams. However, the incomplete standings suffer some drawbacks for this purpose. The strength of the opposing teams in the remaining part of the competition may differ among teams. One team may have already played against all the strong teams in the competition while another team may still need to face the stronger opponents. This creates an imbalance and favors teams that have strong opponents left in the games after the premature ending. Another shortcoming of using the incomplete standings concerns home and away games. The presence of a significant home ground advantage in football matches is well documented in the literature. Different teams can have a different number of home and 2 away games left to be played and this would favor teams that have already played more home games before the premature ending of the competition. We consider an alternative model-based approach that takes into account the strength of the opposing teams as well as the home ground advantage. We measure the performance of the teams by means of a statistical paired-comparison model.

Statistical method

The approach we propose is designed to take the above factors into consideration. We measure the performance of the teams in the season by means of a paired-comparison model as in Maher (1982). The performance of the teams is obtained only using the outcomes of matches that have already been played in the same season. On the basis of this measured performance, we determine the final standings using the model-implied expected number of points in the remaining games. We should emphasize that our analysis is based on a paired-comparison model that has been widely used to model and predict football matches by including regression variables as well as time-variation in the strength of the teams. In particular, the paired-comparison model for outcomes of football matches of Maher (1982) is adopted in many studies, including Dixon and Coles (1997), Koopman and Lit (2015), and Koopman and Lit (2019) We denote the outcome of a football match between the home team $i$ and the away team $j$ as a pair of counts $(X_i,Y_j)$ for $i,j\in \{1,\dots,n\}$, $i\neq j$, where $X_i$ is the number of goals scored by the home team $i$, $Y_j$ is the number of goals scored by the away team $j$, and $n$ indicates the number of teams in the competition. We describe the match outcome by means of a bivariate Poisson paired-comparison model $$(X_i,Y_j)\sim \mathcal{BP}(\lambda_{ij}^x,\lambda_{ij}^y, \gamma),$$ where $\mathcal{BP}(\lambda_{ij}^x,\lambda_{ij}^y, \gamma)$ denotes a bivariate Poisson distribution with intensity $\lambda_{ij}^x$ for the home team count, intensity $\lambda_{ij}^y$ for the away team count, and coefficient $\gamma$ for the dependence between the two counts. The probability mass function (pmf) of the bivariate Poisson $\mathcal{BP}(\lambda_{ij}^x,\lambda_{ij}^y, \gamma)$ is given by \begin{equation}\label{pmf} \mathbb{P}(X_i=x, Y_j=y)=\frac{\lambda_{ij}^x \, \lambda_{ij}^y}{x! \, y! \, e^{\lambda_{ij}^x+\lambda_{ij}^y+\gamma}}\sum_{k=0}^{\max\{x,y\}}\binom{x}{k}\binom{y}{k}k!\left(\frac{\gamma}{\lambda_{ij}^x\lambda_{ij}^y}\right)^k, \end{equation} for $x,y \in \{0,1,\dots,\infty\}$. The intensities $\lambda_{ij}^x$ and $\lambda_{ij}^y$ determine the difference in expected goals between the home and away teams. The intensities are specified as $$\lambda_{ij}^x=\exp(\delta+\alpha_i+\beta_j), \quad \text{and} \quad \lambda_{ij}^y=\exp(\alpha_j+\beta_i),$$ where $\alpha_k$ represents the attacking ability of team $k$, $k=i,j$, $\beta_k$ represents the defending ability of team $k$, $k=i,j$, and $\delta$ is the home ground advantage. The specification for the intensities originates from Maher (1982). It accounts for the different strength level of the teams $(\alpha_k,\beta_k)$, $k=1,\dots,n$, as well as the home ground advantage $\delta$ to determine the probability distribution of the match outcome.
In most applications of the bivariate Poisson model for sports data, the objective is to specify $\alpha_k$ and $\beta_k$ to best predict the outcomes of future matches. For instance, the model can extended with other covariates that may explain the strength levels of the team. However, in our current study the objective is to determine a final ranking that reflects the performance of the teams in the matches that have been played earlier in the season. We achieve this by estimating the parameters of the model $(\alpha_1,\dots,\alpha_n,\beta_1,\dots,\beta_n,\delta,\gamma)$ using the method of maximum likelihood (ML) and only based on the data of the current season. In this way, the estimated intensities only reflect the performance of the teams in the current season. Once the parameters have been estimated, we can obtain the expected number of points of each team in the remaining games and construct the final table by using the model-implied expected number of points at the end of the season. In particular, first we calculate the winning probability of the home team $p_h$, the winning probability of the away team $p_a$, and the probability that the match ends with a draw $p_d$ as follows $$p_h=\sum_{y=0}^\infty\sum_{x=y+1}^\infty \mathbb{P}(X_i=x, Y_j=y),$$ $$p_a=\sum_{x=0}^\infty\sum_{y=x+1}^\infty \mathbb{P}(X_i=x, Y_j=y),$$ and $$p_d=\sum_{z=0}^\infty \mathbb{P}(X_i=z, Y_j=z),$$ where the expression of the pmf is given above. Based on these probabilities, we calculate the expected number of points of the home and away teams. We consider the system of assigning 3 points to the winning team, 0 to the losing team, and 1 point to each of the teams if the game ends with a draw. This system is the standard in most football competitions. The expected number of points of the home team $ep_{h}$ and the one of the away team $ep_{a}$ are $$ep_{h} = 3 p_h + p_d, \quad \text{and} \quad ep_{a} = 3 p_a + p_d.$$ Finally, the final table is obtained by summing up the expected number of points of each team in the remaining games and adding these expected points to the points of the incomplete table.

Time Series Lab Analysis

We describes the necessary steps to replicate Table 2 and 3 of Gorgi, P., Koopman, S. J. and Lit, R. (2020) with the use of Time Series Lab - Sports Statistics Edition (TSL - SSE). We show how to calculate the model-based prediction of the final standings for the French competition in some simple steps.

The frontpage of TSL - SSE, which is visible right after the program starts, is shown in Figure 1.

unobserved components
Figure 1: frontpage of Time Series Lab - Sports Statistics Edition.

After pressing the Get started button on the frontpage you will be taken to the Load data step of the program. Click Load data and a file selection window opens up. Navigate to the data folder which is located in the same folder TSL- SSE is installed in. Ctrl-click the files F11920.csv and F11920_remaining.csv so that both files are highlighted, followed by clicking the open button. Alternatively, the data can be downloaded from the Research section. After loading the data, the screen should look like the one in Figure 2. An indication that the correct dataset is loaded is given in the upper right corner of the screen. It shows the number of matches per team and the total number of teams in the competition. For the French competition these are 38 and 20, respectively. Since not all scheduled matches were played in the 2019-2020 season, many missing values are part of the dataset.

unobserved components
Figure 2: load data section of Time Series Lab - Sports Statistics Edition.

Click the Step 2 button which leads to the Model setup page. Select the Bivariate Poisson distribution and tick the boxes in front of Replace missing values with Expectations and Print final table. A screenshot of the mandatory selections is given in Figure 3

unobserved components
Figure 3: Model settings.

Click the Step 3 button which leads to the Estimate page. Click Estimate to start model estimation. After the program is done maximizing the likelihood, output is printed to the Main page of the program. The model-based prediction of the final standings in the French competition is printed on screen as in Figure 4. This printed output matches the results presented in Table 2 of Gorgi, P., Koopman, S. J. and Lit, R. (2020).

unobserved components
Figure 4: Final results.

Bibliography

References

Dixon, M. J. and Coles, S. G. (1997). Modelling association football scores and inefficien- cies in the football betting market. Journal of the Royal Statistical Society: Series C (Applied Statistics), 46(2):265–280.

Gorgi, P., Koopman, S. J. and Lit, R. (2020).Estimation of final standings in football competitions with premature ending: the case of COVID-19. Working paper.

Koopman, S. J. and Lit, R. (2015). A dynamic bivariate poisson model for analysing and forecasting match results in the english premier league. Journal of the Royal Statistical Society. Series A (Statistics in Society), 178(1):167–186.

Koopman, S. J. and Lit, R. (2019). Forecasting football match results in national league competitions using score-driven time series models. International Journal of Forecasting, 35(2):797–809.

Maher, M. J. (1982). Modelling association football scores. Statistica Neerlandica, 36(3):109–118.