FIFA World Cup 2018; random forests; soccer; sports tournaments; team abilities; Social Sciences (miscellaneous); Decision Sciences (miscellaneous)
Abstract :
[en] In this work, we propose a new hybrid modeling approach for the scores of international soccer matches which combines random forests with Poisson ranking methods. While the random forest is based on the competing teams' covariate information, the latter method estimates ability parameters on historical match data that adequately reflect the current strength of the teams. We compare the new hybrid random forest model to its separate building blocks as well as to conventional Poisson regression models with regard to their predictive performance on all matches from the four FIFA World Cups 2002-2014. It turns out that by combining the random forest with the team ability parameters from the ranking methods as an additional covariate the predictive power can be improved substantially. Finally, the hybrid random forest is used (in advance of the tournament) to predict the FIFA World Cup 2018. To complete our analysis on the previous World Cup data, the corresponding 64 matches serve as an independent validation data set and we are able to confirm the compelling predictive potential of the hybrid random forest which clearly outperforms all other methods including the betting odds.
Disciplines :
Mathematics Engineering, computing & technology: Multidisciplinary, general & others
Author, co-author :
Groll, Andreas; TU Dortmund University, Faculty Statistics, Dortmund, Germany
LEY, Christophe ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Mathematics (DMATH)
Schauberger, Gunther; Technische Universitaet Muenchen, Department of Sport and Health Sciences, Munich, Bavaria, Germany
Van Eetvelde, Hans; Ghent University, Department of Applied Mathematics, Computer Science and Statistics, Campus Sterre, Ghent, Belgium
External co-authors :
yes
Language :
English
Title :
A hybrid random forest to predict soccer matches in international tournaments
Bischl B., Lang M., Kotthoff L., Schiffner J., Richter J., Studerus E., Casalicchio G., Jones Z. M., 2016 mlr: Machine Learning in R. Journal of Machine Learning Research 17 1 5. http://jmlr.org/papers/v17/15-066.html.
Boshnakov G., Kharrat T., McHale I. G., 2017 A Bivariate Weibull Count Model for Forecasting Association Football Scores. International Journal of Forecasting 33 458 466. http://www.sciencedirect.com/science/article/pii/S0169207017300018. http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000399512200010&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3 10.1016/j.ijforecast.2016.11.006
Breiman L., Friedman J. H., Olshen R. A., Stone J. C., 1984 Classification and Regression Trees Monterey, CA Wadsworth
Dixon M. J., Coles S. G., 1997 Modelling Association Football Scores and Inefficiencies in the Football Betting Market. Journal of the Royal Statistical Society: Series C (Applied Statistics) 46 265 280 10.1111/1467-9876.00065
Dyte D., Clarke S. R., 2000 A Ratings Based Poisson Model for World Cup Soccer Simulation. Journal of the Operational Research Society 51 8 993 998 10.1057/palgrave.jors.2600997
Friedman J., Hastie T., Tibshirani R., 2010 Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33 1
Gneiting T., Raftery A. E., 2007 Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association 102 359 378 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000244361000032&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3 10.1198/016214506000001437
Groll A., Abedieh J., 2013 Spain Retains its Title and Sets a New Record-Generalized Linear Mixed Models on European Football Championships. Journal of Quantitative Analysis in Sports 9 51 66 10.1515/jqas-2012-0046
Groll A., Kneib T., Mayr A., Schauberger G., 2018 On the Dependency of Soccer Scores-A Sparse Bivariate Poisson Model for the UEFA European Football Championship 2016. Journal of Quantitative Analysis in Sports 14 65 79 10.1515/jqas-2017-0067 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000437244500003&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3
Groll A., Schauberger G., Tutz G., 2015 Prediction of Major International Soccer Tournaments Based on Team-Specific Regularized Poisson Regression: An Application to the FIFA World Cup 2014. Journal of Quantitative Analysis in Sports 11 97 115 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000443088100003&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3
Hoerl A. E., Kennard R. W., 1970 Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12 55 67 10.1080/00401706.1970.10488634
Hothorn T., Bühlmann P., Dudoit S., Molinaro A., van der Laan M. J., 2006 Survival Ensembles. Biostatistics 7 355 373
Hothorn T., Buehlmann P., Kneib T., Schmid M., Hofner B., 2017 mboost: Model-Based Boosting. https://CRAN.R-project.org/package=mboost, R package version 2.8-1
Karlis D., Ntzoufras I., 2003 Analysis of Sports Data by Using Bivariate Poisson Models. The Statistician 52 381 393
Kelly J. L., 1956 A New Interpretation of Information Rate. Bell System Technical Journal 35 917 926. http://dx.doi.org/10.1002/j.1538-7305.1956.tb03809.x 10.1002/j.1538-7305.1956.tb03809.x
Koopman S. J., Lit R., 2015 A Dynamic Bivariate Poisson Model for Analysing and Forecasting Match Results in the English Premier League. Journal of the Royal Statistical Society: Series A (Statistics in Society) 178 167 186 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000346277000008&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3 10.1111/rssa.12042
Leitner C., Zeileis A., Hornik K., 2010 Forecasting Sports Tournaments by Ratings of (Prob)Abilities: A Comparison for the EURO 2008. International Journal of Forecasting 26 3 471 481 10.1016/j.ijforecast.2009.10.001 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000278346300004&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3
Ley C., Van de Wiele T., Van Eetvelde H., 2019 Ranking Soccer Teams on the Basis of their Current Strength: A Comparison of Maximum Likelihood Approaches. Statistical Modelling 19 55 77. https://doi.org/10.1177/1471082X18817650 10.1177/1471082X18817650 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000456665200005&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3
Maher M. J., 1982 Modelling Association Football Scores. Statistica Neerlandica 36 109 118 10.1111/j.1467-9574.1982.tb00782.x
McHale I., Scarf P., 2007 Modelling Soccer Matches Using Bivariate Discrete Distributions with General Dependence Structure. Statistica Neerlandica 61 432 445. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9574.2007.00368.x http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000250672500003&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3 10.1111/j.1467-9574.2007.00368.x
McHale I. G., Scarf P. A., 2011 Modelling the Dependence of Goals Scored by Opposing Teams in International Soccer Matches. Statistical Modelling 41 219 236 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000290736900003&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3
Probst P., Boulesteix A.-L., 2017 To Tune or not to Tune the Number of Trees in Random Forest? Journal of Machine Learning Research 18 181:1 181:18
Quinlan J. R., 1986 Induction of Decision Trees. Machine Learning 1 81 106 10.1007/BF00116251 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000244690200005&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3
R Core Team, 2018 R: A Language and Environment for Statistical Computing Vienna, Austria R Foundation for Statistical Computing. https://www.R-project.org/
Schauberger G., Groll A., 2018 Predicting Matches in International Football Tournaments with Random Forests. Statistical Modelling 18 460 482. https://doi.org/10.1177/1471082X18799934 10.1177/1471082X18799934 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000452266900005&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3
Skellam J. G., 1946 The Frequency Distribution of the Difference between Two Poisson Variates Belonging to Different Populations. Journal of the Royal Statistical Society. Series A (General) 109 296 296
Strobl C., Boulesteix A.-L., Zeileis A., Hothorn T., 2007 Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics 8 25 10.1186/1471-2105-8-25 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000244152500001&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3
Strobl C., Boulesteix A.-L., Kneib T., Augustin T., Zeileis A., 2008 Conditional Variable Importance for Random Forests. BMC Bioinformatics 9 307 10.1186/1471-2105-9-307 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000258189000001&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3
Tibshirani R., 1996 Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B58 267 288 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000290575300001&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3
Wright M. N., Ziegler A., 2017 Ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77 1 17 http://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcApp=PARTNER-APP&SrcAuth=LinksAMR&KeyUT=WOS:000399022900001&DestLinkType=FullRecord&DestApp=ALL-WOS&UsrCustomerID=b7bc2757938ac7a7a821505f8243d9f3
Yuan M., Lin Y., 2006 Model Selection and Estimation in Regression with Grouped Variables. Journal of the Royal Statistical Society B68 49 67