Monday, 10 March 2014

Testing of Football Stats: Part 1 - Descriptive

As mentioned in my post last week I've done some testing of a few different football metrics to gauge how well they describe what has happened and how well they can predict or project what will happen in the future. 

My first step was to do a 'broad sweep' using a Correlation Matrix, this being a grid to describe correlation between multiple variables. With the historical fixture data I've now got (as displayed in these tables - not updated from weekend yet) I am able to line up every team's goals for/against, excepted goals, shots, shots on target, points, etc. etc. chronologically and then look at the average of all the stat from the past x games and/or the average of each team's next x games.

To start with I've looked back and forward 38 games and with the method I've used this doesn't split the data into chunks of one season to the next, but looks at a rolling 38 game trend. 

What Just Happened?

The first correlation matrix is descriptive. It's looking at how well stats from the last 38 games correlate with other stats from the last 38 games.

Click = Big

The colour code is Excel's conditional formatting; Blue is the strongest positive correlation, red is the strongest negative. 

The two things I am most interested in are Goals For (GF) and Points Per Game (PPG). Looking straight at Goals Scored then (the top row), the strongest correlation is with PPG. This means if you didn't;t know how many goals a team has scored using how many points accumulated would give you the best idea. Looking at Points per Game (PPG) (last column), w can see the same is true of points. Goals For is the strongest correlation. This confirms what we know of course about football. Goals win games. 

The strongest underlying stats to describe Goals For is xGT. This is a secondary version of the Expected Goals model I have which only uses shots on target instead of all shots. I've noticed this is a better descriptor of actual goals in the past and when doing previous testing. However, it doesn't feel right to 'throw away' any shot that is off target or is blocked. A penalty is the good example of why. If a team has a penalty they are expected to score it; a penalty has a xG value in my model of 0.78. However, if the player balloons said penalty over the bar the xGT model rates that as 0.0, as if it never happened. I can understand why xGT trumps xG as a descriptive metric of goals scored, getting more and better quality shots on target is of course going to lead to more goals, but using xGT would assume that shot accuracy is a sustainable skill when that is not something I can readily assume at this point. Also, as you'll see in my follow up post, xGT is not as good a predictor of future goals as xG is, which goes some way to back up the non-sustainability of it. However, I do believe shot accuracy could be an important bridge between shots/xG and goals/results so will have to definitely have to investigate this further.

I've included a couple of my metrics xGR and xGD (and their xGT versions). xGR is Expected Goals ratio xGF / (xGF+xGA), xGD is Expected Goal difference, xGF-xGA. These are two metrics I was most interested in testing as simple, single values to describe a team's overalll strength. You can see the correlation between xGR and xG is almost 1.0, and in practise I believe there's really little difference between them, but of them both xGD does a slightly better job of explaining goals and points, although their individual components look better. 

The poorest relationships in the matrix between goals and underlying data is with Goals Against. This again is something I've observed over the last couple of season. The correlations aren't bad, but overall they are not as good as with Goals For, and in particular with PPG, but the same could be said of actual goals. It doesn't look like conceding goals matters that much to results, with the proviso that you are scoring them too. Sounds obvious right, but what it means to me is being a team who can win 4-2 on average is better than winning 1-0 on average. 

I've also thrown a few oddball type metrics in, namely Chance Quality xGF/shots xGA/shots and Goal Conversion GF% GA%, as well as xGDO which a combination of the latter similar to PDO. These are all the missing links between the underlying stats like xG and the real thing. I was curious to see how their correlations compared and you can see they are all pretty weak. They obviously are really important, especially in important games like a 6-pointer against league rivals, but I think this demonstrates that quantity and quality of chances is still the biggest determinator of overall success rather than boasting high chance conversion or save stats. 

There's obviously a lot to look at in this matrix so if anyone spots anything interesting or contradictory that I've missed please discuss it with me in the Comments section. Also, as mentioned briefly at the beginning, a correlation matrix is only a coarse view of how a bunch of variables stack up, a quick heads-up, if you will.  Next up I'll look at What Happens Next, using the same data set to examine correlations between results and stats from a team's last 38 games and their next 38. 

No comments:

Post a Comment