Is there a normal number of goals scored in a season for a striker? To answer this, one may be tempted to just take the mean of the goals scored of every player in a season. If we do this for last season, the mean is 1.83. Of course, this is misleading. There isn't really such thing as a "normal" number of goals scored in a season.
The reason for this is that goals scored does not have a standard distribution, the bell curve we are used to. For example, if you looked at the distribution of heights in a population, you would see a nice bell curve. Most people are right around the average height, and as you go towards the extremes either way (really short or really tall) you find fewer and fewer people. Therefore, the mean of heights in the population is instructive because it gives us the "normal" or "typical" height.
The problem is, goals scored in a season does not follow a standard distribution. Instead, most players score no goals at all. The next most common number of goals scored last season? Just one goal, of course. This distribution continues, and it follows a power law distribution.
While I don't understand a lot of the mathematical intricacies of the power law distribution, I can explain generally what it does. A good example is from earthquakes. Most earthquakes are relatively small. As the size of the earthquake increases, its frequency decreases. Specifically, for every magnitude 4 earthquakes there are 10 times as many magnitude 3 earthquakes and 100 times as many magnitude 2 earthquakes. This pattern holds for all magnitudes.
My idea was that a similar power law distribution could be used to model the number of goals scored in a season. It turns out that it does a good job, which has a lot of implications for predicting goals scored in a season. Below is a histogram of the number of goals scored from the 2011-2012 EPL season. The data is from the MCFC Analytics Basic dataset. As you can see, a lot of players scored just 1 goal, while fewer scored 3 total goals, while even fewer scored 10 goals.
So how is it possible to fit a function to this distribution? It turns out that if we take the natural log of both axes (count and goals) and plot the new points, we should see that the points form a line. As you can see from the graph below, the linear relationship is pretty strong (R-squared = .91). This implies that the power law distribution of the original data we are looking at is also strong.
A power law follows the equation y=k * x^(n). In this case, we have the equation count = k * goals^(n). The n in the previous equation is just the slope from the log-log graph above. The slope is -1.63, so n=-1.63. To find k, we raise e to the y intercept in the log-log graph. The y intercept is 5.471 so k=exp(5.471)=161.61. So the equation that fits the first graph is count = 161.61 * (goals^-1.63). When I plot this equation over the distribution of goals from the first graph, it is clear that there is a nice fit.
So what are the implications of all this. First, it says that predicting the number of goals scored in a season is very hard to do. Events that follow power law distributions are extremely hard to predict. Earthquakes, for example, are basically impossible to foresee. Even experts cannot reliably predict when an earthquakes is going to happen. However, because they know the distribution they can say how often earthquakes of various magnitudes are expected to occur. To extend the analogy to goal scoring, it is very hard to predict the number of goals scored in a season. However, from this analysis we know the overall distribution, so we can predict how often goal scoring tallies will happen.
For example, how often do we expect Premiership players to score 30 goals in a season? Well, if we plug in 30 to the model above, we get count = 161.61(30^-1.63) = .63. Therefore, we expect a player to score 30 goals every .63 seasons, or around 6 out of every 10 seasons. If you look, this is consistent with what has happened over the past 10 Premier League seasons.
To finish up, I am not saying that we have no idea if Owen Hargreaves has a good chance of outscoring van Persie next season. Obviously, a lot of people predicted that van Persie would lead the league in scoring this season and he has. However, if we take a step back and look at goal scoring from just this past season, we can find a distribution that models goal scoring for 10 whole years using a power law distribution.
PS: I'll be traveling for the next 3 months so I won't have consistent internet access and won't be able to make any posts. I'll try to stay updated on Twitter though so I'll still talk there as much as I can.