If you want to see how I created the models, check out this post.

And if you haven’t seen the Economist blog post from a couple weeks back comparing Messi to Ronaldo using the data, read it here

A lot of people have reached out to me asking for the data or have been trying to manually gather it from the applet. If you’re interested in using the data then just reach out to me at soccerstatistically@gmail.com and I’d be happy to send all the raw data to you, provided you reference this blog when you use it.

Finally, now that the calculator is fixed I can focus on some other work I’ve been doing. I’ve admittedly been absent from posting here for a while. I have a few posts I’ve been working on recently, so expect some new stuff coming soon...

]]>To see how various (FIFA defined) continents have done compared to past World Cup results, I used past World Cup data collected from 11v11.com. I looked at the past World Cup results (here is an example from the United States’ page http://www.11v11.com/teams/usa/tab/stats/comp/978). These results include all World Cup and World Cup qualifying games, which is what I limited my analysis to. World Cup qualifying games are a little different than World Cup games, but considering these are almost always between countries that are in the same continent, I think its OK because I drop intra-continent games anyways. What defines a continent is pretty hazy, so I just stuck with FIFA’s definitions. This means that Australia is actually a part of Asia, and some other anomalies. This division of the world is the best way to stay consistent, though. The continents I ended up using were Africa, Asia, CONCACAF, Europe, Oceania and South America.

If you want to look at the code I wrote to do the analysis (the data scraping, the actual analysis, and the visualization) head over to here https://github.com/fordb/wc-continent-headtohead

There’s nothing too crazy going on in the analysis, just a lot of graphs to look at.

Each one shows how each individual continent has historically fared against every other continent. I focused on just GA per game and GF per game because this can be compared no matter how many games have occurred between each continent. GA and GF are also a little more representative than wins, losses and draws. Nate Silver preferred GF/GA for this reason when he created his Soccer Power Index.

From these, it’s clear that Europe and Brazil have historically dominated. Oceania is a little bit of an outlier just because of how few countries from there qualify for the World Cup. It seems like Oceania teams have dominated CONCACAF teams in what limited games have occurred between them. We probably can’t infer much from the small sample size, though.

CONCACAF teams have historically scored just under 1.5 goals per game against European teams, and closer to 1.4 goals per game against South American teams. This compared to giving up 2 and just over 2 goals per game, respectively.

How does this compare to this year’s World Cup? To compare I made another set of graphs, this time comparing each continent’s historical GF per game/GA per game vs. their GF per game/GA per game in this World Cup.

CONCACAF countries did not actually score that much more compared to the past. They fared a little better against African and South American countries, but were about level with their past results against European countries. However, their GA per game dropped significantly this World Cup. The CONCACAF’s (relative) success this World Cup came from better defense. This isn't that surprising, especially with Costa Rica.

European countries really struggled scoring goals compared to their historical average, while staying about the same in GA per game.

South America has been dominant and continued to stay dominant. The only drop they had in scoring was against CONCACAF countries, which is what I mentioned earlier.

Asian countries stayed about on par with their previous performances, although this was already fairly poor.

African countries gave up more goals, but was about on par with their past goal scoring.

It’s important to keep in mind that while these graphs are interesting, the results from this World Cup should be taken with a grain of salt. There are not a large number of games, which means that we might see some strange goals for or goals against averages. So while something may seem to be a big change from previous results, it could just be from the fact that there were only 1 or 2 games between countries from two continents.

]]>While I am not really interested in betting on soccer myself, odds do provide an interesting estimate of the probability of an outcome occuring. For example, take Arsenal's home game against Chelsea this past year. Bet365 put the odds of an Arsenal victory at 2.38. These decimal odds imply that they expect the probability of an Arsenal victory to be about 42%. Taking in to account that the odds makers usually lower the payouts so that they make money, the adjusted probability of an Arsenal victory is just over 41.1%.

This is all pretty standard stuff. The odds for relatively evenly matched games like the one above are probably pretty accurate, or at least more accurate than your average person. But what about significant underdogs? What about City against Cardiff? These are a little more difficult to assess. It's clear that Cardiff is an underdog in this game, but how much of an underdog? And do odds makers do a good job of assigning implied probabilities to these lopsided games?

To look in to this, I defined games with a significant underdog to be games where one outcome has an implied probability of occurring under 15%. This could be either a home win, an away win, or a draw. If the odds makers are effective at predicting underdog outcomes, we should see underdog outcomes (as I define them) occuring under 15% of the time, since the implied probability could be anywhere from just above 0% to 15%.

I used data from Football-Data.co.uk and took the Bet365 odds for the past 5 seasons of the Premiership. After limiting the data to signficant underdogs as defined above, I ended up with 643 games with a significant underdog. Of those, 110 were games with a home team underdog, 441 were games with an away team underdog, and 92 were games with a draw underdog. These splits make sense, because clearly away teams are underdogs more often than home teams due to home field advantage.

Next I calculated what percentage of underdog games resulted in the underdog outcome actually occuring. As mentioned above, these percentages should be under 15% if the odds makers are doing a good job. Below is a graph showing these percentages over the past 5 years, split in to home, away, and draw underdogs.

The red dashed line is the 15% cutoff. For draw and away underdogs, we see about what we would expect: the percentage of underdog outcomes occuring fluctuates, but it remains under the 15% line. For home underdogs though, there is a different story. For each year besides 2012, home underdogs actually won more than 15% of the time, in some cases significantly more so. This seems to indicate that odds makers are underestimating home field advantage when weaker teams play stronger teams.

This evidence seems to indicate that there is an inefficiency present, specifically in the odds of underdog home teams. However, there are still a few caveats that should be mentioned. First, there were a smaller number of home underdog games in the past 5 years, which may be influencing the results somewhat. Second, odds are made so that they somewhat lessen the payoffs so that the odds makers can make money. This means that if you find an inefficiency like the one above, it has to be large enough so that you can overcome the advantage that odds makers have when they make the odds. For the first 3 years, this does not seem to be the case for home underdogs.

Overall though, it seems like there could be something here. But don't blame me if this betting strategy doesn't work. Who knows if it will persist in to the future or not.

]]>I think the reason for this lack of progress in the soccer analytics community is threefold:

First, it is very hard to get across new ideas and specific developments in a panel. Panels are not conducive to discussions of new research and developments. It was able to give a general view of soccer analytics, a view in which we step back and look at progress from a distance. If you think you've heard this general view repeatedly over the past few years, you're not alone. To learn truly new ideas at SSAC, one has to attend the research paper presentations.

Second, not very much has actually been done with soccer analytics in the past 2 years. You may argue with me, and of course a lot of things have been done. But in comparison to the developments in basketball, the progress is little. Compared with just 2 years ago, the NBA is now largely influenced by analytical findings. Daryl Morey, the general manager of the Houston Rockets, talked about how teams are starting to shoot a lot more corner 3's and score a lot more points from high percentage layups and dunks from inside the paint. This realization largely stems from analytical work done in basketball. There has been no such large influence on how soccer is played (at least that I can tell) in the past couple of years.

Third, soccer is genuinely harder to break down analytically than other sports. Baseball can be broken down in to individual at bats; Basketball and football can be broken down in to individual possessions. Soccer, though, is one fluid game with no easy way to break it down. People have been preaching this for as long as I can remember.

The thing is, I'm cheating a little bit here. What value does pointing out problems have if I don't offer any solutions? If soccer wants to come out of its analytical dark age, we need to figure this problem out. I'm not saying its easy. If it was, I'd have some brilliant idea. But I don't. I'm just as stuck as most people seem to be.

A lot of this problem comes from the stigma that statistics somehow degrade soccer. Fans think that analytics takes the beauty out of soccer. At SSAC, the legendary Edward Tufte had a perfect quote: "Analytics doesn't take the beauty out of sports- It takes the stupidity out of it."

If data are going to truly influence the sport of soccer at the professional level, basketball is a brilliant lead to follow. The revolution started with individuals, bloggers and armchair analysts. Eventually people started to see that there were inefficiencies to exploit. These individuals started to figure out how to exploit

Daryl Morey, General Manager of the Houston Rockets, at SSAC 2013

these inefficiencies. Finally, forward thinking team executives caught on, and started to change the way they played based on the analysis. Think Billy Beane, Daryl Morey, Theo Epstein, etc. Once other teams saw that this analysis was working, others started to imitate.

There is no doubt in my mind that inefficiencies exist in soccer that can be exploited with statistics. The most exciting part is that few of these inefficiencies have been found yet. Or if they have been found, they haven't been exploited. These low hanging fruit are waiting to be picked. And once we figure out how to gain a significant enough edge for team executives to notice, they'll catch on and start adapting. This adaptation may be a little slower in countries other than the United States. For whatever reason the US seems to be much more open to statistics in sports.

What soccer needs is an Oakland Athletics. Or a Houston Rockets. I have no idea when this will be. It may already be happening, or it may take 10 years. And even more than that, we need to prove the worth of analytics in soccer. Clubs (understandably so) will not invest in analytics unless they think real value can be gained from it. And this requires us to prove the worth of analytics. I don't think we're there yet, but we're on our way.

But when one club eventually punches well above their weight as a result of the edge they have from analytics, other clubs are going to start to notice. And after that, the dominoes are going to fall in to place and the game will probably start to change, just the way it has in basketball in the past 2 years and in baseball before that. The seeds can be planted by individuals. But nothing will grow until there's an effective adoption by clubs.

]]>**Data**

To create a model that gives the odds of win, draw, and loss for a club depending on the venue of the game (home/away), the minute, and the goal difference in the game, I used over 4000 EPL games stretching back to 2001. This is an improvement over the older Outcome Probability Calculator, which only used 1000 games.

**Methodology**

Even though there are over 4000 games to base the model off of, there are inevitably still some game scenarios that do not occur enough to get reliable results. There are not many games in the past data where the away team went up by 3 in the first 5 minutes. Because of this, when I plotted the outcome probability against the minute of the game, the line is not exactly smooth. To make the results a little more reliable and hopefully more consistent with actual games, I applied the Loess regression method to smooth the lines. This method has some positives and negatives-- it does a good job of smoothing the data, but it is a non-parametric method so it cannot be expressed in an equation as a regular regression is able to do. Below are two plots, one using the LOESS and one that is just the raw data.

**Shiny**

Best of all, I updated the application interface. RStudio Shiny lets you make simple web applications through R. The Shiny interface makes the application a little more functional. The Outcome Probabiity Calculator now displays a graph similar to the one above with a marker for the specific minute chosen. It's interesting to mess around and explore different game scenarios.

**Next Steps**

Next I am planning on updating the Expected Points Added application with the new data. I'll also hopefully calculate this for every player in the past 10 years of English Premier League football. It should be interesting to see how this has fluctuated from year to year for specific players.

]]>Here are some of my thoughts:

I really like the emphasis put on the role that luck plays in soccer. People are often inclined to try to use data to explain everything that happens in soccer. Of course, this is impossible. There are always going to be events that simply cannot be predicted, and I think this is crucial to keep in mind going forward. Data can give you an edge, but it is never going to be the only factor that leads clubs to championships or promotion. To use an example from the book, no model is every going to predict a beach ball coming in to play and deflecting a shot in to the net. This is a glaring example of randomness sneaking in to the game to determine a result, but there are countless other incidences in every match that affect results.

A fact I found interesting: 99 percent of the time players didn't touch the ball, and 98.5 percent of the time they ran without it (page 143). Analysts have focused almost all of their time on the events that occur when the ball is at a player's feet. That's focusing on only 1% of a player's input in the game. While these are the most important and easy to record parts of the game, it is interesting that we ignore almost 99% of a player's activities when looking at their performance. Ignored is the work a player does to get in the right position defensively, or the right angle to receive a pass, or the right space to be able to take a shot. These are extremely difficult to analyze and keep track of, but they are likely just as important as the events that occur when a player actually does have the ball.

The part of the book that will have the greatest influence on how clubs behave, at least in my opinion, is the weak link versus strong link analysis (page 218). Basically, the analysis in the book says that improving a club's weakest player is more beneficial than improving a club's strongest player. This is a huge finding, and completely goes against what is "known" in soccer. Instead of signing the big name striker for way too much money, simply upgrading your weak left back to a stronger one can improve your club for a lower cost. Of course, this is not going to be the case every time, but their analysis provides strong evidence that this is the case a lot of the time. If I were employed by a club to analyze signings, this is the first idea I would bring up to improve the club.

Overall, I think Chris and David do a good job of making the book easy to read and entertaining. I assume it would be all too easy to make the analysis complicated (they are Ivy League professors, of course). However, they do a good job of weaving a narrative while also making a lot of very good insights. That being said, the future of soccer analytics will likely be a bit more complicated. As they point out themselves in Forecast 4 on page 304, mathematical tools like algebraic geometry and network analysis are going to provide the base of most insights going forward. Instead of ball events, it is going to be the more complicated analysis of off the ball events like spacing and positioning. These off the ball events make up the 99% of the game that is generally not recorded or analyzed, as mentioned before. These are likely going to be the events that penetrate deepest in to the flowing, dynamic, and non-stop nature of soccer.

If you haven't yet read it and you're reading this blog, you should probably pick up a copy and read it.

]]>*Data*

I had a hunch that this would not be the case. Specifically, my guess was that there would be more goals scored between the 85th and 90th minutes, whereas there would be fewer in the first 5 minutes of the game. To test this hypothesis, I used data from the Rec.Sport.Soccer Statistics Foundation page from 8 years of the Premiership.

*Methodology*

At first I looked at the number of goals scored in the 8 years of my data for each minute of the game. However, it was clear that this breakdown was too granular; the variability was high because there just wasn't enough data to break it down in to each individual minute. To solve this, I aggregated the data in to intervals of 5 minutes. This way, the data was not as specific but it gave a clearer picture of what was going on. Below is the graph that I came up with:

As you can see, the x-axis gives the end of the time interval. In other words, the bar over 50 on the x-axis represents the number of goals scored in the 45-50 minute time range. The y-axis is the frequency, or number of goals in that time range.

*Confidence Interval*

There are also 3 lines on the graph. The middle line gives the mean number of goals across all the time intervals. The top line is the mean number of goals across all the time intervals plus the standard deviation of the number of goals. The bottom line is the same thing but subtracting one standard deviation. I added these to the graph because they provide nice reference points to help determine time intervals that can be considered outliers.

*Late First Half Goals*

The first thing that sticks out from looking at the graph is the number of goals scored between minutes 40-45. It seems that a lot of goals are scored right before the halftime mark. This is pretty interesting, and something that I was not necessarily expecting. It is also probably the worst time to give up a goal; instead of going in to the half tied or up by one, a club would instead be going in to the locker room down by one or tied.

*First Versus Second Half Goals*

Another piece of the plot that is the difference between the height of the bars in the first half compared to the height of those in the second half. In the first half, the number of goals is right around the lower part of the interval. In the second half, the number of goals is around the upper part of the interval. Clearly, more goals are scored in the second half of games, which is probably not the first time someone has pointed this out.

*Implications*

There are a number of takeaways from this plot. First, the final 5 minutes of the first half is a vital period of time in the game. Clubs should bear down and play more defensively than usual, especially considering the fact that you don't want to go in to the half having just let up a goal.

Second, I was thinking that there could be applications to betting. While I don't bet on soccer games myself, the information could be useful for someone placing a bet on the time a goal is scored, or even when the first goal of the game is scored. If you do put a bet on it, just don't come back complaining to me if it doesn't work out!

]]>The reason for this is that goals scored does not have a standard distribution, the bell curve we are used to. For example, if you looked at the distribution of heights in a population, you would see a nice bell curve. Most people are right around the average height, and as you go towards the extremes either way (really short or really tall) you find fewer and fewer people. Therefore, the mean of heights in the population is instructive because it gives us the "normal" or "typical" height.

The problem is, goals scored in a season does not follow a standard distribution. Instead, most players score no goals at all. The next most common number of goals scored last season? Just one goal, of course. This distribution continues, and it follows a power law distribution.

While I don't understand a lot of the mathematical intricacies of the power law distribution, I can explain generally what it does. A good example is from earthquakes. Most earthquakes are relatively small. As the size of the earthquake increases, its frequency decreases. Specifically, for every magnitude 4 earthquakes there are 10 times as many magnitude 3 earthquakes and 100 times as many magnitude 2 earthquakes. This pattern holds for all magnitudes.

My idea was that a similar power law distribution could be used to model the number of goals scored in a season. It turns out that it does a good job, which has a lot of implications for predicting goals scored in a season. Below is a histogram of the number of goals scored from the 2011-2012 EPL season. The data is from the MCFC Analytics Basic dataset. As you can see, a lot of players scored just 1 goal, while fewer scored 3 total goals, while even fewer scored 10 goals.

So how is it possible to fit a function to this distribution? It turns out that if we take the natural log of both axes (count and goals) and plot the new points, we should see that the points form a line. As you can see from the graph below, the linear relationship is pretty strong (R-squared = .91). This implies that the power law distribution of the original data we are looking at is also strong.

A power law follows the equation y=k * x^(n). In this case, we have the equation count = k * goals^(n). The n in the previous equation is just the slope from the log-log graph above. The slope is -1.63, so n=-1.63. To find k, we raise e to the y intercept in the log-log graph. The y intercept is 5.471 so k=exp(5.471)=161.61. So the equation that fits the first graph is count = 161.61 * (goals^-1.63). When I plot this equation over the distribution of goals from the first graph, it is clear that there is a nice fit.

So what are the implications of all this. First, it says that predicting the number of goals scored in a season is very hard to do. Events that follow power law distributions are extremely hard to predict. Earthquakes, for example, are basically impossible to foresee. Even experts cannot reliably predict *when* an earthquakes is going to happen. However, because they know the distribution they can say how often earthquakes of various magnitudes are expected to occur. To extend the analogy to goal scoring, it is very hard to predict the number of goals scored in a season. However, from this analysis we know the overall distribution, so we can predict how often goal scoring tallies will happen.

For example, how often do we expect Premiership players to score 30 goals in a season? Well, if we plug in 30 to the model above, we get count = 161.61(30^-1.63) = .63. Therefore, we expect a player to score 30 goals every .63 seasons, or around 6 out of every 10 seasons. If you look, this is consistent with what has happened over the past 10 Premier League seasons.

To finish up, I am not saying that we have no idea if Owen Hargreaves has a good chance of outscoring van Persie next season. Obviously, a lot of people predicted that van Persie would lead the league in scoring this season and he has. However, if we take a step back and look at goal scoring from just this past season, we can find a distribution that models goal scoring for 10 whole years using a power law distribution.

PS: I'll be traveling for the next 3 months so I won't have consistent internet access and won't be able to make any posts. I'll try to stay updated on Twitter though so I'll still talk there as much as I can.

]]>The advanced data contains (x,y) location information of every statistic that is kept. This is valuable information, as it obviously tells exactly where each event happened in the game. I was interested in how this information can be used, specifically to look at momentum and passing trends.

*Previous Work*

Some work has already been done in the soccer analytics community on trying to quantify and analyze momentum. The Analyse Football looked at momentum shifts from this same game, although in a different way. The Soccer by the Numbers blog looks at momentum in football in a much more general way.

*Methodology*

By looking at the exact position of passes by each team during the game, one can tell in what area of the field the game is being played at a specific point. My aim was to break this down. I looked at the y location of each pass during the game. In the data set, the passes are given an (x,y) location tag, where the y value is the location of the pass end line to end line and 0 < y < 100. In other words, a y value of 100 would be a pass on the other team's end line, and a y value of 50 would be a pass on the half line. The x location is also given, but for the purpose of this analysis I ignored it.

Just graphing the y location of every pass is too noisy to break down and understand. Instead, I took the simple moving average of the y location of the previous 25 passes of both teams. This gives a clearer picture of where the passes are being made throughout the game, and is taken as a proxy for momentum.

Below is the graph plotting the simple moving average of the pass location throughout the game.

The straight black line down the middle is a value of 0. When the simple moving average (the jagged black line) intersects the straight black line this indicates that the y location of the previous 25 passes of both teams is averaged at the mid line. The blue and red bars are the actual y location of every pass for reference. The red bars are the passes in the Bolton offensive half and the blue bars are the passes in the City offensive half. I also indicated goals by each team with the blue and red dots on the simple moving average line. As you can see, the game ended 3-2 in favor of Manchester City.

*Analysis*

What does this graph tell us about the game? It seems that City dominated momentum for the most part of the game. The first 3 goals of the game were scored when there was relatively no momentum either way, though. After Bolton's first goal (which made the game 2-1 in favor of City), City gained momentum as the ball was in Bolton's defensive half consistently. Possibly as a result of this momentum, City was able to score a 3rd goal, making the score 3-1. After evening out momentum and another dominance by City, Bolton was able to gain back momentum, as most passes were in City's defensive half. Again, possibly as a result of this momentum, Bolton was able to get a 2nd goal to make the game 3-2. Finally, Bolton ended the game with most of the momentum, but City was able to hold off the attack and finish the game 3-2.

*Conclusion*

The location data of passes can give us a good proxy for momentum. Additionally, this momentum does seem to matter in terms of scoring goals. While the first 3 goals of the game were scored with seemingly no momentum for either team, the final 2 goals were scored with each team dominating momentum for their respective goals.

When the full advanced data set is available it will be interesting to do this analysis for more games. Obviously this is an imperfect measure of momentum, however I think it does its job as a good indicator of momentum. I'd be interested to hear suggestions and comments on the methodology.

]]>If you're an R user and are having trouble dealing with the Advanced MCFC Analytics XML data file, the link above provides the code to pull the data in to a data frame in R. After this it is easy to perform whatever analysis you want on it.

I'll admit the code above is beyond my limited R skill level, but I know that it works. I'm excited to start doing some analysis, although the advanced data set is only for one game from last season at this point.

]]>You can read the full blog post here.

I wanted to point out on here something interesting that I found while working on the model; betting odds do a relatively poor job of predicting football match outcomes. In other words, the percentage likelihood of a win, draw and loss for the home team implied from the odds set by bookmakers is surprisingly inaccurate.

My hypothesis for why this happens is that football is very unbalanced, especially in the EPL. It is very hard to predict when an upset is going to happen, mostly because these upsets are (seemingly) random.

Using just 4 factors in my model, including the home team's goal differential for the season up to that game, the away team's goal differential for the season up to that game, the home team's point total from the previous season, and the away team's point total from the previous season, I could create a model that was as accurate as the bookmakers.

The question that remains is how much more accurate can the model become with the introduction of new variables? Beyond that, what variables should be used?

I am not sure I know the answers to those questions, but I am going to keep playing around with the data.

]]>Inspired from this post on plotting the frequency of Twitter hashtags over time, I was interested in trying to apply this to soccer some way. While not the most technical analysis, I thought it would be interesting to use this tool to analyze transfer rumors.

To summarize the process quickly, there is a package in R (open source statistical software) called TwitteR which allows you to pull Twitter data. It's actually a fairly easy process, especially if you follow the tutorial in the link at the beginning of this post.

As most Twitter users know there is a seemingly unlimited number of transfer rumors circulating Twitter. These range from being fairly plausible to pretty ridiculous ("Ronaldo to the Philadelphia Union???). As a Manchester City supporter, I was curious at looking at a few popular transfer rumors related to City.

**Robin van Persie to Manchester City:**

Yes, this is definitely a rumor, and yes, it is probably not going to happen. But I was still curious. Below is a plot of the frequency of the number of tweets that include "Robin van Persie" and "Manchester City". Of course, this is an imperfect method, but it still gives us an idea of what is going on in the Twitter transfer rumor world.

To explain, the graph below measures the number of tweets described above at a 2 hour interval for the past week. This means the height of every line gives us the number of tweets referencing RVP and City in that 2 hour interval.

**Carlos Tevez to AC Milan:**

After Tevez's past season with the club, there are obviously transfer rumors concerning Tevez all over the place. Because of this, it was hard not to want to look at the data on Tevez. I picked AC Milan because it seemed like the club he had the highest likelihood of going to. Like above, I searched for tweets that included "Carlos Tevez" and "AC Milan". The frequency of these tweets, in 2 hour intervals, is plotted below.

You can try to analyze these graphs to find some meaning, but they are more just a fun exercise than anything else. The TwitteR package lets you do other cool things, like plot the frequency of Twitter mentions for a user. I did this for another site I write for, EPL Index. They tend to get a lot more mentions than @SoccerStatistic does, so I thought it would be more interesting to plot the frequency of @EPLIndex mentions. Again, the intervals are every 2 hours.

Like I said before, this analysis is not very insightful or ground-breaking, but still pretty cool nonetheless. The possibilities for future analysis like this are almost endless, so if people have good ideas of Twitter data to visualize, I'd love to hear them.

]]>Swansea City, the "Welsh Barcelona"

**Possession, Does it Matter?**

Recently, I've started to question simply looking at percent possession during an entire game. It seems too broad an analysis to yield any significant or insightful results. Inspired by another one of Devin Pleuler's posts, I have looked at possession more specifically within a game. Instead of looking at how possession relates to results as a whole, I've broken down each game to try to understand how keeping the ball within a specific segment of a game changes a club's chances of scoring. Within the EPL, does a long spell of possession mean a team is more likely to score? If the answer is yes, it could point to the fact that possession actually does matter, albeit at a much smaller level.

**Methodology**

To do this, I have broken down passing patterns within specific Premier League games. Instead of looking at possession as a whole, I looked at the past 25 passes, and who made them. As the game progresses and passes are made, I found the difference in passes in the last 25 passes between the two clubs. For example, the home team could have completed 15 of the last 25 passes, while the away club completed the other 10. This would give us a difference in passes of +10, as the home club completed 10 more passes than the away club. As you might expect, this is a constantly fluctuating number, which changes from positive to negative many times throughout the game.

My idea is to graph this pass difference number throughout the entire game, and mark where goals occurred. My hypothesis was that a club would tend to score when they had had the majority of the last 25 passes. In other words, having possession would tend to increase a team's probability of scoring.

**RESULTS**

As it turns out, this hypothesis tends to be true. The key word here is definitely *tends*. Of course, there are always going to be goals where a team scores off a counter-attack. However, it seems that clubs that have dominated possession recently within the last 25 passes are likely to score. When teams do not have much possession of the ball, they are less likely to score. While you may be thinking to yourself that this is nothing new, it does seem to go against the results saying that possession does not matter. It seems that while *general* possession may not matter, *specific *possession has a big influence on scoring.

Below are some examples of the graphs I was looking at when making this analysis. A positive possession difference indicates that the home team has had the majority of passes within the last 25 passes, while a negative possession difference indicates that the away team has had the majority of recent passes. The vertical red lines indicate when a goal was scored; the ones starting at the midline and going up are goals scored by the home team, while the ones starting at the midline and going down are goals scored by the away team. Clearly, there are a few times when goals seem to have come when a team was losing the recent possession battle. For the most part though, it seems that my hypothesis above holds true.

**GRAPHS**

**Only 25?**

One criticism of this approach may be that looking only at the past 25 passes may be too small. Admittedly, there is nothing special about the last 25 passes, I just went with that number. To check the robustness of the results, I did the same analysis, this time with the last 50 passes. It turns out that looking at the difference within the last 50 passes yields very similar results. Of course, the graphs are slightly different, but clubs still tend to score when they have had the majority of the recent possession.

**Conclusion**

Based on the graphs above, it seems that possession *does* matter, though at a much smaller level than what we normally look at. While the overall possession percentage within a game tends to be misleading, more specific possession analysis within the game shows that clubs tend to score when they hold the majority of possession. If Stoke City is playing Arsenal, they are likely to be dominated when looking at possession percentage as a whole. In this way, possession is useless for analyzing the game. However, if we break possession down like above, it's likely to be the case that possession *is *important for Stoke, just in much smaller intervals within the game.

The idea for this post originally came from another blog post written by Chris Anderson (@soccerquant), the writer of the Soccer By the Numbers blog. In this post, Chris compares both the strength and imbalance of 6 of the top European leagues. You can read the post here. My idea was to expand upon this analysis using the extensive and accurate Euro Club Index data, while also looking at more European leagues. This analysis looks at the top leagues of 10 different European countries. The analysis will be split in to two posts. The first looks at only the top division of 10 different countries. The second, which will be posted later, will compare strength and imbalance within each country's league structure.

*Methodology*

Strength and imbalance are both tough to measure. While many measures exist, the Euro Club Index analysis provides an accurate measure of the strength of each European club. To find the strength of the different European leagues, I averaged the index value of each club in the top division. Each value is taken from mid-February of each year. This average represents the strength of each league. To measure the imbalance of the league, I calculated the Gini Coefficient of all the clubs in the country's top flight. The Gini Coefficient is basically a measure of dispersion (or in this case imbalance) that is usually used to measure the dispersion of income across a sample. Calculating it is somewhat complex so I won't go in to it here, but if you're curious here is an explanation. A higher Gini Coefficient indicates more imbalance in the league.

Below is a graph representing both strength and imbalance of 10 different European leagues. The y-axis represents imbalance, with a larger y value indicating more imbalance. The x-axis represents strength, with a larger x value indicating more strength. Ideally, a league is both strong and balanced. These leagues are, hypothetically, the most exciting to watch.

From the graph, we can divide European leagues in to 4 categories:

- Imbalanced and Weak: These leagues are theoretically the least exciting. They are likely highly predictable, but also not very strong. From the graph, we see that Ukraine, Portugal and Scotland are both of this type.
- Imbalanced and Strong: 2 countries fall in to this category: England and Spain. From the graph we can see that, while the English Premier League and La Liga are both very strong, they are also highly imbalanced. This is something you might expect.
- Balanced and Weak: These leagues are less predictable, but also are not that strong. Turkey and Netherlands fall in to this category.
- Balanced and Strong: This last category is the best of both worlds. The clubs in the league are very strong, but the league is also balanced. France, Germany and Italy all fall in to this category.

Clearly, these are just general categories, as some leagues are stronger or weaker of varying degrees. A lot of leagues are also on the edge of some of the average dividers, including Portugal which is slightly weaker than the average league.

*Historical Context*

After this analysis, I was curious to see how the imbalance and strength of different leagues have changed over time. To check this question, I graphed the imbalance and strength values of each of the 10 countries over the past 5 years. These graphs show us how leagues have progressed. For the sake of understanding the graphs more easily, I have split the 10 countries in to two parts. The first is the "top" of the 10 European countries (England, France, Germany, Italy and Spain), while the second is the "bottom" of the 10 (Netherlands, Portugal, Scotland, Turkey, and Ukraine). This is just done so that the reader can see the changes over the years more easily.

Below is the graph of the measure of league strength from 2008-2012. You can see that not much change has happened. England momentarily surpassed Spain as the top European league in 2011, but Spain has now retaken a narrow lead in 2012. Additionally, Germany and Italy have switched places in terms of strength. There has also been a slight decline in strength of the top leagues in the past few years. Spain and Italy have both declined somewhat, whereas England has remained relatively constant.

Next is the same graph as above, but for the "bottom" 5. This graph is even more constant than above. The only change is that Ukraine has surpassed Scotland in strength. The strength levels of each league have also remained relatively unchanged. There have only been small fluctuations.

What about the progression of league imbalance? The graphs of imbalance levels over the past 5 years are shown below. First, France, Italy, and Germany have all kept a low level of imbalance over the past 5 years. On the other hand, England and Spain have had some interesting fluctuations. Spain has seen a consistent increase in imbalance over the past couple of years, while England has been on somewhat of a decline since 2008. However, this year the EPL has reversed this decline and become more imbalanced. This makes sense, as the influx of money in to the EPL quickly raised the level of imbalance. Ever since then, it seems that the EPL imbalance has slowly been declining. For the bottom 5 leagues, there hasn't been too much of a change in imbalance except for Portugal, which has seen a constant increase in imbalance since 2008. In 2008, Portugal had the least imbalance of any of these European leagues. Now, the Primeira Liga is somewhat in the middle in terms of imbalance.

*Conclusion*

As Chris points out in his analysis, this kind of work could give clubs insights in to how transfers will fare moving between leagues. For example, can we predict how well a top player in Ukraine will play in Spain? Similarly, which is better: 10 goals in the EPL or 15 goals in the Eredivisie? These questions are near impossible to answer, but this sort of analysis helps us. Knowing both the strength and imbalance of a league is important for evaluating players between leagues.

The next step for this analysis is to look at how different divisions of leagues vary. What is the differing strength and imbalance from the top division in a country to the bottom division? This sort of analysis can help a club determine how a player in a lower league might fare in a higher division, a question similar to the one I try to answer here.

*Data provided byInfostrada Sports from theEuro Club Index(powered byHypercube). You can follow both Infostrada Sports accounts on Twitter:@InfostradaLive,@EuroClubIndex.*

I tried to make it as stand-alone as possible. In other words, I wanted people to understand it just by looking at it without other information. One point: its interactive in that you can scroll your mouse over a club's circle and it will give you information on them. If you are interested in more analysis and how I created it, read below. My site doesn't allow embedded code, so I have to post it on a different site. Here is a link to the visualization.

What is wrong with the normal EPL table? Well, nothing exactly, but it is a bit boring. It is also hard to get a handle on the true distance between teams because the points are only listed as numbers. It is hard for the human mind to visualize the differential between clubs represented by numbers. A much better way is to represent all the clubs on an axis. This way, we can see how the difference between 6th and 7th (1 point) is much smaller than the difference between 2nd and 3rd (5 points). To do this, the y-axis in my visualization is the current points total of each club.

This is nice, but not complete. Tables also display the goal differential of a club, so I also wanted to display that. A similar problem as above exists with goal differential. It is hard to visualize the difference between two clubs' goal differential. To deal with this, the x-axis of the visualization is goal differential. This way we can compare the true distance between two clubs in goal differential.

What is a way to get even more informative than the average EPL table? I decided that including shots on target per game (data taken from http://www.whoscored.com/) might provide some interesting insight. To do this I changed the color of the circles representing the clubs based on their shots on target per game. In other words, teams that have more shots on goal per game are darker, and teams that have less are lighter.

With all the high spending recently, I thought including net transfer spending as another metric represented in the visualization would be interesting. To incorporate this, I made the area of the circles represent their net transfer spending. On my last post, a reader pointed out that using the radius of a circle to represent a metric is misleading because people naturally look at areas, and in this case the area does not represent the exact relationship. This time I made the areas represent net transfer spending.

Finally, I made all of this interactive. If you scroll your mouse over a circle it will tell you the club, the points they have, their net transfers, their goal differential, and their shots on target per game.

I think the graph gives a lot of interesting insights, and packs a lot of information in to one. It is also very intuitive in that you don't have to think about comparing numbers. The visualization is just there, and you can compare without even thinking.

]]>Using data from 1000 EPL games from the RSSSF, I've created this chart using Processing, which you can find below.

Each node (circle) represents a different scoreline that a game can end as. The diameter of the node is dependent on the number of games that have ended in that scoreline. This means bigger nodes mean more games ended in that scoreline. For example, "1-0" is the biggest node, because the most number of games ended 1-0. On the other hand, the node "0-4" (bottom right hand corner) is tiny because not many games end with the away team leading 0-4.

I should also point out how the progression works. "0-0" is the starting node. All the nodes above that represent the home team scoring first (1-0), and all the nodes below that represent the away team scoring first (0-1). The home team's score is represented by the number to the left of the dash, and the away team's score by the number to the right of the dash.

Another thing: There are scorelines that are repeated in the graph. For example, there are three "2-1" nodes. That's because there are three score progressions that can end that way: 1-0, 2-0, 2-1 is one, 1-0, 1-1, 2-1 is another, and 0-1, 1-1, 2-1 is the final one. The size of each of these "2-1" nodes represents the number of games that ended 2-1 that went through their specific progression.

Finally, you can see size of the transition lines have different weights. The weight of each line represents how many games went through that scoreline. For example, the line connecting the "0-0" node to the "1-0" node is larger than the one connecting the "0-0" node to the "0-1" node because more games start out with the home team scoring first. It is different than just the size of the nodes, because it takes in to account not just how many games ended in that scoreline, but also how many games included that scoreline. For example, if a game ended 2-1 but at one point was 2-0, this would count in the weight of the 2-0 transistion line, but not the "2-0" node. Weighting the transition lines like this is interesting because you can begin to visualize how the "flow" of scores happens in a game.

]]>Blog: The blog is exactly the same, and is the also home page of the site. Nothing new here.

Statistical Applets: There is now a menu option called Statistical Applets. Under this are two options, Expected Points Added and Outcome Probability Calculator. The first is a table of the EPL leaders in Expected Points Added, a metric I created a while ago that takes in to account the true value of each goal when ranking goal scorers. For more information, you can read here. The Outcome Probability Calculator lets you enter information about a team in a game, and then gives you the probability of each type of outcome. For example, you could enter the 34th minute, at home, leading by 1, and see the probability of the team winning, drawing, and losing.

About Me: Just an about me page, with a contact us form link.

If you have any comments or suggestions on the design for the new site, I'd love to hear him. I'm working on adding some more statistical applets to that section for the future, which I'm excited about. Hope you like it!

]]>Michael Essien

What would be the perfect, all-encompassing football statistic? Something that takes in to account both offensive and defensive skill. Something that measures what value a player adds to his club. All in all, a statistic that quantifies the individual impact a player has on improving (or worsening) his club's ability to score goals and limit (or not) goals against.

Some people have made attempts at this in the past. One example are OptaJoe's tweets (@OptaJoe) about club's winning percentages with and without a player. Here is one example: "10 - Since January 2005, Everton have averaged 61 points per season with Arteta playing, compared to 51 points without him. Lynchpin." These statements are simple, easy to understand, and at first glance seem to be informative. On his blog 5 Added Minutes, Omar Chaudhuri has correctly pointed out that these statements tend to be entirely misleading. As Omar shows, the problem is that these statements are not controlling for the strength of the opponent, the venue of the game, or really anything else, in these games.

My idea was to create a metric that would control for all of these factors to truly understand every player's worth to their club. Being a big ice hockey fan (specifically the Boston Bruins, if you are wondering) I thought that the plus minus statistic might be able to be applied to football. For those of you not familiar with this statistic, plus minus basically measures a club's net goals when that player is on the ice/field. When the team scores a goal when the player is playing, the player's plus minus increases by one. Conversely, when their team concedes a goal when they are playing, their plus minus decreases by one. The idea is that, over the season, the best players will have the highest plus minus.

I faced the same problem as before though, as this does not control for the strength of the opponent, the strength of the team the player is playing with, and where the game is being played. For example, a poor player on a top club would naturally have a higher plus minus than a good player on a poor club.

To fix this, I applied an analysis used in basketball to create an adjusted plus minus statistic. This was created by Dan Rosenbaum, and if you are interested the explanation can be found here

Without going in to too many technical details, the adjusted plus minus metric is created using a massive regression. The right hand side variables are variables for every player, while the left hand side are goals for. Each observation is a unit of time during a game where no substitutions are made. Each player variable is a 1 if the player is playing at home during that unit of time, a -1 if they are playing away, and a 0 if they are not playing. The significance behind this methodology is that it controls for each player's team, venue, and opponents. If you want to know more about the methodology, read the link above. The data is from the 2010/2011 season and is provided by Infostrada Sports (@InfostradaLive on Twitter).

The main problem with this as some, including Albert Larcada (@adlarcada_ESPN), pointed out on Twitter, is that there is multicollinearity in the regression. This arises because, unlike in basketball, there are not many scoring events. What happens is that many players are highly correlated in the model. This throws off the adjusted plus minus values for each player, so we should not take anything from the results.

With that in mind, here are the results that I came up with. Again, these results are likely not correct, but I thought people might be curious to see them anyways:

Because I (a) spent a lot of time on this and (b) think it is important post work even if it doesn't necessarily work out, I went ahead and wrote this post. Keep in mind that the results above don't really mean much. The values are also not statistically different from 0. In other words, the standard errors on all the values are large enough so that we cannot say that they are statistically different from 0. This is another reason why the results are not very reliable. However, I think that the adjusted plus minus statistic could be the first step to creating metrics that truly capture the actual value of a player. Most statistics used (assists, goals, etc.) can be thrown off because they are highly dependent on the team the player plays for.

One way to fix the problem of mulitcollinearity is to use a different statistic that occurs more often, and is highly correlated with goals. I think the best option for this would be shots on goal. This way, you could create a statistic that controls for the player's team, opponents, and venue, and measure how many net shots on goal occur when they are on the field. Just a thought on a possibility of something to look at in the future.

*The data used for this is provided by *

* (@InfostradaLive on Twitter). Special thanks to Simon Gleave (@SimonGleave on Twitter) for helping me with the data.*

Swansea's Mark Gower is a perfect example of a player highlighted by the chances created statistic.

**Chances Created**

The appeal of this measure is that it can value players that play on weaker teams better than assists do. For a player on a weaker team, it is harder to record assists since they are playing with teammates that are less likely to score. Chances created is a fairer statistic because it does not value the strength of your teammates as much. Overall, it can highlight creative players that are often overlooked because they are on weaker teams and do not have as many assists.

**Do Chances Created Actually Matter?**

With all this in mind, I was curious to find the actual worth of the chances created statistic. One way to measure this is to look at how chances created and wins are correlated. To make it a little easier, I looked at the relationship between goals scored and chances created for EPL teams. In other words, do teams that have more chances created score more? Do teams with less chances created score less? The answer, in short, is yes, they are correlated. Below is a scatterplot of the relationship. There is a clear positive relationship between chances created and goals in the EPL last season. The coefficient is statistically different than 0 (p=.000), which tells us that there is extremely strong evidence that there is a positive relationship.

**Chance Conversion Percentages**

This is only half the story though. Some teams get a lot of shots off, but either because they are not good at shooting or are taking shots that have a smaller chance of going in, some of these teams have a low number of goals because they have a poor conversion percentage for shots. The conversion percentage is defined as the goals divided by the total number of shots (excluding blocked shots). Below is a scatterplot similar to the one above, this time with conversion percentages on the x-axis. The conversion rates are rounded to 2 decimal places, hence the bunching. Again, this shows a positive relationship between conversion percentage and goals. Teams with higher conversion rates tend to score more and vice versa. This relationship is also statistically different from 0 (p=.002). A quick note: the product of chances created and conversion rate is very close to the number of goals a club has scored. I'm pretty sure the discrepancy comes from including blocked shots in shots attempted, but not in conversion rates.

**EPL 2010-2011, Chances Created and Conversion %**

With this in mind, I created a scatterplot of conversion rates and chances created for EPL teams last season. The plot shows that clubs found scoring success in different ways. The Manchester clubs did it by being efficient scorers; they had conversion percentages of .15 and .16. Chelsea and Tottenham were on the other end of the spectrum with higher chances created, but lower conversion percentages (.12 for both). The graphic also shows that West Ham did not struggle because they were not creating chances; they struggled because they had a low conversion percentage (.10). On the other hand, Birmingham struggled because they failed to create enough chances to score, despite a decent conversion percentage of .12.

**EPL 2011-2012 thus far, Chances Created and Conversion %**

What about this year? Below, I created the same scatterplot as above, this time for the current season. City's dominance is really highlighted. They are leading in both chances created AND conversion percentage, hence the massive number of goals this year. Again, United seems to be scoring because of their high conversion percentage. QPR and United actually have very similar number of chances created, United just finishes their chances with a much higher percentage. Liverpool sticks out because of their high number of chances created, but really low conversion percentage (.09).

**Conclusion**

The bottom line is that creating chances and conversion rates are the key to understanding goal scoring. A club can succeed with a high conversion rate (United) or by creating a lot of chances (Liverpool). A club can really dominate by doing both well (City). The graphic above can also suggest what kind of players each club needs. For example, Manchester United and Newcastle would benefit by picking up a creative midfielder who creates more chances, and Liverpool and QPR would benefit by picking up a more efficient scorer. The scatterplot also tells us why some clubs struggle. Wigan needs to up their conversion percentage (currently a dismal .06) and Stoke needs to create more chances. City, on the other hand, should just continue to buy all the best players.

*All data comes from eplindex.com (@EPLIndex)*

Joey Barton, of newly promoted QPR

An aspect of English football that I love that does not exist in American sports is the promotion/relegation aspect. It makes not just the race for first exciting, but also the race to avoid relegation entertaining. In American sports, last place teams often simply give up, a disappointment for fans.

I wanted to see exactly how promoted/relegated teams fared throughout the season. Some statistical research has already been done on the subject: Omar Chaudhuri, writer of the 5 Added Minutes blog, looks at conversion rates of promoted teams and their corresponding ability to stay in the top flight here. In part 1 of this post, I have looked at how promoted teams have done in their first season in the top flight. My original idea was that teams may struggle early in the season to adjust to the higher level of competition, and eventually even out as the weeks go on and the teams adjust. This also puts the performance of QPR, Swansea, and Norwich into perspective with past promoted team's performances. I use data from promoted teams from the 2003/2004 to 2010/2011 season.

I've created 5 graphs to illustrate the performances of promoted teams. The first one, below, shows how all the promoted clubs' point totals have progressed over the 38 games. On average, promoted teams earn around a point per week. The greenish linear-looking line in the middle is the average. All the other jagged lines are the point totals over the season of promoted clubs. This graph isn't too informative, but is an interesting graphic nonetheless.

The next graph is the same as the one above, but only looks at the three promoted clubs this season in comparison to the average points line and the linear points line. To clarify, the linear line shows is a line illustrating what would happen if a team earned the same points every week to end up with the average point total for promoted clubs. The average line shows the average points earned each week of the season. These may sound the same at first, but I will show in the next couple of paragraphs that there is an important distinction. Anyways, the graphic below illustrates that all 3 promoted clubs are faring about as well as the average promoted team does. QPR started off a little stronger, but has since returned to the average. Norwich and Swansea both started a little weaker, but have improved to end up just above the average 7 weeks in to the season. All 3 teams have 8 points so far, just above the point per week average of promoted teams.

Another way of looking at the first graph is by looking at points per game of promoted teams. The graph below shows this. Obviously, at first clubs' point per game total is a little spread out. As the season progresses, teams earn an average of 1 point per game, as mentioned above. Some clubs have done a little better, and some a little worse, as evident from the graph.

Next is the graph above, but again looking at the performance of the 3 promoted teams this season. Again, the graph shows that QPR started off the campaign a little stronger, but has since regressed to be even with Norwich and Swansea.

The final and most informative graph shows the cumulative points per game of promoted clubs. This graph answers my question of how promoted teams fare throughout the season. As you can see below, promoted teams seem to struggle up until week 7, where they turn it around and do better than their average point total up until around week 20, where they hover around the point per game mark until the end of the season. There could be a lot of explanations for this trend. Maybe clubs struggle at first, and then adjust to the higher competition? Maybe clubs transfer window acquisitions (think QPR) start to pay off around week 7? It would be tough to tell what the true factors driving the trend are really. However, the graph does highlight the interesting phenomenon.

I'm still working on doing a similar analysis of clubs that are relegated at the end of the season to analyze how their performance fluctuates throughout the season.

]]>