Some of these players I’ve mentioned were top picks. Brian Carroll was selected 2nd overall in 2000. Taylor Twellman was selected 2nd overall in 2000. Some of the top players who were chosen in the draft were selected in the later rounds, but went on to very successful careers. Chris Rolfe was selected 29th overall. Davy Arnaud was selected 57th overall and scored 54 goals in his career.

Nikolas who?

On the flip side, there are a number of notable draft busts. Nikolas Besagno was selected 1st overall in 2005 and went on to play in only 8 games. Joseph Ngwenya went 3rd overall in 2004 to Salt Lake and scored 18 goals in his career, while Salt Lake passed over Clint Dempsey, Clarence Goodson and Michael Bradley.

This post aims to provide some context around the value of draft positions. This can be helpful for determining a fair trade (“Should I trade up to a higher selection?”) or looking at how clubs have performed in their draft selections (apparently the Rapids have done a pretty crappy job overall).

Club | GP | Mins | Goals | Years | Picks |
---|---|---|---|---|---|

Portland Timbers | 21.83 | 1715.33 | 3.17 | 3.00 | 2 |

Miami Fusion | 17.70 | 1544.61 | 0.24 | 3.67 | 9 |

MetroStars | 15.24 | 1179.28 | 1.60 | 4.74 | 35 |

Philadelphia Union | 17.00 | 1169.65 | 1.68 | 3.78 | 9 |

Toronto FC | 14.70 | 1110.70 | 0.91 | 2.70 | 20 |

Chivas USA | 15.06 | 1095.91 | 0.87 | 2.35 | 20 |

Houston Dynamo | 14.18 | 1018.77 | 1.09 | 4.12 | 16 |

Kansas City Wizards | 13.67 | 1005.15 | 0.97 | 3.88 | 57 |

Los Angeles Galaxy | 13.85 | 996.64 | 1.29 | 5.23 | 65 |

D.C. United | 13.31 | 984.91 | 0.94 | 4.11 | 57 |

F.C. Dallas | 13.95 | 970.85 | 1.15 | 3.73 | 33 |

New England Revolution | 12.68 | 950.84 | 1.40 | 3.50 | 62 |

Columbus Crew | 13.41 | 937.60 | 1.42 | 4.02 | 57 |

Real Salt Lake | 12.72 | 910.79 | 0.78 | 4.10 | 20 |

New York Red Bulls | 12.67 | 879.00 | 0.75 | 3.39 | 18 |

Chicago Fire | 12.61 | 855.73 | 1.14 | 4.19 | 68 |

San Jose Earthquakes | 11.84 | 844.45 | 0.62 | 4.50 | 48 |

Dallas Burn | 11.48 | 803.24 | 1.24 | 3.27 | 33 |

Colorado Rapids | 10.98 | 729.43 | 0.71 | 2.22 | 55 |

Vancouver Whitecaps FC | 9.07 | 588.27 | 0.20 | 3.75 | 4 |

Tampa Bay Mutiny | 8.25 | 445.35 | 0.70 | 1.54 | 13 |

Seattle Sounders FC | 8.53 | 438.61 | 0.71 | 3.17 | 12 |

Sporting Kansas City | 4.67 | 91 | 0.00 | 1.00 | 3 |

Relating to the second point, some clubs have done better than others in the draft historically, as can be seen in the table above. Just defining “better” is difficult. As I’ll explain later, in this post I’ve used a number of different metrics to define the value of a draft pick.

This post is not meant to try to predict which players will succeed and which ones will fail. I’m not trying to determine if Cyle Larin will be better than Khiry Shelton. Instead, I just want to look at how players have performed statistically by their rank in the draft. This way we can answer questions like “How many minutes should I expect a 3rd round pick to play in his career?” or “How many goals do I expect a forward selected 8th overall to score?” or “How many years can we expect a GK selected in the 1st round to play?” These are forward-looking questions with answers that are backwards-looking. However, I think it’s a reasonable assumption, or at the very least a good place to start, when trying to estimate the performance of draft picks into the future.

**DATA GATHERING**

In order to do this type of analysis there is a bunch of data that I needed to get. All the scripts I used to collect these data are available in the repository I created on GitHub. Any updates I make going forward will be pushed into this repo. There’s a bunch of code in there that I’ve written to scrape data from online, but for this I’m focusing on the mls_draft folder and the mls_scrape.py file. I’m not going to go too much into how I scraped the data, but if you want to learn more about the process then feel free to send me an email.

First, I needed the past SuperDraft selections by teams. These data were all available on Wikipedia. The code used to scrape the Wikipedia pages, much of which was based on this blog post, is in the mls_draft.py file in the the mls_draft folder. Once the data was scraped and saved to a csv file, the create_draft_stats.sql script uploads the data to the MySQL database I set up. With the draft data all set, I also needed the actual career statistics of all MLS players. The mls_scrape.py script does this. There was a bunch more stuff to do with data cleaning that’s not very exciting (fixing player names, making sure the data is correct, etc.). Eventually I ended up with the necessary data all cleaned and looking pretty. If you see anything thats incorrect in the data let me know; Most likely I missed some stuff.

In order to ensure that the results are reliable, I only looked at draft selections from 2011 and before. Draft selections from 2012 and after don’t have enough time to come to fruition and the data is messy and not very informative as a result.

**ANALYSIS**

The first question I was interested in investigating concerned how often selections can be considered successful by round. In other words, if I have a 3rd round draft pick, what is the probability that the selection will be considered a success? As I alluded to earlier, defining success is difficult. Ultimately, I settled on a simple metric: minutes per year. The question of which threshold to choose is definitely an open one. In fact, there is really no right answer, so I tried a couple different values for determining a success.

I first looked at the median minutes per year. The idea is pretty simple: A selection is a success if they end up playing more minutes per year than 50% of draft selections.

Round | Success % |
---|---|

1 | 85.9% |

2 | 65.0% |

3 | 38.0% |

4 | 33.8% |

5 | 14.5% |

6 | 14.8% |

The median minutes per year is slightly under 150 minutes per year. This actually seems like a pretty good threshold; assuming 34 games per year, that’s about 5 minutes per game, per season. So really we’re saying a success in the draft is a player that ends up being at least a regular sub. To the left are the percentage breakdowns for each round, i.e. what percent of players drafted in each round ended up playing more than 150 minutes per year on average.

An astute observer might say that teams are expecting more than 5 minutes per game from their selections. In fact, the density plot above suggests another possible threshold for defining success. There seems to be a kink right around 400 minutes, marked in the plot below.

Round | Success % |
---|---|

1 | 75.0% |

2 | 49.7% |

3 | 27.2% |

4 | 22.8% |

5 | 10.9% |

6 | 11.1% |

At that mark the density plot flattens out. Let’s use the 400 minute threshold instead. To the right are the percentage breakdowns for each round with the new threshold of 400 minutes per year on average.

Whatever threshold is used, there is a clear picture of how picks tend to perform based on round. The 1st round is clearly the best and with each subsequent round the probability of being deemed a success decreases. While I’m clearly not going to win any awards for that conclusion, it is at least interesting to see exactly how much the likelihood of a player becoming a success decreases. Interestingly enough, there doesn’t seem to be much of a difference in the success probabilities of the 5th and 6th rounds, although there is a good chance this is due to a small sample size, considering a lot of years the draft has had fewer than 5 rounds (in fact, there were only 3 in 2011 and 2 in 2013 and 2014, although I only use years 2011 and before for this analysis).

We can do even more interesting analysis with these data. It’s possible to model the expected value of various statistics as a function of the pick number. Even more, we can break this down by position and answer a bunch of different questions. If a team selects a forward 10th overall, how many goals should we expect them to score in their career? What about the expected number of minutes per year for a defender selected 25th overall?

To answer these questions, I use LOESS. I used LOESS previously to smooth out the raw outcome probability curves for the outcome probability calculator. LOESS, which kinda sorta stands for LOcal regrESSion, is useful here for a couple of reasons: First, it’s simple. It’s as easy to use as a linear regression. Second, its non-linear and is useful for smoothing out curves. The curves I’m plotting and looking to model are a bit whacky. LOESS is a simple way to get an accurate representation of the data. The biggest drawback is probably that it is non-parametric, which means that you can’t use an equation to model the data. But this isn’t the end of the world in this case.

Below is a plot of pick number overall versus the number of total minutes played. The most clear takeaway is that there isn’t very much of a pattern to the data. Yes, there is a general trend downwards as the pick number increases, as we can see in the LOESS curve in blue. But even general might be too strong of a word here. There is really a lot of noise around that curve.

The same lack of a trend is evident if we look at the pick number versus number of goals for Midfielders and Forwards only. It’s actually even worse, as the tail on the distribution of goals is very long. In fact, the distribution of goals scored seems to follow a Power Law (another plug, apologies). Seeing a lack of a relationship between pick number and both minutes and goals scored is honestly not that surprising. It’s very difficult to predict a player’s success based on their play from before the draft. It’s important to keep that in mind for the analysis ahead.

Using these LOESS curves, its possible to determine some “fair” draft pick trades. Let’s say you’re in a club’s front office and you have the first pick in the draft. What is a trade you can make that would be equitable, at least in terms of historical data? The expected minutes for the 1st overall pick is 10,844. If you were offered both the 11th (6,407 expected minutes) and 18th (4,390 expected minutes) overall picks, you’d expect a total of 10,797 minutes, which seems like a pretty fair trade.

What about if you’re looking for a goal scorer? Let’s say this year you have the 5th overall pick. The expected goals for a Midfielder/Forward for 5th overall is about 16. For 13th overall it is 10.3 and for 25th overall it is 5.6, yielding a total of around 16 also.

This could be applied to trades of draft picks for current players, all other factors not included. With the 15th overall pick you would expect 5,123 minutes. So if you think the current player you are trading your 15th overall pick for is worth about that, you should be willing to make the trade.

There are a bunch more ways to break these types of trades down. Instead of going through more examples like above, I’m just going to post the results from the LOESS models for both minutes and goals by pick number below. Feel free to download and play around with the data.

Minutes model Goals Model**CONCLUSION**

With all this analysis in mind, I think its important to go back to a point I made at the beginning of this post. Nowhere here am I trying to make predictions about specific players. Instead, this is just an overall look at how draft picks have fared. Going back to my original example, if you think Khiry Shelton is going to break MLS scoring records, whether it be because he’ll fit into your system well, or you’re smarter than everyone else, or he’ll play well in NYC, or whatever else, then you should overvalue him compared to how my model predicts he will perform based on his draft pick. That being said, I hope this post at least serves as a solid starting point, or at least comes into the discussion, when valuing draft picks. The model is simplistic, but it provides good context around the expected value of draft picks.

]]>If you want to see how I created the models, check out this post.

And if you haven’t seen the Economist blog post from a couple weeks back comparing Messi to Ronaldo using the data, read it here

A lot of people have reached out to me asking for the data or have been trying to manually gather it from the applet. If you’re interested in using the data then just reach out to me at soccerstatistically@gmail.com and I’d be happy to send all the raw data to you, provided you reference this blog when you use it.

Finally, now that the calculator is fixed I can focus on some other work I’ve been doing. I’ve admittedly been absent from posting here for a while. I have a few posts I’ve been working on recently, so expect some new stuff coming soon...

]]>To see how various (FIFA defined) continents have done compared to past World Cup results, I used past World Cup data collected from 11v11.com. I looked at the past World Cup results (here is an example from the United States’ page http://www.11v11.com/teams/usa/tab/stats/comp/978). These results include all World Cup and World Cup qualifying games, which is what I limited my analysis to. World Cup qualifying games are a little different than World Cup games, but considering these are almost always between countries that are in the same continent, I think its OK because I drop intra-continent games anyways. What defines a continent is pretty hazy, so I just stuck with FIFA’s definitions. This means that Australia is actually a part of Asia, and some other anomalies. This division of the world is the best way to stay consistent, though. The continents I ended up using were Africa, Asia, CONCACAF, Europe, Oceania and South America.

If you want to look at the code I wrote to do the analysis (the data scraping, the actual analysis, and the visualization) head over to here https://github.com/fordb/wc-continent-headtohead

There’s nothing too crazy going on in the analysis, just a lot of graphs to look at.

Each one shows how each individual continent has historically fared against every other continent. I focused on just GA per game and GF per game because this can be compared no matter how many games have occurred between each continent. GA and GF are also a little more representative than wins, losses and draws. Nate Silver preferred GF/GA for this reason when he created his Soccer Power Index.

From these, it’s clear that Europe and Brazil have historically dominated. Oceania is a little bit of an outlier just because of how few countries from there qualify for the World Cup. It seems like Oceania teams have dominated CONCACAF teams in what limited games have occurred between them. We probably can’t infer much from the small sample size, though.

CONCACAF teams have historically scored just under 1.5 goals per game against European teams, and closer to 1.4 goals per game against South American teams. This compared to giving up 2 and just over 2 goals per game, respectively.

How does this compare to this year’s World Cup? To compare I made another set of graphs, this time comparing each continent’s historical GF per game/GA per game vs. their GF per game/GA per game in this World Cup.

CONCACAF countries did not actually score that much more compared to the past. They fared a little better against African and South American countries, but were about level with their past results against European countries. However, their GA per game dropped significantly this World Cup. The CONCACAF’s (relative) success this World Cup came from better defense. This isn't that surprising, especially with Costa Rica.

European countries really struggled scoring goals compared to their historical average, while staying about the same in GA per game.

South America has been dominant and continued to stay dominant. The only drop they had in scoring was against CONCACAF countries, which is what I mentioned earlier.

Asian countries stayed about on par with their previous performances, although this was already fairly poor.

African countries gave up more goals, but was about on par with their past goal scoring.

It’s important to keep in mind that while these graphs are interesting, the results from this World Cup should be taken with a grain of salt. There are not a large number of games, which means that we might see some strange goals for or goals against averages. So while something may seem to be a big change from previous results, it could just be from the fact that there were only 1 or 2 games between countries from two continents.

]]>While I am not really interested in betting on soccer myself, odds do provide an interesting estimate of the probability of an outcome occuring. For example, take Arsenal's home game against Chelsea this past year. Bet365 put the odds of an Arsenal victory at 2.38. These decimal odds imply that they expect the probability of an Arsenal victory to be about 42%. Taking in to account that the odds makers usually lower the payouts so that they make money, the adjusted probability of an Arsenal victory is just over 41.1%.

This is all pretty standard stuff. The odds for relatively evenly matched games like the one above are probably pretty accurate, or at least more accurate than your average person. But what about significant underdogs? What about City against Cardiff? These are a little more difficult to assess. It's clear that Cardiff is an underdog in this game, but how much of an underdog? And do odds makers do a good job of assigning implied probabilities to these lopsided games?

To look in to this, I defined games with a significant underdog to be games where one outcome has an implied probability of occurring under 15%. This could be either a home win, an away win, or a draw. If the odds makers are effective at predicting underdog outcomes, we should see underdog outcomes (as I define them) occuring under 15% of the time, since the implied probability could be anywhere from just above 0% to 15%.

I used data from Football-Data.co.uk and took the Bet365 odds for the past 5 seasons of the Premiership. After limiting the data to signficant underdogs as defined above, I ended up with 643 games with a significant underdog. Of those, 110 were games with a home team underdog, 441 were games with an away team underdog, and 92 were games with a draw underdog. These splits make sense, because clearly away teams are underdogs more often than home teams due to home field advantage.

Next I calculated what percentage of underdog games resulted in the underdog outcome actually occuring. As mentioned above, these percentages should be under 15% if the odds makers are doing a good job. Below is a graph showing these percentages over the past 5 years, split in to home, away, and draw underdogs.

The red dashed line is the 15% cutoff. For draw and away underdogs, we see about what we would expect: the percentage of underdog outcomes occuring fluctuates, but it remains under the 15% line. For home underdogs though, there is a different story. For each year besides 2012, home underdogs actually won more than 15% of the time, in some cases significantly more so. This seems to indicate that odds makers are underestimating home field advantage when weaker teams play stronger teams.

This evidence seems to indicate that there is an inefficiency present, specifically in the odds of underdog home teams. However, there are still a few caveats that should be mentioned. First, there were a smaller number of home underdog games in the past 5 years, which may be influencing the results somewhat. Second, odds are made so that they somewhat lessen the payoffs so that the odds makers can make money. This means that if you find an inefficiency like the one above, it has to be large enough so that you can overcome the advantage that odds makers have when they make the odds. For the first 3 years, this does not seem to be the case for home underdogs.

Overall though, it seems like there could be something here. But don't blame me if this betting strategy doesn't work. Who knows if it will persist in to the future or not.

]]>I think the reason for this lack of progress in the soccer analytics community is threefold:

First, it is very hard to get across new ideas and specific developments in a panel. Panels are not conducive to discussions of new research and developments. It was able to give a general view of soccer analytics, a view in which we step back and look at progress from a distance. If you think you've heard this general view repeatedly over the past few years, you're not alone. To learn truly new ideas at SSAC, one has to attend the research paper presentations.

Second, not very much has actually been done with soccer analytics in the past 2 years. You may argue with me, and of course a lot of things have been done. But in comparison to the developments in basketball, the progress is little. Compared with just 2 years ago, the NBA is now largely influenced by analytical findings. Daryl Morey, the general manager of the Houston Rockets, talked about how teams are starting to shoot a lot more corner 3's and score a lot more points from high percentage layups and dunks from inside the paint. This realization largely stems from analytical work done in basketball. There has been no such large influence on how soccer is played (at least that I can tell) in the past couple of years.

Third, soccer is genuinely harder to break down analytically than other sports. Baseball can be broken down in to individual at bats; Basketball and football can be broken down in to individual possessions. Soccer, though, is one fluid game with no easy way to break it down. People have been preaching this for as long as I can remember.

The thing is, I'm cheating a little bit here. What value does pointing out problems have if I don't offer any solutions? If soccer wants to come out of its analytical dark age, we need to figure this problem out. I'm not saying its easy. If it was, I'd have some brilliant idea. But I don't. I'm just as stuck as most people seem to be.

A lot of this problem comes from the stigma that statistics somehow degrade soccer. Fans think that analytics takes the beauty out of soccer. At SSAC, the legendary Edward Tufte had a perfect quote: "Analytics doesn't take the beauty out of sports- It takes the stupidity out of it."

If data are going to truly influence the sport of soccer at the professional level, basketball is a brilliant lead to follow. The revolution started with individuals, bloggers and armchair analysts. Eventually people started to see that there were inefficiencies to exploit. These individuals started to figure out how to exploit

Daryl Morey, General Manager of the Houston Rockets, at SSAC 2013

these inefficiencies. Finally, forward thinking team executives caught on, and started to change the way they played based on the analysis. Think Billy Beane, Daryl Morey, Theo Epstein, etc. Once other teams saw that this analysis was working, others started to imitate.

There is no doubt in my mind that inefficiencies exist in soccer that can be exploited with statistics. The most exciting part is that few of these inefficiencies have been found yet. Or if they have been found, they haven't been exploited. These low hanging fruit are waiting to be picked. And once we figure out how to gain a significant enough edge for team executives to notice, they'll catch on and start adapting. This adaptation may be a little slower in countries other than the United States. For whatever reason the US seems to be much more open to statistics in sports.

What soccer needs is an Oakland Athletics. Or a Houston Rockets. I have no idea when this will be. It may already be happening, or it may take 10 years. And even more than that, we need to prove the worth of analytics in soccer. Clubs (understandably so) will not invest in analytics unless they think real value can be gained from it. And this requires us to prove the worth of analytics. I don't think we're there yet, but we're on our way.

But when one club eventually punches well above their weight as a result of the edge they have from analytics, other clubs are going to start to notice. And after that, the dominoes are going to fall in to place and the game will probably start to change, just the way it has in basketball in the past 2 years and in baseball before that. The seeds can be planted by individuals. But nothing will grow until there's an effective adoption by clubs.

]]>**Data**

To create a model that gives the odds of win, draw, and loss for a club depending on the venue of the game (home/away), the minute, and the goal difference in the game, I used over 4000 EPL games stretching back to 2001. This is an improvement over the older Outcome Probability Calculator, which only used 1000 games.

**Methodology**

Even though there are over 4000 games to base the model off of, there are inevitably still some game scenarios that do not occur enough to get reliable results. There are not many games in the past data where the away team went up by 3 in the first 5 minutes. Because of this, when I plotted the outcome probability against the minute of the game, the line is not exactly smooth. To make the results a little more reliable and hopefully more consistent with actual games, I applied the Loess regression method to smooth the lines. This method has some positives and negatives-- it does a good job of smoothing the data, but it is a non-parametric method so it cannot be expressed in an equation as a regular regression is able to do. Below are two plots, one using the LOESS and one that is just the raw data.

**Shiny**

Best of all, I updated the application interface. RStudio Shiny lets you make simple web applications through R. The Shiny interface makes the application a little more functional. The Outcome Probabiity Calculator now displays a graph similar to the one above with a marker for the specific minute chosen. It's interesting to mess around and explore different game scenarios.

**Next Steps**

Next I am planning on updating the Expected Points Added application with the new data. I'll also hopefully calculate this for every player in the past 10 years of English Premier League football. It should be interesting to see how this has fluctuated from year to year for specific players.

]]>Here are some of my thoughts:

I really like the emphasis put on the role that luck plays in soccer. People are often inclined to try to use data to explain everything that happens in soccer. Of course, this is impossible. There are always going to be events that simply cannot be predicted, and I think this is crucial to keep in mind going forward. Data can give you an edge, but it is never going to be the only factor that leads clubs to championships or promotion. To use an example from the book, no model is every going to predict a beach ball coming in to play and deflecting a shot in to the net. This is a glaring example of randomness sneaking in to the game to determine a result, but there are countless other incidences in every match that affect results.

A fact I found interesting: 99 percent of the time players didn't touch the ball, and 98.5 percent of the time they ran without it (page 143). Analysts have focused almost all of their time on the events that occur when the ball is at a player's feet. That's focusing on only 1% of a player's input in the game. While these are the most important and easy to record parts of the game, it is interesting that we ignore almost 99% of a player's activities when looking at their performance. Ignored is the work a player does to get in the right position defensively, or the right angle to receive a pass, or the right space to be able to take a shot. These are extremely difficult to analyze and keep track of, but they are likely just as important as the events that occur when a player actually does have the ball.

The part of the book that will have the greatest influence on how clubs behave, at least in my opinion, is the weak link versus strong link analysis (page 218). Basically, the analysis in the book says that improving a club's weakest player is more beneficial than improving a club's strongest player. This is a huge finding, and completely goes against what is "known" in soccer. Instead of signing the big name striker for way too much money, simply upgrading your weak left back to a stronger one can improve your club for a lower cost. Of course, this is not going to be the case every time, but their analysis provides strong evidence that this is the case a lot of the time. If I were employed by a club to analyze signings, this is the first idea I would bring up to improve the club.

Overall, I think Chris and David do a good job of making the book easy to read and entertaining. I assume it would be all too easy to make the analysis complicated (they are Ivy League professors, of course). However, they do a good job of weaving a narrative while also making a lot of very good insights. That being said, the future of soccer analytics will likely be a bit more complicated. As they point out themselves in Forecast 4 on page 304, mathematical tools like algebraic geometry and network analysis are going to provide the base of most insights going forward. Instead of ball events, it is going to be the more complicated analysis of off the ball events like spacing and positioning. These off the ball events make up the 99% of the game that is generally not recorded or analyzed, as mentioned before. These are likely going to be the events that penetrate deepest in to the flowing, dynamic, and non-stop nature of soccer.

If you haven't yet read it and you're reading this blog, you should probably pick up a copy and read it.

]]>*Data*

I had a hunch that this would not be the case. Specifically, my guess was that there would be more goals scored between the 85th and 90th minutes, whereas there would be fewer in the first 5 minutes of the game. To test this hypothesis, I used data from the Rec.Sport.Soccer Statistics Foundation page from 8 years of the Premiership.

*Methodology*

At first I looked at the number of goals scored in the 8 years of my data for each minute of the game. However, it was clear that this breakdown was too granular; the variability was high because there just wasn't enough data to break it down in to each individual minute. To solve this, I aggregated the data in to intervals of 5 minutes. This way, the data was not as specific but it gave a clearer picture of what was going on. Below is the graph that I came up with:

As you can see, the x-axis gives the end of the time interval. In other words, the bar over 50 on the x-axis represents the number of goals scored in the 45-50 minute time range. The y-axis is the frequency, or number of goals in that time range.

*Confidence Interval*

There are also 3 lines on the graph. The middle line gives the mean number of goals across all the time intervals. The top line is the mean number of goals across all the time intervals plus the standard deviation of the number of goals. The bottom line is the same thing but subtracting one standard deviation. I added these to the graph because they provide nice reference points to help determine time intervals that can be considered outliers.

*Late First Half Goals*

The first thing that sticks out from looking at the graph is the number of goals scored between minutes 40-45. It seems that a lot of goals are scored right before the halftime mark. This is pretty interesting, and something that I was not necessarily expecting. It is also probably the worst time to give up a goal; instead of going in to the half tied or up by one, a club would instead be going in to the locker room down by one or tied.

*First Versus Second Half Goals*

Another piece of the plot that is the difference between the height of the bars in the first half compared to the height of those in the second half. In the first half, the number of goals is right around the lower part of the interval. In the second half, the number of goals is around the upper part of the interval. Clearly, more goals are scored in the second half of games, which is probably not the first time someone has pointed this out.

*Implications*

There are a number of takeaways from this plot. First, the final 5 minutes of the first half is a vital period of time in the game. Clubs should bear down and play more defensively than usual, especially considering the fact that you don't want to go in to the half having just let up a goal.

Second, I was thinking that there could be applications to betting. While I don't bet on soccer games myself, the information could be useful for someone placing a bet on the time a goal is scored, or even when the first goal of the game is scored. If you do put a bet on it, just don't come back complaining to me if it doesn't work out!

**Update 6/10/2015: **It's worth noting that the goals scored in the 40-45 minute bin is likely a bit less than what is represented in the graph above. I'm not 100% sure, but it seems that the RSSSF data counts a goal scored in the 2nd minute of first half stoppage time as being scored in the 45th minute, meaning the numbers are likely inflated for that specific bin. Thanks to Jack Heer for pointing this out in the comments.

The reason for this is that goals scored does not have a standard distribution, the bell curve we are used to. For example, if you looked at the distribution of heights in a population, you would see a nice bell curve. Most people are right around the average height, and as you go towards the extremes either way (really short or really tall) you find fewer and fewer people. Therefore, the mean of heights in the population is instructive because it gives us the "normal" or "typical" height.

The problem is, goals scored in a season does not follow a standard distribution. Instead, most players score no goals at all. The next most common number of goals scored last season? Just one goal, of course. This distribution continues, and it follows a power law distribution.

While I don't understand a lot of the mathematical intricacies of the power law distribution, I can explain generally what it does. A good example is from earthquakes. Most earthquakes are relatively small. As the size of the earthquake increases, its frequency decreases. Specifically, for every magnitude 4 earthquakes there are 10 times as many magnitude 3 earthquakes and 100 times as many magnitude 2 earthquakes. This pattern holds for all magnitudes.

My idea was that a similar power law distribution could be used to model the number of goals scored in a season. It turns out that it does a good job, which has a lot of implications for predicting goals scored in a season. Below is a histogram of the number of goals scored from the 2011-2012 EPL season. The data is from the MCFC Analytics Basic dataset. As you can see, a lot of players scored just 1 goal, while fewer scored 3 total goals, while even fewer scored 10 goals.

So how is it possible to fit a function to this distribution? It turns out that if we take the natural log of both axes (count and goals) and plot the new points, we should see that the points form a line. As you can see from the graph below, the linear relationship is pretty strong (R-squared = .91). This implies that the power law distribution of the original data we are looking at is also strong.

A power law follows the equation y=k * x^(n). In this case, we have the equation count = k * goals^(n). The n in the previous equation is just the slope from the log-log graph above. The slope is -1.63, so n=-1.63. To find k, we raise e to the y intercept in the log-log graph. The y intercept is 5.471 so k=exp(5.471)=161.61. So the equation that fits the first graph is count = 161.61 * (goals^-1.63). When I plot this equation over the distribution of goals from the first graph, it is clear that there is a nice fit.

So what are the implications of all this. First, it says that predicting the number of goals scored in a season is very hard to do. Events that follow power law distributions are extremely hard to predict. Earthquakes, for example, are basically impossible to foresee. Even experts cannot reliably predict *when* an earthquakes is going to happen. However, because they know the distribution they can say how often earthquakes of various magnitudes are expected to occur. To extend the analogy to goal scoring, it is very hard to predict the number of goals scored in a season. However, from this analysis we know the overall distribution, so we can predict how often goal scoring tallies will happen.

For example, how often do we expect Premiership players to score 30 goals in a season? Well, if we plug in 30 to the model above, we get count = 161.61(30^-1.63) = .63. Therefore, we expect a player to score 30 goals every .63 seasons, or around 6 out of every 10 seasons. If you look, this is consistent with what has happened over the past 10 Premier League seasons.

To finish up, I am not saying that we have no idea if Owen Hargreaves has a good chance of outscoring van Persie next season. Obviously, a lot of people predicted that van Persie would lead the league in scoring this season and he has. However, if we take a step back and look at goal scoring from just this past season, we can find a distribution that models goal scoring for 10 whole years using a power law distribution.

PS: I'll be traveling for the next 3 months so I won't have consistent internet access and won't be able to make any posts. I'll try to stay updated on Twitter though so I'll still talk there as much as I can.

]]>The advanced data contains (x,y) location information of every statistic that is kept. This is valuable information, as it obviously tells exactly where each event happened in the game. I was interested in how this information can be used, specifically to look at momentum and passing trends.

*Previous Work*

Some work has already been done in the soccer analytics community on trying to quantify and analyze momentum. The Analyse Football looked at momentum shifts from this same game, although in a different way. The Soccer by the Numbers blog looks at momentum in football in a much more general way.

*Methodology*

By looking at the exact position of passes by each team during the game, one can tell in what area of the field the game is being played at a specific point. My aim was to break this down. I looked at the y location of each pass during the game. In the data set, the passes are given an (x,y) location tag, where the y value is the location of the pass end line to end line and 0 < y < 100. In other words, a y value of 100 would be a pass on the other team's end line, and a y value of 50 would be a pass on the half line. The x location is also given, but for the purpose of this analysis I ignored it.

Just graphing the y location of every pass is too noisy to break down and understand. Instead, I took the simple moving average of the y location of the previous 25 passes of both teams. This gives a clearer picture of where the passes are being made throughout the game, and is taken as a proxy for momentum.

Below is the graph plotting the simple moving average of the pass location throughout the game.

The straight black line down the middle is a value of 0. When the simple moving average (the jagged black line) intersects the straight black line this indicates that the y location of the previous 25 passes of both teams is averaged at the mid line. The blue and red bars are the actual y location of every pass for reference. The red bars are the passes in the Bolton offensive half and the blue bars are the passes in the City offensive half. I also indicated goals by each team with the blue and red dots on the simple moving average line. As you can see, the game ended 3-2 in favor of Manchester City.

*Analysis*

What does this graph tell us about the game? It seems that City dominated momentum for the most part of the game. The first 3 goals of the game were scored when there was relatively no momentum either way, though. After Bolton's first goal (which made the game 2-1 in favor of City), City gained momentum as the ball was in Bolton's defensive half consistently. Possibly as a result of this momentum, City was able to score a 3rd goal, making the score 3-1. After evening out momentum and another dominance by City, Bolton was able to gain back momentum, as most passes were in City's defensive half. Again, possibly as a result of this momentum, Bolton was able to get a 2nd goal to make the game 3-2. Finally, Bolton ended the game with most of the momentum, but City was able to hold off the attack and finish the game 3-2.

*Conclusion*

The location data of passes can give us a good proxy for momentum. Additionally, this momentum does seem to matter in terms of scoring goals. While the first 3 goals of the game were scored with seemingly no momentum for either team, the final 2 goals were scored with each team dominating momentum for their respective goals.

When the full advanced data set is available it will be interesting to do this analysis for more games. Obviously this is an imperfect measure of momentum, however I think it does its job as a good indicator of momentum. I'd be interested to hear suggestions and comments on the methodology.

]]>If you're an R user and are having trouble dealing with the Advanced MCFC Analytics XML data file, the link above provides the code to pull the data in to a data frame in R. After this it is easy to perform whatever analysis you want on it.

I'll admit the code above is beyond my limited R skill level, but I know that it works. I'm excited to start doing some analysis, although the advanced data set is only for one game from last season at this point.

]]>You can read the full blog post here.

I wanted to point out on here something interesting that I found while working on the model; betting odds do a relatively poor job of predicting football match outcomes. In other words, the percentage likelihood of a win, draw and loss for the home team implied from the odds set by bookmakers is surprisingly inaccurate.

My hypothesis for why this happens is that football is very unbalanced, especially in the EPL. It is very hard to predict when an upset is going to happen, mostly because these upsets are (seemingly) random.

Using just 4 factors in my model, including the home team's goal differential for the season up to that game, the away team's goal differential for the season up to that game, the home team's point total from the previous season, and the away team's point total from the previous season, I could create a model that was as accurate as the bookmakers.

The question that remains is how much more accurate can the model become with the introduction of new variables? Beyond that, what variables should be used?

I am not sure I know the answers to those questions, but I am going to keep playing around with the data.

]]>Inspired from this post on plotting the frequency of Twitter hashtags over time, I was interested in trying to apply this to soccer some way. While not the most technical analysis, I thought it would be interesting to use this tool to analyze transfer rumors.

To summarize the process quickly, there is a package in R (open source statistical software) called TwitteR which allows you to pull Twitter data. It's actually a fairly easy process, especially if you follow the tutorial in the link at the beginning of this post.

As most Twitter users know there is a seemingly unlimited number of transfer rumors circulating Twitter. These range from being fairly plausible to pretty ridiculous ("Ronaldo to the Philadelphia Union???). As a Manchester City supporter, I was curious at looking at a few popular transfer rumors related to City.

**Robin van Persie to Manchester City:**

Yes, this is definitely a rumor, and yes, it is probably not going to happen. But I was still curious. Below is a plot of the frequency of the number of tweets that include "Robin van Persie" and "Manchester City". Of course, this is an imperfect method, but it still gives us an idea of what is going on in the Twitter transfer rumor world.

To explain, the graph below measures the number of tweets described above at a 2 hour interval for the past week. This means the height of every line gives us the number of tweets referencing RVP and City in that 2 hour interval.

**Carlos Tevez to AC Milan:**

After Tevez's past season with the club, there are obviously transfer rumors concerning Tevez all over the place. Because of this, it was hard not to want to look at the data on Tevez. I picked AC Milan because it seemed like the club he had the highest likelihood of going to. Like above, I searched for tweets that included "Carlos Tevez" and "AC Milan". The frequency of these tweets, in 2 hour intervals, is plotted below.

You can try to analyze these graphs to find some meaning, but they are more just a fun exercise than anything else. The TwitteR package lets you do other cool things, like plot the frequency of Twitter mentions for a user. I did this for another site I write for, EPL Index. They tend to get a lot more mentions than @SoccerStatistic does, so I thought it would be more interesting to plot the frequency of @EPLIndex mentions. Again, the intervals are every 2 hours.

Like I said before, this analysis is not very insightful or ground-breaking, but still pretty cool nonetheless. The possibilities for future analysis like this are almost endless, so if people have good ideas of Twitter data to visualize, I'd love to hear them.

]]>Swansea City, the "Welsh Barcelona"

**Possession, Does it Matter?**

Recently, I've started to question simply looking at percent possession during an entire game. It seems too broad an analysis to yield any significant or insightful results. Inspired by another one of Devin Pleuler's posts, I have looked at possession more specifically within a game. Instead of looking at how possession relates to results as a whole, I've broken down each game to try to understand how keeping the ball within a specific segment of a game changes a club's chances of scoring. Within the EPL, does a long spell of possession mean a team is more likely to score? If the answer is yes, it could point to the fact that possession actually does matter, albeit at a much smaller level.

**Methodology**

To do this, I have broken down passing patterns within specific Premier League games. Instead of looking at possession as a whole, I looked at the past 25 passes, and who made them. As the game progresses and passes are made, I found the difference in passes in the last 25 passes between the two clubs. For example, the home team could have completed 15 of the last 25 passes, while the away club completed the other 10. This would give us a difference in passes of +10, as the home club completed 10 more passes than the away club. As you might expect, this is a constantly fluctuating number, which changes from positive to negative many times throughout the game.

My idea is to graph this pass difference number throughout the entire game, and mark where goals occurred. My hypothesis was that a club would tend to score when they had had the majority of the last 25 passes. In other words, having possession would tend to increase a team's probability of scoring.

**RESULTS**

As it turns out, this hypothesis tends to be true. The key word here is definitely *tends*. Of course, there are always going to be goals where a team scores off a counter-attack. However, it seems that clubs that have dominated possession recently within the last 25 passes are likely to score. When teams do not have much possession of the ball, they are less likely to score. While you may be thinking to yourself that this is nothing new, it does seem to go against the results saying that possession does not matter. It seems that while *general* possession may not matter, *specific *possession has a big influence on scoring.

Below are some examples of the graphs I was looking at when making this analysis. A positive possession difference indicates that the home team has had the majority of passes within the last 25 passes, while a negative possession difference indicates that the away team has had the majority of recent passes. The vertical red lines indicate when a goal was scored; the ones starting at the midline and going up are goals scored by the home team, while the ones starting at the midline and going down are goals scored by the away team. Clearly, there are a few times when goals seem to have come when a team was losing the recent possession battle. For the most part though, it seems that my hypothesis above holds true.

**GRAPHS**

**Only 25?**

One criticism of this approach may be that looking only at the past 25 passes may be too small. Admittedly, there is nothing special about the last 25 passes, I just went with that number. To check the robustness of the results, I did the same analysis, this time with the last 50 passes. It turns out that looking at the difference within the last 50 passes yields very similar results. Of course, the graphs are slightly different, but clubs still tend to score when they have had the majority of the recent possession.

**Conclusion**

Based on the graphs above, it seems that possession *does* matter, though at a much smaller level than what we normally look at. While the overall possession percentage within a game tends to be misleading, more specific possession analysis within the game shows that clubs tend to score when they hold the majority of possession. If Stoke City is playing Arsenal, they are likely to be dominated when looking at possession percentage as a whole. In this way, possession is useless for analyzing the game. However, if we break possession down like above, it's likely to be the case that possession *is *important for Stoke, just in much smaller intervals within the game.

The idea for this post originally came from another blog post written by Chris Anderson (@soccerquant), the writer of the Soccer By the Numbers blog. In this post, Chris compares both the strength and imbalance of 6 of the top European leagues. You can read the post here. My idea was to expand upon this analysis using the extensive and accurate Euro Club Index data, while also looking at more European leagues. This analysis looks at the top leagues of 10 different European countries. The analysis will be split in to two posts. The first looks at only the top division of 10 different countries. The second, which will be posted later, will compare strength and imbalance within each country's league structure.

*Methodology*

Strength and imbalance are both tough to measure. While many measures exist, the Euro Club Index analysis provides an accurate measure of the strength of each European club. To find the strength of the different European leagues, I averaged the index value of each club in the top division. Each value is taken from mid-February of each year. This average represents the strength of each league. To measure the imbalance of the league, I calculated the Gini Coefficient of all the clubs in the country's top flight. The Gini Coefficient is basically a measure of dispersion (or in this case imbalance) that is usually used to measure the dispersion of income across a sample. Calculating it is somewhat complex so I won't go in to it here, but if you're curious here is an explanation. A higher Gini Coefficient indicates more imbalance in the league.

Below is a graph representing both strength and imbalance of 10 different European leagues. The y-axis represents imbalance, with a larger y value indicating more imbalance. The x-axis represents strength, with a larger x value indicating more strength. Ideally, a league is both strong and balanced. These leagues are, hypothetically, the most exciting to watch.

From the graph, we can divide European leagues in to 4 categories:

- Imbalanced and Weak: These leagues are theoretically the least exciting. They are likely highly predictable, but also not very strong. From the graph, we see that Ukraine, Portugal and Scotland are both of this type.
- Imbalanced and Strong: 2 countries fall in to this category: England and Spain. From the graph we can see that, while the English Premier League and La Liga are both very strong, they are also highly imbalanced. This is something you might expect.
- Balanced and Weak: These leagues are less predictable, but also are not that strong. Turkey and Netherlands fall in to this category.
- Balanced and Strong: This last category is the best of both worlds. The clubs in the league are very strong, but the league is also balanced. France, Germany and Italy all fall in to this category.

Clearly, these are just general categories, as some leagues are stronger or weaker of varying degrees. A lot of leagues are also on the edge of some of the average dividers, including Portugal which is slightly weaker than the average league.

*Historical Context*

After this analysis, I was curious to see how the imbalance and strength of different leagues have changed over time. To check this question, I graphed the imbalance and strength values of each of the 10 countries over the past 5 years. These graphs show us how leagues have progressed. For the sake of understanding the graphs more easily, I have split the 10 countries in to two parts. The first is the "top" of the 10 European countries (England, France, Germany, Italy and Spain), while the second is the "bottom" of the 10 (Netherlands, Portugal, Scotland, Turkey, and Ukraine). This is just done so that the reader can see the changes over the years more easily.

Below is the graph of the measure of league strength from 2008-2012. You can see that not much change has happened. England momentarily surpassed Spain as the top European league in 2011, but Spain has now retaken a narrow lead in 2012. Additionally, Germany and Italy have switched places in terms of strength. There has also been a slight decline in strength of the top leagues in the past few years. Spain and Italy have both declined somewhat, whereas England has remained relatively constant.

Next is the same graph as above, but for the "bottom" 5. This graph is even more constant than above. The only change is that Ukraine has surpassed Scotland in strength. The strength levels of each league have also remained relatively unchanged. There have only been small fluctuations.

What about the progression of league imbalance? The graphs of imbalance levels over the past 5 years are shown below. First, France, Italy, and Germany have all kept a low level of imbalance over the past 5 years. On the other hand, England and Spain have had some interesting fluctuations. Spain has seen a consistent increase in imbalance over the past couple of years, while England has been on somewhat of a decline since 2008. However, this year the EPL has reversed this decline and become more imbalanced. This makes sense, as the influx of money in to the EPL quickly raised the level of imbalance. Ever since then, it seems that the EPL imbalance has slowly been declining. For the bottom 5 leagues, there hasn't been too much of a change in imbalance except for Portugal, which has seen a constant increase in imbalance since 2008. In 2008, Portugal had the least imbalance of any of these European leagues. Now, the Primeira Liga is somewhat in the middle in terms of imbalance.

*Conclusion*

As Chris points out in his analysis, this kind of work could give clubs insights in to how transfers will fare moving between leagues. For example, can we predict how well a top player in Ukraine will play in Spain? Similarly, which is better: 10 goals in the EPL or 15 goals in the Eredivisie? These questions are near impossible to answer, but