Some of these players I’ve mentioned were top picks. Brian Carroll was selected 2nd overall in 2000. Taylor Twellman was selected 2nd overall in 2000. Some of the top players who were chosen in the draft were selected in the later rounds, but went on to very successful careers. Chris Rolfe was selected 29th overall. Davy Arnaud was selected 57th overall and scored 54 goals in his career.

Nikolas who?

On the flip side, there are a number of notable draft busts. Nikolas Besagno was selected 1st overall in 2005 and went on to play in only 8 games. Joseph Ngwenya went 3rd overall in 2004 to Salt Lake and scored 18 goals in his career, while Salt Lake passed over Clint Dempsey, Clarence Goodson and Michael Bradley.

This post aims to provide some context around the value of draft positions. This can be helpful for determining a fair trade (“Should I trade up to a higher selection?”) or looking at how clubs have performed in their draft selections (apparently the Rapids have done a pretty crappy job overall).

Club | GP | Mins | Goals | Years | Picks |
---|---|---|---|---|---|

Portland Timbers | 21.83 | 1715.33 | 3.17 | 3.00 | 2 |

Miami Fusion | 17.70 | 1544.61 | 0.24 | 3.67 | 9 |

MetroStars | 15.24 | 1179.28 | 1.60 | 4.74 | 35 |

Philadelphia Union | 17.00 | 1169.65 | 1.68 | 3.78 | 9 |

Toronto FC | 14.70 | 1110.70 | 0.91 | 2.70 | 20 |

Chivas USA | 15.06 | 1095.91 | 0.87 | 2.35 | 20 |

Houston Dynamo | 14.18 | 1018.77 | 1.09 | 4.12 | 16 |

Kansas City Wizards | 13.67 | 1005.15 | 0.97 | 3.88 | 57 |

Los Angeles Galaxy | 13.85 | 996.64 | 1.29 | 5.23 | 65 |

D.C. United | 13.31 | 984.91 | 0.94 | 4.11 | 57 |

F.C. Dallas | 13.95 | 970.85 | 1.15 | 3.73 | 33 |

New England Revolution | 12.68 | 950.84 | 1.40 | 3.50 | 62 |

Columbus Crew | 13.41 | 937.60 | 1.42 | 4.02 | 57 |

Real Salt Lake | 12.72 | 910.79 | 0.78 | 4.10 | 20 |

New York Red Bulls | 12.67 | 879.00 | 0.75 | 3.39 | 18 |

Chicago Fire | 12.61 | 855.73 | 1.14 | 4.19 | 68 |

San Jose Earthquakes | 11.84 | 844.45 | 0.62 | 4.50 | 48 |

Dallas Burn | 11.48 | 803.24 | 1.24 | 3.27 | 33 |

Colorado Rapids | 10.98 | 729.43 | 0.71 | 2.22 | 55 |

Vancouver Whitecaps FC | 9.07 | 588.27 | 0.20 | 3.75 | 4 |

Tampa Bay Mutiny | 8.25 | 445.35 | 0.70 | 1.54 | 13 |

Seattle Sounders FC | 8.53 | 438.61 | 0.71 | 3.17 | 12 |

Sporting Kansas City | 4.67 | 91 | 0.00 | 1.00 | 3 |

Relating to the second point, some clubs have done better than others in the draft historically, as can be seen in the table above. Just defining “better” is difficult. As I’ll explain later, in this post I’ve used a number of different metrics to define the value of a draft pick.

This post is not meant to try to predict which players will succeed and which ones will fail. I’m not trying to determine if Cyle Larin will be better than Khiry Shelton. Instead, I just want to look at how players have performed statistically by their rank in the draft. This way we can answer questions like “How many minutes should I expect a 3rd round pick to play in his career?” or “How many goals do I expect a forward selected 8th overall to score?” or “How many years can we expect a GK selected in the 1st round to play?” These are forward-looking questions with answers that are backwards-looking. However, I think it’s a reasonable assumption, or at the very least a good place to start, when trying to estimate the performance of draft picks into the future.

**DATA GATHERING**

In order to do this type of analysis there is a bunch of data that I needed to get. All the scripts I used to collect these data are available in the repository I created on GitHub. Any updates I make going forward will be pushed into this repo. There’s a bunch of code in there that I’ve written to scrape data from online, but for this I’m focusing on the mls_draft folder and the mls_scrape.py file. I’m not going to go too much into how I scraped the data, but if you want to learn more about the process then feel free to send me an email.

First, I needed the past SuperDraft selections by teams. These data were all available on Wikipedia. The code used to scrape the Wikipedia pages, much of which was based on this blog post, is in the mls_draft.py file in the the mls_draft folder. Once the data was scraped and saved to a csv file, the create_draft_stats.sql script uploads the data to the MySQL database I set up. With the draft data all set, I also needed the actual career statistics of all MLS players. The mls_scrape.py script does this. There was a bunch more stuff to do with data cleaning that’s not very exciting (fixing player names, making sure the data is correct, etc.). Eventually I ended up with the necessary data all cleaned and looking pretty. If you see anything thats incorrect in the data let me know; Most likely I missed some stuff.

In order to ensure that the results are reliable, I only looked at draft selections from 2011 and before. Draft selections from 2012 and after don’t have enough time to come to fruition and the data is messy and not very informative as a result.

**ANALYSIS**

The first question I was interested in investigating concerned how often selections can be considered successful by round. In other words, if I have a 3rd round draft pick, what is the probability that the selection will be considered a success? As I alluded to earlier, defining success is difficult. Ultimately, I settled on a simple metric: minutes per year. The question of which threshold to choose is definitely an open one. In fact, there is really no right answer, so I tried a couple different values for determining a success.

I first looked at the median minutes per year. The idea is pretty simple: A selection is a success if they end up playing more minutes per year than 50% of draft selections.

Round | Success % |
---|---|

1 | 85.9% |

2 | 65.0% |

3 | 38.0% |

4 | 33.8% |

5 | 14.5% |

6 | 14.8% |

The median minutes per year is slightly under 150 minutes per year. This actually seems like a pretty good threshold; assuming 34 games per year, that’s about 5 minutes per game, per season. So really we’re saying a success in the draft is a player that ends up being at least a regular sub. To the left are the percentage breakdowns for each round, i.e. what percent of players drafted in each round ended up playing more than 150 minutes per year on average.

An astute observer might say that teams are expecting more than 5 minutes per game from their selections. In fact, the density plot above suggests another possible threshold for defining success. There seems to be a kink right around 400 minutes, marked in the plot below.

Round | Success % |
---|---|

1 | 75.0% |

2 | 49.7% |

3 | 27.2% |

4 | 22.8% |

5 | 10.9% |

6 | 11.1% |

At that mark the density plot flattens out. Let’s use the 400 minute threshold instead. To the right are the percentage breakdowns for each round with the new threshold of 400 minutes per year on average.

Whatever threshold is used, there is a clear picture of how picks tend to perform based on round. The 1st round is clearly the best and with each subsequent round the probability of being deemed a success decreases. While I’m clearly not going to win any awards for that conclusion, it is at least interesting to see exactly how much the likelihood of a player becoming a success decreases. Interestingly enough, there doesn’t seem to be much of a difference in the success probabilities of the 5th and 6th rounds, although there is a good chance this is due to a small sample size, considering a lot of years the draft has had fewer than 5 rounds (in fact, there were only 3 in 2011 and 2 in 2013 and 2014, although I only use years 2011 and before for this analysis).

We can do even more interesting analysis with these data. It’s possible to model the expected value of various statistics as a function of the pick number. Even more, we can break this down by position and answer a bunch of different questions. If a team selects a forward 10th overall, how many goals should we expect them to score in their career? What about the expected number of minutes per year for a defender selected 25th overall?

To answer these questions, I use LOESS. I used LOESS previously to smooth out the raw outcome probability curves for the outcome probability calculator. LOESS, which kinda sorta stands for LOcal regrESSion, is useful here for a couple of reasons: First, it’s simple. It’s as easy to use as a linear regression. Second, its non-linear and is useful for smoothing out curves. The curves I’m plotting and looking to model are a bit whacky. LOESS is a simple way to get an accurate representation of the data. The biggest drawback is probably that it is non-parametric, which means that you can’t use an equation to model the data. But this isn’t the end of the world in this case.

Below is a plot of pick number overall versus the number of total minutes played. The most clear takeaway is that there isn’t very much of a pattern to the data. Yes, there is a general trend downwards as the pick number increases, as we can see in the LOESS curve in blue. But even general might be too strong of a word here. There is really a lot of noise around that curve.

The same lack of a trend is evident if we look at the pick number versus number of goals for Midfielders and Forwards only. It’s actually even worse, as the tail on the distribution of goals is very long. In fact, the distribution of goals scored seems to follow a Power Law (another plug, apologies). Seeing a lack of a relationship between pick number and both minutes and goals scored is honestly not that surprising. It’s very difficult to predict a player’s success based on their play from before the draft. It’s important to keep that in mind for the analysis ahead.

Using these LOESS curves, its possible to determine some “fair” draft pick trades. Let’s say you’re in a club’s front office and you have the first pick in the draft. What is a trade you can make that would be equitable, at least in terms of historical data? The expected minutes for the 1st overall pick is 10,844. If you were offered both the 11th (6,407 expected minutes) and 18th (4,390 expected minutes) overall picks, you’d expect a total of 10,797 minutes, which seems like a pretty fair trade.

What about if you’re looking for a goal scorer? Let’s say this year you have the 5th overall pick. The expected goals for a Midfielder/Forward for 5th overall is about 16. For 13th overall it is 10.3 and for 25th overall it is 5.6, yielding a total of around 16 also.

This could be applied to trades of draft picks for current players, all other factors not included. With the 15th overall pick you would expect 5,123 minutes. So if you think the current player you are trading your 15th overall pick for is worth about that, you should be willing to make the trade.

There are a bunch more ways to break these types of trades down. Instead of going through more examples like above, I’m just going to post the results from the LOESS models for both minutes and goals by pick number below. Feel free to download and play around with the data.

Minutes model Goals Model**CONCLUSION**

With all this analysis in mind, I think its important to go back to a point I made at the beginning of this post. Nowhere here am I trying to make predictions about specific players. Instead, this is just an overall look at how draft picks have fared. Going back to my original example, if you think Khiry Shelton is going to break MLS scoring records, whether it be because he’ll fit into your system well, or you’re smarter than everyone else, or he’ll play well in NYC, or whatever else, then you should overvalue him compared to how my model predicts he will perform based on his draft pick. That being said, I hope this post at least serves as a solid starting point, or at least comes into the discussion, when valuing draft picks. The model is simplistic, but it provides good context around the expected value of draft picks.

]]>If you want to see how I created the models, check out this post.

And if you haven’t seen the Economist blog post from a couple weeks back comparing Messi to Ronaldo using the data, read it here

A lot of people have reached out to me asking for the data or have been trying to manually gather it from the applet. If you’re interested in using the data then just reach out to me at soccerstatistically@gmail.com and I’d be happy to send all the raw data to you, provided you reference this blog when you use it.

Finally, now that the calculator is fixed I can focus on some other work I’ve been doing. I’ve admittedly been absent from posting here for a while. I have a few posts I’ve been working on recently, so expect some new stuff coming soon...

]]>To see how various (FIFA defined) continents have done compared to past World Cup results, I used past World Cup data collected from 11v11.com. I looked at the past World Cup results (here is an example from the United States’ page http://www.11v11.com/teams/usa/tab/stats/comp/978). These results include all World Cup and World Cup qualifying games, which is what I limited my analysis to. World Cup qualifying games are a little different than World Cup games, but considering these are almost always between countries that are in the same continent, I think its OK because I drop intra-continent games anyways. What defines a continent is pretty hazy, so I just stuck with FIFA’s definitions. This means that Australia is actually a part of Asia, and some other anomalies. This division of the world is the best way to stay consistent, though. The continents I ended up using were Africa, Asia, CONCACAF, Europe, Oceania and South America.

If you want to look at the code I wrote to do the analysis (the data scraping, the actual analysis, and the visualization) head over to here https://github.com/fordb/wc-continent-headtohead

There’s nothing too crazy going on in the analysis, just a lot of graphs to look at.

Each one shows how each individual continent has historically fared against every other continent. I focused on just GA per game and GF per game because this can be compared no matter how many games have occurred between each continent. GA and GF are also a little more representative than wins, losses and draws. Nate Silver preferred GF/GA for this reason when he created his Soccer Power Index.

From these, it’s clear that Europe and Brazil have historically dominated. Oceania is a little bit of an outlier just because of how few countries from there qualify for the World Cup. It seems like Oceania teams have dominated CONCACAF teams in what limited games have occurred between them. We probably can’t infer much from the small sample size, though.

CONCACAF teams have historically scored just under 1.5 goals per game against European teams, and closer to 1.4 goals per game against South American teams. This compared to giving up 2 and just over 2 goals per game, respectively.

How does this compare to this year’s World Cup? To compare I made another set of graphs, this time comparing each continent’s historical GF per game/GA per game vs. their GF per game/GA per game in this World Cup.

CONCACAF countries did not actually score that much more compared to the past. They fared a little better against African and South American countries, but were about level with their past results against European countries. However, their GA per game dropped significantly this World Cup. The CONCACAF’s (relative) success this World Cup came from better defense. This isn't that surprising, especially with Costa Rica.

European countries really struggled scoring goals compared to their historical average, while staying about the same in GA per game.

South America has been dominant and continued to stay dominant. The only drop they had in scoring was against CONCACAF countries, which is what I mentioned earlier.

Asian countries stayed about on par with their previous performances, although this was already fairly poor.

African countries gave up more goals, but was about on par with their past goal scoring.

It’s important to keep in mind that while these graphs are interesting, the results from this World Cup should be taken with a grain of salt. There are not a large number of games, which means that we might see some strange goals for or goals against averages. So while something may seem to be a big change from previous results, it could just be from the fact that there were only 1 or 2 games between countries from two continents.

]]>While I am not really interested in betting on soccer myself, odds do provide an interesting estimate of the probability of an outcome occuring. For example, take Arsenal's home game against Chelsea this past year. Bet365 put the odds of an Arsenal victory at 2.38. These decimal odds imply that they expect the probability of an Arsenal victory to be about 42%. Taking in to account that the odds makers usually lower the payouts so that they make money, the adjusted probability of an Arsenal victory is just over 41.1%.

This is all pretty standard stuff. The odds for relatively evenly matched games like the one above are probably pretty accurate, or at least more accurate than your average person. But what about significant underdogs? What about City against Cardiff? These are a little more difficult to assess. It's clear that Cardiff is an underdog in this game, but how much of an underdog? And do odds makers do a good job of assigning implied probabilities to these lopsided games?

To look in to this, I defined games with a significant underdog to be games where one outcome has an implied probability of occurring under 15%. This could be either a home win, an away win, or a draw. If the odds makers are effective at predicting underdog outcomes, we should see underdog outcomes (as I define them) occuring under 15% of the time, since the implied probability could be anywhere from just above 0% to 15%.

I used data from Football-Data.co.uk and took the Bet365 odds for the past 5 seasons of the Premiership. After limiting the data to signficant underdogs as defined above, I ended up with 643 games with a significant underdog. Of those, 110 were games with a home team underdog, 441 were games with an away team underdog, and 92 were games with a draw underdog. These splits make sense, because clearly away teams are underdogs more often than home teams due to home field advantage.

Next I calculated what percentage of underdog games resulted in the underdog outcome actually occuring. As mentioned above, these percentages should be under 15% if the odds makers are doing a good job. Below is a graph showing these percentages over the past 5 years, split in to home, away, and draw underdogs.

The red dashed line is the 15% cutoff. For draw and away underdogs, we see about what we would expect: the percentage of underdog outcomes occuring fluctuates, but it remains under the 15% line. For home underdogs though, there is a different story. For each year besides 2012, home underdogs actually won more than 15% of the time, in some cases significantly more so. This seems to indicate that odds makers are underestimating home field advantage when weaker teams play stronger teams.

This evidence seems to indicate that there is an inefficiency present, specifically in the odds of underdog home teams. However, there are still a few caveats that should be mentioned. First, there were a smaller number of home underdog games in the past 5 years, which may be influencing the results somewhat. Second, odds are made so that they somewhat lessen the payoffs so that the odds makers can make money. This means that if you find an inefficiency like the one above, it has to be large enough so that you can overcome the advantage that odds makers have when they make the odds. For the first 3 years, this does not seem to be the case for home underdogs.

Overall though, it seems like there could be something here. But don't blame me if this betting strategy doesn't work. Who knows if it will persist in to the future or not.

]]>I think the reason for this lack of progress in the soccer analytics community is threefold:

First, it is very hard to get across new ideas and specific developments in a panel. Panels are not conducive to discussions of new research and developments. It was able to give a general view of soccer analytics, a view in which we step back and look at progress from a distance. If you think you've heard this general view repeatedly over the past few years, you're not alone. To learn truly new ideas at SSAC, one has to attend the research paper presentations.

Second, not very much has actually been done with soccer analytics in the past 2 years. You may argue with me, and of course a lot of things have been done. But in comparison to the developments in basketball, the progress is little. Compared with just 2 years ago, the NBA is now largely influenced by analytical findings. Daryl Morey, the general manager of the Houston Rockets, talked about how teams are starting to shoot a lot more corner 3's and score a lot more points from high percentage layups and dunks from inside the paint. This realization largely stems from analytical work done in basketball. There has been no such large influence on how soccer is played (at least that I can tell) in the past couple of years.

Third, soccer is genuinely harder to break down analytically than other sports. Baseball can be broken down in to individual at bats; Basketball and football can be broken down in to individual possessions. Soccer, though, is one fluid game with no easy way to break it down. People have been preaching this for as long as I can remember.

The thing is, I'm cheating a little bit here. What value does pointing out problems have if I don't offer any solutions? If soccer wants to come out of its analytical dark age, we need to figure this problem out. I'm not saying its easy. If it was, I'd have some brilliant idea. But I don't. I'm just as stuck as most people seem to be.

A lot of this problem comes from the stigma that statistics somehow degrade soccer. Fans think that analytics takes the beauty out of soccer. At SSAC, the legendary Edward Tufte had a perfect quote: "Analytics doesn't take the beauty out of sports- It takes the stupidity out of it."

If data are going to truly influence the sport of soccer at the professional level, basketball is a brilliant lead to follow. The revolution started with individuals, bloggers and armchair analysts. Eventually people started to see that there were inefficiencies to exploit. These individuals started to figure out how to exploit

Daryl Morey, General Manager of the Houston Rockets, at SSAC 2013

these inefficiencies. Finally, forward thinking team executives caught on, and started to change the way they played based on the analysis. Think Billy Beane, Daryl Morey, Theo Epstein, etc. Once other teams saw that this analysis was working, others started to imitate.

There is no doubt in my mind that inefficiencies exist in soccer that can be exploited with statistics. The most exciting part is that few of these inefficiencies have been found yet. Or if they have been found, they haven't been exploited. These low hanging fruit are waiting to be picked. And once we figure out how to gain a significant enough edge for team executives to notice, they'll catch on and start adapting. This adaptation may be a little slower in countries other than the United States. For whatever reason the US seems to be much more open to statistics in sports.

What soccer needs is an Oakland Athletics. Or a Houston Rockets. I have no idea when this will be. It may already be happening, or it may take 10 years. And even more than that, we need to prove the worth of analytics in soccer. Clubs (understandably so) will not invest in analytics unless they think real value can be gained from it. And this requires us to prove the worth of analytics. I don't think we're there yet, but we're on our way.

But when one club eventually punches well above their weight as a result of the edge they have from analytics, other clubs are going to start to notice. And after that, the dominoes are going to fall in to place and the game will probably start to change, just the way it has in basketball in the past 2 years and in baseball before that. The seeds can be planted by individuals. But nothing will grow until there's an effective adoption by clubs.

]]>**Data**

To create a model that gives the odds of win, draw, and loss for a club depending on the venue of the game (home/away), the minute, and the goal difference in the game, I used over 4000 EPL games stretching back to 2001. This is an improvement over the older Outcome Probability Calculator, which only used 1000 games.

**Methodology**

Even though there are over 4000 games to base the model off of, there are inevitably still some game scenarios that do not occur enough to get reliable results. There are not many games in the past data where the away team went up by 3 in the first 5 minutes. Because of this, when I plotted the outcome probability against the minute of the game, the line is not exactly smooth. To make the results a little more reliable and hopefully more consistent with actual games, I applied the Loess regression method to smooth the lines. This method has some positives and negatives-- it does a good job of smoothing the data, but it is a non-parametric method so it cannot be expressed in an equation as a regular regression is able to do. Below are two plots, one using the LOESS and one that is just the raw data.

**Shiny**

Best of all, I updated the application interface. RStudio Shiny lets you make simple web applications through R. The Shiny interface makes the application a little more functional. The Outcome Probabiity Calculator now displays a graph similar to the one above with a marker for the specific minute chosen. It's interesting to mess around and explore different game scenarios.

**Next Steps**

Next I am planning on updating the Expected Points Added application with the new data. I'll also hopefully calculate this for every player in the past 10 years of English Premier League football. It should be interesting to see how this has fluctuated from year to year for specific players.

]]>Here are some of my thoughts:

I really like the emphasis put on the role that luck plays in soccer. People are often inclined to try to use data to explain everything that happens in soccer. Of course, this is impossible. There are always going to be events that simply cannot be predicted, and I think this is crucial to keep in mind going forward. Data can give you an edge, but it is never going to be the only factor that leads clubs to championships or promotion. To use an example from the book, no model is every going to predict a beach ball coming in to play and deflecting a shot in to the net. This is a glaring example of randomness sneaking in to the game to determine a result, but there are countless other incidences in every match that affect results.

A fact I found interesting: 99 percent of the time players didn't touch the ball, and 98.5 percent of the time they ran without it (page 143). Analysts have focused almost all of their time on the events that occur when the ball is at a player's feet. That's focusing on only 1% of a player's input in the game. While these are the most important and easy to record parts of the game, it is interesting that we ignore almost 99% of a player's activities when looking at their performance. Ignored is the work a player does to get in the right position defensively, or the right angle to receive a pass, or the right space to be able to take a shot. These are extremely difficult to analyze and keep track of, but they are likely just as important as the events that occur when a player actually does have the ball.

The part of the book that will have the greatest influence on how clubs behave, at least in my opinion, is the weak link versus strong link analysis (page 218). Basically, the analysis in the book says that improving a club's weakest player is more beneficial than improving a club's strongest player. This is a huge finding, and completely goes against what is "known" in soccer. Instead of signing the big name striker for way too much money, simply upgrading your weak left back to a stronger one can improve your club for a lower cost. Of course, this is not going to be the case every time, but their analysis provides strong evidence that this is the case a lot of the time. If I were employed by a club to analyze signings, this is the first idea I would bring up to improve the club.

Overall, I think Chris and David do a good job of making the book easy to read and entertaining. I assume it would be all too easy to make the analysis complicated (they are Ivy League professors, of course). However, they do a good job of weaving a narrative while also making a lot of very good insights. That being said, the future of soccer analytics will likely be a bit more complicated. As they point out themselves in Forecast 4 on page 304, mathematical tools like algebraic geometry and network analysis are going to provide the base of most insights going forward. Instead of ball events, it is going to be the more complicated analysis of off the ball events like spacing and positioning. These off the ball events make up the 99% of the game that is generally not recorded or analyzed, as mentioned before. These are likely going to be the events that penetrate deepest in to the flowing, dynamic, and non-stop nature of soccer.

If you haven't yet read it and you're reading this blog, you should probably pick up a copy and read it.

]]>*Data*

I had a hunch that this would not be the case. Specifically, my guess was that there would be more goals scored between the 85th and 90th minutes, whereas there would be fewer in the first 5 minutes of the game. To test this hypothesis, I used data from the Rec.Sport.Soccer Statistics Foundation page from 8 years of the Premiership.

*Methodology*

At first I looked at the number of goals scored in the 8 years of my data for each minute of the game. However, it was clear that this breakdown was too granular; the variability was high because there just wasn't enough data to break it down in to each individual minute. To solve this, I aggregated the data in to intervals of 5 minutes. This way, the data was not as specific but it gave a clearer picture of what was going on. Below is the graph that I came up with:

As you can see, the x-axis gives the end of the time interval. In other words, the bar over 50 on the x-axis represents the number of goals scored in the 45-50 minute time range. The y-axis is the frequency, or number of goals in that time range.

*Confidence Interval*

There are also 3 lines on the graph. The middle line gives the mean number of goals across all the time intervals. The top line is the mean number of goals across all the time intervals plus the standard deviation of the number of goals. The bottom line is the same thing but subtracting one standard deviation. I added these to the graph because they provide nice reference points to help determi