An Analysis of the Performance of Promoted Clubs

October 11, 2011Ford Bohrmann

Joey Barton, of newly promoted QPR

An aspect of English football that I love that does not exist in American sports is the promotion/relegation aspect. It makes not just the race for first exciting, but also the race to avoid relegation entertaining. In American sports, last place teams often simply give up, a disappointment for fans.

I wanted to see exactly how promoted/relegated teams fared throughout the season. Some statistical research has already been done on the subject: Omar Chaudhuri, writer of the 5 Added Minutes blog, looks at conversion rates of promoted teams and their corresponding ability to stay in the top flight here. In part 1 of this post, I have looked at how promoted teams have done in their first season in the top flight. My original idea was that teams may struggle early in the season to adjust to the higher level of competition, and eventually even out as the weeks go on and the teams adjust. This also puts the performance of QPR, Swansea, and Norwich into perspective with past promoted team's performances. I use data from promoted teams from the 2003/2004 to 2010/2011 season.

I've created 5 graphs to illustrate the performances of promoted teams. The first one, below, shows how all the promoted clubs' point totals have progressed over the 38 games. On average, promoted teams earn around a point per week. The greenish linear-looking line in the middle is the average. All the other jagged lines are the point totals over the season of promoted clubs. This graph isn't too informative, but is an interesting graphic nonetheless.

The next graph is the same as the one above, but only looks at the three promoted clubs this season in comparison to the average points line and the linear points line. To clarify, the linear line shows is a line illustrating what would happen if a team earned the same points every week to end up with the average point total for promoted clubs. The average line shows the average points earned each week of the season. These may sound the same at first, but I will show in the next couple of paragraphs that there is an important distinction. Anyways, the graphic below illustrates that all 3 promoted clubs are faring about as well as the average promoted team does. QPR started off a little stronger, but has since returned to the average. Norwich and Swansea both started a little weaker, but have improved to end up just above the average 7 weeks in to the season. All 3 teams have 8 points so far, just above the point per week average of promoted teams.

Another way of looking at the first graph is by looking at points per game of promoted teams. The graph below shows this. Obviously, at first clubs' point per game total is a little spread out. As the season progresses, teams earn an average of 1 point per game, as mentioned above. Some clubs have done a little better, and some a little worse, as evident from the graph.

Next is the graph above, but again looking at the performance of the 3 promoted teams this season. Again, the graph shows that QPR started off the campaign a little stronger, but has since regressed to be even with Norwich and Swansea.

The final and most informative graph shows the cumulative points per game of promoted clubs. This graph answers my question of how promoted teams fare throughout the season. As you can see below, promoted teams seem to struggle up until week 7, where they turn it around and do better than their average point total up until around week 20, where they hover around the point per game mark until the end of the season. There could be a lot of explanations for this trend. Maybe clubs struggle at first, and then adjust to the higher competition? Maybe clubs transfer window acquisitions (think QPR) start to pay off around week 7? It would be tough to tell what the true factors driving the trend are really. However, the graph does highlight the interesting phenomenon.

I'm still working on doing a similar analysis of clubs that are relegated at the end of the season to analyze how their performance fluctuates throughout the season.

tags EPL, promotion, relegation, vi

Expected Points Added (EPA) Leaders Through Week 3

August 30, 2011Ford Bohrmann

Below are the Expected Points Added (EPA) leaders for the EPL through week 3. The week 1 leaders can be found in an earlier post here. To reiterate, EPA weights goals based on how important they are to the team's chance of winning the game. This is based on the notion that a go ahead goal in the 90th minute is worth more than the 5th goal in a 5-0 win.

Some interesting things to point out...

While Rooney has 5 goals this season, Welbeck's 2 goals have actually been more beneficial to United. In fact, Rooney doesn't even make the top 15 list above considering most of his goals were in the recent Arsenal blowout.

Dzeko gets to the top of the list by scoring frequently and in important situations. His average goal weight is a solid .51 expected points added, but just because of the fact that he has scored 6 goals puts him at the top.

It's still early in the season. Arteta makes third on the list with only 1 goal (a late game winning goal). Soon we'll start to see the top dominated by players who have scored a lot, and in important situations.

tags expected points added, EPL

Expected Points Added (EPA) Data Through EPL Week 1

August 21, 2011Ford Bohrmann

Before the season I promised to post Expected Points Added (EPA) totals after each week of the season. Here are the EPA totals from week 1. If you don't know what EPA is, check out a full explanation here.

To summarize it very basically, EPA is the total measure of how much each player's goals add to team's expected points total. That is why you see some EPA's of 0 below. These players scored goals that added nothing to the teams expected points total (for example, a team is up 3-0 and is already going to win, and a player scores a 4th in the 90th minute. This does not add to the team's chance of winning technically, because the team is already very likely to win.)

Average Goal Weight (AGW) is just EPA divided by the number of goals a player has scored. This measures how important, on average, a player's goals are. It can show us that a player consistently scores clutch goals (high AGW) or that they are scoring useless goals in blowouts (low AGW).

Dzeko has the highest EPA from his go ahead goal in the 57th minute. This equated to a little more than a point for City. Klasnic, Muamba, and Silva all scored goals that added no expected points for their team.

If you have any questions feel free to ask in the comment section. I'll be super busy this week between moving in to my apartment at school and 3-a-days for preseason but I'll try to keep some posts coming.

tags expected points added, average goal weight, EPL

Refining The Win Probability Statistic

August 03, 2011Ford Bohrmann

Last year I was planning on going to go to the Sloan Sports Conference but ended up not being able to make it. I was thinking about it again this year, and I decided it wouldn't be a bad idea to submit something for this year’s conference. At first I wasn’t going to, but why the hell not? Might as well go for it, I guess.

My win probability added statistic has generated some interest for people, and I think it gives some pretty interesting insight, so I’ve been working on expanding it. If you have no idea what win probability added is, check out my first post on win probability and another on win probability added. Anyways, thus begins my quest to refine and expand the win probability added statistic for submission to the sports conference. To make it a lot better, comments, criticisms, and suggestions are very much appreciated and would help a lot.

The first fix I made was change the name based on a simple fix. The problem with “win probability added” is that it doesn’t necessarily calculate the win probability added. That’s a little bit problematic. For example, if two teams are tied in the 90th minute, the win probability under my old calculations was .333 for both teams. This doesn’t really make sense, because each team has close to a 0% chance of winning the game, not 1/3. This comes from modeling the statistic after the similar calculation in professional baseball. My fix for the problem is extremely simple: multiply all the values by 3. This changes the statistic from win probability added, to the expected points added. It basically makes much more sense now. If a player scores a go ahead goal in the 90th minute, the Expected Points Added (easier to write EPA from now on) is going to be almost 2. If a player scores a tying goal in the 90th minute the EPA would be almost 1. Much simpler and easier this way (originally got the idea from @11tegen11’s similar analysis).

After this, I noticed the graphs were not nice easy curves. Even though I took a big sample size of games (about 10 years worth) there isn’t enough data to give a nice curve. To fix this, I just created lines of best fit for each game situation. The home and away graphs for each minute and goal differential are below. Before there were a few situations that didn’t give a realistic expected point total because there were so few game situations (like a 2 goal lead in the 5th minute). Making the nice smooth curves fixes this problem. It also allows me to use equations to calculate EPA instead of the annoying process of referencing a massive excel chart.

I think there’s a lot of possible paths to take from here. I’m going to recalculate the top goal scorer’s EPA using the equations. It won’t change much, but it’ll be nice to have some continuity because I’ll be calculating EPA week by week for every goal next EPL season.

I’m also working on creating a database of the top goal scorers in the last 10 years in the EPL with their goal totals and their EPA over the years. Looking at goals and EPA over time will hopefully give some insights in to clutch (or lack thereof) goal scoring. If some players consistently have very high EPA’s and some players consistently have low EPA’s, it could be an indicator of clutch goal scoring in football.

Like I said before, I’d love comments and suggestions on ideas for where to go next on the blog, via Twitter, or even email.

tags win probability, viz, win probability added, SSAC

Does More Possession=More Wins in the MLS?

July 27, 2011Ford Bohrmann

In the past couple of blog posts I've looked at two common statistics and shown that they are not as meaningful as most people believe. shots on goal do not predict success very well, and assists favor players on better clubs. In keeping with this theme of misleading statistics in football, I decided to look at possession data. The commonly held notion is that the team that has the ball more (has a possession percent over 50) is more likely to win. This makes sense. A team with the ball more is more likely to score and less likely to concede. But does the data back it up? Does having more possession than your opponent mean you are more likely to win the game? I looked at the possession data from the MLS season so far. What I found goes completely against what most people would think. So far this season in the MLS, the average possession percentage for teams that have won the game is 48.5%. Teams that win actually posses the ball less. This means the average possession percentage for losing teams is 51.5%.

To get even more specific, I broke down the possession data further. Winning home teams average 50.9% possession, and winning away teams average 43.4% possession. On the other side, losing home teams average 56.6% possession and losing away teams average 49.1% possession. The histograms below illustrate these facts. I found that away teams, on average, have a possession percentage of 47.3%, and home teams have a possession percentage of 52.7%.

So what does all this mean? It seems possession percentage in the MLS does not predict success. Teams that possess the ball more don't win more; they actually lose more. Home teams also have a slight advantage in possession percentage compared with away teams.

What about teams that completely dominate possession? You might think that a team that had the ball much more often than their opponent would be much more likely to win. I defined "dominating possession" as having the ball more than 60% of the time. So far this season, teams that have dominated possession have a record of 10 wins, 19 losses, and 18 ties. Domination in possession? Yes. Domination in wins? No.

This analysis calls in to question statements like "the Union had the run of play, they possessed the ball more and deserved the win." It's apparent that in the MLS, possession is not all that important when it comes to winning games. So what's the problem with possession? One reason could be that the best teams do not play possession football. The teams with the most success may play kick and run. Another possibility is that possessing the ball simply doesn't lead to wins. Either way, having the ball more than your opponent does not mean much in the MLS.

tags MLS, possession, stata, viz

Why We Shouldn't Put Much Value in Assists

July 25, 2011Ford Bohrmann

Last week I wrote a post on why shots on goal are a misleading statistic. In keeping with the analysis of the problems with some commonly kept statistics in football, I decided to look at assists.

If you think about it, assists are highly misleading. Simply playing with good players boosts your assist total. Similar to shots on goal, not all assists are the same. There are the assists where a player makes a short pass in the midfield that leads to a teammate dribbling through all the opposing defenders and finishing, and the assists where a player makes a beautiful cross where their teammate simply has to tap the ball in the open net. These obviously shouldn't be counted as the same value to the team, yet they are. Hell, I could probably record an assist eventually in the EPL if I played for one of the top teams (OK, maybe an exaggeration but you get the point.)

First, let's look at the assists data for all the teams in the EPL league. As the graph below shows, as the point value of a team increases (basically, the better the team is) the assist total also generally increases. This is no surprise. We would expect better teams to score more goals and thus have more assist totals.

Basically what this means is that the assist statistic should favor players on better teams. Players on better teams play with better teammates and should therefore have more opportunities for assists. Below is a screenshot from the EPL website of the players with the top 20 assist totals.

9 players from top 5 clubs are in the top 20 for assist totals. No players from bottom 3 clubs are in the top 20, with the exception of Blackpool's Charlie Adam who was just signed by Liverpool. It's easy to see assists totals are higher for players on better clubs.

A better statistic that is not influenced by the quality of your teammates are chances created. A chance created is defined as a pass that leads to a shot. These are obviously not as dependent on your teammates and give a more fair and true assessment of how much of a playmaker that player is for their team.

The next time a club is looking to sign a player based solely on their assists totals, they should take a more in depth look. Assists can tell an inaccurate, or at the least biased, story.

tags stata, assists, shots on goal, EPL, regression

Do Shots on Goal Matter?

July 18, 2011Ford Bohrmann

The major point of this blog is to test commonly held notions in football for their validity. After watching the US women lose to Japan yesterday, I started to think about shots on goal. I don't have the exact numbers, but I'm pretty sure the US crushed Japan in the shots on goal category. This made me think, do shots on goal matter? Most people would quickly say yes. It would make sense that more shots on goal mean more chances to score and thus more goals. The only problem is that some things in football just don't make sense. I wanted to see if shots on goals equate to success in two categories: 1.) Do more shots on goal mean more success for a team as a whole? 2.) Do more shots on goal mean more goals for a specific player? To test these questions I used data from the MLS website. As an aside, mls.com has extensive statistics for every season in a bunch of categories. Great to see. Anyways, the data is from the 2010 MLS season.

First question: Do more shots on goal mean more success for a team as a whole?

If this was true, we would expect points to increase as shots on goal increase on a team level. In other words, teams that have more shots on goal would be more successful. The graph below tells us a different story.

The graph shows there is no real relationship between shots on goal and points. Most teams cluster around just under 140 shots on goal on the season. The line of best fit shows a positive relationship, but this relationship is not strong at all. The correlation of the graph is r=.1311. As a reminder, the correlation of a graph tells us how strong the linear relationship is between two variables. The correlation coefficient (the value of r) gives a numerical value of the strength of the relationship. A value of 0 means there is no linear relationship at all, and a value of 1 means there is a perfect positive linear relationship. In this case, the value is .1311, telling us there is a very weak linear relationship.

Second question: Do more shots on goal mean more goals for a specific player?

Similar story for this question: is there a linear increase in the amount of goals as the amount of shots on goal increases? The graph below gives us the answer.

This graph shows a stronger relationship compared with the graph above. However, the relationship is still not very strong. The value of r in this case is .4722, indicating that the relationship is stronger than the graph above. However, a correlation under .5 is generally considered to be a weak relationship. This means for individual players, shots on goals are not a very good indicator of goals.

Here's my best explanation for why shots on goal are not a very indicative statistic: Not all shots on goal are the same. There are 40 yard weak rollers that the goalie easily saves, and there are 5 yard shots that the keeper barely gets a hand on. There are weak attempts by a center back getting forward and there are breakaways by forwards. In the shots on goal statistic, in both cases the shots on goal are counted as equivalent. Obviously this makes no sense. A statistic that would be better indicative of goals scored for both questions I looked at above would be shots on goal inside the box. Shots on goal inside the box would get rid of the shots on goal that have no chance of going in. Not all shots inside the box are the same, so we have somewhat of the same problem as shots on goal. However, I assume there would be a much stronger correlation between shots on goal inside the 18 and points, and shots on goals inside the 18 and goals by an individual player. Unfortunately, I don't have the data to back up this claim (working on it). If/when I do get the data from shots inside the box I'll post the graph and the correlation between shots on goal in the box and goals.

Even without the data, the point I'm making is still clear: shots on goal do not equate to more success from a team perspective and do not correlate with goals for individual players very strongly like most people assume they do. There are better statistics than shots on goal. This means statements like "New England had 5 more shots on goal than New York, they dominated the game" and "Donovan had 4 shots on goal in the game, he was due for a goal" are not neccesarily valid. What if New England had a bunch of shots on goal from outside the 18 that never had a chance of going in? And what if Donovan's shots on goal all were weak rollers? Shots on goal are often misleading.

tags shots on goal, MLS, stata

A Different Look at League Parity: MLB vs. EPL

July 14, 2011Ford Bohrmann

I was intrigued after reading a post last month by Chris Anderson on his Soccer By the Numbers Blog. The post compares the competitiveness of different football leagues in Europe. You can find it here. Anderson talks about "uncertainty of the outcome" as a measure of parity. This makes sense, as leagues where the outcome is not a sure thing are more equal.

With an uncertainty of the outcome in mind, I took another approach to analyzing the parity in a league is by looking at the amount of champions. In the past 10 years, only 3 clubs have won the English Premier League. In baseball, 9 different teams have won the World Series. Of course, this has flaws and is not a complete look at the league. This does suggest that the outcome is not fixed for baseball though.

Does this mean that professional baseball has a more balanced league than the EPL? If you've read Moneyball by Michael Lewis (if you haven't you should) you would know that MLB is facing payroll disparities similar to the one's in the EPL. So why the large difference in the number of winners? The answer is the playoffs.

In baseball, the 6 division winners plus 2 wild cards make the playoffs. There is one best of 5 series, followed by 2 best of 7 series. This adds up to only 11 wins to take home the World Series. Most people say that the playoffs are different than the regular season. They say all previous records are thrown out the window and any team can beat any other team. While there may be some change in the way a team plays when it comes to the playoffs, there is a more important factor at work: a small sample size of games. With such a small sample, it is not uncommon for a less skilled team to simply get lucky and beat a better team. Assume a team has a 30% chance of beating another team in a playoff game. For a best of 5 series, that team has a 16.3% chance of winning. For a best of 7 series, the team has a 12.6% chance of winning. All in all, upsets are not uncommon in the MLB playoffs. These upsets are the force behind the multitude of World Series winning teams this decade.

In contrast, we can look at the EPL. The EPL has no playoff system, and the winner is determined by the most number of points after each team plays 38 games. Effectively, you can look at this as just being one long playoff. Here, the sample size is much bigger: 38 games. Historically, teams have to win above 25 games to win the league (with the exception of last season). If we look at an above average team, what is their chance of winning more than 25 games? Let's take Liverpool from last season. For simplicity's sake I will only look at wins for the analysis. This may hurt a team with a lot of draws, but it makes the analysis a lot simpler. Last season they finished 6th with 17 wins. This means they win about 45% of their games. I am also assuming that Liverpool's record is an accurate measure of their ability to win games. In other words, Liverpool really does have a 45% chance of winning a game. The probability of Liverpool winning more than 25 games last year, if they have a 42% chance of winning each game, is .3%. For a team that won 25 games, or 65% of their games (in the past 10 years it has been ManU, Chelsea or Arsenal), the chance of winning more than 25 games is 42% Because of the bigger sample size, upsets are much less likely in the EPL. Even with a good team like Liverpool (I don't think anyone would say Liverpool winning the league is an upset), the probability of it happening is very low.

Baseball's smaller sample size of games in the playoffs allows for upsets and gives the appearance of parity with numerous teams winning the World Series. The EPL's larger sample size and lack of playoffs vastly reduces the chance of an upset which leads to the same powerhouse teams winning over and over again. John Henry already has two championships this decade with the Red Sox. The way the leagues are set up, a third championship with Red Sox is more likely than his first with Liverpool.

tags mlb, epl, imbalance, moneyball

WPA and AGW Weekly Updates this Season

July 12, 2011Ford Bohrmann

I just added the image on the right of the page ranking the players ranked by their WPA totals. The chart also includes the player's AGW and their goal totals for the season. I'll update this every week during the EPL season. An explanation of WPA and AGW are below.

WPA: Win Probability Added defines exactly what it sounds like it should: How much a player has added to their team's success through their goals. The way I calculate this is to sum how much each player's goals add to the team's probability of winning. Goals are a flawed statistic because every goal is obviously not worth the same amount. The 5th goal in the 90th minute in a 5-0 win is not important. The 1st goal in the 90th minute in a 1-0 win obviously is very important. To quantify these values I accumulated the total record (wins, losses, and ties) of every game in the past 10 years in the EPL. This way, I could calculate the exact winning percentage at every different game situation for both teams. For example, I know that scoring the 2nd goal to make a game 2-0 at home in the 67th minute increases a team's chance of winning by 10.845983%. WPA takes in to account the importance of each goal, and shows how much, overall, a player has added to their team's chance of winning a game through their goals.

AGW: Average Goal Weight is simply how much, on average, the player's goal is worth. Mathematically, it is the player's total WPA divided by the number of goals they have scored. For example, one player may only score 5 goals on the season, whereas another may score 15. However, the first player could have a higher AGW if they tended to score pivotal goals while the second player scored useless goals.

WPA and AGW are not perfect statistics, but they do provide a little more insight in to a player's goal scoring ability.

tags expected points added, average goal weight, win probability, win probability added

Answer to my Question via Twitter Posted Earlier

July 11, 2011Ford Bohrmann

The question I asked earlier today via my twitter @SoccerStatistic was, "Which statistic correlates best with a team's point total?" The options were goals against, corners, goals for, and shots on target. The answer is extremely surprising to say the least.

Another way to ask the question is "Given the goals against, corner, goals for, or shots on targets total for a team in the EPL, which variable would allow you to best predict the point total of the team?" Turns out the answer is not goals for, goals against, or even shots on target. Yep, its the corner total. This means the amount of corners a team accumulates during the season is a better indicator of the team's standing than the other variables. To me, this is mind-boggling. The point of the game is to score more goals than your opponent, yet the amount of corners predict point totals the best.

The way to figure this out is with linear regressions between points and the 4 statistics in questions using season totals for EPL teams. A linear regression tells us how strong the linear relationship between two variables are with a number called the correlation coefficient. A value of 0 would mean there is absolutely no relationship, and a value of 1 would mean a perfect linear relationship. Below is a chart of the 4 variables and their correlation coefficient value. The absolute value of the correlation coefficients are given below, as goals against obviously has a negative relationship with.

Corners just edge out goals for and goals against as the strongest relationship. There is only really one explanation I can think of to explain this: Corners result from pressure on the goal, and more corners would mean more pressure on the goal which corresponds with more wins and a higher point total. Still, the fact that the relationship is stronger than the relationships between points and goals for and points and goals against really amazes me.

A few things to point out: First, the best way to really predict a team's success is with their goal differentials. However, it is still interesting that corners have the strongest relationship of the 4 variables above. Second, the relationship between corners and points shouldn't be read in to too much. This doesn't mean that if a team goes out trying to get more corners they will be more likely to win the game; instead it means that better teams tend to earn more corners based on the way they are playing.

This also leads in to something else I will be working on in the near future which relates somewhat. Are the amount of goals scored by a player a good indication of the quality of the player? Forwards are the highest paid players in soccer, but what if goal scorers are significantly overvalued? Is it right when we say "Player x is a better player than player y because he scored more goals this season"? I think there are a number of ways to test these questions, so check back in the coming week for some results and analysis.

tags twitter, corners, goals

An Analysis of City Pre/Post Abu Dhabi Using the Transfer Price Index

July 07, 2011Ford Bohrmann

Pretty soon I'm going to start writing the Manchester City statistical blog over at http://www.eplindex.com/ (@EPLIndex). I also just read Pay As You Play by Paul Tomkins. If you haven't read it and you're interested in statistics and football, you should really give it a read. The book basically outlines the trend in the EPL that money buys points using what Tomkins calls the Transfer Price Index. More specifically, the higher the cost of the XI (Tomkins refers to this as £XI) the more a team tends to win. Of course, there are exceptions to this, but in general it seems to hold true. Anyways, when I was reading the book I thought it would be a good idea to analyze City using Tomkin's data, especially when I saw that my future fellow City blogger at EPL Index Danny Pugsley (@danny_pugsley) wrote the "Expert View" for the City section. I'm no expert on the analysis that Tomkins does, but I understand a good amount from reading the book. The subject of the book rings especially true for City considering the recent Abu Dhabi takeover and sudden influx of large amounts of cash for the club.

Some notes before the analysis: One, the data I am using is all from the book Pay As You Play, as I mentioned above. Two, make sure to notice some data is missing for years when City was not in the top flight. Three, the data in the book only goes to the 2009/2010 season, so the 2010/2011 season is missing.

Basically, I looked at 3 questions: 1.) Does City really spend more money since the Abu Dhabi take over? 2.) Does a higher £XI cost equate to success for City in the EPL? 3.) Screw 1 and 2. What if City keeps buying Robinho's?

Does City really spend more money since the Abu Dhabi take over?

Yeah, really dumb question. Pretty obvious the answer is yes. Below is the graph comparing the league average starting eleven cost and the City starting eleven cost since 1992. In the 2008/2009 City's £XI is higher than the league average for the first time since the 1994/1995 season. Remember, Abu Dhabi took over at the start of the 2008/2009 season. For the 2009/2010 season it skyrockets to over £120,000,000. City now has money to spend.

Does a higher £XI equate to success for City in the EPL?

The answer Tomkins gives for EPL clubs in his book is yes. Again, this makes sense. Clubs that are able to spend more on players should be able to produce higher quality sides and win more. I wanted to analyze specifically City's success, so I looked at the data to see if their £XI rank in the EPL follows their league position. In other words, does City succeed more when they spend more? Looking at the graph below, the answer seems to be yes. The league postion (green line) generally follows the club's £XI rank (orange line).

Screw 1 and 2. What if City keeps buying Robinho's?

The first two graphs seem to point to inevitable success for City. They have a lot of money and money can buy success, so they'll succeed, right? People will obviously point to some recent not-so-successful expensive purchases. Robinho, Jo, and Santa Cruz are the 3 big ones. Each has had start percentages of 47, 16, and 16 respectively, despite a massive total cost of £69,000,000. A good graphic to show the efficiency of purchases is the cost per point used in

Pay As You Play

. Clubs that are efficient in this regard will have spent less money per point earned, while clubs that are inefficient will do the opposite. The graph shows how much City spent in each year for each point they earned. Not surprisingly, the cost per point has spiked since 2008. This may seem like money is being wasted. While City may not be getting as much bang for their buck, it likely won't matter in terms of success. According to Tomkins, the highest cost per point goes to Chelsea in 2006/2007. They finished in 2nd that year. It seems that simply having a lot of money can trump inefficiencies displayed from the cost per point value. Tomkins even refers to City's high cost per point on page 18: "Manchester City will certainly close the gap for this unwanted honour (although if they win the league, they won't care what people think; they could probably afford to pay £4m or £5m per point if it would guarantee them success)." So yes, City may make some poor purchases like Robinho, Jo, and Santa Cruz in the future. All in all, it doesn't matter that much though. City has so much money that they'll win anyways.

tags books, pay as you play, eplindex, mcfc, stata, viz, money, EPL

Fun With Graphs

July 06, 2011Ford Bohrmann

Often graphs can tell us a lot more about certain data then just the numbers itself. At least they are usually easier to understand. I just downloaded Aaron Nielsen's (@ENBSports) amazing database from the 2010 MLS season and started playing around with it. Here are some interesting graphs I came up with:

This is probably a graph that already exists somewhere, but I made it anyways. It really highlights how much Seattle dominates attendance in the MLS. Also added in a bar for average attendance (between Chicago and Salt Lake) for comparison.

Another graph that highlights domination (in this case probably in a negative sense) of one team over all the others. All teams fall in the range of 1.4 to 1.8 cards per game. However, its clear that Toronto is an outlier with 2.17 cards per game.

This graph once again shows domination by one team in a certain statistic. Dallas scored almost 20% of their goals from PK's.

That's 1 out of every 5 goals

. This almost doubled every other team in the MLS last season, and was 10 times the percentage of Seattle. Hmm. Not exactly sure what the explanation here is. Is Dallas really good at diving? Are they being favored by refs? Are they just getting a lot of chances in the box? Something to look at in the future.

For the percentage of goals scored outside the 18, I took the 2 lowest, 2 highest, and the average. Dallas (likely from their massive share of goals from PK's) and Columbus have the lowest percentage of goals scored from outside the 18. New England and Chivas USA have the two highest percentage of goals scored from outside the 18. This shows not every team is scoring goals the same way in the MLS. Having a high percentage of goals from outside the 18 doesn't exactly mean the team is being creative or is better at long distance shooting. Instead, it more likely tells us that the team struggled in scoring goals within the 18, where the bulk of goals are scored. Dallas and Columbus were 4th and 5th last year, respectively, while New England and Chivas USA were 13th and 15th, respectively.

tags viz, mls, stata, attendance, cards, pks, shots

Another Look at Referee Bias: Extra Time Given

June 30, 2011Ford Bohrmann

Yesterday I looked at referee bias in this past season for the EPL. It turned out that while referees favored the home team overall in parts of the game like fouls, yellow cards, and red cards, it is more likely due to the advantage the home team has in a game. One statistic I did not look at though, is the amount of extra time given.

Extra time has nothing to do the relative abilities or score of the game like many other parts of soccer do. In theory, it should be an objective amount not dependent on if the home or away team is leading in the game. You see in almost every game though, the home crowd jeering for the ref to end the game if their team is ahead, or cheering even louder for their team to come back if they are trailing. Based on this, referee bias would be present if home teams that are leading have shorter games compared with away teams that are leading. The obvious logic being that the referee gives in to the home team's fans and adjusts his extra time given unconsciously.

To do this I looked at the length of the game for home teams that won the game versus length of the game for away teams leading. If there is indeed a referee bias then we should see that the length of games is shorter for home leading teams versus away teams.

Below are histograms (graphs showing the frequency of each dependent variable value) of the length of the game for the two categories above.

We can see the graphs are very similar, except for the tail on the right end of the away win time. This is in accordance with our hypothesis that away teams that are leading face more extra time. It seems refs gave trailing home teams more than 10 minutes of stoppage time more than they gave trailing away teams more than 10 minutes.

Like the previous post looking at referee bias, I did statistical analysis to see if the difference was actually statistically significant (in other words, the difference was not due to randomness). The mean length of game for leading home teams was 96.36 minutes, while the mean length of games for leading away teams was 96.56.

Using the data, I ran a two sample t-test. Basically what a

t-test does is takes in to account the number of observations, mean, and standard deviation (measure of spread) and tests to see if they are equal. In the end, the test gives a p-value between 0 and 1. A p-value basically answers the question, if the two means were actually the same (time given for leading teams were the same for home and away), what is the probability that we there would be a difference in the means that we actually saw. In this case, a probability of 0 suggest that the means are different, and one of 1 suggests they are the same. Generally, a p-value of .05 or lower is statistically significant, meaning we can rightfully say the means are not the same.

After doing the test, the p-value I got was 0.2013. While this suggests that referees are giving more time to trailing home teams, it is not at a statistically significant level. In other words,

we cannot conclude that referees give more extra time to trailing home teams compared with trailing away teams.

It may seem like there is a bias evident based on the means, but it is not at a statistically significant level.

All in all, referees are doing a good job in terms of not favoring home teams over away teams. Next time someone complains that the ref is favoring the home team, you can just tell them to look at the data.

tags bias, ref, EPL, stata

Checking for Referee Bias in the 2010 EPL Season

June 29, 2011Ford Bohrmann

Referee bias is a hot topic in any sport, not just soccer. People often accuse referees of favoring the home team in matches. The accusation makes sense: with a stadium full of fans rooting for one team, you would think it would be hard not to favor the home team just a little bit.

But is there a bias evident in the data? To look at this, I looked at data from last season in the EPL. Referees have control over a number of parts of the game. The parts I looked at were fouls, yellow cards, red cards, and offsides. If refs exhibit a home bias in the EPL, they would call more fouls and offsides and give more yellow and red cards to the away team. Pretty simple logic. Let's look at the data piece by piece.

Fouls:

Clearly, the graph shows that average number of fouls is indeed higher for the away team. The away team is called for, on average, 13.04474 fouls, while the home team is called for only 12.09737. That's a difference of about a foul per game. I also ran a two sample t-test to test for significance. Basically what a t-test does is takes in to account the number of observations, mean, and standard deviation (measure of spread) and tests to see if they are equal. In the end, the test gives a p-value between 0 and 1. A p-value basically answers the question, if the two means were actually the same (fouls were the same for home and away), what is the probability that we there would be a difference in the means that we actually saw. A probability of 0 suggest that the means are different, and one of 1 suggests they are the same. Generally, a p-value of .05 or lower is statistically significant, meaning we can rightfully say the means are not the same.

Anyways, the p-value I came up with after running a t-test for home and away fouls was .0003. In other words, we can say that refs called more fouls on the away team at a statistically significant level.

Yellow Cards:

Next I looked at yellow cards. Again, looking at the graph away teams received way more yellow cards on average than home team. Specifically, the home team averaged 1.413158 per game, while the away team averaged 1.955263 per game. That's a difference of about .5 per game. Again, I ran a t-test similar to the one above for fouls, this time for yellow cards. This time, the p-value was 0. This means there were definitely more yellow cards given to away teams than home teams at a statistically significant level.

Red Cards:

Third, I looked at red cards. If there is a home referee bias present we would expect to see more red cards given to away teams. Like fouls and yellow cards above, the bias seems to continue. Looking at the graph, there were definitely more red cards given to away teams on average. Per game, home teams received .0605263 per game, and away teams received .1184211. In other words, away teams receive about twice the red cards than home teams. Again, are these numbers significant? Turns out, like fouls and yellow cards, they are. The p-value was .0042, again telling us that away teams received more red cards per game at a statistically significant level.

Offsides:

Finally, I looked at offsides. In this case, the home team was actually called more for offsides. Huh? What's going on here? Home teams, on average, were called for offsides 2.35 times per game, while away teams were only called 2.223684 times per game. Are referees not being biased for offsides, while they are for fouls, yellows and reds?

As always, we should check all possible scenarios. One explanation for the differences in these 4 differences in calls could come not from referee bias, but from the advantage that home teams have over away teams. Maybe teams that are losing naturally foul more, receive more yellow and red cards, and get called for offsides less. It's obvious that home teams have a big advantage over away teams in the EPL: To name just one statistic, home teams scored, on average, 1.63 goals per game, while away teams scored only 1.01 goals per game. This is a pretty wide margin.

If the apparent bias was actually due to the home team's advantage, then losing teams would follow the same pattern as away teams. In other words, losing teams would be called for more fouls and receive more yellow and red cards. Most importantly, losing teams would be called for offsides less. Well, let's look at the data for losing teams compared with winning teams side by side with the data for away teams compared with home teams.

Fouls:

Disregarding the draws column on the far left, the graphs look similar. Both away teams and losing teams are called for more fouls.

Yellow Cards:

Again, if we look at the loss and win bars, they coincide closely with the away and home bars, respectively.

Red Cards:

Three in a row. The bars look strikingly similar for away versus home and loss versus win.

Offsides: Finally, we should expect losing teams to be called for less offsides than winning teams, just like how away teams are called for less offsides than home teams...

Look at that! Winning teams are indeed called for more offsides than losing teams.

Conclusion: Based on the first half of the post, it truly appears that referees favor the home team with their calls. In fact, I convinced myself that was the case for a little bit. However, it really comes down to the advantage a home team has in a game instead of any referee bias. While this post doesn't show anything revolutionary about referee bias (admittedly, it would have been pretty cool to make a groundbreaking discovery proving refs favor home teams), it is a good reminder that data can often be deceptive in the way that you look at it. It's important to look at all angles to really understand what is going on beneath the surface before you jump to any conclusions.

Win Probability Graphs and Regressions

June 24, 2011Ford Bohrmann

Earlier in this blog I wrote a post on Win Probability in every possible game situation. I posted the excel files but they aren't as informative as a graph. I made up graphs for home and away and +2, +1, 0, -1, and -2 goal differentials for every minute. I didn't make up graphs for GD's bigger than that because there is basically no point. The fact that a team has a .999% win probability when they are up 4-0 isn't that exciting.

Each graph has the line of best fit and a scatter plot of the data. The equations for those lines are also on the graph along with the r^2 value for correlation. The graphs are below to look at. Some interesting things I noticed:

-Most graphs show a very strong relationship between minute and win probability. The only ones that don't really are when teams are away and are tied, when teams are home and up by 2, and when teams are away and down by 2. Not really sure why these three stick out.

-Some of the graphs have linear relationships, while others are quadratic. Again, not really sure why this is. Why is the win probability when you are at home and tied follow a quadratic curve while the win probability of a team at home and down by 1 is linear? Maybe people have ideas as to why this happens.

-For some of the scenarios (the +2 and -2 GD's for home and away) I didn't start the graph at minute 1 because the data points were a little all over the place. This happens because there are so few data points so the win probabilities are screwed. Example: There aren't many times when a team has a 2-0 lead in the 5th minute.

-I added the graphs of all the goal differentials together for comparison, one for home and one for away. They're interesting to look at.

-Finally, because of this we now have some basic equations to model a team's chance of winning a game. Feel free to use them and check them out.

tags win probability, regression, win probability added

WPA and AGW: Van Persie is overrated

June 23, 2011Ford Bohrmann

Well, maybe the title is a little exaggerated. What I really mean is the value of Van Persie's goals last season are overweighted. On the other hand, Darren Bent's goals were undervalued. The explanation comes from WPA, or "win probability added".

If you read the last post, I explained win probability. If not, check it out here. Because we have a probability for every game situation, I was able to weight goals by the added win probability a team has from that goal. In soccer, is a little more complicated because teams can tie. To solve this, I use win percentages instead of win probability. To get a team's win percentage you weight a win as 1 point, a draw as 1/3 of a point, and a loss as 0. The sum of these divided by the number of games a team has played gives us the win percentage. I guess in this case it should be win percentage added instead.

The added part comes in by calculating how much a goal adds to a teams win percentage. Here are a couple of examples:

-A goal in the 95th minute to put the home team up by a goal would have a WPA of .666666. A tie game in the 95th minute gives the home team a win percentage of .33333 (almost every time they will draw the game). However, in this case the home team scored. Now the score is 1-0 in the 95th minute. Now the home team's win percentage is almost 1 (almost every time they will win the game). To get the WPA of the goal we subtract the win percentage before the goal (.3333) from the win percentage after the goal (1). This gives us a WPA of .666666

Basically what WPA does is values goals that are more important to the team. In the example above, that goal is obviously very important to the team. However, a goal in the 90th minute to put a team up by 6 would be worthless to the team. That goal would have a WPA of 0.

I calculated the WPA of the top scorers in the EPL last season (players with more than 10 goals). Interestingly enough, the list shook up a bit. The table is below.

Notably, Darren Bent moves up to first on the list, and Van Persie moves down to 8th. Beyond this, I wanted to know which players tend to score more important goals and which players score non-important goals. Obviously, Van Persie has a higher WPA than most of these players because he scored a lot more goals than them.

The way I did this was to calculate the average WPA of a goal by a player. I called this the Average Goal Weight, or AGW. The list of the AGW versus goals is below.

Not surprisingly, Van Persie moves to the bottom of the list, and Bent stays at the top. So what does all this mean? I don't think its a good idea to jump to the conclusion that Van Persie is not a good goal scorer. Despite everything, he scored 18 goals last season, which is good no matter how you score them. However, I think AGW is a good supplement to the top goal scorers list. Last season, Bent was consistently scoring goals that added a whole 10 points to the winning percentage than Van Persie on average.

You shouldn't base your entire assessment of a goal scorer only on AGW. However, I think its something to take in to account.

tags EPL, win probability added, win probability, average goal weight

Win Probability Added in Soccer

June 21, 2011Ford Bohrmann

Everyone hears it all the time: A 2-0 lead is the most dangerous lead in soccer. But is it really? Thinking about the led me to wonder how exactly dangerous leads were in soccer. In fact, I wanted to find out what win, loss and draw percentages a team had in all situations. The best way to find this out is to analyze a lot of games and calculate the win, loss, and draw percentages in every possible game situation. To do this, I took in to account the venue of the game (home versus away), the goal differential between the teams (team is up by 2, team is up by 1, game is tied, team is down 1, team is down by 2 etc) and the minute of the game. I took goal differentials of -5 to 5 and minutes 1-90. I thought these were probably really the most important factors. You could maybe take in to account cards too, but this is hard and makes it pretty complicated. Overall, there are 2*11*90 = 1980 combinations of game situations.

The idea relates to WPA in baseball. Basically, WPA is a measurement of how much a play adds to the chance the team wins a game. For example, how much does a 2 run home run help the team's chances in the 6th inning? In soccer, a question would be how much does a goal at home to give you a 2 goal lead in the 67th minute change your winning percentage in the game? Pretty simple concept.

To get the percentages for all of these situations I imported game data from the past 10 years of the EPL in to Excel. My Excel skills are not the best but with some help I was able to eventually get these to convert in to percentages for each game situation mentioned above. The basic idea is this: how often do teams with a 1 goal lead in the 40th minute at home win? How often do they draw? How often do they lose? This was done for every minute and every goal differential both home and away. The results truly tell us how dangerous a variety of leads are.

Here's an example: The team is away, the game is tied, and it's the 67th minute. Any guesses on the win, draw and loss percentages? Well turns out the team has about a 19% chance of winning, a 51% chance of drawing, and a 30% chance of losing.

We can also test the "2 goal leads are the most dangerous leads theory". Let's say the team is home and it's the 35th minute. Here are the percentages for 1 and 2 goal leads:

1 goal lead: win: 78%, draw: 16%, loss: 6%
2 goal lead: win: 96%, draw: 2%, loss: 2%

The same holds true for all minutes and both home and away teams. A 2 goal lead is in fact not the most dangerous lead in soccer.

I'm also in the process of making a Java Applet to post here that lets you input the goal differential, venue, and minute, and spits out the win, loss and draw percentages. Again, my Java programming talents are not the best, so no promises on anything getting finished or uploaded soon. I uploaded the actual excel files to a google sites page though if you're curious to look at other percentages. If you want to download the files click here and type in the search bar ".htm" without the quotes to find the files.

Next, I'm planning on relating this more to how WPA is used in baseball by using it to analyze specific players by calculating how much percentage they add to their team winning by scoring goals. Not sure how useful this statistic will actually be, but it's worth a shot.

tags win probability added, EPL

Contact Us

Blog