An Intro To Observational Studies
Observational, sometimes called retrospective, studies are an interesting concept and have many applications in my opinion. You can think of the goal of these studies as being to reverse engineer a statistical experiment as best as possible. Since most real life (observational) data was not set up in some sort of randomized design, we need a way to reduce bias and put control/treatment individuals on equal footing before estimating any sort of causal effects. One technique of doing this is called Matching.
Matching
Matching attempts to “balance” the distribution of covariates between control/treatment individuals. To do this, we first need some sort of distance metric to use between individuals, even if this is a highly dimensional problem. Some common ones are:
- Exact: only match individuals that have the same exact covariate values (useful if binary).
- Propensity score: absolute or linear difference in propensity scores, which is the conditional probabilty of an individual being a treatment given its covariates.
- Euclidean: sum of the normalized distanced for each covariate.
- Mahalanobis: euclidean distance adjusting for covariance in the data.
- Generalized Mahalanobis: a generalized version of the Mahalanobis distance to weight covariates by their importance to producing balance.
Using a caliper we can cap how large these distances can be for each covariate. As we become more stringent with the caliper, we throw more individuals out from consideration from each group, which you will see shortly.
Once we have distances between all individuals, we just need to match! There are a few different ways to do so:
- Nearest neighbor: for a treatment individual, find the closest control (minimize the distance). If there’s a tie, pick randomly.
- Optimal: finds the matching solution that minimizes overal distance. Could allow for one-to-many, where several weighted controls are matched to one treatment (or vice versa).
- GenMatch: a genetic algorithm that attempts to find the Mahalanobis weights that produce the best balance. Again one-to-many is possible.
Example
This example is for those of you who pay attention to the european soccer transfer news (ok ok gossip) like me. I discovered a really unique and extensive soccer data set, developed mainly by Hugo Mathien. I’ve been tinkering with it for a while and then one day I came up with a question I thought was worth investigating: When a player transfers to another team, how can you tell if he actually improved / performed better? The only way to properly test this would be to give him the exact same teammates, opponents, and game conditions for some amount of time, say a season or two, and then see if he scores more goals, gets more assists, makes more tackles, etc. But we don’t live in a world where this is possible, and that’s where an observational study comes in. What if I match games where the treatment is the club the player is representing? Obviously I would want to control for covariates like the overall skill / form of each team in the game, at what point in the season is the game, home or away status, weather, and maybe even betting odds. I believe doing so would allow us to better say if a player really got better after switching teams.
For this post, we’ll look at Gareth Bale. Specifically, we’ll look at his games playing for Tottenham and Real Madrid. We’ll try to find ones that look similar based on the attributes I mentioned above and then we’ll see if there’s a significant difference between his goals per game ratio for each club. See the Notes section at the bottom of this post for additional player and team IDs.
Data Background
As I mentioned, the data I’m using is a small sqlite database I found on
Kaggle here. The game
(Match) data spans from the 2008/2009 to 2015/2016 seasons. It only
contains the league games from 11 european countries, so no
Champions’/Europa League games are included. There’s also a ton of
betting odds for each game.
The data is just over 300MB and can be read into R
like so:
## [1] "Country" "League" "Match"
## [4] "Player" "Player_Attributes" "Team"
## [7] "Team_Attributes" "sqlite_sequence"
Data Prep
The function below helps to aggregate the data for use in our observational study. Some important things it does:
- Get the games our player of interest started in for each club.
- Create the treatment variable (1 if second team, 0 if first team) and parse some XML to get the player’s goals and assists per game.
- Find all player’s most recent overall rating from EA’s FIFA video game. EA updates these so I select the closest rating based on dates leading up to or on the date of the game. Get an average team rating for each team.
- Return a
data.frame
and table with some relevant statistics.
Let’s call this function. Again, see the Notes section at the bottom of this post for additional player and team IDs.
## Tottenham Hotspur Real Madrid CF
## games 128 75
## goals 25 29
## assists 13 17
We can see that Gareth has more goals as a madridista, even while playing less games. But could this be because he’s in an easier league or with better teammates? Let’s use Matching to find out.
Matching
Let’s look at Matching in a way I find intuitive: visually through multivariate plots. Since Matching “balances” covariates, we can think of the joint distributions of our control and treatment groups to become more overlapped the better the balance. So if I use the home and away teams’ average rating for each game, we see something like below. Can you guess where los clasicos are ;)?
If we do a quick and dirty Matching and tighten our caliper, you can see that the distribtuions begin to overlap more. However we throw out more and more individual games while doing so. This is always the tradeoff with Matching.
Now let’s do a more technical Matching. I will match on the following covariates:
- stage = what number game in the league season it is
- B365H = Bet365 home win odds
- B365D = Bet365 draw odds
- B365A = Bet365 away win odds
- home_rating_avg: home team’s average EA FIFA rating
- away_rating_avg: away team’s average EA FIFA rating
In a perfect world where I had a bit more data, I would include a flag for whether the game is home or away, if it’s near another fixture such as a cup game, weather conditions, etc.
## [1] "stage" "B365H" "B365D" "B365A"
## [5] "home_rating_avg" "away_rating_avg"
See my comments below and the Matching
package’s
documention
for more insight as to what these parameters are doing in this optimal
matching scenario:
It is always recommended to check the balance of your covariates using this function. I won’t print this function since it is reallllllly long, but basically you should at the minimum look to see if your variance ratio of treatments to controls got closer to 1 and your Kolmogorov-Smirnov tests are significant.
Results
## [1] 58
## [1] 58
Our matching kept 58 games for each group, totalling to 116 games. What’s the difference in goals per game for Gareth’s matched Tottenham and Real Madrid games?
## [,1]
## [1,] 0.2586207
Wow! So when controlling for “similar” games he played in for Tottenham and Real Madrid, we estimate that Gareth scores on average a quarter of a goal more per game as a madridista! But is this result statistically significant?
## Lower 95 Conf. Level Upper 95 Conf. Level
## [1,] 0.03671373 0.4805277
Using a 95% confidence interval, I would say yes!
Assumptions, Caveats, & Other Considerations
As I mentioned earlier, there are other covariates I could’ve used that
theoretically make sense and could potentially improve the balance. I
could’ve included a propensity score as a covariate or even matched
using GenMatch()
’s genetic algorithm. It’s also important to note that
I’m pushing the boundaries when it comes to sample sizes in this
example. More games for each team would’ve been preferable.
In terms of the causal effect I estimated remember a couple things:
- This effect only applies to the games I matched on and not all the other games. By doing this we are obviously throwing away a lot of data.
- This method is simply a nonparameteric way of estimating a causal effect. “Without an experiment, natural experiment, a discontinuity, or some other strong design, no amount of econometric or statistical modeling can make the move from correlation to causation persuasive” - Jasjeet S. Sekhon.
Resources
I eventually figured out how to join player ratings using the closest date thanks to this post.
For more technical details on Matching and its implemenation in R
check out this
paper.
The kaggle discussion board was helpful in general, particularly on how to parse the XML game event data to get goals and assists by player.
Notes
I hope you find these tables useful for your own study!
## player_api_id player_name birthday height weight
## 1 41411 Alessandro Matri 1984-08-19 00:00:00 182.88 176
## 2 50047 Alexis Sanchez 1988-12-19 00:00:00 170.18 137
## 3 38817 Carlos Tevez 1984-02-05 00:00:00 172.72 157
## 4 30613 Cesc Fabregas 1987-05-04 00:00:00 175.26 163
## 5 51553 Daniel Sturridge 1989-09-01 00:00:00 187.96 168
## 6 37459 David Silva 1986-01-08 00:00:00 170.18 148
## 7 30909 David Villa 1981-12-03 00:00:00 175.26 152
## 8 19243 Diego Costa 1988-10-07 00:00:00 187.96 187
## 9 107417 Eden Hazard 1991-01-07 00:00:00 172.72 163
## 10 15403 Edin Dzeko 1986-03-17 00:00:00 193.04 185
## 11 49677 Edinson Cavani 1987-02-14 00:00:00 182.88 157
## 12 18925 Ezequiel Lavezzi 1985-05-03 00:00:00 172.72 165
## 13 31921 Gareth Bale 1989-07-16 00:00:00 182.88 163
## 14 25759 Gonzalo Higuain 1987-12-10 00:00:00 182.88 181
## 15 41468 Juan Mata 1988-04-28 00:00:00 170.18 139
## 16 40636 Luis Suarez 1987-01-24 00:00:00 182.88 187
## 17 31097 Luka Modric 1985-09-09 00:00:00 175.26 143
## 18 177714 Mario Goetze 1992-06-03 00:00:00 175.26 159
## 19 36378 Mesut Oezil 1988-10-15 00:00:00 182.88 168
## 20 22543 Radamel Falcao 1986-02-10 00:00:00 177.80 159
## 21 93447 Robert Lewandowski 1988-08-21 00:00:00 185.42 174
## 22 30843 Robin van Persie 1983-08-06 00:00:00 187.96 157
## 23 37412 Sergio Aguero 1988-06-02 00:00:00 172.72 163
## 24 35724 Zlatan Ibrahimovic 1981-10-03 00:00:00 195.58 209
## team_api_id team_long_name
## 1 10260 Manchester United
## 2 9825 Arsenal
## 3 8650 Liverpool
## 4 8456 Manchester City
## 5 8586 Tottenham Hotspur
## 6 8455 Chelsea
## 7 9829 AS Monaco
## 8 9847 Paris Saint-Germain
## 9 8639 LOSC Lille
## 10 9823 FC Bayern Munich
## 11 9789 Borussia Dortmund
## 12 8721 VfL Wolfsburg
## 13 8529 Cagliari
## 14 9885 Juventus
## 15 8564 Milan
## 16 9875 Napoli
## 17 8636 Inter
## 18 10267 Valencia CF
## 19 8633 Real Madrid CF
## 20 8634 FC Barcelona
## 21 9906 Atlético Madrid