Introduction to dplyr, potential outcomes, and random assignment

Introduction
Terminology
Loading libraries in R
Importing data
Data cleaning and manipulation in dplyr
Random assignment using randomizr and the switching equation
Group by and summarise
Loops
ggplot2

Introduction

Officially, the only pre-requisite is any course covering (at any level of detail) linear regression.

This course does not assume prior knowledge of R or a background in statistical computing. However, the learning curve is steep. For example, by week two you will need to be comfortable with loops in R.

You will also be expected to submit your problem sets using rmarkdown.

The primary purpose of these sections is therefore to assist you in gaining the computational skills necessary to understand the material, complete your problem sets, and (most importantly) implement experiments in the real world.

Terminology

Estimand, estimator, and estimate

Estimand: our quantity of interest.
Estimator: Statistical procedure (i.e. method) used to make a guess about a parameter. This is a sample statistic, and therefore a random variable. We use the “hat” (\(\hat{\theta}\) vs. \(\theta\)) symbol to denote an estimator.
Estimate: the value of the estimand that results from use of a particular estimator (i.e. the value the estimator takes on).

Loading libraries in R

Below are some useful R preamble functions and libraries that we will use in class regularly. Note that before beginning a new work session or starting or working on a new script, you are stronly encouraged to restart your R session.

The tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Knitr is a package for producing dynamic reports (e.g. PDFs) directly with R.

KableExtra is a package that builds flxibile and beautiful tables in R.

randomizr is designed to make conducting field, lab, survey, or online experiments easier by automating the random assignment process.

# ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
# Preamble, libraries, and import
# ______________________________________________________________________________

# Libraries
library(tidyverse)
library(knitr)
library(kableExtra)
library(randomizr)

# Set a seed
set.seed(30)

Importing data

First, we are going to import data from a CSV file.

# Import CSV
week1 = read_csv("week1.csv")

# View first 20 rows of data
head(week1, n = 20)

## # A tibble: 20 x 4
##       X1    X2   Yi0   Yi1
##    <dbl> <dbl> <dbl> <dbl>
##  1     0     0     6    17
##  2     0     0     9    12
##  3     1     1     8    17
##  4     1     0    14    13
##  5     1     0    15    14
##  6     1     1     5    15
##  7     0     1    10    13
##  8     1     0     8    16
##  9     1     1     8    16
## 10     0     0    11    15
## 11     0     0     7    15
## 12     0     1     5    16
## 13     1     0     8    13
## 14     1     1    10     9
## 15     1     0    13    19
## 16     0     1    11    11
## 17     0     0    10    16
## 18     1     0     8    12
## 19     1     0     6    18
## 20     1     1     4    19

Data cleaning and manipulation in dplyr

Next we will perform some basic data cleaning and manipulation functions using dplyr.

There are many, many more useful functions in dplyr that are not shown here. Consult “R for Data Science” for a comprehensive overview. If you think a function might exist, or would like to manipulate your data in a particular way but don’t know how, try Googling “how to X in dplyr”!

# ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
# Basic data cleaning and manipulation using dplyr
# ______________________________________________________________________________

# Let's select only the variables we need using the "select" function
week1 %>% select(Yi1, Yi0)

## # A tibble: 100 x 2
##      Yi1   Yi0
##    <dbl> <dbl>
##  1    17     6
##  2    12     9
##  3    17     8
##  4    13    14
##  5    14    15
##  6    15     5
##  7    13    10
##  8    16     8
##  9    16     8
## 10    15    11
## # … with 90 more rows

# We can also remove variables we don't need in the following manner
week1 %>% select(-X1, -X2)

## # A tibble: 100 x 2
##      Yi0   Yi1
##    <dbl> <dbl>
##  1     6    17
##  2     9    12
##  3     8    17
##  4    14    13
##  5    15    14
##  6     5    15
##  7    10    13
##  8     8    16
##  9     8    16
## 10    11    15
## # … with 90 more rows

# We can add variables using the mutate function
week1 %>% mutate(X3 = 1)

## # A tibble: 100 x 5
##       X1    X2   Yi0   Yi1    X3
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     0     0     6    17     1
##  2     0     0     9    12     1
##  3     1     1     8    17     1
##  4     1     0    14    13     1
##  5     1     0    15    14     1
##  6     1     1     5    15     1
##  7     0     1    10    13     1
##  8     1     0     8    16     1
##  9     1     1     8    16     1
## 10     0     0    11    15     1
## # … with 90 more rows

# We can use "pipes" to complete multiple dplyr procesese at once
week1 %>% 
  mutate(X3 = 1) %>%
  select(X1, X2, X3, Yi0, Yi1)

## # A tibble: 100 x 5
##       X1    X2    X3   Yi0   Yi1
##    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     0     0     1     6    17
##  2     0     0     1     9    12
##  3     1     1     1     8    17
##  4     1     0     1    14    13
##  5     1     0     1    15    14
##  6     1     1     1     5    15
##  7     0     1     1    10    13
##  8     1     0     1     8    16
##  9     1     1     1     8    16
## 10     0     0     1    11    15
## # … with 90 more rows

# Now let's select only the variable we need and save a new dataframe
week1_clean = week1 %>% select(Yi1, Yi0)

Random assignment using randomizr and the switching equation

The code below uses the randomizer package to perform simple and complete random assignment.

Simple random assignment assigns all subjects to treatment with an equal probability (e.g., by flipping a (weighted) coin for each subject). Why might simple random assignment not be the best idea?

Answer: A different number of subjects might be assigned to each group (e.g., 40 to treatment and 60 to control with N = 100). The number of subjects assigned to treatment is itself a random number.

Complete random assignment allows the researcher to specify exactly how many units are assigned to each condition (e.g., 50 to treatment and 50 to control with N = 100).

# ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
# Random assignment using randomizr and the switching equation
# ______________________________________________________________________________

# We can perform simple random assignment, but this is rare in practice
Z = simple_ra(N = 100) # N = number of observations

# Complete random assignment - all of these functions are equivalent
Z = complete_ra(N = 100) # N = number of observations
Z = complete_ra(N = 100, m = 50) # m = number of units in treatment
Z = complete_ra(N = 100, prob = 0.5) # prob = prob of assignment to treatment

Now let’s do something a bit more complex and combine pipes in dplyr with randomizr. We will do the following:

Create a vector of randomly assigned treatment statuses.
Create a treatment indicator variable \(D\) based on this vector of treatment assignments.
Create a variable of revealed potential outcomes \(Y\) using the switching equation.
Acknowledge the fundamental problem of causal inference by removing the original \(Y_i(0)\) and \(Y_i(1)\) variables from our dataset.
Calculate the average treatment effect across all units in the sample:

(\(E[Y_i(1) | D = 1] - E[Y_i(0) | D = 0]\)).

# Perform functions described above
analysis = week1_clean %>%
  mutate(
    D = complete_ra(N = 100), # Randomly assign treatment statuses and create D variable
    Y = ifelse(D == 1, Yi1, Yi0), # Reveal treatment or control outcomes and create Y variable
    Y = (D * Yi1) + (1 - D) * Yi0, # Or using switching equation
    ) %>% 
  select(D, Y) %>% # Remove full schedule of potential outcomes. Can never see!
  mutate(ATE = mean(Y[D == 1]) - mean(Y[D == 0])) # Calculate ATE

# Show final dataset
analysis

## # A tibble: 100 x 3
##        D     Y   ATE
##    <int> <dbl> <dbl>
##  1     0     6   5.2
##  2     0     9   5.2
##  3     1    17   5.2
##  4     1    13   5.2
##  5     1    14   5.2
##  6     0     5   5.2
##  7     1    13   5.2
##  8     1    16   5.2
##  9     0     8   5.2
## 10     0    11   5.2
## # … with 90 more rows

We could also calculate treatment effects for indvidual units here, but this is a completely hypothetical exercise. Why?

Answer: In reality, we can never see an individual unit’s treatment and control potential outcomes. If a unit is treated, it reveals its treatment outcome. If it is in control, it reveals its control outcome. This fact—that it is impossible to observe the causal effect on a single unit—is known as the Fundamental Problem of Causal Inference.

week1_clean %>% mutate(diff = Yi1 - Yi0)

## # A tibble: 100 x 3
##      Yi1   Yi0  diff
##    <dbl> <dbl> <dbl>
##  1    17     6    11
##  2    12     9     3
##  3    17     8     9
##  4    13    14    -1
##  5    14    15    -1
##  6    15     5    10
##  7    13    10     3
##  8    16     8     8
##  9    16     8     8
## 10    15    11     4
## # … with 90 more rows

Group by and summarise

Now we will use another particularly useful combination of dplyr functions often refered to as “group by summarise.” This allows us to calculate a number of summary statistics for a given dataframe/tibble.

# Create a dataframe showing us groupmeans and standard errors
groupmeans = analysis %>%
  group_by(D) %>%
  summarise(
    groupmean = mean(Y),
    size = n(),
    var = var(Y),
    se = sqrt(var / size),
    ATE = mean(ATE)
    )

Let’s display the output of our function in a pretty manner using Kable. Note that Kable has many more functions for customizing tables for output to both PDF and HTML that you can explore.

kable(groupmeans) %>% kable_styling(position = "center")

D	groupmean	size	var	se	ATE
0	9.2	50	10.285714	0.4535574	5.2
1	14.4	50	9.795918	0.4426267	5.2

Loops

Loops allow you to reduce duplication in your code by repeating a function iteratively. This helps you when you need to do the same thing to multiple inputs, for example repeating the same operation on different rows, columns, or on different datasets.

As a general rule of thumb, you should not copy and past the same line of code to repeat a function more than two or three times, but there will also be times when you need to repeat a function more times than is practically possible by copying lines of code. These kinds of operations are often essential for random sampling and simulation, and therefore experiments.

First, we will practice a basic loop where we create a empty vector and fill it with random draws from a normal distribution. You can create “for loops,” “while loops,” and “repeat loops” in R. For this class you will need a firm understanding of “for loops.”

# Set the number of times we will draw from normal distribution
draws = 10000 # Try to get in the habit of not hardcoding these values directly into loops

# Create an empty vector of length draws to store our results
vector = rep(NA, draws)
normal_dist = rnorm(n = 10000, mean = 100, sd = 10)

# Basic loop drawing from normal distribution
for (i in 1:draws) {
  vector[i] = sample(normal_dist, size = 1, replace = TRUE)
}

Next, we will create a plot of this thing using base R plotting. Note that I am only showing you base R plots so that you can see how much we can improve on this with ggplot in a moment!

hist(vector)

Next week we will practice a procedure for hypothesis testing called randomization inference. This will require you to have a firm understanding of loops, so please ensure that you do, or come to office hours if anything is confusing!

ggplot2

Now we will create a plot using ggplot2—arguably the best and most flexible data visualization platform around. Other languages and packages have sophisticated plotting functions (e.g., Python’s Seaborn, Stata’s plotting functions, Tableau, etc.), but none are as comprehensive, flexible, and developed as ggplot2.

First, let’s recreate the histogram above using ggplot2.

# ggplot requires a dataframe
vector = data.frame(vector)

# Tell R that D is binary - will matter for plotting
analysis = analysis %>% mutate(D = as.factor(D))

# Create histogram
ggplot(vector, aes(vector)) +
  geom_histogram(color = "white", fill = "steelblue2") + 
  xlab("Vector") +
  ylab("Frequency") +
  theme_classic()

Now let’s go step by step through how ggplot works, and construct a visualization of the average treatment effect we calculated above.

ggplot elements can be added iteratively in layers. In other words, you can continually add new elements to or on top of a plot. Let’s start by creating a basic scatterplot of our outcome variable \(Y\) versus our treatment indicator \(D\).

# Create and save plot
ate_plot = 
  ggplot(analysis, aes(x = D, y = Y)) + 
  geom_point()

# Display plot
ate_plot

That’s a bit weird? We have 50 observations in each treatment group, but only 15 observations appear in control and 13 in treatment on the plot. This is because our points are on top of each other, we can give them some wiggle room using the “geom_jitter” function.

# Create and save plot
ate_plot = 
  ggplot(analysis, aes(x = D, y = Y)) + 
  geom_jitter()

# Display plot
ate_plot

Maybe we don’t want that much horizontal jitter. Let’s reduce the width but add on some alpha - which is ggplot speak for transparency - and use a more interesting color than black.

# Create and save plot
ate_plot = 
  ggplot(analysis, aes(x = D, y = Y)) + 
  geom_jitter(height = 1, width = 0.02, alpha = 0.8, color = "steelblue2")

# Display plot
ate_plot

Let’s rename our y axis. Never leave default axis labels. Always try to make things more clear for your reader.

# Create and save plot
ate_plot = ate_plot + 
  ylab("Outcome variable")

# Display plot
ate_plot

Now let’s make the x axis more clear.

# Create and save plot
ate_plot = ate_plot + 
  scale_x_discrete("", labels = c("0" = "Control", "1" = "Treatment"), 
                   expand = c(0.2, 0))

# Display plot
ate_plot

Now let’s add some functions that clean up the visualization.

# Create and save plot
ate_plot = ate_plot + 
  theme_classic() + # Use simple black and white theme
  theme(axis.title = element_text(size = 12)) + # Change axis title text size
  theme(axis.text = element_text(size = 12)) # Change axis label text size

# Display plot
ate_plot

Finally, let’s add a regression line to characterize the average treatment effect. We will get into this later, but the slope of this line is equal to the average treatment effect.

# Create and save plot
ate_plot = ate_plot + 
    stat_smooth(method = "lm", 
              se = FALSE, color = "grey75", size = 0.75, 
              aes(group = 1))

# Display plot
ate_plot

For now we’re keeping things fairly simple, but in the future we will need to create more complex plots. For example, we will want to add confidence intervals to estimates, break charts out by subgroup (called “faceting”) when looking at heterogenous effects, etc. ggplot can handle all of these tasks with ease.