Note that throughout section we will be using the data from problem set 6.
# Libraries
library(tidyverse)
library(knitr)
library(stargazer)
library(sandwich)
library(foreign)
library(plotly)
library(reshape2)
library(glmnet)
library(caTools)
# Import data
ps6 = read.dta('/Users/trevorincerti/Box Sync/Graduate/Teaching/500/section_notes/data/qog_std_cs_15may13.dta')
# Drop all observations such that wdi_gnipc has missing values.
ps6_clean = ps6 %>%
filter(!is.na(wdi_gnipc))
# Create logged wdi_gnipc variable
ps6_clean$ln_wdi_gnipc = log(ps6_clean$wdi_gnipc)
# Keep variables needed for analysis only
ps6_clean = ps6_clean %>%
select(ccode, cname, fh_ipolity2, ln_wdi_gnipc, ciri_empinx_new)
Interpreting a coefficient in a bivariate linear regression (e.g. \(\beta_1\) in \(Y = \alpha + \beta_1 x_1\)).
\(\beta_1\) is the slope of of the BLP with respect to \(x\).
So, for our problem set, this is the slope of the BLP with respect to logged GNI. In other words, it tells us E[Y], or E[Polity], associated with a 1 unit change in logged GNI.
Shown numerically:
ps6_lm = lm(fh_ipolity2 ~ ln_wdi_gnipc, data = ps6_clean)
Dependent variable: | |
Polity score | |
Logged GNI per capita | 0.846*** |
(0.129) | |
Constant | -0.390 |
(1.101) | |
Observations | 187 |
Adjusted R2 | 0.185 |
Note: | p<0.1; p<0.05; p<0.01 |
## `geom_smooth()` using formula 'y ~ x'
Interpreting a coefficient in a multivariate linear regression (\(Y = \alpha + \beta_1x_1 + \beta_2x_2\)).
The coefficient now represents a conditional slope.
Generally, we now interpret interpret \(\beta_1\) as the average effect on \(Y\) of a one unit increase in \(x_1\), holding all other predictors fixed.
In this case, \(\beta_1\) now represents the change in E[Polity] associated with a 1 unit change in logged GNI, holding the Empowerment Rights Index constant.
Expressed as a partial derivative:
\[\beta_1 = \frac{\partial Y}{\partial x_1} \]
ps6_lm2 = lm(fh_ipolity2 ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean)
Dependent variable: | |
Polity score | |
Logged GNI per capita | 0.142* |
(0.072) | |
CIRI Empowerment score | 0.631*** |
(0.027) | |
Constant | 0.475 |
(0.558) | |
Observations | 187 |
Adjusted R2 | 0.792 |
Note: | p<0.1; p<0.05; p<0.01 |
If the coefficient is twice (technically 1.96) the size of the standard error, we may call our result “statistically significant at the .05 level.”
If our null hypothesis is that \(\beta_1\) = 0, the coefficient divided by the standard error is the t-statistic (i.e. \(t = \frac{\hat{\beta_1} - 0}{SE({\hat{\beta_1}})})\).
This is very useful for quickly reading regression tables, especially when authors do not report “significance stars.”
Why might an author not report stars (hint: not p-hacking)? If it is hard to tell if the coefficient is twice the size of the standard error, what should we conclude?
A range of values such that with 95% probability, the range will contain the true unknown value of the parameter.
To test the null hypothesis, we need to determine whether \(\hat{\beta_1}\), our estimate for \(\beta_1\), is sufficiently far from zero that we can be confident that \(\beta_1\) is non-zero. How do we use a confidence interval to tell this?
For experimental data
Considers what would have occurred not only under the random assignment that occurred in the experiment, but rather under all possible random assignments.
Basic example in OLS. To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to give an estimate of the model’s predictive performance. For example, you could split your dataset multiple times by keeping the first 90% of your data in training and last 10% in test, then change which portion of your data is in test versus traning 10x until you have covered the entire dataset. This is called k-fold cross validation, where k = number of folds.
Basic one-split example
# Split dataset into test and training
sample = sample.split(ps6_clean, SplitRatio = .9)
train = subset(ps6_clean, sample == TRUE)
test = subset(ps6_clean, sample == FALSE)
# Run OLS on training dataset
ols_train <- lm(fh_ipolity2 ~ ln_wdi_gnipc + ciri_empinx_new, data = train)
# Add prediction to training dataset
test$ols_pred = predict(ols_train, test)
# Calculate out-of-sample MSE
mse_ols = mean((test$fh_ipolity2 - test$ols_pred)^2)
mse_ols
## [1] 1.471299
Example using monte-carlo cross validation. The advantage of this method (over k-fold cross validation) is that the proportion of the training/validation split is not dependent on the number of iterations (i.e., the number of partitions). The disadvantage of this method is that some observations may never be selected in the validation subsample, whereas others may be selected more than once. In other words, validation subsets may overlap.
sims = 100
mse_en = rep(NA, 100)
for (i in 1:100) {
sample = sample.split(ps6_clean, SplitRatio = .9)
train = subset(ps6_clean, sample == TRUE)
test = subset(ps6_clean, sample == FALSE)
# Run OLS on training dataset
ols_train <- lm(fh_ipolity2 ~ ln_wdi_gnipc + ciri_empinx_new, data = train)
# Add prediction to training dataset
test$ols_pred = predict(ols_train, test)
# Calculate MSE
mse_en[i] <- mean((test$fh_ipolity2 - test$ols_pred)^2)
}
mean_mse <- mean(mse_en)
print(mean_mse)
## [1] 2.032445
Consider the following statements:
This is a parametric statistical model with parameters \(\alpha\) and \(\beta\). [If we knew \(\alpha\) and \(\beta\), we would know the exact probability model.]
This is identical to the previous statement except described verbally.
Semi-parametric statistical model with parameters of interest \(\alpha\) and \(\beta\). [If we knew \(\alpha\) and \(\beta\), we would not know the exact probability model, but we would know everything we want to know: the regression function.]
Non-parametric statistical model. [No way to parameterize the model with a finite number of parameters.]
# Create binary outcome variable
ps6_clean$dem = with(ps6_clean, ifelse(fh_ipolity2 > 5, 1, 0))
# OLS
ols = lm(dem ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean)
# Logit
logit = glm(dem ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean,
family = binomial(link = "logit"))
# Probit
probit = glm(dem ~ ln_wdi_gnipc + ciri_empinx_new, data = ps6_clean,
family = binomial(link = "probit"))
Dependent variable: | |||
Democracy | |||
OLS | Logit | Probit | |
Constant | 0.11 | -5.41*** | -3.08*** |
(0.12) | (1.68) | (0.93) | |
Logged GNI per capita | -0.02 | 0.19 | 0.10 |
(0.02) | (0.19) | (0.11) | |
CIRI Empowerment score | 0.09*** | 0.74*** | 0.42*** |
(0.01) | (0.11) | (0.06) | |
Observations | 187 | 187 | 187 |
Note: | p<0.1; p<0.05; p<0.01 |
Note that this uses the cv.glmnet function, which stands for cross-validation. In this case 10-fold cross-validation is used to pick the value of the tuning parameter lamda.
A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. The first fold will be used for validation set and the model is fit on 9 folds. In the case of lasso and ridge models, CV helps choose the value of the tuning parameter lambda that minimizes mean squared error.
# Set up matrices
x = model.matrix(fh_ipolity2 ~ ., ps6_clean)[, -1]
y = ps6_clean$fh_ipolity2
# Run ridge net model
cv_ridge_fit <- cv.glmnet(x, y, alpha = 0)
# Run lasso model
cv_lasso_fit <- cv.glmnet(x, y, alpha = 1)