R Regression Models

Topics

  • Formula interface for model specification
  • Function methods for extracting quantities of interest from models
  • Contrasts to test specific hypotheses
  • Model comparisons
  • Predicted marginal effects

Setup

Class Structure

  • Informal — Ask questions at any time. Really!
  • Collaboration is encouraged - please spend a minute introducing yourself to your neighbors!

Prerequisites

This is an intermediate R course:

  • Assumes working knowledge of R
  • Relatively fast-paced
  • This is not a statistics course! We will teach you how to fit models in R, but we assume you know the theory behind the models.

Goals

We will learn about the R modeling ecosystem by fitting a variety of statistical models to different datasets. In particular, our goals are to learn about:

  1. Modeling workflow
  2. Visualizing and summarizing data before modeling
  3. Modeling continuous outcomes
  4. Modeling binary outcomes
  5. Modeling clustered data

We will not spend much time interpreting the models we fit, since this is not a statistics workshop. But, we will walk you through how model results are organized and orientate you to where you can find typical quantities of interest.

Launch an R session

Start RStudio and create a new project:

  • On Windows click the start button and search for RStudio. On Mac RStudio will be in your applications folder.
  • In Rstudio go to File -> New Project.
  • Choose Existing Directory and browse to the workshop materials directory on your desktop.
  • Choose File -> Open File and select the file with the word “BLANK” in the name.

Packages

You should have already installed the tidyverse and rmarkdown packages onto your computer before the workshop — see R Installation. Now let’s load these packages into the search path of our R session.

library(tidyverse)
library(rmarkdown)

Finally, lets install some packages that will help with modeling:

# install.packages("lme4")
library(lme4)  # for mixed models

# install.packages("emmeans")
library(emmeans)  # for marginal effects

# install.packages("effects")
library(effects)  # for predicted marginal means

Modeling workflow

Before we delve into the details of how to fit models in R, it’s worth taking a step back and thinking more broadly about the components of the modeling process. These can roughly be divided into 3 stages:

  1. Pre-estimation
  2. Estimation
  3. Post-estimaton

At each stage, the goal is to complete a different task (e.g., to clean data, fit a model, test a hypothesis), but the process is sequential — we move through the stages in order (though often many times in one project!)

Throughout this workshop we will go through these stages several times as we fit different types of model.

R modeling ecosystem

There are literally hundreds of R packages that provide model fitting functionality. We’re going to focus on just two during this workshop — stats, from Base R, and lme4. It’s a good idea to look at CRAN Task Views when trying to find a modeling package for your needs, as they provide an extensive curated list. But, here’s a more digestable table showing some of the most popular packages for particular types of model.

Models Packages
Generalized linear stats, biglm, MASS, robustbase
Mixed effects lme4, nlme, glmmTMB, MASS
Econometric pglm, VGAM, pscl, survival
Bayesian brms, blme, MCMCglmm, rstan
Machine learning mlr, caret, h2o, tensorflow

Before fitting a model

GOAL: To learn about the data by creating summaries and visualizations.

One important part of the pre-estimation stage of model fitting, is gaining an understanding of the data we wish to model by creating plots and summaries. Let’s do this now.

Load the data

List the data files we’re going to work with:

list.files("dataSets")
## [1] "Exam.rds"          "NatHealth2008MI"   "NatHealth2011.rds" "states.rds"

We’re going to use the states data first, which originally appeared in Statistics with Stata by Lawrence C. Hamilton.

  # read the states data
  states_data <- read_rds("dataSets/states.rds")

  # look at the last few rows
  tail(states_data)
##            state  region     pop  area density metro waste energy miles toxic green house senate csat vsat msat percent expense income high college
## 46       Vermont N. East  563000  9249   60.87  23.4  0.69    232  10.4  1.81 15.17    85     94  890  424  466      68    6738 34.717 80.8    24.3
## 47      Virginia   South 6187000 39598  156.25  72.5  1.45    306   9.7 12.87 18.72    33     54  890  424  466      60    4836 38.838 75.2    24.5
## 48    Washington    West 4867000 66582   73.10  81.7  1.05    389   9.2  8.51 16.51    52     64  913  433  480      49    5000 36.338 83.8    22.9
## 49 West Virginia   South 1793000 24087   74.44  36.4  0.95    415   8.6 21.30 51.14    48     57  926  441  485      17    4911 24.233 66.0    12.3
##  [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
Variable Description
csat Mean composite SAT score
expense Per pupil expenditures
percent % HS graduates taking SAT
income Median household income, $1,000
region Geographic region: West, N. East, South, Midwest
house House ’91 environ. voting, %
senate Senate ’91 environ. voting, %
energy Per capita energy consumed, Btu
metro Metropolitan area population, %
waste Per capita solid waste, tons

Examine the data

Start by examining the data to check for problems.

  # summary of expense and csat columns, all rows
  sts_ex_sat <- 
      states_data %>% 
      select(expense, csat)
  
  summary(sts_ex_sat)
##     expense          csat     
##  Min.   :2960   Min.   : 832  
##  1st Qu.:4352   1st Qu.: 888  
##  Median :5000   Median : 926  
##  Mean   :5236   Mean   : 944  
##  3rd Qu.:5794   3rd Qu.: 997  
##  Max.   :9259   Max.   :1093
  # correlation between expense and csat
  cor(sts_ex_sat, use = "pairwise") 
##           expense      csat
## expense  1.000000 -0.466298
## csat    -0.466298  1.000000

Plot the data

Plot the data to look for multivariate outliers, non-linear relationships etc.

  # scatter plot of expense vs csat
  plot(sts_ex_sat)

Obviously, in a real project, you would want to spend more time investigating the data, but we’ll now move on to modeling.

Models with continuous outcomes

GOAL: To learn about the R modeling ecosystem by fitting ordinary least squares (OLS) models. In particular:

  1. Formula representation of a model specification
  2. Model classes
  3. Function methods
  4. Model comparison

Once the data have been inspected and cleaned, we can start estimating models. The simplest models (but those with the most assumptions) are those for continuous and unbounded outcomes. Typically, for these outcomes, we’d use a model estimated using Ordinary Least Lquares (OLS), which in R can be fit with the lm() (linear model) function.

To fit a model in R, we first have to convert our theoretical model into a formula — a symbolic representation of the model in R syntax:

# formula for model specification
outcome ~ pred1 + pred2 + pred3

# NOTE the ~ is a tilde

For example, the following model predicts SAT scores based on per-pupil expenditures:

\[ csat_i = \beta_{0}1 + \beta_1expense_i + \epsilon_i \]

We can use lm() to fit this model:

  # fit our regression model
  sat_mod <- lm(csat ~ 1 + expense, # regression formula
                data = states_data) # data 

  # look at the basic printed output
  sat_mod
## 
## Call:
## lm(formula = csat ~ 1 + expense, data = states_data)
## 
## Coefficients:
## (Intercept)      expense  
##   1060.7324      -0.0223
  # get more informative summary information 
  # about coefficients and goodness-of-fit
  summary(sat_mod)
## 
## Call:
## lm(formula = csat ~ 1 + expense, data = states_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -131.81  -38.08    5.61   37.85  136.50 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1060.73244   32.70090   32.44  < 2e-16
## expense       -0.02228    0.00604   -3.69  0.00056
## 
## Residual standard error: 59.8 on 49 degrees of freedom
## Multiple R-squared:  0.217,  Adjusted R-squared:  0.201 
## F-statistic: 13.6 on 1 and 49 DF,  p-value: 0.000563
  # show only the regression coefficients table 
  summary(sat_mod) %>% coef() 
##                 Estimate  Std. Error  t value    Pr(>|t|)
## (Intercept) 1060.7324439 32.70089674 32.43741 8.87841e-35
## expense       -0.0222756  0.00603711 -3.68978 5.63090e-04

Why is the association between expense & SAT scores negative?

Many people find it surprising that the per-capita expenditure on students is negatively related to SAT scores. The beauty of multiple regression is that we can try to pull these apart. What would the association between expense and SAT scores be if there were no difference among the states in the percentage of students taking the SAT?

  lm(csat ~ 1 + expense + percent, data = states_data) %>% 
  summary() 
## 
## Call:
## lm(formula = csat ~ 1 + expense + percent, data = states_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -62.92 -24.32   1.74  15.50  75.62 
## 
## Coefficients:
##             Estimate Std. Error t value           Pr(>|t|)
## (Intercept) 989.8074    18.3958   53.81            < 2e-16
## expense       0.0086     0.0042    2.05              0.046
## percent      -2.5377     0.2249  -11.28 0.0000000000000042
## 
## Residual standard error: 31.6 on 48 degrees of freedom
## Multiple R-squared:  0.786,  Adjusted R-squared:  0.777 
## F-statistic:   88 on 2 and 48 DF,  p-value: <2e-16

The lm class & methods

Okay, we fitted our model. Now what? Typically, the main goal in the post-estimation stage of analysis is to extract quantities of interest from our fitted model. These quantities could be things like:

  1. Testing whether one group is different on average from another group
  2. Generating average response values from the model for interesting combinations of predictor values
  3. Calculating interval estimates for particular coefficients

But before we can do any of that, we need to know more about what a fitted model actually is, what information it contains, and how we can extract from it information that we want to report.

Let’s start by examining the model object:

  class(sat_mod)
## [1] "lm"
  str(sat_mod)
## List of 12
##  $ coefficients : Named num [1:2] 1060.7324 -0.0223
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "expense"
##  $ residuals    : Named num [1:51] 11.1 44.8 -32.7 26.7 -63.7 ...
##   ..- attr(*, "names")= chr [1:51] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:51] -6742.2 220.7 -31.7 29.7 -63.2 ...
##   ..- attr(*, "names")= chr [1:51] "(Intercept)" "expense" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:51] 980 875 965 978 961 ...
##   ..- attr(*, "names")= chr [1:51] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:51, 1:2] -7.14 0.14 0.14 0.14 0.14 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:51] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "expense"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.14 1.33
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 0.0000001
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 49
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = csat ~ 1 + expense, data = states_data)
##  $ terms        :Classes 'terms', 'formula'  language csat ~ 1 + expense
##   .. ..- attr(*, "variables")= language list(csat, expense)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "csat" "expense"
##   .. .. .. ..$ : chr "expense"
##   .. ..- attr(*, "term.labels")= chr "expense"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(csat, expense)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "csat" "expense"
##  $ model        :'data.frame':   51 obs. of  2 variables:
##   ..$ csat   : int [1:51] 991 920 932 1005 897 959 897 892 840 882 ...
##   ..$ expense: int [1:51] 3627 8330 4309 3700 4491 5064 7602 5865 9259 5276 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language csat ~ 1 + expense
##   .. .. ..- attr(*, "variables")= language list(csat, expense)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "csat" "expense"
##   .. .. .. .. ..$ : chr "expense"
##   .. .. ..- attr(*, "term.labels")= chr "expense"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(csat, expense)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "csat" "expense"
##  - attr(*, "class")= chr "lm"
  names(sat_mod)
##  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"        "qr"            "df.residual"   "xlevels"       "call"          "terms"         "model"

We can list all the functions that extract particular quantities of interest (called extractor functions) by using the methods() function with the class argument set to the class of the model object:

methods(class = class(sat_mod))
##  [1] add1           addterm        alias          anova          boxcox         case.names     coerce         confint        cooks.distance deviance       dfbeta         dfbetas        drop1          dropterm      
## [15] dummy.coef     Effect         effects        emm_basis      extractAIC     family         formula        fortify        hatvalues      influence      initialize     kappa          labels         logLik        
## [29] logtrans       model.frame    model.matrix   nobs           plot           predict        print          proj           qqnorm         qr             recover_data   residuals      rstandard      rstudent      
## [43] show           simulate       slotsFromS3    summary        variable.names vcov          
## see '?methods' for accessing help and source code

We can also use function methods to get more information about the fit:

  summary(sat_mod) # summary table
## 
## Call:
## lm(formula = csat ~ 1 + expense, data = states_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -131.81  -38.08    5.61   37.85  136.50 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1060.73244   32.70090   32.44  < 2e-16
## expense       -0.02228    0.00604   -3.69  0.00056
## 
## Residual standard error: 59.8 on 49 degrees of freedom
## Multiple R-squared:  0.217,  Adjusted R-squared:  0.201 
## F-statistic: 13.6 on 1 and 49 DF,  p-value: 0.000563
  confint(sat_mod) # confidence intervals
##                   2.5 %       97.5 %
## (Intercept) 995.0175316 1126.4473563
## expense      -0.0344077   -0.0101436
  anova(sat_mod) # ANOVA table  
## Analysis of Variance Table
## 
## Response: csat
##           Df Sum Sq Mean Sq F value   Pr(>F)
## expense    1  48708   48708   13.62 0.000563
## Residuals 49 175306    3578

How does R know which method to call for a given object? R uses generic functions, which provide access to methods. Method dispatch takes place based on the class of the first argument to the generic function. For example, for the generic function summary() and an object of class lm, the method dispatched will be summary.lm(). Function methods always take the form generic.method():

methods("summary")
##   [1] summary,ANY-method             summary,DBIObject-method       summary,diagonalMatrix-method  summary,sparseMatrix-method    summary.aareg*                 summary.allFit*               
##   [7] summary.aov                    summary.aovlist*               summary.aspell*                summary.cch*                   summary.check_packages_in_dir* summary.connection            
##  [13] summary.corAR1*                summary.corARMA*               summary.corCAR1*               summary.corCompSymm*           summary.corExp*                summary.corGaus*              
##  [19] summary.corIdent*              summary.corLin*                summary.corNatural*            summary.corRatio*              summary.corSpher*              summary.corStruct*            
##  [25] summary.corSymm*               summary.coxph*                 summary.coxph.penal*           summary.data.frame             summary.Date                   summary.DBIrepdesign*         
##  [31] summary.DBIsvydesign*          summary.default                summary.Duration*              summary.ecdf*                  summary.eff*                   summary.efflatent*            
##  [37] summary.efflist*               summary.effpoly*               summary.emm_list*              summary.emmGrid*               summary.factor                 summary.ggplot*               
##  [43] summary.glht*                  summary.glht_list*             summary.glm                    summary.gls*                   summary.hcl_palettes*          summary.infl*                 
##  [49] summary.Interval*              summary.lm                     summary.lme*                   summary.lmList*                summary.lmList4*               summary.loess*                
##  [55] summary.loglm*                 summary.manova                 summary.matrix                 summary.mcmc*                  summary.mcmc.list*             summary.merMod*               
##  [61] summary.MIresult*              summary.mlm*                   summary.mlm.efflist*           summary.modelStruct*           summary.multinom*              summary.negbin*               
##  [67] summary.nls*                   summary.nlsList*               summary.nnet*                  summary.packageStatus*         summary.pdBlocked*             summary.pdCompSymm*           
##  [73] summary.pdDiag*                summary.pdIdent*               summary.pdLogChol*             summary.pdMat*                 summary.pdNatural*             summary.pdSymm*               
##  [79] summary.Period*                summary.polr*                  summary.POSIXct                summary.POSIXlt                summary.ppr*                   summary.pps*                  
##  [85] summary.prcomp*                summary.prcomplist*            summary.predictorefflist*      summary.princomp*              summary.proc_time              summary.pyears*               
##  [91] summary.ratetable*             summary.reStruct*              summary.rlang_error*           summary.rlang_trace*           summary.rlm*                   summary.shingle*              
##  [97] summary.srcfile                summary.srcref                 summary.stepfun                summary.stl*                  
##  [ reached getOption("max.print") -- omitted 35 entries ]
## see '?methods' for accessing help and source code

It’s always worth examining whether the class of model you’re fitting has a method for a particular extractor function. Here’s a summary table of some of the most often used extractor functions, which have methods for a wide range of model classes. These are post-estimation tools you will want in your toolbox:

Function Package Output
summary() stats base R standard errors, test statistics, p-values, GOF stats
confint() stats base R confidence intervals
anova() stats base R anova table (one model), model comparison (> one model)
coef() stats base R point estimates
drop1() stats base R model comparison
predict() stats base R predicted response values
fitted() stats base R predicted response values (for observed data)
residuals() stats base R residuals
fixef() lme4 fixed effect point estimates (mixed models only)
ranef() lme4 random effect point estimates (mixed models only)
coef() lme4 empirical Bayes estimates (mixed models only)
allEffects() effects predicted marginal means
emmeans() emmeans predicted marginal means & marginal effects
margins() margins predicted marginal means & marginal effects

OLS regression assumptions

OLS regression relies on several assumptions, including:

  1. The model includes all relevant variables (i.e., no omitted variable bias).
  2. The model is linear in the parameters (i.e., the coefficients and error term).
  3. The error term has an expected value of zero.
  4. All right-hand-side variables are uncorrelated with the error term.
  5. No right-hand-side variables are a perfect linear function of other RHS variables.
  6. Observations of the error term are uncorrelated with each other.
  7. The error term has constant variance (i.e., homoscedasticity).
  8. (Optional - only needed for inference). The error term is normally distributed.

Investigate assumptions #7 and #8 visually by plotting your model:

  par(mfrow = c(2, 2)) # splits the plotting window into 4 panels
  plot(sat_mod)

Comparing models

Do congressional voting patterns predict SAT scores over and above expense? Fit two models and compare them:

  # fit another model, adding house and senate as predictors
  sat_voting_mod <- lm(csat ~ 1 + expense + house + senate,
                        data = na.omit(states_data))

  summary(sat_voting_mod) %>% coef()
##                 Estimate  Std. Error   t value    Pr(>|t|)
## (Intercept) 1082.9343804 38.63381274 28.030741 1.06779e-29
## expense       -0.0187083  0.00969149 -1.930385 6.00200e-02
## house         -1.4424375  0.60047838 -2.402147 2.05867e-02
## senate         0.4981786  0.51356136  0.970047 3.37326e-01

Why are we using na.omit()? Let’s see what na.omit() does.

# fake data
dat <- data.frame(
  x = 1:5,
  y = c(3, 2, 1, NA, 5),
  z = c(6, NA, 2, 7, 3))
dat
##   x  y  z
## 1 1  3  6
## 2 2  2 NA
## 3 3  1  2
## 4 4 NA  7
## 5 5  5  3
na.omit(dat) # listwise deletion of observations
##   x y z
## 1 1 3 6
## 3 3 1 2
## 5 5 5 3
# also see
# ?complete.cases
dat[with(dat, complete.cases(x, y, z)), ]
##   x y z
## 1 1 3 6
## 3 3 1 2
## 5 5 5 3

To compare models, we must fit them to the same data. This is why we need na.omit(). Now let’s update our first model using na.omit():

  sat_mod <- 
      sat_mod %>%
      update(data = na.omit(states_data))

  # compare using an F-test with the anova() function
  anova(sat_mod, sat_voting_mod)
## Analysis of Variance Table
## 
## Model 1: csat ~ 1 + expense
## Model 2: csat ~ 1 + expense + house + senate
##   Res.Df    RSS Df Sum of Sq     F Pr(>F)
## 1     46 169050                          
## 2     44 149284  2     19766 2.913 0.0649

Exercise 0

Ordinary least squares regression

Use the states.rds data set. Fit a model predicting energy consumed per capita (energy) from the percentage of residents living in metropolitan areas (metro). Be sure to

  1. Examine/plot the data before fitting the model
## 
  1. Print and interpret the model summary()
## 
  1. plot() the model to look for deviations from modeling assumptions
## 
  1. Select one or more additional predictors to add to your model and repeat steps 1-3. Is this model significantly better than the model with metro as the only predictor?
## 
Click for Exercise 0 Solution

Use the states.rds data set.

  states <- read_rds("dataSets/states.rds")

Fit a model predicting energy consumed per capita (energy) from the percentage of residents living in metropolitan areas (metro). Be sure to:

  1. Examine/plot the data before fitting the model.
  states_en_met <- 
      states %>% 
      select(metro, energy)

  summary(states_en_met)
##      metro           energy   
##  Min.   : 20.4   Min.   :200  
##  1st Qu.: 47.0   1st Qu.:285  
##  Median : 67.5   Median :320  
##  Mean   : 64.1   Mean   :354  
##  3rd Qu.: 81.6   3rd Qu.:372  
##  Max.   :100.0   Max.   :991  
##  NA's   :1       NA's   :1
  plot(states_en_met)

  cor(states_en_met, use = "pairwise")
##            metro    energy
## metro   1.000000 -0.339745
## energy -0.339745  1.000000
  1. Print and interpret the model summary().
  mod_en_met <- lm(energy ~ metro, data = states)
  summary(mod_en_met)
## 
## Call:
## lm(formula = energy ~ metro, data = states)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -215.5  -64.5  -30.9   18.7  584.0 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)
## (Intercept)  501.029     61.814    8.11 0.00000000015
## metro         -2.287      0.914   -2.50         0.016
## 
## Residual standard error: 140 on 48 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.115,  Adjusted R-squared:  0.097 
## F-statistic: 6.26 on 1 and 48 DF,  p-value: 0.0158
  1. plot() the model to look for deviations from modeling assumptions.
  par(mfrow = c(2, 2)) # splits the plotting window into 4 panels
  plot(mod_en_met)

  1. Select one or more additional predictors to add to your model and repeat steps 1-3. Is this model significantly better than the model with metro as the only predictor?
  states_en_met_pop_wst <- 
      states %>%
      select(energy, metro, pop, waste)

  summary(states_en_met_pop_wst)
##      energy        metro            pop               waste      
##  Min.   :200   Min.   : 20.4   Min.   :  454000   Min.   :0.540  
##  1st Qu.:285   1st Qu.: 47.0   1st Qu.: 1299750   1st Qu.:0.823  
##  Median :320   Median : 67.5   Median : 3390500   Median :0.960  
##  Mean   :354   Mean   : 64.1   Mean   : 4962040   Mean   :0.989  
##  3rd Qu.:372   3rd Qu.: 81.6   3rd Qu.: 5898000   3rd Qu.:1.145  
##  Max.   :991   Max.   :100.0   Max.   :29760000   Max.   :1.510  
##  NA's   :1     NA's   :1       NA's   :1          NA's   :1
  plot(states_en_met_pop_wst)

  cor(states_en_met_pop_wst, use = "pairwise")
##           energy     metro       pop     waste
## energy  1.000000 -0.339745 -0.184036 -0.252650
## metro  -0.339745  1.000000  0.565356  0.487788
## pop    -0.184036  0.565356  1.000000  0.525571
## waste  -0.252650  0.487788  0.525571  1.000000
  mod_en_met_pop_waste <- lm(energy ~ 1 + metro + pop + waste, data = states)
  summary(mod_en_met_pop_waste)
## 
## Call:
## lm(formula = energy ~ 1 + metro + pop + waste, data = states)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -224.6  -67.5  -31.8   12.7  589.5 
## 
## Coefficients:
##                 Estimate   Std. Error t value  Pr(>|t|)
## (Intercept) 561.66773812  99.04681924    5.67 0.0000009
## metro        -2.07914840   1.16786868   -1.78     0.082
## pop           0.00000165   0.00000481    0.34     0.733
## waste       -83.07494970 104.21963267   -0.80     0.429
## 
## Residual standard error: 142 on 46 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.128,  Adjusted R-squared:  0.0707 
## F-statistic: 2.24 on 3 and 46 DF,  p-value: 0.096
  anova(mod_en_met, mod_en_met_pop_waste)
## Analysis of Variance Table
## 
## Model 1: energy ~ metro
## Model 2: energy ~ 1 + metro + pop + waste
##   Res.Df    RSS Df Sum of Sq    F Pr(>F)
## 1     48 943103                         
## 2     46 930153  2     12949 0.32  0.728

Interactions & factors

GOAL: To learn how to specify interaction effects and fit models with categorical predictors. In particular:

  1. Formula syntax for interaction effects
  2. Factor levels and labels
  3. Contrasts and pairwise comparisons

Modeling interactions

Interactions allow us assess the extent to which the association between one predictor and the outcome depends on a second predictor. For example: Does the association between expense and SAT scores depend on the median income in the state?

  # Add the interaction to the model
  sat_expense_by_percent <- lm(csat ~ 1 + expense + income + expense : income, data = states_data)

  # same as above, but shorter syntax
  sat_expense_by_percent <- lm(csat ~ 1 + expense * income, data = states_data) 

  # Show the regression coefficients table
  summary(sat_expense_by_percent) %>% coef() 
##                     Estimate    Std. Error  t value          Pr(>|t|)
## (Intercept)    1380.36423333 172.086252477  8.02135 0.000000000236707
## expense          -0.06384067   0.032700873 -1.95226 0.056878369245142
## income          -10.49785114   4.991463374 -2.10316 0.040832525070882
## expense:income    0.00138465   0.000863553  1.60343 0.115539488252897

Regression with categorical predictors

Let’s try to predict SAT scores from region, a categorical variable. Note that you must make sure R does not think your categorical variable is numeric.

  # make sure R knows region is categorical
  str(states_data$region)
##  Factor w/ 4 levels "West","N. East",..: 3 1 1 3 1 1 2 3 NA 3 ...
  states_data$region <- factor(states_data$region)

  # arguments to the factor() function
  # factor(x, levels, labels)

  levels(states_data$region)
## [1] "West"    "N. East" "South"   "Midwest"
  # Add region to the model
  sat_region <- lm(csat ~ 1 + region, data = states_data) 

  # Show the results
  summary(sat_region) %>% coef() # show the regression coefficients table
##               Estimate Std. Error   t value    Pr(>|t|)
## (Intercept)   946.3077    14.7958 63.957781 1.35258e-46
## regionN. East -56.7521    23.1328 -2.453314 1.80038e-02
## regionSouth   -16.3077    19.9195 -0.818681 4.17190e-01
## regionMidwest  63.7756    21.3559  2.986321 4.51415e-03
  anova(sat_region) # show ANOVA table
## Analysis of Variance Table
## 
## Response: csat
##           Df Sum Sq Mean Sq F value    Pr(>F)
## region     3  82049   27350    9.61 0.0000486
## Residuals 46 130912    2846

So, make sure to tell R which variables are categorical by converting them to factors!

Setting factor reference groups & contrasts

Contrasts is the umbrella term used to describe the process of testing linear combinations of parameters from regression models. All statistical sofware use contrasts, but each software has different defaults and their own way of overriding these.

The default contrasts in R are “treatment” contrasts (aka “dummy coding”), where each level within a factor is identified within a matrix of binary 0 / 1 variables, with the first level chosen as the reference category.They’re called “treatment” contrasts, because of the typical use case where there is one control group (the reference group) and one or more treatment groups that are to be compared to the controls. It is easy to change the default contrasts to something other than treatment contrasts, though this is rarely needed. More often, we may want to change the reference group in treatment contrasts or get all sets of pairwise contrasts between factor levels.

  # default treatment (dummy) contrasts
  contrasts(states_data$region)
##         N. East South Midwest
## West          0     0       0
## N. East       1     0       0
## South         0     1       0
## Midwest       0     0       1
  # change the reference group
  states_data$region <- relevel(states_data$region, ref = "Midwest")

  # check the reference group has changed
  contrasts(states_data$region)
##         West N. East South
## Midwest    0       0     0
## West       1       0     0
## N. East    0       1     0
## South      0       0     1
  # refit the model
  mod_region <- lm(csat ~ 1 + region, data = states_data)
  summary(mod_region) %>% coef()
##                Estimate Std. Error  t value    Pr(>|t|)
## (Intercept)   1010.0833    15.4000 65.58993 4.29631e-47
## regionWest     -63.7756    21.3559 -2.98632 4.51415e-03
## regionN. East -120.5278    23.5239 -5.12364 5.79840e-06
## regionSouth    -80.0833    20.3723 -3.93100 2.82601e-04
  # get all pairwise contrasts between means
  mod_region %>%
      emmeans(specs = pairwise ~ region)
## $emmeans
##  region  emmean   SE df lower.CL upper.CL
##  Midwest   1010 15.4 46      979     1041
##  West       946 14.8 46      917      976
##  N. East    890 17.8 46      854      925
##  South      930 13.3 46      903      957
## 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast          estimate   SE df t.ratio p.value
##  Midwest - West        63.8 21.4 46  2.986  0.0226 
##  Midwest - N. East    120.5 23.5 46  5.124  <.0001 
##  Midwest - South       80.1 20.4 46  3.931  0.0016 
##  West - N. East        56.8 23.1 46  2.453  0.0812 
##  West - South          16.3 19.9 46  0.819  0.8453 
##  N. East - South      -40.4 22.2 46 -1.820  0.2774 
## 
## P value adjustment: tukey method for comparing a family of 4 estimates

Exercise 1

Interactions & factors

Use the states data set.

  1. Add on to the regression equation that you created in Exercise 1 by generating an interaction term and testing the interaction.
## 
  1. Try adding region to the model. Are there significant differences across the four regions?
## 
Click for Exercise 1 Solution

Use the states data set.

  1. Add on to the regression equation that you created in exercise 1 by generating an interaction term and testing the interaction.
  mod_en_metro_by_waste <- lm(energy ~ 1 + metro * waste, data = states)
  1. Try adding a region to the model. Are there significant differences across the four regions?
  mod_en_region <- lm(energy ~ 1 + metro * waste + region, data = states)
  anova(mod_en_metro_by_waste, mod_en_region)
## Analysis of Variance Table
## 
## Model 1: energy ~ 1 + metro * waste
## Model 2: energy ~ 1 + metro * waste + region
##   Res.Df    RSS Df Sum of Sq    F Pr(>F)
## 1     46 930683                         
## 2     43 821247  3    109436 1.91  0.142

Models with binary outcomes

GOAL: To learn how to use the glm() function to model binary outcomes. In particular:

  1. The family and link components of the glm() function call
  2. Transforming model coefficients into odds ratios
  3. Transforming model coefficients into predicted marginal means

Logistic regression

This far we have used the lm() function to fit our regression models. lm() is great, but limited — in particular it only fits models for continuous dependent variables. For categorical dependent variables we can use the glm() function.

For these models we will use a different dataset, drawn from the National Health Interview Survey. From the CDC website:

The National Health Interview Survey (NHIS) has monitored the health of the nation since 1957. NHIS data on a broad range of health topics are collected through personal household interviews. For over 50 years, the U.S. Census Bureau has been the data collection agent for the National Health Interview Survey. Survey results have been instrumental in providing data to track health status, health care access, and progress toward achieving national health objectives.

Load the National Health Interview Survey data:

  NH11 <- read_rds("dataSets/NatHealth2011.rds")

Logistic regression example

Motivation for a logistic regression model — with a binary response:

  1. Errors will not be normally distributed
  2. Variance will not be homoskedastic
  3. Predictions should be constrained to be on the interval [0, 1]

Anatomy of a generalized linear model:

  # OLS model using lm()
  lm(outcome ~ 1 + pred1 + pred2, 
     data = mydata)

  # OLS model using glm()
  glm(outcome ~ 1 + pred1 + pred2, 
      data = mydata, 
      family = gaussian(link = "identity"))
 
  # logistic model using glm()
  glm(outcome ~ 1 + pred1 + pred2, 
      data = mydata, 
      family = binomial(link = "logit"))

The family argument sets the error distribution for the model, while the link function argument relates the predictors to the expected value of the outcome.

Let’s predict the probability of being diagnosed with hypertension based on age, sex, sleep, and bmi. Here’s the model:

\[ logit(hypev_i) = \beta_{0}1 + \beta_1agep_i + \beta_2sex_i + \beta_3sleep_i + \beta_4bmi_i + \epsilon_i \]

where \(logit(\cdot)\) is the link function, which is equivalent to the log odds of hypev:

\[ logit(hypev_i) = ln \frac{p(hypev_i = 1)}{p(hypev_i = 0)} \]

And here’s how we fit this in R. First, let’s clean up the hypertension outcome by making it binary:

  str(NH11$hypev) # check stucture of hypev
##  Factor w/ 5 levels "1 Yes","2 No",..: 2 2 1 2 2 1 2 2 1 2 ...
  levels(NH11$hypev) # check levels of hypev
## [1] "1 Yes"             "2 No"              "7 Refused"         "8 Not ascertained" "9 Don't know"
  # collapse all missing values to NA
  NH11$hypev <- factor(NH11$hypev, levels=c("2 No", "1 Yes"))

Now let’s use glm() to estimate the model:

  # run our regression model
  hyp_out <- glm(hypev ~ 1 + age_p + sex + sleep + bmi,
                 data = NH11, 
                 family = binomial(link = "logit"))

  summary(hyp_out) %>% coef()
##                Estimate  Std. Error   z value    Pr(>|z|)
## (Intercept) -4.26946603 0.056494729 -75.57282 0.00000e+00
## age_p        0.06069930 0.000822721  73.77874 0.00000e+00
## sex2 Female -0.14402509 0.026797660  -5.37454 7.67785e-08
## sleep       -0.00703578 0.001639720  -4.29084 1.77998e-05
## bmi          0.01857170 0.000951083  19.52691 6.48517e-85

Odds ratios

Generalized linear models use link functions to relate the average value of the response to the predictors, so raw coefficients are difficult to interpret. For example, the age coefficient of .06 in the previous model tells us that for every one unit increase in age, the log odds of hypertension diagnosis increases by 0.06. Since most of us are not used to thinking in log odds this is not too helpful!

One solution is to transform the coefficients to make them easier to interpret. Here we transform them into odds ratios by exponentiating:

  # point estimates
  coef(hyp_out) %>% exp()
## (Intercept)       age_p sex2 Female       sleep         bmi 
##   0.0139893   1.0625794   0.8658660   0.9929889   1.0187452
  # confidence intervals
  confint(hyp_out) %>% exp()
##                 2.5 %    97.5 %
## (Intercept) 0.0125168 0.0156197
## age_p       1.0608738 1.0643008
## sex2 Female 0.8215520 0.9125501
## sleep       0.9897876 0.9961747
## bmi         1.0168514 1.0206501

Predicted marginal means

Instead of reporting odds ratios, we may want to calculate predicted marginal means (sometimes called “least squares means”). These are average values of the outcome at particular levels of the predictors. For ease of interpretation, we want these marginal means to be on the response scale (i.e., the probability scale). We can use the effects package to compute these quantities of interest for us (by default, the numerical output will be on the response scale).

  hyp_out %>% 
      allEffects() %>%
      plot(type = "response") # "response" refers to the probability scale

  # generate a sequence of ages at which to get predictions of the outcome
  age_years <- seq(20, 80, by = 5)
  age_years
##  [1] 20 25 30 35 40 45 50 55 60 65 70 75 80
  eff_df <- 
      hyp_out %>% 
      allEffects(xlevels = list(age_p = age_years)) %>% # override defaults for age
      as.data.frame() # get confidence intervals
  
  eff_df
## $age_p
##    age_p       fit         se     lower     upper
## 1     20 0.0669038 0.00191570 0.0632455 0.0707578
## 2     25 0.0885269 0.00218162 0.0843432 0.0928971
## 3     30 0.1162677 0.00241881 0.1116102 0.1210931
## 4     35 0.1512588 0.00260324 0.1462269 0.1564320
## 5     40 0.1944632 0.00272243 0.1891828 0.1998547
## 6     45 0.2464253 0.00279626 0.2409858 0.2519468
## 7     50 0.3069807 0.00289586 0.3013344 0.3126855
## 8     55 0.3750116 0.00312411 0.3689087 0.3811544
## 9     60 0.4483648 0.00353179 0.4414530 0.4552965
## 10    65 0.5240356 0.00405245 0.5160876 0.5319716
## 11    70 0.5986188 0.00454553 0.5896780 0.6074944
## 12    75 0.6688990 0.00488073 0.6592642 0.6783943
## 13    80 0.7323751 0.00498654 0.7224891 0.7420346
## 
## $sex
##        sex      fit         se    lower    upper
## 1   1 Male 0.299419 0.00418871 0.291274 0.307692
## 2 2 Female 0.270104 0.00371503 0.262885 0.277447
## 
## $sleep
##   sleep      fit         se    lower    upper
## 1     3 0.289994 0.00327643 0.283615 0.296458
## 2    30 0.252489 0.00742700 0.238212 0.267322
## 3    50 0.226866 0.01244618 0.203401 0.252181
## 4    80 0.191984 0.01854946 0.158220 0.230977
## 5   100 0.171095 0.02158429 0.132829 0.217620
## 
## $bmi
##   bmi      fit         se    lower    upper
## 1  10 0.214408 0.00412159 0.206441 0.222597
## 2  30 0.283509 0.00284941 0.277958 0.289127
## 3  60 0.408549 0.00745016 0.394031 0.423227
## 4  80 0.500366 0.01213914 0.476591 0.524140
## 5 100 0.592159 0.01618007 0.560105 0.623448

Exercise 2

Logistic regression

Use the NH11 data set that we loaded earlier.

  1. Use glm() to conduct a logistic regression to predict ever worked (everwrk) using age (age_p) and marital status (r_maritl). Make sure you only keep the following two levels for everwrk (1 Yes and 2 No). Hint: use the factor() function. Also, make sure to drop any r_maritl levels that do not contain observations. Hint: see ?droplevels.
## 
  1. Predict the probability of working for each level of marital status. Hint: use allEffects()
## 

Note that the data are not perfectly clean and ready to be modeled. You will need to clean up at least some of the variables before fitting the model.

Click for Exercise 2 Solution

Use the NH11 data set that we loaded earlier. Note that the data is not perfectly clean and ready to be modeled. You will need to clean up at least some of the variables before fitting the model.

  1. Use glm() to conduct a logistic regression to predict ever worked (everwrk) using age (age_p) and marital status (r_maritl). Make sure you only keep the following two levels for everwrk (1 Yes and 2 No). Hint: use the factor() function. Also, make sure to drop any r_maritl levels that do not contain observations. Hint: see ?droplevels.
  NH11 <- 
      NH11 %>%
      mutate(everwrk = factor(everwrk, levels = c("1 Yes", "2 No")),
             r_maritl = droplevels(r_maritl)
             )

  mod_wk_age_mar <- glm(everwrk ~ 1 + age_p + r_maritl, 
                        data = NH11,
                        family = binomial(link = "logit"))

  summary(mod_wk_age_mar)
## 
## Call:
## glm(formula = everwrk ~ 1 + age_p + r_maritl, family = binomial(link = "logit"), 
##     data = NH11)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.044  -0.565  -0.439  -0.337   2.731  
## 
## Coefficients:
##                                             Estimate Std. Error z value            Pr(>|z|)
## (Intercept)                                 -0.44025    0.09354   -4.71 0.00000251841905292
## age_p                                       -0.02981    0.00165  -18.12             < 2e-16
## r_maritl2 Married - spouse not in household  0.04968    0.21731    0.23              0.8192
## r_maritl4 Widowed                            0.68362    0.08434    8.11 0.00000000000000052
## r_maritl5 Divorced                          -0.73011    0.11168   -6.54 0.00000000006254929
## r_maritl6 Separated                         -0.12809    0.15137   -0.85              0.3974
## r_maritl7 Never married                      0.34361    0.06922    4.96 0.00000069100229497
## r_maritl8 Living with partner               -0.44358    0.13777   -3.22              0.0013
## r_maritl9 Unknown marital status             0.39548    0.49297    0.80              0.4224
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11082  on 14039  degrees of freedom
## Residual deviance: 10309  on 14031  degrees of freedom
##   (18974 observations deleted due to missingness)
## AIC: 10327
## 
## Number of Fisher Scoring iterations: 5
  1. Predict the probability of working for each level of marital status. Hint: use allEffects().
  mod_wk_age_mar %>% 
      allEffects() %>%
      as.data.frame()
## $age_p
##   age_p       fit         se     lower     upper
## 1    20 0.2758744 0.01144854 0.2540091 0.2988678
## 2    30 0.2204336 0.00748365 0.2061163 0.2354505
## 3    50 0.1347747 0.00318375 0.1286557 0.1411376
## 4    70 0.0790279 0.00305362 0.0732463 0.0852239
## 5    80 0.0598752 0.00312393 0.0540376 0.0662991
## 
## $r_maritl
##                              r_maritl       fit         se     lower     upper
## 1     1 Married - spouse in household 0.1082200 0.00425964 0.1001498 0.1168561
## 2 2 Married - spouse not in household 0.1131082 0.02139317 0.0774606 0.1622753
## 3                           4 Widowed 0.1938109 0.01063476 0.1738136 0.2155087
## 4                          5 Divorced 0.0552439 0.00536166 0.0456288 0.0667436
## 5                         6 Separated 0.0964642 0.01270750 0.0742682 0.1244022
## 6                     7 Never married 0.1461100 0.00745921 0.1320878 0.1613441
## 7               8 Living with partner 0.0722496 0.00890495 0.0566247 0.0917666
## 8            9 Unknown marital status 0.1527008 0.06352845 0.0644084 0.3205573

Multilevel modeling

GOAL: To learn about how to use the lmer() function to model clustered data. In particular:

  1. The formula syntax for incorporating random effects into a model
  2. Calculating the intraclass correlation (ICC)
  3. Model comparison for fixed and random effects

Multilevel modeling overview

  • Multi-level (AKA hierarchical) models are a type of mixed-effects model
  • They are used to model data that are clustered (i.e., non-independent)
  • Mixed-effecs models include two types of predictors: fixed-effects and random effects
  • Fixed-effects – observed levels are of direct interest (.e.g, sex, political party…)
  • Random-effects – observed levels not of direct interest: goal is to make inferences to a population represented by observed levels
  • In R, the lme4 package is the most popular for mixed effects models
  • Use the lmer() function for liner mixed models, glmer() for generalized linear mixed models

The Exam data

The Exam data set contans exam scores of 4,059 students from 65 schools in Inner London. The variable names are as follows:

Variable Description
school School ID - a factor.
normexam Normalized exam score.
standLRT Standardised LR test score.
student Student id (within school) - a factor
  Exam <- read_rds("dataSets/Exam.rds")

The null model & ICC

As a preliminary step it is often useful to partition the variance in the dependent variable into the various levels. This can be accomplished by running a null model (i.e., a model with a random effects grouping structure, but no fixed-effects predictors).

  # anatomy of lmer() function
  lmer(outcome ~ 1 + pred1 + pred2 + (1 | grouping_variable), 
       data = mydata)
  # null model, grouping by school but not fixed effects.
  Norm1 <-lmer(normexam ~ 1 + (1 | school), 
              data = na.omit(Exam))
  summary(Norm1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: normexam ~ 1 + (1 | school)
##    Data: na.omit(Exam)
## 
## REML criterion at convergence: 9962.9
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.875 -0.645  0.004  0.689  3.684 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  school   (Intercept) 0.163    0.404   
##  Residual             0.852    0.923   
## Number of obs: 3662, groups:  school, 65
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -0.0131     0.0531   -0.25

The is .161/(.161 + .852) = .159 = 16% of the variance is at the school level.

There is no consensus on how to calculate p-values for MLMs; hence why they are omitted from the lme4 output. But, if you really need p-values, the lmerTest package will calculate p-values for you (using the Satterthwaite approximation).

Adding fixed-effects predictors

Here’s a model that predicts exam scores from student’s standardized tests scores:

\[ normexam_{ij} = \mu1 + \beta_1standLRT_{ij} + U_{0j} + \epsilon_{ij} \]

where \(U_{0j}\) is the random intercept for the \(j\)th school. Let’s implement this in R using lmer():

  Norm2 <-lmer(normexam ~ 1 + standLRT + (1 | school),
               data = na.omit(Exam)) 
  summary(Norm2) 
## Linear mixed model fit by REML ['lmerMod']
## Formula: normexam ~ 1 + standLRT + (1 | school)
##    Data: na.omit(Exam)
## 
## REML criterion at convergence: 8483.5
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.673 -0.630  0.023  0.678  3.334 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  school   (Intercept) 0.0937   0.306   
##  Residual             0.5692   0.754   
## Number of obs: 3662, groups:  school, 65
## 
## Fixed effects:
##               Estimate Std. Error t value
## (Intercept) -0.0000431  0.0405496     0.0
## standLRT     0.5669188  0.0132394    42.8
## 
## Correlation of Fixed Effects:
##          (Intr)
## standLRT 0.007

Multiple degree of freedom comparisons

As with lm() and glm() models, you can compare the two lmer() models using the anova() function. With mixed effects models, this will produce a likelihood ratio test.

  anova(Norm1, Norm2)
## Data: na.omit(Exam)
## Models:
## Norm1: normexam ~ 1 + (1 | school)
## Norm2: normexam ~ 1 + standLRT + (1 | school)
##       Df  AIC  BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## Norm1  3 9965 9984  -4979     9959                        
## Norm2  4 8480 8505  -4236     8472  1487      1     <2e-16

Random slopes

Add a random effect of students’ standardized test scores as well. Now in addition to estimating the distribution of intercepts across schools, we also estimate the distribution of the slope of exam on standardized test.

  Norm3 <- lmer(normexam ~ 1 + standLRT + (1 + standLRT | school), 
                data = na.omit(Exam)) 
  summary(Norm3) 
## Linear mixed model fit by REML ['lmerMod']
## Formula: normexam ~ 1 + standLRT + (1 + standLRT | school)
##    Data: na.omit(Exam)
## 
## REML criterion at convergence: 8442.8
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.790 -0.625  0.024  0.673  3.438 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  school   (Intercept) 0.0909   0.302        
##           standLRT    0.0164   0.128    0.49
##  Residual             0.5559   0.746        
## Number of obs: 3662, groups:  school, 65
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -0.0125     0.0401   -0.31
## standLRT      0.5609     0.0212   26.47
## 
## Correlation of Fixed Effects:
##          (Intr)
## standLRT 0.360

Test the significance of the random slope

To test the significance of a random slope just compare models with and without the random slope term using a likelihood ratio test:

  anova(Norm2, Norm3) 
## Data: na.omit(Exam)
## Models:
## Norm2: normexam ~ 1 + standLRT + (1 | school)
## Norm3: normexam ~ 1 + standLRT + (1 + standLRT | school)
##       Df  AIC  BIC logLik deviance Chisq Chi Df    Pr(>Chisq)
## Norm2  4 8480 8505  -4236     8472                           
## Norm3  6 8444 8481  -4216     8432 40.01      2 0.00000000205

Exercise 3

Multilevel modeling

Use the bh1996 dataset:

## install.packages("multilevel")
data(bh1996, package="multilevel")

From the data documentation:

Variables are Leadership Climate (LEAD), Well-Being (WBEING), and Work Hours (HRS). The group identifier is named GRP.

  1. Create a null model predicting wellbeing (WBEING)
## 
  1. Calculate the ICC for your null model
## 
  1. Run a second multi-level model that adds two individual-level predictors, average number of hours worked (HRS) and leadership skills (LEAD) to the model and interpret your output.
## 
  1. Now, add a random effect of average number of hours worked (HRS) to the model and interpret your output. Test the significance of this random term.
## 
Click for Exercise 3 Solution

Use the dataset, bh1996:

  # install.packages("multilevel")
  data(bh1996, package="multilevel")

From the data documentation:

Variables are Leadership Climate (LEAD), Well-Being (WBEING), and Work Hours (HRS). The group identifier is named GRP.

  1. Create a null model predicting wellbeing (WBEING).
  mod_grp0 <- lmer(WBEING ~ 1 + (1 | GRP), data = na.omit(bh1996))
  summary(mod_grp0)
## Linear mixed model fit by REML ['lmerMod']
## Formula: WBEING ~ 1 + (1 | GRP)
##    Data: na.omit(bh1996)
## 
## REML criterion at convergence: 19347.3
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.322 -0.648  0.031  0.718  2.667 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  GRP      (Intercept) 0.0358   0.189   
##  Residual             0.7895   0.889   
## Number of obs: 7382, groups:  GRP, 99
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   2.7743     0.0222     125
  1. Run a second multi-level model that adds two individual-level predictors, average number of hours worked (HRS) and leadership skills (LEAD) to the model and interpret your output.
  mod_grp1 <- lmer(WBEING ~ 1 + HRS + LEAD + (1 | GRP), data = na.omit(bh1996))
  summary(mod_grp1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: WBEING ~ 1 + HRS + LEAD + (1 | GRP)
##    Data: na.omit(bh1996)
## 
## REML criterion at convergence: 17860
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.919 -0.659  0.038  0.704  3.644 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  GRP      (Intercept) 0.0193   0.139   
##  Residual             0.6467   0.804   
## Number of obs: 7382, groups:  GRP, 99
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  1.68616    0.06770   24.91
## HRS         -0.03162    0.00438   -7.22
## LEAD         0.50074    0.01281   39.09
## 
## Correlation of Fixed Effects:
##      (Intr) HRS   
## HRS  -0.800       
## LEAD -0.635  0.121
  1. Now, add a random effect of average number of hours worked (HRS) to the model and interpret your output. Test the significance of this random term.
  mod_grp2 <- lmer(WBEING ~ 1 + HRS + LEAD + (1 + HRS | GRP), data = na.omit(bh1996))
  anova(mod_grp1, mod_grp2)
## Data: na.omit(bh1996)
## Models:
## mod_grp1: WBEING ~ 1 + HRS + LEAD + (1 | GRP)
## mod_grp2: WBEING ~ 1 + HRS + LEAD + (1 + HRS | GRP)
##          Df   AIC   BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## mod_grp1  5 17848 17882  -8919    17838                        
## mod_grp2  7 17851 17899  -8918    17837  1.22      2      0.543

Wrap-up

Feedback

These workshops are a work in progress, please provide any feedback to: help@iq.harvard.edu

Resources