What I Want

There are several options for calculating expected values in R, but none are completely satisfactory. Here is a list of things I want in such a package:

  • works with a large number of fitted model types
  • simple and concise user interface for calculating expected values
  • simple and flexible interface for comparisons and other transformations
  • simple but customize-able plots
  • easy and natural support for interaction tables/plots
  • simple output format to facilitate custom summaries and/or plots

All of these things exist already, but not in any one place. For example, the Zelig package works with a large number of model types, but does not make it easy to construct interaction plots; the effects package has a simple and concise user interface for calculating quantities of interest, but supports a limited number of models and does not provide an easy way to calculate comparisons.

Package comparisons

The comparisons below are designed to provide an overview of currently available packages, to identify good ideas that should be copied, and possibly to find areas where new approaches would be beneficial.

Calculating expected values

The standard way to calculate expected values in R is to use the predict function along with a newdata argument.

data(Prestige, package = "car")
Prestige <- na.omit(Prestige)

m <- lm(prestige ~ income + type, data = Prestige)

xvals <- expand.grid(income = quantile(Prestige$income, c(.25, .50, .75)),
                     women = mean(Prestige$women),
                     type = "bc")

data.frame(xvals, predict(m, newdata = xvals, se.fit = TRUE))
##    income    women type      fit   se.fit df residual.scale
## 1 4250.50 28.98571   bc 33.95284 1.267627 94       8.210495
## 2 6035.50 28.98571   bc 36.45397 1.248200 94       8.210495
## 3 8226.25 28.98571   bc 39.52364 1.419124 94       8.210495

This is not a terrible interface and it has the advantage of working with most models. There are however a few annoying niggles:

  • averaging over terms (rather than fixing them) requires more work
  • need to specify all x-values, even just those we want to hold at the mean/median
  • need to combine the predicted values with the newdata
  • inconsistency from model-to-model (e.g., no se.fit for some models)
  • no confidence intervals (though we do have standard errors sometimes)

Although the predict function is useful these combined issues are enough to make us look for a more convenient alternative.

Easier predict-like methods

The effects, emmeans, and prediction packages address many of the issues with calculating expected values using the predict function:

library(dplyr)
library(effects)
library(prediction)
library(emmeans)
as.data.frame(Effect(c("income"), m))
##   income      fit       se    lower    upper
## 1   2000 40.40725 1.460364 37.50766 43.30683
## 2   8000 48.81442 0.868663 47.08967 50.53917
## 3  10000 51.61681 1.114859 49.40323 53.83039
## 4  20000 65.62877 3.285165 59.10599 72.15154
## 5  30000 79.64072 5.673440 68.37597 90.90547
emmeans(m, ~ income)
##    income   emmean        SE df lower.CL upper.CL
##  6938.857 48.46063 0.8588977 94 46.75527 50.16599
## 
## Results are averaged over the levels of: type 
## Confidence level used: 0.95
prediction(m)
## Average prediction for 98 observations: 47.3276

By default Effect gives expected values over a range of predictors, while emmeans gives expected values at the mean by default. prediction does not allow for specifying a focal predictor without specifying values for it. All three methods allow for specifying arbitrary predictor variable values:

as.data.frame(
    Effect(c("income"), m,
           xlevels = list(income = quantile(Prestige$income, c(.25, .5, .75)))))
##    income      fit        se    lower    upper
## 1 4250.50 43.56064 1.0563892 41.46315 45.65812
## 2 6035.50 46.06177 0.8580300 44.35813 47.76541
## 3 8226.25 49.13144 0.8865936 47.37109 50.89179
emmeans(m, ~ income,
        at = list(income = quantile(Prestige$income, c(.25, .5, .75))))
##   income   emmean        SE df lower.CL upper.CL
##  4250.50 44.69371 1.0880065 94 42.53345 46.85398
##  6035.50 47.19485 0.8899893 94 45.42775 48.96194
##  8226.25 50.26452 0.9095417 94 48.45860 52.07043
## 
## Results are averaged over the levels of: type 
## Confidence level used: 0.95
summarise(group_by(prediction(m, at = list(income = quantile(Prestige$income, c(.25, .5, .75)))),
                   income),
          mean(fitted),
          mean(se.fitted))
## # A tibble: 3 x 3
##    income `mean(fitted)` `mean(se.fitted)`
##     <dbl>          <dbl>             <dbl>
## 1 4250.50       43.56064          1.646961
## 2 6035.50       46.06177          1.548265
## 3 8226.25       49.13144          1.577820

The observant reader will have noticed that prediction, and Effect returned the same expected value, while emmeans all gave different answers. The difference is thatEffect and prediction gives us the expected values at various income levels averaging over the observed values of type, while emmeans gives the expected values at various income levels averaging over the expected values for each level of type. Each method can be made to produce the results from any other, but in all cases this requires the user to do part of the calculation.

It is also worth noting that the prediction method is currently less user-friendly, in that it requires the user to average expected values, and it does not produce confidence intervals by default.

Self-contained systems

Whereas the effects, emmeans and prediction packages provide methods for calculating expected values for a range of model types, some packages attempt to provide a more self-contained model-fitting ecosystem that includes functionality for computing expected values. Notable examples include the rms and Zelig packages.

The rms package takes a somewhat strange approach to specifying the values at which expected values are calculated. Defaults are set by generating a datadist object, and then setting an option to use this object as the default values when generating predictions. It looks like this:

library(rms)
dd <- datadist(Prestige)
options(datadist = "dd")

m.rms <- ols(prestige ~ income + type, data = Prestige)
Predict(m.rms, income = quantile(Prestige$income, c(.25, .5, .75)))
##    income type     yhat    lower    upper
## 1 4250.50   bc 33.95284 31.43594 36.46974
## 2 6035.50   bc 36.45397 33.97564 38.93230
## 3 8226.25   bc 39.52364 36.70594 42.34135
## 
## Response variable (y): prestige 
## 
## Adjust to: type=bc  
## 
## Limits are 0.95 confidence limits

I dislike almost everything about this, but specifying predictor variable values as named arguments is quite nice.

The Zelig package similarly requires using it’s own wrapper function to fit the model. Then x-values can be set using the setx function. Finally, expected values can be calculated using the sim function.

library(Zelig)
m.z <- zelig(prestige ~ income + type, data = Prestige, model = "ls", cite = FALSE)

summary(sim(setx(m.z, income = quantile(Prestige$income, c(.25, .5, .75)))))
## 
## [1] 4250.5
## 
## 
##  sim range :
##  -----
## ev
##       mean       sd      50%     2.5%    97.5%
## 1 34.02169 1.249483 34.04534 31.49826 36.39669
## pv
##          mean       sd      50%     2.5%    97.5%
## [1,] 34.08854 8.262977 34.41826 17.91407 49.43037
## 
## 
## [1] 6035.5
## 
## 
##  sim range :
##  -----
## ev
##       mean       sd      50%    2.5%    97.5%
## 1 36.55236 1.224164 36.54008 34.1714 39.08843
## pv
##         mean       sd      50%     2.5%    97.5%
## [1,] 36.4233 8.090978 36.44102 20.64159 52.00182
## 
## 
## [1] 8226.25
## 
## 
##  sim range :
##  -----
## ev
##       mean       sd     50%    2.5%   97.5%
## 1 39.65828 1.387799 39.6262 37.0493 42.5768
## pv
##          mean       sd      50%     2.5%    97.5%
## [1,] 39.83956 8.377817 40.00386 23.21353 55.30992

Like the rms package the setx function uses variable names as arguments.

The best parts

Each of the packages reviewed above has advantages and disadvantages relative to the others. If we were to combine the best parts of each approach into a single package it would have the following features:

  • simple data.frame return value, like prediction
  • simple summary method like emmeans
  • natural support for both average (as produced by Effectand prediction) marginal (as produced by emmeans) expected values
  • variable names as arguments (as done by rmd::Predict and Zelig::setx)

Taking these considerations into account, thesmargins function calculates expected values averaging over the observed values of non-specified predictors. Thus by default the expected values are the same as those obtained by Effect and prediction:

library(smargins)

summary(smargins(m,
                 income = quantile(income, c(.25, .5, .75))))
##    income     mean        sd   median lower_2.5 upper_97.5
## 1 4250.50 43.58728 1.0619481 43.57310  41.53283   45.73539
## 2 6035.50 46.08861 0.8670104 46.07721  44.39327   47.79804
## 3 8226.25 49.15853 0.8919855 49.15437  47.43297   50.87083

Producing the expected values at specific values of the predictors as Zelig and rms do is also easy, as one simply needs to specify values for all the predictors:

summary(smargins(m,
                 income = quantile(income, c(.25, .5, .75)),
                 type = "bc"))
##    income type     mean       sd   median lower_2.5 upper_97.5
## 1 4250.50   bc 33.94347 1.262620 33.93652  31.45370   36.46656
## 2 6035.50   bc 36.44140 1.254813 36.44982  34.03017   38.91282
## 3 8226.25   bc 39.50714 1.443390 39.51370  36.73817   42.31521

Obtaining marginal means as computed by emmeans is a simple matter of computing all the expected values and then averaging as desired. For example the emmeans result above can be produced with smargins and summary as follows:

summary(smargins(m,
                 income = quantile(income, c(.25, .5, .75))),
        vars = "income")
##    income     mean        sd   median lower_2.5 upper_97.5
## 1 4250.50 43.54541 1.0704276 43.54427  41.45724   45.68088
## 2 6035.50 46.05210 0.8672191 46.06231  44.37975   47.74713
## 3 8226.25 49.12859 0.8869720 49.13586  47.41033   50.84148

The nice thing about this approach is that it is transparent and flexible. Users can specify any combination of variables to average over. In fact, users can easily do it themselves if desired:

(all.ev <- summary(smargins(m,
                           income = quantile(income, c(.25, .5, .75)),
                           type = unique(type))))
##    income type     mean       sd   median lower_2.5 upper_97.5
## 1 4250.50 prof 59.02741 2.127140 59.03909  54.82912   63.17046
## 2 6035.50 prof 61.52650 1.849445 61.50452  57.85919   65.14716
## 3 8226.25 prof 64.59367 1.605051 64.58917  61.55067   67.77500
## 4 4250.50   bc 33.91644 1.268876 33.90257  31.45407   36.44087
## 5 6035.50   bc 36.41553 1.254604 36.42343  33.98323   38.84080
## 6 8226.25   bc 39.48269 1.428589 39.51877  36.65570   42.26981
## 7 4250.50   wc 41.11035 1.699417 41.10878  37.74646   44.39968
## 8 6035.50   wc 43.60944 1.713946 43.62819  40.25149   46.96603
## 9 8226.25   wc 46.67661 1.873371 46.69679  43.03989   50.28363
(em.inc.ev <- summarise_all(group_by(select(all.ev, -type), income), funs(mean)))
## # A tibble: 3 x 6
##    income     mean       sd   median lower_2.5 upper_97.5
##     <dbl>    <dbl>    <dbl>    <dbl>     <dbl>      <dbl>
## 1 4250.50 44.68473 1.698478 44.68348  41.34322   48.00367
## 2 6035.50 47.18382 1.605998 47.18538  44.03130   50.31799
## 3 8226.25 50.25099 1.635670 50.26824  47.08209   53.44281

The summary method is just a convenience that makes it easier to do this. The simplicity and transparency of this operation is a direct consequence of the decision to return expected values in a simple data.frame structure that is familiar to R users and is easy for them to manipulate.

Comparisons and other transformations

Some of the packages reviewed above provide some tools for calculating differences among expected values. For example the emmeans package provides some pre-defined contrasts and allows for user written contrast functions as well:

m.em <- emmeans(m, ~ type, at = list(income = mean(Prestige$income)))

emmeans::contrast(m.em, method = "pairwise")
##  contrast    estimate       SE df t.ratio p.value
##  bc - prof -25.055474 2.302012 94 -10.884  <.0001
##  bc - wc    -7.167155 2.114048 94  -3.390  0.0029
##  prof - wc  17.888319 2.627157 94   6.809  <.0001
## 
## P value adjustment: tukey method for comparing a family of 3 estimates
last_vs_others <- function(...) {
    data.frame("wc vs other" = c(-.5, -.5, 1))
}

emmeans::contrast(m.em, method = last_vs_others)
##  contrast     estimate       SE df t.ratio p.value
##  wc.vs.other -5.360582 2.088243 94  -2.567  0.0118

The Zelig package also provides some limited support for this by allowing for computing differences between quantities of interest at two different sets of predictor variable values. That is:

library(Zelig)

m.z <- zelig(prestige ~ income + type, data = Prestige, model = "ls", cite = FALSE)

summary(sim(m.z,
            setx(m.z, type = "bc"),
            setx(m.z, type = "wc")))
## 
##  sim x :
##  -----
## ev
##       mean       sd     50%     2.5%    97.5%
## 1 37.72659 1.315224 37.7306 35.19607 40.14276
## pv
##          mean       sd      50%     2.5%    97.5%
## [1,] 37.79103 8.704785 37.74936 20.27688 53.92506
## 
##  sim x1 :
##  -----
## ev
##       mean       sd      50%     2.5%    97.5%
## 1 44.99378 1.793829 45.02816 41.48194 48.69833
## pv
##         mean      sd      50%   2.5%    97.5%
## [1,] 45.1364 8.37292 45.31917 28.423 60.65767
## fd
##       mean       sd      50%     2.5%    97.5%
## 1 7.267191 2.099155 7.268638 3.324846 11.33183
summary(sim(m.z,
            setx(m.z, type = "bc"),
            setx(m.z, type = "prof")))
## 
##  sim x :
##  -----
## ev
##       mean       sd      50%   2.5%    97.5%
## 1 37.69796 1.281604 37.68932 35.258 40.32212
## pv
##          mean       sd      50%     2.5%    97.5%
## [1,] 37.76897 8.241141 37.96331 22.15938 53.99203
## 
##  sim x1 :
##  -----
## ev
##       mean       sd      50%    2.5%    97.5%
## 1 62.80684 1.800894 62.73903 59.2008 66.13784
## pv
##          mean       sd      50%     2.5%    97.5%
## [1,] 63.21582 8.774306 63.40497 45.70338 79.91007
## fd
##       mean       sd      50%    2.5%    97.5%
## 1 25.10888 2.342465 25.17747 20.6407 29.66534

Unlike emmeans there is no simple way to perform arbitrary comparisons (e.g., to compare the average of type = "bc" and type = "prof" to type = "wc").

All in all, current implementations don’t do a great job of making it easy to compare expected values in a simple and flexibly ways. Figuring out a nice interface for this in the smargins package is an on-going project, but already some progress has been made.

The scompare function that gives pairwise differences:

m.sm <- smargins(m, type = unique(type), income = mean(income))
summary(scompare(m.sm, "type"))
##         type       mean       sd     median lower_2.5 upper_97.5
## 1 prof vs bc  17.849108 2.609524  17.868959  12.71594  22.830741
## 2 prof vs wc -25.047503 2.324493 -25.050109 -29.60391 -20.477656
## 3   bc vs wc  -7.198395 2.120790  -7.195647 -11.35260  -3.075413

For custom comparisons the structure produced by smargins is simple enough that the user can perform their own arbitrary transformations. For example, we can compare the average of blue collar and professional with white collar:

wc <- subset(m.sm, type == "wc")$.smargin_qi
bc_prof <- (subset(m.sm, type == "bc")$.smargin_qi +
            subset(m.sm, type == "prof")$.smargin_qi)/2
sumstats(wc - bc_prof)
##       mean         sd     median  lower_2.5 upper_97.5 
##  -5.325356   2.074474  -5.366090  -9.271299  -1.224537

While the simple structure allows users to compute arbitrary transformations without too much difficulty, it would be nice to have some convenience functions for this. The scompare function needs more work, starting with conceptual work to hammer out a description of the desired interface. The emmeans::contrast approach is one source of inspiration; other places to look include car::lht / multcomp::glht and https://github.com/goshevs/effcomp .

Plots

TODO

Interactions

TODO

Custom displays