There are several options for calculating expected values in R, but none are completely satisfactory. Here is a list of things I want in such a package:
All of these things exist already, but not in any one place. For example, the Zelig package works with a large number of model types, but does not make it easy to construct interaction plots; the effects package has a simple and concise user interface for calculating quantities of interest, but supports a limited number of models and does not provide an easy way to calculate comparisons.
The comparisons below are designed to provide an overview of currently available packages, to identify good ideas that should be copied, and possibly to find areas where new approaches would be beneficial.
The standard way to calculate expected values in R is to use the predict
function along with a newdata
argument.
data(Prestige, package = "car")
Prestige <- na.omit(Prestige)
m <- lm(prestige ~ income + type, data = Prestige)
xvals <- expand.grid(income = quantile(Prestige$income, c(.25, .50, .75)),
women = mean(Prestige$women),
type = "bc")
data.frame(xvals, predict(m, newdata = xvals, se.fit = TRUE))
## income women type fit se.fit df residual.scale
## 1 4250.50 28.98571 bc 33.95284 1.267627 94 8.210495
## 2 6035.50 28.98571 bc 36.45397 1.248200 94 8.210495
## 3 8226.25 28.98571 bc 39.52364 1.419124 94 8.210495
This is not a terrible interface and it has the advantage of working with most models. There are however a few annoying niggles:
Although the predict
function is useful these combined issues are enough to make us look for a more convenient alternative.
The effects
, emmeans
, and prediction
packages address many of the issues with calculating expected values using the predict
function:
library(dplyr)
library(effects)
library(prediction)
library(emmeans)
as.data.frame(Effect(c("income"), m))
## income fit se lower upper
## 1 2000 40.40725 1.460364 37.50766 43.30683
## 2 8000 48.81442 0.868663 47.08967 50.53917
## 3 10000 51.61681 1.114859 49.40323 53.83039
## 4 20000 65.62877 3.285165 59.10599 72.15154
## 5 30000 79.64072 5.673440 68.37597 90.90547
emmeans(m, ~ income)
## income emmean SE df lower.CL upper.CL
## 6938.857 48.46063 0.8588977 94 46.75527 50.16599
##
## Results are averaged over the levels of: type
## Confidence level used: 0.95
prediction(m)
## Average prediction for 98 observations: 47.3276
By default Effect
gives expected values over a range of predictors, while emmeans
gives expected values at the mean by default. prediction
does not allow for specifying a focal predictor without specifying values for it. All three methods allow for specifying arbitrary predictor variable values:
as.data.frame(
Effect(c("income"), m,
xlevels = list(income = quantile(Prestige$income, c(.25, .5, .75)))))
## income fit se lower upper
## 1 4250.50 43.56064 1.0563892 41.46315 45.65812
## 2 6035.50 46.06177 0.8580300 44.35813 47.76541
## 3 8226.25 49.13144 0.8865936 47.37109 50.89179
emmeans(m, ~ income,
at = list(income = quantile(Prestige$income, c(.25, .5, .75))))
## income emmean SE df lower.CL upper.CL
## 4250.50 44.69371 1.0880065 94 42.53345 46.85398
## 6035.50 47.19485 0.8899893 94 45.42775 48.96194
## 8226.25 50.26452 0.9095417 94 48.45860 52.07043
##
## Results are averaged over the levels of: type
## Confidence level used: 0.95
summarise(group_by(prediction(m, at = list(income = quantile(Prestige$income, c(.25, .5, .75)))),
income),
mean(fitted),
mean(se.fitted))
## # A tibble: 3 x 3
## income `mean(fitted)` `mean(se.fitted)`
## <dbl> <dbl> <dbl>
## 1 4250.50 43.56064 1.646961
## 2 6035.50 46.06177 1.548265
## 3 8226.25 49.13144 1.577820
The observant reader will have noticed that prediction
, and Effect
returned the same expected value, while emmeans
all gave different answers. The difference is thatEffect
and prediction
gives us the expected values at various income levels averaging over the observed values of type, while emmeans
gives the expected values at various income levels averaging over the expected values for each level of type. Each method can be made to produce the results from any other, but in all cases this requires the user to do part of the calculation.
It is also worth noting that the prediction
method is currently less user-friendly, in that it requires the user to average expected values, and it does not produce confidence intervals by default.
Whereas the effects
, emmeans
and prediction
packages provide methods for calculating expected values for a range of model types, some packages attempt to provide a more self-contained model-fitting ecosystem that includes functionality for computing expected values. Notable examples include the rms
and Zelig
packages.
The rms
package takes a somewhat strange approach to specifying the values at which expected values are calculated. Defaults are set by generating a datadist
object, and then setting an option to use this object as the default values when generating predictions. It looks like this:
library(rms)
dd <- datadist(Prestige)
options(datadist = "dd")
m.rms <- ols(prestige ~ income + type, data = Prestige)
Predict(m.rms, income = quantile(Prestige$income, c(.25, .5, .75)))
## income type yhat lower upper
## 1 4250.50 bc 33.95284 31.43594 36.46974
## 2 6035.50 bc 36.45397 33.97564 38.93230
## 3 8226.25 bc 39.52364 36.70594 42.34135
##
## Response variable (y): prestige
##
## Adjust to: type=bc
##
## Limits are 0.95 confidence limits
I dislike almost everything about this, but specifying predictor variable values as named arguments is quite nice.
The Zelig
package similarly requires using it’s own wrapper function to fit the model. Then x-values can be set using the setx
function. Finally, expected values can be calculated using the sim
function.
library(Zelig)
m.z <- zelig(prestige ~ income + type, data = Prestige, model = "ls", cite = FALSE)
summary(sim(setx(m.z, income = quantile(Prestige$income, c(.25, .5, .75)))))
##
## [1] 4250.5
##
##
## sim range :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 34.02169 1.249483 34.04534 31.49826 36.39669
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 34.08854 8.262977 34.41826 17.91407 49.43037
##
##
## [1] 6035.5
##
##
## sim range :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 36.55236 1.224164 36.54008 34.1714 39.08843
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 36.4233 8.090978 36.44102 20.64159 52.00182
##
##
## [1] 8226.25
##
##
## sim range :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 39.65828 1.387799 39.6262 37.0493 42.5768
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 39.83956 8.377817 40.00386 23.21353 55.30992
Like the rms
package the setx
function uses variable names as arguments.
Each of the packages reviewed above has advantages and disadvantages relative to the others. If we were to combine the best parts of each approach into a single package it would have the following features:
data.frame
return value, like prediction
summary
method like emmeans
Effect
and prediction
) marginal (as produced by emmeans
) expected valuesrmd::Predict
and Zelig::setx
)Taking these considerations into account, thesmargins
function calculates expected values averaging over the observed values of non-specified predictors. Thus by default the expected values are the same as those obtained by Effect
and prediction
:
library(smargins)
summary(smargins(m,
income = quantile(income, c(.25, .5, .75))))
## income mean sd median lower_2.5 upper_97.5
## 1 4250.50 43.58728 1.0619481 43.57310 41.53283 45.73539
## 2 6035.50 46.08861 0.8670104 46.07721 44.39327 47.79804
## 3 8226.25 49.15853 0.8919855 49.15437 47.43297 50.87083
Producing the expected values at specific values of the predictors as Zelig
and rms
do is also easy, as one simply needs to specify values for all the predictors:
summary(smargins(m,
income = quantile(income, c(.25, .5, .75)),
type = "bc"))
## income type mean sd median lower_2.5 upper_97.5
## 1 4250.50 bc 33.94347 1.262620 33.93652 31.45370 36.46656
## 2 6035.50 bc 36.44140 1.254813 36.44982 34.03017 38.91282
## 3 8226.25 bc 39.50714 1.443390 39.51370 36.73817 42.31521
Obtaining marginal means as computed by emmeans
is a simple matter of computing all the expected values and then averaging as desired. For example the emmeans
result above can be produced with smargins
and summary
as follows:
summary(smargins(m,
income = quantile(income, c(.25, .5, .75))),
vars = "income")
## income mean sd median lower_2.5 upper_97.5
## 1 4250.50 43.54541 1.0704276 43.54427 41.45724 45.68088
## 2 6035.50 46.05210 0.8672191 46.06231 44.37975 47.74713
## 3 8226.25 49.12859 0.8869720 49.13586 47.41033 50.84148
The nice thing about this approach is that it is transparent and flexible. Users can specify any combination of variables to average over. In fact, users can easily do it themselves if desired:
(all.ev <- summary(smargins(m,
income = quantile(income, c(.25, .5, .75)),
type = unique(type))))
## income type mean sd median lower_2.5 upper_97.5
## 1 4250.50 prof 59.02741 2.127140 59.03909 54.82912 63.17046
## 2 6035.50 prof 61.52650 1.849445 61.50452 57.85919 65.14716
## 3 8226.25 prof 64.59367 1.605051 64.58917 61.55067 67.77500
## 4 4250.50 bc 33.91644 1.268876 33.90257 31.45407 36.44087
## 5 6035.50 bc 36.41553 1.254604 36.42343 33.98323 38.84080
## 6 8226.25 bc 39.48269 1.428589 39.51877 36.65570 42.26981
## 7 4250.50 wc 41.11035 1.699417 41.10878 37.74646 44.39968
## 8 6035.50 wc 43.60944 1.713946 43.62819 40.25149 46.96603
## 9 8226.25 wc 46.67661 1.873371 46.69679 43.03989 50.28363
(em.inc.ev <- summarise_all(group_by(select(all.ev, -type), income), funs(mean)))
## # A tibble: 3 x 6
## income mean sd median lower_2.5 upper_97.5
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4250.50 44.68473 1.698478 44.68348 41.34322 48.00367
## 2 6035.50 47.18382 1.605998 47.18538 44.03130 50.31799
## 3 8226.25 50.25099 1.635670 50.26824 47.08209 53.44281
The summary
method is just a convenience that makes it easier to do this. The simplicity and transparency of this operation is a direct consequence of the decision to return expected values in a simple data.frame
structure that is familiar to R users and is easy for them to manipulate.
Some of the packages reviewed above provide some tools for calculating differences among expected values. For example the emmeans
package provides some pre-defined contrasts and allows for user written contrast functions as well:
m.em <- emmeans(m, ~ type, at = list(income = mean(Prestige$income)))
emmeans::contrast(m.em, method = "pairwise")
## contrast estimate SE df t.ratio p.value
## bc - prof -25.055474 2.302012 94 -10.884 <.0001
## bc - wc -7.167155 2.114048 94 -3.390 0.0029
## prof - wc 17.888319 2.627157 94 6.809 <.0001
##
## P value adjustment: tukey method for comparing a family of 3 estimates
last_vs_others <- function(...) {
data.frame("wc vs other" = c(-.5, -.5, 1))
}
emmeans::contrast(m.em, method = last_vs_others)
## contrast estimate SE df t.ratio p.value
## wc.vs.other -5.360582 2.088243 94 -2.567 0.0118
The Zelig
package also provides some limited support for this by allowing for computing differences between quantities of interest at two different sets of predictor variable values. That is:
library(Zelig)
m.z <- zelig(prestige ~ income + type, data = Prestige, model = "ls", cite = FALSE)
summary(sim(m.z,
setx(m.z, type = "bc"),
setx(m.z, type = "wc")))
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 37.72659 1.315224 37.7306 35.19607 40.14276
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 37.79103 8.704785 37.74936 20.27688 53.92506
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 44.99378 1.793829 45.02816 41.48194 48.69833
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 45.1364 8.37292 45.31917 28.423 60.65767
## fd
## mean sd 50% 2.5% 97.5%
## 1 7.267191 2.099155 7.268638 3.324846 11.33183
summary(sim(m.z,
setx(m.z, type = "bc"),
setx(m.z, type = "prof")))
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 37.69796 1.281604 37.68932 35.258 40.32212
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 37.76897 8.241141 37.96331 22.15938 53.99203
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## 1 62.80684 1.800894 62.73903 59.2008 66.13784
## pv
## mean sd 50% 2.5% 97.5%
## [1,] 63.21582 8.774306 63.40497 45.70338 79.91007
## fd
## mean sd 50% 2.5% 97.5%
## 1 25.10888 2.342465 25.17747 20.6407 29.66534
Unlike emmeans
there is no simple way to perform arbitrary comparisons (e.g., to compare the average of type = "bc"
and type = "prof"
to type = "wc"
).
All in all, current implementations don’t do a great job of making it easy to compare expected values in a simple and flexibly ways. Figuring out a nice interface for this in the smargins
package is an on-going project, but already some progress has been made.
The scompare
function that gives pairwise differences:
m.sm <- smargins(m, type = unique(type), income = mean(income))
summary(scompare(m.sm, "type"))
## type mean sd median lower_2.5 upper_97.5
## 1 prof vs bc 17.849108 2.609524 17.868959 12.71594 22.830741
## 2 prof vs wc -25.047503 2.324493 -25.050109 -29.60391 -20.477656
## 3 bc vs wc -7.198395 2.120790 -7.195647 -11.35260 -3.075413
For custom comparisons the structure produced by smargins
is simple enough that the user can perform their own arbitrary transformations. For example, we can compare the average of blue collar and professional with white collar:
wc <- subset(m.sm, type == "wc")$.smargin_qi
bc_prof <- (subset(m.sm, type == "bc")$.smargin_qi +
subset(m.sm, type == "prof")$.smargin_qi)/2
sumstats(wc - bc_prof)
## mean sd median lower_2.5 upper_97.5
## -5.325356 2.074474 -5.366090 -9.271299 -1.224537
While the simple structure allows users to compute arbitrary transformations without too much difficulty, it would be nice to have some convenience functions for this. The scompare
function needs more work, starting with conceptual work to hammer out a description of the desired interface. The emmeans::contrast
approach is one source of inspiration; other places to look include car::lht
/ multcomp::glht
and https://github.com/goshevs/effcomp .
TODO
TODO