Given the output of an infer core function, this function will fit
a linear model using stats::glm()
according to the formula and data supplied
earlier in the pipeline. If passed the output of specify()
or
hypothesize()
, the function will fit one model. If passed the output
of generate()
, it will fit a model to each data resample, denoted in
the replicate
column. The family of the fitted model depends on the type
of the response variable. If the response is numeric, fit()
will use
family = "gaussian"
(linear regression). If the response is a 2-level
factor or character, fit()
will use family = "binomial"
(logistic
regression). To fit character or factor response variables with more than
two levels, we recommend parsnip::multinom_reg()
.
infer provides a fit "method" for infer objects, which is a way of carrying
out model fitting as applied to infer output. The "generic," imported from
the generics package and re-exported from this package, provides the
general form of fit()
that points to infer's method when called on an
infer object. That generic is also documented here.
Learn more in vignette("infer")
.
Usage
# S3 method for infer
fit(object, ...)
Arguments
- object
Output from an infer function---likely
generate()
orspecify()
---which specifies the formula and data to fit a model to.- ...
Any optional arguments to pass along to the model fitting function. See
stats::glm()
for more information.
Value
A tibble containing the following columns:
replicate
: Only supplied if the input object had been previously passed togenerate()
. A number corresponding to which resample of the original data set the model was fitted to.term
: The explanatory variable (or intercept) in question.estimate
: The model coefficient for the given resample (replicate
) and explanatory variable (term
).
Details
Randomization-based statistical inference with multiple explanatory
variables requires careful consideration of the null hypothesis in question
and its implications for permutation procedures. Inference for partial
regression coefficients via the permutation method implemented in
generate()
for multiple explanatory variables, consistent with its meaning
elsewhere in the package, is subject to additional distributional assumptions
beyond those required for one explanatory variable. Namely, the distribution
of the response variable must be similar to the distribution of the errors
under the null hypothesis' specification of a fixed effect of the explanatory
variables. (This null hypothesis is reflected in the variables
argument to
generate()
. By default, all of the explanatory variables are treated
as fixed.) A general rule of thumb here is, if there are large outliers
in the distributions of any of the explanatory variables, this distributional
assumption will not be satisfied; when the response variable is permuted,
the (presumably outlying) value of the response will no longer be paired
with the outlier in the explanatory variable, causing an outsize effect
on the resulting slope coefficient for that explanatory variable.
More sophisticated methods that are outside of the scope of this package requiring fewer---or less strict---distributional assumptions exist. For an overview, see "Permutation tests for univariate or multivariate analysis of variance and regression" (Marti J. Anderson, 2001), doi:10.1139/cjfas-58-3-626 .
Reproducibility
When using the infer package for research, or in other cases when exact
reproducibility is a priority, be sure the set the seed for R’s random
number generator. infer will respect the random seed specified in the
set.seed()
function, returning the same result when generate()
ing
data given an identical seed. For instance, we can calculate the
difference in mean age
by college
degree status using the gss
dataset from 10 versions of the gss
resampled with permutation using
the following code.
set.seed(1)
gss %>%
specify(age ~ college) %>%
hypothesize(null = "independence") %>%
generate(reps = 5, type = "permute") %>%
calculate("diff in means", order = c("degree", "no degree"))
## Response: age (numeric)
## Explanatory: college (factor)
## Null Hypothesis: independence
## # A tibble: 5 x 2
## replicate stat
## <int> <dbl>
## 1 1 -0.531
## 2 2 -2.35
## 3 3 0.764
## 4 4 0.280
## 5 5 0.350
Setting the seed to the same value again and rerunning the same code will produce the same result.
# set the seed
set.seed(1)
gss %>%
specify(age ~ college) %>%
hypothesize(null = "independence") %>%
generate(reps = 5, type = "permute") %>%
calculate("diff in means", order = c("degree", "no degree"))
## Response: age (numeric)
## Explanatory: college (factor)
## Null Hypothesis: independence
## # A tibble: 5 x 2
## replicate stat
## <int> <dbl>
## 1 1 -0.531
## 2 2 -2.35
## 3 3 0.764
## 4 4 0.280
## 5 5 0.350
Please keep this in mind when writing infer code that utilizes
resampling with generate()
.
Examples
# fit a linear model predicting number of hours worked per
# week using respondent age and degree status.
observed_fit <- gss %>%
specify(hours ~ age + college) %>%
fit()
observed_fit
#> # A tibble: 3 × 2
#> term estimate
#> <chr> <dbl>
#> 1 intercept 40.6
#> 2 age 0.00596
#> 3 collegedegree 1.53
# fit 100 models to resamples of the gss dataset, where the response
# `hours` is permuted in each. note that this code is the same as
# the above except for the addition of the `generate` step.
null_fits <- gss %>%
specify(hours ~ age + college) %>%
hypothesize(null = "independence") %>%
generate(reps = 100, type = "permute") %>%
fit()
null_fits
#> # A tibble: 300 × 3
#> # Groups: replicate [100]
#> replicate term estimate
#> <int> <chr> <dbl>
#> 1 1 intercept 43.4
#> 2 1 age -0.0457
#> 3 1 collegedegree -0.481
#> 4 2 intercept 41.2
#> 5 2 age 0.00565
#> 6 2 collegedegree -0.212
#> 7 3 intercept 40.3
#> 8 3 age 0.0314
#> 9 3 collegedegree -0.510
#> 10 4 intercept 40.5
#> # ℹ 290 more rows
# for logistic regression, just supply a binary response variable!
# (this can also be made explicit via the `family` argument in ...)
gss %>%
specify(college ~ age + hours) %>%
fit()
#> # A tibble: 3 × 2
#> term estimate
#> <chr> <dbl>
#> 1 intercept -1.13
#> 2 age 0.00527
#> 3 hours 0.00698
# more in-depth explanation of how to use the infer package
if (FALSE) {
vignette("infer")
}