This function is a wrapper that calls specify()
, hypothesize()
, and
calculate()
consecutively that can be used to calculate observed
statistics from data. hypothesize()
will only be called if a point
null hypothesis parameter is supplied.
Learn more in vignette("infer")
.
Usage
observe(
x,
formula,
response = NULL,
explanatory = NULL,
success = NULL,
null = NULL,
p = NULL,
mu = NULL,
med = NULL,
sigma = NULL,
stat = c("mean", "median", "sum", "sd", "prop", "count", "diff in means",
"diff in medians", "diff in props", "Chisq", "F", "slope", "correlation", "t", "z",
"ratio of props", "odds ratio"),
order = NULL,
...
)
Arguments
- x
A data frame that can be coerced into a tibble.
- formula
A formula with the response variable on the left and the explanatory on the right. Alternatively, a
response
andexplanatory
argument can be supplied.- response
The variable name in
x
that will serve as the response. This is an alternative to using theformula
argument.- explanatory
The variable name in
x
that will serve as the explanatory variable. This is an alternative to using the formula argument.- success
The level of
response
that will be considered a success, as a string. Needed for inference on one proportion, a difference in proportions, and corresponding z stats.- null
The null hypothesis. Options include
"independence"
,"point"
, and"paired independence"
.independence
: Should be used with both aresponse
andexplanatory
variable. Indicates that the values of the specifiedresponse
variable are independent of the associated values inexplanatory
.point
: Should be used with only aresponse
variable. Indicates that a point estimate based on the values inresponse
is associated with a parameter. Sometimes requires supplying one ofp
,mu
,med
, orsigma
.paired independence
: Should be used with only aresponse
variable giving the pre-computed difference between paired observations. Indicates that the order of subtraction between paired values does not affect the resulting distribution.
- p
The true proportion of successes (a number between 0 and 1). To be used with point null hypotheses when the specified response variable is categorical.
- mu
The true mean (any numerical value). To be used with point null hypotheses when the specified response variable is continuous.
- med
The true median (any numerical value). To be used with point null hypotheses when the specified response variable is continuous.
- sigma
The true standard deviation (any numerical value). To be used with point null hypotheses.
- stat
A string giving the type of the statistic to calculate. Current options include
"mean"
,"median"
,"sum"
,"sd"
,"prop"
,"count"
,"diff in means"
,"diff in medians"
,"diff in props"
,"Chisq"
(or"chisq"
),"F"
(or"f"
),"t"
,"z"
,"ratio of props"
,"slope"
,"odds ratio"
,"ratio of means"
, and"correlation"
.infer
only supports theoretical tests on one or two means via the"t"
distribution and one or two proportions via the"z"
.- order
A string vector of specifying the order in which the levels of the explanatory variable should be ordered for subtraction (or division for ratio-based statistics), where
order = c("first", "second")
means("first" - "second")
, or the analogue for ratios. Needed for inference on difference in means, medians, proportions, ratios, t, and z statistics.- ...
To pass options like
na.rm = TRUE
into functions like mean(), sd(), etc. Can also be used to supply hypothesized null values for the"t"
statistic or additional arguments tostats::chisq.test()
.
See also
Other wrapper functions:
chisq_stat()
,
chisq_test()
,
prop_test()
,
t_stat()
,
t_test()
Other functions for calculating observed statistics:
chisq_stat()
,
t_stat()
Examples
# calculating the observed mean number of hours worked per week
gss %>%
observe(hours ~ NULL, stat = "mean")
#> Response: hours (numeric)
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 41.4
# equivalently, calculating the same statistic with the core verbs
gss %>%
specify(response = hours) %>%
calculate(stat = "mean")
#> Response: hours (numeric)
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 41.4
# calculating a t statistic for hypothesized mu = 40 hours worked/week
gss %>%
observe(hours ~ NULL, stat = "t", null = "point", mu = 40)
#> Response: hours (numeric)
#> Null Hypothesis: point
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 2.09
# equivalently, calculating the same statistic with the core verbs
gss %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 40) %>%
calculate(stat = "t")
#> Response: hours (numeric)
#> Null Hypothesis: point
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 2.09
# similarly for a difference in means in age based on whether
# the respondent has a college degree
observe(
gss,
age ~ college,
stat = "diff in means",
order = c("degree", "no degree")
)
#> Response: age (numeric)
#> Explanatory: college (factor)
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 0.941
# equivalently, calculating the same statistic with the core verbs
gss %>%
specify(age ~ college) %>%
calculate("diff in means", order = c("degree", "no degree"))
#> Response: age (numeric)
#> Explanatory: college (factor)
#> # A tibble: 1 × 1
#> stat
#> <dbl>
#> 1 0.941
# for a more in-depth explanation of how to use the infer package
if (FALSE) {
vignette("infer")
}