In this vignette, we’ll walk through conducting an analysis of variance (ANOVA) test using
infer. ANOVAs are used to analyze differences in group means.
Throughout this vignette, we’ll make use of the
gss dataset supplied by
infer, which contains a sample of data from the General Social Survey. See
?gss for more information on the variables included and their source. Note that this data (and our examples on it) are for demonstration purposes only, and will not necessarily provide accurate estimates unless weighted properly. For these examples, let’s suppose that this dataset is a representative sample of a population we want to learn about: American adults. The data looks like this:
## Rows: 500 ## Columns: 11 ## $ year <dbl> 2014, 1994, 1998, 1996, 1994, 1996, 1990, 2016, 2000, 1998, 2… ## $ age <dbl> 36, 34, 24, 42, 31, 32, 48, 36, 30, 33, 21, 30, 38, 49, 25, 5… ## $ sex <fct> male, female, male, male, male, female, female, female, femal… ## $ college <fct> degree, no degree, degree, no degree, degree, no degree, no d… ## $ partyid <fct> ind, rep, ind, ind, rep, rep, dem, ind, rep, dem, dem, ind, d… ## $ hompop <dbl> 3, 4, 1, 4, 2, 4, 2, 1, 5, 2, 4, 3, 4, 4, 2, 2, 3, 2, 1, 2, 5… ## $ hours <dbl> 50, 31, 40, 40, 40, 53, 32, 20, 40, 40, 23, 52, 38, 72, 48, 4… ## $ income <ord> $25000 or more, $20000 - 24999, $25000 or more, $25000 or mor… ## $ class <fct> middle class, working class, working class, working class, mi… ## $ finrela <fct> below average, below average, below average, above average, a… ## $ weight <dbl> 0.8960, 1.0825, 0.5501, 1.0864, 1.0825, 1.0864, 1.0627, 0.478…
To carry out an ANOVA, we’ll examine the association between age and political party affiliation in the United States. The
age variable is a numerical variable measuring the respondents’ age at the time that the survey was taken, and
partyid is a factor variable with unique values ind, rep, dem, other.
This is what the relationship looks like in the observed data:
If there were no relationship, we would expect to see the each of these boxplots lining up along the y-axis. It looks like the average age of democrats and republicans seems to be a bit larger than independent and other American voters. Is this difference just random noise, though?
The observed \(F\) statistic is 2.4842. Now, we want to compare this statistic to a null distribution, generated under the assumption that age and political party affiliation are not actually related, to get a sense of how likely it would be for us to see this observed statistic if there were actually no association between the two variables.
generate the null distribution using randomization. The randomization approach permutes the response and explanatory variables, so that each person’s educational attainment is matched up with a random income from the sample in order to break up any association between the two.
# generate the null distribution using randomization null_distribution <- gss %>% specify(age ~ partyid) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "F")
To get a sense for what this distribution looks like, and where our observed statistic falls, we can use
# visualize the null distribution and test statistic! null_distribution %>% visualize() + shade_p_value(observed_f_statistic, direction = "greater")
We could also visualize the observed statistic against the theoretical null distribution. Note that we skip the
calculate() steps when using the theoretical approach, and that we now need to provide
method = "theoretical" to
# visualize the theoretical null distribution and test statistic! gss %>% specify(age ~ partyid) %>% hypothesize(null = "independence") %>% visualize(method = "theoretical") + shade_p_value(observed_f_statistic, direction = "greater")
To visualize both the randomization-based and theoretical null distributions to get a sense of how the two relate, we can pipe the randomization-based null distribution into
visualize(), and then further provide
method = "both" to
# visualize both null distributions and the test statistic! null_distribution %>% visualize(method = "both") + shade_p_value(observed_f_statistic, direction = "greater")
Either way, it looks like our observed test statistic would be really unlikely if there were actually no association between age and political party affiliation. More exactly, we can calculate the p-value:
# calculate the p value from the observed statistic and null distribution p_value <- null_distribution %>% get_p_value(obs_stat = observed_f_statistic, direction = "greater") p_value
## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0.055
Thus, if there were really no relationship between age and political party affiliation, the probability that we would see a statistic as or more extreme than 2.4842 is approximately 0.055.
The package currently does not supply a wrapper for tidy ANOVA tests.