11.4. Summary#
11.4.1. Terminology Review#
Use the flashcards below to help you review the terminology introduced in this chapter. \(~~~~ ~~~~ ~~~~ \mbox{ }\)
11.4.2. Key Take-Aways#
Introduction
Categorical data is not numerical but instead takes on one of a set of categories.
Categorical data is either ordinal or nominal.
In ordinal data, the categories have a natural ordering. Examples include Likert scale data and income-range data.
In nominal data, the categories do not have a natural ordering. Examples include handedness and country of citizenship.
Tabulating Categorical Data and Creating a Test Statistic
Two variables of categorical data can be summarized and compared using a (two-way) contingency table.
In a two-way contingency table, one variable is tabulated across the rows, while the other variable is tabulated across the columns.
Each cell contains the number of data points that match the corresponding row and column categories.
Contingency tables often include marginal counts, which are row and column sums.
Pandas has a
crosstab()
function to generate a contingency table given two Series or two columns of a dataframe. If passed the keyword argumentmargins = True
, it will include the marginal counts.The expected contingency table gives the expected value for each cell if there is no dependence between the two variables. The expected value of a cell can be computed as the product of the marginal row sum times the marginal column sum, divided by the total number of entries in the table.
Scipy.stats has a function
stats.contingency.expected_freq()
that returns the expected table when its argument is a contingency table (without marginal sums).The normalized squared differences between the observed and expected values in the table cells are of the form \begin{equation*} \frac{\left(O_i - E_i\right)^2}{E_i}. \end{equation*}
In the above, \(O_i\) and \(E_i\) are the observed value and expected value, respectively, for cell \(i\).
The chi-squared statistic is the sum of the normalized squared differences.
Null Hypothesis Significance Testing for Contingency Tables
A NHST can be conducted via resampling or analysis.
For resampling, the values of one of the variables should be shuffled at random to break up any dependence between the two variables. The relative frequency of seeing such a large value of the chi-squared statistic is estimated over many such shufflings.
For analysis, the chi-squared statistic is modeled as a chi-squared random variable.
The number of degrees of freedom of the chi-squared random variable is the number of degrees of freedom for the table. If the table has \(r\) rows and \(c\) columns, the number of dofs is \begin{equation*} n_{dof} = (r-1)(c-1). \end{equation*}
Then the probability of seeing a value of the chi-squared statistic as large as that observed in the data is equal to the survival function of the chi-squared random variable evaluated at the observed value of the chi-squared statistic.
Using Scipy.stats, we can create a chi-squared distribution with a specified number of degrees of freedom
dof
as:chi_rv = stats.chi2(dof)
.Be careful to use
stats.chi2()
becausestats.chi()
is a different distribution.Given the
chi_rv
object and the observed value of the chi-squared statisticC
, the analytical value for the \(p\)-value is given bychi_rv.sf(C)
.An even better approach is to pass the contingency table (without marginal sums) to
stats.chi2_contingency()
, which will compute the \(p\)-value and apply an adjustment called Yates’ continuity correction to give a better estimate of the true \(p\)-value.
Chi-Square Goodness-of-Fit Test
In a chi-square goodness-of-fit test, discrete data values are tabulated in a one-way contingency table, where each cell represents a particular observed value, and the cell’s content is the number of times that value was observed.
The observed values in the one-way contingency table are compared to a reference distribution for the data, and a chi-squared statistic is computed.
In a NHST, the data is compared to some default distribution to see whether observed differences from the default distribution could be attributed to randomness. A \(p\)-value is calculated based on the chi-squared statistic, and the null hypothesis is either rejected or failed to be rejected.
For determining whether a particular distribution is a reasonable model for observed data, the parameters of the distribution are selected from the data. This reduces the number of degrees of freedom by the number of estimated parameters.
A chi-squared statistic is computed from the data, and a \(p\)-value is computed based on how often such a large value of the chi-squared statistic would be seen under the model distribution. If the \(p\)-value is much larger than the significance threshold, the model is consistent with the data.