The math behind a crosstab

The Chi-square test of independence (also called the Chi-squared test) is a standard measure of association between two categorical variables. It determines whether there is a significant relationship between the variables. If the two categorical variables are independent of one another, knowing the value of one provides no information about the value of the other variable. If one depends on the other, it can be worthwhile to examine their relationship.

The crosstab does the Chi-square test for you, but it can be helpful to understand how the system derives the values in question.

We will use gender as our first categorical value and favorite colors as our second.



What is your favorite color?
Gender
Yellow
Green
Blue
Red
Orange
Black
Purple
Row Total
Female
137
320
754
369
74
159
449
2262
Male
59
343
1188
454
120
155
112
2431
Column
Total
196
663
1942
823
194
314
561
4693

Now, we compute the variable counts expected if the variables were independent. The row and column totals are used to calculate the expected counts for each Gender/Color combination. So, we multiply the row total by the column total, then divide that by the grand total.

For the Male/Orange combination, that's 2431 * 194, which is 471614, divided by 4693. Our expected value (rounded to the nearest whole number) is 100. I've put the expected value for each cell in parentheses and used red text to differentiate it. 



What is your favorite color?
Gender Yellow Green Blue Red Orange Black Purple Row Total
Female 137 (94) 320 (320) 754 (936) 369 (397) 74 (94) 159 (151) 449 (270) 2262
Male 59 (102) 343 (343) 1188 (1006) 454 (426) 120 (100) 155 (163) 112 (291) 2431
Column Total 196 663 1942 823 194 314 561 4693


Now, we will calculate the difference between the actual and expected values for every combination, square that difference, and divide the result by the expected value for that cell. Adding all of those values gives us the test statistic. Using our Male/Orange cell as an example, the actual value is 120. The expected value is 100. The difference is 20, and squaring 20 gives us 400. Dividing 400 by 100 gives us 4. I've placed that in curly brackets and used green text to differentiate it.



What is your favorite color?
Gender Yellow Green Blue Red Orange Black Purple Row Total
Female 137 (94){19.67} 320 (320) {0} 754 (936) {35.39} 369 (397) {1.97} 74 (94) {4.26} 159 (151) {0.42} 449 (270) {118.67} 2262
Male 59 (102){18.13} 343 (343) {0} 1188 (1006) {32.93} 454 (426) {1.84} 120 (100){4} 155 (163) {0.39} 112 (291) {110.11} 2431
Column Total 196 663 1942 823 194 314 561 4693

To get our test statistic, we add all the values in green to get 347.78. That's our Χ2 statistic. 

Next, we need to calculate our degrees of freedom (df), which depends on how many rows and columns we have. The formula is df = (r - 1) * (c - 1). We have two rows and seven columns, so df = 6.

We need our significance level, α. CivicScience uses a significance level of 0.05 for all crosstabs.

Now, we use a published Chi-square distribution table (PDF) to look up the Chi-square value for 6 degrees of freedom and a 0.05 significance level. The Chi-square value is 12.592.

Next, we compare the value of our test statistic (347.78) to the Chi-square value of 12.592. Because 347.78 > 12.592, we reject the hypothesis that gender and favorite colors are independent. We can conclude that the variables have a relationship, though we can't say what kind.

The same variables are shown in a crosstab. The percentage of difference from the expected value is shown in parentheses if it is significant. The text is red for values that are lower than expected and green for those that are higher.

The variables show a high strength of association, as stated in the summary line at the bottom.


In the summary line, you'll see the number of responses, the significance expressed by the Χ2 value, the df (degrees of freedom) value, the p-value, and Tschuprow's T strength of association measure. 

Χ2 is the Chi-square statistic. This one differs slightly from what we calculated above due to rounding differences.

We also calculated our degrees of freedom (df) above.

The p-value corresponds to the Chi-square statistic and represents the probability of no relationship between our categorical variables. The lower the p-value, the more confident we can be that such a relationship exists. Results shown to be statistically significant have been adjusted according to the Benjamini-Hochberg false discovery rate procedure.

The T coefficient is Tschuprow's (sometimes spelled Chuprov's) T. Tschuprow's T measures the strength of the relationship between the variables. Its value ranges from 0 to 1. The higher the value for Tschuprow's T, the stronger the relationship between the variables.