Monday, January 05, 2004

(a stats thing, boring and trivial, so much so that i don't really know why i'm posting it)

Today's late afternoon distraction has been some items from the the 1998 General Social Survey. Respondents were told "Women are more likely than men to take care of children. I'm going to read several reasons why this might be so. Please tell me how important you think each reason is: very important, important, somewhat important, or not at all important."

The five explanations given were:
1. Women are biologically better-suited to care for children.
2. Women are taught from childhood how to care for children.
3. The way society is set up, women don't have much choice.
4. Men have more freedom to do other things.
5. It is God's will that women care for children.
One might expect that some of these explanations would be inversely correlated with one another: for example, we might think that people who think explanation #1 is "very important" would be more likely to regard explanation #3 as not important, etc.. We might likewise expect negative correlations for items #3 and #4. However, when we look at the correlation matrix for these variables, this is not what we observe.

| fekids1 fekids2 fekids3 fekids4 fekids5
fekids1 | 1.0000
fekids2 | 0.5270 1.0000
fekids3 | 0.1866 0.2415 1.0000
fekids4 | 0.1536 0.2101 0.4708 1.0000
fekids5 | 0.4501 0.3857 0.2611 0.2216 1.0000
All of the correlations are positive. People who rated any one of these explanations as "very important" were more likely to rate any other explanation as also "very important." Instead of observing some positive and some negative correlations, we end up treating a correlation like it was negative just because it was less positive than other correlations.

This happens all the time with survey data and is cause for at least mild despair. The reason for it is that different people use the answer categories differently. Some people think all of the explanations provided are at least "important," while other people don't find any of them anything more than "somewhat important." Another way of saying this is that even though respondents were given four answer categories, they didn't use all four.

See what happens to the correlation matrix when we take out all the people who gave the same rating to all five items:

| fekids1 fekids2 fekids3 fekids4 fekids5
fekids1 | 1.0000
fekids2 | 0.4685 1.0000
fekids3 | 0.0916 0.1481 1.0000
fekids4 | 0.0582 0.1170 0.4055 1.0000
fekids5 | 0.3990 0.3256 0.1740 0.1271 1.0000

The correlations are still all positive, but they are somewhat smaller than before. Ultimately, there weren't that many people who gave the same answer to all five. Let's also eliminate people who used only two answer categories for all five items--in other words, let's get rid of the effect of there being some people who answered everything in terms of "very important"/"important" and other people who answered everything in terms of "important"/"somewhat important". Here:

| fekids1 fekids2 fekids3 fekids4 fekids5
fekids1 | 1.0000
fekids2 | 0.3230 1.0000
fekids3 | -0.1201 -0.0706 1.0000
fekids4 | -0.1915 -0.1426 0.2307 1.0000
fekids5 | 0.2436 0.1338 -0.0389 -0.0934 1.0000
There, now we have negative correlations for pairs of items that we might have thought should have negative correlations.

Of course, we can't just throw away people just because they only used two answer categories. What to do? There is an underutilized solution, but, alas, I don't have time to type it out here. Plus, I wouldn't just want to hand it to the many statistical spies who monitor this weblog.

No comments: