Sunday, March 26, 2006

my crazy idea revealed!

As mentioned, I am doing a presentation at the sociology methodology meetings next month, and my intention is to make an argument so radical (for sociology) that I fear I could be judged insane.

The argument is that if quantitative sociologists have enough confidence in their results to publish them in one of the discipline's journals, they should have enough confidence to deposit the code that produces these results in an independent public online archive (like this one) at the time of the article's publication.*

Yes! I am really this crazy!

Everyone agrees that good data analytic practice implies having a set of code that takes one all the way from a pristine data set to the numbers that are presented into a paper. This code serves as an implicit technical appendix to any published quantitative article. So long as this code already is presumed to exist, why not make it publicly available? Not just "upon request," but available up front. Not just "available on your webpage," but available in a place where it will still be even if you quit sociology.

And not just as a matter of good individual practice, but as a matter of collective practice. This is something we should insist that researchers do if they want their work to appear in the discipline's major journals. This should be part of the price of admission for publishing.

I look around quantitative sociology and think, what is the simplest thing that could be normativized or institutionalized that would increase the quality and credibility of quantitative work done in the discipline, and I think this is it. Besides which, I think it is absurd for sociologists to stand around and lament how everyone gives economics so much more credibility than sociology when the flagship journal of economics holds its researchers to this standard and sociology just has some vague and completely toothless statement that researchers should "permit" others to verify their results.

The title of my presentation is "Reproducibility Standards for Quantitative Social Science: Why Not Sociology?" Let me know if you have any reactions to this blogprecis.

* If they have custodial rights over the data, they should be depositing that, too, but I don't want the complications and politics that surround data-sharing to be used as grounds to dodge making the code available. The confidentiality and exclusivity arguments that are employed against broader data-sharing evaporate when you focus the standard on the code. (This is, indeed, the only part of my argument that is remotely original.) I do think that people who have custodial rights over data should be expected to say something about the availability of that data. In other words, if a researcher's stance is that "For confidentiality reasons, no extract of these data can be given to outside researchers, even for the purposes of verifying results," I think this is something that the reviewers and audience for the article have a right to know up front.

BTW, the ICPSR-PRA archive instructions are a little misleading in that they make it sound like it's only for depositing data, but you can deposit code there without data. The Murray Archive at Harvard also accepts code, and presumably there are others.


Kieran said...

Rob Gentleman and Duncan Temple Lang have been doing some work on this. Take a look at this paper by Gentleman:

It uses the Sweave literate programming framework (which is available in R) to apply the kind of principle you have in mind. The idea is to provide the material to reproduce the article itself -- the analysis and the writeup are integrated in the Sweave approach. If you're not familiar with how this works, there's a short discussion and example (with code) here:

The archive for the Gentleman paper is here.

I predict the reaction you will get from audiences is not "You're crazy" but rather "You first."

jeremy said...

I agree fully with the "You first" thing. My plan is to retro-deposit materials from at least one of my papers between now and the conference.

My officemate is a big Sweave fan.

Anonymous said...

not a crazy idea, just one that will elicit resistance because the reward structure isn't there for people to share code that took lots of time to construct. if you can get published in the top journal without sharing code that took you weeks to write, why share it? (i know one could say "becuase it's the 'right' thing to do, it will make sociology more legitimate, etc." but that's not going to change behavior unless a stronger incentive is in place). i'm sympathetic to your cause; just pointing out what i see as a practical problem. kudos to you for working on changing things and finding a solution.

Corey said...

Not crazy at all. But it may be impractical.

You may want to consider that a large proportion of sociologists don't use code at all. [Yes, they should use code; Their evil stats package [#P##]* has a code language; but many have never been forced to learn how to use that language... they are a point-and-click generation, if you will].

I spent 3 long years at ICPSR preparing datasets and providing user support. I think you'd be astonished by the sheer volume of sociologists who are unfamiliar and uncomfortable with syntax code. My, did they complain about the archaic "data definition statements". Why, they asked, couldn't we release data in a format that would just load in #P## via the File --> Open dialog.

My colleagues here at West Virginia are a case in point. The #P## users uniformly use the point-and-click interface; whenever I encourage them to try the paste-function and to save syntax files, they stare at me blankly. [But hey, they have tenure and I don't, so I let it drop].

Perhaps Sociological methodologists should follow the lead of the Political Methodologists (Gary King and company) in talking about how results are generated.

* My user support logs from ICPSR suggest that the majority of those calling for assistance use SPSS. SAS and Stata users tend to be more comforable (and competent) with syntax and coding in general. Despite better software (e.g., Stata) being available, SPSS seems to be the package most people are trained to use. It seems to me that Sociology lags far behind Political Science in terms of diversity in research tools.

P.S. I do recognize that schools like Indiana, Michigan, Wisconsin, Princeton, (fill in your school if it has broken away from the #P## orthodoxy) teach students an array of packages. My point is that more should do so.

P.P.S - I really like the approach Bill Gould of Stata-Corp recommends to syntax code and file management. Anyone who is new to Stata should invest a hundred bucks on the first programming netcourse.

Anonymous said...

My only fear with making code publicly available would be others cut and pasting code out of laziness. Syntax is written with a particular theory in mind, I may want to create a variable for one test of a theory, but this variable is not good for other tests of different theories. I would hate to see a situation in which others simply cut and paste syntax for their own use. How would you cite another writer if you used some of their syntax? Should you copyright your syntax?

Brayden said...

Although the idea has the potential to improve the reproducibility of our findings, I think it would do more to improve the transparency (and credibility) of the research process. It seems like so many people use different statistical packages that finding someone who uses the same stats package as you who is also working on creating a similar variable might involve a little luck. But I like the idea, alot! because of what it could do to the credibility of our findings.