Additional comments on "The Devil Is in the Digits" (July 10, 2009)

A key criticism leveled against Alex Scacco's and my Washington Post op-ed on the election in Iran is that we argue that a fair election is unlikely to produce a lot of variation in last-digit frequencies, but then use an inappropriate test in evaluating the data from Iran against this claim. We should have reported the results from a chi-square test, not the probability of particular digits occurring more or less often than expected.

Is a chi-square test the most appropriate statistic for this type of data? Yes. That's exactly why we report the result in the annotated version of our op-ed. (We initially reported only a nearly equivalent test statistic involving the standard deviation of last-digit frequencies, but since then we've clarified that this is the same result one obtains from a chi-square test.)

But is this test the most appropriate one for a general audience? Only if there isn't a more transparent alternative that captures the same intuition and gives the same substantive result. In our view, the test statistic we report is precisely such an alternative.

First, and most crucially, do we get the same substantive result? Some critics have claimed that once we use the appropriate statistical test to check for suspicious last-digit frequencies (a chi-square test), we don't get a significant result. That's one reason why our analysis is allegedly "bogus." But that's incorrect. When we use a chi-square test, the probability that a fair election would produce the last-digit patterns observed in Iran is .077. That's a significant result.

Don't take our word for it. Take a look at the latest issue of the American Political Science Review, one of the most highly reputed journals of political science. Every single article that uses statistical methods describes a probability below .1 as a "significant" finding (see Shayo 161, 162; Feddersen et al. 188; Brown and Mobarak 202; Blattman 237, 240; Brown and Earle et al. 274; Denisova et al. 292; all in APSR Vol. 103 No. 2). One article happens to report a p-value of .076, almost identical to the one we obtain from a chi-square test, and says that this indicates a statistically significant result (Feddersen fn. 17). Are all of these statements, which made it past anonymous reviewers and an editorial board of well-regarded political scientists, bogus too?

We can quibble over the exact language we used to describe our finding. I'm happy to concede that we find "significant" as opposed to "strong" evidence, or that a fair election is "significantly unlikely" as opposed to "extremely unlikely" to produce the kind of digit patterns we see in the data from Iran. But the substantive conclusion doesn't change.

This is especially true because the op-ed's strong conclusion is based on a combination of two tests. The statement still stands that fewer than 1 in 200 elections will produce the lack of non-adjacent digits and the suspicious deviations in last-digit frequencies we see in the data from Iran, even if we use a chi-square test instead of our test for extreme deviations (see fn. 15 of our annotated op-ed.) In other words, using a chi-square test doesn't damage our main point.

That's not particularly surprising, because our test is similar in spirit to a chi-square test. We highlight only those deviations from what we'd expect to see in a fair election that are particularly large. But that's similar to what a chi-square test does too. In computing a chi-square test statistic, we don't just sum the distances between expected and observed counts, but we square those distances first. This ensures that large deviations weigh more heavily than small ones (i.e. six too many 7s will appear more suspicious than three too many 7s and three many 8s).

No wonder then that the results (the p-values) generated by our test correlate highly with the results produced by a chi-square test, with a correlation coefficient of about .88. (I've posted R code to replicate the simulations used to estimate this correlation here.)

So if our test and a chi-square test produce similar results, why not just report the results from the latter? The reason is that we think our test is more accessible to the layperson. We report the probability that one numeral appears particularly frequently and another particularly infrequently in the last digit, while with a chi-square test we would have reported the probability of observing a particularly large sum of squared and standardized distances between expected and actual counts of numerals in the last digit. Put differently, we highlight digit frequencies themselves as opposed to a measure of the variability of digit frequencies. I think most would agree that the former is more transparent to the casual reader than the latter.

Let's turn to another criticism. Are we cherry-picking our results because we didn't also look at the frequencies of second-to-last digits? No. Anyone who looks at our pre-Iran working paper on election fraud can see that we've only proven and empirically assessed the claim that numerals in the last digits should occur with equal frequency in a "fair" election. The analysis we had done prior to Iran's election provides no justification for looking at frequencies of second-to-last digits to detect fraud, which is why we didn't run such a test. That's not cherry-picking. That's science.

It's true more generally that if we want to find either a significant or an insignificant result, we can use our data to come up with a test that will deliver what we need. But that's exactly not what we did, and our pre-Iran working paper corroborates this fact.

The analysis in that paper shows that there are two types of tests that are effective in the sense that they (a) have a theoretical rationale, (b) don't sound an alarm when they shouldn't (in "clean" election data from Sweden), (c) sound an alarm when they should (in very probably manipulated election data from Nigeria). Those two kinds of tests focus on last-digit frequencies and non-adjacency in last and second-to-last digits. Those are exactly the tests we apply to the data from Iran, tests shown to be efficacious before Iran's election took place.