A full one out of every eight psychology papers in eight of the field’s flagship journals analyzed by a team of statisticians at Tilburg University were found to contain at least one “gross inconsistency” in which a p-value was wrongly calculated to be significant or non-significant. More damningly, the majority of gross inconsistencies were cases where the p-value correctly calculated would have been classified non-significant. Half of the over 30,000 papers analyzed in their paper published last month included at least one p-value inconsistent with its stated test statistic and degrees of freedom. 9.7% of the quarter million p-values tested were inconsistent.

The team, led by Michèle B. Nuijten, used statcheck, an R package developed by Nuijten that calculates p-values which are presented according to APA style, “test statistic (df1, df2) =/ …, p =/ …^.” This is somewhat limiting, as they could not calculate p-values presented either in tables or with other information, like effect size, between the test statistic and the p-value, and statcheck cannot classify whether the statistic is one determining the conclusions of the article or just a side statistic. They also have to use a slightly crude method to account for one-tailed tests (that is, when the p-value calculated is the probability that chance would produce a result as far from the null hypothesis *in that direction* as the experimental result, versus in either direction); they assume it is two-tailed, and then, if it comes up inconsistent, the program recalculates under the assumption that the test is one-tailed. If it comes up consistent under that assumption, it is marked consistent.

Despite these caveats, the methodology seems to be fairly solid. In the paper’s appendix, statcheck is tested to a manually tested paper by one of the co-authors that has the same purpose, and statcheck’s APA style-specificity included two thirds of the p-values in the relevant papers, and they found relatively close rates of inconsistency. With one-tailed test detection, statcheck found a higher percentage of errors than Wicherts et al. (7.2% versus 4.3%), but only slightly more gross errors (1% versus .9%). The number of papers with at least one inconsistency was around 50% for all three. Some statcheck errors that were not erros for Wicherts were cases where the p-value was reported as p = .000, which they said should be reported as p < .001 (p = 0 being impossible). Statcheck clearly isn’t perfect but it appears to be close enough to say that somewhere near half of papers analyzed had inconsistent p-values.

The number of articles with inconsistencies varied by journal from 33.6% to 57.6%, and the mean number of incorrect results varied from 6.7% to 14%, as shown in the graph below, from Nuijten et al.

The researchers are fairly generous with explanations for the lean in mistakes toward wrongly calculated significance. Their explanations include: researchers double checking nonsignificant results when they expect significance, but rarely having cause to do the reverse; researchers rounding down but not up; and, due to publication bias favoring significance, only the papers where significance is accidentally found being published, whereas results accidentally marked insignificant never make it into a journal. The paper also finds positive news, contradicting some studies that checked p-values by hand: the proportion of inconsistent p-values doesn’t appear to be increasing, even though conventional wisdom in the social sciences is that questionable research practices and poor use of statistical techniques are on the upswing. They also argue psych researchers should start using statcheck on their own research. Hopefully the measure does not become a target, however.

The lesson for journalists, regardless of the trends in p-value usage and the causes for their misues, is clear: be very, very careful with psychological papers. Given the large replication study in August that found that half of psychology studies fail to reproduce, it seems likely that some combination of statistical inconsistencies, publication bias, and unexplained experimenter effects make a large proportion of psychological studies questionable at best. Nuijten et al. only look at psychology, because other social sciences do not adhere as closely to APA or any other statistical reporting format, but caution when reporting on them (and life sciences) also seems well advice. Researchers who depend on statistical techniques but are not themselves statisticians seem naturally prone to mistakes, and we journalists covering them should try to be at least as savvy as them, so that we can check their tests for ourselves.

Statcheck is a simple, open source R package. Never write about a study without running a quick statcheck first.