A recent proposal to "redefine statistical significance" (Benjamin, et al. Nature Human Behaviour, 2017) claims that false positive rates "would immediately improve" by factors greater than two and replication rates would double simply by changing the conventional cutoff for 'statistical significance' from P<0.05 to P<0.005. I analyze the veracity of these claims, focusing especially on how Benjamin, et al neglect the effects of P-hacking in assessing the impact of their proposal. My analysis shows that once P-hacking is accounted for the perceived benefits of the lower threshold all but disappear, prompting two main conclusions: (i) The claimed improvements to false positive rate and replication rate in Benjamin, et al (2017) are exaggerated and misleading. (ii) There are plausible scenarios under which the lower cutoff will make the replication crisis worse.
Commentary by Andrew Gelman.
Response by one of the authors (E.J. Wagenmackers) Bayesian Spectacles Blog.
More discussion by Tim van der Zee.
Caron and Fox tout their proposal as
"the first fully generative and projective approach to sparse graph modelling [...] with a notion of exchangeability that is essential for devising our scalable statistical estimation procedure." (p. 12, emphasis added).In calling theirs the first such approach, the authors brush aside prior work of Barabasi and Albert (1999), whose model is also generative, projective, and produces sparse graphs. The Barabasi–Albert model is not exchangeable, but neither is the authors’. And while the Barabasi–Albert model is inadequate for most statistical purposes, the proposed model is not obviously superior, especially with respect to the highlighted criteria above.
I applaud the authors’ advocacy for subjectivity in statistical practice and appreciate the overall attitude of their proposal.
But I worry that the proposed virtues will ultimately serve as a shield to deflect criticism, much like objectivity and subjectivity
often do now. In other words, won’t acceptance of 'virtue' as a research standard in short order be supplanted by the "pursuit to merely appear" virtuous?
I believe Gelman and Hennig when they assert, "[W]e repeatedly encounter publications in top scientific journals that fall foul of these virtues" (p. 27). I’m less convinced, however, that this "indicates [...] that the underlying principles are subtle". This conclusion seems to conflate doing science and publishing science. In fact I suspect that most scientists are more or less aware of these virtues, and many would agree that these virtues are indeed virtuous for doing science. But I’d expect those same scientists to acknowledge that some of these virtues may be regarded as vices in the publishing game. Just think about the lengths to which journals go to maintain the appearance of objectivity. They achieve this primarily through peer review, which promises transparency, consensus, and impartiality, three of Gelman and Hennig's 'virtues', but rarely delivers either. It should be no surprise that a system so obsessed with appearances also tends to reward research that 'looks the part'. As "communication is central to science" (p. 6) and publication is the primary means of scientific communication, is it any wonder that perverse editorial behaviors heavily influence which virtues are practiced and which are merely preached?
Finally, I ask: just as statistical practice is plagued by the "pursuit to merely appear objective", is science not also plagued by the pursuit to 'appear statistical'? Judging from well publicized issues, such as p-hacking (Gelman and Lokin, 2014; Nuzzo, 2014; Wasserstein and Lazar, 2016), and my own conversations with scientists, I’d say so. To borrow from Feyerabend (2010, p. 7), "The only principle that does not inhibit progress is: anything goes". So why not simply encourage scientists to make convincing, cogent arguments for their hypotheses however they see fit, without having to check off a list of 'virtues' or run a battery of statistical tests.
Wasserman (2012) invites us to imagine "a world without referees". Instead, I’m envisioning a world without editors, journals, or statistics lording over science and society. Without 'objectivity' obscuring the objective, and without 'virtues' standing in the way of ideals. That world looks pretty good to me.
Commentary on how probabilistic predictions can be both correct and meaningless at the same time, with a focus on the 2016 presidential election.
Some concluding remarks regarding my 2016 article on The ubiquitous Ewens sampling formula, which was discussed by Arratia, Barbour & Tavare, Favaro & James, Feng, McCullagh, and Teh.