Replication: Is the glass half full, half empty, or irrelevant?

Ella Rhodes reports on the latest twist in the reproducibility tale.

07 March 2016

The findings of last year’s Reproducibility Project, which aimed to replicate 100 psychology studies but only managed to do so in around 40 per cent of cases, have been thrown into question by a new report. The Open Science Collaboration’s work, which some took as an indication of ongoing ‘crisis’ in the field, has been openly questioned by a group led by Harvard psychologist Daniel Gilbert.

Gilbert, Gary King and Stephen Pettigrew (also Harvard), alongside Stephen Timothy D. Wilson (University of Virginia), wrote the comment piece for Science criticising the original Open Science Collaboration (OSC) study around three areas; error, power and bias. They suggest the original report made conclusions which were not supported due to statistical errors. Without these, they claim, the conclusion may have been that the reproducibility of psychological science is actually quite high.

On the topic of error, the authors write that even if an original study shows a true effect a replication may not show that effect due to sampling error – as it is often impossible to replicate an original experimental population. The authors used a statistical benchmark which assumed sampling error to be the only source of error in the data – however, as Gilbert and his colleagues point out, the replications tended to differ quite substantially from the original research.

Some examples of the methods and populations used in a few of the replications are also listed: ‘An original study that measured American’s attitudes toward African-Americans was replicated with Italians, who do not share the same stereotypes; an original study that asked college students to imagine being called on by a professor was replicated with participants who had never been to college; and an original study that asked students who commute to school to choose between apartments that were short and long drives from campus was replicated with students who do not commute to school.’

They write that these infidelities in the replications are potential sources of random error. They also point to the power of the OSC project, which only attempted to replicate each study once. However one of the OSC’s corresponding authors, Brian Nosek and his Many Labs Project, included 36 labs replicating 16 original studies repeatedly – leading to 574 replications. Gilbert and his colleagues point out this more powerful method led to a full 85 per cent of original studies being replicated. If the MLP had used the OSC’s methods what would have happened to this result? The authors say the MLP would only have reported a replication rate of 34 per cent. 

Finally, the authors point to a worrying hint that bias may have been at play in the original study. The OSC asked the authors of original studies to rate the methodological approach to the replication – and whether they endorsed the new approach or not. In comparing unendorsed vs endorsed study methodologies, those which were endorsed were almost four times as likely to lead to a successful replication. They write: ‘If OSC had limited their analyses to endorsed studies, they would have found that 59.7 per cent... were replicated successfully. In fact, we estimate that if all the replication studies had been high enough in fidelity to earn the endorsement of the original authors, then the rate of successful replication would have been 58.6 per cent... when controlling for relevant covariates.’

The original OSC authors have published a rebuttal, saying: ‘Their very optimistic assessment is based on statistical misconceptions and selective interpretation of correlational data.’ But what have the media and academics made of this criticism?

Writing for the New York Times Benedict Carey spoke to a researcher at the Wharton School of the University of Pennsylvania, Uri Simonsohn, who has blogged on the topic. He told Carey the original replication paper and the critique used statistical approaches that were 'predictably imperfect' for this kind of analysis. One way to think about the dispute, Simohnson said, is that 'the original paper found that the glass was about 40 percent full, and the critique argues that it could be 100 percent full. In fact… State-of-the-art techniques designed to evaluate replications say it is 40 percent full, 30 percent empty, and the remaining 30 percent could be full or empty, we can’t tell till we get more data.'

Journalist and editor of Nature Reports Stem Cells, Monya Baker, wrote that, according to statistician Andrew Gelman, replications tend to be reliable guides to the existence and power of effects in psychology: ‘That’s in part because what is published in the original studies tends to be the statistical ‘flukes’ that are left standing after the researchers have cast around to find publishable, positive results. In contrast, for replication projects analysis plans are put in place before a study begins.’

Baker also spoke to Steve Lindsay, a psychologist at the University of Victoria in Canada and interim editor of the journal Psychological Science, who said: ‘We have a lot of reasons to believe that a lot of psychologists have for a long time tended to systematically exaggerate the effects of what they publish’. He added that the real urgency lay in improving bad practices.

The Washington Post’s Amy Ellis Nutt points out that in the original replication effort one study of race involving white students and black students at Stanford University was re-done at the University of Amsterdam. She writes: ‘After realizing their lack of fidelity to the original research, the center's scientists sought to remedy the situation by again repeating their work, this time at Stanford. When they did, Gilbert and his team found, the results were indeed reproducible. But this outcome was never acknowledged in the 2015 study.’

Reporting for The Verge, Jacob Kastrenakes, writes that, for now, there is no real answer to which of the sides in this debate is correct. He spoke to John Ioannidis, the Stanford professor who wrote the famous 2005 paper Why Most Published Research Findings are False. He said: ‘Even the top of the top scientists can disagree in interpretation of what are very solid results.’ Although not involved with the psychology study or its critique, he added that the critique didn’t change his reading of the original OSC study. He told Kastrenakes that they may have overestimated the number of reproducible studies, but that constructive debate was useful.

Kastrenakes goes on: ‘Gilbert would argue that, regardless of the field, taking a better approach to replication in the first place may should lead to clearer results. "Yes, replicating can be done well, and yes, doing it well is hard," he [Gilbert] writes in an all-caps email to The Verge. "But just because it is hard to do something well does not mean that you should do it badly. This applies both to replication and playing the violin in public."’

On the Mind Hacks blog Tom Stafford (University of Sheffield) points to a Bayesian reanalysis of the reproducibility project by Alexander Etz. Stafford writes: ‘This take on the project is a great example of how open science allows people to more easily build on your results, as well as being a vital complement to the original report – not least because it stops you naively accepting any simple statistical report of the what the reproducibility project ‘means’.’ The analysis is now available as a paper on PLoS One.

Their interpretation of the reliability of psychology, as informed by the reproducibility project is as follows: ‘Overall, 75% of studies gave qualitatively similar results in terms of the amount of evidence provided. However, the evidence was often weak …The majority of the studies (64%) did not provide strong evidence for either the null or the alternative hypothesis in either the original or the replication…We conclude that the apparent failure of the Reproducibility Project to replicate many target effects can be adequately explained by overestimation of effect sizes (or overestimation of evidence against the null hypothesis) due to small sample sizes and publication bias in the psychological literature.’

Katie M Palmer, writing for Wired, said: ‘Emotions are running high. Two groups of very smart people are looking at the exact same data and coming to wildly different conclusions. Science hates that. This is how beleaguered Gilbert feels: When I asked if he thought his defensiveness might have colored his interpretation of this data, he hung up on me.’

Gilbert said to her: ‘Most people assume that when you say the word replication, you’re talking about a study that differed in only minor, uncontrollably minor details.’ Palmer adds: ‘That wasn’t the case in many of the Project’s replications, which depended on a small budget and volunteered time. Some studies were so difficult or expensive to replicate that they just … didn’t get replicated at all, including one of Gilbert’s.’

Andrew D. Wilson and Sabrina Golonka, two psychologists from Leeds Beckett University who tweet as @psychscientists, commented on another brewing replication story: ‘Failure to replicate one version of the task, everyone loses their minds in panic… Plus the media angle is all "game over, man!" even though this is just step 1.’ They referred back to their 2013 blog post, ‘Replication will not save psychology’: ‘Being able to replicate a study is an effect, not a cause of good scientific practice. So the emphasis on replication as a goal has the whole thing backwards. We should actually be focusing on improving the experiments we run in the first place. If we run better experiments, the replicability will take care of itself.’

Whether the glass is half full or half empty, expect this story to run and run… it is of course possible that the paper that said the paper that said that psychology isn’t reliable isn’t reliable isn’t reliable.

- Members of the British Psychological Society can log in below to comment, and to alert us to any more quality coverage of the debate.