What crisis? – the reproducibility crisis

Ella Rhodes reports from a British Psychological Society debate.

07 June 2016

A huge audience of psychologists, students and researchers was drawn to the British Psychological Society debate in London about the reproducibility and replication crisis in psychology. After Brian Nosek and the Open Science Collaboration outlined the difficulty in reproducing psychological findings, the BPS, the Experimental Psychology Society and the Association of Heads of Psychology Departments hoped to host an upbeat and positive debate in the area.

Chair of the BPS Research Board Daryl O’Connor (Leeds University) said Nosek’s paper, which highlighted issues in psychology’s methodology and statistical approaches, had provided the field with a huge opportunity. He added: ‘The publication of that paper is revolutionary for our discipline, it provides an opportunity to propel us forward, improve our scientific practice and research methods.’

Many of the publishing models used in psychology, pointed out the day’s first speaker Marcus Munafò (University of Bristol), were developed 400 years ago. Could the crisis in psychology be less of a crisis and more of an opportunity to change? he asked. The findings of Nosek’s work, he said, were not surprising to many, as the lack of replicability in psychology has been recognised for years.

He suggested these issues around reproducibility capture how humans respond to incentives and aspects of their environment that shape their behaviour as well as reflecting cognitive biases people hold. There are also methodological issues within this crisis: Joshua Carp carried out a systematic review of around 240 fMRI studies and found much methodological data missing that would be required to attempt a replication, and almost no two studies analysed their data in the same way.

Munafò said one main issue was named the ‘garden of forking paths’ – where researchers start with data and are encouraged to explore it rather than sticking to an original plan of analysis. But after several analyses are carried out, when scientists find that prized p-value less than .05, it may not mean what they think it means.

So psychology researchers tend to retrofit hypotheses to data, and a culture has been created whereby researchers feel a need to have a narrative in their papers. Munafò said the incentive structures around publishing – some researchers rely on grants for their full income, for example – leads to biases almost inevitably.

These biases have been shown in the literature: for example, people invested in a certain area are likely to think a meta-analysis in their field supports their position compared to outside observers with no ‘skin in the game’. There is also a great deal of distortion in citations, with null results receiving few citations despite their importance.

But despite this Munafò ended on something of a high note. He said the replication crisis in psychology has led to the realisation that a shift in focus is needed from productivity to more quality control. He added: ‘Part of the opportunity is to refresh the way we do science. Like using pre-registration, open access, curating data, these all act as quality-control procedures. The solutions will come about by applying scientific methods to the process of science itself.’

A key issue in psychology’s failure to reproduce results, Dorothy Bishop (University of Oxford) said, was the lack of distinction in the published literature between hypothesis-testing results in contrast to exploratory statistical findings. She said although this problem had been recognised for many years, psychologists had sometimes been actively discouraged from taking this on board. This problem, she said, was pointed out by Dutchman DeGroot, whose work has only recently been translated. He pointed out that exploring a dataset and looking at the numbers then deciding how to analyse them, ‘precludes the interpretability of statistical tests’ or in other words researchers should not be using p-values in this exploratory work.

Bishop gave a hypothetical example – if someone carried out analysis between an ADHD and typical group and found no statistical significance, they may then divide the sample into young and old groups instead. However, if they consequently found a statistically significant p-value it would not mean much. She said it was always important to consider the context in which a p-value is found rather than its significance alone.

She also alluded to the ‘garden of forking paths’ metaphor, saying: ‘When I get to the end of a forking path the chances of getting a significant result is much higher. We have a good chance of finding something that’s not true but looks convincing.’ Although many researchers may think it is right to explore data in this way, it is actually a convoluted way of misleading oneself.

Bishop said using random datasets demonstrated this well, and could be used with students to illustrate this problem. We see many ‘significant’ correlations between two random datasets, when carrying out multiple correlations one is likely to come up at the < 0.05 level. It was important, Bishop said, to overcome our bias towards over-interpreting observed patterns.

She pointed to some possible solutions for avoiding this sort of research methodology; for example, encouraging students to play around with random numbers to see the potential for false positives, distinguishing between hypothesis testing and exploratory statistics and in publishing more replications.

Bishop also suggested institutions could play a role in helping psychology research by, for example, changing incentive structures and moving the focus away from impact factors of journals, potentially by rewarding those who carry out reproducible research and carry out open science practices. She concluded by saying that psychologists were not natural scientists, and so it was far too easy for them to fool themselves.

Chris Chambers (Cardiff University) spoke about his work encouraging journals to take up a pre-registration approach to publishing. This allows researchers to submit the idea for their research, including a detailed methodology, to a journal to be approved for publication prior to carrying out their research. This ensures they stick to the analyses set out at the outset and sees null results have better chance of publication than in ‘traditional’ journals. Another innovative feature of such articles is results from exploratory statistics can be presented but must be labelled as such. Chambers said he wanted to move emphasis from the importance of results on to the processes that produce them. International civil-servant-turned-psychology-PhD-candidate Nick Brown (University of Groningen) provided the audience with a wry, sort-of-outsider’s take on the problems within psychology. Among his achievements Brown has translated from Dutch the confessional autobiography of psychological fraudster Diederik Stapel, who has had scores of papers retracted after fabricating data.

Brown pointed out there were some general incentive problems and enablers of bad science dotted throughout all areas of science. He said publication bias, the ‘lottery’ of peer review, journals chasing impact factor and an ‘article publication commune’ where some authors will automatically add colleagues’ names to their papers and vice versa, were just some of these.

However, psychology has some very specific problems. People love stories about themselves, and the popular media love to report them; the constructs psychologists measure, he said, were not externally verifiable and rested on little sold theory. ‘Psychologists run a mile in the face of statistics, most don’t know how to interpret p-values correctly and many psychologists see numbers as a necessary evil,’ he added.

The consequences of bad incentives in psychology, Brown said, were HARKing (hypothesising after results are known), false positives or type 1 error, questionable research practices and outright fraud. Type 1 errors are not a career-limiting issue in psychology: findings are not taken straight out into the field and used to fly a plane, for example. Brown added: ‘We know it’s important, a brick in the wall, but it doesn’t matter if we get it wrong. And while replication is unlikely, no one will discover your type 1 error.’

He said there was also a lack of disincentives for using questionable research practices – some are even asked for during the review process, and a majority of psychologists admit to using them. Although it is impossible to estimate how prevalent outright fraud is within the field, some put it at 5 per cent; Brown suggested it is likely to be more as only the most incompetent fraudsters are caught.

He also pointed to work by Brown and Heathers who looked at the summary statistics in 71 papers; 36 of these had errors, and the authors contacted the researchers to ask for their raw data, but few would provide it. Brown said: ‘Maybe the reason we can’t reproduce is because results were reported incorrectly or made up. The most common response to asking for raw data is silence.’

Finally Prateek Buch, a Policy Associate for Sense About Science, gave a fascinating talk about the group’s work looking at government-funded research. He said transparency in science was vital, and not only in psychology, but applied particularly in policy making.

The government, he said, generates much research evidence, through the civil service or commissioning outside experts, which is designed to eventually inform public policy. Sense About Science began to wonder how transparent it was in its own research.

However, upon asking the government how many studies it had commissioned or that had been published, they were told the government did not know. This is despite the fact that it spends around £2.5 billion per year on in-house or commissioned research.

Sense About Science launched an inquiry into the delayed publication of government-commissioned research, led by Sir Stephen Sedley. In short, it found chaos at the heart of how the government conducts research. One source of delayed publication is the pressure put on government to align the publication of policy research with policy announcements.

Buch said that while the government was using so much research evidence to make policy, this research needed to be transparent. He said: ‘The discussion around a need for greater transparency strikes me as a nice headache to have. At least you’re able to estimate the nature of the problem in psychology. In policy they can’t even make a reliable, quantitative estimate of how much research is missing.’ 

- Watch the event on the BPS YouTube channel. Find discussion of the event on the Twitter hashtag #PsycDebate