This November’s issue of Perspectives on Psychological Science focuses on another critical issue in psychological science: the replicability of psychological research. Due to the importance of this issue, the American Psychological Association has actually made the whole issue available free here.
To give a brief overview, the main issue discussed in this special issue of Perspectives is a crisis of confidence in research. This crisis is not limited to psychological research; in fact one of the seminal papers on this topic was in medicine, focusing on why “most published research findings are false” (Ioannidis, 2005). More recently, replication has become a hot topic in psychology for a variety of circumstances. To focus on one in particular, the publication of Daryl Bem’s study demonstrating evidence of ESP in social psychology’s flagship journal, the Journal of Personality and Social Psychology (JPSP), raised outrage from many psychologists and recieved considerable attention in the media, even so far as to be lambasted in popular comedic outlets such as Comedy Central’s Colbert Report (link).
Since the publication of Bem’s article, several other researchers have failed to replicate its results. So, why was it published in the first place? That’s where it gets tricky. A short explanation on significance testing will help clarify this discussion. Most of psychology relies on NHST (Null Hypothesis Significance Testing) to determine whether a given effect is “real” or not. Effects are tested against a null hypothesis (H0) that predicts no significant difference between conditions, treatments, participants, etc. For example, the null hypothesis for coin flip would predict that the coin will be no more likely to land heads or tails. In other words, the probability that a coin will land heads or tails should be equal: 50%.
Consider this situation, though. Say you flip a coin 100 times and it lands heads 57 times. Does this mean that your coin is more likely to land heads than tails? Common sense would suggest no. Still, a very important question in statistical analysis is how to determine when effects like these are “real.” To do this, NHST relies on probability. If the effect observed (in this case 57 heads coin flips) is higher than a certain threshold, we reject the null hypothesis and say that the finding is “significant.” In psychology, this threshold is 5%, or p = .05. In a nutshell, this probability means that when there is no difference between conditions, we would expect to obtain such an effect 5% of the time. It is no coincidence that p is equal to the likelihood of making a Type I error, or obtaining a significant effect when there is no such effect . All other things equal, the lower the p value, the less likely an effect is caused by chance. (Caveat: all things are almost never equal in psychological research, which is why p is a poor choice for measuring effect size.)
Why is this important? Remember that Bem’s findings were significant; he found that under certain circumstances participants were able to “see the future.” Rejecting the null hypothesis allows a researcher to say with some degree of confidence that the effects measured exist and are not the result of chance. However, consider the psychologist who does not believe Bem’s claims and seeks to demonstrate that the effect does not exist. He or she replicates his study and fail to find a significant effect of ESP. So, then, has the researcher disproved Bem’s hypothesis? The simple answer: no. The long answer: it’s complicated.
Failure to find an effect (failing to reject H0) differs than rejecting the null. Whereas when we reject the null we can say with some degree of confidence that the effect exists, failure to reject the null does not mean that the effect does not exist. In other words, the absence of evidence is not evidence of absence. There could be a number of reasons why someone could not find a certain effect, such as the replication not being “close” enough, not being powerful enough, or for an infinite number of other reasons. (For a more concrete discussion in relation to my research, look here.)
Because failures to replicate are less conclusive, it is harder to publish them. But the problem goes beyond mere certainty. Replications in general—especially direct (identical) replications—are shunned from major journals. Failures to replicate may be explained by other factors. On the other hand, by default, successful replications are less novel and interesting, which makes them less marketable to a field where competition for publication space is at a premium. Numerous journals even have explicit policies against accepting replications (JPSP falls into this category). Psychologists are then faced with a dilemma. They can spend considerable amounts of time and resources to test an effect they find questionable, and ultimately fail to publish even if their methodology is sound. Alternatively, they can keep their suspicions to themselves and spend time more “wisely” pursing novel questions that are more marketable.
This brings us back to the question of how to weed out “bad” research. While one null result is in itself inconclusive, several null results (as is the case with Bem’s results) raise the question about the validity of the original finding. As long as the replications are methodologically sound and true to the original study (i.e., they use the same materials and procedures to the maximum degree possible), multiple null findings raise a red flag. This is why the current issue of Perspectives focuses on replication as the “gold standard for ensuring the reliability of published scientific literature” (Frank & Saxe, 2012).
The reasons I’ve discussed as to why there is less replication in psychology (findings are less conclusive, competition in publishing) are only part of the problem of why psychology is currently experiencing a potential crisis in confidence. Some of the other issues discussed in the journal are briefly (and incompletely) mentioned here for those interested:
- Most psychological studies are underpowered, making it more likely that over the course of several small studies an experimenter is likely to find a significant effect (in comparison to one large adequately powered study). (Bakker, van Dijk, & Wichters, 2012).
- Effect size and sample size are negatively correlated in the majority of meta-analyses, partially due to underpowered studies. This implies a strong tendency to selectively report positive results and overestimate effect sizes (Ferguson & Heene, 2012).
- Research bottlenecks force experimenters to alter their results in order to present aesthetically pleasing results that are publishable. These alterations distort the true effects (Giner-Sorolla, 2012).
- Not enough attention is paid to the context in which the experiment takes place. In particular, in behavioral priming studies contextual and social cues affect how participants react during experimental sessions (Klein et al., 2012).
Addressing these problems will require taking a different approach to psychology than the antiquated system currently maintained. However, I can offer some (slightly) optimistic news. While it goes beyond the scope of this article, the special issue of Perspectives is divided into two sections, the former focused on diagnosis and the latter on treatment. There are solutions to the problems of replicability in psychological science, but they will require more than a few articles if we are serious about actually changing the practices that have resulted in this looming crisis of confidence.