The 14th annual Society for Personality and Social Psychology (SPSP) conference was held in sunny and (relatively) warm New Orleans January 17th – 19th. The inspiration for this blog post came after listening to a particular symposium chaired by Leif Nelson with Uri Simonsohn and Joseph Simmons, the same three authors that in 2011 published a now well-known article on false positive psychology in Psychological Science.
A false positive results when the null hypothesis is incorrectly rejected; that is, when we determine that there is an effect of a manipulation when there isn’t one. By using “researcher degrees of freedom” (e.g., reporting only effective dependent variables, sampling till significance, using covariates, and reporting only subsets of experimental conditions), these authors argued that the false positive rate in psychology—which is set at a maximal rate of p = .05—is actually much higher.
Are false positive results a big deal in (social) psychology? Well, for starters in November 2012 there was a whole issue in Current Perspectives devoted to replication in psychology that generated a lot of discussion (you can read my take on it here). If attendance at this symposium is any indication, psychologists are very interested, drawing enough people to fill one of the convention center’s largest rooms. This audience was clearly attentive as well, gasping at all the appropriate moments like a well-mannered crowd of moviegoers.
The symposium centered around an extension of the work on researcher degrees of freedom: p-curve, an analytical tool for examining whether a researcher has utilized researcher degrees of freedom. This process is also sometimes referred to as “p-hacking,” signifying that a researcher might attempt to modify their data or analyses until they reach the revered p = .05 level.
Leif opened discussion on the p-curve with a simple statement: everyone is a p-hacker. This bears repeating. It’s time to quit denying that we don’t posture our data and results in ways that make them more compelling. Many times these decisions are logical (enough) and relatively harmless. However, the more researcher degrees of freedom we engage in, the less certain we can be that our conclusions are warranted. The usefulness of the p-curve lies in its ability to investigate sets of p-values to determine whether data has been massaged to the point where it loses credibility. Said slightly differently, it’s not that we can never utilize researcher degrees of freedom, especially when they may make theoretical sense in a given situation. But using them expressly to achieve statistical significance, especially by compounding them, leads to data that has no integrity.
Essentially, the p-curve is just a representation of all the p-values for a given set of related experiments, such as studies on the same effect, from the same researcher, or included in the same journal. A typical p-curve for a real effect will not be evenly distributed, but most of the obtained ps will fall closer to or much below the p = .01 level, indicating it is highly improbable that both groups (or means, or treatments) were pulled from the same overall population. In contrast, when the null hypothesis is true (i.e., there is no difference between groups), values are equally distributed along all probabilities. This is because it is equally likely to get a p of .99 or .01 when the null is true, as the the probability of obtaining each p is equal to 1%. (If you’re a visual person, Michael Kraus has some graphs that might explain this part better).
Why does this matter? Well, it suggests that when the null hypothesis is false, the distribution of p-values will be positively skewed (right skew), with p values of .01 or smaller occurring disproportionately more frequently than values of .04, or .05. However, if individuals have engaged in p-hacking, the opposite will be true: p-values closer to .05 should occur more frequently than lower p-values, leading to a negatively skewed distribution (left skew). Again, the key assumption here is that p-hackers stop engaging in researcher degrees of freedom and data manipulation when they achieve a p-value of .05.
Why is this useful? Well, it gives us an estimate of the reliability of the data. To the extent that a set of p-values departs from its expected positive skew, we should question the integrity of those data. If the data are highly negative skewed, this would suggest that the researcher has engaged in researcher degrees of freedom.
But is it accurate? I admit I’m drawing my information from the symposium alone and don’t have direct access to the mathematical formulas and stats that these assertions are based off of. But I can say is that the authors argued that a) p-curve analysis can be done reliably with relatively few p-values, b) the analysis is quite reliable at determining altered data, and perhaps most importantly, c) is at extremely low risk for identifying false positives (i.e., misidentifying a set of studies as suspect when there is no reason to suspect them) when studies are adequately powered. But don’t take my word for it, email Uri Simonsohn and ask for a copy of the working paper.
Still, the well-informed reader may be wondering where the new information is. After all, these presenters gave a symposium on false positive psychology (including the p-curve) at the 2012 SPSP conference. What is new is that the authors have determined a way to use the p-curve to more accurately determine effect size. While meta-analyses seek to analyze the strength of an effect over multiple studies, a common problem researchers encounter is the “file-drawer” effect, where studies that fail to reach significance often go unpublished, left in the file-drawers of various labs and offices. This results in an overestimation of effect size, because studies that fail to find effects are often not analyzed. Even in the best circumstances, when a researcher is able to get ahold of some of those file-drawer studies (say, by accessing them from the Open Science Framework or Psych File Drawer), it is highly unlikely that the researcher will really get all of the results, making overestimates almost a guarantee.
While the ideal situation for determining effect size would be a researcher who is “omniscient” and has access to everyone’s file drawer, this is simply not possible. However, by using the p-curve to determine what studies are more credible than others (i.e., they give less evidence of p-hacking), effect sizes can be more appropriately weighed to give effect size estimates that accurately map the effect sizes of the omniscient scientist. The current author likes to imagine a situation where researchers engaging in meta-analyses calculate the known effect size (a product of the published data) and accompany this number with an estimate derived from p-curve analysis. The point is not to determine which effect size is “correct”; both would provide different and useful information about the data, namely whether the research demonstrates a reliable effect, and, whether that effect is credible, respectively.
Why is this useful? Again, for psychology to move away from the oversimplified reject / fail to reject null hypothesis significance testing that has dominated the field for almost a century, reliable estimates of effect size are necessary. Though meta-analyses are great tools for estimating the strength of a certain manipulation, over-estimates of effect size can have huge negative consequences when employed in applied settings. For example, discovering a link between height and IQ is fine, but if that relationship is so weak that you would have to grow roughly four feet to gain a ten point increase in IQ the effect is practically useless. (You wouldn’t want parents stuffing their children’s food full of growth hormone).
Although there’s not nearly enough space to cover the breadth of topics discussed at this symposium, some other key points discussed are worth repeating:
- Properly power studies with at least n = 50 per cell.
- P-hacking CAN help us learn from the data, but only if we do direct replications.
- Don’t judge quantity of papers someone has published, judge the quality.
- Label your work as non p-hacking (use the 21 words) “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.”
As mentioned previously, the paper that this symposium was based off of is currently a work in progress. Information about its publication status can be found here. This entry was completely written based on content from the presentation, and the author takes sole responsibility for any butchering (or oversimplification) of the theory within.