Category Archives: Replication

Rejoinder to Schnall (2014) in Social Psychology

Given widespread interest in our direct replication of Schnall, Benton, and Harvey (2008) our rejoinder may be found here.

The commentary written by Schnall (2014) may be found here.

The original article (Johnson, Cheung, & Donnellan, 2014) may be found here.

Brent Donnellan has blogged about this article here.

Simone Schnall has blogged about this article here; comments may be found here.

In the purpose of maintaining transparency about the communications between us and Dr. Schnall, as well as the special edition editors (Brian Nosek and Daniël Lakens), all email correspondence may be found here. Additional information about the process of pre-registration used in this special issue is also provided.

Conceptual Replication and Random Factors

1 Reply

The debate over replication in psychological science is hardly new. Should psychologists prefer direct replications, which attempt to recreate the original procedures of a study as exactly as possible, or conceptual replications, which attempt to manipulate the theoretical variables in an experiment in novel ways? Most recently, this topic was raised in the context of a special section on behavioral priming in the January 2014 issue of Perspectives on Psychological Science.

Full Disclosure: One of my colleagues (Joseph Cesario) wrote an article that was submitted in this special issue. Though the opinions expressed in this article are my own, I have discussed similar issues with him previously.

I read over the articles in this section and am glad that the researchers have taken up the challenge of addressing concerns that behavioral priming research is nonreplicable. Although this issue is a concern across psychological science, behavior priming has unfortunately become the poster child for failures to replicate, as evidenced by Nobel Laureate Daniel Kahneman’s open letter to priming researchers on a “train wreck looming” for the discipline, and the failure of the Many Labs Replication Project to replicate two behavioral priming experiments.

One article in the special issue stuck out to me, Stroebe and Strack’s (2014; hereafter S & S) manuscript on the illusion of exact replication. In a nutshell, they propose that no replication is ever really exact, and that even if a replication were to be exact it would not establish the generality of the effect. They strongly recommend researchers conduct conceptual replications, which are designed to specifically investigate the extent to which research findings will generalize beyond the experiment itself.

All this is well and good, but it seems to me that an issue that got behavioral priming (and psychological science) in trouble is that conceptual replications are not systematic. What do I mean by that? Consider the following scenario:

A researcher is interested on the effect of message strength on attitude change. The researcher designs a strong version of a particular persuasive message (message “A”) and a weak version of the same message (message “a”). He finds that after reading message A, participants demonstrate more attitude change on attitude measure X, compared to reading message a.

The researcher wants to establish the generality of the effect, so he or she designs different messages (messages “B” and “b”) and finds that, as predicted, B has a greater influence on attitude measure Y. If the experimenter is particularly conscientious, perhaps he or she repeats the experiment again with different messages and attitude variables.

The problem with this scenario two-fold. First, it is not systematic. What the researcher wants is to show that the effect of attitude change is not limited to a particular set of stimuli (i.e., messages), outcome measure (i.e. attitude), or sample (i.e., group of participants). However, this kind of design only allows for generalization across the latter, and not the former. Secondly, it allows for a great deal of researcher degrees of freedom. Researchers can easily drop messages that do not “work” or even whole experiments that do not find the predicted effect.

The second issue has been covered extensively by others (e.g., Simmons, Nelson, Simonsohn, 2011), so I focus my attention on the former, that conceptual replications are often not systematic in their approach. Consider S & S’ thoughts on the matter, “If a conceptual replication using a different operationalization of both constructs had succeeded in supporting the theoretical hypothesis, then our trust in the validity of the underlying theory would have been strengthened” (p. 62, emphasis added).

However, it is easily apparent why this kind of logic is flawed. While changing one set of variables (e.g., the message or attitude change of variable) would help to increase generalizability, changing both sets allows for ambiguity. It is unclear whether the same underlying construct is being measured, and failures to replicate are not very conclusive. In other words, only when I hold one set of variables constant can I meaningfully examine differences in the other set. As a more concrete example, consider the fictional message study. If I vary message type (i.e., messages A vs. B) I can determine if the effect of message strength (A vs. a, and B vs. b) holds for both messages (that is, is generalizable) or is specific to a particular message.

What does this mean for the practical researcher? Well, one tempting answer is to continue to conduct conceptual replications, but only vary one set of variables. However, even this approach is problematic because it still allows for considerable freedom in what variables are used, and more importantly, are reported. If a researcher uses sets of variables A – E, but only A and B work, it is tempting to only report the studies using the former two. Ultimately, this option is less than desirable.

However, there is another option—a better option—that directly tests the assumption of generalizability in a systematic fashion: treating stimuli as a random factor.

Psychologists are typically familiar with what random factors are, but Judd, Westfall, and Kenny (2012) give a concise definition: “random factors are factors whose levels are sampled from some larger population of levels across which the researcher wishes to generalize, whereas fixed factors are those whose levels are exhaustive” (p. 55). The most commonly encountered random factor is participants themselves. Psychologists sample individuals from a larger population of interest with the hope of making inferences that extend to this larger population. No one cares that undergraduates from Michigan State University are more likely to donate money after hearing a persuasive message from a speaker. However, this result is interesting if it can be generalized to other people.

Researchers often want to reach conclusions that generalize over levels of factors, such as the conclusion that people are more likely to be persuaded by strong messages than weak messages. However, instead of sampling from theoretical populations of strong messages and weak messages, they often choose a specific example from each. In the example I gave earlier, this would be like choosing strong/weak messages A/a. However, both of these messages are just one example of strong and weak messages; the researcher could just as easily have chosen messages B/b, C/c, and so on. This makes the experimental results ambiguous. The experimenter wants to conclude that the strength of the message is what influenced its persuasiveness, but differences could also be the result of a particular bias in the stimuli.

Conceptually replicating the fictional message experiment with different message types does little to demonstrate its generalizability, because the number of stimuli needed to generalize over a population is greater than the number typically included in a typical psychological experiment (typically one, two, or four). Judd, Westfall, and Kenny discuss this extensively in their 2012 paper, but the take-home message is that more stimuli is better. When factors are treated as random, sampling more variable stimuli (e.g., participants, messages) imposes higher demands on the researcher, but it provides concrete evidence of generalization.

While I do not disagree with S & S’ position that conceptual replications are useful for the generalization and advancement of psychological science, their encouragement of unsystematic replication undermines the goal of generalization. Psychologists aiming for generalization should consider conducting fewer studies but with more proper methods. Treating stimuli such as message type as a random factor is a way of obtaining conclusions about generalization that are supported by statistical evidence, rather than theoretical conjecture.

Data for Schnall et al. (2008) available

Replicability in Psychological Science: Daniel Kahneman

Leave a reply

The 25th APS Conference section on Good Data Practices and Replicability had several well-known speakers giving some well-intoned advice on improving psychological science. The special section focused on three general topics: increasing the adoption of good data practices, increasing the number and publication of direct replications, and finally editorial responses to encourage the enactment of such practices.

In particular, one speaker stood out: Nobel Prize winning psychologist and professor at Princeton, New Jersey, Daniel Kahneman. Kahneman is no stranger to the current issue of replicability in psychological science. A long time advocate for increasing confidence in psychological results, in 2012 Kahneman penned an open letter to the psychological community, calling for direct replications of well known social priming work in order to restore faith in the veracity of such research.

It comes as no surprise that the bulk of his presentation focused on properly powered studies; that is, ensuring that psychological experiments run enough participants to reach 90% power (obtaining enough data so that researchers have a 90% chance of discovering an effect, if such an effect does exist). This call for larger sample sizes is not particularly new in psychology (e.g., see Cohen, 1988), but it raises an additional question: how large does a sample need to be to reliably find an effect? This answer is complicated, but the short answer is that it is dependent upon effect size: how strong the relationship between independent and dependent variables is.

Sometimes the size of a certain effect is large and easy to measure, even with the naked eye. Consider the effect of sex on height, d = 1.8, a large effect (d > .8) according to Cohen’s (1988) standards. (All effect sizes in the rest of this article are drawn from an insightful article written by Meyer et al., 2001). However, rarely are effects this large. For example, most—if not all—physicians would argue that analgesics have a significant effect on the reduction of pain, yet the relationship between taking ibuprofen and pain relief is only d = 0.3, a “small” effect. Clearly, just because an effect is “small” does not mean that it does not have substantial real world applications.

In contrast, a quick power analysis reveals that a properly powered study aimed at revealing the relationship between sex and height (i.e., 90%) would require 16 subjects, while a properly powered study aimed at establishing the effect of ibuprofen on pain reduction would require almost 30 times more subjects (470 total). What this means in the real world is that failure to find certain effects may often be due to small sample size, rather than the absence of an effect. Since most effects in psychological research are closer to the latter effect (d < .3), it is no surprise that failures to replicate occur when the average sample size is low (~30 participants per condition).

Perhaps more strikingly illustrated, the case for larger sample sizes can be made more clear by examining another association, the effect of gender on weight. Another speaker, Uri Simonsohn, conducted a survey aimed at determining how many participants were necessary to verify certain effects that should be extremely obvious. While the effect on gender and weight is large (d = 1.0), it was not until sampling 47 participants that the difference reached statistical significance. Thus, if an effect a researcher examining is less strong than the relationship between height and weight (virtually all psychological research), experimenters should be at least conducting experiments with 50 participants per cell.

Readers may be wondering why sample size is important, considering that if low sample size leads to decreased power, then Type II errors (failing to find an effect when one is present) would be more likely, but Type I errors would remain unaffected (finding an effect when one is not actually present). While this is technically true, there are several reasons to believe that low sample sizes lead to increased Type I error. On such example is that questionable research practices inflate Type I error rates, in particular the tendency to run participants until statistical significance is achieved (e.g., till t_obt exceeds t_crit).

While a discussion of such practices exceeds the scope of this article (but see here), another noteworthy consequence of small sample sizes is that by definition, only large effects can be detected. This means that when effects are detected using “common” research procedures, they tend to be an inflation of the actual effect size (Ferguson & Heene, 2012). Furthermore, experimenters tend to publish studies that show an effect and ignore those that fail to find an effect (the “file drawer” problem). Recall that according to traditional null-hypothesis significance testing, a probability level of 5% leads to a 1 in 20 chance of rejecting the null when the null hypothesis is true. When researchers selectively choose studies to publish, “significant” results merely due to Type I error can be combined to create a package of studies that looks compelling but is not empirically valid.

What should (or can?) be done about this problem? Obviously larger sample sizes are needed in psychological research, but often this standard is difficult to obtain practically. Kahneman’s proposed solution is a compromise between the theoretical problems of relying on underpowered samples and the realistic problem of limited resources and participant availability. His approach consists of changing the standards for what should be accepted in psychological journals. Speaking generally, currently it is common practice to bundle several small scale studies (e.g., 3-4 studies, n ~ 30 per cell) together in a publication. Kahneman envisions a process where the critical manipulation or theoretical construct is tested in a highly powered (i.e., power > 90%) “flagship” study. Extensions of this basic construct can then be conducted in smaller scale “satellite” studies that may have less power.

This approach would be beneficial in that it would both provide more precise estimates of effect size, and more confidence insofar as the power reached “appropriate” levels. But there are some issues with this approach. First, achieving 90% power may be unattainable for some researchers, especially when effects are very small. (A simple independent groups t-test with an effect size of d = .2 would require over 1000 participants). Furthermore, precise estimates of the size of an effect may not be available, especially if the work is exploratory. Related, due to publication bias, reported effect sizes are likely overestimates of the true effect size, so even studies that are adequately powered (according to a priori estimates) may reveal results that when examined a posterori were less than adequately powered (e.g., power < 90%).

A slightly different problem arises from the assumption that the critical theory can also be tested in a single experiment. It is common in psychological research that a proposed theory yields several testable hypotheses, all of which are important to establishing the overall validity of the theory. It may not always be clear what component (if any) is critical to the validity of the hypothesis being tested.

Problems also arise when considering the problematic circumstance when the "flagship" study fails to support (or even contradicts) the satellite studies. Which study or set of studies is correct? This argument is not a productive one, in that data are not correct or incorrect; the data just are. However, a discrepancy between studies should lead researchers to pause and reexamine their original set of hypotheses and procedures. This problem is perhaps more problematic when the “flagship” study supports the overall theory but the “satellite” extensions provide mixed results. In this case, the proposed changes do nothing to safeguard against selective reporting of studies that work.

Still, despite the problems with the flagship-satellite method, it’s purpose is well guided; journals should expect—require, even—studies to have higher levels of quality. Kahneman’s approach is a heuristic for the real changes that are necessary for psychology to mature as a discipline: sample sizes need to be larger, direct replications required, and transparency in reporting adopted as standard practice.

Power of Suggestion

False Positive Psychology: The p-Curve

3 Replies

The 14th annual Society for Personality and Social Psychology (SPSP) conference was held in sunny and (relatively) warm New Orleans January 17th – 19th. The inspiration for this blog post came after listening to a particular symposium chaired by Leif Nelson with Uri Simonsohn and Joseph Simmons, the same three authors that in 2011 published a now well-known article on false positive psychology in Psychological Science.

A false positive results when the null hypothesis is incorrectly rejected; that is, when we determine that there is an effect of a manipulation when there isn’t one. By using “researcher degrees of freedom” (e.g., reporting only effective dependent variables, sampling till significance, using covariates, and reporting only subsets of experimental conditions), these authors argued that the false positive rate in psychology—which is set at a maximal rate of p = .05—is actually much higher.

Are false positive results a big deal in (social) psychology? Well, for starters in November 2012 there was a whole issue in Current Perspectives devoted to replication in psychology that generated a lot of discussion (you can read my take on it here). If attendance at this symposium is any indication, psychologists are very interested, drawing enough people to fill one of the convention center’s largest rooms. This audience was clearly attentive as well, gasping at all the appropriate moments like a well-mannered crowd of moviegoers.

The symposium centered around an extension of the work on researcher degrees of freedom: p-curve, an analytical tool for examining whether a researcher has utilized researcher degrees of freedom. This process is also sometimes referred to as “p-hacking,” signifying that a researcher might attempt to modify their data or analyses until they reach the revered p = .05 level.

Leif opened discussion on the p-curve with a simple statement: everyone is a p-hacker. This bears repeating. It’s time to quit denying that we don’t posture our data and results in ways that make them more compelling. Many times these decisions are logical (enough) and relatively harmless. However, the more researcher degrees of freedom we engage in, the less certain we can be that our conclusions are warranted. The usefulness of the p-curve lies in its ability to investigate sets of p-values to determine whether data has been massaged to the point where it loses credibility. Said slightly differently, it’s not that we can never utilize researcher degrees of freedom, especially when they may make theoretical sense in a given situation. But using them expressly to achieve statistical significance, especially by compounding them, leads to data that has no integrity.

Essentially, the p-curve is just a representation of all the p-values for a given set of related experiments, such as studies on the same effect, from the same researcher, or included in the same journal. A typical p-curve for a real effect will not be evenly distributed, but most of the obtained ps will fall closer to or much below the p = .01 level, indicating it is highly improbable that both groups (or means, or treatments) were pulled from the same overall population. In contrast, when the null hypothesis is true (i.e., there is no difference between groups), values are equally distributed along all probabilities. This is because it is equally likely to get a p of .99 or .01 when the null is true, as the the probability of obtaining each p is equal to 1%. (If you’re a visual person, Michael Kraus has some graphs that might explain this part better).

Why does this matter? Well, it suggests that when the null hypothesis is false, the distribution of p-values will be positively skewed (right skew), with p values of .01 or smaller occurring disproportionately more frequently than values of .04, or .05. However, if individuals have engaged in p-hacking, the opposite will be true: p-values closer to .05 should occur more frequently than lower p-values, leading to a negatively skewed distribution (left skew). Again, the key assumption here is that p-hackers stop engaging in researcher degrees of freedom and data manipulation when they achieve a p-value of .05.

Why is this useful? Well, it gives us an estimate of the reliability of the data. To the extent that a set of p-values departs from its expected positive skew, we should question the integrity of those data. If the data are highly negative skewed, this would suggest that the researcher has engaged in researcher degrees of freedom.

But is it accurate? I admit I’m drawing my information from the symposium alone and don’t have direct access to the mathematical formulas and stats that these assertions are based off of. But I can say is that the authors argued that a) p-curve analysis can be done reliably with relatively few p-values, b) the analysis is quite reliable at determining altered data, and perhaps most importantly, c) is at extremely low risk for identifying false positives (i.e., misidentifying a set of studies as suspect when there is no reason to suspect them) when studies are adequately powered. But don’t take my word for it, email Uri Simonsohn and ask for a copy of the working paper.

Still, the well-informed reader may be wondering where the new information is. After all, these presenters gave a symposium on false positive psychology (including the p-curve) at the 2012 SPSP conference. What is new is that the authors have determined a way to use the p-curve to more accurately determine effect size. While meta-analyses seek to analyze the strength of an effect over multiple studies, a common problem researchers encounter is the “file-drawer” effect, where studies that fail to reach significance often go unpublished, left in the file-drawers of various labs and offices. This results in an overestimation of effect size, because studies that fail to find effects are often not analyzed. Even in the best circumstances, when a researcher is able to get ahold of some of those file-drawer studies (say, by accessing them from the Open Science Framework or Psych File Drawer), it is highly unlikely that the researcher will really get all of the results, making overestimates almost a guarantee.

While the ideal situation for determining effect size would be a researcher who is “omniscient” and has access to everyone’s file drawer, this is simply not possible. However, by using the p-curve to determine what studies are more credible than others (i.e., they give less evidence of p-hacking), effect sizes can be more appropriately weighed to give effect size estimates that accurately map the effect sizes of the omniscient scientist. The current author likes to imagine a situation where researchers engaging in meta-analyses calculate the known effect size (a product of the published data) and accompany this number with an estimate derived from p-curve analysis. The point is not to determine which effect size is “correct”; both would provide different and useful information about the data, namely whether the research demonstrates a reliable effect, and, whether that effect is credible, respectively.

Why is this useful? Again, for psychology to move away from the oversimplified reject / fail to reject null hypothesis significance testing that has dominated the field for almost a century, reliable estimates of effect size are necessary. Though meta-analyses are great tools for estimating the strength of a certain manipulation, over-estimates of effect size can have huge negative consequences when employed in applied settings. For example, discovering a link between height and IQ is fine, but if that relationship is so weak that you would have to grow roughly four feet to gain a ten point increase in IQ the effect is practically useless. (You wouldn’t want parents stuffing their children’s food full of growth hormone).

Although there’s not nearly enough space to cover the breadth of topics discussed at this symposium, some other key points discussed are worth repeating:

Properly power studies with at least n = 50 per cell.
P-hacking CAN help us learn from the data, but only if we do direct replications.
Don’t judge quantity of papers someone has published, judge the quality.
Label your work as non p-hacking (use the 21 words) “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.”

As mentioned previously, the paper that this symposium was based off of is currently a work in progress. Information about its publication status can be found here. This entry was completely written based on content from the presentation, and the author takes sole responsibility for any butchering (or oversimplification) of the theory within.

David J. Johnson

University of Maryland