The debate over replication in psychological science is hardly new. Should psychologists prefer direct replications, which attempt to recreate the original procedures of a study as exactly as possible, or conceptual replications, which attempt to manipulate the theoretical variables in an experiment in novel ways? Most recently, this topic was raised in the context of a special section on behavioral priming in the January 2014 issue of Perspectives on Psychological Science.
Full Disclosure: One of my colleagues (Joseph Cesario) wrote an article that was submitted in this special issue. Though the opinions expressed in this article are my own, I have discussed similar issues with him previously.
I read over the articles in this section and am glad that the researchers have taken up the challenge of addressing concerns that behavioral priming research is nonreplicable. Although this issue is a concern across psychological science, behavior priming has unfortunately become the poster child for failures to replicate, as evidenced by Nobel Laureate Daniel Kahneman’s open letter to priming researchers on a “train wreck looming” for the discipline, and the failure of the Many Labs Replication Project to replicate two behavioral priming experiments.
One article in the special issue stuck out to me, Stroebe and Strack’s (2014; hereafter S & S) manuscript on the illusion of exact replication. In a nutshell, they propose that no replication is ever really exact, and that even if a replication were to be exact it would not establish the generality of the effect. They strongly recommend researchers conduct conceptual replications, which are designed to specifically investigate the extent to which research findings will generalize beyond the experiment itself.
All this is well and good, but it seems to me that an issue that got behavioral priming (and psychological science) in trouble is that conceptual replications are not systematic. What do I mean by that? Consider the following scenario:
A researcher is interested on the effect of message strength on attitude change. The researcher designs a strong version of a particular persuasive message (message “A”) and a weak version of the same message (message “a”). He finds that after reading message A, participants demonstrate more attitude change on attitude measure X, compared to reading message a.
The researcher wants to establish the generality of the effect, so he or she designs different messages (messages “B” and “b”) and finds that, as predicted, B has a greater influence on attitude measure Y. If the experimenter is particularly conscientious, perhaps he or she repeats the experiment again with different messages and attitude variables.
The problem with this scenario two-fold. First, it is not systematic. What the researcher wants is to show that the effect of attitude change is not limited to a particular set of stimuli (i.e., messages), outcome measure (i.e. attitude), or sample (i.e., group of participants). However, this kind of design only allows for generalization across the latter, and not the former. Secondly, it allows for a great deal of researcher degrees of freedom. Researchers can easily drop messages that do not “work” or even whole experiments that do not find the predicted effect.
The second issue has been covered extensively by others (e.g., Simmons, Nelson, Simonsohn, 2011), so I focus my attention on the former, that conceptual replications are often not systematic in their approach. Consider S & S’ thoughts on the matter, “If a conceptual replication using a different operationalization of both constructs had succeeded in supporting the theoretical hypothesis, then our trust in the validity of the underlying theory would have been strengthened” (p. 62, emphasis added).
However, it is easily apparent why this kind of logic is flawed. While changing one set of variables (e.g., the message or attitude change of variable) would help to increase generalizability, changing both sets allows for ambiguity. It is unclear whether the same underlying construct is being measured, and failures to replicate are not very conclusive. In other words, only when I hold one set of variables constant can I meaningfully examine differences in the other set. As a more concrete example, consider the fictional message study. If I vary message type (i.e., messages A vs. B) I can determine if the effect of message strength (A vs. a, and B vs. b) holds for both messages (that is, is generalizable) or is specific to a particular message.
What does this mean for the practical researcher? Well, one tempting answer is to continue to conduct conceptual replications, but only vary one set of variables. However, even this approach is problematic because it still allows for considerable freedom in what variables are used, and more importantly, are reported. If a researcher uses sets of variables A – E, but only A and B work, it is tempting to only report the studies using the former two. Ultimately, this option is less than desirable.
However, there is another option—a better option—that directly tests the assumption of generalizability in a systematic fashion: treating stimuli as a random factor.
Psychologists are typically familiar with what random factors are, but Judd, Westfall, and Kenny (2012) give a concise definition: “random factors are factors whose levels are sampled from some larger population of levels across which the researcher wishes to generalize, whereas fixed factors are those whose levels are exhaustive” (p. 55). The most commonly encountered random factor is participants themselves. Psychologists sample individuals from a larger population of interest with the hope of making inferences that extend to this larger population. No one cares that undergraduates from Michigan State University are more likely to donate money after hearing a persuasive message from a speaker. However, this result is interesting if it can be generalized to other people.
Researchers often want to reach conclusions that generalize over levels of factors, such as the conclusion that people are more likely to be persuaded by strong messages than weak messages. However, instead of sampling from theoretical populations of strong messages and weak messages, they often choose a specific example from each. In the example I gave earlier, this would be like choosing strong/weak messages A/a. However, both of these messages are just one example of strong and weak messages; the researcher could just as easily have chosen messages B/b, C/c, and so on. This makes the experimental results ambiguous. The experimenter wants to conclude that the strength of the message is what influenced its persuasiveness, but differences could also be the result of a particular bias in the stimuli.
Conceptually replicating the fictional message experiment with different message types does little to demonstrate its generalizability, because the number of stimuli needed to generalize over a population is greater than the number typically included in a typical psychological experiment (typically one, two, or four). Judd, Westfall, and Kenny discuss this extensively in their 2012 paper, but the take-home message is that more stimuli is better. When factors are treated as random, sampling more variable stimuli (e.g., participants, messages) imposes higher demands on the researcher, but it provides concrete evidence of generalization.
While I do not disagree with S & S’ position that conceptual replications are useful for the generalization and advancement of psychological science, their encouragement of unsystematic replication undermines the goal of generalization. Psychologists aiming for generalization should consider conducting fewer studies but with more proper methods. Treating stimuli such as message type as a random factor is a way of obtaining conclusions about generalization that are supported by statistical evidence, rather than theoretical conjecture.