Rejoinder to Schnall (2014) in Social Psychology


Given widespread interest in our direct replication of Schnall, Benton, and Harvey (2008) our rejoinder may be found here.

The commentary written by Schnall (2014) may be found here.

The original article (Johnson, Cheung, & Donnellan, 2014) may be found here.

Brent Donnellan has blogged about this article here.

Simone Schnall has blogged about this article here; comments may be found here.

In the purpose of maintaining transparency about the communications between us and Dr. Schnall, as well as the special edition editors (Brian Nosek and Daniël Lakens), all email correspondence may be found here. Additional information about the process of pre-registration used in this special issue is also provided.


Conceptual Replication and Random Factors


The debate over replication in psychological science is hardly new. Should psychologists prefer direct replications, which attempt to recreate the original procedures of a study as exactly as possible, or conceptual replications, which attempt to manipulate the theoretical variables in an experiment in novel ways? Most recently, this topic was raised in the context of a special section on behavioral priming in the January 2014 issue of Perspectives on Psychological Science.

Full Disclosure: One of my colleagues (Joseph Cesario) wrote an article that was submitted in this special issue. Though the opinions expressed in this article are my own, I have discussed similar issues with him previously.

I read over the articles in this section and am glad that the researchers have taken up the challenge of addressing concerns that behavioral priming research is nonreplicable. Although this issue is a concern across psychological science, behavior priming has unfortunately become the poster child for failures to replicate, as evidenced by Nobel Laureate Daniel Kahneman’s open letter to priming researchers on a “train wreck looming” for the discipline, and the failure of the Many Labs Replication Project to replicate two behavioral priming experiments.

One article in the special issue stuck out to me, Stroebe and Strack’s (2014; hereafter S & S) manuscript on the illusion of exact replication. In a nutshell, they propose that no replication is ever really exact, and that even if a replication were to be exact it would not establish the generality of the effect. They strongly recommend researchers conduct conceptual replications, which are designed to specifically investigate the extent to which research findings will generalize beyond the experiment itself.

All this is well and good, but it seems to me that an issue that got behavioral priming (and psychological science) in trouble is that conceptual replications are not systematic. What do I mean by that? Consider the following scenario:

A researcher is interested on the effect of message strength on attitude change. The researcher designs a strong version of a particular persuasive message (message “A”) and a weak version of the same message (message “a”). He finds that after reading message A, participants demonstrate more attitude change on attitude measure X, compared to reading message a.

The researcher wants to establish the generality of the effect, so he or she designs different messages (messages “B” and “b”) and finds that, as predicted, B has a greater influence on attitude measure Y. If the experimenter is particularly conscientious, perhaps he or she repeats the experiment again with different messages and attitude variables.

The problem with this scenario two-fold. First, it is not systematic. What the researcher wants is to show that the effect of attitude change is not limited to a particular set of stimuli (i.e., messages), outcome measure (i.e. attitude), or sample (i.e., group of participants). However, this kind of design only allows for generalization across the latter, and not the former. Secondly, it allows for a great deal of researcher degrees of freedom. Researchers can easily drop messages that do not “work” or even whole experiments that do not find the predicted effect.

The second issue has been covered extensively by others (e.g., Simmons, Nelson, Simonsohn, 2011), so I focus my attention on the former, that conceptual replications are often not systematic in their approach. Consider S & S’ thoughts on the matter, “If a conceptual replication using a different operationalization of both constructs had succeeded in supporting the theoretical hypothesis, then our trust in the validity of the underlying theory would have been strengthened” (p. 62, emphasis added).

However, it is easily apparent why this kind of logic is flawed. While changing one set of variables (e.g., the message or attitude change of variable) would help to increase generalizability, changing both sets allows for ambiguity. It is unclear whether the same underlying construct is being measured, and failures to replicate are not very conclusive. In other words, only when I hold one set of variables constant can I meaningfully examine differences in the other set. As a more concrete example, consider the fictional message study. If I vary message type (i.e., messages A vs. B) I can determine if the effect of message strength (A vs. a, and B vs. b) holds for both messages (that is, is generalizable) or is specific to a particular message.

What does this mean for the practical researcher? Well, one tempting answer is to continue to conduct conceptual replications, but only vary one set of variables. However, even this approach is problematic because it still allows for considerable freedom in what variables are used, and more importantly, are reported. If a researcher uses sets of variables A – E, but only A and B work, it is tempting to only report the studies using the former two. Ultimately, this option is less than desirable.

However, there is another option—a better option—that directly tests the assumption of generalizability in a systematic fashion: treating stimuli as a random factor.

Psychologists are typically familiar with what random factors are, but Judd, Westfall, and Kenny (2012) give a concise definition: “random factors are factors whose levels are sampled from some larger population of levels across which the researcher wishes to generalize, whereas fixed factors are those whose levels are exhaustive” (p. 55). The most commonly encountered random factor is participants themselves. Psychologists sample individuals from a larger population of interest with the hope of making inferences that extend to this larger population. No one cares that undergraduates from Michigan State University are more likely to donate money after hearing a persuasive message from a speaker. However, this result is interesting if it can be generalized to other people.

Researchers often want to reach conclusions that generalize over levels of factors, such as the conclusion that people are more likely to be persuaded by strong messages than weak messages. However, instead of sampling from theoretical populations of strong messages and weak messages, they often choose a specific example from each. In the example I gave earlier, this would be like choosing strong/weak messages A/a. However, both of these messages are just one example of strong and weak messages; the researcher could just as easily have chosen messages B/b, C/c, and so on. This makes the experimental results ambiguous. The experimenter wants to conclude that the strength of the message is what influenced its persuasiveness, but differences could also be the result of a particular bias in the stimuli.

Conceptually replicating the fictional message experiment with different message types does little to demonstrate its generalizability, because the number of stimuli needed to generalize over a population is greater than the number typically included in a typical psychological experiment (typically one, two, or four). Judd, Westfall, and Kenny discuss this extensively in their 2012 paper, but the take-home message is that more stimuli is better. When factors are treated as random, sampling more variable stimuli (e.g., participants, messages) imposes higher demands on the researcher, but it provides concrete evidence of generalization.

While I do not disagree with S & S’ position that conceptual replications are useful for the generalization and advancement of psychological science, their encouragement of unsystematic replication undermines the goal of generalization. Psychologists aiming for generalization should consider conducting fewer studies but with more proper methods. Treating stimuli such as message type as a random factor is a way of obtaining conclusions about generalization that are supported by statistical evidence, rather than theoretical conjecture.

Data for Schnall et al. (2008) available

Hand Washing

My colleagues (Felix Cheung and Brent Donnellan) and I recently completed a replication of a paper by Schnall, Benton, and Harvey (2008) on the effect of cleanliness on moral judgment. We directly replicated both experiments in the original manuscript; this report will be published in Social Psychology in early 2014.

We also conducted a online replication to recruit a larger sample size (n = 731) in order to obtain more precise parameter estimates. Details on this study (and an overview of the two in press replications) can be found on a blog entry from Brent Donnellan’s site. The data can be found here.

Lecture on Replication


I will be giving a guest lecture at Michigan State University on June 25th (12:40 – 2:30pm) in Olds Hall, Room 12. All are welcome to attend. The lecture is entitled: Replication in Social Psychology. Here is the abstract:

Replication in psychology is a hot topic. From Daryl Bem’s publication of research in support of ESP to Diederik Stapel’s massive fraud scandal, social psychology has been under the microscope and in the public eye. More important than ever, we as psychologists need to restore credibility in our field by critically examining our own experimental results to ensure that they are valid. Replication is the gold standard in science, and so one key component of this enterprise is encouraging well-conducted, highly powered, direct replications. This lecture will both discuss why replication is important, what it can and cannot do, and ultimately discuss what changes need to be made across the discipline to improve psychology as a whole.

You can read more of my posts on this topic here and here.

Replicability in Psychological Science: Daniel Kahneman

Screen Shot 2013-06-02 at 3.50.52 PMThe 25th APS Conference section on Good Data Practices and Replicability had several well-known speakers giving some well-intoned advice on improving psychological science. The special section focused on three general topics: increasing the adoption of good data practices, increasing the number and publication of direct replications, and finally editorial responses to encourage the enactment of such practices.

In particular, one speaker stood out: Nobel Prize winning psychologist and professor at Princeton, New Jersey, Daniel Kahneman. Kahneman is no stranger to the current issue of replicability in psychological science. A long time advocate for increasing confidence in psychological results, in 2012 Kahneman penned an open letter to the psychological community, calling for direct replications of well known social priming work in order to restore faith in the veracity of such research.

It comes as no surprise that the bulk of his presentation focused on properly powered studies; that is, ensuring that psychological experiments run enough participants to reach 90% power (obtaining enough data so that researchers have a 90% chance of discovering an effect, if such an effect does exist). This call for larger sample sizes is not particularly new in psychology (e.g., see Cohen, 1988), but it raises an additional question: how large does a sample need to be to reliably find an effect? This answer is complicated, but the short answer is that it is dependent upon effect size: how strong the relationship between independent and dependent variables is.

Sometimes the size of a certain effect is large and easy to measure, even with the naked eye. Consider the effect of sex on height, d = 1.8, a large effect (d > .8) according to Cohen’s (1988) standards. (All effect sizes in the rest of this article are drawn from an insightful article written by Meyer et al., 2001). However, rarely are effects this large. For example, most—if not all—physicians would argue that analgesics have a significant effect on the reduction of pain, yet the relationship between taking ibuprofen and pain relief is only d = 0.3, a “small” effect. Clearly, just because an effect is “small” does not mean that it does not have substantial real world applications.

In contrast, a quick power analysis reveals that a properly powered study aimed at revealing the relationship between sex and height (i.e., 90%) would require 16 subjects, while a properly powered study aimed at establishing the effect of ibuprofen on pain reduction would require almost 30 times more subjects (470 total). What this means in the real world is that failure to find certain effects may often be due to small sample size, rather than the absence of an effect. Since most effects in psychological research are closer to the latter effect (d < .3), it is no surprise that failures to replicate occur when the average sample size is low (~30 participants per condition).

Perhaps more strikingly illustrated, the case for larger sample sizes can be made more clear by examining another association, the effect of gender on weight. Another speaker, Uri Simonsohn, conducted a survey aimed at determining how many participants were necessary to verify certain effects that should be extremely obvious. While the effect on gender and weight is large (d = 1.0), it was not until sampling 47 participants that the difference reached statistical significance. Thus, if an effect a researcher examining is less strong than the relationship between height and weight (virtually all psychological research), experimenters should be at least conducting experiments with 50 participants per cell.

Readers may be wondering why sample size is important, considering that if low sample size leads to decreased power, then Type II errors (failing to find an effect when one is present) would be more likely, but Type I errors would remain unaffected (finding an effect when one is not actually present). While this is technically true, there are several reasons to believe that low sample sizes lead to increased Type I error. On such example is that questionable research practices inflate Type I error rates, in particular the tendency to run participants until statistical significance is achieved (e.g., till tobt exceeds tcrit).

While a discussion of such practices exceeds the scope of this article (but see here), another noteworthy consequence of small sample sizes is that by definition, only large effects can be detected. This means that when effects are detected using “common” research procedures, they tend to be an inflation of the actual effect size (Ferguson & Heene, 2012). Furthermore, experimenters tend to publish studies that show an effect and ignore those that fail to find an effect (the “file drawer” problem). Recall that according to traditional null-hypothesis significance testing, a probability level of 5% leads to a 1 in 20 chance of rejecting the null when the null hypothesis is true. When researchers selectively choose studies to publish, “significant” results merely due to Type I error can be combined to create a package of studies that looks compelling but is not empirically valid.

What should (or can?) be done about this problem? Obviously larger sample sizes are needed in psychological research, but often this standard is difficult to obtain practically. Kahneman’s proposed solution is a compromise between the theoretical problems of relying on underpowered samples and the realistic problem of limited resources and participant availability. His approach consists of changing the standards for what should be accepted in psychological journals. Speaking generally, currently it is common practice to bundle several small scale studies (e.g., 3-4 studies, n ~ 30 per cell) together in a publication. Kahneman envisions a process where the critical manipulation or theoretical construct is tested in a highly powered (i.e., power > 90%) “flagship” study. Extensions of this basic construct can then be conducted in smaller scale “satellite” studies that may have less power.

This approach would be beneficial in that it would both provide more precise estimates of effect size, and more confidence insofar as the power reached “appropriate” levels. But there are some issues with this approach. First, achieving 90% power may be unattainable for some researchers, especially when effects are very small. (A simple independent groups t-test with an effect size of d = .2 would require over 1000 participants). Furthermore, precise estimates of the size of an effect may not be available, especially if the work is exploratory. Related, due to publication bias, reported effect sizes are likely overestimates of the true effect size, so even studies that are adequately powered (according to a priori estimates) may reveal results that when examined a posterori were less than adequately powered (e.g., power < 90%).

A slightly different problem arises from the assumption that the critical theory can also be tested in a single experiment. It is common in psychological research that a proposed theory yields several testable hypotheses, all of which are important to establishing the overall validity of the theory. It may not always be clear what component (if any) is critical to the validity of the hypothesis being tested.

Problems also arise when considering the problematic circumstance when the "flagship" study fails to support (or even contradicts) the satellite studies. Which study or set of studies is correct? This argument is not a productive one, in that data are not correct or incorrect; the data just are. However, a discrepancy between studies should lead researchers to pause and reexamine their original set of hypotheses and procedures. This problem is perhaps more problematic when the “flagship” study supports the overall theory but the “satellite” extensions provide mixed results. In this case, the proposed changes do nothing to safeguard against selective reporting of studies that work.

Still, despite the problems with the flagship-satellite method, it’s purpose is well guided; journals should expect—require, even—studies to have higher levels of quality. Kahneman’s approach is a heuristic for the real changes that are necessary for psychology to mature as a discipline: sample sizes need to be larger, direct replications required, and transparency in reporting adopted as standard practice.

Talk at Michigan State University

I will be giving a talk this Friday at the Michigan State University Social and Personality Brown Bag Series:

“Reducing Racial Categorization Through Alternative Cues to Group Membership”

Much research has supported the idea that categorization by race is both automatic and mandatory. However, according to recent evolutionary models of categorization, it is unlikely that humans would have developed cognitive mechanisms to identify race. More likely, humans evolved to encode coalitional membership of individuals, with race serving as an arbitrary cue to group membership. I present data in favor of this hypothesis, and demonstrate that when a more reliable cue to coalition is present, categorization by race is reduced. I also examine the strength of this reduction, and test what conditions might lead individuals to continue categorizing by race.

The talk will be in the Psychology Building, Room 230, from 12:00pm-1:00pm on April 19th.

You can read more about this research here and here.