Replicability in Psychological Science: Daniel Kahneman

Screen Shot 2013-06-02 at 3.50.52 PMThe 25th APS Conference section on Good Data Practices and Replicability had several well-known speakers giving some well-intoned advice on improving psychological science. The special section focused on three general topics: increasing the adoption of good data practices, increasing the number and publication of direct replications, and finally editorial responses to encourage the enactment of such practices.

In particular, one speaker stood out: Nobel Prize winning psychologist and professor at Princeton, New Jersey, Daniel Kahneman. Kahneman is no stranger to the current issue of replicability in psychological science. A long time advocate for increasing confidence in psychological results, in 2012 Kahneman penned an open letter to the psychological community, calling for direct replications of well known social priming work in order to restore faith in the veracity of such research.

It comes as no surprise that the bulk of his presentation focused on properly powered studies; that is, ensuring that psychological experiments run enough participants to reach 90% power (obtaining enough data so that researchers have a 90% chance of discovering an effect, if such an effect does exist). This call for larger sample sizes is not particularly new in psychology (e.g., see Cohen, 1988), but it raises an additional question: how large does a sample need to be to reliably find an effect? This answer is complicated, but the short answer is that it is dependent upon effect size: how strong the relationship between independent and dependent variables is.

Sometimes the size of a certain effect is large and easy to measure, even with the naked eye. Consider the effect of sex on height, d = 1.8, a large effect (d > .8) according to Cohen’s (1988) standards. (All effect sizes in the rest of this article are drawn from an insightful article written by Meyer et al., 2001). However, rarely are effects this large. For example, most—if not all—physicians would argue that analgesics have a significant effect on the reduction of pain, yet the relationship between taking ibuprofen and pain relief is only d = 0.3, a “small” effect. Clearly, just because an effect is “small” does not mean that it does not have substantial real world applications.

In contrast, a quick power analysis reveals that a properly powered study aimed at revealing the relationship between sex and height (i.e., 90%) would require 16 subjects, while a properly powered study aimed at establishing the effect of ibuprofen on pain reduction would require almost 30 times more subjects (470 total). What this means in the real world is that failure to find certain effects may often be due to small sample size, rather than the absence of an effect. Since most effects in psychological research are closer to the latter effect (d < .3), it is no surprise that failures to replicate occur when the average sample size is low (~30 participants per condition).

Perhaps more strikingly illustrated, the case for larger sample sizes can be made more clear by examining another association, the effect of gender on weight. Another speaker, Uri Simonsohn, conducted a survey aimed at determining how many participants were necessary to verify certain effects that should be extremely obvious. While the effect on gender and weight is large (d = 1.0), it was not until sampling 47 participants that the difference reached statistical significance. Thus, if an effect a researcher examining is less strong than the relationship between height and weight (virtually all psychological research), experimenters should be at least conducting experiments with 50 participants per cell.

Readers may be wondering why sample size is important, considering that if low sample size leads to decreased power, then Type II errors (failing to find an effect when one is present) would be more likely, but Type I errors would remain unaffected (finding an effect when one is not actually present). While this is technically true, there are several reasons to believe that low sample sizes lead to increased Type I error. On such example is that questionable research practices inflate Type I error rates, in particular the tendency to run participants until statistical significance is achieved (e.g., till tobt exceeds tcrit).

While a discussion of such practices exceeds the scope of this article (but see here), another noteworthy consequence of small sample sizes is that by definition, only large effects can be detected. This means that when effects are detected using “common” research procedures, they tend to be an inflation of the actual effect size (Ferguson & Heene, 2012). Furthermore, experimenters tend to publish studies that show an effect and ignore those that fail to find an effect (the “file drawer” problem). Recall that according to traditional null-hypothesis significance testing, a probability level of 5% leads to a 1 in 20 chance of rejecting the null when the null hypothesis is true. When researchers selectively choose studies to publish, “significant” results merely due to Type I error can be combined to create a package of studies that looks compelling but is not empirically valid.

What should (or can?) be done about this problem? Obviously larger sample sizes are needed in psychological research, but often this standard is difficult to obtain practically. Kahneman’s proposed solution is a compromise between the theoretical problems of relying on underpowered samples and the realistic problem of limited resources and participant availability. His approach consists of changing the standards for what should be accepted in psychological journals. Speaking generally, currently it is common practice to bundle several small scale studies (e.g., 3-4 studies, n ~ 30 per cell) together in a publication. Kahneman envisions a process where the critical manipulation or theoretical construct is tested in a highly powered (i.e., power > 90%) “flagship” study. Extensions of this basic construct can then be conducted in smaller scale “satellite” studies that may have less power.

This approach would be beneficial in that it would both provide more precise estimates of effect size, and more confidence insofar as the power reached “appropriate” levels. But there are some issues with this approach. First, achieving 90% power may be unattainable for some researchers, especially when effects are very small. (A simple independent groups t-test with an effect size of d = .2 would require over 1000 participants). Furthermore, precise estimates of the size of an effect may not be available, especially if the work is exploratory. Related, due to publication bias, reported effect sizes are likely overestimates of the true effect size, so even studies that are adequately powered (according to a priori estimates) may reveal results that when examined a posterori were less than adequately powered (e.g., power < 90%).

A slightly different problem arises from the assumption that the critical theory can also be tested in a single experiment. It is common in psychological research that a proposed theory yields several testable hypotheses, all of which are important to establishing the overall validity of the theory. It may not always be clear what component (if any) is critical to the validity of the hypothesis being tested.

Problems also arise when considering the problematic circumstance when the "flagship" study fails to support (or even contradicts) the satellite studies. Which study or set of studies is correct? This argument is not a productive one, in that data are not correct or incorrect; the data just are. However, a discrepancy between studies should lead researchers to pause and reexamine their original set of hypotheses and procedures. This problem is perhaps more problematic when the “flagship” study supports the overall theory but the “satellite” extensions provide mixed results. In this case, the proposed changes do nothing to safeguard against selective reporting of studies that work.

Still, despite the problems with the flagship-satellite method, it’s purpose is well guided; journals should expect—require, even—studies to have higher levels of quality. Kahneman’s approach is a heuristic for the real changes that are necessary for psychology to mature as a discipline: sample sizes need to be larger, direct replications required, and transparency in reporting adopted as standard practice.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s