Thursday, January 7, 2016

Averaging can produce misleading standardized effect sizes

Recently, there have been many calls for a focus on effect sizes in psychological research. In this post, I discuss how naively using standardized effect sizes with averaged data can be misleading. This is particularly problematic for meta-analysis, where differences in number of trials across studies could lead to very misleading results.

There are two main types of effect sizes in typical use: raw effect sizes and standardized effect sizes. Raw effect sizes are what you typically see in a plot: for instance, the effect of a priming manipulation might be 30ms. The advantage of raw effect sizes are that they are closer to the process of interest and more interpretable. We all know what it means for something to take 30ms to happen.

Another kind of effect size is the standardized effect size. With a standardized effect size, the raw effect is compared to the some measure of variability in the population. For instance, if the standard deviation of childrens’ heights at age 10 were 3 inches, and a “good” diet had an effect of 1.5 inches on average, we could say that the effect of the diet was .5 standard deviations. This is the logic of Cohen’s \(d\), for instance. The disadvantage of this is that it is more difficult to understand what an effect of “half a standard deviation” means (and the variance-accounted-for statistics such as \(\eta^2\) and \(\omega^2\) are even more difficult to interpret); but standardized effect sizes have many good properties, including a close relationship to the concept of statistical power, comparability across paradigms, and the fact that they can often be computed from reported statistics such as \(t\) and \(F\).

For these reasons, standardized effect sizes are very common in meta-analysis. However, the common practice of averaging over trials in cognitive psychology makes them difficult to compare or even interpret.

Consider a typical cognitive experiment in with 30 participants, each performing 10 response time trials in two conditions. Typically each participant’s data will be averaged to form a single, average response time in each condition; these average RTs are then submitted to a repeated measures ANOVA (in fact, some R packages, such as afex do this automatically).

Hypothetical data is shown in the plot below. These data represent two hypothetical experiments, one with 10 samples per participant and one with 50.

Error bars represent Morey (2008) adjusted within-subject error bars, computed using this code.

I have generated the data so that everything is the same across the two experiments except the scale of the “error”: the scale of the error in experiment 2 is \(1/\sqrt{5}\) times that of experiment 1, because each averaged “observation” represents five times more data. The raw effect size is precisely the same, but our certainty about the effect size is greater in experiment 2. This is exactly as it should be.

We can now perform the typical repeated measures ANOVAs on these two data sets, using the afex. The package will, if requested, compute the common partial \(\eta^2\) standardized effect size statistic.

The results of the ANOVA for the “condition” in experiment 1 are:

Effect df MSE F pes p.value
condition 1, 29 0.08 7.72 .21 .009

And the results of the ANOVA for the “condition” in experiment 2 are:
Effect df MSE F pes p.value
condition 1, 29 0.02 38.58 .57 <.0001

This is would not be unexpected by anyone who routinely uses repeated measures ANOVA. Typically, the whole point of running more trials is to get more power. We perform more trials, and we obtained a higher \(F\) value for the comparison of interest.

Notice that the sum of squares for the effect is precisely the same for both experiments. That’s because the raw effect is precisely the same. What is driving the higher \(F\) value is the lower residual mean square (MSE) for the comparison, which is about 5 times smaller in Experiment 2. Again, this is what we expect. More trials, less “noise”.

But notice what happens to partial \(\eta^2\). Because the MSE is smaller, the proportion of variance accounted for by the condition effect is larger. This drives the partial \(\eta^2\) from .21 in experiment 1 to .57 in experiment 2. Researchers have previously warned about using partial \(\eta^2\) for comparisons across designs (see for instance, Olejnik & Algina, 2003), but these two experiments appear to have the same design; at least, from the perspective of someone used to only analysing averaged data, they do.

This has the potential to wreak havoc on meta-analyses. Suppose someone combs the literature looking for \(F\) values and computing partial \(\eta^2\) values from the \(F\) values (or, alternatively, Cohen’s \(d\) from \(t\) values). Assume experiment 1 represents a patient group; due to time constraints, the patients only had time for 10 trials per condition. Suppose experiment 2, on the other hand, represents a group of college students, who had time for more trials. The figure below shows the standardized effects in the two experiments.

Although the raw effect size is precisely the same across the two experiments, the standardized effect size is radically different, possibly leading to erroneous conclusions. Even if there are no systematic differences in number of trials across experiments experiments with different kinds of groups, this introduces a new source of variability into estimates, as well as making it nearly impossible to interpret the effect size. What is the “true” standardized effect size? It seems difficult to say. How can we solve this problem?

Solution 1: Generalized \(\omega^2\)

One option is generalized \(\omega^2\) (see for instance, Olejnik & Algina, 2003) Instead of using the residual variance to standardize against, generalized \(\omega^2\) standardizes against all measured (as opposed to manipulated) factors. For instance, the variability in participants is a measured source of variability. These sources of variability are assumed to be stable properties of populations and not effected by mere design choices. We can compute generalized \(\omega^2\) again using the afex, which yields \(\omega^2_g=0.011\) for the experiment 1 and \(\omega^2_g=0.012\) for experiment 2; notice that these very similar. The effect “looks” smaller, because participants vary quite a bit relative to the size of the effect.

One problem with this approach is that the statistics necessary to compute generalized \(\omega^2\) are not typically reported, meaning that this solution is useless for meta-analysis of existing literature. Ideally, we’d like a way to use reported statistics to at least compare across studies, when designs are similar enough.

Solution 2: Adjust the MSE

If we know the relative numbers of trials across two studies that have the same basic design, we should be able to “adjust” the MSE in the formula for the effect size (whichever effect size it happens to be) for the number of trials. In our example, experiment 2 has five times as many trials as experiment 1; we therefore would expect the MSE of experiment 2 to be one-fifth as large as that for experiment 1. To make the effect size computed from experiment 2 comparable to that from experiment 1, we can multiply its MSE by 5 before applying the formula for the effect size of interest. For partial \(\eta^2\), this leads to an easy adjustment:
\[ \eta^2_2 = \frac{1}{c/\eta^2_1 - c + 1} \] where \(c\) is the adjustment factor, \(\eta^2_1\) is the original partial \(\eta^2\), and \(\eta^2_2\) is the adjusted partial \(\eta^2\).

As an example, take our experiment 2, which had a partial \(\eta^2\) of 0.571 and a sample size 5 times larger than experiment 1. Applying the formula above yields
\[ \begin{eqnarray*} \eta^2_2 &=& \frac{1}{5/0.571 - 5 + 1}\\ &\approx&0.21 \end{eqnarray*} \] which matches the partial \(\eta^2\) from experiment 1 very well.

The problem with this approach is that it uses experiment 1 as a “reference” experiment. It is therefore not clear what the standardized effect size means in this case, except as a way to compare across experiments with similar designs. This may be enough to someone performing a meta-analysis — particularly if they can’t obtain the statistics to compute generalized \(\omega^2\) — but as a general reporting solution, it is unsatisfactory.

Wrap up

Although standardized effect sizes have been advocated as a general tool for science and are increasingly reported, they are difficult to interpret because they are affected by trivial, common design decisions. The issues I raise here should be of interest to anyone working with standardized effect sizes, particularly those performing meta-analysis. They affect repeated measures designs with averaged data most acutely; however, between-subjects designs are also affected if each participant contributes an “average” score to the analysis. In the between-subjects case the adjustment would have to be different, but for large number of trials per participant might be acceptably ignored, if the error each participants’ score is small enough.

Added postscript

After a re-read, I want to ensure that I make clear that I'm not implying that the only problem here is with meta-analyses; that's just what drove me to write this post, and how I decided to frame it. But consider this: if an arbitrary decision (driven merely by the resources at hand, such as time or money, or even whim) such as "how many trials will we perform per cell in this experiment?" can cause the standardized effect size to increase almost 200%, that standardized effect size should not be taken to reveal any psychological "truth" and is useless for drawing substantive conclusions.


  1. Great post. You might also be interested in:

  2. Thanks! You make some nice observations there, which are particularly relevant in this comment: here. I didn't expand on it too much above, but I think you're right about the difficulties inherent in standardised effect sizes. Although standardised effect sizes do allow for comparison across paradigms, given the difficulties, I wonder if the meaningfulness of such comparisons is merely illusory. It would definitely be profitable to at least consider what meta-analysis would look like based solely on raw effect sizes.

    1. Actually, those are not my blog posts, and so not my interesting observations (though I agree with them). So sadly, I cannot take credit.

    2. With a mere 14-month delay: Thanks to jwdink for linking to my blog, and thanks to Richard for his comments!

  3. Interesting post!
    I had recently come across this issue myself, and observed that including random intercepts and slopes in the simulations attenuates the overestimation of the effect sizes (depending on the ratio between the within- and between subjects variability).

    I have presented this at our labmeeting only (and do not have a blog to summarize this), but I thought it might be useful to share these slides here, in case you would be interested:

    As you highlight in the post-script, I agree one of the most important implications of this is that standardized effect sizes behave in unanticipated ways when researchers first average across trials for each participant before analyzing the data, but this has implications for power analysis, and meta-analysis indeed.

    1. Very nice. It looks like we're thinking along the same lines.

  4. This comment has been removed by a blog administrator.

  5. Richard, can you provide intuition/explanation for why the standardize effect sizes are so much lower In Expt 1 vs. Expt 2? I could see it if it were Cohen's d because Expt 1 would have higher same mean difference but higher standard deviation, thus Expt 1's d would be lower than Expt 2's. How does this work for eta-squared though, which according to Wikipedia is SS_treatment / SS_total?

    There, you have SS in both numerator and denominator, so why aren't the ratios in both experiments roughly the same? Is it because the SS_total is similar/same in both experiments, and Expt 1 has higher SS_treatment than does Expt 2?

    Also, would be interested in seeing code that underlies this running example.

    1. edit: "Expt 1 would have higher same mean difference" should be "... have same mean difference"

    2. Still wondering about the specifics behind why the standardized effect size is lower in Expt 1, if anyone has an idea...

  6. This comment has been removed by a blog administrator.

  7. Can Solution 2 be applied also when for adjusting reliability coefficients?
    e.g. If I measure reliability using the split-half method, I am really only measuring the reliability of a measure with half as many trials as the actual measure used in analysis.

  8. Quantitative data depicts the quality and can be scrutinized, but measuring it precisely is daunting enough; in contrast quantitative data can be easily measured and is depicted in number or amount. See more qualitative analysis with nvivo


  9. It was very useful for me. This was actually what I was looking for, and I am glad to came here! Thanks for sharing the such information with us.

    Server 2016
    Microsoft Server 2016 Migration