## Sunday, March 29, 2015

### The TES Challenge to Greg Francis

This post is a follow-up to my previous post, “Statistical alchemy and the 'test for excess significance'”. In the comments on that post, Greg Francis objected to my points about the Test for Excess Significance. I laid out a challenge in which I would use simulation to demonstrate these points. Greg Francis agreed to the details; this post is about the results of the simulations (with links to the code, etc.)

## A challenge

In my previous post, I said this:

Morey: “…we have bit of a mystery. That $E$ [the expected number of non-significant studies in a set of $n$ studies] equals the sum of the expected [Type II error] probabilities is merely asserted [by Ioannidis and Trikalinos]. There is no explanation of what assumptions were necessary to derive that fact. Moreover, it is demonstrably false.”

Greg Francis replied:

Francis:“…none of your examples of the falseness of the equation are valid because you fix the number of studies to be n, which is inconsistent with your proposed study generation process. Your study generation process works if you let n vary, but then the Ioannidis & Trikalinos formula is shown to be correct…[i]n short, you present impossible sampling procedures and then complain that the formula proposed by Ioannidis & Trikalinos does not handle your impossible situations.”

To which I replied,

Morey:“If you don’t believe me, here’s a challenge: you pick a power and a random seed. I will simulate a very large ‘literature’ according to the ‘experimenter behaviour’ of my choice, importantly with no publication bias or other selection of studies. I will guarantee that I will use a behaviour that will generate experiment set sizes of 5. I will save the code and the ‘literature’ coded in terms of ‘sets’ of studies and how many significant and nonsignificant studies there are. You get to guess what the average number of significant studies are in sets of 5 via I&T’s model, along with a 95% CI (I’ll tell you the total number of such studies). That is, we’re just using Monte Carlo to estimate the expected number of significant studies in sets of experiments n=5; that is, precisely what I&T use as the basis of their model (for the special case of n=5).” “This will answer the question of ‘what is the expected number of nonsignificant studies in a set of n?’”

This challenge will very clearly show that my situations are not “impossible”. I can sample them in a very simple simulation. Greg Francis agreed to the simulation:

Francis: “Clearly at least one of us is confused. Maybe we can sort it out by trying your challenge. Power=0.5, random seed= 19374013”

I further clarified:

Morey: “Before I do this, though, I want to make sure that we agree on what this will show. I want to show that the expected number of nonsignificant studies in a set of n (=5) studies is not what I&T say it is, and hence, the reasoning behind the test is flawed (because ‘excess significance’ is defined as deviation from this expected number). I also want to be clear what the prediction is here: Since the power of the test is .5, according to I&T, the expected number of nonsignificant studies in a set of 5 is 2.5. Agreed?”

…to which Greg Francis agreed.

I have performed this simulation. Before reading on, you should read the web page containing the results:
The table below shows the results of the simulation of 1000000 “sets” of studies. All simulated “studies” are published in this simulation, no questionable research practices are involved. The first column shows $n$, and the second column shows the average number of non-significant studies for sets of $n$, which is a Monte Carlo estimate of I&T's $E$. As you can see, it is not 2.5.

Total studies (n)  Mean nonsig. studies  Expected by TES (E)  SD nonsig. studies  Count
1 1 0.5 0 499917
2 1 1.0 0 249690
3 1 1.5 0 125269
4 1 2.0 0 62570
5 1 2.5 0 31309
6 1 3.0 0 15640
7 1 3.5 0 7718
8 1 4.0 0 3958
9 1 4.5 0 1986
10 1 5.0 0 975

(I have truncated the table at $n=10$; see the HTML file for the full table.)

I also showed that you can change the experimenter's behaviour and make it 2.5. This indicates that the assumptions one makes about experimenter behavior matter to the expected number of non-significant studies in a particular set. Across all sets of studies, the expected proportion of significant studies is expected to be equal to the power. However, how this is distributed across studies of different lengths is a function of the decision rule.

The expression for the expected number of non-significant studies in a set of $n$ is not correct (without further very strong, unwarranted assumptions).

1. The only assumption that is needed is that the set of studies is a representative set of studies with 50% power. Assume that there is an urn with 100 studies and an experimenter draws at random from this urn. The experimenter is expected to draw 50% significant results and 50% non-significant results. That is a standard assumption in statistics and not "a very strong, unwarranted assumption"

2. Yes, but you're not understanding the point. The point is that the test assumes that n is fixed. That's why we agreed in the simulation that n=5. In your example, you don't specify a number of balls drawn from the urn. You're marginalizing over n, which is NOT what the TES does. No one is disputing that half the studies are expected to be significant with power = .5; the simulations clearly show that, and it can be easily proved. What is being disputed, if you read all the material, is that half the studies will be expected to be significant *for every n*.

3. In your description of the challenge, you left out a crucial condition of my agreement. I said, "As long as your procedure for producing studies reports all the studies that are relevant for a theoretical claim and do not use some kind of questionable research practice, then I think we are in agreement."

So what does it mean for your procedure to report "all studies that are relevant for a theoretical claim"? Well, you ran 2,000,661 simulated studies, so that's what would have to be reported with regard to a (hypothetical) theoretical claim being made about this (hypothetical) effect (or set of effects). If all 2,000,661 studies are reported, then we find that the proportion of significant and non-significant studies is 0.500165195 and 0.499834805, respectively. This matches the 0.5 power as specified by the I&T analysis (nothing special here, it's just what we mean by power).

If someone takes a subset of these experiments, say, one of the sets with 7 experiments (6 significant and 1 non-significant) and uses them as support for a theoretical claim about some effect, then that person is ignoring the thousands (or millons) of other studies (half of which are non-significant) that are relevant to the theoretical claim. That person is cherry-picking study results and thereby is making a poor scientific argument for their theoretical claim (indeed, their 7 experiments will overestimate the effect size). A TES analysis (based on the true power of 0.5) would properly report that we should be skeptical about the presented relationship between the theoretical claim and these 7 experiments because, in isolation, they appear too good to be true. If these 7 experiments were combined with the thousands of other experiments, then the scientist could (probably) make a convincing case for the theoretical claim. A TES analysis of the full set would not report any problem.

To summarize, if scientists are working in an environment where there are many investigations of an effect, then their theoretical claims have to reflect that environment. They should not isolate their findings from the larger set of studies. Such isolation is essentially "cherry-picking" of results, and the TES analysis will sometimes pick up on this kind of bias.

Just to be clear, this kind of cherry-picking of results does not imply malicious intent to deceive. It can be very difficult for a scientist to judge whether studies in another lab, with different stimuli, or different subjects really investigate the same topic. I can easily see how a scientist might mistakenly convince themselves that their set of 7 experiments is a good set for their theory while other experiments do not apply. This mistake is similar to model overfitting, and a TES analysis can raise a flag to warn a scientist that they might be engaging in something like that.

4. Please stick with the question the simulation was meant to answer. The only reason there are millions of "studies" is that these are *theoretical replications* meant to estimate the expected value through Monte Carlo simulation. Even if I had done the simulation once, the expected value remains the same.

So here's the question that I put to you, and that you agreed to: What is the expected number of nonsignificant studies in a set of 5, in a situation with no QRPs all studies published? Is it necessarily what I&T said? Yes or no?

We can move on to other issues when we've addressed the question the simulation was meant to answer.

1. I cannot answer your question until you provide me with a simulation where the reported set includes all studies that are relevant to a theoretical claim. Your set of 5 studies is cherry-picked from a much larger set of studies. If you run your simulation one time, then you will most likely not produce a set of 5 studies; half of your simulations will stop after the first study. It is improper to ignore those studies when drawing a conclusion about the theoretical claim.

I get the feeling that you are trying to debunk the TES on the basis of something the TES is not doing. In particular, contrary to your response to replicationindex, the TES does *not* claim that half the studies should be significant for every n. The TES claims that half the studies should be significant for the *full set of studies* that are related to the theoretical claim.

2. This comment has been removed by the author.

5. What you are saying is clearly inconsistent with I&T's "derivation" of the test. n is the number of "already published" studies. It is therefore known and fixed, but of arbitrary value. Hence, I&T's test assumes that for *each n* the expected value is n(1-beta) (if every study has same beta). This is just straight from I&T's text.

Why would you agree to a simulation where the question is about n=5? Do "all studies that are relevant to a theoretical claim" exist in set sizes 5? What were you thinking the simulation would show?

[previous post deleted for copy-paste error....]

1. I&T proposed to use the TES for meta-analyses, which often pool together many sets of studies (often with only one experiment per set). It seems obvious that I&T would not predict that every set of published studies has to match the expected value, but they would predict that an unbiased set of all relevant studies would match the expected number of significant outcomes. Perhaps I&T were not as clear about this particular point as you would like, but I think their behavior makes their attitude clear.

I agreed to the simulation, with my stipulation, because you seemed to firmly believe you could produce the promised results (and I wanted to better understand why you are so opposed to the TES). What I expected from your simulations is exactly what happened. You were unable to produce the results you promised without suppressing a large number of experimental outcomes. Of course not all sets relevant to a theoretical claim exist in sets of size 5. This is why I did not think your simulations could provide what you needed.

2. I did produce the promised results. I did not "suppress" anything: it was a Monte Carlo simulation! The question at hand was "what is the expected number of nonsignificant results in a set of 5"? That answer, with no QRPs, no "suppressed results", was "exactly 1", or, for the purposes at hand, "not 2.5". This is not disputable. It follows from the definition of the expected value.

Imagine a scientific literature under the stopping rule considered. You pick up a paper and note that it has 5 studies. How many do you expect to be significant? Exactly 1. Are there too few non-significant studies? No. There is exactly the right number. Every set has exactly 1! There are more total studies than one would expect, that is true, but a major part of my point is that *how* you look at "too much" significance depends on the process.

"Perhaps I&T were not as clear about this particular point as you would like, but I think their behavior makes their attitude clear." If by "not as clear" you mean "resting on all kinds of unwritten assumptions," I guess. It is strange for you to defend it in this way, given that you read the paper and (naturally!) thought it could be applied in other contexts. These unwritten assumptions were a major part of the "statistical alchemy" blog post.

"When they say that the outcomes from 5 experiments provide support for their theoretical claim, I take that for what it is." Please tell me what this means. I do not see any way that you can divine an "experiment set generating process" from "tak[ing] it for what it is." This conversation is frustrating because I'm trying to be precise and in response I get hand-waving back. Perhaps you can take a set of studies you looked at and show precisely how you determined the experiment generating process and what would have occurred if it had been repeated. [Please note that this is not the same as "what would happen if I replicated each of these experiments." That's not how a p value is computed.]

"But you only get the properties you wanted if you do not include a large number of studies (whether published or not) when considering theoretical claims." This is absolutely not true. The "properties I want" are properties like "expected value" which are not dependent on ignoring anything; they are dependent on the long-run properties of the process.

3. I suspect the conditions under which I&T's expression would hold is that n is an ancillary statistic -- that is, its distribution doesn't depend on beta. In the optional stopping case, for instance, n is clearly a random variable whose distribution depends on beta. In general, if would be difficult to defend a requirement that n is ancillary since any focus -- even field-wide -- on "larger" or "more promising" effects would change the distribution of the number of studies done, and hence n would be a function of beta and not ancillary.

Unless they are pre-registered, of course.

4. Here's another you might like: a sufficient condition for I&T expression is the exchangeability of the studies. This seems to be a nice feature for the TES, in that it captures some notion that the studies are equally (and maximally? not sure) informative. However, *any* consideration of the outcome of past studies in planning future ones would violate the assumption of exchangeability. This is, of course, absurd for most scientific research.

5. How you look at too much significance depends on the theoretical claims of the authors and how their reported studies support those claims. Thus, if the theoretical claims differ, so will the conclusion of excess significance.

Let's make this more concrete with an example of situation where applying the TES (or the I&T expression) really would be inappropriate. Suppose someone wants to investigate some variation of the false memory effect (maybe they want to study it under water vs. on dry land). To investigate differences, they have to produce the effect on dry land. If the power is modest, not every experiment will produce the FM effect on dry land; and authors may not bother to report those failures. I can imagine situations where such selective reporting about the traditional FM effect does not undermine theoretical conclusions about the comparison between the dry land and under water situations. If the theoretical conclusions are about the differences in the FM effect under water versus on dry land, then it would be inappropriate for the TES to be applied to the outcomes of the dry land part of the studies, which by design are all significant.

However, if the authors use the same set of studies to make conclusions about the strength or robustness of the dry land FM effect, then their publication bias leads to misrepresentations about the size and robustness of the effect; and a TES analysis would be appropriate. Bias is always relative to a theoretical claim.

So, are the outcomes of your n=5 studies being used to support some theoretical claim? If so, then the absence of the other studies is problematic. If not, then the TES would not be applied because what you have is just a set of 5 experiments that are grouped together for unknown reasons.

6. This comment has been removed by the author.

7. This comment has been removed by the author.

8. The deletes were duplicates. My web browser reported an error posting, but apparently they actually went through.

6. ...and by the same token, why are you saying "if you run your simulation one time, then you will most likely not produce a set of 5 studies", implying that the possible outcomes from the experiment are something other than a set of 5 studies, yet when you compute the p values for the TES, you do not include any other ns as possible outcomes?

7. I said it because it is obviously true for your experiment set generating process. When I compute Ptes, I use the set that corresponds to what was reported by the authors. When they say that the outcomes from 5 experiments provide support for their theoretical claim, I take that for what it is. I have never seen a case where the authors say that the outcomes from 3 experiments would have been sufficient but they go ahead and unnecessarily report 2 more experiments. Maybe the TES would not apply in such a case, I would have to think about it.

Your experiment set generating process seems fine if all studies are included when considering theoretical claims, and the TES will not find fault with such a set of studies. But you only get the properties you wanted if you do not include a large number of studies (whether published or not) when considering theoretical claims. That's publication bias, and the TES will (sometimes) find fault with such a set of studies.

1. [see response above]

8. If your method works for power, how about gambling. You go to a casino and keep playing while you are winning and stop after you lose (50% red, 50% black). Some nights you break even (1 win, 1 lose), all other nights you win more than you lose (2:1) (3:1)...etc. I think you just found a sure way to beat the odds and get rich. Go and try it.

1. You have forgotten the nights where you lose on the first draw and then go home (0 wins 1 loss). That will happen a lot, balancing it out.

2. Yes, as Alexander points out, you're missing something *very important*...

3. Do you really not see that Alex is repeating my argument against your simulations? replicationindex (hi Uli!) was trying to make it obvious how your method of producing a subset of 5 experiments with 4 significant and 1 non-significant outcomes involved many other experiments. If you do not report these other experiments, then there is publication bias in your reporting method. If you do report them, then there is not just 5 experiments but many more (with a success rate of right around one half, just as I&T indicate there should be).

4. "What is the expected number of nonsignificant studies in a set of n=5?" is a question about sets with n=5. What about this is unclear?

5. What you state is not unclear, it is just not relevant to the issues we are discussing. The question the TES asks has never been about what is the expected number of nonsignificant studies in a set of n=5 (or whatever number). It is about the expected number of nonsignificant (or significant) studies in the full set of studies that are related to a theoretical claim. If the authors making a theoretical claim say that there were n=5 experiments, then that is what is used in the TES analysis. If the authors do not share (or consider when making their claims) many other relevant experiments, then they are cherry-picking or suppressing results, which may show up as excess significance.

With regard to the theoretical claim about some effect that is being measured by the experiments, your full set of experiments is not n=5 but much larger. For your simulation to produce your first set with 4 significant and 1 nonsignificant outcome, you had to run 100 studies (which I found by varying M to find the smallest M that produced a set with 5 significant and 1 nonsignificant outcomes). So, if you tell me that these n=5 experiments are the basis for your theoretical claim, you are ignoring the other 95 experiments (publication bias).

If you pick a set of n=5 experiments and make no theoretical claim about them, then there is nothing for TES to consider. It's just 5 experiments that you put together for reasons known only to you. You are correct that the I&T formula would not apply here; and it *is not* applied in such a situation. So, it seems you are making a big fuss about a situation (when no theoretical claim is being made) where the TES should not (and is not) being used.

6. Hi all,

I just wanted to clarify my comment. It was not a criticism or endorsement of either argument. I have not been following this very closely, maybe I will chime in later.

I simply corrected Uli's claim that you could cook the books, by reminding him of the 0 win 1 loss nights that would happen roughly 50% of the time.

All the best.

9. If you go to a casino and play 5 rounds for a year (50% black, 50% red). What is your expected number of wins on an average night. Do you really think it is 4? No, it is 2.5 wins on average. Your sets of 5 studies are not representative sets of 5 studies. They are created by a sampling strategy that leads to a systematic bias for all set sizes. It simply does not follow that the TES is useless because it predicts the expected number of success in a representative sample of studies (i.e., go to the Casino and play 5 games).

1. If you go into a casino and play until you lose once, then given that you played 5 times, the expected number of wins is exactly 4. ALL sets of 5, under my stopping rule, will be exactly the same. The fact that they have a different expected value as under some other stopping rule *is precisely the point.*

"They are created by a sampling strategy that leads to a systematic bias for all set sizes." YES! Indeed, the expected value is not the same. That's the point!

The whole point is that the conditions under which I&T's expression would be applicable are not spelled out, and do not in fact exist in the literature except for certain conditions (pre-registration), and would not be expected to exist, due to fact that studies are dependent in particular ways.

Significant effects are more likely to be followed up on or replicated. Large effect sizes are will receive more focus on those effects. People re-run experiments when the last experiment looked promising. These are natural dependencies between experiments and they will cause the null hypothesis in the TES to be false.

These are analogous to me continuing to play when I've won. That's what happens in science, and it is part of *normal* scientific behavior. A test that tells you that this happens is useless.

10. "With regard to the theoretical claim about some effect that is being measured by the experiments, your full set of experiments is not n=5 but much larger." No, it isn't; there is no set of experiments! The Monte Carlo simulation was designed for one purpose: to estimate the conditional expectation. This is trivial and could have been done without sampling. Expectation does not require any "full set" of experiments. Suppose I had instead assumed a distribution on n, sampled that, and then sampled significant "studies" based on a binomial model. If it took me 100 studies to sample n=5, would that have any bearing at all on my estimation of the conditional expectation? No, of course not. This is a silly argument.

"If the authors making a theoretical claim say that there were n=5 experiments, then that is what is used in the TES analysis." Yes! This is correct! And the question is, in those 5 experiments, what is the expected number of nonsignificant studies? The answer: it depends! This is a fact.

I really wish you'd interface with the *statistical* argument I'm making. Talking about "replication", "theoretical claims", "pool[ing] together", "relevant studies", is just obfuscating the issue. I'm taking about the properties of a statistical model in the abstract. We're modelling studies as Bernoulli random variables, and "sets" of studies are just sequences of Bernoulli random variables. There's nothing about "theoretical claims" here.

I can't see this conversation continuing unless you choose to discuss the statistical argument in precise, statistical terms. Prove the necessary conditions under which I&T's expression will be applicable. Note that n is random but observed, but of arbitrary value, and therefore the expectation they describe is a conditional expectation. Use a statistical argument, not vague ideas like "theoretical claims" and "relevant studies". These are all irrelevant to the abstract argument at hand; bringing them up simply obfuscates things.

As I noted above, I think that requirement has something to do with exchangeability: that is, that the ordering of the experiments is arbitrary (perhaps in the case of different powers, conditional exchangeability, given a population of true powers). This would make sense; the binomial distribution arises when the underlying Bernoulli RVs are exchangeable, hence any ordering is equally likely. But you can't address my argument without describing the conditions under which I&T's expression is applicable. That's the whole question.

Also, if you could define "tak[ing] it for what it is" for me that would be great. I note that this was in response to a statistical question of mine, so I assume it has a statistical meaning.

11. I agree that we seem to be an impasse again. You seem to think that conditional expectation is what the TES deals with, and I say that it does not. You show a method for producing n=5 studies with certain properties, and when I show that the method is biased for the purpose scientists would use it for, you say that is irrelevant.

I think we are basically back where we started. That's a shame because I think we both put in an honest effort to explain our views and try to understand the other person's argument. I am puzzled that my explanation does not clarify things for you, and I suspect you feel the same.

I do not think this is a situation where simply agreeing to disagree is an appropriate end point, but I think further discussion along the present lines is not going to be fruitful. Maybe someone else can join in with a new characterization of the issues.

1. Here's another try. Suppose that you are walking on a beach and you come across a genie in a bottle. The genie pops out and offers you something:

"I will allow you to skip the whole TES: all you have to do is pick up a paper and ask, and I will tell you whether, in that set, the expected number of nonsignificant results in the observed papers is different from I&T's expression. I will tell you whether the null hypothesis true or false. I know the true power, so estimating it is not a problem. You will then say that '[t]he set of studies are [sic] simply not scientific' and lack 'internal consistency'. (Francis, 2012)"

This would make your life substantially easier; the genie knows everything, and of course we only use statistics because we don't have access to the underlying truth. Do you take the genie up on their offer?

2. Your proposal highlights that you really do not understand what the TES is doing.

12. This comment has been removed by the author.

13. Final post:

"Prove the necessary conditions under which I&T's expression will be applicable."

I am not a statistician, so excuse me if I am not using your language, but I understand the basic concepts of statistics well enough.

In my words, any time the set of studies is a random sample of studies from the population of studies with a given power (say 50%), the expected value of significant results is the number of studies multiplied by power (say 10 studies, expected value = 10 * .5 = 5).

This is similar to a coin flip. if a coin flip has a 50% chance to show heads or tails and you flip a coin 10 times you are expected to get 5 heads and 5 tails.

Now I am aware that you understand this, but you claim that this has nothing to do with the TES and that I am missing your point. So, I agree with you that further discussion is pointless because you don't seem to think that this logic applies to the TES and I don't understand why your simulation is supposed to show a major flaw of the TES.

I think you need to make your argument clearer if you want to make a case that the TES is misleading. Ultimately, the success of the TES or other methods will rest on the ability to make correct predictions about actual experimental outcomes in the real world.

14. Good point of view. Where can I read more about it? All in all a good alternative is also comments and discussions. And what about this - can anyone say something more about microsoft dynamics 365 pricing? Mayby someone :)