## Friday, January 30, 2015

### On verbal categories for the interpretation of Bayes factors

As Bayesian analysis is becoming more popular, adopters of Bayesian statistics have had to consider new issues that they did not before. What is makes “good” prior? How do I interpret a posterior? What Bayes factor is “big enough”? Although the theoretical arguments for the use of Bayesian statistics are very strong, new and unfamiliar ideas can cause uncertainty in new adopters. Compared to the cozy certainty of $p<.05$, Bayesian statistics requires more care and attention. In theory, this is no problem at all. But as Yogi Berra said, "In theory there is no difference between theory and practice. In practice there is."

In this post, I discuss the the use of verbal labels for magnitudes of Bayes factors. In short, I don't like them, and think they are unnecessary.

Bayes factors have many good characteristics, and have been advocated by many to replace $p$ values from null hypothesis significance tests. Both $p$ values and Bayes factors are continuous statistics, and it seems reasonable to ask how one should interpret the magnitude of the number. I will first address the issue of how the magnitudes of $p$ values are interpreted, then move on to Bayes factors for a comparison.

### Classical and Frequentist statistics

With $p$ values this matter is either very difficult or very easy, depending on whether you're more Fisherian or more Neyman-Personian. Under the Fisherian view, interpretation of the number is difficult. Fisher said, for instance, that:

“Though recognizable as a psychological condition of reluctance, or resistance to the acceptance of a proposition, the feeling induced by a test of significance has an objective basis in that the probability statement on which it is based is a fact communicable to, and verifiable by, other rational minds. The level of significance in such cases fulfills the conditions of a measure of the rational grounds for the disbelief it engenders…”

“In general tests of significance are based on hypothetical probabilities calculated from their null hypotheses. They do not generally lead to any probability statements about the real world, but to a rational and well-defined measure of reluctance to the acceptance of the hypotheses they test.” (Fisher, in ‘Statistical Methods and Scientific Inference’)

According to Fisher, the worth of a $p$ value is that it is an objective statement about a probability under the null hypothesis. The strength of the evidence against the null hypothesis, however, is not the $p$ value itself; it somehow translated from the “reluctance” engendered by a particular $p$ value. The definition of a $p$ value itself, as Fisher points out, does not naturally lead to statements about the world. The problem is immediately obvious. How much reluctance should a rational person feel, based on a $p$ value? Who decides what is reasonable and what is not? To be clear, these questions are not meant as critiques of Fisher's viewpoint, with which I sympathize; I only wish to highlight the burden that Fisher's view of $p$ values places on the researcher.

From the Neyman-Person (and the hybrid NHST) perspective, this particular problem goes away completely. As a side benefit of Neyman's rejection of epistemology in favor of an action/decision-based view, statistics do not need to have meaning at all. In the Neyman's view, statistical tests are methods of deciding between behaviors, with defined (or in some sense optimal) error rates. A $p$ value of less than $0.05$ might, for instance, lead to an acceptance of a particular proposition, automatically. As Neyman says, rejecting both Fisher's account of scientific inductive reasoning and Jeffreys' Bayesian account:

[T]he examination of experimental or observational data is invariably followed by a set of mental processes that involve [] three categories…: (i) scanning of memory and a review of the various sets of relevant hypotheses, (ii) deductions of consequences of these hypotheses and the comparison of these consequences with empirical data, (iii) an act of will, a decision to take a particular action.

It must be obvious that…[the] use [of inductive reasoning] as a basic principle underlying research is unsatisfactory. The beliefs of particular scientists are a very personal matter and it is useless to attempt to norm them by any dogmatic formula. (Neyman, 1957)

Neyman is not suggesting that statistics is completely automatic – after all, one needs to choose one's decision rules, according to what suits one's goals – but the interpretation of the magnitude of $p$ values in relation to rational belief is irrelevant to Neyman. The worth of $p$ value (or other statistic) is in the decision it determines. The meaning of the number itself is not important, or even nonexistent.

Today $p$ values are used opportunistically in both ways. Most people do not know what a $p$ value is, but they can tell you two things:
1. “When $p$ is less than 0.05, I reject the null hypothesis.” (Neyman-Pearson)
2. “When $p$ it is very small, it provides a lot of evidence against the null hypothesis.” (Fisher)
These days, it is well known that $p$ values do not serve Fisher's goals particularly well. Even Fisher did not provide a formal link between $p$ values and any sort of rational belief, and as it happens no such link exists. So if one is to use $p$ values, one is left with only Neyman's decision-based account. The $p$ value is uninterpretable (except trivially, via its definition), but as a decision criterion this isn't as much of a worry.

The use of such criteria is comforting. It provides one fewer thing to argue over; if $p<0.05$, then one can no longer argue that the effect isn't there. If research is a game, these sorts of rules provide people with a sense of fairness. Whatever else happens, we all collectively agree that we will not doubt one another's research on the grounds that there is not enough evidence for the effect. The way $p$ values are used, $p<0.05$ means there is "enough" evidence, by definition.

### Bayesian statistics

Researchers who have adopted Bayesian statistics encounter practical hurdles that they did not have previously. Priors require some care to develop and use, but there is no clear analog in classical statistics, except for perhaps the determination of an alternative for power calculation in the Neyman-Pearson paradigm. Likewise, Bayes factors, as changes in relative model odds, have no clear analog.
The closest thing to a Bayes factor in classical statistics is a $p$ value, but in truth the only similarity is that they are both interpreted in terms of evidential strength. As I outlined in a previous post, a Bayes factor is two things:
1. A Bayes factor is the probability (or density) of the observed data in one model compared to another.
2. A Bayes factor is the relative evidence for one model compared to another.
Part of the elegance of Bayes factors is that these two things are the same; a model is preferred in direct proportion to the degree to which it predicted the observed data.

When they encounter Bayes factors, researchers familiar with $p$ values, often ask “How big is big enough?” or “What is a 'big' Bayes factor?” Various proponents of Bayes factors have recommended scales that give interpretations of various sizes of Bayes factors. For instance, a Bayes factor of 4 is interpreted as “substantial” evidence by Jeffreys.

Although it is common practice, in my view there are numerous problems with the practice of assigning verbal labels to sizes of Bayes factors. They are not needed, and they actually distort the meaning of Bayes factors.

As noted previously, $p$ values do not have a ready interpretation without a label like “statistically significant,” which means no more and no less than “I choose to reject the null hypothesis.” Bayes factors, on the other hand, do not require any such categorization. They are readily interpretable as either ratios of probabilities, or changes in model odds. They yield the amount of evidence contributed by the data, on top of what was already known (the priors). As Kass and Raftery state:

Probability itself provides a meaningful scale defined by betting, and so these categories are not a calibration of the Bayes factor but rather a rough descriptive statement about standards of evidence in scientific investigation. (Kass and Raftery, 1995, emphasis mine)

Although Kass and Raftery are often cited as recommending guidelines for interpreting the Bayes factor, they did not. They offered a description of scientific behavior, and how they thought this mapped onto Bayes factors. They were neither interpreting the Bayes factor nor were they offering normative guidelines for action on the basis of evidential strength. If Kass and Raftery were wrong about scientific behavior (after all, they did not offer any evidence for their description), if scientific behavior were to change, or if one were to consider another area besides scientific investigation, these numbers would not serve.

But even if Bayes factors do not need to be interpreted, perhaps it might be good to have the verbal categories anyway. I do not think so, for several reasons.

My first objection is that words mean different things to different people, and meanings change over time. Take, for instance, Jeffreys' category “substantial” for Bayes factors of between 3 and 10. This is less evidence than Jeffreys' category of “strong”, which runs from 10 to 30. This seems strange, because the definition of “substantial” in modern use is “of considerable value.” How are “substantial” and “strong” different? Couldn't we reverse these labels and have just as good a scale?

I believe the answer to this puzzle is that a less common use of “substantial” is “has substance.” For instance, I may say that I thought an argument is “substantial”, but this does not necessarily mean that I think the argument is strong, but simply means it is not trivially wrong. Put another way, it means that I did not think the argument was insubstantial. This is, I believe, what Jeffreys meant. But why should my evaluation of the strength of evidence depend on my knowledge of uncommon uses of common words? Would someone who did not know the less common use of “substantial” take a different view of the evidence, simply because we read different books or used different dictionaries?

Consider also Wetzels and Wagenmakers' (2012) replacement of Jeffreys' “not worth more than a bare mention” with “anecdotal”. Anecdotal evidence has a specific meaning; it is not simply weak evidence. I could have a well-designed, perfectly controlled experiment that nonetheless produces weak evidence for differentiating between two hypotheses of interest. This does not mean that my evidence is anecdotal. Anecdotal evidence is second-hand evidence that has not been substantiated and does not derive from well-controlled experiments.

Here we see the major problem: the use of these verbal labels smuggles arbitrary meaning into the judgement where none is needed. These meanings differ across people, times, and fields. Using such labels adds unnecessary – and perhaps incorrect – baggage to the interpretation of the results of an experiment.

The second objection is that the evaluation of what is “strong” evidence depends on what is being studied, and how. Is 10 kilometers a long way? It is if you're walking, it isn't if you've just started a flight to Australia. In a sense, Bayes factors are the same way; if we're claiming something mundane and plausible, a Bayes factor of 10 may be more than enough. If we're claiming something novel and implausible, a Bayes factor of 10 may not even be a start. Extraordinary claims, as the saying goes, demand extraordinary evidence; we would not regard the same level of evidence as “strong” for all claims.

The third objection is related to the second, and that is that providing evidential categories allows the researcher to shirk their responsibility for interpreting the strength of the evidence. We do not allow this in other settings. When reviewing a paper that contains evidence for some claim, for instance, it is our duty to evaluate the strength of the evidence in context. We do not, and cannot, demand from editors “standard” papers that we all agree are “strong” evidence; such standard papers do not exist. Providing normative guidelines such as “A Bayes factor of 15 is strong evidence,” though comforting, asks researchers to stop thinking in ways that we would not allow in other contexts. They impose an arbitrary, unjustified homogeneity on judgments of evidential strength.

Finally, a fourth objection is that verbal categories provide the illusion of understanding. Being able to say “A Bayes factor of 3 means that there is anecdotal evidence,” may give a researcher comfort, but does not ultimately show any understanding at all. This provides a dangerous fluency effect, because fluency is has been consistently shown to cause people to misjudge their knowledge. Because categories are not actually necessary for the interpretation of Bayes factors, giving them illusory fluency using the labels is likely to hinder, not help, their understanding.

### Do people “need” categories?

All of the previous arguments may be admitted, and yet one might argue that they are substantially weakened by a single fact: that Bayes factors cannot be understood by researchers without them. People cannot think without categories, and so if we do not provide them, people will not be able to interpret Bayes factors.

I think this is self-evidently wrong. It at least requires some sort of evidence to back it up. The use of Bayes factors by researchers is in its early years, and we do not yet know how well people can interpret them in practice.

As evidence that the claim that people need categories for Bayes factors is wrong, one may point to other related quantities for which we do not provide verbal categories. The most obvious is probability. When we teach probability in class, we do not give guidelines about what is a “small” or “large” probability. If a student were to ask, we would naturally say “it depends.” A probability of 1% is small if it represents the probability that we will win a game of chess against an opponent; it is large if it is the probability that we will be killed in an accident tomorrow.

For other similar quantities, too, we do not offer verbal categories. Odds ratios and relative risk are closely related to the Bayes factor, and yet they are used by researchers all the time without the need for contextless categories.

It is often the case that students (or researchers) are unsure about probability. Although verbal categories are never (that I know of) advocated for alleviating misunderstandings or lack of certainty about probability – and rightly so – there are other ways of helping students understand probability. Gerd Gigerenzer's work, in particular, has shown that certain visualizations have been shown to help students understand, and make use of, probabilities. A similar evidence-based tack can, and should, be taken with Bayes factors. We know a lot about how to teach people about probability, so we should apply that knowledge.

As argued previously, it is possible that through the illusion of fluency, categories may actually harm peoples understanding. It would be better to address the root of the problem rather than providing quick fixes for people's uncomfortableness with new methods. The quick fixes may actually backfire.

### Bayes factors as decision statistics

It has been suggested that cut-offs on the Bayes factors are sometimes useful; in particular, when used to stop collecting data. This is a completely different issue from the one addressed above. A rule for behavior does not need an interpretation, and furthermore, the interpretation of a Bayes factor does not depend on the stopping rule. Such a rule is merely a practicality, and there is nothing wrong with using such rules if they are needed.

As an example, I may have a rule for stopping eating, but this a completely separate question from whether I would judge how much I ate to be “a lot”. I do not need the rule to say I ate a lot, and following such a rule does not make what I ate any more or any less. I might choose such a rule based on what I thought “a lot” was, but the concept of “a lot” is prior to the rule.

In the case of Bayes factor, such decision criterion is actually only useful in light of prior odds. We should choose such a criterion such that a Bayes factor that exceeds a particular threshold is likely to convince most people; that is, that it is large enough to overcome most peoples' biases. Bayes factors in research are used in arguments made for other researchers' benefits; if we end sampling before we have achieved a level of evidence that would overcome others' prior odds, then we have not done enough sampling. Convincing ourselves is not the goal of research, after all. This should make it obvious why even a rule for stopping depends on context, because the context helps us know what a useful amount of evidence is.

It should also make clear that the Bayes factor is not really the useful decision statistic; rather, the posterior odds are. If an experiment is expensive but would not achieve the levels of evidence necessary to change peoples' minds, achieving a “strong” Bayes factor is irrelevant.

### Conclusion

This turned out to be quite a lengthier post than I anticipated it to be, but summarizing it is easy: although $p$ values need categories or criteria to be interpreted, Bayes factors do not. They have a natural interpretation that directly connects evidence with changes in odds. Furthermore, the use of verbal category labels for Bayes factors is misleading and potentially harmful to learners of Bayesian methods. Teachers of Bayesian statistics should focus on ways of visualizing Bayes factors to help people understand, rather than using the “short-cut” of verbal categories.

1. Maybe a stupid question, but why does the p-value not have a meaning without a label such as "statistically significant"? It tells you something about how often you get a sample like the one you have under the null hypothesis and repeated sampling. It might not make a lot of sense and be difficult to interpret, but it does not require a label. At least not more or less than Bayes factors for which I am inclined to agree that the categories are not needed.

1. I suppose that depends on what you mean by meaning. By meaning, what I meant is something beyond the definition of the number itself. When we use statistics, we want those statistics to have meaning aside from the definition, because it is that meaning that what we connect to the statistical question at hand (eg, are these means different).

In this sense, it is possible to *define* many meaningless statistics; I could, for instance, define and report exp(arcsine(sqrt(p))), but that would not have any meaning to anyone. The p value needs the criterion to achieve its meaning because as it turns out, it cannot be used as a formal measure of evidence (which is the meaning it is usually given). The Bayes factor, also, has both a definition (the ratio of marginal likelihoods) and a meaning in terms of a statistical question (the relative evidence between two models).

This is, of course, parsing words pretty finely, but this is what I meant by the passage to which you refer.

2. The words are parsed finely, but it makes sense and I agree (and scientists should take words seriously). Thanks for the clarification.

2. This comment has been removed by the author.

3. Bit late to the party here, but I've been thinking about a variation on your second objection, which I think is the most important objection. I fundamentally agree with you about labels, but this feels wrong in my gut. We want labels, regardless of whether we should want labels, because we want a grammar for explicitly articulating 'evidential value'. It seems problematic to make the evidential value of a Bayes factor relative to the claim, because it's difficult to quantify the "extraordinary-ness" of claims, and both the meaning of 'extraordinary' or 'mundane' are subject to your first objection.

Bayes factors are in the business of informing the relative evidence for different models, and while models and claims are very closely related, I wouldn't say a model is a claim and nothing but a claim, or vice versa. If we then instead say that the evidential value of a Bayes factor is relative to the model in the denominator (e.g.., a BF of 100 against the Standard Model of physics is different from a BF of 100 against the Big Five) rather than the more qualitatively defined claim, it satisfies the Bayesian intuition and still allows us to articulate the evidential value.

I realize it's a very small change and probably more nitpicking than anything, but it's just something that's been banging around my head.

4. Hi Richard,

I really like this post. Very clear about Fisher, Neyman-Pearson, and Bayesian approaches. I wish I had read it before I wrote my blog.

https://replicationindex.wordpress.com/2015/04/30/replacing-p-values-with-bayes-factors-a-miracle-cure-for-the-replicability-crisis-in-psychological-science/

Nevertheless, I came to the same conclusion that Bayes-Factors are like p-values in that they provide quantitative information that do not lead to qualitatively different inferences. To move from quantitative information to qualitative inferences (true/false), it is necessary to have a decision criterion that is necessarily arbitrary and implies error rates. Neyman-Pearson are the only ones who worked out a theory that provides this information. My question is whether Bayesian statistics has a theory that leads to inferences and a framework to examine error rates. Without such a theory, researchers will fall back on Neyman-Pearson and use Bayes-Factors like p-values. However, Bayes-Factors may not be ideal for this purpose.

1. I think that, ideally, researchers will abstain from Neyman-Pearson thinking and not use Bayes factors like p-values, but rather use Bayes factors like Bayes factors.

The ideal scenario, for me, is that someone publishes a 3:1 Bayes factor. Dr. Ambivalent upgrades her beliefs from 1:1 to 3:1. Dr. Gregarious upgrades beliefs from 10:1 to 30:1. Dr. Skeptic upgrades beliefs from 1:30 to 1:10. They may not all place the same amount of belief in the effect, but they can agree on the strength of evidence and upgrade accordingly.

The trickier point, and the one that I think people really have in mind when asking for a cutoff, is scientific publishing. Is there a Bayes factor that marks some necessary criterion to be "good enough" to publish? I don't think there should be. If the researchers have designed a good experiment and done their best to collect a decent sample size, I think publication is in order regardless of the obtained Bayes factor. The obtained strength of evidence depends on things beyond the experimenters' control (but bless them if they collect more data to try to get stronger evidence!).

5. I really loved reading your blog. It was very well authored and easy to understand. translation services in Houston