[This is a revised version of a previous post which is superseded by this post.
This revision has corrected derivations, new R/JAGS code, and new diagrams.]
Overview

"Captain, the prior probability of this
character dying and leaving the show
is infinitesimal." 
A primary example of Bayes' rule is for disease diagnosis (or illicit drug screening). The example is invoked routinely to explain the importance of prior probabilities. Here's one version of it: Suppose a diagnostic test has a 97% detection rate and a 5% false alarm rate. Suppose a person selected at random tests positive. What is the probability that the person has the disease? It might seem that the odds of having the disease are 0.97/0.05 (i.e., the detection rate over the false alarm rate), which corresponds to a probability of about 95%. But Bayesians know that such an answer is
not appropriate because it ignores the prior probability of the disease, which presumably is very rare. Suppose the prior probability of having the disease is 1%. Then Bayes' rule implies that the posterior probability of having the disease is only about 16%, even after testing positive! This type of example is presented over and over again in introductory expositions (e.g., pp. 103105 of DBDA2E), emphatically saying not to use only the detection and false alarm rates, and always to incorporate the prior probabilities of the conditions.
The ratio of the detection and falsealarm rates is the
Bayes factor, hence the lesson of the primary example is: Do not use the Bayes factor alone to make decisions; instead always incorporate the prior probabilities of the options. Unfortunately, this lesson is often forgotten or at least deemphasized in general modelselection settings. In modelselection settings, some people compute the Bayes factor and use it alone as the basis for model selection. The tacit assumption is that the models all have equal prior probabilities. But this assumption of equal prior probabilities seems implausible in most situations, and should be explicitly justified. If we should not rely on the Bayes factor alone to make decisions in disease diagnosis or drug testing, why should we rely on the Bayes factor alone to make decisions in other modelselection applications?
In this post
 I explain the role of the Bayes factor in the simple diseasediagnosis situation. (This also applies to random drug screening.)
 Then I extend to the situation in which the prior probabilities are not certain. I show that in the limit, with utter uncertainty in the prior probabilities, the posterior odds fade to match the Bayes factor. But even with some modest knowledge of the base rates, the posterior odds can be far from the Bayes factor.
 Hierarchical diagrams are provided to illustrate the model comparisons.
 JAGS code is provided for handson experience.
Conclusion: When deciding between models, the decision should not be based on the Bayes factor alone.
Bayes' rule applied to disease diagnosis
A person is randomly selected from a population in which the prior probability of having a particular disease is 1%. Denote the condition of the person by the variable \(M\), which can have two values, either \(M\!\!=\!\!1\) for having the disease or \(M\!\!=\!\!2\) for not having the disease. The prior probability can be written as \(p(M\!\!=\!\!1) = 0.01\), and hence \(p(M\!\!=\!\!2) = 0.99\).
A diagnostic test for the disease can show a "positive" result suggesting that the disease is present, or a "negative" result suggesting that the disease is absent. The result of the test is the datum, denoted \(D\), which can have two values: \(D\!\!=\!\!1\) for "positive" or \(D\!\!=\!\!0\) for "negative". The correctdetection rate (a.k.a. sensitivity) of the test is 97%, meaning \(p( D\!\!=\!\!1  M\!\!=\!\!1 ) = 0.97\). The falsealarm rate of the test is 5%, meaning \(p( D\!\!=\!\!1  M\!\!=\!\!2 ) = 0.05\). The complement of the false alarm rate is called the specificity of the test, which in this example is 95%.
Applying Bayes' rule yields: \[ p( M\!\!=\!\!1 D\!\!=\!\!1 ) = p( D\!\!=\!\!1  M\!\!=\!\!1 ) p( M\!\!=\!\!1 ) / p( D\!\!=\!\!1 ) \] and \[ p( M\!\!=\!\!2 D\!\!=\!\!1 ) = p( D\!\!=\!\!1  M\!\!=\!\!2 ) p( M\!\!=\!\!2 ) / p( D\!\!=\!\!1 ) \] In the above two equations, the denominator \( p( D\!\!=\!\!1 ) \) is the same, computed as \( p( D\!\!=\!\!1 ) = \sum_m p( D\!\!=\!\!1  M\!\!=\!\!m ) p( M\!\!=\!\!m ) \). The ratio of the two equations is \begin{align} \underbrace{\frac{p( M\!\!=\!\!1  D\!\!=\!\!1 )}{p( M\!\!=\!\!2  D\!\!=\!\!1 ) } }_{\mbox{posterior odds}} &= \underbrace{\frac{p( D\!\!=\!\!1  M\!\!=\!\!1 )}{p( D\!\!=\!\!1  M\!\!=\!\!2 )} }_{\mbox{Bayes factor}} \times \underbrace{\frac{p( M\!\!=\!\!1 ) }{ p( M\!\!=\!\!2 )}}_{\mbox{prior odds}} \label{eq:BF}\end{align} Notice from Equation \ref{eq:BF} that the Bayes factor is the ratio of the detection rate to the false alarm rate (when the datum is \(D\!\!=\!\!1\)). The Bayes factor converts the prior odds (i.e., the ratio of prior probabilities) to the posterior odds (i.e., the ratio of posterior probabilities).
When we plug in the numerical values of the example, we get \[ \underbrace{\frac{0.164}{0.836}}_{\mbox{posterior odds}} = \underbrace{\frac{0.97}{0.05}}_{\mbox{Bayes factor}} \times \underbrace{\frac{0.01}{0.99}}_{\mbox{prior odds}} \] or \[ \underbrace{0.196}_{\mbox{posterior odds}} = \underbrace{19.4}_{\mbox{Bayes factor}} \times \underbrace{0.0102}_{\mbox{prior odds}}\] (the displayed values are rounded, whereas the underlying calculations retained many decimal places). Thus, even though the Bayes factor is 19.4 in favor of having the disease, the posterior probability of having the disease is only 0.164 because the prior probability was 0.01.
The reallife lesson of these sorts of examples is that when doing random disease screening or random drug testing, it is not appropriate to use the Bayes factor alone to interpret the result of the diagnostic test. The prior probabilities of the conditions (disease or healthy, using drug or not) must be factored in, and the decision should be based on the posterior probability.
Disease diagnosis is model comparison
The case of disease diagnosis is just a case of model comparison. In model comparison, we have some data and we infer the posterior probabilities of the models given the data. For disease diagnosis, the data consist of a dichotomous value, namely \(D\!\!=\!\!1\) or \(D\!\!=\!\!0\), and the two models are merely two possible underlying states of the world, namely \(M\!\!=\!\!1\) or \(M\!\!=\!\!2\). Disease diagnosis is simply asking which model, disease or healthy, is a better model of the outcome of the diagnostic test.
The moral of the diseasediagnosis example that the Bayes factor alone is not appropriate for selecting the model applies to model comparison generally. Model comparison should be based on the posterior probabilities of the models.

Figure 1: Disease diagnosis as model comparison. At bottom, D is the datum from the diagnostic test, with D=1 denoting "positive" result and D=0 denoting negative result. At left, model M=1 represents disease present. The right bar of its Bernoulli distribution over D=1 has height
labeled by p(D=1M=1), which is the correctdetection rate. At right, model M=2 represents disease absent.The right bar of its Bernoulli distribution over D=1 has height labeled by p(D=1M=2) which
is the falsealarm rate. At the top of the diagram is the base rate of
the diseases. The prior probability of having the disease, M=1, is \(\pi_1\). 
Figure 1 shows disease diagnosis as model comparison analogous to the diagrams in Figures 10.1 and 10.2 of DBDA2E (pp. 267 and 269). The caption provides a description. Essentially, the model for "has disease" has likelihood function \(D \sim \mbox{dbern}( p(D\!\!=\!\!1M\!\!=\!\!1) ) \) and the model for "no disease" has likelihood function \(D \sim \mbox{dbern}( p(D\!\!=\!\!1M\!\!=\!\!2) ) \). The prior probability of the model indices is \(M \sim \mbox{dcat}( \pi_1 , 1\!\!\pi_1 )\).
Decision rule in model comparison sometimes uses only the Bayes factor, but should not
One practice in model comparison is to base decisions on the Bayes factor, forgetting about or at least deemphasizing the prior probabilities (and hence the posterior probabilities) of the models. A typical decision rule is,
If the Bayes factor is >3 in favor of one model relative to the other, accept the better model and reject the other.
Applying the Bayesfactoronly decision rule to disease diagnosis would lead us to decide that a person has the disease whenever the diagnostic test is positive because the Bayes factor is 19.4 which is much greater than 3. Using only the Bayes factor for decision making, instead of the posterior probabilities of the models, ignores Bayes' rule and is not Bayesian despite the terminology. A Bayes factor is only Bayesian in the full context of Bayes' rule, which converts prior probabilities to posterior probabilities. (Note that I am not concerned with the value 3 in the decision rule. The decision threshold could be 3 or 10 or whatever. Rather, the problem is not taking into account the prior probabilities of the models.)
The usual justification for using the Bayes factor alone is two fold. First, if the prior probabilities of the models are all equal, then the posterior odds algebraically equal the Bayes factor. (You can see that for yourself from Equation \ref{eq:BF}, above.) Setting the prior probabilities to be equal seems to be an innocuous default for many modelcomparison settings, even if it's an absurd assumption for random screening of diseases or drug usage. Second, even if unequal prior probabilities for the models are assumed, the Bayes factor tells us how to modify those prior odds to arrive at the corresponding posterior odds. Therefore it is informative to report the Bayes factor.
The two justifications must be used carefully. First, I doubt that most modelcomparison settings really are well represented by assuming equal prior probabilities of the models. Setting the prior probabilities equal to each other is an assertion of equality, not an expression of uncertainty; the next section discusses uncertainty. Setting the prior probabilities to 0.50/0.50 should be based on as much prior information as setting the prior probabilities to 0.01/0.99. Second, even if the Bayes factor is reported to summarize the analysis, the Bayes factor should not be the basis for decision making. The decision should be based on the posterior probabilities of the models.
Modeling uncertainty in the prior probabilities of the models
The usual formal setup, as in the equations above, assumes specific exact values for the prior probabilities of the models. That is to say, \(p(M\!\!=\!\!m)\) has a specific exact value such as \(p(M\!\!=\!\!1) = 0.01\) for the "has disease" model. But what if we are uncertain about the prior probabilities of the models? Modeling uncertainty is exactly what Bayesian analysis is all about.
In a previous version of this post, I made a mathematical derivation that seemed reasonable but which did not match my intuition for how the answer should play out. I asked readers to let me know if they could point out errors in the derivation. In a series of emails, Tomasz Miąsko (tomasz.miasko@gmail.com) patiently helped me to reconceptualize the derivation, and he also provided me with a simple JAGS script to verify the result. Thank you, Tomasz! I finally have clarity and here I present an elaborated derivation and JAGS script.
Denote \(p(M\!\!=\!\!1)\) by the parameter \(\pi_1\), which can take on any value from 0 to 1. Of course, \(p(M\!\!=\!\!2) = 1\pi_1\). We express our uncertainty in \(\pi_1\) as a probability distribution \(p(\pi_1)\). For example, \(p(\pi_1)\) could be a beta distribution that has mode \(\omega\) and concentration \(\kappa\). (Hence the shape parameters for the beta distribution are \(a=\omega (\kappa2) + 1\) and \(b= (1\omega) (\kappa2) + 1\).) If we are extremely certain about the prior probability of the model, then \(\kappa\) has a very large value, meaning that the prior distribution on \(\pi_1\) is tightly concentrated at \(\omega\), and the analysis converges to the point value assumed in the equations above. If we are uncertain about the prior probability of the model, then \(\kappa\) is a smallish value, with \(\kappa \ge 2\). When \(\kappa=2\) the prior is a uniform distribution and we have "utter" uncertainty in the value of \(\pi_2\). Figure 2 shows a diagram of the model.

Figure 2: Disease diagnosis with uncertainty in the base rates of the diseases. See Figure 1 for description of the lower part of the diagram. Near the top of the diagram is the base rate of
the diseases, where the prior probability of having the disease is denoted
\(\pi_1\). There is uncertainty in the value of \(\pi_1\) represented by the beta distribution at the top of the diagram. The beta distribution has mode \(\omega\) and concentration \(\kappa\). 
Notice that the overall model involves two parameters: \(M\) and \(\pi_1\). Consider Bayes' rule for the twoparameter space (as pointed out to me by Tomasz Miasko):
\begin{align}
p( M , \pi_1  D )
&= p( D  M , \pi_1 ) p( M , \pi_1 ) / p^*(D)
\notag
\\
&= p( D  M ) p( M  \pi_1 ) p( \pi_1 ) / p^*(D)
\label{eq:pMpi1gD}
\end{align}
where, and this is crucial, \(p^*(D)\) is the double integral over the twoparameter space:
\begin{align}
p^*(D)
&=
\sum_m \int_{\pi_1=0}^{\pi_1=1} d\pi_1 p( D  M ) p( M  \pi_1 ) p( \pi_1 )
\label{eq:pD2param}
\end{align}
Notice that \(p^*(D)\) is a constant, not a function of \(\pi_1\).
To compute the posterior probability of \(M\) given data \(D\), we marginalize the twoparameter posterior of Equation \ref{eq:pMpi1gD}:
\begin{align}
p(MD)
&=
\int_{\pi_1=0}^{\pi_1=1} d\pi_1 p( M , \pi_1  D )
\notag
\\
&=
\int_{\pi_1=0}^{\pi_1=1} d\pi_1 p( D  M ) p( M  \pi_1 ) p( \pi_1 ) / p^*(D)
\notag
\\
&=
\frac{ p( D  M )}{ p^*(D) } \int_{\pi_1=0}^{\pi_1=1} d\pi_1 p( M  \pi_1 ) p( \pi_1 )
\label{eq:pMgD2param}
\end{align}
Notice in Equation \ref{eq:pMgD2param} (above) that \(p^*(D)\) was pulled in front of the integral sign, which is valid because \(p^*(D)\) is a constant. For the two models, Equation \ref{eq:pMgD2param} becomes
\begin{align}
p(M=1D)
&=
\frac{ p( D  M=1 )}{ p^*(D) }
\int_{\pi_1=0}^{\pi_1=1} d\pi_1 \pi_1 p( \pi_1 )
\notag
\\
&\mbox{and}
\notag
\\
p(M=2D)
&=
\frac{ p( D  M=2 )}{ p^*(D) }
\int_{\pi_1=0}^{\pi_1=1} d\pi_1 (1\pi_1) p( \pi_1 )
\notag
\end{align}
The posterior odds are therefore
\begin{align}
\frac{ p(M=1D) }{ p(M=2D) }
&=
\frac{ p( D  M=1 ) }{ p( D  M=2 ) }
\times
\frac{ \int_{\pi_1=0}^{\pi_1=1} d\pi_1 \pi_1 p( \pi_1 ) }{ \int_{\pi_1=0}^{\pi_1=1} d\pi_1 (1\pi_1) p( \pi_1 )}
\label{eq:postOdds2param}
\end{align}
In particular, when the distribution \(p( \pi_1 )\) is uniform with \(p( \pi_1 )=1.0\), the prior odds in Equation \ref{eq:postOdds2param} are 1/1. Thus, when there is complete uncertainty on \(\pi_1\) as expressed by a uniform distribution, then the posterior odds match the Bayes factor.
[What was wrong with the derivation of the previous post superseded by this post? The previous derivation began with the wrong conditional, but I didn't realize it because the notation suppressed the conditioned variable and hid the problem. The previous derivation started with \(p(M=1D)\) without explicitly noting that it was conditional on \(\pi_1\). Therefore it was easy to think, in fact difficult not to think, that the derivation began with \(p(M=1,\pi_1D)\) when in fact it began with \(p(M=1D,\pi_1)\). Thus, when integrating over \(\pi_1\), the result is \(\int d\pi_1 \, p(M=1D,\pi_1) \) and not the desired \(\int d\pi_1 \, p(M=1,\pi_1D) \). Thus, the derivation of the previous post derived some strange quantity and not the posterior probability of \(M\).]
Numerical results from R and JAGS
Here are some results from the R script for JAGS appended below. In all cases, the mode of the prior on \(\pi_1\) is \(\omega=0.01\).
When the prior on \(\pi_1\) is extremely concentrated around \(\omega=0.01\), with \(\kappa=20000\), a run of JAGS produces
P(M=1D)=0.165; P(M=2D)=0.835; P(M=1D)/P(M=2D)=0.198
This result essentially matches the numerical result in the opening paragraphs of this post, which assumed a singular value for \(\pi_1\) at 0.01.
When the prior on \(\pi_1\) is only modestly concentrated around \(\omega=0.01\), with \(\kappa=20\), a run of JAGS produces
P(M=1D)=0.547; P(M=2D)=0.453; P(M=1D)/P(M=2D)=1.21
Notice that the posterior odds are now slightly in favor of having the disease, but are far from the Bayes factor of 19.4.
When the prior on \(\pi_1\) is weakly concentrated around \(\omega=0.01\), with \(\kappa=10\), a run of JAGS produces
P(M=1D)=0.701; P(M=2D)=0.299; P(M=1D)/P(M=2D)=2.34
Notice that the posterior odds are leaning in favor of having the disease, but have still not exceeded a ratio of 3, and are still far from the Bayes factor of 19.4.
When the prior on \(\pi_1\) is completely uncertain around \(\omega=0.01\), that is, a uniform distribution with \(\kappa=2\), a run of JAGS produces
P(M=1D)=0.951; P(M=2D)=0.0493; P(M=1D)/P(M=2D)=19.3
Here, with utter uncertainty in the prior probabilities of the models, the posterior odds match the Bayes
factor, as was commented just after Equation \ref{eq:postOdds2param}.
Recap: The posterior odds of the models are not the same thing as the Bayes factor of the models. Decisions should be based on the posterior probabilities. The Bayes factor does not take into account the prior probabilities of the models (and despite its name the Bayes factor by itself does not even use Bayes' rule). The derivations and numerical examples above have shown that even with a lot of uncertainty in the prior probabilities on models, the posterior odds can be quite different than the Bayes factor.
The posterior odds numerically equal the Bayes factor when (i) there is utter uniform uncertainty in the prior probabilities of the models or (ii) when there is complete prior knowledge that the models have exactly equal prior probability. I suspect that neither of those conditions occurs in reality. [Revised 28Dec2015, 9:35am] The posterior odds numerically equal the Bayes factor when (as a broad sufficient but not necessary condition) the uncertainty of the model prior probability is symmetric and
known to have a mode at exactly 0.5, which I suspect rarely occurs in reality.
The R and JAGS script:
# Original script by Tomasz Miasko 24Dec2015 tomasz.miasko@gmail.com
# Elaborated by John Kruschke 26Dec2015 johnkruschke@gmail.com
rm(list=ls())
library(rjags)
# JAGS model for disease diagnosis example, 27December2015,
# http://doingbayesiandataanalysis.blogspot.com/
# Datum D is outcome of diagnostic test(s) for a single individual.
# D=1 suggests disease is present, D=0 suggests disease is absent.
# pD1gM[m] is probability of D=1 given model index m.
# m=1 means disease is truly present, m=2 means disease is truly absent.
modelString = "
model {
for ( i in 1:length(D) ) {
D[i] ~ dbern( pD1gM[m] ) # pD1gM[1:2] are constants provided in data list.
}
m ~ dcat( pM[1:2] ) # model index probability.
pM[1] ~ dbeta( omega*(kappa2)+1 ,
(1omega)*(kappa2)+1 ) # prior probability for m=1
pM[2] < 1pM[1] # prior probability for m=2
# omega and kappa are constants provided in data list.
}
"
# Specify constants:
# The result of the diagnostic test(s) for a single individual:
D = c(rep(1,1),rep(0,0)) # use D=NA for prior
# Probability D=1 for m=1,2; i.e.,
# correct detection rate and false alarm rate:
pD1gM = c(0.97,0.05) # high value for disease present, low value for absent
# Modal base rate of disease present:
omega = 0.01
# Certainty of modal base rate (kappa >= 2):
kappa = c(2,20,20000)[3]
model = jags.model(
file=textConnection(modelString),
data=list( D=D , pD1gM=pD1gM , omega=omega , kappa=kappa ),
n.chains=4)
update( object=model, n.iter=1000 )
samples = coda.samples(
model=model,
variable.names=c("m","pM"),
n.iter=125000) # total MCMC steps = n.chains*n.iter
chain = as.matrix(samples)
pM1gD = sum(chain[,"m"]==1)/length(chain[,"m"])
pM2gD = sum(chain[,"m"]==2)/length(chain[,"m"])
print( paste0( "For D = ", sum(D) , " 1's and " , length(D)sum(D) , " 0's" ,
": P(M=1D)=" , signif(pM1gD,3) ,
"; P(M=2D)=" , signif(pM2gD,3) ,
"; P(M=1D)/P(M=2D)=", signif(pM1gD/pM2gD,3) ) )
hist( chain[,"pM[1]"] , breaks=51 ,
main=bquote("p( pM[1]  D=" *.(sum(D))* " 1's and "
*.(length(D)sum(D))* " 0's )" ) ,
col="grey" , border="grey" , xlim=c(0,1) )