Sunday, February 19, 2017

Interpreting Bayesian posterior distribution of a parameter: Is density meaningful?


Background: Suppose a researcher is interested in the Bayesian posterior distribution of a parameter, because the parameter is directly meaningful in the research domain. This occurs, for example, in psychometrics. Specifically, in item response theory (IRT; for details and an example of Bayesian IRT see this blog post), the data from many test questions (i.e., the items) and many people yield estimates of the difficulties \(\delta_i\) and discriminations \(\gamma_i\) of the items along with the abilities \(\alpha_p\) of the people. That is, the item difficulty is a parameter \(\delta_i\), and the analyst is specifically interested in the magnitude and uncertainty of each item's difficulty. The same is true for the other parameters, item discrimination and person ability. That is, the analyst is specifically interest in the discrimination \(\gamma_i\) magnitude and uncertainty for every item and the ability \(\alpha_p\) magnitude and uncertainty for every person.

The question: How should the posterior distribution of a meaningful parameter be summarized? We want a number that represents the central tendency of the (posterior) distribution, and numbers that indicate the uncertainty of the distribution. There are two options I'm considering, one based on densities, the other based on percentiles.

Densities. One way of conveying a summary of the posterior distribution is in terms of densities. This seems to be the most intuitive summary, as it directly answers the natural questions from the researcher:
  • Question: Based on the data, what is the most credible parameter value? Answer: The modal (highest density) value. For example, we ask: Based on the data, what is the most credible value for this item's difficulty \(\delta_i\)? Answer: The mode of the posterior is 64.5.
  • Question: Based on the data, what is the range of the 95% (say) most credible values? Answer: The 95% highest density interval (HDI). For example, we ask: Based on the data, what is the range of the 95% most credible values of \(\delta_i\)? Answer: 51.5 to 75.6.
Percentiles. A different way of conveying a summary of the posterior distribution is in terms of percentiles. The central tendency is reported as the 50th percentile (i.e., the median), and the range of uncertainty (that covers 95% of the distribution) is the equal-tailed credible interval from the 2.5 %ile to the 97.5 %ile. When using percentiles, densities are irrelevant, and the shape of the distribution is irrelevant.
An illustration from DBDA2E showing how highest-density intervals and equal-tailed intervals (based on percentiles) are not necessarily equivalent.


Some pros and cons:

Density answers what the researcher wants to know: What is the most credible value of the parameter, and what is the range of the credible (i.e., high density) values? Those questions simply are not answered by percentiles. On the other hand, density is not invariant under non-linear (but monotonic) transformations of the parameters. By squeezing or stretching different regions of the parameter, the densities can change dramatically, but the percentiles stay the same (on the transformed scale). This transformation invariance is the key reason that analysts avoid using densities in abstract, generic models and derivations.

But in applications where the parameters have meaningful interpretations, I don't think researchers are satisfied with percentiles.  If you told a researcher, "Well, we cannot tell you what the most probable parameter value is, all we can tell you is the median (50 %ile)," I don't think the researcher would be satisfied. If you told the researcher, "We can tell you that 30% of the posterior falls below this 30th %ile, but we cannot tell you whether values below the 30th %ile have lower or higher probability density than values above the 30th %ile," I don't think the researcher would be satisfied. Lots of parameters in traditional psychometric models have meaningful scales (and aren't arbitrarily non-linearly transformed). Lots of parameters in conventional models have scales that directly map onto the data scales, for example the mean and standard deviation of a normal model (and the data scales are usually conventional and aren't arbitrarily non-linearly transformed). And in spatial or temporal models, many parameters directly correspond to space and time, which (in most terrestial applications) are not non-linearly transformed.

Decision theory to the rescue? I know there is not a uniquely "correct" answer to this question. I suspect that the pros and cons could be formalized as cost functions in formal decision theory, and then an answer would emerge depending on the utilities assigned to density and tranformation invariance. If the cost function depends on densities, then mode and HDI would emerge as the better basis for decisions. If the cost function depends on transformation invariance, then median and equal-tail interval would emerge as the better basis for decisions.

What do you think?

Thursday, February 16, 2017

Equivalence testing (two one-sided test) and NHST compared with HDI and ROPE


In this blog post I show that frequentist equivalence testing (using the procedure of two one-sided tests: TOST) with null hypothesis significance testing (NHST) can produce conflicting decisions for the same parameter values, that is, TOST can accept the value while NHST rejects the same value. The Bayesian procedure using highest density interval (HDI) with region of practical equivalence (ROPE) never produces that conflict.

The Bayesian HDI+ROPE decision rule.

For a review of the HDI+ROPE decision rule, see this blog post and specifically this picture in that blog post. To summarize:
  • A parameter value is rejected when its ROPE falls entirely outside the (95%) HDI. To "reject" a parameter value merely means that all the most credible parameter values are not practically equivalent to the rejected value. For a parameter value to be "rejected", it is not merely outside the HDI!
  • A parameter value is accepted when its ROPE completely contains the (95%) HDI. To "accept" a parameter value merely means that all the most credible parameter values are practically equivalent to the accepted value. For a parameter value to be "accepted", it is not merely inside the HDI! In fact, parameter values can be "accepted" that are outside the HDI, because reject or accept depends on the ROPE.
The frequentist TOST procedure for equivalence testing.

In the frequentist TOST procedure, the analyst sets up a ROPE, and does a one-sided test for being below the high limit of the ROPE and another one-sided test for being above the low limit of the ROPE. If both limits are rejected, the parameter value is "accepted". The TOST is the same as checking that the 90% (not 95%) confidence interval falls inside the ROPE. The TOST procedure is used to decide on equivalence to a ROPE'd parameter value.

To reject a parameter value, the frequentist uses good ol' NHST. In other words, if the parameter value falls outside the 95% CI, it is rejected.

Examples comparing TOST+NHST with HDI+ROPE.

Examples below show the ranges of parameter values rejected, accepted, undecided, or conflicting (both rejected and accepted) by the two procedures.
  • In these cases the ROPE is symmetric around its parameter value, with ROPE limits at \(-0.1\) and \(+0.1\) the central value. These are merely default ROPE limits we might use if the parameter value were the effect-size Cohen's \(\delta\) because \(0.1\) is half of a "small" effect size. In general, ROPE limits could be asymmetric, and should be chosen in the context of current theory and measurement abilities. The key point is that the ROPE is the same for TOST and for HDI procedures for all the examples.
  • In all the examples, the 95% HDI and the 95% CI are arbitrarily set to be equal. In general, the 95% HDI and 95% CI will not be equal, especially when the CI is corrected for multiple tests or correctly computed for stopping rules that do not assume fixed sample size. But merely for simplicity of comparison, the 95% HDI and 95% CI are arbitrarily set equal to each other. The 90% CI is set to 0.83 times the width of the 95% CI, as it would be for a normal distribution. The exact numerical value does not matter for the qualitative results.
For each example, plots show the parameter values that would be accepted, rejected, undecided, or conflicting: both accepted and rejected. The horizontal axis is the parameter value. The horizontal black bars indicate the CI or HDI. The vertical axis indicates the different decisions.

Example 1: HDI and 90% CI are wider than the ROPE.

In the first example below, the HDI and 90% CI are wider than the ROPE. Therefore no parameter values can be accepted, because the ROPE, no matter what parameter value it is centered on, can never contain the HDI or 90% CI.
Example 1.
Notice above that NHST rejects all parameter values outside the 95% CI, whereas the HDI+ROPE procedure only rejects parameter values that are a half-ROPE away from the 95% HDI. All other parameter values are undecided (neither rejected nor accepted).

Example 2: HDI and 90% CI are a little narrower than the ROPE.

In the next example, below, the HDI and 90% CI are a little narrower than the ROPE. Therefore there are some parameter values which have ROPE's that contain the HDI or 90% CI and are "accepted". Notice that the TOST procedure accepts a wider range of parameter values than the HDI+ROPE decision rule, because the 90% CI is narrower than the 95% HDI (which in these examples is arbitrarily set equal to the 95% CI).
Example 2.
Notice above that qualitatively the TOST procedure for equivalence produces qualitatively similar results to the HDI+ROPE decision rule, but the TOST procedure accepts a wider range of parameter values because its 90% CI is narrower (i.e., less conservative) than the 95% HDI.

Example 3: HDI and 90% CI are much narrower than the ROPE.

The third example, below, had the HDI and CI considerably narrower than the ROPE. This situation might arise when there is a windfall of data, with high precision estimates but lenient tolerance for "practical equivalence". Notice that this leads to conflicting decisions for TOST and NHST: There are parameter values that are both accepted by TOST by rejected by NHST. Conflicts like this cannot happen when using the HDI+ROPE decision rule.

Example 3
Notice above that some parameter values are "accepted" for practical purposes even though they fall outside the HDI or CI. This points out the very different meaning of the discrete decision and the HDI or CI. The Bayesian HDI marks the most credible parameter values, but the green "accept" interval marks all the parameter values to which the most credible values are practically equivalent. The frequentist CI marks the parameter values that would not be rejected, but the green "accept" interval marks all the parameter values that pass the TOST procedure.

The comparison illustrated here was inspired by a recent blog post by Daniel Lakens, which emphasized the similarity of results using TOST and HDI+ROPE. Here I've tried to illustrate at least one aspect of their different meanings.

Wednesday, February 8, 2017

The Bayesian New Statistics - finally published


The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective.

Abstract: In the practice of data analysis, there is a conceptual distinction between hypothesis testing, on the one hand, and estimation with quantified uncertainty on the other. Among frequentists in psychology, a shift of emphasis from hypothesis testing to estimation has been dubbed “the New Statistics” (Cumming, 2014). A second conceptual distinction is between frequentist methods and Bayesian methods. Our main goal in this article is to explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods. The article reviews frequentist and Bayesian approaches to hypothesis testing and to estimation with confidence or credible intervals. The article also describes Bayesian approaches to meta-analysis, randomized controlled trials, and power analysis.

Published in Psychonomic Bulletin & Review.
Final submitted manuscript: https://osf.io/preprints/psyarxiv/wntsa/
Published version view-only online (displays some figures incorrectly): http://rdcu.be/o6hd

Published article: http://link.springer.com/article/10.3758/s13423-016-1221-4



Published online: 2017-Feb-08
Corrected proofs submitted: 2017-Jan-16
Accepted: 2016-Dec-16
Revision 2 submitted: 2016-Nov-15
Editor action 2: 2016-Oct-12
Revision 1 submitted: 2016-Apr-16
Editor action 1: 2015-Aug-23

Initial submission: 2015-May-13

Sunday, January 29, 2017

Courses doing Bayesian data analysis, June 2017

Choose from two multi-day workshops doing Bayesian data analysis in June, 2017:
The two workshops cover similar material. The Stats Camp provides a little more time for an extra topic or ad lib demonstrations of variations. Register early!

Wednesday, December 21, 2016

Bayesian assessment of null values


A blog post by Christian Robert considered an ancient (2011!) article titled "Bayesian assessment of null values via parameter estimation and model comparison." Here I'll try to clarify the ideas from way back then through the lens of more recent diagrams from my workshops and a new article.

Terminology: "Bayesian assessment of null values" is supposed to be neutral wording to refer to any Bayesian method for assessing null values. Bayesian "hypothesis testing" is reserved for Bayes factors. Making decisions by posterior interval is not referred to as hypothesis testing and is not equivalent to Bayes factors.

Bayesian hypothesis testing: Suppose we are modeling some data with a model that has parameter δ in which we are currently interested, along with some other parameters. A null hypothesis model can be formulated as a prior on the parameters that puts a "spike" at the null value of δ but is spread out over the other parameters. A non-null alternative model puts a prior on δ that allows non-null values. The two models are indexed by a higher-level discrete parameter M. The entire hierarchy (a mixture model) has all its parameters updated by the data. The following slide from my workshops illustrates:



The Bayes factor (BF) is the shift in model-index probabilities:



Digression: I throw in two usual caveats about using Bayes factors. First, Bayesian model comparison --for null hypothesis testing or more generally for any (non-nested) models-- must use meaningful priors on the parameters in both models for the Bayes factor to be meaningful. Default priors for either model are typically not very meaningful and quite possibly misleading.
And, the Bayes factor is not the posterior probability of the models. Typically we ultimately want to know the posterior probabilities of the models, and the BF is just a step in that direction.
Link in slide above: http://doingbayesiandataanalysis.blogspot.com/2015/12/lessons-from-bayesian-disease-diagnosis_27.html

Assessing null value through parameter estimation: There's another way to assess null values. This other way focuses on the (marginal) posterior distribution of the parameter in which we're interested. (As mentioned at the outset, this approach is not called "hypothesis testing.") This approach is analogous to frequentist equivalence testing, which sets up a region of practical equivalence (ROPE) around the null value of the parameter:
 
The logic of this approach stems from a direct reading of the meaning of the intervals. We decide to reject the null value when the 95% highest density parameter values are all not practically equivalent to the null value. We decide to accept the null value when the 95% highest density parameter values are all practically equivalent to the null value. Furthermore, we can make direct probability statements about the probability mass inside the ROPE such as, "the probability that the parameter is practically equivalent to the null is 0.017" or "the probability that the parameter is practically equivalent to the null is 0.984."

The ROPE is part of the decision rule, not part of the null hypothesis. The ROPE does not constitute an interval null hypothesis; the null hypothesis here is a point value. The ROPE is part of the decision rule for two main purposes: First, it allows decisions to accept the null (again, analogous to frequentist equivalence testing). Second, it makes the decision rule asymptotically correct: As data sample size increases, the rule will come to the correct decision, either practically equivalent to the null value (within the ROPE) or not (outside the ROPE).

Juxtaposing the two approaches: Notice that the two approaches to assessing null values are not equivalent and have different emphases. The BF focuses on the model index, whereas the HDI and ROPE focus on the parameter estimate:


Therefore the two approaches will not always come to the same decision, though often they will. Neither approach is uniquely "correct;" the two approaches frame the question differently and provide different information. 

Below is an example of the different information provided by hypothesis testing and estimation (for both frequentist and Bayesian analyses). The data are dichotomous, with z=14 successes out N=18 attempts (e.g., 14 heads out of 18 flips). The data are modeled by a Bernoulli distribution with parameter θ. The null value is taken to be θ=0.50. For the Bayesian analysis, the alternative-hypothesis prior is uniform merely for purposes of illustration; uniform is equivalent to dbeta(1,1).
From: Kruschke, J. K. and Liddell, T. (in press 2016), The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review.
You can see above that Bayesian hypothesis testing and Bayesian parameter estimation provide very different information. Which approach to use for assessing the null value then comes down to careful interpretation and practicalities. For more discussion please see this article: Kruschke, J. K. and Liddell, T. (2017), The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review.

Friday, December 16, 2016

The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective

UPDATE: Now published, see this post.

Two conceptual distinctions in the practice of data analysis. Rows show point-value hypothesis testing versus estimating magnitude with uncertainty. Columns show frequentist versus Bayesian methods. Cells indicate the typical information provided by each approach. [Figure 1 of Kruschke & Liddell (in press), The Bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review.]
Many people have found the table above to be useful for understanding two conceptual distinctions in the practice of data analysis. The article that discusses the table, and many other issues, is now in press. (It was submitted in mid May, 2015, and was just accepted; a blog post announcing its original version is here, along with many comments.) The in-press version can be found at OSF and at SSRN.

Abstract: In the practice of data analysis, there is a conceptual distinction between hypothesis testing, on the one hand, and estimation with quantified uncertainty, on the other hand. Among frequentists in psychology a shift of emphasis from hypothesis testing to estimation has been dubbed "the New Statistics" (Cumming, 2014). A second conceptual distinction is between frequentist methods and Bayesian methods. Our main goal in this article is to explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods. The article reviews frequentist and Bayesian approaches to hypothesis testing and to estimation with confidence or credible intervals. The article also describes Bayesian approaches to meta-analysis, randomized controlled trials, and power analysis.

Wednesday, November 16, 2016

Is it legitimate to view the data and then decide on a distribution for the dependent variable?

An emailer asks,
In Bayesian parameter estimation, is it legitimate to view the data and then decide on a distribution for the dependent variable? I have heard that this is not “fully Bayesian”.
The shortest questions often probe some of the most difficult issues; this is one of those questions.

Let me try to fill in some details of what this questioner may have in mind. First, some examples:
  • Suppose we have some response-time data. Is it okay to look at the response-time data, notice they are very skewed, and therefore model them with, say, a Weibull distribution? Or must we stick with a normal distribution because that was the mindless default distribution we might have used before looking at the data? Or, having noticed the skew in the data and decided to use a skewed model, must we now obtain a completely new data set for the analysis?
  • As another example, is it okay to look at a scatter plot of population-by-time data, notice they have a non-linear trend, and therefore model them with, say, an exponential growth trend? Or must we stick with a linear trend because that was the mindless default trend we might have used before looking at the data? Or, having noticed the non-linear trend in the data and decided to use a non-linear model, must we now obtain a completely new data set for the analysis?
  • As a third example, suppose I’m looking at data about the positions of planets and asteroids against the background of stars. I’m trying to fit a Ptolemaic model with lots of epicycles. After considering the data for a long time, I realize that a completely different model, involving elliptical orbits with the sun at a focus, would describe the data nicely. Must I stick with the Ptolemaic model because it’s what I had in mind at first? Or, having noticed the Keplerian trend in the data, must I now obtain a completely new data set for the analysis?
What is the worry of the unnamed person who said it would not be “fully Bayesian” to look at the data and then decide on the model? I can think of a few possible worries:

One worry might be that selecting a model after considering the data is HARKing (hypothesizing after the results are known; Kerr 1998, http://psr.sagepub.com/content/2/3/196.short). Kerr even discusses Bayesian treatments of HARKing (pp. 206-207), but this is not a uniquely Bayesian problem. In that famous article, Kerr discusses why HARKing may be inadvisable. In particular, HARKing can transform Type I errors (false alarms) into confirmed hypothesis ergo fact. With respect to the three examples above, the skew in the RT data might be a random fluke, the non-linear trend in the population data might be a random fluke, and the better fit by the solar-centric ellipse might be a random fluke. Well, the only cure to Type I errors (false alarms) is replications (as Kerr mentions). Pre-registered replications. There are lots of disincentives to replication attempts, but these disincentives are gradually being mitigated by recent innovations like registered replication reports (https://www.psychologicalscience.org/publications/replication). In the three examples above, there are lots of known replications of skewed RT distributions, exponential growth curves, and elliptical orbits.

Another worry might be that the analyst had a tacit but strong theoretical commitment to a particular model before collecting the data, and then reneged on that commitment by sneaking a peak at the data. With respect to the three examples above, it may have been the case that the researcher had a strong theoretical commitment to normally distributed data, but, having noticed the skew, failed to mention that theoretical commitment and used a skewed distribution instead. And analogously for the other examples. But I think this mischaracterizes the usual situation of generic data analysis. The usual situation is that the analyst has no strong commitment to a particular model and is trying to get a reasonable summary description of the data. To be “fully Bayesian” in this situation, the analyst should set up a vast model space that includes all sorts of plausible descriptive models, including many different noise distributions and many different trends, because this would more accurately capture the prior uncertainty of the analyst than a prior with only a single default model. But doing that would be very difficult in practice, as the space of possible models is infinite. Instead, we start with some small model space, and refine the model as needed.

Bayesian analysis is always conditional on the assumed model space. Often the assumed model space is merely a convenient default. The default is convenient because it is familiar to both the analyst and the audience of the analysis, but the default need not be a theoretical commitment. There are also different goals for data analysis: Describing the one set of data in hand, and generalizing to the population from which the data were sampled. Various methods for penalizing overfitting of noise are aimed at finding a statistical compromise between describing the data in hand and generalizing to other data. Ultimately, I think it only gets resolved by (pre-registered) replication.

This is a big issue over which much ink has been spilled, and the above remarks are only a few off-the-cuff thoughts. What do you think is a good answer to the emailer's question?