Monday, September 29, 2014

Doing Bayesian Data Analysis in Python

Do you prefer Python? Some readers have undertaken to translate the computer programs from Doing Bayesian Data Analysis into Python, including Osvaldo Martin, who has this GitHub site for his ongoing project. If you are interested in what he has done, or if you are interested in contributing, please contact him. If you have your own Python-translation project going, please add a comment to this post. Thank you all for your contributions to make Bayesian methods available in all flavors.

Thursday, September 18, 2014

p value from likelihood ratio test is STILL not the same as p value from maximum likelihood estimate

In yesterday's post, I described two ways for finding a p value for a parameter, and pointed out that the two ways lead to different p values. As an example, I considered the slope parameter in simple linear regression. One way to get a p value for the slope parameter is with a standard likelihood ratio test, for linear model with free slope versus intercept-only model. The other way to get a p value is to generate a sampling distribution of the MLE slope parameter itself, from a null hypothesis that consists of the MLE intercept-only model. Through remarks from commenters, especially Arash Khodadadi and Noah Silbert, I've been prodded to clarify the differences between these approaches. This post points out
  • The likelihood ratio test compares model fits relative to the variance, while the sampling distribution of the MLE slope parameter is on the absolute scale of the data.
  • An example in which the slope parameter has p < .05 two-tailed in the sampling distribution of the MLE slope parameter but has p > .05 (one-tailed) in the sampling distribution of the likelihood ratio.
First, the data (a little different from the previous posts):
The header of the plot, above, indicates the MLE of the linear model, and -2log(LR), a.k.a. G2, rounded to three significant digits.

The null hypothesis is an intercept-only model (beta1=0) with beta0 and sigma set to their MLE values when beta1 is fixed at zero. (Which means, in this case, that beta0 is the mean of y and sigma is the sd of y using N not N-1 in the denominator.) I generated sample data from the null hypothesis using the x values in the actual data. For each sample I computed the MLE of the full model and -2log(LR). The resulting marginal sampling distributions are shown here:
In the left panel, above, the one-tailed p value of MLE beta1 is displayed; multiply it by 2 to get the two-tailed p value. Notice it is different than the p value of the likelihood ratio.

Below is the joint sampling distribution, where each point is a sample from the null hypothesis. There is a new twist to this figure: Each point is color coded for the magnitude of MLE sigma, where blue is the largest MLE sigma in the distribution and red is the smallest MLE sigma in the distribution.
The joint sampling distribution also shows the thresholds for p=.05 (one-tailed or two-tailed), and the actual data statistics are plotted as a "+".

You can see from the joint sampling distribution that MLE beta1 can be large-ish even when -2log(LR) is small-ish when the sample MLE sigma is large-ish (blue points). But the opposite can happen when the sample MLE sigma is small-ish (red points). Thus, a key difference between the measures of the slope parameter is how they deal with the variance. The likelihood ratio compares the free-slope against intercept-only models relative to the variance, while the MLE beta1 considers the slope on the absolute scale of the data, not relative to the variance.

As discussed in yesterday's post, I don't think either test is inherently better than the other. They just ask the question about the slope parameter in different ways. As mentioned yesterday, posing the question in terms of absolute MLE beta1 has direct intuitive interpretation. It's also much easier to use when defining confidence intervals as the range of parameter values not rejected by p<alpha (which is, for me, the most coherent way to define confidence intervals). But that's a topic for another day!




Wednesday, September 17, 2014

p value from likelihood ratio test is not the same as p value from maximum likelihood estimate

In a post of a few hours ago, I pointed out that I was having trouble getting p values to agree for two different methods. Thanks to a suggestion from a reader, there is a resolution: The p values should not agree. This was actually my hunch and hope all along, because it adds another reason never to talk about "the" p value for a set of data, because any data set has many different p values.

First, a recap:
I'm doing Monte Carlo simulation of sampling distributions to compute p values. For example, consider the slope parameter, β1, in simple linear regression. I want to find out if p < .05 for a null hypothesis that β1=0. I'm working with two different ways to compute a p value, and am finding that the results do not agree with each other. Here are the two ways:
  1. Consider the maximum likelihood estimate (MLE) of β1 from the actual data, denoted β1MLEactual, and see where it falls in a sampling distribution of β1MLEnull for simulated data from the null hypothesis.
  2. Consider the likelihood ratio statistic, G2 = -2log(LR) where LR is the ratio of the maximum likelihood of the restricted model with β1=0 over the maximum likelihood of the full model with β1 free. We see where G2actual falls in the sampling distribution of G2null for simulated data from the null hypothesis.
See the previous post for details about how the data from the null hypothesis were generated. But here is a repeat of the old picture of the two sampling distributions:
Notice that the p values are not the same in the two distributions. (See the end of the previous post for reasons that the difference cannot be explained away as some simple artifact.)

Now the news: Arash Khodadadi, an advanced graduate student in Psychological & Brain Sciences at Indiana University, read the post and pointed out to me that not all samples that have G2null > G2actual are the same samples that have β1MLEnull > β1MLEactual. Essentially he was saying that I really should be looking at the joint sampling distribution. So I made a plot, and here it is:
Each point corresponds to a sample from the null hypothesis. The marginals of the joint distribution were plotted in the previous figure. I put lines at the values of β1MLEactual and G2actual, and I color coded the points that exceed those values. Obviously the points contributing to the p value for β1MLE
are quite different than the points contribution to the p value for G2. It is conceivable that the red and blue points would exactly equal each other mathematically (perhaps in a two-tailed version), but I doubt it.

Conclusion: This exercise leads me to conclude that the p values are different because they are referring to different questions about the the null hypothesis. Which one is more meaningful? For me, the sampling distribution of β1MLE makes more direct intuitive contact with what people want to know (in a frequentist framework), namely, Is the observed magnitude of the slope very probable under the null hypothesis? The sampling distribution of G2 is less intuitive, as it is asking, Is the observed ratio, of the probability of the data under the zero-slope model over the probability of the data under the free-slope model, very probable under the null hypothesis? Both questions are meaningful, but the first one asks directly about the magnitude of the slope, whereas the second one asks about the relative probabilities of data under the model structures.

Now the difficult semantic question: When the p values conflict, as they do in this example (i.e., p < .05 for β1MLE, while p > .05 for G2 [and if you prefer two-tailed tests, then we could contrive a case in which the p values conflict there, too]), is the slope "significantly" non-zero? The answer: It depends! It depends on the question you are asking. I think it makes intuitive sense to ask the question about the magnitude of the slope and, therefore, to say in this case that the slope is significantly non-zero. But if you specifically want to ask the model comparison question, then you need the model-comparison p value, and you would conclude that the free-slope model is not significantly better than the zero-slope model.

Addendum: SEE THE FOLLOW-UP POST.

Sampling distribution of maximum lilkelihood estimate - help?

I'm doing Monte Carlo simulation of sampling distributions to compute p values. For example, consider the slope parameter, β1, in simple linear regression. I want to find out if p < .05 for a null hypothesis that β1=0. I'm working with two different ways to compute a p value, and am finding that the results do not agree with each other. Here are the two ways:
  1. Consider the maximum likelihood estimate (MLE) of β1 from the actual data, denoted β1MLEactual, and see where it falls in a sampling distribution of β1MLEnull for simulated data from the null hypothesis.
  2. Consider the likelihood ratio statistic, G2 = -2log(LR) where LR is the ratio of the maximum likelihood of the restricted model with β1=0 over the maximum likelihood of the full model with β1 free. We see where G2actual falls in the sampling distribution of G2null for simulated data from the null hypothesis.
I'm finding that the p values don't agree. What am I doing wrong?

Here's an example with details. Consider the data in this scatter plot:
The maximum-likelihood regression line is shown above in blue, and the maximum-likelihood zero-slope line is shown in red. The MLE parameter values are shown in the header of the plot; in particular β1MLEactual = 5.86. The header also shows that G2actual = 4.41.

For reference, in R a call to lm yields p = 0.068 for the slope coefficient. Let's see if either of the Monte Carlo schemes will agree with that result.

We need to generate Monte Carlo data samples. For this purpose, I use the x values in the actual data, and for each x value I generate a random y value according to the MLE values of the restricted model. The restricted model has parameter values shown in the header of the figure below (rounded to three digits):
Sampling distributions from the null hypothesis. The null-hypothesis parameter values are shown in the header of the
Thus, for each x value in the data, a random value for y is created in R as rnorm(1,mean=158,sd=36.6)using the full-precision MLE values, not just the rounded values shown here. For each simulated set of data, I (well, the computer) computed β1MLEnull and G2null. There were 50,000 simulated data sets to create fairly smooth and stable sampling distributions, shown above.

The sampling distributions above also show the locations of β1MLEactual and G2actual along with their p values. You can see that the p value for β1MLEactual does not match the p value for G2actual, but the latter does match the result of R's lm function.

Maybe the problem is merely Monte Carlo sampling noise. But, no, even much bigger simulations don't help (and the p values stay essentially the same).

Maybe the problem is one-tailed vs two-tailed p values. But, no, two times the MLE p value does not equal the G2 p value.

Maybe the problem is that the null hypothesis σ should be the unbiased estimate of SD(y), using N-1 in the denominator for computing SD, instead of the MLE for SD(y), which effectively uses N in the denominator. But, no, that still doesn't make the p value match. The p value for β1MLE does increase (i.e., the sampling distribution of β1MLEnull widens while the sampling distribution of G2null remains unchanged), but still does not match, either one-tailed or two-tailed.

What am I doing wrong with the sampling distribution of  β1MLEnull?

ADDENDUM: SEE THE FOLLOW-UP POST!

Wednesday, August 20, 2014

Converting combination of random variables to hierarchical form for JAGS (BUGS, Stan, etc.)

An emailer asks:

Hi, John. Long-time listener, first-time caller... I have a model that says X is a function of three (independent) random variables:
X ~ normal(mu, sigma) / uniform(a,a+b) - beta(v,w)
and I also have N random samples of X. Can I use JAGS to estimate mu, sigma, a, b, v and w? Or is this problem outside its scope? Not sure if it's even possible.

My reply:

In other words, you want

x <- r / u - b
where
r ~ normal( rMean , rSD )
u ~ uniform( uLow , uHigh )
b ~ beta( bA , bB )

but the problem is that JAGS won't accept that sort of model specification.

One solution is to change to an equivalent hierarchical formalization:

x ~ normal( rMean/u-b , rSD/u )
u ~ uniform( uLow , uHigh )
b ~ beta( bA , bB )

This hierarchical form is directly implementable in JAGS. But be careful that in JAGS the normal is parameterized by precision, not SD, so you'll need to write x ~ dnorm( rMean/u-b , 1/(rSD/u)^2 ).

To demonstrate that the two forms are equivalent, here is a Monte Carlo sample generated in R from each formalization:

# Specify parameter values and sample size:
rMean=7 ; rSD=2
uLow=3 ; uHigh=5
bA=4 ; bB=3
N = 50000

# Sample x values your original way:
b = rbeta(N, bA,bB)
u = runif(N, uLow,uHigh)
r = rnorm(N, rMean,rSD)
x1 = r/u-b

# Sample x values a new way:
x2 = rep(0,N)
for ( i in 1:N ) {
  b = rbeta(1, bA,bB)
  u = runif(1, uLow,uHigh)
  x2[i] = rnorm(1, rMean/u-b , rSD/u )
}

# Plot results:
layout(matrix(1:2,nrow=2))
xLim=range(c(x1,x2))
hist(x1,xlim=xLim,breaks=21)
text( mean(x1) , 0 , "+" , cex=2 )
hist(x2,xlim=xLim,breaks=21)
text( mean(x2) , 0 , "+" , cex=2 )



The result is shown below, where you can see that the original and new distributions are identical:

Friday, August 15, 2014

How to use MCMC posterior as prior for future data

An emailer writes:

Dear Prof. Kruschke,
Hello. My name is ... and I am ... . I'm trying to apply Bayesian theorem in developing a model of ... . I used your code to estimate posterior distribution without any trouble. Here is my question. Would you kindly let me know how to use the estimated posterior distribution as the prior of another Bayesian update? ...  Any comment would be greatly appreciated. Thank you very much for your time. Regards,
 My reply:

Thanks very much for your interest in the book.

Because MCMC methods require a mathematically specified prior, but generate a Monte Carlo sample of the posterior, you need to either (a) find a reasonable mathematical summary of the MCMC posterior to use as the mathematical prior for the next batch of data or (b) concatenate the previous data with the next batch of data and analyze them together.

Option (b) gives you the mathematically correct answer, but at the cost of a large data set which might slow down the MCMC process, and it demands that the previous and next batch of data are structured the same way and analyzed with the same model.

For option (a), there is no uniquely correct approach. For each parameter, examine its marginal MCMC posterior distribution and summarize it with a reasonable mathematical function. For example, if the parameter is metric on positive values, maybe a gamma distribution would mimic the MCMC distribution well. Use a mathematical distribution with the same central tendency and standard deviation as the MCMC posterior. A peril of mimicking each marginal distribution separately is that you lose all information about the correlation of parameters in the posterior. If the correlation is important for capturing the prior information you want, then you can use a multivariate prior for two or more parameters simultaneously (e.g., dmnorm in JAGS).

Hope this helps.

P.S. (Added an hour or so after the text above.) A similar thing was done in the book with pseudo-priors. See the middle paragraph of p. 251: "The shape and rate constants for the gamma pseudopriors were set so that the gamma distributions had the same mean and standard deviation as the posterior sample of kappa values. For example, consider the parameter kappa0. In the initial run without pseudopriors, a reasonable posterior sample was obtained for model index 2 (i.e., when kappa0 is actually constrained by the data, as shown in the bottom panel of Figure 10.3). That posterior sample had a mean m and a standard deviation s. The conversion from mean and standard deviation to shape and rate parameters was explained in the caption of Figure 9.8. The conversion yields shape and rate values specified in lines 55 to 56 of the code."

Tuesday, August 12, 2014

Stopping and testing intentions in p values

An emailer asks:

I am really interested in Bayesian analysis, but I don't get the issue of sampling intention being so important in frequentist t-tests; if you have 60 values you have 60 values surely - why does your intention matter? The computer does not know what I intended to do; or have I missed the point entirely!?

That is exactly the point -- intuitively (for natural Bayesians) the stopping and testing intentions should not matter (for some things), but for p values the stopping and testing intention is of the essence, sine qua non. In lieu of reading the BEST article, see this video, starting at 7:15 minutes:





One other quick question if I may; how does the BEST package differ from the BayesFactor package?

An appendix of the BEST article explains, but you can instead see this video, starting at 2:35 minutes:


Those video excerpts might make the most sense if you just start with the first and watch them through, all the way...