Let’s say we have data points plotted on the x-y plane. We want to draw a line or curve best fit to the data points. In other words, we aim to draw a curve as close to all the data points through the error bars as possible. Since the shape of the curve is described by mathematical variables that are called parameters, this process is called parameter estimation.

In Astronomy and Astrophysics, such a curve is called the model. A model can be motivated from theory or from observation. There is no fundamental difference in fitting a theoretical model or an empirical model.

Here comes the question: how can we draw this best-fit curve? We have the conventional frequentist approach: we guess a reasonable set of parameters, compute the difference between the curve and the data points, and then minimise (by some efficient algorithms) the difference by changing the shape of the curve, i.e., drawing a new curve. The process is repeated, which is called the “fitting” process, or regression. The curve is said to be “best fit” if the difference is smaller than or approximately equal to certain *ad hoc* values, usually known as the “goodness of fit”.

In frequentist statistics, we obtain a point estimation, e.g., the polynomial coefficients of a polynomial curve. Note that the resulting point estimation depends on the choice of the goodness-of-fit function.

In contrast, the main idea of Bayesian statistics is extremely simple. We don’t need any goodness of fit. What we have is the one and only one formula, the Bayes Rule:

(Probability of the model is true given the observed data) = (Probability of observing the data given the model is true) x (Probability of the model being true).

[up to a normalisation constant called the evidence, under the condition that the integration of the probability of all possible values must equal to 1]

The left hand side is called the posterior, which contains all the information inferred from the data, summarised in a probability distribution of the model parameters. The first term on the right is called the likelihood, and the second term is called the prior. The posterior is completely determined from the likelihood and the prior.

The likelihood is a probability function that connects the physical parameters to the statistical parameters. We need to choose the likelihood function by ourselves, according to the physical processes under consideration. For instance, are we looking at a Poisson process? Are the error bars distributed as a Gaussian? etc.

The prior is a probability function chosen from expert knowledge. In other words, the prior is a probability distribution for each parameter under consideration.

The point is that once the likelihood and the prior are chosen, the shape of the posterior is fixed. There is no fitting process, no iteration, no loops for the code. Instead, we need to find an efficient way to draw the shape of the posterior. Often it is of multi-dimensions, so that we need methods more clever than brutal force.

Fortunately, the so-called Markov chain Monte Carlo comes to rescue. It is a kind of algorithm to efficiently sample the parameter space of the posterior. Once the posterior is displayed, the error bars can be obtained readily by locating the highest posterior density interval (HPDI). The HPDIs are the intervals of possible parameter values containing the highest density of probability. In contrast, frequentists have to rely on performing simulations to obtain the so-called confidence intervals, which are not even the same thing as error bars.

There are a lot of statistical reasons to argue for Bayesian statistics against frequentist statistics. In my viewpoint the main reason is that Bayesian statistics is simple. It directly answers the fundamental question of “how likely is a scientific theory to be true, given the data?” without the need to introduce any extra concepts. Simple. Only the Bayes Rule.

I will illustrate the advantages of Bayesian statistics in the up-coming posts of this “Bayesian Astrostatistics” series.