Nicholas Metropolis, a Greek American physicist, is well known for the “Metropolis chains”. It is a well known fact in Markov chain theory that for an irreducible matrix there is a unique stationary distribution satisfying . A natural question which comes up is the following:
Given a probability distribution on a state space , can we construct a transition matrix for which is its stationary distribution?
The method for sampling from a given probability distribution is called Markov Chain Monte Carlo (In my opinion, this fact by itself is elegant, but not too surprising. The most interesting question is how fast do you converge, but I will elaborate on this in a follow up post.). The main idea behind Markov Chain Monte Carlo is censorship. Let’s start for simplicity by considering a symmetric transition matrix , that is . We will create a chain that will censor moves with some probability. Assume for example that we are in state . We will sample the next state using but we will accept the new state with probability . Therefore, we end up having the following transition matrix: if and . The goal thus has become to pick a matrix such that the matrix has stationary distribution the distribution we want to sample from. Now we will use the following proposition, which is the crucial argument we will use to pick the matrix:
Proposition: Assume that we have a probability distribution on which satisfies the detailed balance equations, i.e., for all pairs , where is a transition matrix. Then, any which satisfies this set of equations is stationary for .
Proof: Recall the definition of a stationary distribution for a transition matrix , i.e., or equivalently . But from the detailed balance equation we can substitute the terms of the sum and obtain the following tautology: .
Now that we know the above, we can pick the matrix so that we obtain the desired probability distribution. Specifically, it suffices to find an matrix such that for every pair of states the detailed balance equations for the censored MC are satisfied. . Since we started out with a symmetric matrix, we obtain . Since the entries of the matrix are probabilities, we have the constraints and . A choice which wastes as much as little moves and satisfies the above equations and constraints is to pick . This results in the transition matrix that we wanted from the beginning. Clearly, if and .
1) An important fact concerning the Metropolis Chain algorithm is that we can easily simulate the chain since the “amount of censorship” we introduce depends on the ratios of .
2) Exactly the same argument applies to the case that is not symmetric. In that case .
3) Despite the fact that the above analysis today is considered standard, it is still a remarkable result. Think about it…
4) A lot of hard work goes into proving that the chain mixes rapidly. More about this aspect in future posts.
5) Many data mining and machine learning papers which I have seen use Metropolis chains, e.g., to perform simulated annealing, but neither prove that their heuristic is there because the problem is NP-hard, nor they prove that their chain mixes rapidly. This is in my opinion something to be avoided.