The Kolmogorov Complexity and Shannon Entropy of a System and How They Are Related

Kolmogorov Complexity and Shannon Entropy of an information source have different definitions and the Wiki pages on Kolmogorov Complexity], Shannon Entropy and the Shannon’s source coding theorem are essential and sound background reading as well as:

E. T. Jaynes, “Information Theory and Statistical Mechanics”, Phys. Rev. 106 No. 4, pp620-630, May 15, 1967 (especially the first section of this one).

The Jaynes article as well as the Maths Stack Exchange question “what is the connectivity between Boltzmann’s entropy expression and Shannon’s entropy expression?” give a pretty good summary of the relationship between Shannon and Gibbs entropy: Boltzmann entropy of an ensemble of statistically independent subsystems (or a gas of particles whose states are statistically independent) is the same as the Gibbs entropy in that case (where statistical independence is true); the two differ otherwise, and the Boltzmann entropy is simply a number calculated from the particle states as though they were statistically independent; the Jaynes work:

E. T. Jaynes, “Gibbs vs. Boltzmann Entropies”, Am. J. Phys. 33, No. 5, pp391-398, May 1965

talks about the difference between Boltzmann and Gibbs entropies in detail. The point is that the Boltzmann entropy is a different concept from the information theoretic point of view, but it is “accessible” (i.e. can be calculated from macroscopic measurements of a system, and as such is equivalent to Clausius’s macroscopic definition of entropy) and there are quite strong qualitative arguments as to why it should “realise” second law of thermodynamics: true informational entropy cannot change for a deterministic system like the classical statistical mechanical gas even for irreversible changes, if the gas does not come into contact with some other system. Its state in such a case wholly determines its state at all other times.

So now we look at the Kolmogorov Complexity of a system. This is the length in bits of the shortest possible description of the system with respect to a given descriptive language. The K-complexity is always cited with a descriptive language in mind – it cannot be otherwise, even if this is not explicitly state. Again, for the gas or system ensemble where the constituents truly are statistically independent, then the most succinct description of the constituents’ states is simply one which names each state in full: the knowledge of one particle’s state tells us nothing about that of the others: otherwise they wouldn’t be statistically independent so the situation where the naming of all the particles’ states in full is the shortest possible description can be taken as a definition of statistical independence, if you like. In this case, the Kolmogorov Complexity and the Shannon entropy approach one another as the number of particles or substates approaches infinity.

To understand why this is so in this statistically independent case, as the number $N$ of particles gets very large, the numbers of them in each possible state are very near to $p_1,\,p_2,\,\cdots$ where $p_j$ is the probability that a given particle will be in the $j^{th}$ state, and the proportional error between the actual number in state $j$ for a sample of $N$ of them and the number $p_j\,N$ approaches nought as $N\to\infty$ (see my post “Why are the laws of thermodynamics “supreme among the laws of Nature?” where I discuss this in detail).

So this is how we shall specify the state in detail in a message. We shall need a “foreword” to our message, which tells us the number of particles $n_j$ in each state, without saying which particles are in this state. This is simply a number from 1 to N for each of a finite number $J$ of possible states. So we need:

$$F_1 = J \log_2 N$$

bits to do this. Once we have the numbers $n_j$ in each state, we now have our main message naming which particles are in which state. We can easily see that the fewest number of bits is $\log_2 \Omega$, where $\Omega$ is the total number of ways of arranging $N$ particles into partitions of size $n_1,\,n_2,\,\cdots$. We have one different integer for each possible arrangements: simply naming this integer tells us exactly which arrangement we have. We can’t have any fewer than the integer $\Omega$, because then the relationship between the set of all arrangements and the codewords cannot be one to one (this is simply a restatement of the pigeon hole principle). So, without knowing how to encode the arrangements, we can see there is some coding that will encode the message in $\log_2 \Omega$ bits. But now:

$$\Omega = \frac{N!}{n_1!\,n_2!\,\cdots}$$

and by the Stirling approximation we get

$$\log_e \Omega \approx N\log N – N – \sum\limits_{j=1}^J {(n_j \log_e n_j -n_j)} = -\sum\limits_{j=1}^J {n_j \log_e \frac{n_j}{N}}$$

Now we know, by the laws of large numbers, that $\log(N p_j /n_j)\to 0$ as $N\to\infty$, so in the limit of large $N$, the shortest message size in bits per particle is:

$$\frac{\log_2\Omega + F_1+F_2}{N} = \frac{F_1+F_2}{N} – \left(\sum\limits_{j=1}^J {\frac{n_j}{N} \log_2 \frac{n_j}{N}}\right)\to-\sum\limits_{j=1}^J p_j \log_2 p_j$$

bits per symbol, which is the Shannon entropy. Here we have had to add to our “foreword” a description $F_2$ bits long of how the descriptive code integer $1\cdots\Omega$ maps to the different arrangements. But this is a slowly growing overhead – it can be shown its length in bits can be arranged to grow only logarithmically with $N$. So we see that the Kolmogorov Complexity and Shannon entropy come to mean the same thing in many important cases in the thermodynamic limit as both the laws of large numbers take hold and make $n_j = N p_j$ effectively exact and also spreads the constant descriptional “overhead” of the foreword over more and more particles.

It is worth noting that you can work the above into a rigorous proof of the following Shannon’s noiseless coding theorem: http://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem. If you try to code a message comprising a string of statistically independent symbols using $H−\epsilon$ bits per symbol (where H is the Shannon entropy of the symbol probability distribution) for any $\epsilon>0$ nomatter how small, then the probability of “failure” (i.e. the probability that you will “garble” the message rises to unity as the message length $\to\infty$). Contrapositively, if you choose to use $H+\epsilon$ bits per symbol, then it can be proven that there exists a coding scheme such that the probability of failure shrinks to nought as the message length rises without bound. Shannon’s definition of entropy is given its “working” practical meaning through the noiseless coding theorem. Jaynes also gives statistical mechanial interpretations and other interesting mathematical properties of the Shannon entropy in the opening sections of E. T. Jaynes, “Information Theory and Statistical Mechanics”.