How is the Information in a Continuous Variable Limited?

A question on Physics Stack Exchange speculates that the information content of a human brain is infinite, because the brain’s state is described by a continuous quantum state, a “wavefunction”.

As pithily answered by Emilio Pisanty, the Hilbert space of quantum mechanics, i.e. $\mathbb{L}^2(\mathbb{R}^N)$ is separable, which means we have a countably infinite number of basis states, even for continuous wavefunctions. Put this fact together with the fact that the energy of a human brain is experimentally observed to be bounded, and it is clear that only a finite number of these discrete energy eigenstates can contribute. So the theoretical quantum state capacity of a brain, although huge, is indeed finite. As Emilio so eloquently put it “….if a sizable fraction of the electrons in our brain had energies above, say, 1 GeV, we would instantly come apart in a blaze of gamma rays and positrons and whatnot…“, a statement that evokes a scene like Dr. Who regenerating for me, a blaze of energy blasting from his head region (at least in the Christopher Eccleston and later Dr. Whos).

That’s the theoretical limit. The real situation is degraded by noise. The information that can be stored in one continuous variable (i.e. one real number, which in principle encodes $\aleph_0$ bits) is precisely quantified by Shannon’s noisy channel coding theorem.

Let’s suppose we have a normalized real variable $x \in [0,1]$: the interval represents the fact that we have a finite voltage range, or light intensity range, or whatever we use to record our information. We think of “writing down” our information from a source with Shannon entropy $H$ bits per symbol as a value $x\in[0,1]$. When we come to read this value, it has in general been corrupted by noise, so its value will be some other $y\in[0,1]$ and we can think of the writing / reading cycle of the same variable as a transmission through a noisy channel. Intuitively it makes sense to use only discrete values in the interval to stand for recorded information: the more tightly packed they are, the likelier they are to be corrupted in the read/write cycle, so we can see that there is going to be some limit here. So we have two discrete probability distributions $p_X(x_j)$ the distribution of which symbol is written in the real variable, and $p_Y(y_j)$ the distribution of which symbol is read instead.

The noisy channel coding theorem states that the maximum storage capacity $C$ in bits of this real variable is the supremum over all possible symbol $p_X(x_j)$ distributions of the Mutual Information of $p_X(x_j)$ and $p_Y(y_j)$ i.e.

$$C = \sup\limits_{p_X} \left(\sum\limits_{x_j}\sum\limits_{y_k} p_{X,Y}(x_j,y_k) \log_2\frac{p_{X,Y}(x_j,y_k)}{p_X(x_j)\,p_Y(y_k)}\right)$$

where $p_{X,Y}(x_j,y_k)$ is the joint distribution of the input $x$ and output $y$ and models the noise corruption of the written variable.

If the written variable is corrupted by Gaussian noise of variance $\sigma^2$, then we intuitively expect that the number of levels in $[0,1]$ that we can tell apart will be of the order of $\sigma^{-1}$ so we expect roughly $-\log_2 \sigma$ bits will be storable in the continuous interval. Indeed, if we apply the noisy channel coding theorem above to this situation, we find the Shannon-Hartley theorem, which is the noisy coding theorem for an additive Gaussian noise channel:

$$C = \frac{1}{2}\log_2(1 + SNR) = \frac{1}{2}\log_2\left(1 + \frac{1}{\sigma^2}\right)$$

bits per symbol, which approaches our intuitive expression $-\log_2 \sigma$ as $SNR = \sigma^{-2}\to\infty$. $SNR$ is the signal to noise ratio.

It is important to take heed of the remarkable fact that $C$ represents a situation arbitrarily near to perfect, noiseless information storage and is not a “rough measure of storable bits”. That is, the noisy channel coding theorem takes exact account of the possibility of error correcting coding spread over many such information storage variables. It assumes we have a large number of these unit intervals and that we spread our coded information over this large number and deliberately introduce correlations between them through codeword structure so as to detect and correct errors. If we are allowed to do this over an arbitrarily large number of these unit intervals, then the theorem shows us that we can noiselessly encode $C$ bits per continuous variable, with the probability of any errors (after error correction) approaching nought as the number of coded variables gathered into each codeword increases without bound.

This is why the theorem is so ingenious: without constructing the code, it can show that there exists one that will come arbitrarily near to achieving perfect storage, as long as we demand only up to and including $C$ bits per symbol. It also shows that if we try to store $C+\epsilon$ bits per symbol, for any $\epsilon>0$, then the probability of errors approaches unity as the number of read/write cycles approaches infinity, whichever coding scheme we may use. $C$ truly does represent the exact capacity of a noisy continuous variable.