Ch15 · H — Entropy and the Compression Circle

§1 · Shannon Entropy — The Measure of Surprise

In 1948, Claude Shannon published "A Mathematical Theory of Communication" and defined entropy as the average surprise in a probability distribution. An event that always happens carries zero information — you already knew it was coming. An event that rarely happens carries maximum information — it is maximally surprising.

Definition 15.1 — Shannon Entropy H(X) = -Σᵢ pᵢ log₂ pᵢ (bits) pᵢ = probability of outcome i Special cases: pᵢ = 1 for one i (certainty): H = 0 bits (no surprise) pᵢ = 1/n for all i (uniform): H = log₂ n (maximum) H(fair coin) = log₂ 2 = 1.000 bit H(DNA base) = log₂ 4 = 2.000 bits H(amino acid) = log₂ 20 = 4.322 bits (if uniform) H(English word) \approx 11.8 bits (Shannon 1951 estimate for natural language) Source coding theorem: optimal code length = H(X) bits/symbol No lossless compression can go below H(X) — it is the entropy floor.

H = 0

Certainty

One outcome has probability 1. No information — nothing new to learn. A paper that only confirms what everyone already knows.

H ≈ 0.5

Structured

A few outcomes dominate. Predictable but not trivial. A paper with a clear thesis and supporting evidence.

H ≈ log₂n − 1

Informative

Near-uniform distribution with slight structure. Maximally informative within a framework. The ideal research paper.

H = log₂n

Maximum Entropy

Uniform distribution — all outcomes equally likely. Pure noise. A paper with no coherent argument, only unstructured data.

§2 · Operator C — Compression as Entropy Reduction

Chapter 4 introduced C as the Compression operator — the first step in G = U ∘ F ∘ K ∘ C. Now we can define it precisely: C reduces the entropy of the data representation while preserving the mutual information with the target claim.

Compression as Entropy Reduction Raw data X: H(X) = high (many variables, redundancy, noise) Compressed C(X): H(C(X)) < H(X) (fewer bits, same signal) Information preserved: I(C(X) ; Y) \approx I(X ; Y) where Y = the downstream claim you are trying to support Rate-distortion: min H(C(X)) subject to I(C(X);Y) \geq I_min Lossless C: H(C(X)) = H(X) (e.g., gzip — exact reconstruction) Lossy C: H(C(X)) < H(X) with controlled distortion Methods section = a lossy compression of the experiment \to enough bits to reconstruct the key results, not the entire raw dataset Results section = decompressed C(X) mapped to human-readable space \to tables, figures, statistics are the decoded representation

The classic error in Methods sections: either too high-entropy (every protocol detail, no compression — Methods becomes unreadable) or too low-entropy (so compressed that replication fails — the reader cannot reconstruct the experiment from the description). The optimal Methods section achieves the Shannon bound: minimum bits needed for replication, no more, no fewer.

The Compression Circle: G = U ∘ F ∘ K ∘ C starts by reducing entropy (C), then builds structure through K and F, then expands into Discussion (U). Ch15 closes the loop — the Discussion's claims become the prior for the next paper's C operator, compressing toward what the field now knows. Science is a self-compressing information system.

§3 · Efficient Coding — Biology at Maximum Entropy

Horace Barlow's efficient coding hypothesis (1961) proposes that the nervous system has evolved to encode stimuli in a way that maximises information transmission — operating near maximum entropy given the statistics of the natural environment.

This is why retinal ganglion cells respond strongly to edges (high spatial information content — unexpected luminance changes) and weakly to uniform fields (low information content — predictable). The visual system has learned the entropy structure of natural scenes and adapted its code to match.

Biological Information Densities DNA (4 bases, ~uniform coding regions): H \approx 1.9 bits/nucleotide (theoretical max: 2.0 bits) Amino acid sequence (20 aa, non-uniform): H \approx 4.2 bits/residue (theoretical max: 4.32 bits) Human genome total information: ~750 MB compressed (6\times10⁹ bp \times 1.9 bits / 8 bits/byte) V(D)J recombination (antibody diversity): ~10¹⁵ possible sequences \to H \approx 50 bits per antibody locus This IS maximum entropy of the immune recognition space Goal: no antigen should be entirely unpredictable to the system Drosophila connectome (2023, Schlegel et al.): 130,000 neurons \cdot 50M synapses \cdot ~10⁷ bits of connectivity information The most information-dense biological structure formally verified

V(D)J recombination deliberately maximizes the entropy of the antibody repertoire — ensuring the immune system has a non-zero probability of recognising any possible antigen. This is the K operator (Chapter 5) serving the C operator: maximum repertoire entropy ensures that any pathogen can be compressed into a specific antibody shape.

Writing Context	H (bits/unit)	Entropy State	Interpretation
Abstract: every sentence predictable from the last	H ≈ 0.2	Too low	Redundant, says nothing new per sentence
Introduction: contextualised claim	H ≈ 1.5	Optimal	Each sentence advances the argument
Results: findings plus statistics	H ≈ 2.5	High-value	Dense with information, warrants careful reading
Discussion: every sentence a new speculation	H ≈ 4.0	Too high	Unanchored — ρ > 1 (Ch11), sorry accumulation (Ch14)

§4 · KL Divergence — How Surprising Is Your Paper?

The Kullback-Leibler divergence D_KL(P ‖ Q) measures how different a distribution P is from an expected distribution Q. It equals the extra bits needed to encode samples from P using a code optimised for Q.

KL Divergence and Paper Novelty D_KL(paper ‖ field_prior) = Σ p(claim) \cdot log₂ [p(claim) / q(claim)] p = distribution of claims in your paper q = distribution of claims the field already expects D_KL = 0: your paper is exactly what the field predicted (no novelty) D_KL small: modest advance — confirms with refinements D_KL moderate: genuine contribution — surprising but in the right direction D_KL large: paradigm-shifting (or wrong — reviewer decides) Mutual information of your paper with the field's unknowns: I(paper ; gap) = H(gap) - H(gap | paper) = entropy of the open question BEFORE reading minus AFTER reading A paper's scientific value \approx I(paper ; gap) bits

Theorem 15.1 — The Information Value of a Paper

Let Q be the field's prior distribution over claims (what is expected before reading the paper), and P be the paper's claimed distribution (what is asserted after). The paper's information value is:

V(paper) = I(paper ; gap) = H(Q) − H(Q | P)

where gap = the set of open questions the paper addresses.

A paper maximises V by: (1) targeting a high-entropy gap H(Q) — an open question with many plausible answers;
(2) reducing H(Q|P) as much as possible — narrowing the remaining uncertainty after reading;
(3) ensuring D_KL(P ‖ Q) is neither 0 (trivial) nor ∞ (incomprehensible);
(4) keeping ρ(P) ≤ 1 (Ch11) — so the reduction in uncertainty is warranted.

⬡ LLM Prompt Portal · Chapter 15

PROMPT 8.2 · PAPER ENTROPY AUDIT

Identify High- and Low-Information Sentences

Read your paper one paragraph at a time. For each sentence, estimate its information content:
H ≈ 0 = says something already implied by the previous sentence (redundant — cut or merge)
H ≈ 1 = advances the argument by one logical step (good)
H ≈ 2–3 = introduces a new finding, mechanism, or claim (high value — protect these)
H > 3 = makes a claim that is completely unconnected to the previous sentence (check: is this a new section, or a sorry?)
For each section (Introduction / Methods / Results / Discussion / Conclusion), compute the average H per sentence. Flag any section where average H < 0.5 (too redundant) or average H > 3 (too disconnected).

PROMPT 8.3 · KL DIVERGENCE FROM FIELD EXPECTATIONS

Measure Your Paper's Novelty

Answer these questions to estimate D_KL(paper ‖ field_prior):
(a) What would a well-read researcher in your field expect this paper to find, before reading it? List 3–5 expected claims (this is Q).
(b) What does your paper actually claim? List 3–5 actual claims (this is P).
(c) For each claim in P, is it in Q (expected), a refinement of Q (small KL), or new information not in Q (large KL)?
(d) Estimate overall novelty: 0 (expected) → 1 (modest advance) → 2 (significant contribution) → 3 (paradigm-shifting).
(e) Is D_KL appropriate for your venue? High D_KL is right for Nature/Science; moderate D_KL for specialist journals; low D_KL for replication studies.

EXTENSION · EFFICIENT ABSTRACT

Maximise Information per Word

Apply the efficient coding principle to your abstract:
Count the words. For each sentence, ask: if I removed this sentence, how much would a reader's uncertainty about my paper's contribution increase?
This is the mutual information I(sentence ; paper) — the sentence's contribution in bits.
Rank sentences by I value. Keep the top 5 (highest information density). Rewrite the abstract using only those sentences plus the minimum connective tissue needed for coherence.
Target: every sentence should reduce the reader's uncertainty by at least 0.5 bits. No sentence should restate what the previous sentence already implied.

Entropy and the Compression Circle

§1 · Shannon Entropy — The Measure of Surprise

§2 · Operator C — Compression as Entropy Reduction

§3 · Efficient Coding — Biology at Maximum Entropy

§4 · KL Divergence — How Surprising Is Your Paper?

▶ Shannon Entropy Calculator

⬡ LLM Prompt Portal · Chapter 15