G3 · Part V · Practitioner  ·  G  ·  D1–D2  ·  Book 3 · 3/6
↑ Ring IV ↓ Ring VI
INFORMATION THEORY · CHAPTER 15
H

Entropy and the Compression Circle

H(X) measures what you cannot predict. A paper's value is its mutual information with what the field does not yet know.

Week 16+
Level D2
Operator C revisited
Focus Shannon Entropy · KL Divergence · Efficient Coding

§1 · Shannon Entropy — The Measure of Surprise

In 1948, Claude Shannon published "A Mathematical Theory of Communication" and defined entropy as the average surprise in a probability distribution. An event that always happens carries zero information — you already knew it was coming. An event that rarely happens carries maximum information — it is maximally surprising.

Definition 15.1 — Shannon Entropy
H(X) = −Σᵢ pᵢ log₂ pᵢ    (bits)

pᵢ = probability of outcome i

Special cases: pᵢ = 1 for one i (certainty):        H = 0 bits (no surprise) pᵢ = 1/n for all i (uniform):        H = log₂ n (maximum)

H(fair coin) = log₂ 2 = 1.000 bit H(DNA base) = log₂ 4 = 2.000 bits H(amino acid) = log₂ 20 = 4.322 bits (if uniform) H(English word) ≈ 11.8 bits (Shannon 1951 estimate for natural language)

Source coding theorem: optimal code length = H(X) bits/symbol No lossless compression can go below H(X) — it is the entropy floor.
H = 0
Certainty
One outcome has probability 1. No information — nothing new to learn. A paper that only confirms what everyone already knows.
H ≈ 0.5
Structured
A few outcomes dominate. Predictable but not trivial. A paper with a clear thesis and supporting evidence.
H ≈ log₂n − 1
Informative
Near-uniform distribution with slight structure. Maximally informative within a framework. The ideal research paper.
H = log₂n
Maximum Entropy
Uniform distribution — all outcomes equally likely. Pure noise. A paper with no coherent argument, only unstructured data.

§2 · Operator C — Compression as Entropy Reduction

Chapter 4 introduced C as the Compression operator — the first step in G = U ∘ F ∘ K ∘ C. Now we can define it precisely: C reduces the entropy of the data representation while preserving the mutual information with the target claim.

Compression as Entropy Reduction
Raw data X: H(X) = high (many variables, redundancy, noise) Compressed C(X): H(C(X)) < H(X) (fewer bits, same signal)

Information preserved: I(C(X) ; Y) ≈ I(X ; Y) where Y = the downstream claim you are trying to support

Rate-distortion: min H(C(X)) subject to I(C(X);Y) ≥ I_min

Lossless C: H(C(X)) = H(X) (e.g., gzip — exact reconstruction) Lossy C: H(C(X)) < H(X) with controlled distortion

Methods section = a lossy compression of the experiment   → enough bits to reconstruct the key results, not the entire raw dataset Results section = decompressed C(X) mapped to human-readable space   → tables, figures, statistics are the decoded representation

The classic error in Methods sections: either too high-entropy (every protocol detail, no compression — Methods becomes unreadable) or too low-entropy (so compressed that replication fails — the reader cannot reconstruct the experiment from the description). The optimal Methods section achieves the Shannon bound: minimum bits needed for replication, no more, no fewer.

The Compression Circle: G = U ∘ F ∘ K ∘ C starts by reducing entropy (C), then builds structure through K and F, then expands into Discussion (U). Ch15 closes the loop — the Discussion's claims become the prior for the next paper's C operator, compressing toward what the field now knows. Science is a self-compressing information system.

§3 · Efficient Coding — Biology at Maximum Entropy

Horace Barlow's efficient coding hypothesis (1961) proposes that the nervous system has evolved to encode stimuli in a way that maximises information transmission — operating near maximum entropy given the statistics of the natural environment.

This is why retinal ganglion cells respond strongly to edges (high spatial information content — unexpected luminance changes) and weakly to uniform fields (low information content — predictable). The visual system has learned the entropy structure of natural scenes and adapted its code to match.

Biological Information Densities
DNA (4 bases, ~uniform coding regions):   H ≈ 1.9 bits/nucleotide (theoretical max: 2.0 bits)

Amino acid sequence (20 aa, non-uniform):   H ≈ 4.2 bits/residue (theoretical max: 4.32 bits)

Human genome total information:   ~750 MB compressed (6×10⁹ bp × 1.9 bits / 8 bits/byte)

V(D)J recombination (antibody diversity):   ~10¹⁵ possible sequences → H ≈ 50 bits per antibody locus   This IS maximum entropy of the immune recognition space   Goal: no antigen should be entirely unpredictable to the system

Drosophila connectome (2023, Schlegel et al.):   130,000 neurons · 50M synapses · ~10⁷ bits of connectivity information   The most information-dense biological structure formally verified

V(D)J recombination deliberately maximizes the entropy of the antibody repertoire — ensuring the immune system has a non-zero probability of recognising any possible antigen. This is the K operator (Chapter 5) serving the C operator: maximum repertoire entropy ensures that any pathogen can be compressed into a specific antibody shape.

Writing Context H (bits/unit) Entropy State Interpretation
Abstract: every sentence predictable from the last H ≈ 0.2 Too low Redundant, says nothing new per sentence
Introduction: contextualised claim H ≈ 1.5 Optimal Each sentence advances the argument
Results: findings plus statistics H ≈ 2.5 High-value Dense with information, warrants careful reading
Discussion: every sentence a new speculation H ≈ 4.0 Too high Unanchored — ρ > 1 (Ch11), sorry accumulation (Ch14)

§4 · KL Divergence — How Surprising Is Your Paper?

The Kullback-Leibler divergence D_KL(P ‖ Q) measures how different a distribution P is from an expected distribution Q. It equals the extra bits needed to encode samples from P using a code optimised for Q.

KL Divergence and Paper Novelty
D_KL(paper ‖ field_prior) = Σ p(claim) · log₂ [p(claim) / q(claim)]

p = distribution of claims in your paper q = distribution of claims the field already expects

D_KL = 0: your paper is exactly what the field predicted (no novelty) D_KL small: modest advance — confirms with refinements D_KL moderate: genuine contribution — surprising but in the right direction D_KL large: paradigm-shifting (or wrong — reviewer decides)

Mutual information of your paper with the field's unknowns: I(paper ; gap) = H(gap) − H(gap | paper) = entropy of the open question BEFORE reading minus AFTER reading

A paper's scientific value ≈ I(paper ; gap) bits
Theorem 15.1 — The Information Value of a Paper
Let Q be the field's prior distribution over claims (what is expected before reading the paper), and P be the paper's claimed distribution (what is asserted after). The paper's information value is:

V(paper) = I(paper ; gap) = H(Q) − H(Q | P)

where gap = the set of open questions the paper addresses.

A paper maximises V by: (1) targeting a high-entropy gap H(Q) — an open question with many plausible answers;
(2) reducing H(Q|P) as much as possible — narrowing the remaining uncertainty after reading;
(3) ensuring D_KL(P ‖ Q) is neither 0 (trivial) nor ∞ (incomprehensible);
(4) keeping ρ(P) ≤ 1 (Ch11) — so the reduction in uncertainty is warranted.

▶ Shannon Entropy Calculator

Drag sliders to set event probabilities (auto-normalized to sum = 1). Watch H(X) update in real time.

H(X) bits
H_max bits
2.000
Efficiency %
KL from uniform
State

⬡ LLM Prompt Portal · Chapter 15

PROMPT 8.2 · PAPER ENTROPY AUDIT
Identify High- and Low-Information Sentences
Read your paper one paragraph at a time. For each sentence, estimate its information content:
H ≈ 0 = says something already implied by the previous sentence (redundant — cut or merge)
H ≈ 1 = advances the argument by one logical step (good)
H ≈ 2–3 = introduces a new finding, mechanism, or claim (high value — protect these)
H > 3 = makes a claim that is completely unconnected to the previous sentence (check: is this a new section, or a sorry?)
For each section (Introduction / Methods / Results / Discussion / Conclusion), compute the average H per sentence. Flag any section where average H < 0.5 (too redundant) or average H > 3 (too disconnected).
PROMPT 8.3 · KL DIVERGENCE FROM FIELD EXPECTATIONS
Measure Your Paper's Novelty
Answer these questions to estimate D_KL(paper ‖ field_prior):
(a) What would a well-read researcher in your field expect this paper to find, before reading it? List 3–5 expected claims (this is Q).
(b) What does your paper actually claim? List 3–5 actual claims (this is P).
(c) For each claim in P, is it in Q (expected), a refinement of Q (small KL), or new information not in Q (large KL)?
(d) Estimate overall novelty: 0 (expected) → 1 (modest advance) → 2 (significant contribution) → 3 (paradigm-shifting).
(e) Is D_KL appropriate for your venue? High D_KL is right for Nature/Science; moderate D_KL for specialist journals; low D_KL for replication studies.
EXTENSION · EFFICIENT ABSTRACT
Maximise Information per Word
Apply the efficient coding principle to your abstract:
Count the words. For each sentence, ask: if I removed this sentence, how much would a reader's uncertainty about my paper's contribution increase?
This is the mutual information I(sentence ; paper) — the sentence's contribution in bits.
Rank sentences by I value. Keep the top 5 (highest information density). Rewrite the abstract using only those sentences plus the minimum connective tissue needed for coherence.
Target: every sentence should reduce the reader's uncertainty by at least 0.5 bits. No sentence should restate what the previous sentence already implied.
← Ch14 · AXLE Ch16 · Scale →
🜁