In 1948, Claude Shannon published "A Mathematical Theory of Communication"
and defined entropy as the average surprise in a probability distribution.
An event that always happens carries zero information — you already knew it was coming.
An event that rarely happens carries maximum information — it is maximally surprising.
Definition 15.1 — Shannon Entropy
H(X) = −Σᵢ pᵢ log₂ pᵢ (bits)
pᵢ = probability of outcome i
Special cases:
pᵢ = 1 for one i (certainty): H = 0 bits (no surprise)
pᵢ = 1/n for all i (uniform): H = log₂ n (maximum)
Source coding theorem: optimal code length = H(X) bits/symbol
No lossless compression can go below H(X) — it is the entropy floor.
H = 0
Certainty
One outcome has probability 1. No information — nothing new to learn. A paper that only confirms what everyone already knows.
H ≈ 0.5
Structured
A few outcomes dominate. Predictable but not trivial. A paper with a clear thesis and supporting evidence.
H ≈ log₂n − 1
Informative
Near-uniform distribution with slight structure. Maximally informative within a framework. The ideal research paper.
H = log₂n
Maximum Entropy
Uniform distribution — all outcomes equally likely. Pure noise. A paper with no coherent argument, only unstructured data.
§2 · Operator C — Compression as Entropy Reduction
Chapter 4 introduced C as the Compression operator — the first step in
G = U ∘ F ∘ K ∘ C. Now we can define it precisely:
C reduces the entropy of the data representation
while preserving the mutual information with the target claim.
Compression as Entropy Reduction
Raw data X: H(X) = high (many variables, redundancy, noise)
Compressed C(X): H(C(X)) < H(X) (fewer bits, same signal)
Information preserved: I(C(X) ; Y) ≈ I(X ; Y)
where Y = the downstream claim you are trying to support
Rate-distortion: min H(C(X)) subject to I(C(X);Y) ≥ I_min
Methods section = a lossy compression of the experiment
→ enough bits to reconstruct the key results, not the entire raw dataset
Results section = decompressed C(X) mapped to human-readable space
→ tables, figures, statistics are the decoded representation
The classic error in Methods sections: either too high-entropy
(every protocol detail, no compression — Methods becomes unreadable)
or too low-entropy (so compressed that replication fails —
the reader cannot reconstruct the experiment from the description).
The optimal Methods section achieves the Shannon bound:
minimum bits needed for replication, no more, no fewer.
The Compression Circle: G = U ∘ F ∘ K ∘ C starts by reducing entropy (C),
then builds structure through K and F, then expands into Discussion (U).
Ch15 closes the loop — the Discussion's claims become the prior
for the next paper's C operator, compressing toward what the field now knows.
Science is a self-compressing information system.
§3 · Efficient Coding — Biology at Maximum Entropy
Horace Barlow's efficient coding hypothesis (1961) proposes that
the nervous system has evolved to encode stimuli in a way that
maximises information transmission — operating near maximum entropy
given the statistics of the natural environment.
This is why retinal ganglion cells respond strongly to edges
(high spatial information content — unexpected luminance changes)
and weakly to uniform fields (low information content — predictable).
The visual system has learned the entropy structure of natural scenes
and adapted its code to match.
Biological Information Densities
DNA (4 bases, ~uniform coding regions):
H ≈ 1.9 bits/nucleotide (theoretical max: 2.0 bits)
Human genome total information:
~750 MB compressed (6×10⁹ bp × 1.9 bits / 8 bits/byte)
V(D)J recombination (antibody diversity):
~10¹⁵ possible sequences → H ≈ 50 bits per antibody locus
This IS maximum entropy of the immune recognition space
Goal: no antigen should be entirely unpredictable to the system
Drosophila connectome (2023, Schlegel et al.):
130,000 neurons · 50M synapses · ~10⁷ bits of connectivity information
The most information-dense biological structure formally verified
V(D)J recombination deliberately maximizes the entropy of the antibody repertoire —
ensuring the immune system has a non-zero probability of recognising any possible antigen.
This is the K operator (Chapter 5) serving the C operator:
maximum repertoire entropy ensures that any pathogen can be compressed
into a specific antibody shape.
Writing Context
H (bits/unit)
Entropy State
Interpretation
Abstract: every sentence predictable from the last
§4 · KL Divergence — How Surprising Is Your Paper?
The Kullback-Leibler divergence D_KL(P ‖ Q) measures how different
a distribution P is from an expected distribution Q.
It equals the extra bits needed to encode samples from P
using a code optimised for Q.
p = distribution of claims in your paper
q = distribution of claims the field already expects
D_KL = 0: your paper is exactly what the field predicted (no novelty)
D_KL small: modest advance — confirms with refinements
D_KL moderate: genuine contribution — surprising but in the right direction
D_KL large: paradigm-shifting (or wrong — reviewer decides)
Mutual information of your paper with the field's unknowns:
I(paper ; gap) = H(gap) − H(gap | paper)
= entropy of the open question BEFORE reading minus AFTER reading
A paper's scientific value ≈ I(paper ; gap) bits
Theorem 15.1 — The Information Value of a Paper
Let Q be the field's prior distribution over claims (what is expected before reading the paper),
and P be the paper's claimed distribution (what is asserted after).
The paper's information value is:
V(paper) = I(paper ; gap) = H(Q) − H(Q | P)
where gap = the set of open questions the paper addresses.
A paper maximises V by:
(1) targeting a high-entropy gap H(Q) — an open question with many plausible answers;
(2) reducing H(Q|P) as much as possible — narrowing the remaining uncertainty after reading;
(3) ensuring D_KL(P ‖ Q) is neither 0 (trivial) nor ∞ (incomprehensible);
(4) keeping ρ(P) ≤ 1 (Ch11) — so the reduction in uncertainty is warranted.
▶ Shannon Entropy Calculator
Drag sliders to set event probabilities (auto-normalized to sum = 1). Watch H(X) update in real time.
—
—
—
—
H(X) bits
—
H_max bits
2.000
Efficiency %
—
KL from uniform
—
State
—
⬡ LLM Prompt Portal · Chapter 15
PROMPT 8.2 · PAPER ENTROPY AUDIT
Identify High- and Low-Information Sentences
Read your paper one paragraph at a time. For each sentence, estimate its information content:
H ≈ 0 = says something already implied by the previous sentence (redundant — cut or merge)
H ≈ 1 = advances the argument by one logical step (good)
H ≈ 2–3 = introduces a new finding, mechanism, or claim (high value — protect these)
H > 3 = makes a claim that is completely unconnected to the previous sentence (check: is this a new section, or a sorry?)
For each section (Introduction / Methods / Results / Discussion / Conclusion),
compute the average H per sentence.
Flag any section where average H < 0.5 (too redundant) or average H > 3 (too disconnected).
PROMPT 8.3 · KL DIVERGENCE FROM FIELD EXPECTATIONS
Measure Your Paper's Novelty
Answer these questions to estimate D_KL(paper ‖ field_prior):
(a) What would a well-read researcher in your field expect this paper to find, before reading it? List 3–5 expected claims (this is Q).
(b) What does your paper actually claim? List 3–5 actual claims (this is P).
(c) For each claim in P, is it in Q (expected), a refinement of Q (small KL), or new information not in Q (large KL)?
(d) Estimate overall novelty: 0 (expected) → 1 (modest advance) → 2 (significant contribution) → 3 (paradigm-shifting).
(e) Is D_KL appropriate for your venue? High D_KL is right for Nature/Science; moderate D_KL for specialist journals; low D_KL for replication studies.
EXTENSION · EFFICIENT ABSTRACT
Maximise Information per Word
Apply the efficient coding principle to your abstract:
Count the words. For each sentence, ask: if I removed this sentence, how much would a reader's uncertainty about my paper's contribution increase?
This is the mutual information I(sentence ; paper) — the sentence's contribution in bits.
Rank sentences by I value. Keep the top 5 (highest information density). Rewrite the abstract using only those sentences plus the minimum connective tissue needed for coherence.
Target: every sentence should reduce the reader's uncertainty by at least 0.5 bits. No sentence should restate what the previous sentence already implied.