Documentation

Linglib.Core.ChannelCapacity

Channel Capacity and Capacity-Achieving Priors #

@cite{cover-thomas-2006} @cite{zaslavsky-etal-2019}

A naming system (lexicon) can be viewed as a communication channel: the input is a meaning (color, object, etc.), the output is a word, and the conditional distribution p(w|c) specifies how likely each word is for each meaning. The channel capacity — the maximum mutual information achievable by any input distribution — measures the maximal amount of information about meanings that the lexicon can convey.

@cite{zaslavsky-etal-2019} show that a capacity-achieving prior (CAP) links communicative need p(c) and communicative precision S(c) through a simple linear relationship: −log p(c) = S(c) + log Z.

The CAP condition p(c) ∝ exp(−S(c)) is the necessary and sufficient first-order condition for maximizing I(W;C) over the probability simplex (@cite{cover-thomas-2006}; Blahut 1972; Arimoto 1972).

Connection to RSA #

The naming channel p(w|c) is the RSA literal speaker S₀: given a state c, how likely is each utterance w? The posterior p(c|w) is the RSA literal listener L₀. Mutual information I(W;C) measures how much the listener learns from the speaker. Channel capacity is the best-case informativity.

Main definitions #

NamingChannel: a conditional distribution p(w|c) (stochastic matrix)
marginalWord: p(w) = Σ_c p(c) · p(w|c)
posterior: p(c|w) via Bayes' rule
commPrecision: expected surprisal S(c)
mutualInfo: I(W;C)
IsCAP: capacity-achieving prior predicate
channelCapacity: sup_{p(c)} I(W;C)

Main results #

cap_linear: at a CAP, −log p(c) = S(c) + log Z
mutualInfo_eq_log_Z_of_cap: at a CAP, I(W;C) = log Z
gibbs_inequality: KL divergence is non-negative
mutualInfo_le_log_card: I(W;C) ≤ log |W|
channelCapacity_le_log_card: C* ≤ log |W|

structure Core.ChannelCapacity.NamingChannel (C W : Type) [Fintype C] [Fintype W] :

A naming system as a communication channel. encode c w = p(w|c): probability of using word w for meaning c.

The row-stochastic constraint (nonneg + rows sum to 1) makes this a valid conditional distribution.

encode : C → W → ℝ
p(w|c): probability of word w given meaning c.
encode_nonneg (c : C) (w : W) : 0 ≤ self.encode c w
encode_sum_one (c : C) : ∑ w : W, self.encode c w = 1

Instances For

noncomputable def Core.ChannelCapacity.marginalWord {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) (prior : C → ℝ) (w : W) :

Marginal word probability p(w) = Σ_c p(c) · p(w|c).

Equations

Core.ChannelCapacity.marginalWord nc prior w = ∑ c : C, prior c * nc.encode c w

Instances For

noncomputable def Core.ChannelCapacity.posterior {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) (prior : C → ℝ) (w : W) (c : C) :

Posterior p(c|w) via Bayes' rule. p(c|w) = p(w|c) · p(c) / Σ_{c'} p(w|c') · p(c').

This is the listener's belief about the meaning after hearing word w (= RSA literal listener L₀).

Equations

Core.ChannelCapacity.posterior nc prior w c = nc.encode c w * prior c / Core.ChannelCapacity.marginalWord nc prior w

Instances For

noncomputable def Core.ChannelCapacity.commPrecision {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) (prior : C → ℝ) (c : C) :

Communicative precision (expected surprisal) of a meaning c. S(c) = −Σ_w p(w|c) · log p(c|w).

Lower S(c) means the naming system communicates c more precisely: the listener can recover c from the word with less uncertainty.

Defined in @cite{zaslavsky-etal-2019}.

Equations

Core.ChannelCapacity.commPrecision nc prior c = -∑ w : W, nc.encode c w * Real.log (Core.ChannelCapacity.posterior nc prior w c)

Instances For

noncomputable def Core.ChannelCapacity.mutualInfo {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) (prior : C → ℝ) :

Mutual information I(W;C) for a naming channel under prior p(c). I(W;C) = Σ_{c,w} p(c) · p(w|c) · log(p(c|w) / p(c)).

Measures how much information the word conveys about the meaning on average.

Equations

Core.ChannelCapacity.mutualInfo nc prior = ∑ c : C, ∑ w : W, prior c * nc.encode c w * Real.log (Core.ChannelCapacity.posterior nc prior w c / prior c)

Instances For

def Core.ChannelCapacity.IsCAP {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) (prior : C → ℝ) :

A prior is capacity-achieving iff p(c) ∝ exp(−S(c)). This is the necessary and sufficient condition for p(c) to maximize I(W;C) over all priors on C.

The CAP condition from @cite{zaslavsky-etal-2019}; derived from the Blahut-Arimoto algorithm (@cite{cover-thomas-2006}).

Equations

Core.ChannelCapacity.IsCAP nc prior = ∃ Z > 0, ∀ (c : C), prior c > 0 → prior c = Real.exp (-Core.ChannelCapacity.commPrecision nc prior c) / Z

Instances For

theorem Core.ChannelCapacity.cap_linear {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) (prior : C → ℝ) {Z : ℝ} (hZ : Z > 0) (hCAP : ∀ (c : C), prior c > 0 → prior c = Real.exp (-commPrecision nc prior c) / Z) {c : C} (hc : prior c > 0) :

-Real.log (prior c) = commPrecision nc prior c + Real.log Z

The CAP linear relation: at a capacity-achieving prior, −log p(c) = S(c) + log Z.

From @cite{zaslavsky-etal-2019}. It means that communicative need (−log p(c)) and communicative precision (S(c)) are linearly related with slope 1 and intercept log Z. At a CAP, log Z equals the mutual information (see mutualInfo_eq_log_Z_of_cap).

Proof: take log of both sides of p(c) = exp(−S(c)) / Z, then negate. Uses log(a/b) = log a − log b and log(exp x) = x.

theorem Core.ChannelCapacity.cap_linear' {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) (prior : C → ℝ) (hCAP : IsCAP nc prior) {c : C} (hc : prior c > 0) :

∃ Z > 0, -Real.log (prior c) = commPrecision nc prior c + Real.log Z

Variant of cap_linear taking an existential IsCAP witness.

theorem Core.ChannelCapacity.marginalWord_nonneg {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) (prior : C → ℝ) (hp : ∀ (c : C), 0 ≤ prior c) (w : W) :

0 ≤ marginalWord nc prior w

Marginal word probability is non-negative.

theorem Core.ChannelCapacity.marginalWord_sum_one {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) (prior : C → ℝ) (hsum : ∑ c : C, prior c = 1) :

∑ w : W, marginalWord nc prior w = 1

The marginal word distribution sums to 1.

theorem Core.ChannelCapacity.mutualInfo_eq_log_Z_of_cap {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) (prior : C → ℝ) {Z : ℝ} (hZ : Z > 0) (hCAP : ∀ (c : C), prior c > 0 → prior c = Real.exp (-commPrecision nc prior c) / Z) (hprior_nonneg : ∀ (c : C), 0 ≤ prior c) (hprior_sum : ∑ c : C, prior c = 1) (hprior_pos : ∀ (c : C), prior c > 0) :

mutualInfo nc prior = Real.log Z

At a CAP, mutual information equals log Z.

Proof: factor p(c) out of the double sum and show each inner sum Σ_w p(w|c) · log(p(c|w)/p(c)) = log Z. Using sum_encode_log_ratio, the inner sum equals −S(c) − log p(c). Then cap_linear gives −log p(c) = S(c) + log Z, so the inner sum = log Z.

This means the normalization constant Z in the CAP condition determines the channel capacity.

theorem Core.ChannelCapacity.gibbs_inequality {ι : Type u_1} [Fintype ι] (p q : ι → ℝ) (hp_nn : ∀ (i : ι), 0 ≤ p i) (hq_pos : ∀ (i : ι), 0 < q i) (hp_sum : ∑ i : ι, p i = 1) (hq_sum : ∑ i : ι, q i = 1) :

0 ≤ ∑ i : ι, p i * Real.log (p i / q i)

Gibbs inequality: KL divergence is non-negative. D_KL(p ∥ q) = Σ p(i) log(p(i)/q(i)) ≥ 0 for distributions p, q. Uses the fundamental inequality log x ≤ x − 1.

theorem Core.ChannelCapacity.mutualInfo_le_log_card {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) (prior : C → ℝ) (hprior_nonneg : ∀ (c : C), 0 ≤ prior c) (hprior_sum : ∑ c : C, prior c = 1) :

mutualInfo nc prior ≤ Real.log ↑(Fintype.card W)

Mutual information is bounded by the log of the vocabulary size: I(W;C) ≤ log |W|.

Proof chain: I(W;C) ≤ H(W) ≤ log |W|, where the first inequality uses conditional entropy H(W|C) ≥ 0, and the second uses Gibbs' inequality (KL divergence of p(w) from uniform is non-negative).

noncomputable def Core.ChannelCapacity.channelCapacity {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) :

Channel capacity: the supremum of mutual information over all valid priors. C* = sup_{p(c)} I(W;C) (eq. 3 of @cite{zaslavsky-etal-2019}).

A capacity-achieving prior (CAP) attains this supremum.

Equations

Core.ChannelCapacity.channelCapacity nc = sSup {I : ℝ | ∃ (prior : C → ℝ), (∀ (c : C), 0 ≤ prior c) ∧ ∑ c : C, prior c = 1 ∧ Core.ChannelCapacity.mutualInfo nc prior = I}

Instances For

theorem Core.ChannelCapacity.channelCapacity_le_log_card {C W : Type} [Fintype C] [Fintype W] (nc : NamingChannel C W) [Nonempty C] :

channelCapacity nc ≤ Real.log ↑(Fintype.card W)

Channel capacity is bounded by log of the vocabulary size: C* ≤ log |W|. Follows directly from mutualInfo_le_log_card.

theorem Core.ChannelCapacity.mutualInfo_le_log_fin {C : Type} [Fintype C] {k : ℕ} (nc : NamingChannel C (Fin k)) (prior : C → ℝ) (hp : ∀ (c : C), 0 ≤ prior c) (hsum : ∑ c : C, prior c = 1) :

mutualInfo nc prior ≤ Real.log ↑k

Mutual information bound for a vocabulary of size k: I(W;C) ≤ log k. Useful corollary when the word type is Fin k.