Rational Action @cite{luce-1959} #
@cite{cover-thomas-2006} @cite{zaslavsky-hu-levy-2020} @cite{adams-messick-1958}The mathematical foundation for all soft-rational agents: RSA speakers/listeners, BToM agents, and decision-theoretic actors.
Architecture #
A RationalAction agent selects actions with probability proportional to a
non-negative score function — the Luce choice rule. This is the
unique choice rule satisfying IIA (independence of irrelevant alternatives):
the relative probability of two actions depends only on their scores.
The key mathematical results characterizing this choice rule are:
Softmax (§2): The exponential parameterization
score = exp(α · utility)givespolicy = softmax(utility, α). This is the standard form in RSA.Gibbs Variational Principle (§3): Softmax uniquely maximizes
H(p) + α · ⟨p, s⟩on the probability simplex. This is the mathematical foundation for RSA convergence.Maximum Entropy (§4): Softmax is the max-entropy distribution subject to an expected-utility constraint. Equivalently, it minimizes free energy (the Boltzmann distribution from statistical mechanics).
Bayesian Optimality (§5): The Bayesian posterior maximizes expected log-likelihood. This is the listener half of RSA convergence.
A rational action agent: selects actions with probability ∝ score(state, action).
This is the Luce choice rule. The score function is unnormalized;
normalization to a proper distribution is derived (see policy).
Key instances:
- RSA L0:
score(u, w) = prior(w) · meaning(u, w) - RSA S1:
score(w, u) = rpow(informativity(w, u), α) · exp(-α · cost(u)) - BToM agent:
score(state, action) = exp(β · Q(state, action))
- score : State → Action → ℝ
Unnormalized score: policy(a|s) ∝ score(s, a).
Scores are non-negative.
Instances For
Total score across all actions in a state (normalization constant).
Equations
- ra.totalScore s = ∑ a : A, ra.score s a
Instances For
Normalized policy: P(a|s) = score(s,a) / Σ_a' score(s,a'). Returns 0 for all actions if totalScore = 0.
Equations
- ra.policy s a = if ra.totalScore s = 0 then 0 else ra.score s a / ra.totalScore s
Instances For
Policy sums to 1 when totalScore is nonzero.
Zero score implies zero policy, regardless of whether totalScore is zero.
Policy respects score equality: equal scores → equal probabilities.
Follows directly from the Luce rule: both sides are score / totalScore
with the same denominator.
When totalScore equals the score of action a, the policy for a is 1.
Used by the compositional proof builder when all other scores are zero,
so totalScore = score a + 0 +... + 0 = score a, making policy = 1.
Score ordering implies ¬(policy strict ordering). Used by compositional proof builder for ¬(L1 w₁ > L1 w₂) goals.
Strict policy monotonicity: strictly higher score → strictly higher probability.
Used by rsa_decide to eliminate shared denominator computations: when
comparing policy s a₁ > policy s a₂ (same state), it suffices to show
score s a₁ > score s a₂, skipping the expensive totalScore computation
in the proof term.
Cross-state policy comparison: compares policy values at different states (different denominators). Used for S2 cross-world comparisons where S2(u|w₁) vs S2(u|w₂) have different normalization constants.
The cross-product condition score(s₁,a) * total(s₂) > score(s₂,a) * total(s₁)
is equivalent to score(s₁,a)/total(s₁) > score(s₂,a)/total(s₂) when both
totals are positive.
Cross-state policy comparison with positivity derived from the cross-product.
Like policy_gt_cross but derives the totalScore > 0 conditions from the
cross-product inequality itself: if score(s₁,a) * total(s₂) > score(s₂,a) * total(s₁) ≥ 0,
then score(s₁,a) * total(s₂) > 0, so both score(s₁,a) > 0 and total(s₂) > 0.
And score(s₁,a) ≤ total(s₁), so total(s₁) > 0.
Used by rsa_predict for cross-utterance L1 comparisons where the two sides
have different normalization constants.
Score-sum ordering implies policy-sum ordering when both sides share the same
state (same denominator). Used by rsa_predict for marginal L1 comparisons
where the worlds being summed differ but the utterance and config are shared.
Finset-sum ordering implies policy-sum ordering when both sides share the
same state (same denominator). Like policy_list_sum_gt but for Finset.sum.
Derives totalScore positivity from the score ordering itself, so no extra hypothesis is needed: if Σ_{F₁} score > Σ_{F₂} score ≥ 0, then some score is positive, so totalScore > 0.
Used by rsa_predict for denominator cancellation in marginal comparisons.
Luce's Choice Axiom #
showed that the ratio rule P(a|s) = v(a)/Σv(b) is
characterized by the independence of irrelevant alternatives (IIA): the
relative probability of two actions depends only on their scores, not on what
other actions are available.
We formalize:
- The constant ratio rule (Theorem 2):
policy(a₁) · score(a₂) = policy(a₂) · score(a₁) - Choice from subsets (
pChoice): restriction of the choice rule to aFinset - IIA (Axiom 1): ratios in any subset equal score ratios
- The product rule (Theorem 1):
P(a,T) = P(a,S) · P(S,T)forS ⊆ T - Scale invariance (Theorem 5): multiplying all scores by
k > 0preserves policy - Uniqueness (Theorem 4, forward): proportional scores yield the same policy
Constant Ratio Rule (Theorem 2):
policy(a₁) · score(a₂) = policy(a₂) · score(a₁).
The odds ratio policy(a₁)/policy(a₂) = score(a₁)/score(a₂).
Choice probability from a subset: pChoice(a, T) = score(a) / Σ_{b∈T} score(b).
Returns 0 if a ∉ T or the subset total is 0.
Equations
Instances For
pChoice is non-negative.
pChoice sums to 1 over the subset when the subset total is nonzero.
IIA core: the ratio of pChoice values in any subset equals the score ratio.
For a₁, a₂ ∈ T with score(a₂) > 0:
pChoice(a₁, T) · score(a₂) = pChoice(a₂, T) · score(a₁) (Axiom 1).
IIA (Axiom 1): P(a, S) = P(a, T) / Σ_{b∈S} P(b, T) for S ⊆ T.
Choice probability from a subset is the conditional probability.
Product rule (Theorem 1):
P(a, T) = P(a, S) · P(S, T) for a ∈ S ⊆ T,
where P(S, T) = Σ_{b∈S} score(b) / Σ_{b∈T} score(b).
Scale all scores by a positive constant k.
Instances For
Uniqueness (forward direction, Theorem 4):
If scores are proportional (score'(s,a) = k · score(s,a) for some k > 0),
then both agents have the same policy.
Softmax Function #
The softmax function σ(s, α)ᵢ = exp(α · sᵢ) / Σⱼ exp(α · sⱼ) is the
exponential parameterization of the Luce choice rule. Following Franke & Degen
(submitted), we establish Facts 1–8.
The logistic (sigmoid) function: S(x) = 1 / (1 + exp(−x)).
Instances For
At α = 0, softmax is uniform.
Luce choice with rpow scores equals softmax over log scores.
f(i)^α / Σⱼ f(j)^α = softmax(log ∘ f, α)(i) when all f(i) > 0.
This is the general identity connecting belief-based RSA (which uses
rpow) to the softmax framework (which uses exp). Every S1 model with
s1Score = rpow(l0, α) inherits all softmax limit theorems via this
identity: as α → ∞, rpow-based Luce choice concentrates on the
argmax of f, i.e., the most informative utterance.
Why Softmax? The Fechnerian Characterization #
The exponential parameterization score = exp(α · utility) is not a design
choice — it is the unique transformation connecting Luce's ratio scale to
a utility (interval) scale (§2.A; @cite{adams-messick-1958}).
Ratio vs interval scales. Luce's Axiom 1 (IIA) yields a ratio scale
v: only ratios v(a)/v(b) are meaningful (Theorem 4). Fechner's
psychophysics requires an interval scale u: only differences
u(a) - u(b) are meaningful. The question: how are v and u related?
Derivation. From P(a,b) = v(a)/(v(a)+v(b)), the odds ratio is
v(a)/v(b) = g(u(a) - u(b)) for some function g. Transitivity of ratios
(v(a)/v(c) = [v(a)/v(b)] · [v(b)/v(c)]) forces Cauchy's multiplicative
functional equation: g(s + t) = g(s) · g(t). The unique monotone increasing
solution is g(s) = exp(k · s) (cauchy_mul_exp), giving:
v(a) = C · exp(k · u(a))— the ratio scale IS the exponential of utilityP(a,b) = 1/(1 + exp(-k · (u(a) - u(b))))— the logistic function- For n alternatives:
P(a|S) = exp(k · u(a)) / Σ exp(k · u(b))— softmax
The parameter k > 0 is the rationality parameter α in RSA.
Cauchy's multiplicative functional equation (classical):
If g : ℝ → ℝ satisfies g(s + t) = g(s) · g(t) and is strictly
monotone increasing, then g(s) = exp(k · s) for some k > 0.
The proof reduces to the additive Cauchy equation via log: setting
h = log ∘ g, the multiplicative equation becomes h(s+t) = h(s) + h(t).
The key lemma (cauchy_monotone_additive_linear) shows that a strictly
monotone additive function must be linear, by density of ℚ in ℝ:
h agrees with x ↦ k·x on rationals (by induction), and any
deviation on an irrational x would violate monotonicity via a
rational witness between x and h(x)/k.
Fechnerian uniqueness (§2.A; @cite{adams-messick-1958}):
If a ratio scale v and interval scale u represent the same
ordering via v(x)/v(y) = g(u(x) - u(y)) for a strictly monotone
multiplicative g, then v is the exponential of u.
This is WHY fromSoftmax uses exp(α · utility): the exponential
is forced by the requirement that log-odds be linear in utility
differences. It is the unique bridge between Luce's ratio scale
(Chapter 1) and Fechner's interval scale (Chapter 2).
Construct a RationalAction from a utility function via softmax.
The score is exp(α · utility(s, a)), so policy = softmax(utility, α).
The exponential parameterization is forced by the Fechnerian characterization
(luce_fechnerian_exp): it is the unique bridge from Luce's ratio scale
to an additive utility scale.
Equations
Instances For
KL Divergence and the Gibbs Variational Principle #
The softmax distribution uniquely maximizes entropy + expected score on the probability simplex. This is the mathematical foundation for RSA convergence (@cite{zaslavsky-hu-levy-2020}, Proposition 1).
Proof strategy #
The Gibbs VP reduces to KL non-negativity via three identities:
- H(p) + KL(p‖q) = -∑ pᵢ log qᵢ (negMulLog + KL term telescope)
- -∑ pᵢ log qᵢ = -α⟨p,s⟩ + log Z (substitute log qᵢ = α sᵢ - log Z)
- H(q) + α⟨q,s⟩ = log Z (softmax self-information)
Combining: H(p) + α⟨p,s⟩ + KL = log Z = H(q) + α⟨q,s⟩, so KL ≥ 0 ⟹ LHS ≤ RHS.
Gibbs' inequality (finite form): KL(p ‖ q) ≥ 0.
For distributions p, q with qᵢ > 0, pᵢ ≥ 0, and Σpᵢ = Σqᵢ: Σᵢ pᵢ · log(pᵢ/qᵢ) ≥ 0
Proof via Mathlib's klFun_nonneg: klFun(x) ≥ 0 for x ≥ 0.
The per-meaning speaker objective: f(s) = Σᵤ [negMulLog(sᵤ) + α · sᵤ · vᵤ].
This is the function that the speaker maximizes for each meaning m, where vᵤ = log L(m|u) is the listener utility of utterance u.
Instances For
Gibbs Variational Principle: softmax maximizes entropy + expected score.
For any distribution p on ι and scores s : ι → ℝ: H(p) + α·⟨p, s⟩ ≤ H(q) + α·⟨q, s⟩ where q = softmax(s, α) and H(p) = Σ negMulLog(pᵢ).
Softmax Concentration at High Rationality #
As the rationality parameter α → ∞, softmax concentrates all probability mass on the action with highest utility — i.e., softmax converges to argmax. This connects:
- MaxEnt phonology ↔ OT: a MaxEnt grammar with weights
wbecomes categorical (OT-like) as temperature → 0 (equivalently α → ∞). - RSA ↔ neo-Gricean pragmatics: a soft-rational RSA speaker becomes a hard-rational Gricean reasoner in the α → ∞ limit.
Proof sketch #
From softmax_odds, we have σᵢ / σⱼ = exp(α(sᵢ − sⱼ)). When sᵢ > sⱼ,
this ratio → ∞ as α → ∞, so σⱼ / σᵢ → 0. Since Σ σₖ = 1, the maximizer's
probability → 1 by squeezing: 1 - σ_max = Σ_{k≠max} σₖ, and each non-maximal
term → 0 (bounded by exp(-α · gap) where gap = sᵢ - sⱼ > 0).
Softmax → argmax limit: as α → ∞, softmax concentrates on the unique
maximizer. For any i_max with s i_max strictly greater than all other
scores, softmax(s, α)_{i_max} → 1.
This is the formal connection between MaxEnt (soft optimization) and OT (hard optimization): OT is the α → ∞ limit of MaxEnt.
Shannon entropy: H(p) = -Σᵢ pᵢ log pᵢ.
For a ℚ-valued counterpart suitable for decidable computation, see
Core.InformationTheory.entropy in Linglib/Core/InformationTheory.lean.
Instances For
Entropy of uniform distribution.
Entropy-regularized objective: G_α(p, s) = ⟨s, p⟩ + (1/α) H(p).
Equations
- Core.entropyRegObjective s α p = ∑ i : ι, p i * s i + 1 / α * Core.shannonEntropy p
Instances For
Fact 5: Softmax maximizes the entropy-regularized objective.
Proof: gibbs_variational gives H(p) + α⟨p,s⟩ ≤ H(q) + α⟨q,s⟩;
dividing by α > 0 yields the result.
Softmax is the unique maximizer.
Proof: equality in the objective ⟹ KL(p ‖ softmax) = 0 (via speakerObj_plus_kl),
hence p = softmax (via kl_eq_zero_imp_eq).
The log-partition function is convex in α.
Proof: By Hölder's inequality. For 0 < a, b with a + b = 1:
∑ exp(x·sᵢ)^a · exp(y·sᵢ)^b ≤ (∑ exp(x·sᵢ))^a · (∑ exp(y·sᵢ))^b
Since exp(x·sᵢ)^a · exp(y·sᵢ)^b = exp((ax+by)·sᵢ), taking logs gives
logSumExp(s, ax+by) ≤ a·logSumExp(s, x) + b·logSumExp(s, y).
Log-sum-exp with offsets: log Σ exp(α · sᵢ + rᵢ).
Equations
- Core.logSumExpOffset s r α = Real.log (Core.partitionFnOffset s r α)
Instances For
Softmax with offsets: exp(α · sᵢ + rᵢ) / Z(α).
Equations
- Core.softmaxOffset s r α i = Real.exp (α * s i + r i) / Core.partitionFnOffset s r α
Instances For
Derivative of offset log-partition gives weighted expected value:
d/dα log(Σ exp(α·sᵢ + rᵢ)) = Σ softmaxOffset(s,r,α)ᵢ · sᵢ.
The offset log-partition function is convex in α.
Same Hölder argument as logSumExp_convex: the key factoring
exp((ax+by)·sᵢ + rᵢ) = exp(x·sᵢ + rᵢ)^a · exp(y·sᵢ + rᵢ)^b
holds because a + b = 1, absorbing the offset.
The derivative of log conditional likelihood α ↦ (α·sᵧ + rᵧ) − logSumExpOffset
is the observed feature value minus the expected value:
d/dα = sᵧ − Σᵢ softmaxOffset(s,r,α)ᵢ · sᵢ.
Bayesian optimality: The normalized posterior maximizes weighted log-likelihood on the probability simplex.
For weights wᵢ ≥ 0 with C = Σwᵢ > 0, and any distribution L on the simplex (Lᵢ > 0, Σ Lᵢ = 1), the normalized posterior p*ᵢ = wᵢ/C satisfies:
Σᵢ wᵢ · log(Lᵢ) ≤ Σᵢ wᵢ · log(p*ᵢ)
This is used for listener optimality: with wᵢ = P(m)·S(u|m), the Bayesian listener L*(m|u) = wᵢ/C maximizes Σ_m P(m)·S(u|m)·log L(m|u).
Uniqueness of the Ratio Scale #
Theorem 4 (proved earlier as policy_eq_of_proportional) shows that
proportional scores yield the same policy. The converse — that identical
policies force proportional scores — completes the uniqueness characterization:
same policy ↔ proportional scores, with proportionality constant
k = totalScore₂/totalScore₁.
Note: This is distinct from the "Independence-of-Unit Condition" in §1.F (pp. 28–33), which concerns state-dependent transformations f satisfying f(kv) = kf(v). That condition addresses how scale values transform across experimental conditions, not the uniqueness of the scale within a condition.
Converse of Theorem 4 (uniqueness of ratio scale):
If two agents with positive total scores have the same policy,
then their scores are proportional with constant k = total₂/total₁.
Full uniqueness characterization (Theorem 4 and its converse): Two agents with positive total scores have the same policy if and only if their scores are proportional.
Alternative Forms of Axiom 1 #
proves three equivalent formulations of the choice axiom:
(a) Ratio form: There exists a positive function v such that
P(x, T) = v(x) / Σ_{y∈T} v(y) for all x ∈ T.
(b) Product rule: P(x, T) = P(x, S) · P(S, T) for x ∈ S ⊆ T,
where P(S, T) = Σ_{y∈S} P(y, T).
(c) Pairwise independence: The odds ratio P(x,{x,y}) · P(y,T) = P(y,{x,y}) · P(x,T) — pairwise ratios are preserved in any superset.
The ratio form (a) is the definition of RationalAction; (a)→(b) is
product_rule and (a)→(c) is pChoice_ratio.
A general choice function on finite subsets: the minimal structure for stating Axiom 1 equivalences without assuming a ratio scale a priori.
P(a, T): probability of choosing
afrom setTProbabilities are non-negative
Zero probability outside the choice set
Instances For
Axiom 1, Form (a): ratio scale representation.
There exists v > 0 such that P(x, T) = v(x) / Σ v(y).
Equations
Instances For
Axiom 1, Form (b): product rule.
P(x, T) = P(x, S) · Σ_{y∈S} P(y, T) for x ∈ S ⊆ T.
Equations
Instances For
Axiom 1, Form (c): pairwise independence (IIA).
The odds ratio is preserved in any superset:
P(x,T) · P(y,{x,y}) = P(y,T) · P(x,{x,y}).
Equations
Instances For
(a) → (b): Ratio form implies product rule (Appendix 1).
(a) → (c): Ratio form implies pairwise IIA (Appendix 1).
(c) → (a): Pairwise IIA implies ratio form (Appendix 1). The ratio scale is constructed by fixing a reference element x₀ and setting v(x) = P(x, {x, x₀}) / P(x₀, {x, x₀}). Requires normalization (probabilities sum to 1) and strict positivity for elements in the choice set.
Axiom 1 equivalence (Appendix 1): Ratio form ↔ pairwise IIA (under normalization and positivity).