Softmax Optimality: The Decision-Theoretic Foundation of Soft-Rational Agents #

The deepest claim in the RSA architecture is that the softmax agent IS the entropy-regularized expected-utility maximizer. This file makes that connection explicit by bridging DecisionProblem (ℚ) from DecisionTheory.lean with RationalAction (ℝ) from RationalAction.lean.

Results #

Construction (§1): A DecisionProblem yields a RationalAction via softmax over expected utility, bridging the ℚ/ℝ gap.
Monotonicity (§2): Higher EU ⟹ higher choice probability (α > 0).
Entropy-regularized optimality (§3): The softmax agent uniquely maximizes ∑ p(a)·EU(a) + (1/α)·H(p) — the Gibbs Variational Principle in decision-theoretic language.
Hard-max convergence (§4): As α → ∞, the softmax agent converges to the EU-optimal deterministic policy.
Round-trip (§5): Extracting utility from a softmax agent and reconstructing recovers the same policy.

source

noncomputable def Core.expectedUtilityR {W : Type u_1} {A : Type u_2} [Fintype W] (dp : DecisionTheory.DecisionProblem W A) (a : A) :

ℝ

Expected utility cast to ℝ for interfacing with softmax/RationalAction.

Equations

Core.expectedUtilityR dp a = ∑ w : W, ↑(dp.prior w) * ↑(dp.utility w a)

Instances For

source

theorem Core.expectedUtilityR_nonneg {W : Type u_1} {A : Type u_2} [Fintype W] (dp : DecisionTheory.DecisionProblem W A) (a : A) (hpr : ∀ (w : W), 0 ≤ dp.prior w) (hu : ∀ (w : W), 0 ≤ dp.utility w a) :

0 ≤ expectedUtilityR dp a

EU in ℝ is non-negative when utility and prior are non-negative.

source

theorem Core.expectedUtilityR_mono {W : Type u_1} {A : Type u_2} [Fintype W] [DecidableEq W] (dp : DecisionTheory.DecisionProblem W A) (a₁ a₂ : A) (h : DecisionTheory.expectedUtility dp a₁ ≤ DecisionTheory.expectedUtility dp a₂) :

expectedUtilityR dp a₁ ≤ expectedUtilityR dp a₂

ℚ-ordering of EU is preserved under the cast to ℝ.

source

noncomputable def Core.RationalAction.fromDecisionProblem {W : Type u_1} {A : Type u_2} [Fintype W] [Fintype A] (dp : DecisionTheory.DecisionProblem W A) (α : ℝ) :

RationalAction Unit A

Softmax agent from a decision problem: score(a) = exp(α · EU(a)).

The state type is Unit because the decision problem's prior already encodes the agent's beliefs — there is no external state to condition on.

Equations

Core.RationalAction.fromDecisionProblem dp α = Core.RationalAction.fromSoftmax (fun (x : Unit) (a : A) => Core.expectedUtilityR dp a) α

Instances For

source

theorem Core.fromDP_policy_mono {W : Type u_1} {A : Type u_2} [Fintype W] [Fintype A] [Nonempty A] (dp : DecisionTheory.DecisionProblem W A) {α : ℝ} (hα : 0 < α) (a₁ a₂ : A) (h : expectedUtilityR dp a₁ ≤ expectedUtilityR dp a₂) :

(RationalAction.fromDecisionProblem dp α).policy () a₁ ≤ (RationalAction.fromDecisionProblem dp α).policy () a₂

Higher EU implies higher choice probability (α > 0).

source

theorem Core.fromDP_policy_strict_mono {W : Type u_1} {A : Type u_2} [Fintype W] [Fintype A] [Nonempty A] (dp : DecisionTheory.DecisionProblem W A) {α : ℝ} (hα : 0 < α) (a₁ a₂ : A) (h : expectedUtilityR dp a₁ < expectedUtilityR dp a₂) :

(RationalAction.fromDecisionProblem dp α).policy () a₁ < (RationalAction.fromDecisionProblem dp α).policy () a₂

Strict version: strictly higher EU implies strictly higher probability.

source

theorem Core.softmax_maximizes_EU_plus_entropy {W : Type u_1} {A : Type u_2} [Fintype W] [Fintype A] [Nonempty A] (dp : DecisionTheory.DecisionProblem W A) {α : ℝ} (hα : 0 < α) (p : A → ℝ) (hp_nn : ∀ (a : A), 0 ≤ p a) (hp_sum : ∑ a : A, p a = 1) :

entropyRegObjective (fun (a : A) => expectedUtilityR dp a) α p ≤ entropyRegObjective (fun (a : A) => expectedUtilityR dp a) α fun (x : A) => (RationalAction.fromDecisionProblem dp α).policy () x

The softmax agent uniquely maximizes entropy-regularized expected utility: p* = argmax_p [∑ p(a)·EU(a) + (1/α)·H(p)].

This is the decision-theoretic content of the Gibbs Variational Principle. The objective entropyRegObjective from RationalAction.lean is exactly ∑ p(a)·s(a) + (1/α)·H(p) — we just instantiate s = EU.

source

theorem Core.softmax_unique_EU_maximizer {W : Type u_1} {A : Type u_2} [Fintype W] [Fintype A] [Nonempty A] (dp : DecisionTheory.DecisionProblem W A) {α : ℝ} (hα : 0 < α) (p : A → ℝ) (hp_nn : ∀ (a : A), 0 ≤ p a) (hp_sum : ∑ a : A, p a = 1) (h_max : entropyRegObjective (fun (a : A) => expectedUtilityR dp a) α p = entropyRegObjective (fun (a : A) => expectedUtilityR dp a) α (softmax (fun (a : A) => expectedUtilityR dp a) α)) :

p = softmax (fun (a : A) => expectedUtilityR dp a) α

The softmax agent is the UNIQUE maximizer: any distribution achieving the same objective value must equal the softmax policy.

source

theorem Core.fromDP_converges_to_optimal {W : Type u_1} {A : Type u_2} [Fintype W] [Fintype A] [Nonempty A] [DecidableEq A] (dp : DecisionTheory.DecisionProblem W A) (a_opt : A) (h_opt : ∀ (a : A), a ≠ a_opt → expectedUtilityR dp a < expectedUtilityR dp a_opt) :

Filter.Tendsto (fun (α : ℝ) => (RationalAction.fromDecisionProblem dp α).policy () a_opt) Filter.atTop (nhds 1)

As α → ∞, the softmax agent converges to the EU-optimal deterministic policy: the probability of the unique EU-maximizing action → 1.

This is the decision-theoretic content of tendsto_softmax_infty_at_max from Softmax.Limits.

source

theorem Core.fromDP_nonoptimal_vanishes {W : Type u_1} {A : Type u_2} [Fintype W] [Fintype A] [Nonempty A] [DecidableEq A] (dp : DecisionTheory.DecisionProblem W A) (a_opt a : A) (h_opt : ∀ (a' : A), a' ≠ a_opt → expectedUtilityR dp a' < expectedUtilityR dp a_opt) (ha : a ≠ a_opt) :

Filter.Tendsto (fun (α : ℝ) => (RationalAction.fromDecisionProblem dp α).policy () a) Filter.atTop (nhds 0)

As α → ∞, any non-optimal action gets probability → 0.

source

noncomputable def Core.RationalAction.toUtility {A : Type u_1} [Fintype A] (ra : RationalAction Unit A) (α : ℝ) (a : A) (_ha : 0 < ra.score () a) :

ℝ

Extract the implicit utility function from a softmax-parameterized agent.

If ra = fromSoftmax u α, then toUtility ra α a = u a + const (up to the translation invariance of softmax).

Equations

ra.toUtility α a _ha = Real.log (ra.score () a) / α

Instances For

source

theorem Core.fromSoftmax_toUtility_eq {A : Type u_1} [Fintype A] (utility : Unit → A → ℝ) (α : ℝ) (hα : α ≠ 0) (a : A) :

(RationalAction.fromSoftmax utility α).toUtility α a ⋯ = utility () a

Round-trip: constructing via fromSoftmax and extracting utility recovers the original utility (the log inverts the exp, and dividing by α cancels).

source

theorem Core.fromSoftmax_roundtrip_policy {A : Type u_1} [Fintype A] [Nonempty A] (utility : Unit → A → ℝ) (α : ℝ) (hα : α ≠ 0) (a : A) :

(RationalAction.fromSoftmax utility α).policy () a = (RationalAction.fromSoftmax (fun (x : Unit) (a' : A) => (RationalAction.fromSoftmax utility α).toUtility α a' ⋯) α).policy () a

Round-trip for the full agent: fromSoftmax ∘ toUtility recovers the same policy (the softmax is translation-invariant so the constant cancels).

Documentation

Linglib.Core.Agent.SoftmaxOptimality

Softmax Optimality: The Decision-Theoretic Foundation of Soft-Rational Agents #

Results #