Softmax Optimality: The Decision-Theoretic Foundation of Soft-Rational Agents #
The deepest claim in the RSA architecture is that the softmax agent IS the
entropy-regularized expected-utility maximizer. This file makes that connection
explicit by bridging DecisionProblem (ℚ) from DecisionTheory.lean with
RationalAction (ℝ) from RationalAction.lean.
Results #
Construction (§1): A
DecisionProblemyields aRationalActionvia softmax over expected utility, bridging the ℚ/ℝ gap.Monotonicity (§2): Higher EU ⟹ higher choice probability (α > 0).
Entropy-regularized optimality (§3): The softmax agent uniquely maximizes
∑ p(a)·EU(a) + (1/α)·H(p)— the Gibbs Variational Principle in decision-theoretic language.Hard-max convergence (§4): As α → ∞, the softmax agent converges to the EU-optimal deterministic policy.
Round-trip (§5): Extracting utility from a softmax agent and reconstructing recovers the same policy.
Expected utility cast to ℝ for interfacing with softmax/RationalAction.
Equations
- Core.expectedUtilityR dp a = ∑ w : W, ↑(dp.prior w) * ↑(dp.utility w a)
Instances For
EU in ℝ is non-negative when utility and prior are non-negative.
ℚ-ordering of EU is preserved under the cast to ℝ.
Softmax agent from a decision problem: score(a) = exp(α · EU(a)).
The state type is Unit because the decision problem's prior already
encodes the agent's beliefs — there is no external state to condition on.
Equations
- Core.RationalAction.fromDecisionProblem dp α = Core.RationalAction.fromSoftmax (fun (x : Unit) (a : A) => Core.expectedUtilityR dp a) α
Instances For
Higher EU implies higher choice probability (α > 0).
Strict version: strictly higher EU implies strictly higher probability.
The softmax agent uniquely maximizes entropy-regularized expected utility: p* = argmax_p [∑ p(a)·EU(a) + (1/α)·H(p)].
This is the decision-theoretic content of the Gibbs Variational Principle.
The objective entropyRegObjective from RationalAction.lean is exactly
∑ p(a)·s(a) + (1/α)·H(p) — we just instantiate s = EU.
The softmax agent is the UNIQUE maximizer: any distribution achieving the same objective value must equal the softmax policy.
As α → ∞, the softmax agent converges to the EU-optimal deterministic policy: the probability of the unique EU-maximizing action → 1.
This is the decision-theoretic content of tendsto_softmax_infty_at_max
from Softmax.Limits.
As α → ∞, any non-optimal action gets probability → 0.
Extract the implicit utility function from a softmax-parameterized agent.
If ra = fromSoftmax u α, then toUtility ra α a = u a + const (up to
the translation invariance of softmax).
Instances For
Round-trip for the full agent: fromSoftmax ∘ toUtility recovers the same policy (the softmax is translation-invariant so the constant cancels).