I am already familiar with categories, functors and monads, so I will skip this.
Compared to next, we now have time-dependence, controls,
and the uncertainty monad is generalised from lists to an arbitrary
monad M.
To compute the flow and trajectories of next, we would
need to know which control should be selected at each step and state: a
policy sequence for n steps.
A non-deterministic system would have multiple edges for the same starting time, state and control, which this picture does not.
A non-deterministic decision network with a single control could look like:
Init (t = 0)
|-[A]-> S0 (t = 1)
\-[A]-> S1 (t = 1)
Informally, solving an SDP is finding the optimal policy across all possible outcomes.
Formally, what it means depends on how we measure reward under
uncertainty: that is, how we compare values of type
M Val.
open import 1Lab.Prelude
open import Meta.Idiom
open import Meta.Bind
open import Data.List
open import Data.Int
open import Data.Rational.Base
open import FPClimate.OUU3
module FPClimate.MM2 where
module _ (M : Effect) ⦃ _ : Bind M ⦄ where
private module M = Effect M
record SDP : Type (lsuc (M.adj lzero)) where
field
X : Nat → Type
Y : (t : Nat) → X t → Type
next : (t : Nat) → (x : X t) → Y t x → M.₀ (X (suc t))
-- The SP monad (note: _>>=_ breaks `uniform`'s invariant)
mutual instance
Map-SP : Map (eff SP)
Map-SP .Map.map f = map λ (a , p) → f a , p
Idiom-SP : Idiom (eff SP)
Idiom-SP .Idiom.pure a = (a , 1) ∷ []
Idiom-SP .Idiom._<*>_ f a = f >>= λ f → a >>= λ a → pure (f a)
Bind-SP : Bind (eff SP)
Bind-SP .Bind._>>=_ s f = s >>= λ (a , p) → f a <&> λ (b , q) → b , p *ℚ q
-- The generation dilemma as an SDP, parametrised by the probability that
-- the control a will take us to state B.
private
data X : Type where
GU GS B BT : X
data Y : Type where
a b : Y
Generation-dilemma : Ratio → SDP (eff SP)
Generation-dilemma risk .SDP.X t = X
Generation-dilemma risk .SDP.Y t GU = Y
Generation-dilemma risk .SDP.Y t _ = ⊤
Generation-dilemma risk .SDP.next t GU a = (GU , 1 -ℚ risk) ∷ (B , risk) ∷ []
Generation-dilemma risk .SDP.next t GU b = pure BT
Generation-dilemma risk .SDP.next t GS _ = pure GS
Generation-dilemma risk .SDP.next t BT _ = pure GS
Generation-dilemma risk .SDP.next t B _ = pure B
Val should be some way to measure the well-being
(wealth?) of a society. In any case, reward should assign
“low” rewards to states B and BT and “high”
rewards to states GU and GS.
A PolicySeq t n specifies how to act for n
steps starting at time t: morally, it is a dependent
function (i : [t..t+n-1]) → Policy i.
Policy sequences grow “backwards” via the _∷_
constructor: if we have policies for n steps starting at
time suc t, and we have a policy for time t,
then we have policies for suc n steps starting at time
t.