Proof that Bellman update is a contraction
Mia Lopez
Following the proof that Bellman update is a contraction from Instructor's resources to Artificial Intelligence: A Modern Approach I fail to understand 1 step. Some definitions and proofs that were used in the final step:
definition of max norm: $||U||=\max_s|U(s)|$
definition of Bellman update: $U(s) = R(s) + \gamma\max_{a\in A(s)}\sum_{s'}P(s'|s,a)U(s')$
$U_i$ - vector of utilities for all states at the $i$th iteration.
statement: $|max_a f(a) - max_a g(a)| \leq max_a|f(a)-g(a)|$
$|(BU_i-BU'_i)(s)| = |R(s) + \gamma\underset{a\in A(s)}{\max}|\sum_{s'}P(s'|s,a)U_i(s') - R(s) \\- \gamma\max_{a\in A(s)}\sum_{s'}P(s'|s,a)U'_i(s')| \\= \gamma|\max_{a\in A(s)}\sum_{s'}P(s'|s,a)U_i(s')-\max_{a\in A(s)}\sum_{s'}P(s'|s,a)U'_i(s')| \\ \leq \gamma \max_{a\in A(s)}|\sum_{s'}P(s'|s,a)U_i(s')-\sum_{s'}P(s'|s,a)U'_i(s')| \\ = \gamma|\sum_{s'}P(s'|s,a^*(s))U_i(s')-\sum_{s'}P(s'|s,a^*(s))U'_i(s')|\\ = \gamma|\sum_{s'}P(s'|s,a^*(s))(U_i(s')-U'_i(s')|$
then the final part of proof where we use the above: $||BU_i-BU'_i|| = \max_s|(BU_i-BU'_i)(s)| \\ \leq \gamma\max_s|\sum_{s'}P(s'|s,a^*(s))(U_i(s')-U'_i(s'))| \\ \leq \gamma\max_s|U_i(s)-U'_i(s)|\\ =\gamma||U_i-U'_i||$
How do I get the inequality $\gamma\max_s|\sum_{s'}P(s'|s,a^*(s))(U_i(s')-U'_i(s'))| \leq \gamma\max_s|U_i(s)-U'_i(s)|$?
$\endgroup$1 Answer
$\begingroup$It's a bit of an old question, but thought I'd answer it, anyways.
Note that (a) $$ P(s'|s,a^*(s)) \ge 0, \sum_{s'} P(s'|s,a^*(s)) = 1 $$ and (b), that $$ U_i(s') - U_i'(s') \le \max_s | U_i(s) - U_i'(s)| $$ using both of these statements yields $$ \begin{align} \sum_{s'} \left(P(s'|s,a^*(s))(U_i(s') - U_i'(s'))\right) &\le \sum_{s'}\left( P(s'|s,a^*(s))\max_s | U_i(s) - U_i'(s)|\right)\\ &=\left( \sum_{s'}P(s'|s,a^*(s))\right)\max_s | U_i(s) - U_i'(s)|\\ &=\max_s | U_i(s) - U_i'(s)| \end{align} $$ for all $s$, from which the final result follows.
$\endgroup$