# reinforcement learning，增强学习：Markov Decision Processes

Optimal control primarily deals with continuous MDPs
Partially observable problems can be converted into MDPs

S
is a (finite) set of states
P is a state transition probability matrix,
Pss=P[St+1=s’ | St=s]

A Markov reward process is a Markov chain with values.

A Markov Reward Processis a tuple <S,P,R,γ>
S
is a finite set of states
P is a state transition probability matrix,
Pss’ [St+1 s’ | Ss]
R is a reward function,Rs=E[Rt+1jSt=s]
γ is a discount factor,γ[0,1]

The return Gtis the total discounted reward from time-stept.

The state value function v(s)of an MRP is the expected return starting from state s
v
(s) =E[Gt | St=s]

Bellman Equation for MRPs ：

The value function can be decomposed into two parts:
immediate reward
Rt+1
discounted value of successor state γv(St+1)

【计算时常用】

Computational complexity is O(n3) fornstates，Direct solution only possible for small MRPs
There are many iterative methods for large MRPs, e.g.
Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning

（动态规划、蒙特卡洛方法、时间差分学习，是计算马尔科夫value function的常用方法）

A Markov decision process (MDP) is a Markov reward process with decisions（actions）. It is anenvironmentin which all states are Markov.

A policy π is a distribution over actions given states,
π(a|s) =P [At= a | St= s]
MDP policies depend on the current state (not the history) , Policies arestationary (time-independent)

【计算时常用】

===》

【计算时常用】

【计算时常用】

When we know the q*(s,a), we immediately have the optimal polic。 An optimal policy can be found by maximising over q(sa)：

There is always a deterministic optimal policy for any MDP

Solving the Bellman Optimality Equation ：

Bellman Optimality Equation is non-linear，No closed form solution (in general)，Many iterative solution methods
Value Iteration Policy Iteration Q-learning Sarsa