[인공지능] Chapter 9. Markov Decision Process (1)

이준언 2023. 10. 25. 15:17

Markov Decision Processes

다음과 같은 요소로 정의
- A set of states s from S
- A set of actions a from A
- A transition function T(s, a, s’)
  - s’: a(from s)가 s’가 될 확률
  - Also called the model or the dynamic
- A reward function R(s, a, s’)
- A start stae
- Maybe a terminal state
MDPs are non-deterministic search problems
- One way to solve them is with expectimax search

Markov
- generally means that given the present state, the future and the past are independent (미래와 과거 상태가 조건적으로 독립)
- For Markov decision processes, ‘Markov’ means action outcomes depend only on the current state (action의 결과가 현재 상태에만 의존)
- 이는 successor function이 history가 아닌, current state에만 의존할 수 있는 검색과 유사

In deterministic single-agent search problems,
- we wanted an optimal plan
- or sequence of actions, from start to a goal
For MDPs, we want an optimal policy
- it gives an action for each state
- An optimal policy is one that maximizes expected utility if followed
- An explicit policy defines a reflex agent
Expectimax didn’t compute entire policies
- It computed the action for a single state only
Example: Racing
- A robot car wants to travel far, quickly
- states
  - Cool
  - Warm
  - Overheated
- actions
  - Slow
  - Fast
- Going faster gets double reward

sum of rewards를 최대화하는 것이 바람직
rewards later 보다 rewards now를 선호하는 것이 바람직
One solution: values of rewards decay exponentially
How to discount?
- level을 descend할 때마다, discount를 한 번 씩 곱함
Why discount?
- 빠른 rewards는 나중에 받는 rewards보다 높은 utility를 갖는다.
- 알고리즘이 수렴하는 데에 도움이 된다.

Theorem
- If we assume stationary preferences:
- Then there are only two ways to define utilities

Optimal Quantities
- The value(utility) of a state s
  - V*(s)
    - expected utility starting in s and acting optimally
- The value(utility) of a q-state (s, a)
  - Q*(s, a)
    - expected utility starting out having taken action a from state s and (thereafter) acting optimally
- The optimal policy
  - 𝜋∗ (𝑠)
    - optimal action from state 𝑠
Values of States
- Fundamental operation (기본 연산)
  - 각 상태의 expectimax value를 계산
  - optimal action에 따른 expected utility
  - rewards 합계의 평균
  - expectimax 계산과 유사
Racing Search Tree
- expectimax
- 문제1: 상태가 반복됨
  - 필요한 수량을 한 번만 계산
- 문제2: Tree goes on forever
  - 깊이 제한 계산을 수행하되, 변화가 작아질 때까지 깊이를 증가
Time-Limited Values
- 𝑉_𝑘 (𝑠)
  - optimal value of s
  - if the game ends in k more time steps
  - 즉, s에서 depth-k expectimax를 얻을 수 있는 값

MDP can model problems with uncertain outcomes of actions
Solving an MDP is to find the optimal policy
- The optimal policy provides the optimal action to take at each state
Q-states are introduced to model uncertainty in the outcomes of an action
Values of states and q-values of q-states are defined recursively
Values of states can be computed through the value iteration algorithm