인공지능/Artificial Intelligence (COSE361)

[인공지능] Chapter 9. Markov Decision Process (1)

이준언 2023. 10. 25. 15:17

Markov Decision Processes

  • 다음과 같은 요소로 정의
    • A set of states s from S
    • A set of actions a from A
    • A transition function T(s, a, s’)
      • s’: a(from s)가 s’가 될 확률
      • Also called the model or the dynamic
    • A reward function R(s, a, s’)
    • A start stae
    • Maybe a terminal state
  • MDPs are non-deterministic search problems
    • One way to solve them is with expectimax search

What is Markov about MDPs?

  • Markov
    • generally means that given the present state, the future and the past are independent (미래와 과거 상태가 조건적으로 독립)
    • For Markov decision processes, ‘Markov’ means action outcomes depend only on the current state (action의 결과가 현재 상태에만 의존)
    • 이는 successor function이 history가 아닌, current state에만 의존할 수 있는 검색과 유사

Policies

  • In deterministic single-agent search problems,
    • we wanted an optimal plan
    • or sequence of actions, from start to a goal
  • For MDPs, we want an optimal policy
    • it gives an action for each state
    • An optimal policy is one that maximizes expected utility if followed
    • An explicit policy defines a reflex agent
  • Expectimax didn’t compute entire policies
    • It computed the action for a single state only
  • Example: Racing
    • A robot car wants to travel far, quickly
    • states
      • Cool
      • Warm
      • Overheated
    • actions
      • Slow
      • Fast
    • Going faster gets double reward

Discounting

  • sum of rewards를 최대화하는 것이 바람직
  • rewards later 보다 rewards now를 선호하는 것이 바람직
  • One solution: values of rewards decay exponentially
  • How to discount?
    • level을 descend할 때마다, discount를 한 번 씩 곱함
  • Why discount?
    • 빠른 rewards는 나중에 받는 rewards보다 높은 utility를 갖는다.
    • 알고리즘이 수렴하는 데에 도움이 된다.

Stationary Preferences

  • Theorem
    • If we assume stationary preferences:
    • Then there are only two ways to define utilities

Infinite Utilities

  • 문제
    • 게임이 계속 지속되면 어떻게 되는가? 무한대의 reward를 받을 수 있는가?
  • 해결
    • Finite horizon (similar to depth-limited search)
      • fixed T step 이후 episode 종료
      • nonstationary police 제공
    • Discounting
      • use 0<γ<1
      • Smaller γ means smaller ‘horizon’
      • shorter term focus
    • Absorbing state
      • 모든 policy에 대해 종료상태에 도달하도록 보장

Solving MDPs

  • Optimal Quantities
    • The value(utility) of a state s
      • V*(s)
        • expected utility starting in s and acting optimally
    • The value(utility) of a q-state (s, a)
      • Q*(s, a)
        • expected utility starting out having taken action a from state s and (thereafter) acting optimally
    • The optimal policy
      • 𝜋∗ (𝑠)
        • optimal action from state 𝑠
  • Values of States
    • Fundamental operation (기본 연산)
      • 각 상태의 expectimax value를 계산
      • optimal action에 따른 expected utility
      • rewards 합계의 평균
      • expectimax 계산과 유사
  • Racing Search Tree
    • expectimax
    • 문제1: 상태가 반복됨
      • 필요한 수량을 한 번만 계산
    • 문제2: Tree goes on forever
      • 깊이 제한 계산을 수행하되, 변화가 작아질 때까지 깊이를 증가
  • Time-Limited Values
    • 𝑉_𝑘 (𝑠)
      • optimal value of s
      • if the game ends in k more time steps
      • 즉, s에서 depth-k expectimax를 얻을 수 있는 값

Value Iteration

  • Value Iteration
    • V_0(s) = 0 에서 시작
    • no time steps left means an expected sum of reward = 0
    • V_k(s)를 고려하면, 각 상태에서 expectimax를 한 번씩 수행
    • 수렴 시까지 반복
    • 각 반복의 복잡도: O(A * S^2)
    • Theorem
      • will converge to unique optimal values
      • Basic idea: approximations get refined towards optimal values
      • Policy may converge long before values do

Convergence

  • How do we know the V_k vectors are going to converge?
    • Case 1
      • 트리의 최대 깊이가 M인 경우, V_M holds the actual untruncated values
    • Case 2
      • If the discount is less than 1
      • Sketch: 어떤 상태에 대해 V_k와 V_(k+1)는 거의 동일한 검색트리에서 depth k+1 expectimax result로 볼 수 있음
      • 차이점:
        • 최하위 계층에서, V_(k+1)에는 실제 rewards가 있는 반면, V_k는 0이다.
        • 마지막 레이어는 at best all R_max
        • 최악의 경우 R_min
        • But everything is discounted by 𝛾^𝑘

Summary

  • MDP can model problems with uncertain outcomes of actions
  • Solving an MDP is to find the optimal policy
    • The optimal policy provides the optimal action to take at each state
  • Q-states are introduced to model uncertainty in the outcomes of an action
  • Values of states and q-values of q-states are defined recursively
  • Values of states can be computed through the value iteration algorithm