Reinforcement Learning

Reinforcement Learning vs. Supervised machine learning

Reinforcement Learning

  • Reward
  • Partial information
  • Delayed consequence
  • Agent’s choice affects the subsequent data it receives, i.e., non-i.i.d.

Supervised Machine Learning

  • Ground-truth labels
  • Full information
  • Immediate consequence
  • Pre-determined data distribution, i.e., i.i.d.

Defination

  • Observations:
  • Actions:
  • Rewards:
  • State:
  • policy:
  • State-action value:
  • State value:

Multi-Armed Bandit

Lower regret bound

K is the number of arms
is the probability distribution of the best arm

probability

  • Markov's inequality: non-negative random variable ,

  • Chebyshev's inequality: with mean and variance ,

  • Hoeffding's inequality: are independent random variables, , , , ,

bandit problem with algorithm "explration and exploitation"

  • assume are the expected rewards of the k arms
  • define
  • only consider the case when k=2
  • choose the algoritm (explore and commit) that
    1. choose each arm times
    2. find the empirically best arm
    3. choose the arm for the rest of the time

the regret

change m to minimize the regret, we get that when

the regret bound is

when is small,the regret may be smaller than the naive algorithm which randomly choose the arm, whose regret is

so

UCB algorithm

score function:

where

  1. is the average reward of arm i
  2. is the total number of times the bandit has been played
  3. is the number of times arm i has been pulled

derivation

考虑Hoffeding's inequality,有

,解得
再令,得到,所以信心上界为

,得到UCB算法的score function

epoch-greedy algorithm

每一个epoch,先随机选一个arm作为explore,然后剩下的时间里选择最好的那一个arm作为exploit。

  1. 每一个epoch的exploit的arm是固定的
  2. 每一个epoch的次数与当前时间有关

-greedy algorithm

constant leads to linear regret, while decreasing leads to logarithmic regret.

By setting , with , we have

Contextual bandit

每一个arm有一个特征向量,得到的奖励

目标是估计

使用最小二乘法,令,想要,得到

regret bound

总体的Lower regret bound是

LinUCB

LinTS

利用bayes,将看作是服从高斯分布的,根据采样得到后验分布。

Perturbation-based

每次对得到的reward加一定的扰动,使得会有更多的exploration