Reinforcement Learning
Reinforcement Learning vs. Supervised machine learning
Reinforcement Learning
- Reward
- Partial information
- Delayed consequence
- Agent’s choice affects the subsequent data it receives, i.e., non-i.i.d.
Supervised Machine Learning
- Ground-truth labels
- Full information
- Immediate consequence
- Pre-determined data distribution, i.e., i.i.d.
Defination
- Observations:
- Actions:
- Rewards:
- State:
- policy:
- State-action value:
- State value:
Multi-Armed Bandit
Lower regret bound
K is the number of arms
probability
- Markov's inequality: non-negative random
variable
,
- Chebyshev's inequality:
with mean and variance ,
- Hoeffding's inequality:
are independent random variables, , , , ,
bandit problem with algorithm "explration and exploitation"
- assume
are the expected rewards of the k arms
- define
- only consider the case when k=2
- choose the algoritm (explore and commit) that
- choose each arm
times
- find the empirically best arm
- choose the arm for the rest of the time
- choose each arm
the regret
change m to minimize the regret, we get that when
the regret bound is
when
so
UCB algorithm
score function:
where
is the average reward of arm i is the total number of times the bandit has been played
is the number of times arm i has been pulled
derivation
考虑Hoffeding's inequality,有
令
再令
取
epoch-greedy algorithm
每一个epoch,先随机选一个arm作为explore,然后剩下的时间里选择最好的那一个arm作为exploit。
- 每一个epoch的exploit的arm是固定的
- 每一个epoch的次数与当前时间有关
-greedy algorithm
constant
By setting
Contextual bandit
每一个arm有一个特征向量
目标是估计
使用最小二乘法,令
regret bound
总体的Lower regret bound是
LinUCB
LinTS
利用bayes,将
Perturbation-based
每次对得到的reward加一定的扰动,使得会有更多的exploration