CS294 Deep Reinforcement Learning Notes

Lectures

The “REPL”

The Roles

Given by Agent
\[\pi_\theta(\mathbf{u}_t \vert \mathbf{o}_t) \,, \mathbf{o}_t \rightarrow \mathbf{x}_t\]
- $\pi$ as policy, $\theta$ as parameters in policy model
  
  Note
  Action space could be discrete, consider using Monte-Carlo Tree Search then.
- $\mathbf{u}$, $\mathbf{x}$, $\mathbf{o}$ as action, ground-truth state and observed state respectively, $t$ for current time step
Given by Environment
\[\mathrm{p}(\mathbf{x}_{t+1} \vert \mathbf{x}_t, \mathbf{u}_t)\]
- $\mathrm{p}$ (stocastically) or $f$ for set of state transition rules built within the environment
Defined by User
\[r(\mathbf{x}, \mathbf{u})\]
- $r$ for reward, or $c := -r$ for cost

Note
Clever audiences may have found the 3 Places to Apply NN in a RL model by now.

Agent: learn Policies

Environment: learn Dynamics

Feedback: learn Rewards

The Goal

With constraints (collocation method), will be

\[\min_{\mathbf{u}_1, \dots, \mathbf{u}_T, \mathbf{x}_1, \dots, \mathbf{x}_T} \sum_{t=1}^{T} = c(\mathbf{x}_t, \mathbf{u}_t) \text{ such that } \mathbf{x}_t = f(\mathbf{x}_{t-1}, \mathbf{u}_{t-1}) \hspace{1em}\cdots\cdots\cdots\cdots \, (1)\]

LQR Algorithm

LQR, or Linear Quadratic Regulator.

Goal: Solve $\text{Eq. }(1)$ or expanded as

\[\min_{\mathbf{u}_1, \dots, \mathbf{u}_T, \mathbf{x}_1, \dots, \mathbf{x}_T} = c(\mathbf{x}_1, \mathbf{u}_2) + c(f(\mathbf{x}_1, \mathbf{u}_1), \mathbf{u}_2) + \cdots + c(f(f(\dots)\dots), \mathbf{u}_T) \hspace{1em}\cdots\cdots\cdots\cdots \, (1')\]

Here presents an example with known linear dynamics and linear quadratic cost (i.e. all $\mathrm{C}$ and $\mathrm{F}$ can be substituted)

Linear Dynamics System

\[f(\mathbf{x}_t, \mathbf{u}_t) = \mathbf{F}_t \begin{bmatrix} \mathbf{x}_t \\ \mathbf{u}_t \end{bmatrix} + \mathbf{f}_t \hspace{1em}\cdots\cdots\cdots\cdots \, (2)\]

Note
$\mathbf{x}$ may contain quadraic terms of its elements, velocity $\dot{x}$ and acceleration $\ddot{x}$ for example.

Linear Quadratic Cost

\[c(\mathbf{x}_t, \mathbf{u}_t) = \frac{1}{2} \begin{bmatrix} \mathbf{x}_t \\ \mathbf{u}_t \end{bmatrix}^T \mathbf{C}_t \begin{bmatrix} \mathbf{x}_t \\ \mathbf{u}_t \end{bmatrix} + \begin{bmatrix} \mathbf{x}_t \\ \mathbf{u}_t \end{bmatrix}^T \mathbf{c}_t \hspace{1em}\cdots\cdots\cdots\cdots \, (3)\]

Methodology

Solve symbols backwards (from last action $\mathbf{u}_T$ to first state $\mathbf{x}_1$), fill values forwards (from first action $\mathbf{u}_1$ to last state $\mathbf{x}_T$).

\[\begin{align} &\text{/* 1. Backward recursion */} \\ &\text{for } t = T \text{ to } 1 \text{:} \\ &\hspace{1em} \mathbf{Q}_t = \mathbf{C}_t + \mathbf{F}_t^T \mathbf{V}_{t+1} \mathbf{F}_t \\ &\hspace{1em} \mathbf{q}_t = \mathbf{c}_t + \mathbf{F}_t^T \mathbf{V}_{t+1} \mathbf{f}_t + \mathbf{F}_t^T \mathbf{v}_{t+1} \\ &\hspace{1em} Q(\mathbf{x}_t, \mathbf{u}_t) = \mathrm{const} + \frac{1}{2} \begin{bmatrix} \mathbf{x}_t \\ \mathbf{u}_t \end{bmatrix}^T \mathbf{Q}_t \begin{bmatrix} \mathbf{x}_t \\ \mathbf{u}_t \end{bmatrix} + \begin{bmatrix} \mathbf{x}_t \\ \mathbf{u}_t \end{bmatrix}^T \mathbf{q}_t \hspace{2em} \text{(total cost from now until end if take } \mathbf{u}_t \text{ from state } \mathbf{x}_t \text{)} \\ &\hspace{1em} \mathbf{u}_t \leftarrow \arg \min_{\mathbf{u}_t} Q(\mathbf{x}_t, \mathbf{u}_t) = \mathbf{K}_t \mathbf{x}_t + \mathbf{k}_t \\ &\hspace{1em} \mathbf{K}_t = -\mathbf{Q}_{\mathbf{u}_t, \mathbf{u}_t}^{-1} \mathbf{Q}_{\mathbf{u}_t, \mathbf{x}_t} \\ &\hspace{1em} \mathbf{k}_t = -\mathbf{Q}_{\mathbf{u}_t, \mathbf{u}_t}^{-1} \mathbf{q}_{\mathbf{u}_t} \\ &\hspace{1em} \mathbf{V}_t = \mathbf{Q}_{\mathbf{x}_t, \mathbf{x}_t} + \mathbf{Q}_{\mathbf{x}_t, \mathbf{u}_t} \mathbf{K}_t + \mathbf{K}_t^T \mathbf{Q}_{\mathbf{u}_t, \mathbf{x}_t} + \mathbf{K}_t^T \mathbf{Q}_{\mathbf{u}_t, \mathbf{u}_t} \mathbf{K}_t \\ &\hspace{1em} \mathbf{v}_t = \mathbf{q}_{\mathbf{x}_t} + \mathbf{Q}_{\mathbf{x}_t, \mathbf{u}_t} \mathbf{k}_t + \mathbf{K}_t^T \mathbf{Q}_{\mathbf{u}_t} + \mathbf{K}_t^T \mathbf{Q}_{\mathbf{u}_t, \mathbf{u}_t} \mathbf{k}_t \\ &\hspace{1em} V(\mathbf{x}_t) = \mathrm{const} + \frac{1}{2} \mathbf{x}_t^T \mathbf{V}_t \mathbf{x}_t + \mathbf{x}_t^T \mathbf{v}_t \hspace{2em} \text{(total cost from now until end from state } \mathbf{x}_t \text{)} \\ &\\ &\because \, \mathbf{x}_1 \text{ is given} \\ &\\ &\text{/* 2. Forward recursion */} \\ &\text{for } t = 1 \text{ to } T \text{:} \\ &\hspace{1em} \mathbf{u}_t = \mathbf{K}_t \mathbf{x}_t + \mathbf{k}_t \\ &\hspace{1em} \mathbf{x}_{t+1} = f(\mathbf{x}_t, \mathbf{u}_t) \end{align} \\\]

Flags

You’d want to touch this thing no more

It can be solved properly and easily with an NN (and escape from Newton’s Method for good)

Nonlinear Case

Approximate linear-quadratic case using Taylor expansion.

Differential dynamic programming (or DDP)
Iterative LQR
- For all reference points, run LQR backward recursion on approximation, then run forward recursion on true nonlinear system (with hacks that will not be discussed on this post)

Learn the Model / Dynamics

Goal

Discover $p(\mathbf{x}_{t+1} \vert \mathbf{x}_t, \mathbf{u}_t)$.

Model-Based RL v1.5

Explore the playground with a particular goal
- Run base policy $\pi_0(\mathbf{u}_t \vert \mathbf{x}_t)$ to collect $\mathcal{D}=\{ (\mathbf{x}, \mathbf{u}, \mathbf{x}')_i \}$
Learn from exprerience to generate $p_{\pi_0}$, the observed model
- Learn dynamics model $f(\mathbf{x}, \mathbf{u})$ to minimize $\sum_i \lVert f(\mathbf{x}_i, \mathbf{u}_i) - \mathbf{x}_i' \rVert ^2$
Dream of more adventures
- Backpropagate through $f(\mathbf{x}, \mathbf{u})$ to choose actions
Take real actions, verify the learnt exprerience, i.e. test observed model $p_{\pi_0}$ against ground-truth $p_{\pi_f}$ to find if any mismatch exists, and try to get even close to our goal
- [v1.5] (MPC) Execute those actions and observe consequent states $\mathbf{x}'$
- (DAgger) append the resulting data $\{ (\mathbf{x}, \mathbf{u}, \mathbf{x}')_j \}$ to $\mathcal{D}$
- [v1.5] Back to step 3 to re-plan actions (correct errors before going off road, planning algorithm required), loop for N times (or N-parallelism)
- back to step 2

Model-Based RL v2.0

Explore the playground with a particular goal
- Run base policy $\pi_0(\mathbf{u}_t \vert \mathbf{x}_t)$ to collect $\mathcal{D}=\{ (\mathbf{x}, \mathbf{u}, \mathbf{x}')_i \}$
Learn from exprerience to generate $p_{\pi_0}$, the observed model
- Learn dynamics model $f(\mathbf{x}, \mathbf{u})$ to minimize $\sum_i \lVert f(\mathbf{x}_i, \mathbf{u}_i) - \mathbf{x}_i' \rVert ^2$
Dream of more adventures
- Backpropagate through $f(\mathbf{x}, \mathbf{u})$ into the policy (along the whole decision chain) to optimze $\pi_\theta(\mathbf{u}_t \vert \mathbf{x}_t)$
- Run $\pi_\theta(\mathbf{u}_t \vert \mathbf{x}_t)$
Take real actions, and try to get even close to our goal
- (DAgger) append the resulting data $\{ (\mathbf{x}, \mathbf{u}, \mathbf{x}')_j \}$ to $\mathcal{D}$
- Back to step 2

Train Policies (Model-Based RL)

No need to replan
Potentially better generalization

Goal

\[\min_{\tau, \theta} c(\tau) \text{ such that } \mathbf{u}_t = \pi_\theta(\mathbf{x}_t)\]

Dual Gradient Descent (Augmented Lagrangian)

\[\bar{\mathcal{L}}(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda C(\mathbf{x}) + \rho \lVert C(\mathbf{x}) \rVert ^2\]

Where the last squared error term is the augmented part.

Find $\mathbf{x}^\star \leftarrow \arg \min_\mathbf{x} \bar{\mathcal{L}}(\mathbf{x}, \lambda)$
Compute $\frac{dg}{d\lambda} = \frac{d\bar{\mathcal{L}}}{d\lambda}(\mathbf{x}^\star, \lambda)$
\[\lambda \leftarrow \lambda + \alpha \frac{dg}{d\lambda}\]

So here by now,

\[\bar{\mathcal{L}}(\tau, \theta, \lambda) = c(\tau) + \sum_{t=1}^T \lambda_t (\pi_\theta(\mathbf{x}_t) - \mathbf{u}_t) + \sum_{t=1}^T \rho_t (\pi_\theta(\mathbf{x}_t) - \mathbf{u}_t)^2\]

Find $\tau \leftarrow \arg \min_\tau \bar{\mathcal{L}}(\tau, \theta, \lambda)$
- Find a good trajectory
- Optimize $p(\tau)$ w.r.t. some surrogate $\tilde{c}(\mathbf{x}_t, \mathbf{u}_t)$
- Sequential, may use iLQR
Find $\theta \leftarrow \arg \min_\theta \bar{\mathcal{L}}(\tau, \theta, \lambda)$
- learn (imitational) the policy upon the trajectory
- Optimize $\theta$ w.r.t. some supervised objective
- non-sequential, may use SGD
\[\lambda \leftarrow \lambda + \alpha \frac{dg}{d\lambda}\]
- Do gradient descent
- Increment or modify dual variables $\lambda$

Imitating MPC: PLATO Algorithm

Train $\pi_\theta(\mathbf{u}_t \vert \mathbf{o}_t)$ from human data $\mathcal{D} = \{ \mathbf{o}_1, \mathbf{u}_1, \dots, \mathbf{o}_N, \mathbf{u}_N \}$
Run (in real dynamics, e.g. real world) $\hat{\pi}(\mathbf{u}_t \vert \mathbf{o}_t)$ to get dataset $\mathcal{D} = \{ \mathbf{o}_1, \dots, \mathbf{o}_M \}$
- \[\hat{\pi}(\mathbf{u}_t \vert \mathbf{x}_t) = \arg \min_\hat{\pi} \sum_{t' = t}^T E_\hat{\pi} [ c(\mathbf{x}_{t'}, \mathbf{u}_{t'}) ] + \lambda D_{\mathrm{KL}}(\hat{\pi}(\mathbf{u}_t \vert \mathbf{x}_t) \Vert \pi_\theta(\mathbf{u}_t \vert \mathbf{x}_t))\]
  - Keep newly generated training trajectories different from but relatively close to original ones (could be human interfering manually via MPC), in order to avoid financially unacceptabe training cost, for it may lead to physical damage on the test object / environment (e.g. autopilot drone bump into a tree, the model will be taught a good lesson, but the drone is broken for good)
Ask computer to label $\mathcal{D}_\pi$ with actions $\mathbf{u}_t$
(DAgger) $\mathcal{D} \leftarrow \mathcal{D} \cup \mathcal{D}_\pi$

Inverse Dynamics Model

$\arg\min$ the dynamics equations.

Model Representation

Dynamics equation

\[M(\mathbf{x}) \ddot{\mathbf{x}} + C(\mathbf{x}, \dot{\mathbf{x}}) \dot{\mathbf{x}} = B \mathbf{u} + J(\mathbf{x})^T \mathbf{f}\]

Inverse dynamics function

\[f^{-1}(\mathbf{x}_{t-1}, \mathbf{x}_{t}, \mathbf{x}_{t+1}) = \arg \min_{\mathbf{u}, \mathbf{f}} \lVert \text{ difference of dynamics equation } \rVert ^2\]

Inverse dynamics residual

Likelihood of being on the trajectory $(\mathbf{x}_{t-1}, \mathbf{x}_{t}, \mathbf{x}_{t+1})$.

\[r^{-1}(\mathbf{x}_{t-1}, \mathbf{x}_{t}, \mathbf{x}_{t+1}) = \min_{\mathbf{u}, \mathbf{f}} \lVert \text{ difference of dynamics equation } \rVert ^2\]

Forward Shooting / Direct Collocation Method

Forward Shooting

\[\min_{\mathbf{u}_1, \dots, \mathbf{u}_T} \sum_t c_t(\mathbf{x}_t) \text{ s.t. } \mathbf{x}_{t+1} = f(\mathbf{x}_t, \mathbf{u}_t)\]

Optimize over controls
State trajectory is implicit
Dynamics is an implicit constraint (always satisfied)

Direct Collocation

\[\min_{\mathbf{x}_1, \dots, \mathbf{x}_T} \sum_t c_t(\mathbf{x}_t) \text{ s.t. } f^{-1}(\mathbf{x}_t, \mathbf{x}_{t+1}) = \mathbf{u}_t \in \mathcal{U}\]

Optimize over states
Controls and forces are implicit
Dynamics is explict constraint (can be made hard or soft)

Dynamics with Contact

Problems

Discontinuous jumps in contact forces (and their number)
No gradient information from inactive contacts

Contact-Invariant Optimization

$c_{t, n} = 1$, the controller $n$ must be touching something and not sliding
$c_{t, n} = 0$, the controller $n$ is unconstrained

Dynamics Consistency

Assume all forces are active (contact set is constant), but apply high penalties for use of forces where $c = 0$.

\[f^{-1}(\mathbf{x}_{t-1}, \mathbf{x}_{t}, \mathbf{x}_{t+1}) = \arg \min_{\mathbf{u}, \mathbf{f}} \lVert \text{ difference of dynamics equation } \rVert ^2 + \sum_{i} \lVert \mathbf{f}_i \rVert ^2 / (c_i + \epsilon)\]

Policy Drifting

Supervised learning in RL is different from traditional supervised learning, like in recognition tasks, in the way that the steps are now no longer independent of time / sequence with each other. Errors in feed-forward results will accumulate, getting the model into states that are off the labeled trajectory.

Solution: Noisy Training Data

Simulate expert policies at states that are slightly off, by generating a correction term pointing towards the opposite direction of the error.

Input: $\mathbf{x} + \epsilon$
Output: $\mathbf{u} + \mathbf{K} \epsilon$

Decompose Policy Optimization Problem

Trajectory optimization
- Stay close to policy
\[\min_{\mathbf{x}} \sum_t C(\mathbf{x}_t) + \lVert \pi_\theta(\mathbf{x}_t) - \mathbf{u}_t \rVert ^2\]
Regression
\[\min_\theta \sum_{i, t} \lVert \pi_\theta(\mathbf{x}_{i, t}) - \mathbf{u}_{i, t} \rVert ^2\]

Movement with Model Uncertianty

Generate noisy models varying:
- limb mass
- limb conter of mass
- contact locations
- etc.
Optimize over multiple state trajectories
Single control trajectory
Execute control trajectory in open loop

Interactive Policies with Online Model Learning

Offline
- Train policy to output desired next state $\bar{\mathbf{x}}_{t+1}$
At every timestep
- Learn robot dynamics on the fly from past observations $\mathbf{x}_{t+1} = f(\mathbf{x}_t, \mathbf{u}_t)$
- Query policy for $\bar{\mathbf{x}}_{t+1}$
- Solve for robot torques $\mathbf{u}^*$ such that $\bar{\mathbf{x}}_{t+1} = f(\mathbf{x}_t, \mathbf{u}^*)$

Markov Decision Processes and Solving Finite Problems

Markov Decision Process

Defined by the following components:

$\mathcal{S}$: state space
$\mathcal{A}$: action space
$P(r, s' \vert s, a)$: a transition probability distribution
- $r$: reward
- $s$, $s'$: state and next state

Partially Observed MDPs

Instead of obsrving full state $s$, agent observes $y$ (previously denoted as $o$), with $y \sim P(y \vert s)$
A MDP can be trivially mapped onto a POMDP
A POMBDP can be mapped on to an MDP
\[\tilde{s}_0 = \{ y_0 \} ,\, \tilde{s}_1 = \{ y_0, y_1 \} ,\, \tilde{s}_2 = \{ y_0, y_1, y_2 \} ,\, \dots\]

Problems Involving MDPs

Policy optimization: maximize expected reward w.r.t. policy $\pi$
\[\max_\pi \mathbb{E} \bigg[ \sum_{t=0}^\infty r_t \bigg]\]
Policy evaluation: compute exprected return for fixed policy $\pi$
- return := sum of future rewards in an episode, i.e. a trajectory
  - Discounted return: $r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots$
  - Undiscounted return: $r_t + r_{t+1} + \cdots + r_{T-1} + V(s_T)$
- Performance of policy: $\eta(\pi) = \mathbb{E} [ \sum_{t=0}^\infty \gamma^t r_t ]$
- State value function: $V^\pi(s) = \mathbb{E} [ \sum_{t=0}^\infty \gamma^t r_t \vert s_0 = s ]$
- State-action value function: $Q^\pi(s, a) = \mathbb{E} [ \sum_{t=0}^\infty \gamma^t r_t \vert s_0 = s, a_0 = a ]$

Policy Evaluation

Value Iteration: Finite Horizon Case

General idea of LQR algorithm.

Problem
\[\min_{\pi_0, \dots, \pi_{T-1}} \mathbb{E}[ r_0 + r_1 + \cdots + r_{T-1} + V_T(s_T) ]\]
Swap maxes and exprectations
\[\max_{\pi_0} \mathbb{E} \bigg[ r_0 + \max_{\pi_1} \mathbb{E} \big[ r_1 + \cdots + \max_{\pi_{T-1}} \mathbb{E} [ r_{T-1} + V_T(s_T) ] \big] \bigg]\]
Solve innermost problem
\[\begin{align} &\text{for each } s \in \mathcal{S} \text{ :} \\ &\hspace{1em} \pi_{T-1}(s), V_{T-1}(s) = \underset{a}{\text{maximize}} \, \mathbb{E}_{s_T} [ r_{T-1} + V_T(s_T) ] \end{align}\]
Substitude original problem recurrently given
\[V_{T-1} = \max_{\pi_{T-1}} \mathbb{E} [ r_{T-1} + V_T(s_T) ]\]

Discounted Setting

Dealing with infinite, or large, timestep.

Discount factor $\gamma \in [0, 1)$, downweights future rewards
Discounted return: $r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots$
Effective time horizon $(1 + \gamma + \gamma^2 + \cdots) = 1 / (1 - \gamma)$
Discounted problem can be obtained by adding transitions to sink state, where agent gets stuck and recieves zero reward
\[\tilde{P}(s' \vert s, a) = \begin{cases} P(s' \vert s, a) \text{ with probability } \gamma \\ \text{sink state with probability } 1 - \gamma \end{cases}\]

Infinite-Horizon V.I. Via Finite-Horizon V.I.

Pretend there exists finite horizon $T$, ignore $r_T, r_{T+1}, \dots$
- Error $\epsilon \leq r_\max \gamma^T / (1 - \gamma)$
- Resulting nonstationary policy only suboptimal by $\epsilon$
- $\pi_0, V_0$ converges to optimal policy as $T \rightarrow \infty$

Infinite-Horizon V.I. Via Operator View

$V \in \mathbb{R}^{\#\mathcal{S}}$: $V$ is a vector with number of dimensions equal to the degree of freedomness of state
V.I. update is a function $\mathcal{T}: \mathbb{R}^{\#\mathcal{S}} \rightarrow \mathbb{R}^{\#\mathcal{S}}$, called backup operator
\[V_{n+1} = [\mathcal{T} V_n](s) = \max_a \mathbb{E}_{s' \vert s, a} [ r + \gamma V_n(s') ]\]
$\mathcal{T}$ is a contraction, that is
\[V, \mathcal{T} V, \mathcal{T}^2 V, \cdots \rightarrow V^{*, \gamma}\]

Policy Iteration

Problem: evaluate fixed policy $\pi$

\[V^{\pi, \gamma}(s) = \mathbb{E} [ r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots \vert s_0 = s ]\]

Backwards recursion involves a backup operator $V_t = \mathcal{T}^\pi V_{t+1}$ where

\[[\mathcal{T}^\pi V] = \mathbb{E}_{s' \vert s, a=\pi(s)} [ r + \gamma V(s') ]\]

Alternate between

Evaluate policy $\pi \Rightarrow V^\pi$
Set new policy to be greedy policy for $V^\pi$
\[\pi(s) = \arg \max_a \mathbb{E}_{s' \vert s, a} [ r + \gamma V^\pi(s') ]\]

Modified Policy Iteration

Update $\pi$ to be the greedy policy, then value function with $k$ backups ($k$-step lookahead)

\[\begin{align} &\text{Initialize } V^{(0)} \text{.} \\ &\textbf{for } n = 1, 2, \dots \textbf{ do} \\ &\hspace{1em} \pi^{(n+1)} = \mathcal{G} V^{(n)} \\ &\hspace{1em} V^{(n+1)} = (\mathcal{T}^{\pi^{(n+1)}}) V^{(n)} , \text{ for integer } k \geq 1 \\ &\textbf{end for} \end{align}\]

$k=1$: value iteration
$k=\infty$: policy iteration

Policy Gradient Methods

Parameterized Policies

Analogous to classification or regression with input $s$, output $a$.

Discrete action space (classification): network outputs vector of probabilities
Continuous action space (regression): network outputs mean and diagonal covariance of Gaussian

Prolicy Gradient Methods: Overview

Goal

\[\text{maximize} \, \mathbb{E} [ R \vert \pi_\theta ]\]

Intuitions

Collect a bunch of trajectories, and…

Make the good trajectories more probable
Make the good actions more probable
Push the actions towards good actions

Score Function Gradient Estimator

Have expectation $\mathbb{E}_{x \sim p(x \vert \theta)} [ f(x) ]$, want gradient w.r.t. $\theta$.

\[\begin{align} \nabla_\theta \mathbb{E}_x [ f(x) ] &= \nabla_\theta \int dx \, p(x \vert \theta) f(x) \\ &= \int dx \, \nabla_\theta p(x \vert \theta) f(x) \\ &= \int dx \, p(x \vert \theta) \frac{\nabla_\theta p(x \vert \theta)}{p(x \vert \theta)} f(x) \\ &= \int dx \, p(x \vert \theta) \nabla_\theta \log{p(x \vert \theta)} f(x) \\ &= \mathbb{E}_x [ f(x) \nabla_\theta \log{p(x \vert \theta)} ] \,. \end{align}\]

Hack
Switch place for $\nabla_\theta$ and $\int dx$.

Now we have the gradient

\[\hat{g}_i = f(x_i) \nabla_\theta \log{p(x \vert \theta)}\]

Then substitute $f(x)$ with whatever you want, say $R(\tau)$, where $x$ is substituted with the whole trajectory $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \dots, s_{T-1}, a_{T-1}, r_{T-1}, s_T)$, we get

\[\nabla_\theta \mathbb{E}_\tau [ R(\tau) ] = \mathbb{E}_\tau [ \nabla_\theta \log{p(\tau \vert \theta)} R(\tau) ]\]

Then we look at $p(\tau \vert \theta)$.

\[\begin{align} p(\tau \vert \theta) &= \mu(s_0) \prod_{t=0}^{T-1} [ \pi(a_t \vert s_t, \theta) P(s_{t+1, r_t \vert s_t, a_t}) ] \\ \log{p(\tau \vert \theta)} &= \log{\mu(s_0)} + \sum_{t=0}^{T-1} [ \log{\pi(a_t \vert s_t, \theta)} + \log{P(s_{t+1}, r_t \vert s_t, a_t)} ] \\ \nabla_\theta \log{p(\tau \vert \theta)} &= \nabla_\theta \sum_{t=0}^{T-1} \log{\pi(a_t \vert s_t, \theta)} \\ \nabla_\theta \mathbb{E}_\tau[R] &= \mathbb{E}_\tau \bigg[ R \, \nabla_\theta \sum_{t=0}^{T-1} \log{\pi(a_t \vert s_t, \theta)} \bigg] \end{align}\]

Introduce Baseline

Further reduce variance by introducing baseline $b(S)$.

\[\nabla_\theta \mathbb{E}_\tau[R] = \mathbb{E}_\tau \bigg[ \sum_{t=0}^{T-1} \nabla_\theta \log{\pi(a_t \vert s_t, \theta)} \bigg( \sum_{t' = t}^{T-1} r_{t'} - b(s_t) \bigg) \bigg]\]

Near optimal choice is expected return,
\[b(s_t) \approx \mathbb{E} [ r_t + r_{t+1} + \cdots + r_{T-1} ]\]

Discounts for Variance Reduction

Introduce discount factor $\gamma$, which ignores delayed effects between actions and rewards.

\[\nabla_\theta \mathbb{E}_\tau[R] \approx \mathbb{E}_\tau \bigg[ \sum_{t=0}^{T-1} \nabla_\theta \log{\pi(a_t \vert s_t, \theta)} \bigg( \sum_{t' = t}^{T-1} \gamma^{t' - t} r_{t'} - b(s_t) \bigg) \bigg]\]

Want
\[b(s_t) \approx \mathbb{E} [ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T-1-t} r_{T-1} ]\]

“Vanilla” Policy Gradient Algorithm

\[\begin{align} &\text{Initialize policy parameter } \theta, \text{ baseline } b \\ &\textbf{for} \text{ iteration} = 1, 2, \dots \ \textbf{do} \\ &\hspace{1em} \text{Collect a set of trajectories by executing current policy} \\ &\hspace{1em} \text{At each timestep in each trajectory, compute} \\ &\hspace{2em} \text{the} \textit{ return } R_t = \sum_{t'=t}^{T-1} \gamma^{t'-t} r_{t'} \text{, and} \\ &\hspace{2em} \text{the} \textit{ advantage estimate } \hat{A}_t = R_t - b(s_t) \text{.} \\ &\hspace{1em} \text{Re-fit the baseline, by minimizing } \| b(s_t) - R_t \| ^2 \\ &\hspace{2em} \text{summed over all trajectories and timesteps.} \\ &\hspace{1em} \text{Update the policy, using a policy gradient estimate } \hat{g} \text{,} \\ &\hspace{2em} \text{which is a sum of terms } \nabla_\theta \log{\pi(a_t \vert s_t, \theta)} \hat{A}_t \text{.} \\ &\textbf{end for} \end{align}\]

Value Functions

Q-Function / state-action-value function
\[Q^{\pi, \gamma}(s, a) = \mathbb{E}_\pi [ r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots \vert s_0 = s, a_0 = a ]\]
State-value function
\[\begin{align} V^{\pi, \gamma}(s) &= \mathbb{E}_\pi [ r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots \vert s_0 = s ] \\ &= \mathbb{E}_{a \sim \pi} [ Q^{\pi, \gamma}(s, a) ] \end{align}\]
Advantage function
\[A^{\pi, \gamma}(s, a) = Q^{\pi, \gamma}(s, a) - V^{\pi, \gamma}(s)\]

Policy Gradient Formulas with Value Functions

\[\begin{align} \nabla_\theta \mathbb{E}_\tau [ R ] &= \mathbb{E}_\tau \bigg[ \sum_{t=0}^{T-1} \nabla_\theta \log{\pi(a_t \vert s_t, \theta)} Q^{\pi}(s_t, a_t) \bigg] \\ &= \mathbb{E}_\tau \bigg[ \sum_{t=0}^{T-1} \nabla_\theta \log{\pi(a_t \vert s_t, \theta)} A^\pi(s_t, a_t) \bigg] \\ &\approx \mathbb{E}_\tau \bigg[ \sum_{t=0}^{T-1} \nabla_\theta \log{\pi(a_t \vert s_t, \theta)|} A^{\pi, \gamma}(s_t, a_t) \bigg] \end{align}\]

Q-Function Learning Methods

(model-free reinforcement learning method)

Bellman Equations for $Q^\pi$

\[\begin{align} Q^\pi(s_0, a_0) &= \mathbb{E}_{s_1 \sim P(s_1 \vert s_0, a_0)} [ r_0 + \gamma V^\pi(s_1) ] \\ &= \mathbb{E}_{s_1 \sim P(s_1 \vert s_0, a_0)} [ r_0 + \gamma \mathbb{E}_{a_1 \sim \pi} [ Q^\pi(s_1, a_1) ] ] \end{align}\]

Define the Bellman backup operator (operating on Q functions) as follows

\[[\mathcal{T}^\pi Q](s_0, a_0) = \mathbb{E}_{s_1 \sim P(s_1 \vert s_0, a_0)} [ r_0 + \gamma \mathbb{E}_{a_1 \sim \pi}[Q(s_1, a_1)] ]\]

Introducing $Q^\star$

Let $\pi^\star$ denote an optimal policy
Define $Q^\star = Q^{\pi^\star}$, which satisfies $Q^\star(s, a) = \max_\pi Q^{\pi}(s, a)$
$\pi^\star$ satisfies $\pi^\star(s) = \arg \max_a Q^\star(s, a)$

Therefore,

\[Q^\pi(s_0, a_0) = \mathbb{E}_{s_1 \sim P(s_1 \vert s_0, a_0)} [ r_0 + \gamma \max_{a_1} Q^\star(s_1, a_1) ]\]

Get (Q-Value iteration)

\[\lim_{t \rightarrow +\infty} \mathcal{T}^t Q \rightarrow Q^\star\]

Why $Q$ Rather than $V$

Can compute greedy aciton $\max_a Q(s, a)$ without knowing $P$
Can compute unbiased estimator of backup value $[\mathcal{T}Q](s, a)$ without knowing $P$ using single transition $(s, a, r, s')$
Can compute unbiased estimator of backup value $[\mathcal{T}Q](s, a)$ using off-policy data

(potentially model-free)

Partial Backups

(reduce noise)

Partial backup: $Q \leftarrow \epsilon \widehat{\mathcal{T}Q} + (1 + \epsilon) Q$
Equivalent to gradient step on squared error
\[\begin{align} Q &\rightarrow Q - \epsilon \nabla_Q \lVert Q - \widehat{\mathcal{T} Q_t} \rVert ^2 / 2 \\ &= Q - \epsilon (Q - \widehat{\mathcal{T} Q_t}) \\ &= (1 - \epsilon) Q + \epsilon \widehat{\mathcal{T} Q_t} \end{align}\]

Sampling-Based Q-Value Iteration

\[\begin{align} &\text{Initialize } Q^{(0)} \\ &\textbf{for } n = 0, 1, 2, \dots \text{ until termination conditon } \textbf{do} \\ &\hspace{1em} \text{Interact with the environment for } K \text{ timesteps (including multiple episodes)} \\ &\hspace{1em} Q^{(n+1)} = Q^{(n)} + \epsilon \nabla_Q \sum_{t=1}^{K} \lVert \widehat{\mathcal{T} Q_t} - Q(s_t, a_t) \rVert ^2 / 2 \\ &\textbf{end for} \end{align}\]

Note
Appropriate schedule required for convergence, e.g. $\epsilon = 1 /n$.

Function Approximation / Neural-Fitted Algorithms

Parameterize Q function with a neural network $Q_\theta$
To approximate $Q \leftarrow \widehat{\mathcal{T}Q}$, do
\[\underset{\theta}{\text{minimize}} \sum_t \lVert Q_\theta(s_t, a_t) - \widehat{\mathcal{T}Q}(s_t, a_t) \rVert ^2\]

Deep Q-Network

Q-Value Iteration with Function Approximation

\[\begin{align} &\text{Initialize } \theta^{(0)} \\ &\textbf{for } n = 0, 1, 2, \dots \textbf{do} \\ &\hspace{1em} \text{Run policy for } K \text{ timesteps using some policy } \pi^{(n)} \\ &\hspace{1em} g^{(n)} = \nabla_\theta \sum_{t} \bigg( \widehat{\mathcal{T} Q_t} - Q_\theta(s_t, a_t) \bigg) ^2 \\ &\hspace{1em} \theta^{(n+1)} = \theta^{(n)} - \alpha g^{(n)} \\ &\textbf{end for} \end{align}\]

Key Terms

Replay memory $\mathcal{D}$: history of last $N$ transitions
Target network: old Q function $Q^{(n)}$ that is fixed over many (~ 10,000) timesteps, while $Q \Rightarrow \mathcal{T}Q^{(n)}$

Deep Q-learning with Experience Replay

\[\begin{align*} &\text{Initialize replay memory } \mathcal{D} \text{ to capacity } N \\ &\text{Initialize action-value function } Q \text{ with random weights} \\ &\textbf{for} \text{ episode } = 1, \cdots, M \textbf{ do} \\ &\hspace{1em} \text{Initialize sequence } s_1 = \{ x_1 \} \text{ and preprocessed sequenced } \phi_1 = \phi(s_1) \\ &\hspace{1em} \textbf{for } t = 1, \cdots, T \textbf{ do} \\ &\hspace{2em} \text{With probability } \epsilon \text{ select a random action } a_t \\ &\hspace{4em} \text{otherwise select } a_t = \max_a Q^*(\phi(s_t), a; \theta) \\ &\hspace{2em} \text{Set } s_{t+1} = s_t, a_t, x_{t+1} \text{ and preprocess } \phi_{t+1} = \phi(s_{s+1}) \\ &\hspace{2em} \text{Store transition } (\phi_t, a_t, r_t, \phi_{t+1}) \text{ in } \mathcal{D} \\ &\hspace{2em} \text{Sample random minibatch of transitions } (\phi_j, a_j, r_j, \phi_{j+1}) \text{ from } \mathcal{D} \\ &\hspace{2em} \text{Set } y_j = \begin{cases} r_j &\text{for terminal } \phi_{j+1} \\ r_j + \gamma \max_{a'} Q(\phi_{j+1}, a'; \theta) &\text{for non-terminal } \phi_{j+1} \\ \end{cases} \\ &\hspace{2em} \text{Perform a gradient descent step on } (y_j - Q(\phi_j, a_j; \theta))^2 \\ &\hspace{1em} \textbf{end for} \\ &\textbf{end for} \\ \end{align*}\]

$\phi$: preprocessed input, e.g. image, $\approx s$
$\gamma$: our good ol’ discount factor
- Assuming actions far in the past contribute less to the present, for their influences are indirect

Double Q-learning

\[\mathbb{E}_{x_1, x_2}[\max(x_1, x_2)] \geq \max (\mathbb{E}_{x_1, x_2}[x_1], \mathbb{E}[x_2])\]
Q values are noisy, thus $r + \gamma \max_{a'} Q(s', a')$ is an overestimate

Solution

Use two networks $Q_A, Q_B$, and compute $\arg\max$ with the other network

\[\begin{align} Q_A(s, a) &\leftarrow r + \gamma Q(s', \arg\max_{a'} Q_B(s', a')) \\ Q_B(s, a) &\leftarrow r + \gamma Q(s', \arg\max_{a'} Q_A(s', a')) \end{align}\]

Two Q functions of different noise, avoid overestimation and converges faster

Dueling Net

\[Q(s, a) = V(s) + A(s, a)\]

$\vert V \vert$ has larger scale than $\vert A \vert$ by $\approx 1 / (1 - \gamma)$, for good states contributes more to the final goal than those good actions at given states do. But small differences $A(s, a) - A(s, a')$ determine policy, therefore they could be easily submerged by the swing in $\vert V \vert$!

Solution

Parameterize Q function as follows

\[Q_\theta(s, a) = V_\theta(s) + \underbrace{F_\theta(s, a) - \text{mean}_{a'}F_\theta(s, a')}_{``\text{Advantage''} \text{ part}}\]

Prioritized Replay

Bellman error loss: $\sum_{i \in \mathcal{D}} \lVert Q_\theta(s_i, a_i) - \hat{Q}_t \rVert ^2 / 2$
Can use importance sampling to favor timestep $i$ with large gradient, allowing faster backwards propagation of reward info
Use last Bellman error $\vert \delta_i \vert$, where $\delta_i = Q_\theta(s_i, a_i) - \hat{Q}_t$ as proxy for size of gradient

Practical Tips

User Huber loss on Bellman error
\[\begin{align*} L(x) = \begin{cases} x^2 / 2 &\text{if } \vert x \vert \leq \delta \\ \delta \vert x \vert - \delta^2 / 2 &\text{otherwise} \end{cases} \end{align*}\]
Use Double DQN
Try your own skills at navigating the environment based on processed frames, in order to test out your data preprocessing (e.g. image down-sampling)
Always run at least two different seeds when experimenting
Learning rate schedueling, try high ones in initial exploring period
Try non-standard exploring schedules

Artistic Stylization and Rendering

Methods

Procedural Methods
- Pure mathematics formulas, trying to immitate what real artists are thinking.
  - Pros
    - lovely results
    - very controllable
  - Cons
    - hard to design styles
    - complex to implement
Patch-Based Methods
- Find in source for similar patches / patterns.
Neural Methods
- Pros: good results
- Cons: no degree of control

Color Channel

Luminance channel
- texture and line info
Color channel
- color info

Note
Luminance channel is much more important! Consider image compressing with jpeg.

Adding Control to Neural Methods

Color control with Luminance style transfer
Space control: masking out areas to apply algorithm

Geometry Understanding

Data Labeling Approches “without Human-Hour”

Flickr image tags
- noisy
- try to model cofidence for tags
Online 3D models
- built with hierarchy in mind
- possible to use name of parts that come with the downloaded model
  1. embedding geometries
  2. clustering feature vectors
  3. labeling each cluster with name and position in hierarchy

Advanced Topics in Imitation Learning & Safety

Approaches to Provide Demonstrations for an Robotic Arm and Humanoid

kinesthetic teaching
teleoperation

Uncertainty-Aware Collision Cost

\[c_\text{collision}(\tau) \propto \text{speed} \cdot \big( \mathbb{E}[p(c_{t+H} \vert \tau)] + \sqrt{\mathrm{Var}[p(c_{t+H} \vert \tau)]} \big)\]

Inverse Reinforcement Learning

Motivation & Definition

In the real world, human dont’t get a score! The reward function is unclear.

Previously, solved by introducing behavioral cloning, but…

no reasoning about outcomes or dynamics
the expert might have different degrees of freedom

Inverse Optimal Control / Inverse Reinforcement Learning

Infer cost / reward function from expert demonstrations.

Given…

state & action space
roll-outs from $\pi^*$
[optional] dynamics model

Goal…

recover reward function
then use reward to get policy

Problem

Underdefined problem
Difficult to evaluate a learned cost
Demonstrations may not be precisely optimal

Maximum Entropy Inverse RL

Want to have

high scores for good trajectories
low scores for bad trajectories

Construct

\[p(\tau) \propto e^{-c(\tau)}\]

Goal becomes

\[\min_\pi \mathbb{E}_\pi[c(\tau)] - \mathrm{H}(\pi)\]

Now consider how to minimize the cost

cross entropy

\[\begin{align} \theta &= \arg \max_\theta \log{\prod_{\tau_d \in \mathcal{D}} p(\tau_d)} \\ &= \arg \min_\theta - \frac{1}{M} \sum_{\tau_d \in \mathcal{D}} \log{\frac{1}{z} e^{-c(\tau_d)}} \\ &= \arg \min_\theta \frac{1}{M} \sum_{\tau_d} c(\tau_d) + \log{\sum_\tau e^{-c(\tau)}} \\ \end{align}\]

where

\[z := \sum_\tau e^{-c(\tau)}\]
$M$ is amount of data in demonstration dataset $\mathcal{D}$

Consider using gradient descent method, define $\mathcal{L}$ by

\[\mathcal{L} := \frac{1}{M} \sum_{\tau_d} c(\tau_d) + \log{\sum_\tau e^{-c(\tau)}}\]

Therefore

\[\nabla_\theta \mathcal{L} = \frac{1}{M} \sum_{\tau_d} \frac{d c(\tau_d)}{d \theta} + \frac{1}{\sum_\tau e^{-c(\tau)}} \sum_\tau \big( e^{-c(\tau)} \cdot (-1) \cdot \frac{d c(\tau)}{d \theta} \big)\]

Note that

\[p(\tau) = \frac{e^{-c(\tau)}}{\sum_\tau e^{-c(\tau)}}\]

Have

\[\nabla_\theta \mathcal{L} = \frac{1}{M} \sum_{\tau_d} \frac{d c(\tau_d)}{d \theta} - \sum_\tau p(\tau \vert \theta, T) \frac{d c(\tau)}{d \theta}\]

where

$T$ is transition dynamics

Substitute trajectories with their states

\[\nabla_\theta \mathcal{L} = \frac{1}{M} \sum_{\tau_d} \frac{d c(\tau_d)}{d \theta} - \sum_s p(s \vert \theta, T) \frac{d c(s)}{d \theta}\]

Note

\[\begin{align} p(\tau) &= p(s_1) \prod_t \pi_\theta(a_t \vert s_t) T(s_{t+1} \vert s_t, a_t) \\ &= p(s_1) \prod_t p(a_t \vert s_t, \theta) p(s_{t+1} \vert s_t, a_t) \end{align}\]

To compute $p(s \vert \theta, T)$

$\pi(a \vert s)$ w.r.t. $c_\theta$
do
\[\begin{align} &\mu_1(s) = p(s_1 = s) \\ &\textbf{for } t = 1 : T - 1 \textbf{ do} \\ &\hspace{1em} \mu_{t+1}(s) = \sum_{a', s'} \mu_t(s') \pi(a' \vert s') p(s \vert s', a') \\ &\textbf{end for} \end{align}\]
where
- $\mu_t(s)$ is probability visiting $s$ at $t$
- $a', s'$ are action / state of previous timestep respectively
\[p(s \vert \theta, T) = \sum_t \mu_t(s)\]

Scaling Inverse RL to Deep Cost Functions

Backpropagation

Inverse RL with Unkown Dynamics

Consider using importance sampling.

\[\begin{align} z &= \sum_\tau e^{-c(\tau)} \\ &\approx \sum_{\tau_s \sim q} \frac{e^{-c(\tau_s)}}{q(\tau_s)} \end{align}\]

For importance sampling, want large probability of sampling for dramatic costs.

\[\text{minimize} \ \mathrm{Var} \, [ q(\tau) \propto \vert e^{-c(\tau)} \vert ]\]

Problem is the target $q$ is hard to guess. Instead, try to construct $q$ dynamically.

\[q = \arg \min_q \mathbb{E}_q[c(\tau)] - \mathrm{H}(q)\]

Also, construct multiple $q$s to avoid introducing high variance to sample data.

\[v(\tau) = \frac{1}{K} \sum_k q_k(\tau)\]

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More

\[\text{maximize}_\theta \, \sum_{n=1}^N \frac{\pi_\theta(a_n \vert s_n)}{\pi_{\theta_\text{old}}(a_n \vert s_n)} \hat{A}_n - C \cdot \overline{\mathrm{KL}}_{\pi_{\theta_\text{old}}}(\pi_\theta)\]

Truncated Natural Policy Gradient Algorithm

Unconstrained problem: $\text{maximize} \, L_{\pi_{\theta_\text{old}}}(\pi_\theta) - C \cdot \overline{\mathrm{KL}}_{\pi_{\theta_\text{old}}}(\pi_\theta)$

\[\begin{align} &\textbf{for} \text{ iteration} = 1, 2, \dots \textbf{ do} \\ &\hspace{1em} \text{Run policy for } T \text{ timesteps or } N \text{ trajectories} \\ &\hspace{1em} \text{Estimate advantage function at all timesteps} \\ &\hspace{1em} \text{Compute policy gradient } g \\ &\hspace{1em} \text{Use } \textit{Conjugated Gradient Method} \text{ (with Hessian-vector products) to compute } H^{-1}g \\ &\hspace{1em} \text{Update policy parameter } \theta = \theta_\text{old} + \alpha H^{-1}g \\ &\textbf{end for} \end{align}\]

TRPO

Constrained problem: $\text{maximize} \, L_{\pi_{\theta_\text{old}}}(\pi_\theta)$ subject to $\overline{\mathrm{KL}}_{\pi_{\theta_\text{old}}}(\pi_\theta) \leq \delta$
- Set hyperparameter $\delta$ rather than $C$

\[\begin{align} &\textbf{for} \text{ iteration} = 1, 2, \dots \textbf{ do} \\ &\hspace{1em} \text{Run policy for } T \text{ timesteps or } N \text{ trajectories} \\ &\hspace{1em} \text{Estimate advantage function at all timesteps} \\ &\hspace{1em} \text{Compute policy gradient } g \\ &\hspace{1em} \text{Use } \textit{Conjugated Gradient Method} \text{ (with Hessian-vector products) to compute } H^{-1}g \\ &\hspace{1em} \text{Compute rescaled step } s = \alpha H^-1 g \text{ with rescaling and line search} \\ &\hspace{1em} \text{Apply update: } \theta = \theta_\text{old} + \alpha H^{-1}g \\ &\textbf{end for} \end{align}\]

Alternative Method for Calculating Natural Gradients

Fisher information matrix
\[\frac{\partial}{\partial^2 \theta} \mathrm{KL}[p_{\theta_\text{old}}, p_\theta] = \mathbb{E}_{x \sim p_{\theta_\text{old}}} \bigg[ \bigg( \frac{\partial}{\partial \theta} \log{p_\theta(x)} \bigg)^T \bigg( \frac{\partial}{\partial \theta} \log{p_\theta(x)} \bigg) \bigg] \vert_{\theta = \theta_\text{old}}\]
In policy optimization setting, instead of forming FIM by differentiating $\mathrm{KL}$, can explicitly form $\sum_n \big( \frac{\partial}{\partial \theta} \log{\pi_\theta(a_n \vert s_n)} \big) ^T \big( \frac{\partial}{\partial \theta} \log{\pi_\theta(a_n \vert s_n)} \big)$

Variance Reduction for Policy Gradient Methods

Reward Shaping

Giving a hint to the model, via a shaping term in the reward / cost function.

$\tilde{r}(s, a, s') = r(s, a, s') + \gamma \phi(s') - \phi(s)$ for arbitrary potential function $\phi$.

Reward shaping transformation leaves policy gradient and optimal policy invariant
Shaping with $\phi \approx V^\pi$ makes consequences of actions more immediate
Shaping, and then ignoring all but the first term, gives policy gradient

Variance Reduction

Previously showed that taking

\[\hat{A}_t = r_t, + r_{t+1} + \cdots - b(s_t)\]

for any function $b(s_t)$, gives an unbiased policy gradient estimator.

$b(s_t) \approx V^\pi(s_t)$ gives variance reduction.

Value Functions in the Future

(reward-to-go)

Subtracting out baselines, get advantage estimators

\[\begin{align} &\hat{A}_t^{(1)} = r_t + \gamma V(s_{t+1}) - V(s_t) \\ &\hat{A}_t^{(2)} = r_t + r_{t+1} + \gamma^2 V(s_{t+2}) - V(s_t) \\ &\dots \\ &\hat{A}_t^{(\infty)} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots - V(s_t) \\ \end{align}\]

$\hat{A}_t^{(1)}$ has low variance but high bias, $\hat{A}_t^{(\infty)}$ has high variance but low bias.
Using intermediate $k$ gives an intermediate amount of bias and variance.

Assignments

Code could be found on GitHub.

Assignment 1

For code, see GitHub Repo.

Section 1: Getting Set Up

Out-Dated Pre-Trained .pkl
- Bit of hack: replace all v1 seen, could be found in expert/ and demo.bash, with v2.
mujoco-py Installation Related (tested on Ubuntu 18.04)
- GL/osmesa.h not found
  - sudo apt install libosmesa6-dev
- -lGL not found from ld
  - sudo apt install libgl* (* is the *nix wildcard, expand it yourself)
- patchelf command not found
  - sudo apt install pathelf

Section 2: Warmup

Dump the expert_data in hw1/run_expert.py to pkl, will be used as training data
Define network
- Model: simple NN
- Loss: MSE
- Optimizer: SGD
Train

Assignment 2

Section 4: Implement Policy Gradient

For code, see GitHub Repo

Networks

Use build_mlp (Section 3) to create policy net, which takes in observations and gives logits of actions
Choose / Sample one from actions based on calculated logits
Calculate log probability of actions aginst their labels, i.e. error

tf.nn.softmax_cross_entropy_with_logits_v2
- Measures the probability error in distance classification tasks in which the classes are mutually exclusive.
- logits and labels must have the same shape, e.g. [batch_size, num_classes] and the same dtype (either float16, float32 or float64)
- Backpropagation will happen into both logits and labels. To disallow backpropagation into labels, pass label tensors through @{tf.stop_gradient} before feeding it to this function
tf.nn.sparse_softmax_cross_entropy_with_logits
- Measures the probability error in discrete classification tasks in which the classes are mutually exclusive.
- A common use case is to have logits of shape [batch_size, num_classes] and labels of shape [batch_size].

Note
View error from a stocastic perspective: the higher an error is, the more the mean is deviated from the most probable location in the space (if you set label as MPL or mean roughly), i.e. less probability (or larger negative number in log space) at the spot where the error is calculated.
In short, $- \log{(\text{probability})} \Rightarrow \text{error}$.

Computing Q-Values

Policy gradient: $\hat{g} = \mathbb{E}_\tau [ \sum_{t=0}^{T} \nabla \log{\pi(a_t \vert s_t)} \cdot (Q_t - b_t) ]$
Q-value: $Q_t$
- trajectory-based (all along trajectory): $\sum_{t'=0}^{T} \gamma^{t'} r_{t'}$
- reward-to-go (from current timestep to future): $\sum_{t'=t}^{T} \gamma^{t'-t} r_{t'}$

Assignment 3

For code, see GitHub Repo

Section 3: Implementation

Bellman Error

Build two Q networks, the exploring policy and the update target respectively
Calculate state-value by multiplying the series of actions you take and their weights, i.e. Q
Calculate state-action-value by $r + \gamma \max_{a'} Q$
The Bellman error is the differenct in Q values between the exploring policy and target policy

The rest will be basic RTFM .

PREVIOUSCS294 Sharing Lecture 1

NEXTIntroduction to Computer Systems Notes