强化学习拾遗 —— 策略梯度定理的两种详细推导

  • 首发链接:强化学习拾遗 —— 策略梯度定理的两种详细推导
  • 近期成功复现了 LLM-GRPO,但复现 LLM-PPO 时遇到一些困难。在回顾过去 RL 理论知识时,发现一个没仔细考虑的点:如果看过一些 RL 教程,会发现策略梯度定理通常有两种推导方法,一种比较简单,另一种则涉及随机过程马尔科夫链的占用度量等相关知识,比较复杂,但其实二者是可以相互转换的,且分别对应于几种经典的 Policy gradient RL 方法

1. 推导方式A:轨迹期望形式

1.1 策略梯度形式

  • 策略质量定义为策略诱导的轨迹收益R(τ)=k=0T1γkrkR(\tau) = \sum_{k=0}^{T-1}\gamma^{k}r_k的期望 J(θ)=Eτπθ[R(τ)]J(\theta)=\mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)],最优策略为

    arg maxπθJ(θ)=arg maxπθEτπθ[R(τ)]\argmax_{\pi_\theta} J(\theta) = \argmax_{\pi_\theta} \mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)]

    这种推导方式得到的结论为

    θJ(θ)=Eτπθ[t=0T1Gtθlogπθ(atst)]\nabla_{\theta} J(\theta)=\mathbb{E}_{\tau\sim \pi_{\theta}}\left[\sum_{t=0}^{T-1} G_{t} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right]

    其中 Gt=k=tT1γkrkG_t=\sum_{k=t}^{T-1}\gamma^{k}r_ktt 时刻的 return-to-go(RTG)

1.2 推导过程

  • 下面开始详细推导:对 J(θ)J(\theta) 求梯度得到

    θJ(θ)=θEτπθ[R(τ)]=θτR(τ)P(τπθ)=τR(τ)θP(τπθ)=τR(τ)P(τπθ)θP(τπθ)P(τπθ)=τR(τ)P(τπθ)θlog(P(τπθ))=Eτπθ[R(τ)θlog(P(τπθ))](1)\begin{aligned} \nabla_{\theta} J(\theta) &= \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)] \\ & = \nabla_{\theta} \sum_{\tau} R(\tau) P\left(\tau \mid \pi_{\theta}\right) \\ & = \sum_{\tau} R(\tau) \nabla_{\theta} P \left(\tau \mid \pi_{\theta}\right) \\ & =\sum_{\tau} R(\tau) P\left(\tau \mid \pi_{\theta}\right) \frac{\nabla_{\theta} P\left(\tau \mid \pi_{\theta}\right)}{P\left(\tau \mid \pi_{\theta}\right)} \\ & =\sum_{\tau} R(\tau) P\left(\tau \mid \pi_{\theta}\right) \nabla_{\theta}\log \left(P\left(\tau \mid \pi_{\theta}\right)\right) \\ & =\mathbb{E}_{\tau \sim \pi_{\theta}}\left[R(\tau) \nabla_{\theta}\log \left(P\left(\tau \mid \pi_{\theta}\right)\right)\right] \end{aligned} \tag{1}

    轨迹是从策略和环境状态转移概率分布中采样得到的,引入初始状态分布 ρ0\rho_0,有

    P(τπθ)=ρ0(s0)t=0T1P(st+1st,at)πθ(atst)logP(τπθ)=logρ0(s0)+t=0T1logP(st+1st,at)+t=0T1logπθ(atst)\begin{aligned} P\left(\tau \mid \pi_{\theta}\right) &=\rho_{0}\left(s_{0}\right) \prod_{t=0}^{T-1} P\left(s_{t+1} \mid s_{t}, a_{t}\right) \pi_{\theta}\left(a_{t} \mid s_{t}\right) \\ \log P\left(\tau \mid \pi_{\theta}\right) &=\log \rho_{0}\left(s_{0}\right)+\sum_{t=0}^{T-1} \log P\left(s_{t+1} \mid s_{t}, a_{t}\right)+\sum_{t=0}^{T-1} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\\ \end{aligned}

    由于初始状态分布 ρ0(s0)\rho_{0}(s_{0})、环境转移概率 P(st+1st,at)P\left(s_{t+1} \mid s_{t}, a_{t}\right) 和策略无关,即不含 θ\theta,求梯度后只剩下

    θlogP(τπθ)=t=0T1θlogπθ(atst)\nabla_{\theta} \log P\left(\tau \mid \pi_{\theta}\right)=\sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)

    回代入式(1),得到

    θJ(θ)=Eτπθ[R(τ)t=0T1θlogπθ(atst)]=Eτπθ[t=0T1R(τ)θlogπθ(atst)](2)\begin{aligned} \nabla_{\theta} J(\theta) &=\mathbb{E}_{\tau \sim \pi_{\theta}}\left[R(\tau) \sum_{t=0}^{T-1} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right] \\ &=\mathbb{E}_{\tau \sim \pi_{\theta}}\left[\sum_{t=0}^{T-1} R(\tau) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right] \end{aligned} \tag{2}

    其中轨迹总收益 R(τ)R(\tau) 是每个时刻收益的累计折扣和。这里直接把轨迹总收益 R(τ)R(\tau) 作为系数和每个时刻 tt 的概率相乘,但由于 MDP 的马尔科夫无后效性,logπθ(atst)\log \pi_{\theta}\left(a_{t} \mid s_{t}\right)其实和 tt 时刻之前的历史轨迹 Ht=(s0,a0,,st)\mathcal{H}_{t}=\left(s_{0}, a_{0}, \ldots, s_{t}\right) 无关,因此还可以进一步化简从而降低方差。把 R(τ)R(\tau) 的展开式带入 (2),得到

    θJ(θ)=Eτπθ[t=0T1(k=0T1γkrk)θlogπθ(atst)]=Eτπθ[t=0T1(k=0t1γkrkpast +k=tT1γkrkfuture )θlogπθ(atst)]=Eτπθ[t=0T1(k=0t1γkrkpast +Gt)θlogπθ(atst)](3)\begin{aligned} \nabla_{\theta} J(\theta) &=\mathbb{E}_{\tau \sim \pi_{\theta}}\left[\sum_{t=0}^{T-1} \left(\sum_{k=0}^{T-1}\gamma^{k}r_k\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right] \\ &=\mathbb{E}_{\tau \sim \pi_{\theta}}\left[\sum_{t=0}^{T-1} \left(\underbrace{\sum_{k=0}^{t-1} \gamma^{k} r_{k}}_{\text {past }}+\underbrace{\sum_{k=t}^{T-1} \gamma^{k} r_{k}}_{\text {future }}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right] \\ &=\mathbb{E}_{\tau \sim \pi_{\theta}}\left[\sum_{t=0}^{T-1} \left(\underbrace{\sum_{k=0}^{t-1} \gamma^{k} r_{k}}_{\text {past }}+G_t\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right] \end{aligned} \tag{3}

    在每个 tt 时刻,引入历史 Ht\mathcal{H}_{t} 作为条件,有

    Eτπθ[(k=0t1γkrk)θlogπθ(atst)]=Eτπθ[(k=0t1γkrk)Eatπθ(st)(θlogπθ(atst)Ht)]=Eτπθ[(k=0t1γkrk)Eatπθ(st)(θlogπθ(atst)st)]=Eτπθ[(k=0t1γkrk)(aπθ(ast)θlogπθ(atst))]=Eτπθ[(k=0t1γkrk)(θaπθ(ast))]=Eτπθ[(k=0t1γkrk)θ1]=0\begin{aligned} \mathbb{E}_{\tau \sim \pi_{\theta}}\left[\left(\sum_{k=0}^{t-1} \gamma^{k} r_{k}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \right] &= \mathbb{E}_{\tau \sim \pi_{\theta}}\left[\left(\sum_{k=0}^{t-1} \gamma^{k} r_{k}\right) \mathbb{E}_{a_t\sim \pi_\theta(\cdot|s_t)}\Big( \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \mid \mathcal{H}_{t} \Big) \right]\\ &= \mathbb{E}_{\tau \sim \pi_{\theta}}\left[\left(\sum_{k=0}^{t-1} \gamma^{k} r_{k}\right) \mathbb{E}_{a_t\sim \pi_\theta(\cdot|s_t)}\Big( \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \mid s_{t} \Big) \right] \\ &= \mathbb{E}_{\tau \sim \pi_{\theta}}\left[\left(\sum_{k=0}^{t-1} \gamma^{k} r_{k}\right) \left( \sum_a\pi_\theta(a|s_t) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right) \right) \right] \\ &= \mathbb{E}_{\tau \sim \pi_{\theta}}\left[\left(\sum_{k=0}^{t-1} \gamma^{k} r_{k}\right) \left( \nabla_{\theta}\sum_a \pi_\theta(a|s_t) \right) \right] \\ &= \mathbb{E}_{\tau \sim \pi_{\theta}}\left[\left(\sum_{k=0}^{t-1} \gamma^{k} r_{k}\right) \nabla_{\theta} 1 \right] \\ &=0 \end{aligned}

    带入式 (3),得到最终无偏估计。REINFORCE 算法直接使用该策略梯度进行优化

    θJ(θ)=Eτπθ[t=0T1Gtθlogπθ(atst)]\boxed{ \nabla_{\theta} J(\theta)=\mathbb{E}_{\tau\sim \pi_{\theta}}\left[\sum_{t=0}^{T-1} G_{t} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right] }

2. 推导方式B:标准策略梯度定理形式

  • 推导方式 A 得到了无偏的策略梯度表达式,直接对应的 RL 算法是 REINFORCE。其问题在于直接用 MC 回报GtG_t进行估计往往方差很大,且在 episodic 情况下通常需要采样较长的轨迹片段(甚至等到终止)才能得到 GtG_t,样本效率低。 为降低方差并提升样本效率
    1. 引入价值函数 Vπ(s)V^\pi(s)、动作价值函数 Qπ(s,a)Q^\pi(s,a),通过 TD/n-step/GAE 等 bootstrap 方法学习 V,QV,Q 以提升样本效率
    2. 引入优势函数 Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s) 概念,通过减去 baseline 降低梯度估计的方差
  • 推导方式 B 考虑了以上两个改进点,将轨迹级目标重写为对 (折扣)状态访问分布 的期望,从而把策略梯度写成状态动作层面的形式,这样就自然地与价值函数**Qπθ,Vπθ,AπθQ_{\pi_\theta}, V_{\pi_\theta}, A_{\pi_\theta}适配,得到标准的策略梯度定理**。由于这种推导方式涉及对状态访问分布求期望,推导过程通常更复杂

2.1 策略梯度形式

  • 策略质量定义为状态价值 Vπθ(s)V_{\pi_\theta}(s) 关于初始状态分布 ρ0\rho_0 的期望 J(θ)=Esρ0[Vπθ(s)]J(\theta) = \mathbb{E}_{s \sim \rho_0}[V_{\pi_\theta}(s)],最优策略为

    arg maxπθJ(θ)=arg maxπθEsρ0[Vπθ(s)]\argmax_{\pi_\theta} J(\theta) = \argmax_{\pi_\theta} \mathbb{E}_{s \sim \rho_0}[V_{\pi_\theta}(s)]

    这种推导方式得到的结论为

    θJ(θ)Esdπθγ,aπθ(s)[Qπθ(s,a)θlogπθ(as)]\nabla_{\theta} J(\theta) \propto \mathbb{E}_{s \sim d_{\pi_{\theta}}^\gamma, a \sim \pi_{\theta}(\cdot \mid s)}\left[Q_{\pi_{\theta}}(s, a) \nabla_{\theta} \log \pi_{\theta}(a \mid s)\right]

    其中 dπθγd_{\pi_{\theta}}^\gamma 是策略 πθ\pi_\theta 诱导的归一化后的折扣占用度量(概率分布)

2.2 推导过程

  • 首先回顾价值函数定义与 Bellman 方程

    Vπθ(s)=aAπθ(as)Qπθ(s,a)Qπθ(s,a)=sSP(ss,a)(r(s,a,s)+γVπθ(s)).\begin{aligned} V_{\pi_\theta}(s) &=\sum_{a\in\mathcal{A}}\pi_\theta(a|s)Q_{\pi_\theta}(s,a)\\ Q_{\pi_\theta}(s,a)&=\sum_{s'\in\mathcal{S}}P(s'|s,a)\Big(r(s,a,s')+\gamma V_{\pi_\theta}(s')\Big). \end{aligned}

    下面开始详细推导:对 J(θ)J(\theta) 求梯度得到

    θJ(θ)=Es0ρ0[θVπθ(s0)].\nabla_\theta J(\theta)=\mathbb{E}_{s_0\sim \rho_0}\big[\nabla_\theta V_{\pi_\theta}(s_0)\big].

    展开状态价值函数的梯度,建立关于 VV 的递推式

    θVπθ(s)=θ(aAπθ(as)Qπθ(s,a))=aA(θπθ(as)Qπθ(s,a)+πθ(as)θQπθ(s,a))=aA(θπθ(as)Qπθ(s,a)+πθ(as)θs,rp(s,rs,a)(r+γVπθ(s)))=aA(θπθ(as)Qπθ(s,a)+γπθ(as)s,rp(s,rs,a)θVπθ(s))=aA(θπθ(as)Qπθ(s,a)+γπθ(as)sp(ss,a)θVπθ(s))=aAθπθ(as)Qπθ(s,a)简化表示为 ϕ(s)+aAγπθ(as)sp(ss,a)θVπθ(s)=ϕ(s)+γaπθ(as)sP(ss,a)θVπθ(s)(4)\begin{aligned} \nabla_{\theta} V_{\pi_{\theta}}(s) & =\nabla_{\theta}\left(\sum_{a \in A} \pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s, a)\right) \\ & =\sum_{a \in A}\left(\nabla_{\theta} \pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s, a)+\pi_{\theta}(a \mid s) \nabla_{\theta} Q_{\pi_{\theta}}(s, a)\right) \\ & =\sum_{a \in A}\left(\nabla_{\theta} \pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s, a)+\pi_{\theta}(a \mid s) \nabla_{\theta} \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left(r+\gamma V_{\pi_{\theta}}\left(s^{\prime}\right)\right)\right)\\ & =\sum_{a \in A}\left(\nabla_{\theta} \pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s, a)+\gamma \pi_{\theta}(a \mid s) \sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right) \nabla_{\theta} V_{\pi_{\theta}}\left(s^{\prime}\right)\right) \\ & =\sum_{a \in A}\left(\nabla_{\theta} \pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s, a)+\gamma \pi_{\theta}(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \nabla_{\theta} V_{\pi_{\theta}}\left(s^{\prime}\right)\right) \\ & =\underbrace{\sum_{a \in A}\nabla_{\theta} \pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s, a)}_{简化表示为\space \phi(s)}+\sum_{a \in A}\gamma \pi_{\theta}(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) \nabla_{\theta} V_{\pi_{\theta}}\left(s^{\prime}\right) \\ &=\phi(s)+\gamma \sum_{a} \pi_{\theta}(a \mid s) \sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) \nabla_{\theta} V_{\pi_{\theta}}\left(s^{\prime}\right) \end{aligned} \tag{4}

    πθ\pi_\theta从 MDP 中诱导出马尔科夫链,定义从状态 ss 出发经过 kk 步到达状态 xx 的概率(kk 步转移概率)

    dπθ(sx;k):=Prπθ(sk=xs0=s),k=0,1,2,特别地 dπθ(sx;1)=Pπθ(xs)=aπθ(as)P(xs,a)\begin{aligned} &d_{\pi_\theta}(s\to x;k):=\Pr_{\pi_\theta}(s_k=x\mid s_0=s),\quad k=0,1,2,\dots \\ \text{特别地} &\space d_{\pi_\theta}(s\to x;1)=P_{\pi_\theta}(x|s) = \sum_{a}\pi_\theta(a|s)P(x|s,a) \end{aligned}

    对递推式(4)反复代入展开

    θVπθ(s)=ϕ(s)+γaπθ(as)sP(ss,a)θVπθ(s)=ϕ(s)+γs(aπθ(as)P(ss,a))θVπθ(s)=ϕ(s)+γsdπθ(ss,1)θVπθ(s)=ϕ(s)+γsdπθ(ss,1)[ϕ(s)+γsdπθ(ss,1)θVπθ(s)]=ϕ(s)+γsdπθ(ss,1)ϕ(s)+γ2sdπθ(ss,2)θVπθ(s)=ϕ(s)+γsdπθ(ss,1)ϕ(s)+γ2sdπθ(ss,2)ϕ(s)+γ3sdπθ(ss,3)θVπθ(s)==k=0γkxSdπθ(sx,k)ϕ(x)\begin{aligned} \nabla_{\theta} V_{\pi_{\theta}}(s) & =\phi(s)+\gamma \sum_{a} \pi_{\theta}(a \mid s) \sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) \nabla_{\theta} V_{\pi_{\theta}}\left(s^{\prime}\right) \\ & =\phi(s)+\gamma \sum_{s^{\prime}} \left(\sum_{a} \pi_{\theta}(a \mid s) P\left(s^{\prime} \mid s, a\right)\right) \nabla_{\theta} V_{\pi_{\theta}}\left(s^{\prime}\right) \\ & =\phi(s)+\gamma \sum_{s^{\prime}} d_{\pi_{\theta}}\left(s \rightarrow s^{\prime}, 1\right) \nabla_{\theta} V_{\pi_{\theta}}\left(s^{\prime}\right) \\ & =\phi(s)+\gamma \sum_{s^{\prime}} d_{\pi_{\theta}}\left(s \rightarrow s^{\prime}, 1\right)\left[\phi\left(s^{\prime}\right)+\gamma \sum_{s^{\prime \prime}} d_{\pi_{\theta}}\left(s^{\prime} \rightarrow s^{\prime \prime}, 1\right) \nabla_{\theta} V_{\pi_{\theta}}\left(s^{\prime \prime}\right)\right] \\ & =\phi(s)+\gamma \sum_{s^{\prime}}d_{\pi_{\theta}}\left(s \rightarrow s^{\prime}, 1\right) \phi\left(s^{\prime}\right)+\gamma^{2} \sum_{s^{\prime \prime}} d_{\pi_{\theta}}\left(s \rightarrow s^{\prime \prime}, 2\right) \nabla_{\theta} V_{\pi_{\theta}}\left(s^{\prime \prime}\right) \\ & =\phi(s)+\gamma \sum_{s^{\prime}} d_{\pi_{\theta}}\left(s \rightarrow s^{\prime}, 1\right) \phi\left(s^{\prime}\right)+\gamma^{2} \sum_{s^{\prime \prime}} d_{\pi_{\theta}}\left(s \rightarrow s^{\prime \prime}, 2\right) \phi\left(s^{\prime \prime}\right)+\gamma^{3} \sum_{s^{\prime \prime \prime}} d_{\pi_{\theta}}\left(s \rightarrow s^{\prime \prime \prime}, 3\right) \nabla_{\theta} V_{\pi_{\theta}}\left(s^{\prime \prime \prime}\right) \\ & =\cdots \\ & =\sum_{k=0}^{\infty} \gamma^{k} \sum_{x \in S} d_{\pi_{\theta}}(s \rightarrow x, k) \phi(x) \end{aligned}

    定义 “策略 πθ\pi_\theta 诱导的一条无限长轨迹中状态 ss 出现的次数的期望” 为 η(s)=Es0ρ0[k=0γkdπθ(s0s,k)]\eta(s)=\mathbb{E}_{s_0\sim \rho_0}\left[\sum_{k=0}^{\infty} \gamma^{k} d_{\pi_{\theta}}\left(s_{0} \rightarrow s, k\right)\right]

    θJ(θ)=Es0ρ0[θVπθ(s0)]=Es0ρ0[k=0γksdπθ(s0s;k)ϕ(s)]=s(Es0ρ0[k=0γkdπθ(s0s;k)])ϕ(s)=sη(s)ϕ(s)(5)\begin{aligned} \nabla_\theta J(\theta) &=\mathbb{E}_{s_0\sim \rho_0}\big[\nabla_\theta V_{\pi_\theta}(s_0)\big] \\ &=\mathbb{E}_{s_0\sim\rho_0}\left[\sum_{k=0}^{\infty}\gamma^k\sum_s d_{\pi_\theta}(s_0\to s;k)\phi(s)\right]\\ &=\sum_s \left(\mathbb{E}_{s_0 \sim\rho_0} \left[\sum_{k=0}^{\infty}\gamma^k d_{\pi_\theta}(s_0\to s;k)\right]\right)\phi(s)\\ & =\sum_{s} \eta(s) \phi(s) \end{aligned} \tag{5}

    这里 η(s)\eta(s) 是折扣访问计数,但我们想要的结果应当是关于某个分布的期望,这样我们才能用 MC 方法进行近似。注意到

    sη(s)=Es0[k=0γksPr(sk=ss0)]=k=0γk=11γ\sum_{s} \eta(s)=\mathbb{E}_{s_{0}}\left[\sum_{k=0}^{\infty} \gamma^{k} \sum_{s} \operatorname{Pr}\left(s_{k}=s\mid s_{0}\right)\right]=\sum_{k=0}^{\infty} \gamma^{k}=\frac{1}{1-\gamma}

    因此只要将其乘以系数 1γ1-\gamma 即可归一化为合法概率形式,由此定义出 “折扣占用度量”

    dπθγ(s):=(1γ)η(s)=(1γ)Es0ρ0[k=0γkdπθ(s0s;k)]d_{\pi_\theta}^\gamma(s):=(1-\gamma)\eta(s) =(1-\gamma)\mathbb{E}_{s_0\sim\rho_0}\left[\sum_{k=0}^{\infty}\gamma^k d_{\pi_\theta}(s_0\to s;k)\right]

    带入式(5)得到

    θJ(θ)=sη(s)ϕ(s)=11γsdπθγ(s)ϕ(s)sdπθγ(s)ϕ(s)=Esdπθγϕ(s)=EsdπθγaAθπθ(as)Qπθ(s,a)=EsdπθγaAπθ(as)θπθ(as)πθ(as)Qπθ(s,a)=Esdπθγ,aπθ(s)[Qπθ(s,a)θlogπθ(as)](6)\begin{aligned} \nabla_\theta J(\theta) &=\sum_s \eta(s)\phi(s) \\ &=\frac{1}{1-\gamma}\sum_s d_{\pi_\theta}^\gamma(s)\phi(s) \\ &\propto \sum_s d_{\pi_\theta}^\gamma(s)\phi(s) \\ &= \mathbb{E}_{s\sim d_{\pi_\theta}^\gamma} \phi(s) \\ &= \mathbb{E}_{s\sim d_{\pi_\theta}^\gamma} \sum_{a \in A}\nabla_{\theta} \pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s, a) \\ &= \mathbb{E}_{s\sim d_{\pi_\theta}^\gamma} \sum_{a \in A}\pi_{\theta}(a \mid s)\frac{\nabla_{\theta} \pi_{\theta}(a \mid s)}{\pi_{\theta}(a \mid s)} Q_{\pi_{\theta}}(s, a)\\ &= \mathbb{E}_{s \sim d_{\pi_{\theta}}^{\gamma}, a \sim \pi_{\theta}(\cdot \mid s)}\left[Q_{\pi_{\theta}}(s, a) \nabla_{\theta} \log \pi_{\theta}(a \mid s)\right] \end{aligned} \tag{6}

    至此,得到策略梯度定理的标准形式。Actor-Crtic 算法直接使用该策略梯度进行优化

    θJ(θ)Esdπθγ, aπθ(s)[Qπθ(s,a)θlogπθ(as)].\boxed{ \nabla_\theta J(\theta)\propto \mathbb{E}_{s\sim d_{\pi_\theta}^\gamma,\ a\sim\pi_\theta(\cdot|s)} \Big[Q_{\pi_\theta}(s,a)\nabla_\theta\log\pi_\theta(a|s)\Big]. }

  • 为了降低方差,引入优势函数 Aπθ(s,a):=Qπθ(s,a)Vπθ(s)A_{\pi_\theta}(s,a):=Q_{\pi_\theta}(s,a)-V_{\pi_\theta}(s),并利用

    Eaπθ(s)[Vπθ(s)θlogπθ(as)]=Vπθ(s)aπθ(as)θlogπθ(as)=Vπθ(s)aθπθ(as)=Vπθ(s)θaπθ(as)=Vπθ(s)θ1=0\begin{aligned} \mathbb{E}_{a\sim\pi_\theta(\cdot|s)}\big[V_{\pi_\theta}(s)\nabla_\theta\log\pi_\theta(a|s)\big] &=V_{\pi_\theta}(s) \sum_a \pi_\theta(a|s)\nabla_\theta\log\pi_\theta(a|s) \\ &=V_{\pi_\theta}(s) \sum_a \nabla_\theta \pi_\theta(a|s) \\ &= V_{\pi_\theta}(s)\nabla_\theta \sum_a \pi_\theta(a|s) \\ &=V_{\pi_\theta}(s) \nabla_\theta 1=0 \end{aligned}

    由此得到一个方差更低的无偏估计。A2C 算法直接使用该策略梯度进行优化

    θJ(θ)Esdπθγ, aπθ(s)[Aπθ(s,a)θlogπθ(as)].\boxed{ \nabla_\theta J(\theta)\propto \mathbb{E}_{s\sim d_{\pi_\theta}^\gamma,\ a\sim\pi_\theta(\cdot|s)} \Big[A_{\pi_\theta}(s,a)\nabla_\theta\log\pi_\theta(a|s)\Big]. }

3. 两种策略梯度的转换

  • 推导方法A得到的策略梯度为

    θJ(θ)=Eτπθ[t=0T1Gtθlogπθ(atst)]\boxed{ \nabla_{\theta} J(\theta)=\mathbb{E}_{\tau\sim \pi_{\theta}}\left[\sum_{t=0}^{T-1} G_{t} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right] }

    考虑某个时刻 tt,考虑项

    Eτπθ[Gtθlogπθ(atst)]\mathbb{E}_{\tau\sim \pi_{\theta}}\left[ G_{t} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right]

    (st,at)(s_t,a_t) 做期望,并利用定义 Qπθ(s,a)=Eτπθ[Gtst=s,at=a]Q_{\pi_{\theta}}(s, a) = \mathbb{E}_{\tau\sim \pi_{\theta}} \Big[G_{t} \mid s_{t}=s, a_{t}=a\Big]

    Eτπθ[Gtθlogπθ(atst)]=Eτπθ[E[Gtst,at]θlogπθ(atst)]=Eτπθ[Qπθ(st,at)θlogπθ(atst)],\begin{aligned} \mathbb{E}_{\tau \sim \pi_{\theta}}\left[G_{t} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right] & =\mathbb{E}_{\tau \sim \pi_{\theta}}\left[\mathbb{E}\left[G_{t} \mid s_{t}, a_{t}\right] \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right] \\ & =\mathbb{E}_{\tau \sim \pi_{\theta}}\left[Q^{\pi_{\theta}}\left(s_{t}, a_{t}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right], \end{aligned}

    考虑所有时刻,得到推导方法B的标准策略梯度

    θJ(θ)Esdπθγ, aπθ(s)[Qπθ(s,a)θlogπθ(as)].\boxed{ \nabla_\theta J(\theta)\propto \mathbb{E}_{s\sim d_{\pi_\theta}^\gamma,\ a\sim\pi_\theta(\cdot|s)} \Big[Q_{\pi_\theta}(s,a)\nabla_\theta\log\pi_\theta(a|s)\Big]. }


强化学习拾遗 —— 策略梯度定理的两种详细推导
https://wxc971231.github.io/MyBlog/2026/02/02/强化学习拾遗_策略梯度定理的两种详细推导/
作者
云端fff
发布于
2026年2月2日
许可协议