强化学习之PG算法族(基于gymnasium开发)

1、什么是策略梯度PG(Policy Gradient)

以前学的DQN属于值函数方法:先学习每个状态下所有动作的Q值,再选Q最大的动作。

策略梯度PG思路完全反过来:

直接训练一个策略网络\pi _{\theta}(a|s),输入当前状态s,输出动作,训练目标是最大化长期累计奖励。通过梯度上升更新网络参数,所以叫策略梯度。

2、PG的两种策略区分

1.随机策略:输出动作概率分布,比如4个动作输出[0.1, 0.5, 0.3, 0.1],按概率随机选动作,自带探索能力。

代表:REINFORCE、基础Actor-Critic

2.确定性策略:输入状态,直接输出唯一一个确定动作,没有概率。适合机械臂、小车速度这种连续动作,自身无探索能力,需要手动加噪声。

代表:DPG、DDPG

3、Actor&Critic含义

-Actor:策略网络,负责选动作

-Critic:价值网络,评估状态/动作好不好,只有AC、DPG、DDPG有,REINFORCE没有

4、REINFORCE算法

它是最原始、最简单的MC策略梯度

(1)网络结构

只有一个Actor策略网络,没有Critic价值网络。

输入状态s -> 输出各个动作的概率分布(随机策略)。

(2)完整训练步骤

1.采集完整回合轨迹

智能体和环境交互,必须完整跑完一整局(直到游戏结束/回合终止),把全过程全部存下来:

轨迹\tau =\left ( s_0,a_0,r_0,...,s_T,a_T,r_T \right )

2.计算每一步的真实回报G_t(MC核心)

G_t代表:从t时刻往后,未来所有折扣奖励总和,当作这个动作真实好坏的标签。

G_t=r_t+\gamma r_{t+1}+...+\gamma ^ {T-t}r_T

\gamma是折扣因子,越远的奖励权重越低。

3.构造策略梯度损失函数,更新Actor网络

公式:

\bigtriangledown J(\theta)=E\left [ G_t \cdot \bigtriangledown_\theta log \pi_\theta(a_t|s_t) \right ]

每个部分通俗解释:

1)J(\theta):总目标,一整条轨迹所有奖励的平均值,我们的目标是最大化J(\theta)

2)\bigtriangledown J(\theta):J对网络参数\theta的梯度,梯度上升用这个值更新网络,让总奖励变大

3)E\left [ ... \right ]:期望,多条轨迹取平均,减少随机波动

4)G_t:从t时刻往后所有未来折扣总回报,用来判断当前这个动作好不好

5)\bigtriangledown_\theta log \pi_\theta(a_t|s_t):核心项,专门用来调整动作a_t的输出概率

整体一句话逻辑:

拿未来总汇报G_t当打分:

  • 动作收益高(G_t>0):放大\bigtriangledown _\theta log \pi,提高该动作概率
  • 动作收益差(G_t<0):反向抑制,降低该动作概率

4.梯度上升更新网络,清空轨迹,重新采集下一局训练。

(3)优点

结构简单,只有一个网络,同时梯度无偏(因为G是真实完整回报,没有估值误差)。

(4)缺点

1.必须等整局结束才能更新:长回合游戏训练极慢

2.回报G方差巨大:单局奖励波动大,梯度来回震荡,很难收敛

3.只能在回合末尾更新,无法单步迭代,样本利用率低。

(5)梯度核心项解读

梯度上升的唯一目标:让你要优化的那个函数的数值变得更大。

梯度=函数增长最快的方向( 不是变化最快的方向),上升=沿着梯度方向往前走,每一步更新参数,最终效果:函数值不断变大

梯度上升更新公式

\theta _{new} = \theta _{old} + \alpha \bigtriangledown _\theta f,其中\alpha为学习率为正数。

梯度下降的更新公式\theta _{new} = \theta _{old} - \alpha \bigtriangledown _\theta f

对于\theta _{new} = \theta _{old} + \alpha \bigtriangledown _\theta log \pi(s_t|a_t)做梯度上升,那么log \pi变大,\pi变大,意味着选中a的概率增大。

而增加了G之后的公式变为:\theta _{new} = \theta _{old} + \alpha G_t \cdot \bigtriangledown _\theta log \pi(s_t|a_t)

G如果为正数(正收益),那么就会让\theta变大,从而\pi变大,选中a的概率增大。

G如果是负数(负收益),那么就会让\theta变小,从而\pi变小,选中a的概率减小。

实际操作中,会通过梯度下降最小化-G_t \cdot \bigtriangledown _\theta log \pi来等价于梯度上升最大化G_t \cdot \bigtriangledown _\theta log \pi

(6)示例代码

import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F

# 策略网络:输出动作概率分布
class PolicyNet(nn.Module):
    def __init__(self, in_states, h1_nodes, out_actions):
        super().__init__()
        self.fc1 = nn.Linear(in_states, h1_nodes)
        self.out = nn.Linear(h1_nodes, out_actions)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        logits = self.out(x)
        prob = F.softmax(logits, dim=-1)
        return prob

class FrozenLakeREINFORCE:
    learning_rate_a = 0.001
    discount_factor_g = 0.9
    ACTIONS = ['L', 'D', 'R', 'U']

    optimizer = None

    def state_to_onehot(self, state, num_states):
        input_tensor = torch.zeros(num_states)
        input_tensor[state] = 1
        return input_tensor

    def print_policy(self, policy_net):
        num_states = policy_net.fc1.in_features

        for s in range(num_states):
            prob = policy_net(self.state_to_onehot(s, num_states)).tolist()
            prob_str = ''
            for p in prob:
                prob_str += f"{p:.2f} "
            prob_str = prob_str.rstrip()
            best_act = self.ACTIONS[np.argmax(prob)]
            print(f"{s:02},{best_act},[{prob_str}]", end=" ")
            if (s + 1) % 4 == 0:
                print()

    def train(self, episodes, render=False, is_slippery=False):
        env = gym.make(
            "FrozenLake-v1",
            map_name="8x8",
            is_slippery=is_slippery,
            render_mode="human" if render else None
        )
        num_states = env.observation_space.n
        num_actions = env.action_space.n

        policy_net = PolicyNet(in_states=num_states, h1_nodes=num_states, out_actions=num_actions)
        self.optimizer = torch.optim.Adam(policy_net.parameters(), lr=self.learning_rate_a)

        print("Random initial policy:")
        self.print_policy(policy_net)

        rewards_per_episode = np.zeros(episodes)

        for ep in range(episodes):
            # 存储单条完整轨迹
            states = []
            actions = []
            rewards = []

            state, _ = env.reset()
            terminated = False
            truncated = False

            # 1. 收集完整episode轨迹
            while not terminated and not truncated:
                state_tensor = self.state_to_onehot(state, num_states)
                with torch.no_grad():
                    act_prob = policy_net(state_tensor)
                # 按概率采样动作
                action = torch.multinomial(act_prob, num_samples=1).item()

                new_state, reward, terminated, truncated, _ = env.step(action)
                states.append(state)
                actions.append(action)
                rewards.append(reward)
                state = new_state

            if reward == 1:
                rewards_per_episode[ep] = 1

            # 2.计算每一步蒙特卡洛折扣回报G_t
            T = len(rewards)
            G_list = [0] * T
            g = 0
            # 从后往前算折扣回报
            for t in reversed(range(T)):
                g = rewards[t] + self.discount_factor_g * g
                G_list[t] = g

            # 3. 策略梯度更新
            loss_sum = 0.0
            for t in range(T):
                s = states[t]
                a = actions[t]
                Gt = G_list[t]

                s_tensor = self.state_to_onehot(s, num_states)
                act_prob = policy_net(s_tensor)
                log_prob = torch.log(act_prob[a])

                # REINFORCE损失:- G_t * logπ(a|s)
                # 梯度下降最小化等价于梯度上升最大化 G_t*logπ
                loss = - Gt * log_prob
                loss_sum += loss

            # 批量梯度更新整条轨迹
            self.optimizer.zero_grad()
            loss_sum.backward()
            self.optimizer.step()

            # 打印训练进度
            if (ep + 1) % 200 == 0:
                avg_100 = np.sum(rewards_per_episode[max(0, ep - 99):ep + 1]) / 100
                print(f"Episode {ep + 1}, last 100 avg reward: {avg_100:.2f}")

        env.close()
        torch.save(policy_net.state_dict(), "reinforce_frozenlake.pt")

        # 绘图:近100回合平均奖励
        plt.figure()
        sum_rewards = np.zeros(episodes)
        for x in range(episodes):
            sum_rewards[x] = np.sum(rewards_per_episode[max(0, x - 100):x + 1])
        plt.plot(sum_rewards)
        plt.title("REINFORCE Reward Sum (last 100 episodes)")
        plt.savefig("reinforce_reward.png")

        print("\nTrained Policy:")
        self.print_policy(policy_net)

    def test(self, episodes, is_slippery=False):
        env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=is_slippery, render_mode="human")
        num_states = env.observation_space.n
        num_actions = env.action_space.n
        policy_net = PolicyNet(num_states, num_states, num_actions)
        policy_net.load_state_dict(torch.load("reinforce_frozenlake.pt"))
        policy_net.eval()

        print("\nTest Policy:")
        self.print_policy(policy_net)

        for _ in range(episodes):
            state, _ = env.reset()
            terminated = False
            truncated = False

            while not terminated and not truncated:
                s_tensor = self.state_to_onehot(state, num_states)
                with torch.no_grad():
                    act_prob = policy_net(s_tensor)
                action = torch.argmax(act_prob).item()
                state, reward, terminated, truncated, _ = env.step(action)
        env.close()

if __name__ == "__main__":
    agent = FrozenLakeREINFORCE()
    slippery = False
    # agent.train(6000, is_slippery=slippery)
    agent.test(10, is_slippery=slippery)

代码核心点解读:

1)PolicyNet策略网络设计

  • 输出层不用直接输出Q值,而是输出logits,经过softmax得到随机策略动作概率分布
  • DQN输出确定Q值(值方法),REINFORCE输出概率(策略梯度方法),二者核心区别
  • 动作采样使用torch.multinomial按概率随机选动作,天然自带探索,无需\epsilon -greedy

2)必须收集完整Episode轨迹才能更新

纯MC算法,没有时序差分,必须走完一局,存储每一步state, action, reward,无法单步更新。

3)折扣回报G_t计算逻辑

G_t=r_t+\gamma r_{t+1}+...+\gamma ^ {T-t}r_T

代码从轨迹末尾反向迭代计算。

4)策略梯度损失函数

理论上梯度上升目标:\bigtriangledown J = E\left [ G_t \cdot \bigtriangledown log \pi(a_t|s_t) \right ]

Pytorch只有梯度下降,等价构造损失:Loss = -G_t \cdot log \pi(a_t|s_t)

5)梯度更新方式

整条轨迹所有 loss 累加后统一反向传播,轨迹批量梯度更新,每条样本只使用一次,无经验回放。

5、Actor-Critc算法

(1)改进思路

REINFORCE只用完整轨迹G_t,方差太大,还得等回合结束。

AC新增一个Critic价值网络,实时预估未来回报,用TD时序差分代替完整MC回报,做到每走一步就能更新网络。

(2)双网络分工

1.Actor(演员,策略网络):随机策略\pi _\theta(a|s),负责和环境交互、选动作,依靠梯度上升优化动作和选择;

2.Critic(评论家,价值网络):输入状态s,输出一个数字代表当前状态长期回报,用来评价动作好坏

(3)分布训练逻辑

步骤1:单步交互,不用跑完回合

每执行一步\left ( s,a,r,s_{next},done \right )立刻训练,不用等到游戏结束

步骤2:Critic先训练,算出TD误差\delta(替代REINFORCE的G_t

TD目标(预估真实回报):

y_t=r_t+\gamma \cdot V(s_{t+1})\cdot(1-done)

TD误差\delta=预估回报 - 当前状态估值

\delta _{t}=y_t - V(s_t)

\delta就是评价当前动作好坏的权重:

\delta>0表示动作比平均更好,\delta<0表示动作很差。

Critic损失:均方误差L_{critic}=(y_t-V(s_t))^2,梯度下降更新Critic,让估值越来越准

步骤3:用\delta更新Actor策略网络

\bigtriangledown J(\theta)=E\left [ \delta_t \cdot \bigtriangledown_\theta log \pi_\theta(a_t|s_t) \right ]

此时:\delta为正就提高动作概率,\delta为负就降低动作概率。

(4)AC对比REINFORCE的提升

1.支持单步更新,训练速度大幅提升

2.TD估计代替完整轨迹回报,梯度方差显著降低,训练更平稳

(5)AC缺陷

  • 需要更新两个模型,且其中一个会直接影响另一个。如果Critic模型学出来的东西都是错的,那么Actor模型根本不可能学到一个好的动作策略
  • 采用随机策略,输出概率分布,很难处理机械臂、速度这类连续动作
  • 在线单步更新样本仅使用一次,样本利用率低
  • 无经验回放,连续样本相关性高。

(6)示例代码

import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F

# Actor策略网络:输出动作概率分布
class ActorNet(nn.Module):
    def __init__(self, in_states, h1_nodes, out_actions):
        super().__init__()
        self.fc1 = nn.Linear(in_states, h1_nodes)
        self.out = nn.Linear(h1_nodes, out_actions)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        logits = self.out(x)
        prob = F.softmax(logits, dim=-1)
        return prob

# Critic价值网络:输出单个状态价值V(s)
class CriticNet(nn.Module):
    def __init__(self, in_states, h1_nodes):
        super().__init__()
        self.fc1 = nn.Linear(in_states, h1_nodes)
        self.out = nn.Linear(h1_nodes, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        value = self.out(x)
        return value

class FrozenLakeAC:
    learning_rate_a = 0.001
    learning_rate_c = 0.001
    discount_factor_g = 0.9
    ACTIONS = ['L', 'D', 'R', 'U']

    actor_optim = None
    critic_optim = None

    def state_to_onehot(self, state, num_states):
        input_tensor = torch.zeros(num_states)
        input_tensor[state] = 1
        return input_tensor

    def print_policy(self, actor_net):
        num_states = actor_net.fc1.in_features

        for s in range(num_states):
            prob = actor_net(self.state_to_onehot(s, num_states)).tolist()
            prob_str = ''
            for p in prob:
                prob_str += f"{p:.2f} "
            prob_str = prob_str.rstrip()
            best_act = self.ACTIONS[np.argmax(prob)]
            print(f"{s:02},{best_act},[{prob_str}]", end=" ")
            if (s + 1) % 4 == 0:
                print()

    def train(self, episodes, render=False, is_slippery=False):
        env = gym.make(
            "FrozenLake-v1",
            map_name="8x8",
            is_slippery=is_slippery,
            render_mode="human" if render else None
        )
        num_states = env.observation_space.n
        num_actions = env.action_space.n

        # 初始化双网络
        actor_net = ActorNet(in_states=num_states, h1_nodes=num_states, out_actions=num_actions)
        critic_net = CriticNet(in_states=num_states, h1_nodes=num_states)

        # 两个独立优化器
        self.actor_optim = torch.optim.Adam(actor_net.parameters(), lr=self.learning_rate_a)
        self.critic_optim = torch.optim.Adam(critic_net.parameters(), lr=self.learning_rate_c)


        print("Random initial policy:")
        self.print_policy(actor_net)

        rewards_per_episode = np.zeros(episodes)

        for ep in range(episodes):
            state, _ = env.reset()
            terminated = False
            truncated = False
            episode_reward = 0

            # AC核心:单步交互,每一步立刻更新,不用存完整轨迹
            while not terminated and not truncated:
                # 1. Actor选动作
                state_tensor = self.state_to_onehot(state, num_states)
                with torch.no_grad():
                    act_prob = actor_net(state_tensor)
                action = torch.multinomial(act_prob, num_samples=1).item()

                # 环境交互
                new_state, reward, terminated, truncated, _ = env.step(action)
                episode_reward += reward

                # 2. 更新Critic网络
                s_tensor = self.state_to_onehot(state, num_states)
                s_next_tensor = self.state_to_onehot(new_state, num_states)

                v_s = critic_net(s_tensor)
                v_s_next = critic_net(s_next_tensor)

                # TD目标
                done_flag = 1 if (terminated or truncated) else 0
                y_t = reward + self.discount_factor_g * v_s_next * (1 - done_flag)
                td_error = y_t - v_s

                # Critic损失 MSE
                loss_critic = torch.square(td_error)
                self.critic_optim.zero_grad()
                loss_critic.backward()
                self.critic_optim.step()

                # 3. 更新Actor网络
                log_prob = torch.log(actor_net(s_tensor)[action])
                loss_actor = -td_error.detach() * log_prob
                self.actor_optim.zero_grad()
                loss_actor.backward()
                self.actor_optim.step()

                # 状态前移
                state = new_state

            if episode_reward == 1:
                rewards_per_episode[ep] = 1

            # 打印训练进度
            if (ep + 1) % 200 == 0:
                avg_100 = np.sum(rewards_per_episode[max(0, ep - 99):ep + 1]) / 100
                print(f"Episode {ep + 1}, last 100 avg reward: {avg_100:.2f}")

        env.close()
        torch.save(actor_net.state_dict(), "ac_actor_frozenlake.pt")
        torch.save(critic_net.state_dict(), "ac_critic_frozenlake.pt")

    def test(self, episodes, is_slippery=False):
        env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=is_slippery, render_mode="human")
        num_states = env.observation_space.n
        num_actions = env.action_space.n
        actor_net = ActorNet(num_states, num_states, num_actions)
        actor_net.load_state_dict(torch.load("ac_actor_frozenlake.pt"))
        actor_net.eval()

        print("\nTest Policy:")
        self.print_policy(actor_net)

        for _ in range(episodes):
            state, _ = env.reset()
            terminated = False
            truncated = False

            while not terminated and not truncated:
                s_tensor = self.state_to_onehot(state, num_states)
                with torch.no_grad():
                    act_prob = actor_net(s_tensor)
                action = torch.argmax(act_prob).item()
                state, reward, terminated, truncated, _ = env.step(action)
        env.close()

if __name__ == "__main__":
    agent = FrozenLakeAC()
    slippery = False
    # agent.train(6000, is_slippery=slippery)
    agent.test(10, is_slippery=slippery)

代码解读:

1)双网络分离设计

ActorNet:输入one-hot状态,输出softmax动作概率分布(随机策略),用于和环境交互、采样动作

CriticNet:输入one-hot状态,输出单个标量V(s)(状态价值函数),用来评估当前状态长期收益

2)动作采样逻辑

训练阶段用torch.multinomial按概率随机采样,自带探索;

测试阶段argmax取概率最大动作,关闭探索

3)两套Loss与梯度反向传播

AC存在两个独立损失,两套独立优化器,分两次zero_grad、backward、step,不能合并更新。

1.Critic损失(MSE均方误差)

L_{critic}=(y_t - V(s_t))^2

作用:让Critic预测的V逼近真实TD目标,把价值评估训练准确;

优化方式:梯度下降最小化损失

2.Actor策略梯度损失

L_{actor}=-\delta_t \cdot log \pi(a_t|s_t)

  • 理论策略梯度是梯度上升最大化\delta_t log \pi,但Pytorch框架仅支持梯度下降,因此加符号转为损失
  • td_error.detach()关键操作:阻断Actor反向传播时梯度流向Critic,保证更新Actor时不改动Critic参数,两套网络梯度完全隔离

3.更新顺序固定:先更新Critic,再更新Actor

Actor更新依赖Critic输出的TD误差\delta_t,所以必须先把价值网络拟合稳定,再优化策略。

6、A2C算法

(1)算法定位与演进链路

A2C全称:Advantage Actor-Critic,同步式优势演员评论家,属于On-Policy强化学习算法。

演进顺序:

1.REINFORCE:仅策略网络,MC完整轨迹回报,无价值评估,梯度方差极大

2.基础单步Actor-Critc:Actor+Critic双网络,单步TD误差更新,仅使用一步即时奖励,价值估计偏差大,单环境采样样本时序高度相关

3.A2C:在基础AC之上引入核心改进:优势函数、N步截断回报、多环境同步批量采样、策略熵正则。解决基础AC方差大、偏差高、训练慢、探索不足的问题,是工程中稳定易用的基线算法。

(2)核心数学公式

1.N步截断回报G_t^{(n)}

基础AC仅使用单步奖励,远期收益全部依赖Critici网络预测,估值偏差很高;

A2C连续采集多步真实奖励,仅最后一步用价值网络预估未来收益,平衡偏差与方差:

G_t^{(n)}=r_t + \gamma r_{t+1}+...+\gamma^{n-1}r_{t+n-1}+\gamma^{n}V(s_{t+n}) \cdot(1-done)

步数n是超参,常用n=5;步数越大方差越大、偏差越小;步数越小方差越小、偏差越大

2.优势函数A(s_t,a_t)

状态价值V(s):当前状态下,所有动作能获得的平均长期回报

动作价值Q(s,a):在状态s执行动作a后获得的长期回报

优势函数定义:A(s_t,a_t)=Q(s_t,a_t)-V(s_t)

优势代表当前动作比该状态平均水平好多少。

A>0:该动作优于平均,需要提高此动作的选择概率

A<0:该动作不如平均,需要压低此动作的选择概率

理论上Q(s_t,a_t)是无穷多步奖励之和,现实无法直接计算,因此采用N步截断回报G_t^{(n)}近似替代Q(s_t,a_t),得到工程中可直接计算的优势函数:

A_t=G_t^{(n)}-V(s_t)

若取步数n=1,N步回报退化为单步TD回报G_t^{(1)}=r_t+\gamma V(s_{t+1}),此时优势等价于基础单步AC的TD误差:

A_t=r_t+\gamma V(s_{t+1})- V(s_t)=\delta_t

基础AC的TD误差只是1步版本的优势函数,A2C使用多步G_t^{(n)},能降低价值估计偏差,训练更稳定。

3.策略梯度理论公式

强化学习的目标:最大化长期期望总回报J(\theta),采用梯度上升更新策略网络参数

原始策略梯度公式:

\bigtriangledown _\theta J = E\left [ A_t \cdot \bigtriangledown_\theta log \pi_\theta(a_t|s_t) \right ]

A_t(优势函数)作为权重,控制梯度更新方向与幅度;

梯度上升:增大优势值为正的动作概率,降低优势值为负的动作概率

4.损失函数(训练反向传播核心)

Pytorch框架仅支持梯度下降最小化损失,最大化目标需添加负号转为损失,总损失加权合并:

L_{total}=L_{actor}+c_1 \cdot L_{critic} - c_2 \cdot H(\pi)

c_1是价值损失权重,c_2是熵正则权重

1)Actor策略损失L_{actor}

L_{actor}=-A_t \cdot log \pi_\theta(a_t|s_t)

2)Critic价值损失L_{critic}

L_{critic}=\left ( G_t^{(n)}-V_\phi(s_t) \right )^2

3)策略熵正则项H(\pi)

熵衡量动作分布随机程度,熵越大,策略探索性越强:

H(\pi)=-\sum _a \pi(a|s) \cdot log \pi(a|s)

总损失减去熵项等价于最大化熵,强制模型持续探索;无熵正则时策略会快速收敛到单一贪心动作,陷入局部最优。

解读:

Pytorch的optimizer会梯度下降,-c_2 \cdot H往小的走,那么H就要往大的走。

如果策略每次只固定选同一个动作(贪心)那么熵很小,几乎不探索;

如果策略每个动作概率均匀分散,熵很大,探索性强。

一句话:熵越大=智能体越愿意乱试不同动作,探索能力越强。

为什么不加熵正则会出问题?

不加熵项时,策略梯度只想着提高优势为正的动作概率,会快速把最优动作概率拉到接近 1,其他动作概率压到 0。 最后模型每次只选同一个动作,不再尝试别的路线,一旦这个动作只是局部最优,智能体永远找不到全局更好的策略。

训练过程线性衰减熵权重c_2,这样前半段重点探索,后半段稳定贪心执行最优策略。

(3)训练分布流程

1)初始化阶段

1.搭建双网络:Actor输出动作概率,Critic输出状态价值V(s)

2.创建多个独立并行环境,获取不相关样本,替代经验回收

3.初始化优化器与超参:折扣因子\gamma、每轮采样步数T_{max}、价值权重、熵正则权重

2)批量同步采样N步轨迹

所有并行环境同步和环境交互,连续采集固定T_{max}步数据;

统一缓存所有轨迹\left ( s,a,r,s_{next},done \right )

采集期间不更新网络,全部采完再统一计算更新

3)批量计算N步回报与优势

1.Critic预测轨迹末尾状态价值,反向计算每一步N步截断回报G_t^{(n)}

2.Critic预测当前状态价值V(s_t)

3.计算优势:A_t=G_t^{(n)}-V(s_t)

4)统一计算损失+一次反向传播更新

1.批量计算Actor损失、Critic损失、熵正则项,加权得到总损失

2.清空梯度,一次backward同时更新双网络梯度

3.执行梯度更新,同步优化Actor、Critic参数

5)迭代循环

清空轨迹缓存,环境中途不重置,直接开启下一轮N步采样,持续迭代至训练结束。

(4)示例代码

import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from gymnasium.vector import SyncVectorEnv

# 自动选择设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Actor策略网络
class ActorNet(nn.Module):
    def __init__(self, in_states, h1_nodes, out_actions):
        super().__init__()
        self.fc1 = nn.Linear(in_states, h1_nodes)
        self.fc2 = nn.Linear(h1_nodes, h1_nodes)
        self.out = nn.Linear(h1_nodes, out_actions)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        logits = self.out(x)
        prob = F.softmax(logits, dim=-1)
        return prob

# Critic价值网络
class CriticNet(nn.Module):
    def __init__(self, in_states, h1_nodes):
        super().__init__()
        self.fc1 = nn.Linear(in_states, h1_nodes)
        self.fc2 = nn.Linear(h1_nodes, h1_nodes)
        self.out = nn.Linear(h1_nodes, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        value = self.out(x)
        return value

class FrozenLakeA2C:
    learning_rate = 1e-4
    discount_factor_g = 0.99
    critic_coeff = 0.5
    entropy_coeff = 0.001
    T_MAX = 30
    NUM_ENVS = 16
    ACTIONS = ['L', 'D', 'R', 'U']

    optimizer = None
    device = device

    def state_to_onehot(self, state_batch, num_states):
        batch_size = state_batch.shape[0]
        onehot = torch.zeros(batch_size, num_states, device=self.device)
        onehot[torch.arange(batch_size, device=self.device), state_batch] = 1.0
        return onehot

    def train(self, episodes, render=False, is_slippery=False):
        def make_env():
            return gym.make("FrozenLake-v1", map_name="8x8", is_slippery=is_slippery)

        temp_env = make_env()
        num_states = temp_env.observation_space.n
        num_actions = temp_env.action_space.n
        temp_env.close()

        envs = SyncVectorEnv([make_env for _ in range(self.NUM_ENVS)])

        # 网络移至GPU
        actor_net = ActorNet(num_states, 64, num_actions).to(self.device)
        critic_net = CriticNet(num_states, 64).to(self.device)
        self.optimizer = torch.optim.Adam(
            list(actor_net.parameters()) + list(critic_net.parameters()),
            lr=self.learning_rate
        )

        rewards_record = []
        total_ep_count = 0

        while total_ep_count < episodes:
            state_batch, _ = envs.reset()
            traj_states = []
            traj_actions = []
            traj_rewards = []
            traj_dones = []

            # 采集T_MAX步轨迹
            for _ in range(self.T_MAX):
                s_tensor = self.state_to_onehot(torch.from_numpy(state_batch).to(self.device), num_states)
                act_prob = actor_net(s_tensor)
                action_batch = torch.multinomial(act_prob, num_samples=1).squeeze(-1).cpu().numpy()
                next_state_batch, reward_batch, term_batch, trunc_batch, _ = envs.step(action_batch)
                done_batch = np.logical_or(term_batch, trunc_batch).astype(np.float32)

                traj_states.append(s_tensor)
                traj_actions.append(torch.from_numpy(action_batch).to(self.device))
                traj_rewards.append(torch.from_numpy(reward_batch).float().to(self.device))
                traj_dones.append(torch.from_numpy(done_batch).float().to(self.device))
                state_batch = next_state_batch

            # 统计本轮通关成功率(转回cpu计算)
            all_r = torch.cat(traj_rewards).cpu()
            all_done = torch.cat(traj_dones).cpu()
            finish_mask = all_done == 1.0
            finish_r = all_r[finish_mask]
            finish_num = finish_r.shape[0]
            success_num = torch.sum(finish_r == 1.0).item()
            if finish_num > 0:
                success_rate = success_num / finish_num
                rewards_record.append(success_rate)
                total_ep_count += finish_num

            # N步回报计算
            last_s = self.state_to_onehot(torch.from_numpy(state_batch).to(self.device), num_states)
            v_last = critic_net(last_s).squeeze(-1)
            G_list = [torch.zeros(self.NUM_ENVS, device=self.device) for _ in range(self.T_MAX)]
            G = v_last * (1 - traj_dones[-1])

            # 从后往前反向计算
            for t in reversed(range(self.T_MAX)):
                r_t = traj_rewards[t]
                done_t = traj_dones[t]
                G = r_t + self.discount_factor_g * G * (1 - done_t)
                G_list[t] = G

            total_loss = torch.tensor(0.0, device=self.device)
            for t in range(self.T_MAX):
                s_t = traj_states[t]
                a_t = traj_actions[t]
                G_t = G_list[t]
                v_t = critic_net(s_t).squeeze(-1)
                advantage = (G_t - v_t).detach()

                # Critic loss
                loss_critic = torch.square(G_t - v_t)
                # Actor loss
                prob_t = actor_net(s_t)
                log_p = torch.log(prob_t[torch.arange(self.NUM_ENVS, device=self.device), a_t] + 1e-8)
                loss_actor = -advantage * log_p
                # Entropy
                entropy = -torch.sum(prob_t * torch.log(prob_t + 1e-8), dim=-1)
                # Total step loss
                step_loss = loss_actor + self.critic_coeff * loss_critic - self.entropy_coeff * entropy
                total_loss += torch.mean(step_loss)

            # 梯度裁剪
            self.optimizer.zero_grad()
            total_loss.backward()
            torch.nn.utils.clip_grad_norm_(
                list(actor_net.parameters()) + list(critic_net.parameters()),
                max_norm=0.5
            )
            self.optimizer.step()

            # 打印日志
            if total_ep_count % 200 == 0 and len(rewards_record) >= 100:
                avg100 = np.mean(rewards_record[-100:])
                print(f"Total episodes {total_ep_count}, last 100 success rate: {avg100:.2f}")

            state_batch, _ = envs.reset()

        envs.close()
        # 保存模型移回cpu
        torch.save(actor_net.cpu().state_dict(), "a2c_actor_frozenlake.pt")
        plt.figure()
        plt.plot(rewards_record)
        plt.title("A2C 8x8 FrozenLake Success Rate")
        plt.savefig("a2c_curve.png")

    def test(self, episodes, is_slippery=False):
        env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=is_slippery, render_mode="human")
        num_states = env.observation_space.n
        num_actions = env.action_space.n
        actor_net = ActorNet(num_states, 64, env.action_space.n)
        # 测试用CPU加载推理
        actor_net.load_state_dict(torch.load("a2c_actor_frozenlake.pt"))
        actor_net.eval()
        cnt = 0
        for _ in range(episodes):
            s, _ = env.reset()
            done = False
            while not done:
                s_tensor = torch.zeros(1, num_states)
                s_tensor[0, s] = 1
                with torch.no_grad():
                    prob = actor_net(s_tensor)
                a = torch.argmax(prob).item()
                s, r, term, trunc, _ = env.step(a)
                done = term or trunc
                if r == 1:
                    cnt += 1
        env.close()
        print(f"Test success rate: {cnt/episodes:.2f}")

if __name__ == "__main__":
    agent = FrozenLakeA2C()
    slippery = False
    # agent.train(120000, is_slippery=slippery)
    agent.test(200, is_slippery=slippery)

7、DPG算法

(1)为什么需要DPG

之前A2C是离散型动作算法,输出动作概率分布,靠采样选动作。

连续控制场景(机械臂、自动驾驶、机器人)动作是实数区间a\in [-1,1],无法用softmax离散概率,传统Actor-Critic失效。

DPG:确定性策略梯度,解决连续动作理论。

(2)核心概念区分:随机策略 vs 确定性策略

1)随机策略(A2C):\pi_\theta (a|s)输出动作分布,动作是随机采样得到

2)确定性策略DPG:\mu _\theta(s)输入状态,直接输出唯一确定动作a=\mu _\theta(s)

优点:不用采样,梯度计算简单;连续控制天然适配;样本方差更低

缺点:无随机探索机制,必须额外添加噪声探索

(3)DPG核心公式

目标:最大化长期回报J(\theta)=E\left [ r(s,\mu(s)) \right ]

确定性策略梯度定理:

\bigtriangledown _\theta J = E[\bigtriangledown _a Q(s,a)|_{a=\mu_\theta(s)} \cdot \bigtriangledown_\theta \mu_\theta(s)]

(4)总结

DPG 是为了解决连续动作空间而生的。它把随机策略的‘概率分布输出’改成了‘确定性数值输出’,避开了动作采样,降低了方差。但由于动作是确定的,它没法直接探索,所以必须加噪声,并且必须配合 Off-policy 和 Replay Buffer 来复用历史数据。Critic 提供动作微调的方向,Actor 顺着这个方向梯度更新。不过 DPG 本身在非线性网络下训练不稳定,所以工业界通常直接上它的进化版 DDPG。

8、DDPG算法

(1)DDPG诞生背景

原生On-Policy DPG仅理论可行,实际训练完全不稳定,存在硬伤:

  1. 样本一次性丢弃,利用率极低:每段轨迹更新后直接废弃,交互成本高、收敛极慢
  2. Q值严重过估计震荡:单套网络同时计算预测值与目标回报,价值持续虚高,梯度剧烈波动
  3. 单环境采样方差巨大,无平滑稳定手段,很难收敛

DDPG = Deep DPG,在DPG确定性梯度理论基础上,增加4套工程优化方案,把DPG改造为可稳定落地的连续控制算法。

(2)经验回放池 Replay Buffer

实现Off-Policy。

1.存储交互五元组\left ( s,a,r,s',done \right )

2.训练时随机批量采样,打破轨迹时序相关性

3.历史旧样本可反复利用,大幅提升样本利用效率,解决DPG On-Policy样本浪费问题

(3)双目标网络

原生DPG只有1套Actor、1套Critic;DDPG新增两套独立延迟网络;

1.目标Actor:生成下一状态稳定目标动作

2.目标Critic:计算无波动TD目标值y_t

3.完全隔离在线网络实时参数波动,根治DPG的Q过估计震荡问题

(4)目标网络软更新

不直接硬拷贝参数,缓慢同步在线网络权重到目标网络

\theta^` \leftarrow \tau \cdot \theta + (1-\tau)\theta^`, \tau \ll 1

目标网络参数永远滞后在线网络,避免目标值瞬间突变

(5)批量离线训练

DPG 采集一段轨迹立刻更新;DDPG 填满回放池后随机批量训练,单步交互即可触发更新,训练频率更高、梯度更平稳。

(6)DDPG核心公式

(7)Pendulumn控制力矩说明

1)力矩物理定义

力矩(Torque)是扭转、旋转物体的作用力,单位:N·m(牛米)

倒立摆 环境中,摆杆中间有转轴,我们给转轴施加力矩,用来控制摆杆向上直立平衡。

力矩=力 x 力臂,力矩越大,摆杆旋转的加速度越大

2)有方向,正负代表旋转方向

正数力矩:对转轴施加顺时针扭转力矩,摆杆会顺时针加速转动

负数力矩:对转轴施加逆时针扭转力矩,摆杆会逆时针加速转动

3)在Pendulumn任务里的作用

任务目标:让黑色摆杆保持竖直向上

  • 如果摆杆向左倒(逆时针倾斜),Agent 需要输出正力矩顺时针拉回;
  • 如果摆杆向右倒(顺时针倾斜),Agent 需要输出负力矩逆时针拉回;

4)Pendulum三维状态含义

Pendulum输入3维状态:[cos\theta,sin\theta,\theta'],基于角度的信息+角速度信息。

1.为什么不用原始角度\theta

角度\theta是周期性数值,\theta=\pi\theta=-\pi物理位置一样,但数值跳变巨大,神经网络会误认为差距极大,无法学习、梯度混乱。

2.为什么用cos\theta,sin\theta(2维位置信息)

  • 把角度转换成单位圆坐标,全程连续、无跳变
  • 两个值组合可以唯一确定摆杆当前倾斜位置
  • cos区分上下,sin区分左右,解决角度周期性断层问题

3.为什么需要角速度\theta'(1维速度信息)

角速度定义:单位时间内角度变化量\theta'=\frac{d\theta}{dt},是旋转速度,正负代表旋转方向,绝对值代表转动快慢。

强化学习是MDP马尔可夫状态,当前状态必须包含全部运动信息。

相同摆杆位置,可以有完全不同运动趋势:正在向左倒/正在向右回正。

4.总结

cos\theta+sin\theta:精准、连续描述摆杆当前位置

角速度:描述摆杆运动趋势、旋转快慢方向。

(8)示例代码

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gymnasium as gym
from collections import deque
import random
import os

# ==================== 0. 设备全局设置 ====================
# 自动检测cuda显卡,有GPU则全部张量/网络放到GPU加速,无则使用CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"🖥️  使用设备: {device}")


# ==================== 1. 全局超参数设置(贴合DDPG论文标准) ====================
STATE_DIM = 3               # Pendulum环境状态维度:[cosθ, sinθ, 角速度]
ACTION_DIM = 1              # Pendulum连续动作维度:单力矩控制
ACTION_BOUND = 2.0          # 环境真实动作区间 [-2, 2],网络tanh输出[-1,1]后乘该系数缩放

LR_ACTOR = 1e-3             # Actor策略网络学习率
LR_CRITIC = 1e-3            # Critic价值网络学习率,通常大于Actor
GAMMA = 0.99                # 折扣因子,衡量未来回报权重,越接近1越看重长期收益
TAU = 0.005                 # 目标网络软更新系数,极小值保证目标网络缓慢平滑更新
MEMORY_CAPACITY = 100000    # 经验回放池最大存储容量,超出容量自动丢弃最早样本
BATCH_SIZE = 64             # 每次训练从回放池随机采样的批量大小

MAX_EPISODES = 300          # 最大训练回合数
MAX_STEPS = 200             # 单个回合最大交互步数,Pendulum固定200步终止
EXPLORE_NOISE = 0.1         # 训练阶段高斯探索噪声标准差,用于确定性策略环境探索

MODEL_SAVE_PATH = "ddpg_pendulum.pth"  # 模型权重保存路径


# ==================== 2. 神经网络定义(DDPG两套基础网络:Actor、Critic) ====================
# Actor:确定性策略网络,输入状态,输出连续动作
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, action_bound):
        super(Actor, self).__init__()
        # 环境动作缩放边界
        self.action_bound = action_bound
        # 三层全连接网络
        self.fc1 = nn.Linear(state_dim, 256)    # 输入层:状态向量
        self.fc2 = nn.Linear(256, 128)          # 隐藏层
        self.fc3 = nn.Linear(128, action_dim)   # 输出层:原始动作值

    def forward(self, state):
        """前向传播,输入批量状态张量,输出缩放后的连续动作"""
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        # tanh激活约束原始输出 [-1, 1],乘以边界缩放至环境真实动作区间
        return torch.tanh(self.fc3(x)) * self.action_bound

# Critic:动作价值网络Q(s,a),输入状态+动作拼接向量,输出单标量Q值
class Critic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Critic, self).__init__()
        # 状态单独分支提取特征
        self.fc_s = nn.Linear(state_dim, 256)
        # 动作单独分支提取特征
        self.fc_a = nn.Linear(action_dim, 256)
        # 特征融合层:状态特征+动作特征拼接,维度256+256=512
        self.fc2 = nn.Linear(512, 128)
        # 输出层:单一Q值,代表当前(s,a)的长期折扣回报
        self.fc3 = nn.Linear(128, 1)

    def forward(self, state, action):
        """前向传播:分别提取状态、动作特征后融合,输出Q(s,a)"""
        # 提取状态特征
        s_out = F.relu(self.fc_s(state))
        # 提取动作特征
        a_out = F.relu(self.fc_a(action))
        # 在特征维度拼接状态、动作特征
        x = torch.cat([s_out, a_out], dim=1)
        x = F.relu(self.fc2(x))
        return self.fc3(x)


# ==================== 3. 经验回放池 ReplayBuffer(DDPG核心Off-Policy组件) ====================
# 作用:1.复用历史样本,解决原生DPG on-policy样本一次性丢弃问题
#      2.随机采样打破轨迹时序相关性,避免梯度震荡发散
class ReplayBuffer:
    def __init__(self, capacity):
        # 双端队列存储五元组 (s,a,r,s_next,done),自动FIFO淘汰旧样本
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        """存入一条交互转移样本"""
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        """随机采样一批样本,返回numpy数组格式,后续转GPU张量"""
        batch = random.sample(self.buffer, batch_size)
        state, action, reward, next_state, done = zip(*batch)
        return (np.array(state), np.array(action), np.array(reward, dtype=np.float32),
                np.array(next_state), np.array(done, dtype=np.float32))

    def __len__(self):
        """返回当前缓存样本数量,用于判断是否满足训练条件"""
        return len(self.buffer)


# ==================== 4. DDPG 智能体核心类(4网络架构:在线Actor/Critic + 目标Actor/Critic) ====================
# DDPG四大核心改良:回放池、双目标网络、软更新、批量离线训练
class DDPGAgent:
    def __init__(self):
        # 1. 在线网络:实时参与梯度更新,用于环境交互、损失计算
        self.actor = Actor(STATE_DIM, ACTION_DIM, ACTION_BOUND).to(device)
        self.critic = Critic(STATE_DIM, ACTION_DIM).to(device)

        # 2. 目标网络:延迟软更新,仅用于计算稳定TD目标值,无梯度更新
        self.actor_target = Actor(STATE_DIM, ACTION_DIM, ACTION_BOUND).to(device)
        self.critic_target = Critic(STATE_DIM, ACTION_DIM).to(device)

        # 初始化目标网络参数,与在线网络完全一致,保证训练初期目标稳定
        self.actor_target.load_state_dict(self.actor.state_dict())
        self.critic_target.load_state_dict(self.critic.state_dict())

        # 独立优化器:Actor、Critic分开优化,各自学习率独立控制
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=LR_ACTOR)
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=LR_CRITIC)

        # 初始化经验回放池
        self.memory = ReplayBuffer(MEMORY_CAPACITY)
        # 训练迭代计数器,记录更新次数
        self.learn_step_counter = 0

    def select_action(self, state, add_noise=True):
        """
        根据当前状态输出交互动作
        :param state: 环境返回numpy状态数组
        :param add_noise: True=训练阶段叠加高斯噪声探索;False=测试纯确定性策略
        :return: 可直接传入环境的动作numpy数组
        """
        # numpy状态转GPU张量,增加batch维度 [1, state_dim]
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
        # 前向传播得到网络输出动作
        action_tensor = self.actor(state_tensor)
        # 计算图分离,移回CPU转为numpy,用于gym环境交互
        action = action_tensor.detach().cpu().numpy()[0]

        # 训练阶段添加高斯噪声,实现环境探索(确定性策略无内置随机性,必须人工加噪)
        if add_noise:
            noise = np.random.normal(0, EXPLORE_NOISE, size=ACTION_DIM)
            # 裁剪动作至合法区间,防止超出环境动作限制
            action = np.clip(action + noise, -ACTION_BOUND, ACTION_BOUND)
        return action

    def store_transition(self, state, action, reward, next_state, done):
        """将单步交互五元组存入回放池"""
        self.memory.push(state, action, reward, next_state, done)

    def learn(self):
        """DDPG核心训练更新函数:批量更新Critic、Actor,软更新目标网络"""
        # 回放池样本不足批量大小,不执行训练
        if len(self.memory) < BATCH_SIZE:
            return

        # 1. 从回放池随机采样批量数据
        state_np, action_np, reward_np, next_state_np, done_np = self.memory.sample(BATCH_SIZE)
        # 全部转为GPU张量,适配CUDA加速
        state = torch.FloatTensor(state_np).to(device)
        action = torch.FloatTensor(action_np).to(device)
        reward = torch.FloatTensor(reward_np).unsqueeze(1).to(device)    # 扩充维度 [batch,1]
        next_state = torch.FloatTensor(next_state_np).to(device)
        done = torch.FloatTensor(done_np).unsqueeze(1).to(device)       # 终止标记扩充维度

        # -------------------------- 第一阶段:更新在线Critic价值网络 --------------------------
        # with torch.no_grad(): 冻结目标网络计算图,不更新目标网络参数
        with torch.no_grad():
            # 目标Actor生成下一状态最优动作
            next_action = self.actor_target(next_state)
            # 目标Critic评估下一状态目标Q值
            target_q = self.critic_target(next_state, next_action)
            # DDPG TD目标值公式:y_t = r + γ * Q_target(s', μ_target(s')) * (1-done)
            # done=1回合终止,未来回报置0;done=0保留未来折扣回报
            target_value = reward + GAMMA * (1 - done) * target_q

        # 在线Critic预测当前样本Q值
        current_q = self.critic(state, action)
        # Critic损失:MSE均方误差,缩小预测Q与稳定目标y_t的差距
        critic_loss = F.mse_loss(current_q, target_value)

        # 梯度清零、反向传播、优化器更新Critic参数
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        # -------------------------- 第二阶段:更新在线Actor策略网络 --------------------------
        # Actor损失:-E[Q(s, μ(s))],最小化负Q等价最大化长期回报,贴合DPG确定性梯度定理
        new_action = self.actor(state)
        actor_loss = -self.critic(state, new_action).mean()

        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # -------------------------- 第三阶段:软更新两套目标网络(DDPG独有) --------------------------
        self._soft_update(self.actor_target, self.actor)
        self._soft_update(self.critic_target, self.critic)

        self.learn_step_counter += 1

    def _soft_update(self, target_net, source_net):
        """
        软更新函数:缓慢将在线网络参数同步到目标网络,避免目标值剧烈震荡
        更新公式:target_param = τ * source_param + (1-τ) * target_param
        :param target_net: 待更新的目标网络
        :param source_net: 实时更新的在线网络
        """
        for target_param, source_param in zip(target_net.parameters(), source_net.parameters()):
            target_param.data.copy_(TAU * source_param.data + (1 - TAU) * target_param.data)

    def save_model(self, path):
        """保存4套网络完整权重,支持断点续训"""
        # GPU张量直接保存state_dict,加载时自动映射设备
        torch.save({
            'actor_state_dict': self.actor.state_dict(),
            'critic_state_dict': self.critic.state_dict(),
            'actor_target_state_dict': self.actor_target.state_dict(),
            'critic_target_state_dict': self.critic_target.state_dict(),
        }, path)
        print(f"✅ 模型已保存至 {path}")

    def load_model(self, path):
        """加载保存的模型权重,自动适配当前CPU/CUDA设备"""
        if not os.path.exists(path):
            raise FileNotFoundError(f"模型文件 {path} 不存在,请先训练!")
        # map_location自动将权重映射到当前运行设备
        checkpoint = torch.load(path, map_location=device)
        self.actor.load_state_dict(checkpoint['actor_state_dict'])
        self.critic.load_state_dict(checkpoint['critic_state_dict'])
        self.actor_target.load_state_dict(checkpoint['actor_target_state_dict'])
        self.critic_target.load_state_dict(checkpoint['critic_target_state_dict'])
        print(f"✅ 模型已从 {path} 加载到 {device}")


# ==================== 5. 完整训练流程函数 ====================
def train():
    # 创建Pendulum连续动作环境
    env = gym.make('Pendulum-v1')
    # 初始化DDPG智能体
    agent = DDPGAgent()
    # 记录每回合总奖励,用于监控收敛
    episode_rewards = []

    # 循环训练每一个回合
    for episode in range(MAX_EPISODES):
        # 环境重置,获取初始状态
        state, _ = env.reset()
        episode_reward = 0

        # 单回合内循环交互最多MAX_STEPS步
        for step in range(MAX_STEPS):
            # 训练模式,带噪声探索
            action = agent.select_action(state, add_noise=True)
            # 执行动作与环境交互
            next_state, reward, terminated, truncated, _ = env.step(action)
            # terminated=摆锤倒地终止;truncated=达到200步上限,任意一个触发done=True
            done = terminated or truncated

            # 存储交互样本到回放池,并执行网络更新
            agent.store_transition(state, action, reward, next_state, done)
            agent.learn()

            # 状态迭代,累加单步奖励至回合总奖励
            state = next_state
            episode_reward += reward

            # 回合提前终止,跳出步循环
            if done:
                break

        # 保存当前回合总奖励
        episode_rewards.append(episode_reward)
        # 计算近10回合平均奖励,监控收敛效果
        avg_reward = np.mean(episode_rewards[-10:]) if len(episode_rewards) >= 10 else np.mean(episode_rewards)
        print(f"Episode {episode+1:3d} | Reward: {episode_reward:7.2f} | Avg(10): {avg_reward:7.2f}")

        # 提前终止条件:近10轮平均奖励大于-200,摆锤稳定直立,训练完成
        if avg_reward > -200:
            print(f"🎉 在 {episode+1} 回合后达到目标!")
            break

    env.close()
    # 训练结束保存模型权重
    agent.save_model(MODEL_SAVE_PATH)
    return agent, episode_rewards


# ==================== 6. 模型测试推理函数(无噪声纯确定性策略) ====================
def test(render=True):
    # 可视化渲染环境
    env = gym.make('Pendulum-v1', render_mode='human' if render else None)
    agent = DDPGAgent()
    # 加载训练好的模型
    try:
        agent.load_model(MODEL_SAVE_PATH)
    except FileNotFoundError as e:
        print(e)
        return

    state, _ = env.reset()
    total_reward = 0

    # 单回合推理交互,无探索噪声
    for _ in range(MAX_STEPS):
        action = agent.select_action(state, add_noise=False)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        state = next_state
        total_reward += reward

        if done:
            break

    env.close()
    print(f"🧪 测试总奖励: {total_reward:.2f}")
    return total_reward


# ==================== 7. 程序主入口,切换训练/测试模式 ====================
if __name__ == "__main__":
    # ====== 模式开关:True=启动训练,False=加载模型可视化测试 ======
    TRAIN_MODE = False   # True: 训练并保存模型, False: 加载模型并测试

    if TRAIN_MODE:
        train()
    else:
        test(render=True)

代码解读:
流程1:计算TD目标y_t

with torch.no_grad():
    next_action = self.actor_target(next_state)
    target_q = self.critic_target(next_state, next_action)
    target_value = reward + GAMMA * (1 - done) * target_q

对应公式:

其中:

next_action=actor_target(next_state):用目标Actor输出下一状态确定性动作\mu _{\theta'}\left ( s' \right )

critic_target(next_state, next_action):目标Critic评估下一状态价值Q_{\phi '}\left ( s',a' \right )

流程2:计算Critic损失L_{critic}

current_q = self.critic(state, action)
critic_loss = F.mse_loss(current_q, target_value)

对应公式:

critc(state, action):在线Critic预测当前样本价值Q_\phi \left ( s,a \right )

F.mse_loss:最小化预测Q与目标y_t的均方误差,让Critic拟合真实长期回报

反向传播仅更新在线Critic参数\phi

流程3:计算Actor损失L_{actor}

new_action = self.actor(state)
actor_loss = -self.critic(state, new_action).mean()

对应公式:

actor(state):在线Actor输出当前策略动作\mu _\theta(s)

critic(state, new_action):传入在线Critic得到价值Q(s.\mu(s))

取负号求均值:梯度下降等价最大化Q期望,完全匹配DPG确定性梯度定理

反向传播仅更新在线Actor参数\theta,Critic权重不改动

流程4:软更新目标网络

self._soft_update(self.actor_target, self.actor)
self._soft_update(self.critic_target, self.critic)

对应公式:

9、PPO算法

(1)前置背景,原始PG缺陷

1)原始策略梯度REINFORCE

公式:\bigtriangledown _\theta J = E\left [ log \pi _\theta(a|s) \cdot G_t \right ]

缺陷:梯度方差极大,单条轨迹回报波动剧烈,训练极易震荡

2)A2C(优势函数PG)

引入优势函数A_t=G_t - V(s_t),用状态价值基线降低方差

但依旧存在问题:单次更新策略幅度不可控,策略大幅偏移后梯度爆炸、模型崩溃。

3)TRPO(信任区域策略优化)

通过KL散度约束新旧策略差距,严格限制更新幅度,解决策略崩溃。

缺陷:二阶约束计算量大、工程实现复杂,很难落地

4)PPO诞生

PPO = Proximal Policy Optimization 近端策略优化

核心创新:用简单裁剪替代TRPO复杂KL约束,低成本限制策略更新幅度,兼顾稳定与易实现,是目前工业界通用基线PG算法

(2)随机策略\pi _\theta(a|s)

  • 离散动作:输出Softmax概率分布,随机采样动作
  • 连续动作:输出高斯分布均值、方差,采样连续动作

和DDPG确定性策略\mu(s)本质区别:PPO自带随机探索,无需手动加噪声

(3)优势函数A_t

A_t=G_t - V_\phi (s_t)

G_t:当前轨迹折扣总回报

V_\phi (s_t):价值网络预测的当前状态长期平均回报

物理含义:当前动作比“平均动作”好多少

A_t>0:提升该动作概率;否则降低该动作概率。

(4)新旧策略比率r_t(\theta)

r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}

\pi_{\theta_{old}}:采集轨迹时冻结的旧策略,全程不参与梯度更新

r_t>1:新策略比旧策略更倾向选当前动作,否则新策略弱化当前动作

PPO通过裁剪该比率,控制单次更新策略偏移幅度。

(5)PPO裁剪目标函数

PPO的核心是一个经过裁剪的替代目标,但其本质不在于“限制数值范围”,而在于非对称地控制梯度流向。公式如下:

L^{CLIP}(\theta)=E_t\left [ min(r_t(\theta)A_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)A_t) \right ]

其中\epsilon(常用0.1~0.2)定义了新旧策略概率比r_t(\theta)的安全区间。

在参数更新过程中,A_t被视为冻结常量(由旧策略采样计算,不参与求导)。

此时,min函数的选择直接决定了梯度是否存在。

情况1:好动作,概率涨过头

A>0,r>1+\epsilon

min实际取裁剪项:(1+\epsilon)A

梯度状态:\bigtriangledown _\theta=0,因为r被常数替代

物理含义:踩刹车,停止继续放大该动作概率,防止策略一步登天

情况2:坏动作,概率跌过头

A<0,r<1+\epsilon

min实际取裁剪项:(1-\epsilon)A

梯度状态:\bigtriangledown _\theta=0,因为r被常数替代

物理含义:踩刹车,停止继续缩小该动作概率,防止样本过度惩罚

情况3:坏动作,概率反而暴涨

A<0,r>1+\epsilon

min实际取未裁剪项:rA

梯度状态:\bigtriangledown _\theta= \bigtriangledown r \cdot A \neq 0,梯度存活,且方向为负

物理含义:保留纠错权:此时策略严重犯错(概率给高了),min⁡min 宁可保留原始梯度,也要强制把概率拉回来。

为什么不能直接用Clip(r)A?

如果使用clip,梯度直接置零,这意味着:策略犯了大错,但优化器却视而不见,模型永远失去修正该错误的能力,随后崩溃。

这正是PPO用min的精髓:它是一个单向安全阀,只限制你往好处冲,绝不阻止你从坏处退。

总结:PPO的min不是简单的截断数值,而是截断梯度的开关。它只在策略顺风顺水(好动作过火)时切断梯度以保稳定,却在策略逆风翻车(坏动作过火)时保留梯度以纠错,实现了真正的“信任区域”效果。

(6)损失函数的完整形态

(7)示例代码

import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical

# -------------------- 网络定义(不变) --------------------
class ActorNet(nn.Module):
    def __init__(self, in_states, h1_nodes, out_actions):
        super().__init__()
        self.fc1 = nn.Linear(in_states, h1_nodes)
        self.out = nn.Linear(h1_nodes, out_actions)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        logits = self.out(x)
        prob = F.softmax(logits, dim=-1)
        return prob

class CriticNet(nn.Module):
    def __init__(self, in_states, h1_nodes):
        super().__init__()
        self.fc1 = nn.Linear(in_states, h1_nodes)
        self.out = nn.Linear(h1_nodes, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        value = self.out(x)
        return value

# -------------------- PPO Agent --------------------
class FrozenLakePPO:
    # 超参数(可调)
    learning_rate_a = 0.001          # actor 学习率
    learning_rate_c = 0.001          # critic 学习率
    discount_factor_g = 0.99         # 折扣因子 gamma
    gae_lambda = 0.95                # GAE 平滑参数
    clip_epsilon = 0.2               # 裁剪范围
    ppo_epochs = 10                  # 每批数据重复更新的轮数
    batch_size = 64                  # mini-batch 大小(若小于总样本数则全用)
    entropy_coef = 0.01              # 熵正则项系数
    value_coef = 0.5                 # 价值损失系数

    ACTIONS = ['L', 'D', 'R', 'U']

    def __init__(self):
        self.actor_optim = None
        self.critic_optim = None

    # ---------- 工具函数 ----------
    def state_to_onehot(self, state, num_states):
        """将整数状态转为 one‑hot 张量"""
        t = torch.zeros(num_states)
        t[state] = 1.0
        return t

    def print_policy(self, actor_net):
        """打印当前策略(每个状态下的动作概率及最佳动作)"""
        num_states = actor_net.fc1.in_features
        for s in range(num_states):
            prob = actor_net(self.state_to_onehot(s, num_states)).tolist()
            prob_str = ' '.join(f"{p:.2f}" for p in prob)
            best_act = self.ACTIONS[np.argmax(prob)]
            print(f"{s:02},{best_act},[{prob_str}]", end=" ")
            if (s + 1) % 4 == 0:
                print()

    # ---------- 训练(PPO) ----------
    def train(self, episodes, render=False, is_slippery=False):
        env = gym.make(
            "FrozenLake-v1",
            map_name="8x8",
            is_slippery=is_slippery,
            render_mode="human" if render else None
        )
        num_states = env.observation_space.n
        num_actions = env.action_space.n

        # 初始化网络
        actor_net = ActorNet(num_states, num_states, num_actions)
        critic_net = CriticNet(num_states, num_states)
        self.actor_optim = torch.optim.Adam(actor_net.parameters(), lr=self.learning_rate_a)
        self.critic_optim = torch.optim.Adam(critic_net.parameters(), lr=self.learning_rate_c)

        print("Random initial policy:")
        self.print_policy(actor_net)

        rewards_per_episode = np.zeros(episodes)

        for ep in range(episodes):
            state, _ = env.reset()
            terminated = False
            truncated = False
            episode_reward = 0

            # ---------- 1. 收集一条完整轨迹 ----------
            # 存储每个时间步的数据
            states = []          # 状态(one‑hot 张量)
            actions = []         # 动作(整数)
            rewards = []         # 奖励(浮点数)
            dones = []           # 是否终止(布尔)
            log_probs_old = []   # 旧策略下的对数概率
            values = []          # 价值网络估计

            while not terminated and not truncated:
                s_tensor = self.state_to_onehot(state, num_states)
                with torch.no_grad():
                    prob = actor_net(s_tensor)
                    value = critic_net(s_tensor)
                # 采样动作
                dist = Categorical(prob)
                action = dist.sample().item()
                log_prob = dist.log_prob(torch.tensor(action))

                # 环境交互
                new_state, reward, terminated, truncated, _ = env.step(action)
                episode_reward += reward

                # 存储数据
                states.append(s_tensor)
                actions.append(action)
                rewards.append(reward)
                dones.append(terminated or truncated)
                log_probs_old.append(log_prob)
                values.append(value.squeeze())

                state = new_state

            # 收集完毕,计算 GAE 优势
            # 将列表转为张量
            states = torch.stack(states)                 # [T, num_states]
            actions = torch.tensor(actions)              # [T]
            rewards = torch.tensor(rewards, dtype=torch.float32)  # [T]
            dones = torch.tensor(dones, dtype=torch.float32)      # [T]
            log_probs_old = torch.stack(log_probs_old)   # [T]
            values = torch.stack(values)                 # [T]

            # 计算每个时间步的回报 G_t 和优势 A_t (GAE)
            advantages = torch.zeros_like(rewards)
            gae = 0.0
            # 从最后一步向前计算
            for t in reversed(range(len(rewards))):
                if t == len(rewards) - 1:
                    next_value = 0.0  # 终止后价值为0
                else:
                    next_value = values[t+1]
                delta = rewards[t] + self.discount_factor_g * next_value * (1 - dones[t]) - values[t]
                gae = delta + self.discount_factor_g * self.gae_lambda * (1 - dones[t]) * gae
                advantages[t] = gae

            # 回报 = 优势 + 价值 (用于训练 Critic)
            returns = advantages + values.detach()

            # (可选)标准化优势
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

            # 准备数据
            data_size = len(states)
            indices = np.arange(data_size)

            # ---------- 2. 多次更新(PPO 核心) ----------
            for _ in range(self.ppo_epochs):
                np.random.shuffle(indices)
                # 按 batch_size 分批
                for start in range(0, data_size, self.batch_size):
                    end = start + self.batch_size
                    batch_idx = indices[start:end]

                    # 取出 batch 数据
                    batch_states = states[batch_idx]
                    batch_actions = actions[batch_idx]
                    batch_log_probs_old = log_probs_old[batch_idx]
                    batch_advantages = advantages[batch_idx]
                    batch_returns = returns[batch_idx]

                    # 计算当前策略下的概率和对数概率
                    probs = actor_net(batch_states)
                    dist = Categorical(probs)
                    log_probs_new = dist.log_prob(batch_actions)
                    entropy = dist.entropy().mean()

                    # 概率比率
                    ratio = torch.exp(log_probs_new - batch_log_probs_old.detach())

                    # PPO 裁剪目标
                    surr1 = ratio * batch_advantages
                    surr2 = torch.clamp(ratio, 1.0 - self.clip_epsilon, 1.0 + self.clip_epsilon) * batch_advantages
                    actor_loss = -torch.min(surr1, surr2).mean()

                    # 价值损失(MSE)
                    values_pred = critic_net(batch_states).squeeze()
                    critic_loss = F.mse_loss(values_pred, batch_returns)

                    # 总损失
                    loss = actor_loss + self.value_coef * critic_loss - self.entropy_coef * entropy

                    # 更新网络
                    self.actor_optim.zero_grad()
                    self.critic_optim.zero_grad()
                    loss.backward()
                    self.actor_optim.step()
                    self.critic_optim.step()

            # 记录成功 episode
            if episode_reward == 1:
                rewards_per_episode[ep] = 1

            # 打印进度
            if (ep + 1) % 200 == 0:
                avg_100 = np.sum(rewards_per_episode[max(0, ep - 99):ep + 1]) / 100
                print(f"Episode {ep + 1}, last 100 avg reward: {avg_100:.2f}")

        env.close()
        torch.save(actor_net.state_dict(), "ppo_actor_frozenlake.pt")
        torch.save(critic_net.state_dict(), "ppo_critic_frozenlake.pt")
        print("Training finished, models saved.")

    # ---------- 测试(沿用原逻辑) ----------
    def test(self, episodes, is_slippery=False):
        env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=is_slippery, render_mode="human")
        num_states = env.observation_space.n
        num_actions = env.action_space.n
        actor_net = ActorNet(num_states, num_states, num_actions)
        actor_net.load_state_dict(torch.load("ppo_actor_frozenlake.pt"))
        actor_net.eval()

        print("\nTrained Policy:")
        self.print_policy(actor_net)

        for _ in range(episodes):
            state, _ = env.reset()
            terminated = False
            truncated = False
            while not terminated and not truncated:
                s_tensor = self.state_to_onehot(state, num_states)
                with torch.no_grad():
                    prob = actor_net(s_tensor)
                action = torch.argmax(prob).item()
                state, reward, terminated, truncated, _ = env.step(action)
        env.close()


if __name__ == "__main__":
    agent = FrozenLakePPO()
    slippery = False   # 若想测试滑冰环境可改为 True
    # 训练(取消注释以下行)
    # agent.train(2000, is_slippery=slippery)
    # 测试
    agent.test(10, is_slippery=slippery)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值