DQN深度指南

关键词速览

核心概念经验回放目标网络深度神经网络过估计
Double DQNDueling DQNPERRainbow梯度裁剪

核心关键词表

术语英文符号/技术说明
深度Q网络Deep Q-NetworkDQN深度学习与Q学习的结合
经验回放Experience ReplayReplay Buffer存储和重放转移的机制
目标网络Target Network延迟更新的目标网络
双重DQNDouble DQNDDQN解决Q值过估计问题
竞争DQNDueling DQN分离状态价值和优势
优先级回放PERSumTree基于TD误差的采样优先级
RainbowRainbow DQN组合方法多种DQN变体的集成
梯度裁剪Gradient Clipping$\\nabla|$
目标Q值Target Q-valueTD学习的目标值
贪心策略Greedy Policy选择最优动作

一、深度Q网络(DQN)原理

深度Q网络(Deep Q-Network, DQN)是深度强化学习领域的里程碑式工作,由DeepMind团队于2013年首次提出,并在2015年的Nature论文中得到进一步完善。DQN的核心创新在于使用深度神经网络逼近动作价值函数 ,使得强化学习能够处理高维、连续的输入空间(如原始图像像素),从而在Atari游戏中达到人类水平的表现。

1.1 从表格Q学习到深度Q学习

传统Q学习依赖表格存储Q值,当状态空间连续或规模巨大时,这种方法不再适用。DQN引入深度神经网络作为函数逼近器:

其中 是神经网络的参数。这种方法将状态(或状态-动作对)直接映射到Q值预测。

1.2 DQN的核心问题与解决方案

深度神经网络与强化学习的结合面临三个核心挑战:

DQN的三大挑战与解决方案

  1. 数据非平稳性:强化学习中的数据来自不断变化的策略
    • 解决方案:经验回放(Experience Replay)
  2. 相关性:连续样本高度相关,导致方差增大
    • 解决方案:经验回放打破时间相关性
  3. 目标不稳定:Q值目标随网络更新而变化
    • 解决方案:目标网络(Target Network)

1.3 DQN损失函数

DQN的损失函数基于时序差分误差(TD Error):

其中:

  • 是经验回放缓冲区
  • 是目标网络
  • 是目标网络的参数(定期从 复制)

1.4 完整DQN算法

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gym
from collections import deque
import random
 
class DQN(nn.Module):
    """
    Deep Q-Network architecture for Atari-like games.
    """
    def __init__(self, input_shape, n_actions):
        super(DQN, self).__init__()
        
        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU(),
            nn.Flatten()
        )
        
        # 计算卷积层输出尺寸
        conv_out_size = self._get_conv_out(input_shape)
        
        self.fc = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )
    
    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))
    
    def forward(self, x):
        conv_out = self.conv(x)
        return self.fc(conv_out)
 
class ReplayBuffer:
    """Experience Replay Buffer."""
    
    def __init__(self, capacity=100000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (
            np.array(states),
            np.array(actions),
            np.array(rewards),
            np.array(next_states),
            np.array(dones)
        )
    
    def __len__(self):
        return len(self.buffer)
 
class DQNAgent:
    """DQN Agent with target network and experience replay."""
    
    def __init__(self, input_shape, n_actions, learning_rate=0.00025,
                 gamma=0.99, epsilon=1.0, epsilon_min=0.01,
                 epsilon_decay=0.995, batch_size=32, replay_capacity=100000,
                 target_update_freq=10000):
        
        self.n_actions = n_actions
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.target_update_freq = target_update_freq
        self.train_step = 0
        
        # 主网络和目标网络
        self.policy_net = DQN(input_shape, n_actions)
        self.target_net = DQN(input_shape, n_actions)
        self.target_net.load_state_dict(self.policy_net.state_dict())
        self.target_net.eval()
        
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=learning_rate)
        self.memory = ReplayBuffer(replay_capacity)
    
    def select_action(self, state, training=True):
        """Epsilon-greedy action selection."""
        if training and random.random() < self.epsilon:
            return random.randrange(self.n_actions)
        
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.policy_net(state_tensor)
            return q_values.argmax().item()
    
    def store_transition(self, state, action, reward, next_state, done):
        self.memory.push(state, action, reward, next_state, done)
    
    def update(self):
        """Perform one gradient update step."""
        if len(self.memory) < self.batch_size:
            return
        
        # 从经验回放缓冲区采样
        states, actions, rewards, next_states, dones = self.memory.sample(self.batch_size)
        
        # 转换为张量
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.FloatTensor(dones)
        
        # 计算当前Q值
        q_values = self.policy_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        
        # 计算目标Q值(使用目标网络)
        with torch.no_grad():
            next_q_values = self.target_net(next_states).max(1)[0]
            target_q_values = rewards + self.gamma * next_q_values * (1 - dones)
        
        # 计算损失
        loss = nn.MSELoss()(q_values, target_q_values)
        
        # 反向传播
        self.optimizer.zero_grad()
        loss.backward()
        
        # 梯度裁剪
        torch.nn.utils.clip_grad_norm_(self.policy_net.parameters(), 1.0)
        
        self.optimizer.step()
        self.train_step += 1
        
        # 定期更新目标网络
        if self.train_step % self.target_update_freq == 0:
            self.target_net.load_state_dict(self.policy_net.state_dict())
        
        # 衰减epsilon
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
        
        return loss.item()

二、经验回放(Experience Replay)

经验回放是DQN的关键组件之一,极大提高了样本效率和数据利用率。

2.1 工作原理

经验回放缓冲区存储智能体的转移经验 ,训练时随机采样打散相关性。

class PrioritizedReplayBuffer:
    """Prioritized Experience Replay (PER) implementation."""
    
    def __init__(self, capacity=100000, alpha=0.6, beta=0.4):
        self.capacity = capacity
        self.alpha = alpha  # 优先级指数
        self.beta = beta    # 重要性采样指数
        self.buffer = []    # (priority, transition)
        self.pos = 0
        self.priorities = np.zeros(capacity, dtype=np.float32)
    
    def push(self, state, action, reward, next_state, done, td_error=None):
        """Add transition with priority based on TD error."""
        max_priority = self.priorities.max() if self.buffer else 1.0
        
        if len(self.buffer) < self.capacity:
            self.buffer.append((max_priority, (state, action, reward, next_state, done)))
        else:
            self.buffer[self.pos] = (max_priority, (state, action, reward, next_state, done))
        
        self.priorities[self.pos] = max_priority
        self.pos = (self.pos + 1) % self.capacity
    
    def sample(self, batch_size):
        """Sample batch with prioritization."""
        # 计算采样概率
        priorities = np.array([p[0] for p in self.buffer])
        probs = priorities ** self.alpha
        probs /= probs.sum()
        
        # 加权采样
        indices = np.random.choice(len(self.buffer), batch_size, p=probs, replace=False)
        
        # 计算重要性采样权重
        weights = (len(self.buffer) * probs[indices]) ** (-self.beta)
        weights /= weights.max()
        
        batch = [self.buffer[i][1] for i in indices]
        states, actions, rewards, next_states, dones = zip(*batch)
        
        return (np.array(states), np.array(actions), np.array(rewards),
                np.array(next_states), np.array(dones), indices, weights)

2.2 经验回放的优势

  1. 打破时间相关性:随机采样使样本独立同分布
  2. 提高样本效率:每条经验可被多次使用
  3. 稳定训练:减少数据分布的剧烈波动

经验回放调优

  • 缓冲区大小:通常100K到1M
  • 批量大小:32到128,显存允许时偏大更稳定
  • 预热期:开始训练前先填充一定数量的样本

三、目标网络(Target Network)

目标网络解决了TD目标不稳定的问题,是DQN收敛性的关键。

3.1 问题根源

在标准Q学习中,Q值既是”估计目标”又是”估计器”:

当网络参数 更新后,目标 也在变化,导致”移动靶”问题。

3.2 解决方案

使用延迟更新的目标网络 固定TD目标:

目标网络的参数 每隔 步从策略网络复制。

3.3 软更新与硬更新

# 硬更新(标准DQN)
if self.train_step % self.target_update_freq == 0:
    self.target_net.load_state_dict(self.policy_net.state_dict())
 
# 软更新(论文 Deep RL with Double Q-learning)
tau = 0.005
for target_param, policy_param in zip(self.target_net.parameters(), 
                                        self.policy_net.parameters()):
    target_param.data.copy_(tau * policy_param.data + 
                           (1 - tau) * target_param.data)

四、DQN变体详解

4.1 Double DQN (DDQN)

问题:标准DQN存在Q值过估计(Overestimation)问题,max操作倾向于选择被高估的动作。

解决方案:使用两个网络解耦动作选择和价值评估:

def double_dqn_update(self, batch):
    """Double DQN update rule."""
    states, actions, rewards, next_states, dones = batch
    
    # 使用策略网络选择动作
    with torch.no_grad():
        next_actions = self.policy_net(next_states).argmax(1)
        # 使用目标网络评估价值
        next_q_values = self.target_net(next_states).gather(1, next_actions.unsqueeze(1)).squeeze(1)
    
    target_q = rewards + self.gamma * next_q_values * (1 - dones)
    
    # 策略网络更新
    current_q = self.policy_net(states).gather(1, actions.unsqueeze(1)).squeeze(1)
    loss = nn.MSELoss()(current_q, target_q)
    
    self.optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(self.policy_net.parameters(), 10)
    self.optimizer.step()

4.2 Dueling DQN

核心思想:分离状态价值 和优势函数

class DuelingDQN(nn.Module):
    """Dueling Network architecture."""
    
    def __init__(self, input_shape, n_actions):
        super(DuelingDQN, self).__init__()
        
        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, 8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, 3, stride=1),
            nn.ReLU(),
            nn.Flatten()
        )
        
        conv_out = self._get_conv_out(input_shape)
        
        # 价值流
        self.value_stream = nn.Sequential(
            nn.Linear(conv_out, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )
        
        # 优势流
        self.advantage_stream = nn.Sequential(
            nn.Linear(conv_out, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )
    
    def forward(self, x):
        features = self.conv(x)
        value = self.value_stream(features)
        advantage = self.advantage_stream(features)
        
        # Q = V + (A - mean(A))
        q_values = value + (advantage - advantage.mean(dim=1, keepdim=True))
        return q_values

4.3 优先级经验回放(PER)

PER根据TD误差大小分配采样优先级:

4.4 NoisyNet

通过可学习的噪声参数替代 ε-greedy 探索:


五、Rainbow算法组合

Rainbow(DeepMind, 2017)整合了DQN的七种技术:

Rainbow的七种组件

技术解决的问题效果提升
Double DQNQ值过估计+11%
Dueling DQN价值分解+9%
Prioritized Replay样本效率+28%
NoisyNet探索效率+8%
Distributional RL价值估计+33%
N-step Returns偏差-方差平衡+12%
经验回放数据相关性稳定性
class RainbowDQN(DuelingDQN):
    """Rainbow DQN combining all improvements."""
    
    def __init__(self, input_shape, n_actions, n_atoms=51, v_min=-10, v_max=10):
        super().__init__(input_shape, n_actions)
        
        self.n_atoms = n_atoms
        self.v_min = v_min
        self.v_max = v_max
        self.delta_z = (v_max - v_min) / (n_atoms - 1)
        self.z_atoms = torch.linspace(v_min, v_max, n_atoms)
        
        # Distributional stream
        self.dist_stream = nn.Sequential(
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions * n_atoms)
        )
    
    def forward(self, x):
        features = self.conv(x)
        value = self.value_stream(features)
        advantage = self.advantage_stream(features)
        
        # Dueling
        q_base = value + (advantage - advantage.mean(dim=1, keepdim=True))
        
        # Distributional
        dist = self.dist_stream(features).view(-1, self.n_actions, self.n_atoms)
        dist = torch.softmax(dist, dim=2)
        
        return q_base, dist, self.z_atoms

六、代码实现与调参技巧

6.1 完整训练循环

def train_dqn(env_name='Breakout-v0', num_frames=50000000, batch_size=32):
    """Complete DQN training procedure."""
    env = gym.make(env_name)
    
    # 预处理:灰度化、缩放、帧堆叠
    input_shape = (4, 84, 84)  # 4帧84x84图像
    agent = DQNAgent(
        input_shape=input_shape,
        n_actions=env.action_space.n,
        learning_rate=0.00025,
        gamma=0.99,
        target_update_freq=10000
    )
    
    episode_rewards = []
    state = env.reset()
    state = preprocess(state)  # 自定义预处理函数
    
    for frame in range(num_frames):
        # 选择动作
        action = agent.select_action(state)
        
        # 执行
        next_state, reward, done, _ = env.step(action)
        next_state = preprocess(next_state)
        
        # 存储
        agent.store_transition(state, action, reward, next_state, done)
        state = next_state
        
        # 更新
        if len(agent.memory) >= batch_size:
            agent.update()
        
        # 记录
        if done:
            state = env.reset()
            state = preprocess(state)
            episode_rewards.append(frame)
        
        # 周期性评估
        if frame % 250000 == 0:
            evaluate_agent(agent, env)

6.2 调参指南

DQN超参数调优

参数推荐值调整建议
学习率0.00025使用Adam优化器
折扣因子 0.99高值利于长期回报
批量大小32显存允许可增大
目标网络更新频率10000步软更新 tau=0.005
经验回放大小1,000,000根据内存调整
ε初始值1.0完全随机开始
ε衰减1M帧降至0.1线性衰减
梯度裁剪10.0防止梯度爆炸
训练帧数50M+Atari需要大量训练

6.3 常见问题与排查

问题原因解决方案
Q值NaN梯度爆炸/数据异常梯度裁剪,检查reward缩放
性能退化目标网络更新过快增加更新间隔
探索不足ε衰减过快延长衰减周期
显存不足batch过大/网络过深减小batch,简化网络

七、数学形式化总结

DQN损失函数

其中目标

Double DQN目标

Dueling Q值分解


八、相关文档


参考文献

  1. Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
  2. Mnih, V., et al. (2013). Playing Atari with deep reinforcement learning. arXiv:1312.5602.
  3. Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double Q-learning. AAAI, 30(1).
  4. Wang, Z., et al. (2016). Dueling network architectures for deep reinforcement learning. ICML, 1995-2003.
  5. Schaul, T., et al. (2016). Prioritized experience replay. ICLR.
  6. Hessel, M., et al. (2018). Rainbow: Combining improvements in deep reinforcement learning. AAAI, 2018.

DQN开创了深度强化学习的新时代,后续的许多算法都建立在其核心思想之上。