关键词
| 术语 | 英文 | 核心概念 |
|---|---|---|
| 对抗训练 | Adversarial Training | 通过对抗样本增强模型鲁棒性 |
| min-max优化 | Min-Max Optimization | 对抗训练的数学框架 |
| PGD对抗训练 | PGD Adversarial Training | 投影梯度下降对抗训练 |
| 鲁棒性 | Robustness | 模型对扰动的不变性 |
| 认证防御 | Certified Defense | 有理论保证的防御方法 |
| 随机平滑 | Randomized Smoothing | 基于随机化的认证方法 |
| 防御蒸馏 | Defensive Distillation | 降低梯度敏感性的训练方法 |
| 认证边界 | Certified Bound | 鲁棒性的理论保证 |
| 对抗正则化 | Adversarial Regularization | 对抗样本作为正则化项 |
| 迁移攻击 | Transfer Attack | 利用替代模型生成对抗样本 |
1. 引言:对抗训练的起源与意义
对抗训练(Adversarial Training)是应对对抗样本威胁最核心的防御方法之一。其基本思想直接而深刻:既然神经网络可以被对抗样本欺骗,那么就让模型在训练过程中”见过”这些对抗样本,从而学会抵御它们。这种”以毒攻毒”的训练范式,源于 Madry 等人在 2017 年的开创性工作,并迅速成为对抗防御研究的主流方法。
对抗训练的重要性
对抗训练不仅是目前最有效的对抗防御方法之一,更是从博弈论角度理解深度学习鲁棒性的理论框架。它将对抗学习问题形式化为一个 min-max 优化问题,为理解神经网络的决策边界提供了数学工具。
2. 对抗训练原理:min-max优化框架
2.1 从直觉到数学
对抗训练的核心是将鲁棒优化问题形式化为一个双人零和博弈的 min-max 优化:
其中:
- :模型参数
- :训练数据分布
- :损失函数(如交叉熵)
- :扰动预算
- :内层最大化问题(攻击者)
- :外层最小化问题(防御者)
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
class AdversarialTraining:
"""
对抗训练框架
min-max 优化:
- 内层(最大化):找到最有效的对抗扰动
- 外层(最小化):训练模型以最小化对抗损失
"""
def __init__(self, model, epsilon=0.03, alpha=0.01, num_iter=7):
self.model = model
self.epsilon = epsilon
self.alpha = alpha
self.num_iter = num_iter
def pgd_attack(self, images, labels, targeted=False, target_labels=None):
"""
PGD攻击:生成对抗样本
这是对抗训练中内层最大化问题的求解方法
"""
original_images = images.detach().clone()
# 随机初始化(在允许扰动范围内)
images = images.detach() + torch.zeros_like(images).uniform_(-self.epsilon, self.epsilon)
images = torch.clamp(images, 0, 1)
for i in range(self.num_iter):
images.requires_grad = True
outputs = self.model(images)
if targeted:
loss = -F.cross_entropy(outputs, target_labels)
else:
loss = F.cross_entropy(outputs, labels)
self.model.zero_grad()
loss.backward()
# 更新扰动
with torch.no_grad():
images = images + self.alpha * torch.sign(images.grad)
# 投影到允许范围
images = torch.maximum(images, original_images - self.epsilon)
images = torch.minimum(images, original_images + self.epsilon)
images = torch.clamp(images, 0, 1)
return images.detach()
def train_step(self, images, labels, optimizer):
"""
对抗训练单步
1. 生成对抗样本(内层最大化)
2. 用对抗样本训练模型(外层最小化)
"""
# 生成对抗样本
adversarial_images = self.pgd_attack(images, labels)
# 正常样本和对抗样本混合训练
optimizer.zero_grad()
# 正常样本损失
clean_outputs = self.model(images)
clean_loss = F.cross_entropy(clean_outputs, labels)
# 对抗样本损失
adv_outputs = self.model(adversarial_images)
adv_loss = F.cross_entropy(adv_outputs, labels)
# 总损失
total_loss = clean_loss + adv_loss
total_loss.backward()
optimizer.step()
return {
'clean_loss': clean_loss.item(),
'adv_loss': adv_loss.item(),
'total_loss': total_loss.item()
}2.2 鲁棒 vs 准确:训练权衡
对抗训练面临一个重要的权衡问题:鲁棒性和标准准确率之间存在张力。
标准训练: 优化干净数据上的准确率
对抗训练: 优化对抗数据上的准确率
理论分析(Wong & Kolter, 2018):
- 存在数据集使得任何鲁棒模型的标准准确率必须低于某个界限
- 原因:对抗扰动可以改变数据的"语义"但保持人类感知不变
def analyze_robustness_accuracy_tradeoff():
"""
分析鲁棒性与标准准确率的权衡
理论背景:
- 对抗样本利用的是模型在高维空间的线性敏感性
- 鲁棒模型需要在决策边界附近"平滑"
- 这可能牺牲对干净数据的拟合能力
"""
# 实验观察(基于 CIFAR-10):
# 标准训练:Clean Acc ~ 95%, Adv Acc (PGD) ~ 0%
# 对抗训练:Clean Acc ~ 85%, Adv Acc (PGD) ~ 50%
tradeoffs = {
'standard_training': {
'clean_accuracy': 0.95,
'robust_accuracy_pgd20': 0.0,
'robust_accuracy_pgd100': 0.0
},
'adversarial_training': {
'clean_accuracy': 0.85,
'robust_accuracy_pgd20': 0.50,
'robust_accuracy_pgd100': 0.48
},
'TRADES_regularization': {
'clean_accuracy': 0.90,
'robust_accuracy_pgd20': 0.55,
'robust_accuracy_pgd100': 0.52
},
'MART_regularization': {
'clean_accuracy': 0.88,
'robust_accuracy_pgd20': 0.53,
'robust_accuracy_pgd100': 0.50
}
}
return tradeoffs3. PGD对抗训练详解
3.1 PGD-Training 完整实现
Projected Gradient Descent (PGD) 对抗训练是目前最广泛使用的对抗训练方法:
class PGDTraining:
"""
PGD对抗训练
步骤:
1. 使用PGD生成对抗样本
2. 用对抗样本计算梯度并更新模型
3. 重复直到收敛
"""
def __init__(self, model, epsilon=8/255, alpha=2/255, num_iter=7,
lr=0.01, weight_decay=5e-4):
self.model = model
self.epsilon = epsilon
self.alpha = alpha
self.num_iter = num_iter
self.device = next(model.parameters()).device
# 优化器
self.optimizer = torch.optim.SGD(
model.parameters(),
lr=lr,
momentum=0.9,
weight_decay=weight_decay
)
# 学习率调度
self.scheduler = torch.optim.lr_scheduler.MultiStepLR(
self.optimizer,
milestones=[100, 105],
gamma=0.1
)
def pgd_attack(self, images, labels):
"""PGD攻击"""
images = images.to(self.device)
labels = labels.to(self.device)
original_images = images.detach()
# 随机初始化
if True: # 使用随机初始化
images = images + torch.zeros_like(images).uniform_(
-self.epsilon, self.epsilon
)
images = torch.clamp(images, 0, 1)
for _ in range(self.num_iter):
images.requires_grad = True
outputs = self.model(images)
loss = F.cross_entropy(outputs, labels)
self.model.zero_grad()
loss.backward()
with torch.no_grad():
images = images + self.alpha * images.grad.sign()
# 投影到L无穷球
images = torch.maximum(images, original_images - self.epsilon)
images = torch.minimum(images, original_images + self.epsilon)
images = torch.clamp(images, 0, 1)
return images.detach()
def train_epoch(self, dataloader, epoch):
"""训练一个epoch"""
self.model.train()
total_loss = 0
correct_clean = 0
correct_adv = 0
total = 0
for batch_idx, (images, labels) in enumerate(dataloader):
images, labels = images.to(self.device), labels.to(self.device)
# 生成对抗样本
adversarial_images = self.pgd_attack(images, labels)
# 前向传播
clean_outputs = self.model(images)
adv_outputs = self.model(adversarial_images)
# 损失
clean_loss = F.cross_entropy(clean_outputs, labels)
adv_loss = F.cross_entropy(adv_outputs, labels)
loss = clean_loss + adv_loss
# 反向传播
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# 统计
total_loss += loss.item()
_, clean_pred = clean_outputs.max(1)
_, adv_pred = adv_outputs.max(1)
correct_clean += clean_pred.eq(labels).sum().item()
correct_adv += adv_pred.eq(labels).sum().item()
total += labels.size(0)
self.scheduler.step()
return {
'loss': total_loss / len(dataloader),
'clean_acc': 100. * correct_clean / total,
'adv_acc': 100. * correct_adv / total,
'lr': self.optimizer.param_groups[0]['lr']
}3.2 PGD训练的理论分析
def theoretical_analysis_pgd():
"""
PGD对抗训练的理论分析
关键结论:
1. 如果PGD能找到局部最大,则接近全局最大
2. 训练后的模型对任意L无穷攻击都有一定鲁棒性
3. 鲁棒性来源于决策边界的"平滑"
"""
analysis = {
'local_maxima': """
在L无穷约束下,损失函数的局部最大值
通常接近全局最大值。
原因:
- 高维空间中局部最大值和全局最大值差距不大
- PGD的随机初始化有助于发现强对抗样本
""",
'convergence': """
PGD在有限步内收敛:
- 步长 alpha >= epsilon / num_iter
- 通常 7-10 步足够
""",
'robustness_certification': """
认证边界:
- 如果模型对epsilon-PGD攻击鲁棒
- 则对任意L无穷范数<=epsilon的扰动鲁棒
- 条件:攻击必须是最优的
"""
}
return analysis4. 对抗训练的效率问题与解决方案
4.1 计算开销分析
对抗训练的主要瓶颈在于内层最大化问题的计算。对于每个训练样本,需要多次前向-反向传播来生成对抗样本:
标准训练: 1次前向 + 1次反向
对抗训练: (1 + num_iter)次前向 + num_iter次反向
计算开销比: (1 + K) / 1,其中K是PGD迭代次数
def compute_overhead_analysis():
"""
计算开销分析
"""
overheads = {
'FGSM_attack': {
'forward_passes': 2, # 干净样本 + 对抗样本
'backward_passes': 1,
'overhead_ratio': 2
},
'PGD_7': {
'forward_passes': 8, # 1初始化 + 7迭代 + 1最终
'backward_passes': 7,
'overhead_ratio': 8
},
'PGD_20': {
'forward_passes': 22,
'backward_passes': 20,
'overhead_ratio': 21
},
'PGD_100': {
'forward_passes': 102,
'backward_passes': 100,
'overhead_ratio': 101
}
}
return overheads4.2 Free AT:自由对抗训练
Free Adversarial Training (Free AT) 通过复用梯度来减少计算开销:
class FreeAdversarialTraining:
"""
Free Adversarial Training (Shafahi et al., 2019)
核心思想:
- 在每次参数更新中复用对抗扰动
- m次小批量更新后重新生成对抗扰动
原始方法:
- 训练时间:m倍于标准训练
- 对抗强度:与标准PGD训练相当
"""
def __init__(self, model, epsilon=8/255, alpha=2/255,
num_iter=2, num_replays=8):
self.model = model
self.epsilon = epsilon
self.alpha = alpha
self.num_iter = num_iter
self.num_replays = num_replays
# 累积的扰动
self.delta = None
def free_at_attack(self, images, labels, reverse=False):
"""
Free AT的扰动更新
在反向传播后立即更新扰动方向
"""
if self.delta is None or reverse:
self.delta = torch.zeros_like(images)
images_adv = images + self.delta
images_adv = torch.clamp(images_adv, 0, 1)
images_adv.requires_grad = True
outputs = self.model(images_adv)
loss = F.cross_entropy(outputs, labels)
self.model.zero_grad()
loss.backward()
with torch.no_grad():
# 更新扰动
self.delta = self.delta + self.alpha * images_adv.grad.sign()
# 投影
self.delta = torch.clamp(self.delta, -self.epsilon, self.epsilon)
return (images + self.delta).detach()
def train_step(self, images, labels, optimizer):
"""Free AT训练步骤"""
for replay in range(self.num_replays):
# 使用累积扰动生成对抗样本
adversarial_images = self.free_at_attack(images, labels)
# 计算损失并更新
optimizer.zero_grad()
outputs = self.model(adversarial_images)
loss = F.cross_entropy(outputs, labels)
loss.backward()
optimizer.step()
return loss.item()4.3 Fast AT:快速对抗训练
Fast Adversarial Training (Fast AT) 使用FGSM代替PGD来加速:
class FastAdversarialTraining:
"""
Fast Adversarial Training (Wong et al., 2020)
核心发现:
- 使用FGSM(单步攻击)可以训练出鲁棒模型
- 关键:使用随机初始化
- 训练时间与标准训练几乎相同
"""
def __init__(self, model, epsilon=8/255):
self.model = model
self.epsilon = epsilon
self.device = next(model.parameters()).device
def fast_at_attack(self, images, labels):
"""
Fast AT攻击
使用随机初始化的FGSM
"""
# 随机初始化
delta = torch.zeros_like(images).uniform_(-self.epsilon, self.epsilon)
images_adv = images + delta
images_adv = torch.clamp(images_adv, 0, 1).detach()
images_adv.requires_grad = True
outputs = self.model(images_adv)
loss = F.cross_entropy(outputs, labels)
self.model.zero_grad()
loss.backward()
with torch.no_grad():
# FGSM扰动
images_adv = images_adv + self.epsilon * torch.sign(images_adv.grad)
# 投影
images_adv = torch.clamp(images_adv, 0, 1)
return images_adv.detach()
def train_step(self, images, labels, optimizer):
"""
Fast AT训练步骤
与标准训练的计算开销几乎相同
"""
# 生成FGSM对抗样本
adversarial_images = self.fast_at_attack(images, labels)
optimizer.zero_grad()
# 使用对抗样本训练
outputs = self.model(adversarial_images)
loss = F.cross_entropy(outputs, labels)
loss.backward()
optimizer.step()
return loss.item()Fast AT vs PGD AT
- Fast AT:训练快(~1x时间),但需要精心调参
- PGD AT:训练慢(~7x时间),但更稳定可靠
- 两者在最终鲁棒性上可以相近,但Fast AT对超参数更敏感
4.4 SMART:自我对抗正则化
SMART (Self-adversarial training with Margin Enhancement) 结合了多个改进:
class SMARTTraining:
"""
SMART: Self-adversarial training with Margin Enhancement
(Jiang et al., 2020)
特点:
1. 使用动量FGSM提高对抗样本质量
2. 引入margin loss增强决策边界
3. 标签平滑减少过拟合
"""
def __init__(self, model, epsilon=8/255, alpha=1.25*8/255,
margin=0.2, label_smoothing=0.1):
self.model = model
self.epsilon = epsilon
self.alpha = alpha
self.margin = margin
self.label_smoothing = label_smoothing
self.device = next(model.parameters()).device
# 动量
self.momentum = torch.zeros_like(model.parameters())
def momentum_fgsm_attack(self, images, labels):
"""
动量FGSM攻击
累积梯度方向以稳定对抗扰动
"""
delta = torch.zeros_like(images).uniform_(-self.epsilon, self.epsilon).to(self.device)
for _ in range(2): # 2步迭代
delta.requires_grad = True
outputs = self.model(images + delta)
loss = F.cross_entropy(outputs, labels)
self.model.zero_grad()
loss.backward()
# 更新动量
self.momentum = 0.9 * self.momentum + delta.grad / (delta.grad.abs().mean() + 1e-10)
with torch.no_grad():
delta = delta + self.alpha * torch.sign(self.momentum)
delta = torch.clamp(delta, -self.epsilon, self.epsilon)
delta = torch.clamp(images + delta, 0, 1) - images
return (images + delta).detach()
def margin_loss(self, outputs, labels):
"""
Margin loss
鼓励模型对真实类别的置信度显著高于其他类别
"""
# 获取目标类别的logit
target_logits = outputs.gather(1, labels.unsqueeze(1)).squeeze()
# 获取次高logit
other_logits = outputs.clone()
other_logits.scatter_(1, labels.unsqueeze(1), float('-inf'))
second_logits = other_logits.max(dim=1)[0]
# Margin: 目标logit - 次高logit
margins = target_logits - second_logits
# 希望margin大于给定阈值
return F.relu(self.margin - margins).mean()
def train_step(self, images, labels, optimizer):
"""SMART训练步骤"""
# 生成对抗样本
adversarial_images = self.momentum_fgsm_attack(images, labels)
optimizer.zero_grad()
# 标准交叉熵
ce_loss = F.cross_entropy(
self.model(adversarial_images),
labels,
label_smoothing=self.label_smoothing
)
# Margin loss
outputs = self.model(adversarial_images)
margin_loss = self.margin_loss(outputs, labels)
# 总损失
loss = ce_loss + 0.5 * margin_loss
loss.backward()
optimizer.step()
return {
'ce_loss': ce_loss.item(),
'margin_loss': margin_loss.item(),
'total_loss': loss.item()
}5. 对抗正则化方法
5.1 TRADES:对抗正则化
TRADES (TRADE-off between robustness and accuracy) 通过KL散度正则化:
class TRADESTraining:
"""
TRADES: TRADE-off between robustness and accuracy
(Zhang et al., 2019)
损失函数:
L(x, y, θ) = L_ce(f_θ(x), y) + β * KL(f_θ(x) || f_θ(x'))
其中 x' 是对抗样本
"""
def __init__(self, model, epsilon=8/255, beta=6.0):
self.model = model
self.epsilon = epsilon
self.beta = beta
self.device = next(model.parameters()).device
def pgd_attack(self, images, labels):
"""生成对抗样本"""
original_images = images.detach()
images = images + torch.zeros_like(images).uniform_(-self.epsilon, self.epsilon)
images = torch.clamp(images, 0, 1)
for _ in range(10):
images.requires_grad = True
outputs = self.model(images)
loss = F.cross_entropy(outputs, labels)
self.model.zero_grad()
loss.backward()
with torch.no_grad():
images = images + 2/255 * torch.sign(images.grad)
images = torch.maximum(images, original_images - self.epsilon)
images = torch.minimum(images, original_images + self.epsilon)
images = torch.clamp(images, 0, 1)
return images.detach()
def train_step(self, images, labels, optimizer):
"""TRADES训练步骤"""
# 生成对抗样本
adversarial_images = self.pgd_attack(images, labels)
optimizer.zero_grad()
# 干净样本的交叉熵损失
clean_outputs = self.model(images)
ce_loss = F.cross_entropy(clean_outputs, labels)
# KL散度正则化
clean_probs = F.log_softmax(clean_outputs, dim=1)
adv_probs = F.softmax(self.model(adversarial_images), dim=1)
kl_loss = F.kl_div(clean_probs, adv_probs, reduction='batchmean')
# 总损失
loss = ce_loss + self.beta * kl_loss
loss.backward()
optimizer.step()
return {
'ce_loss': ce_loss.item(),
'kl_loss': kl_loss.item(),
'total_loss': loss.item()
}5.2 MART:基于鲁棒错误理论的对抗训练
class MARTTraining:
"""
MART: Misclassification Aware Adversarial Training
(Wang et al., 2019)
损失函数:
L = L_ce + λ * BCE
其中BCE是鲁棒错误分类损失
"""
def __init__(self, model, epsilon=8/255, beta=6.0):
self.model = model
self.epsilon = epsilon
self.beta = beta
def train_step(self, images, labels, optimizer):
"""MART训练步骤"""
# 生成对抗样本
adversarial_images = self.pgd_attack(images, labels)
optimizer.zero_grad()
# 干净样本和对抗样本的输出
clean_outputs = self.model(images)
adv_outputs = self.model(adversarial_images)
# 交叉熵损失
ce_loss = F.cross_entropy(clean_outputs, labels)
# 鲁棒错误分类损失
# 计算被错误分类的样本的对抗KL散度
clean_probs = F.softmax(clean_outputs, dim=1)
adv_probs = F.softmax(adv_outputs, dim=1)
# 找到当前预测类别
_, clean_pred = clean_outputs.max(1)
# BCE损失
bce_loss = clean_probs.gather(1, labels.unsqueeze(1)).squeeze().clamp(1e-8, 1)
bce_loss = -torch.log(bce_loss).mean()
# 总损失
loss = ce_loss + self.beta * bce_loss
loss.backward()
optimizer.step()
return loss.item()
def pgd_attack(self, images, labels):
"""PGD攻击"""
original_images = images.detach()
images = images + torch.zeros_like(images).uniform_(-self.epsilon, self.epsilon)
images = torch.clamp(images, 0, 1)
for _ in range(7):
images.requires_grad = True
outputs = self.model(images)
loss = F.cross_entropy(outputs, labels)
self.model.zero_grad()
loss.backward()
with torch.no_grad():
images = images + 2/255 * torch.sign(images.grad)
images = torch.maximum(images, original_images - self.epsilon)
images = torch.minimum(images, original_images + self.epsilon)
images = torch.clamp(images, 0, 1)
return images.detach()6. 对抗训练 vs 认证防御
6.1 认证防御概述
认证防御(Certified Defense)提供理论保证:对于给定的扰动幅度 ,证明模型在 -球内所有样本上都是正确的。
class CertifiedDefense:
"""
认证防御框架
与对抗训练的区别:
- 对抗训练:经验性防御,可能被更强的攻击突破
- 认证防御:理论保证,无法被突破
"""
def __init__(self, model):
self.model = model
self.device = next(model.parameters()).device
def bound_analysis(self, x, y, epsilon):
"""
认证边界分析
对于线性模型,可以精确计算认证边界
"""
w = self.model.weight.data
b = self.model.bias.data
# 线性模型的认证边界
# 鲁棒性条件:w^T x + b 和 w^T y + b 的符号在epsilon球内不变
# 决策边界距离
margin = (w @ x + b) - (w @ y + b)
# 认证半径
norm_w = torch.norm(w)
certified_radius = margin / norm_w
return certified_radius.item()6.2 随机平滑:可扩展的认证方法
class RandomizedSmoothing:
"""
随机平滑(Randomized Smoothing)
认证方法:
- 给输入添加高斯噪声
- 通过投票决定最终预测
- 提供认证半径保证
论文:Cohen et al., 2019
"""
def __init__(self, model, sigma=0.25, num_samples=1000):
self.model = model
self.sigma = sigma
self.num_samples = num_samples
self.device = next(model.parameters()).device
def certify(self, x, n0=100, alpha=0.05):
"""
认证函数
参数:
- x: 输入样本
- n0: 初始样本数(用于估计p_A)
- alpha: 置信度参数
返回:
- predicted_class: 预测类别
- certified_radius: 认证半径
"""
self.model.eval()
# 估计 top 类别的概率
with torch.no_grad():
# 采样 n0 次
counts = self._sample_predictions(x, n0)
top_class = counts.argmax().item()
p_lower = self._lower_confidence_bound(counts[top_class].item(), n0, alpha)
if p_lower < 0.5:
return top_class, 0.0
# 估计其他类别的上界
n = self.num_samples
with torch.no_grad():
all_counts = self._sample_predictions(x, n)
p_upper = self._upper_confidence_bound(
all_counts[top_class].item(), n, alpha
)
# 认证半径
radius = 0.5 * self.sigma * torch.distributions.Normal(0, 1).ppf(p_lower - 1 + alpha)
return top_class, radius.item()
def _sample_predictions(self, x, num_samples):
"""采样预测"""
# 重复输入
x_repeated = x.repeat(num_samples, 1, 1, 1)
# 添加噪声
noise = torch.randn_like(x_repeated) * self.sigma
x_noisy = torch.clamp(x_repeated + noise, 0, 1)
# 预测
with torch.no_grad():
outputs = self.model(x_noisy)
predictions = outputs.argmax(dim=1)
# 计数
num_classes = outputs.size(1)
counts = torch.zeros(num_classes)
for pred in predictions:
counts[pred] += 1
return counts
def _lower_confidence_bound(self, successes, trials, alpha):
"""Wilson score interval 下界"""
import math
z = 1.96 # 95% 置信度
n = trials
p_hat = successes / n
denominator = 1 + z**2 / n
center = (p_hat + z**2 / (2*n)) / denominator
margin = z * math.sqrt(p_hat * (1 - p_hat) / n + z**2 / (4*n**2)) / denominator
return max(0, center - margin)
def _upper_confidence_bound(self, successes, trials, alpha):
"""Wilson score interval 上界"""
import math
z = 1.96
n = trials
p_hat = successes / n
denominator = 1 + z**2 / n
center = (p_hat + z**2 / (2*n)) / denominator
margin = z * math.sqrt(p_hat * (1 - p_hat) / n + z**2 / (4*n**2)) / denominator
return min(1, center + margin)认证 vs 经验鲁棒性
- 认证半径提供了理论保证,即使攻击者知道模型参数也无法突破
- 但认证半径通常保守,可能低估实际鲁棒性
- 对抗训练可以提高经验鲁棒性,但不一定提高认证鲁棒性
7. 对抗训练的实用指南
7.1 训练配置建议
def get_training_recommendations():
"""
对抗训练实用配置建议
"""
recommendations = {
'PGD_adversarial_training': {
'epsilon': '8/255 (CIFAR-10), 4/255 (ImageNet)',
'alpha': 'epsilon / 4',
'num_iter': '7-10',
'optimizer': 'SGD with momentum 0.9',
'learning_rate': '0.01-0.1',
'weight_decay': '5e-4',
'batch_size': '128-256',
'training_epochs': '110'
},
'fast_adversarial_training': {
'epsilon': '8/255',
'alpha': '1.25 * epsilon',
'random_init': 'True',
'optimizer': 'SGD with momentum 0.9',
'learning_rate': '0.1'
},
'TRADES': {
'epsilon': '8/255',
'beta': '6.0',
'optimizer': 'Adam or SGD',
'learning_rate': '0.01'
}
}
return recommendations7.2 评估协议
def robust_evaluation_protocol():
"""
鲁棒性评估标准协议
"""
evaluation = {
'white_box_attacks': [
'PGD-20: PGD with 20 steps',
'PGD-100: PGD with 100 steps',
'AutoAttack: ensemble of multiple attacks'
],
'black_box_attacks': [
'Transfer attack from surrogate model',
'HopSkipJumpAttack',
'Square Attack'
],
'certified_robustness': [
'Randomized Smoothing',
'CROWN bounds',
'IBP (Interval Bound Propagation)'
],
'metrics': [
'Robust Accuracy: accuracy on adversarial examples',
'Clean Accuracy: accuracy on clean data',
'Certified Radius: guaranteed perturbation bound'
]
}
return evaluation8. 学术引用与参考文献
- Madry, A., et al. (2017). “Towards Deep Learning Models Resistant to Adversarial Attacks.” ICLR.
- Goodfellow, I. J., et al. (2015). “Explaining and Harnessing Adversarial Examples.” ICLR.
- Wong, E., & Kolter, J. Z. (2018). “Provable Defenses against Adversarial Examples via the Convex Outer Adversarial Polytope.” ICML.
- Wong, E., et al. (2020). “Fast Is Better Than Free: Revisiting Adversarial Fine-Tuning.” ICLR.
- Shafahi, A., et al. (2019). “Adversarial Training for Free!” NeurIPS.
- Zhang, H., et al. (2019). “Theoretically Principled Trade-off between Robustness and Accuracy.” ICML.
- Wang, Y., et al. (2019). “Improving Adversarial Robustness via Misclassification Aware Adversarial Training.” NeurIPS.
- Cohen, J. M., et al. (2019). “Certified Adversarial Robustness via Randomized Smoothing.” ICML.
- Carlini, N., et al. (2019). “On Evaluating Adversarial Robustness.” arXiv.