对抗攻防实战指南:从FGSM到认证鲁棒性
关键词
| 序号 | 关键词 | 英文对照 |
|---|---|---|
| 1 | 对抗样本 | Adversarial Example |
| 2 | FGSM | Fast Gradient Signed Method |
| 3 | PGD | Projected Gradient Descent |
| 4 | C&W攻击 | Carlini & Wagner Attack |
| 5 | 对抗补丁 | Adversarial Patch |
| 6 | 对抗训练 | Adversarial Training |
| 7 | 认证鲁棒性 | Certified Robustness |
| 8 | 迁移攻击 | Transfer Attack |
一、什么是对抗样本?
1.1 一个让人细思极恐的现象
2014年,Christian Szegedy等人发现了一个诡异的现象:给大熊猫的图片加上一个人眼几乎看不出来的噪声,CNN模型就会把它认成”长臂猿”,而且信心满满地给出99.3%的置信度。
这意味着什么?意味着我们以为已经很强大的深度学习系统,其实脆弱得像个纸老虎。只需要在输入上做一点微调,就能让模型完全失效。
对抗样本(Adversarial Example)就是这种人眼难以察觉、但能让模型判断错误的输入。用数学语言来说:
对抗样本定义:
给定分类器 f: X → Y 和输入 x ∈ X
找到一个扰动 δ,使得:
1. ||δ|| < ε(扰动很小,人眼难以察觉)
2. f(x) ≠ f(x + δ)(模型预测改变)
1.2 为什么会出现对抗样本?
对抗样本的存在有几个层面的原因:
线性视角:神经网络虽然是非线性的,但它的激活函数(如ReLU)在很多区域是线性的。在高维空间中,即使是很小的扰动,沿着梯度的方向累积起来,也足以让输出发生质变。
线性解释:
输出 = w · (x + δ) = w · x + w · δ
如果 w 的维度很高(比如100万维),
即使 ||δ||_∞ 很小(0.001),
w · δ 也可能很大(1000维 × 0.001 = 100)
换句话说,微小的扰动在高维空间可以被放大
决策边界视角:分类器的决策边界在输入空间中形成了一个复杂的流形。对抗样本就是那些被精心设计出来、刚好”穿过”决策边界的点。
数据分布视角:训练数据在输入空间中只是稀疏的采样点,模型学到的是这些采样点附近的行为。对抗样本出现在训练数据覆盖不足的区域。
1.3 对抗样本的分类
对抗样本可以从多个角度分类:
| 分类标准 | 类型 | 特点 |
|---|---|---|
| 攻击者知识 | 白盒 / 黑盒 | 白盒知道模型参数,黑盒只知道输入输出 |
| 攻击目标 | 定向 / 非定向 | 定向要求预测特定类别,非定向只要求错误 |
| 扰动范围 | 像素级 / patch级 / 物理级 | patch只修改局部区域,物理可打印出来 |
| 扰动幅度 | L2小 / L∞小 / L0少 | 约束不同,生成方法不同 |
二、FGSM:快速梯度符号法
2.1 算法原理
Goodfellow等人提出的FGSM(Fast Gradient Sign Method)是最简单、最经典的对抗攻击方法。它的核心思想是:
沿着损失函数的梯度方向,步进一个小的 epsilon
FGSM算法:
x_adv = x + ε · sign(∇_x J(θ, x, y))
其中:
- x: 原始输入
- x_adv: 对抗样本
- ε: 扰动幅度(超参数)
- J: 损失函数
- ∇_x J: 损失对输入的梯度
- sign: 符号函数
为什么用 sign(梯度) 而不是直接用梯度?因为我们要控制每个像素的扰动方向。sign函数把梯度变成±1,表示每个像素要么往上走ε,要么往下走ε。
2.2 代码实现
"""
FGSM对抗攻击实现
"""
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
def fgsm_attack(image: torch.Tensor,
epsilon: float,
gradient: torch.Tensor) -> torch.Tensor:
"""
FGSM攻击
参数:
image: 原始图像张量 [C, H, W] 或 [B, C, H, W]
epsilon: 扰动幅度
gradient: 损失对输入的梯度
返回:
对抗样本
"""
# 获取扰动方向(梯度的符号)
perturbation = epsilon * torch.sign(gradient)
# 添加扰动
adversarial_image = image + perturbation
# 裁剪到有效范围(对于图像通常是[0, 1]或[0, 255])
adversarial_image = torch.clamp(adversarial_image, 0, 1)
return adversarial_image
def compute_adversarial_loss(model: nn.Module,
image: torch.Tensor,
target: torch.Tensor,
targeted: bool = False) -> torch.Tensor:
"""
计算用于对抗攻击的损失
targeted=True: 最大化目标类的概率
targeted=False: 最小化真实类的概率
"""
output = model(image)
if targeted:
# 定向攻击:最大化目标类的损失
return F.cross_entropy(output, target)
else:
# 非定向攻击:最小化真实类的损失
# 等价于最大化真实类的负损失
return -F.cross_entropy(output, target)
def fgsm_attack_wrapper(model: nn.Module,
image: torch.Tensor,
target: torch.Tensor,
epsilon: float,
targeted: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
"""
FGSM攻击的完整封装
返回:
(对抗样本, 原始梯度)
"""
# 确保图像需要梯度
image.requires_grad = True
# 前向传播
output = model(image)
# 计算损失
if targeted:
loss = F.cross_entropy(output, target)
else:
loss = F.cross_entropy(output, target)
# 反向传播计算梯度
model.zero_grad()
loss.backward()
gradient = image.grad.data
# 生成对抗样本
adversarial_image = fgsm_attack(image.detach(), epsilon, gradient)
return adversarial_image, gradient
def evaluate_attack(model: nn.Module,
images: torch.Tensor,
labels: torch.Tensor,
epsilon: float) -> dict:
"""
评估FGSM攻击效果
"""
model.eval()
correct_before = 0
correct_after = 0
for i in range(len(images)):
img = images[i:i+1].clone()
label = labels[i:i+1]
# 原始准确率
with torch.no_grad():
output = model(img)
pred_before = output.argmax(dim=1)
correct_before += (pred_before == label).item()
# 生成对抗样本
adv_img, _ = fgsm_attack_wrapper(model, img, label, epsilon)
# 对抗样本准确率
with torch.no_grad():
output = model(adv_img)
pred_after = output.argmax(dim=1)
correct_after += (pred_after == label).item()
return {
"accuracy_before": correct_before / len(images),
"accuracy_after": correct_after / len(images),
"attack_success_rate": 1 - (correct_after / correct_before) if correct_before > 0 else 0,
"epsilon": epsilon
}
def demo_fgsm():
"""FGSM攻击演示"""
# 演示代码
print("FGSM攻击演示")
print("=" * 60)
print("""
FGSM攻击流程:
1. 输入原始图像 x 和真实标签 y
2. 计算损失 J(θ, x, y) 对输入的梯度 ∇_x J
3. 扰动 δ = ε × sign(∇_x J)
4. 对抗样本 x' = x + δ
5. clip(x', 0, 1) 确保像素值有效
示例:
- 如果某个像素的梯度为正,说明增加该像素会增大损失
- FGSM会把该像素增加 ε
- 反之亦然
""")
# 模拟计算
print("\n模拟计算:")
batch_size, channels, height, width = 1, 3, 224, 224
epsilon = 0.007 # 常用值
# 模拟梯度
gradient = torch.randn(batch_size, channels, height, width)
perturbation = epsilon * torch.sign(gradient)
print(f"扰动幅度 epsilon: {epsilon}")
print(f"梯度范数 (L2): {torch.norm(gradient).item():.4f}")
print(f"扰动范数 (L2): {torch.norm(perturbation).item():.4f}")
print(f"扰动范数 (L∞): {torch.max(torch.abs(perturbation)).item():.4f}")
if __name__ == "__main__":
demo_fgsm()2.3 FGSM的优缺点
优点:
- 计算速度快,只需一次前向+一次反向
- 理论清晰,容易理解
- 是很多复杂攻击的基础
缺点:
- 单步攻击,效果可能不够强
- 对某些防御方法(如对抗训练)效果较差
- 无法精细控制扰动
三、PGD:多步攻击的最强版本
3.1 为什么需要多步攻击?
FGSM是”一步到位”的攻击,但在某些情况下,一步可能不够。想象一下:
- 决策边界可能很复杂,需要多步才能穿过
- 某些防御会让单步攻击失效
- 扰动可能在一步之后就”卡住”了
PGD(Projected Gradient Descent)攻击正是为了解决这些问题。PGD本质上是FGSM的迭代版本:
PGD算法:
x_0 = x # 原始图像
for t = 1 to T:
x_t = Π_{x + S}(x_{t-1} + α · sign(∇_x J(θ, x_{t-1}, y)))
其中:
- α: 每步的步长(通常 α = ε/T)
- Π: 投影操作,确保扰动在允许范围内
- S: 允许的扰动集合(通常是 L∞ 球)
3.2 代码实现
"""
PGD对抗攻击实现
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple, Optional
class PGDAttack:
"""
PGD攻击类
PGD是FGSM的迭代版本,通常是最强的L∞攻击
"""
def __init__(self,
model: nn.Module,
epsilon: float = 0.3,
alpha: float = 0.01,
num_iter: int = 40,
random_start: bool = True):
"""
参数:
model: 攻击的目标模型
epsilon: 最大扰动幅度(L∞范数)
alpha: 每步的步长
num_iter: 迭代次数
random_start: 是否从随机点开始
"""
self.model = model
self.epsilon = epsilon
self.alpha = alpha
self.num_iter = num_iter
self.random_start = random_start
def attack(self,
images: torch.Tensor,
labels: torch.Tensor,
targeted: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
"""
执行PGD攻击
"""
# 记录原始图像(用于计算最终扰动)
original_images = images.clone()
# 如果需要,从随机点开始
if self.random_start:
images = images + torch.zeros_like(images).uniform_(
-self.epsilon, self.epsilon
)
# 投影回允许范围
images = torch.clamp(images, 0, 1)
# 迭代攻击
for i in range(self.num_iter):
images.requires_grad = True
# 前向传播
outputs = self.model(images)
# 计算损失
if targeted:
loss = -F.cross_entropy(outputs, labels)
else:
loss = F.cross_entropy(outputs, labels)
# 反向传播
self.model.zero_grad()
loss.backward()
# 一步梯度上升
gradient = images.grad.data
images = images.detach()
images = images + self.alpha * torch.sign(gradient)
# 投影回允许范围
# 确保对抗样本在原始图像的epsilon邻域内
images = torch.clamp(
torch.max(
original_images - self.epsilon,
torch.min(
original_images + self.epsilon,
images
)
),
0, 1
)
return images, original_images
def attack_batch(self,
images: torch.Tensor,
labels: torch.Tensor,
batch_size: int = 32) -> torch.Tensor:
"""
分批攻击
"""
adversarial_images = []
for i in range(0, len(images), batch_size):
batch_images = images[i:i+batch_size]
batch_labels = labels[i:i+batch_size]
adv_images, _ = self.attack(batch_images, batch_labels)
adversarial_images.append(adv_images)
return torch.cat(adversarial_images, dim=0)
class TargetedPGDAttack(PGDAttack):
"""
定向PGD攻击
"""
def __init__(self, *args, target_classes: Optional[torch.Tensor] = None, **kwargs):
super().__init__(*args, **kwargs)
self.target_classes = target_classes
def attack(self, images, labels=None, targeted=True):
# 使用预设的目标类或随机选择
if self.target_classes is None:
# 随机选择不同于原始标签的目标
num_classes = self.model(
torch.randn(1, *images.shape[1:])
).shape[1]
target_classes = torch.randint(
0, num_classes, (images.shape[0],), device=images.device
)
else:
target_classes = self.target_classes
return super().attack(images, target_classes, targeted=True)
def compare_fgsm_vs_pgd(model, images, labels, epsilon):
"""
对比FGSM和PGD攻击效果
"""
from copy import deepcopy
results = {}
# FGSM攻击
fgsm_images = []
for i in range(len(images)):
img = images[i:i+1].clone()
label = labels[i:i+1]
adv_img, _ = fgsm_attack_wrapper(model, img, label, epsilon)
fgsm_images.append(adv_img)
fgsm_images = torch.cat(fgsm_images, dim=0)
# PGD攻击
pgd_attack = PGDAttack(model, epsilon=epsilon, num_iter=40)
pgd_images, _ = pgd_attack.attack(images.clone(), labels)
# 评估
model.eval()
with torch.no_grad():
# 原始准确率
clean_acc = (model(images).argmax(1) == labels).float().mean().item()
# FGSM后准确率
fgsm_acc = (model(fgsm_images).argmax(1) == labels).float().mean().item()
# PGD后准确率
pgd_acc = (model(pgd_images).argmax(1) == labels).float().mean().item()
results = {
"clean_accuracy": clean_acc,
"fgsm_accuracy": fgsm_acc,
"pgd_accuracy": pgd_acc,
"fgsm_attack_success": 1 - fgsm_acc,
"pgd_attack_success": 1 - pgd_acc,
}
return results
def demo_pgd():
"""PGD攻击演示"""
print("PGD攻击演示")
print("=" * 60)
print("""
PGD vs FGSM:
FGSM(单步):
x' = x + ε · sign(∇J)
PGD(多步):
for t in 1..T:
x_t = clip(x_{t-1} + α·sign(∇J(x_{t-1})))
x_t = project(x_t, x + ε)
为什么PGD更强?
1. 多步迭代能更好地探索决策边界
2. random_start让攻击更难防御
3. 投影操作确保扰动始终有效
常用配置:
- epsilon = 8/255 ≈ 0.031 (ImageNet)
- alpha = epsilon / 4
- num_iter = 10-40
""")
# 模拟不同迭代次数的攻击效果
import matplotlib.pyplot as plt
print("\n迭代次数对攻击效果的影响(模拟):")
iterations = [1, 5, 10, 20, 40, 100]
# 模拟衰减曲线
for iters in iterations:
# PGD通常在10-40次迭代后收敛
simulated_acc = 0.95 * (1 - 0.9 * (1 - np.exp(-iters / 10)))
print(f" 迭代 {iters:3d}: 模型准确率 ≈ {simulated_acc:.3f}")
if __name__ == "__main__":
demo_pgd()3.3 PGD为什么是最强L∞攻击?
Madry等人证明了PGD攻击是L∞范数约束下的”最强”一阶攻击:
如果你能防御住PGD攻击,你就能防御住所有基于一阶梯度的攻击。
这个结论的意义是重大的:它把对抗防御问题简化了——只需要考虑PGD即可。
四、C&W攻击:更隐蔽的低范数攻击
4.1 C&W攻击的原理
Carlini和Wagner在2016年提出的C&W攻击,目标是在满足约束的前提下找到范数最小的对抗扰动:
C&W攻击优化目标:
minimize ||δ||_p + c · f(x + δ)
subject to: x + δ ∈ [0, 1]^n
其中 f 是精心设计的损失函数:
f(x') = max(max_{i≠t} Z(x')_i - Z(x')_t, -κ)⁺
- Z(x') 是logits
- t 是目标类
- κ 是置信度参数
- (·)⁺ = max(·, 0)
C&W攻击比FGSM/PGD更强的原因:
- 直接优化:直接最小化扰动范数,而不是用梯度步进
- 灵活的范数:可以优化L0、L2、L∞等不同范数
- 更好的优化:使用更好的初始化和优化策略
4.2 代码实现
"""
C&W对抗攻击实现(L2范数)
"""
import torch
import torch.nn as nn
import torch.optim as optim
from typing import Optional, Callable
class CWAttack:
"""
Carlini & Wagner L2攻击
特点:
1. 优化得到最小范数扰动
2. 比FGSM/PGD更难防御
3. 支持定向和非定向攻击
"""
def __init__(self,
model: nn.Module,
targeted: bool = False,
confidence: float = 0,
initial_const: float = 0.001,
max_iterations: int = 1000,
learning_rate: float = 0.01,
binary_search_steps: int = 9):
self.model = model
self.targeted = targeted
self.confidence = confidence
self.initial_const = initial_const
self.max_iterations = max_iterations
self.learning_rate = learning_rate
self.binary_search_steps = binary_search_steps
def _logits_to_attack_loss(self,
logits: torch.Tensor,
target: torch.Tensor) -> torch.Tensor:
"""
计算C&W损失函数
"""
# 获取真实类和目标类的logits
one_hot_target = torch.zeros_like(logits).scatter_(
1, target.unsqueeze(1), 1
)
# 目标类的logit
target_logits = (logits * one_hot_target).sum(dim=1)
# 非目标类的最大logit
other_logits = logits - one_hot_target * 1e9
max_other_logits = other_logits.max(dim=1)[0]
if self.targeted:
# 定向攻击:目标logit应该大于其他logit
return torch.clamp(max_other_logits - target_logits + self.confidence, min=0)
else:
# 非定向攻击:其他logit不应该大于目标logit
return torch.clamp(target_logits - max_other_logits + self.confidence, min=0)
def attack(self,
images: torch.Tensor,
labels: torch.Tensor,
verbose: bool = False) -> torch.Tensor:
"""
执行C&W攻击
"""
device = images.device
batch_size = images.shape[0]
# 初始化扰动变量
# 使用arctanh变换确保加性扰动后的值在[0,1]范围内
delta = torch.zeros_like(images, requires_grad=True)
optimizer = optim.Adam([delta], lr=self.learning_rate)
# 二分搜索找最优的c常数
# c用于平衡扰动大小和攻击成功率
c_lower = torch.ones(batch_size, device=device) * 1e-3
c_upper = torch.ones(batch_size, device=device) * 1e10
c = self.initial_const * torch.ones(batch_size, device=device)
# 记录攻击结果
best_adversarial = images.clone()
best_L2 = torch.full((batch_size,), float('inf'), device=device)
for search_step in range(self.binary_search_steps):
if verbose and search_step == 0:
print(f"Binary search step 0/{self.binary_search_steps}")
for iteration in range(self.max_iterations):
optimizer.zero_grad()
# 计算对抗样本
# 使用tanh变换确保输出在[0,1]范围内
adversarial = torch.tanh(delta) * 0.5 + 0.5
# 确保对抗样本在有效范围内
adversarial = torch.clamp(adversarial, 0, 1)
# 计算L2扰动
L2_dist = torch.sum((adversarial - images) ** 2, dim=[1, 2, 3])
# 计算攻击损失
logits = self.model(adversarial)
attack_loss = self._logits_to_attack_loss(logits, labels)
# 总损失:L2扰动 + c * 攻击损失
total_loss = L2_dist + c * attack_loss
# 平均损失用于优化
mean_loss = total_loss.mean()
mean_loss.backward()
optimizer.step()
if verbose and (iteration + 1) % 200 == 0:
print(f" Iteration {iteration+1}, loss: {mean_loss.item():.4f}")
# 检查攻击结果
adversarial = torch.clamp(torch.tanh(delta.detach()) * 0.5 + 0.5, 0, 1)
L2_dist = torch.sum((adversarial - images) ** 2, dim=[1, 2, 3])
# 更新最优解
success = attack_loss.detach() == 0
improved = (L2_dist < best_L2) & success
best_adversarial[improved] = adversarial[improved]
best_L2[improved] = L2_dist[improved]
# 二分搜索更新c
if self.targeted:
c[attack_loss.detach() == 0] *= 2
c[attack_loss.detach() > 0] /= 2
else:
c[attack_loss.detach() == 0] *= 2
c[attack_loss.detach() > 0] /= 2
return best_adversarial
def demo_cw():
"""C&W攻击演示"""
print("C&W攻击演示")
print("=" * 60)
print("""
C&W vs FGSM/PGD:
FGSM/PGD:
- 使用固定步长
- 最小化扰动不是主要目标
- 可能在不需要的地方添加扰动
C&W:
- 直接优化扰动范数
- 最小化 ||δ||₂ + c · f(x+δ)
- 扰动更小、更隐蔽
攻击效果对比(通常):
FGSM < PGD < C&W
(C&W最强,因为直接优化)
""")
print("\n模拟攻击效果:")
print("-" * 40)
print(f"{'攻击方法':<15} {'扰动范数(L2)':<15} {'成功率':<10}")
print("-" * 40)
print(f"{'FGSM':<15} {'0.032':<15} {'78%':<10}")
print(f"{'PGD':<15} {'0.028':<15} {'85%':<10}")
print(f"{'C&W':<15} {'0.021':<15} {'92%':<10}")
if __name__ == "__main__":
demo_cw()五、黑盒攻击与迁移攻击
5.1 黑盒攻击的原理
在现实中,攻击者通常不知道目标模型的具体参数和结构。黑盒攻击就是在这种情况下发起攻击。
黑盒攻击利用两个关键性质:
1. 决策/梯度查询:攻击者可以查询模型的输入输出,通过观察输出变化来推断梯度方向。
2. 模型可迁移性:不同模型学到的对抗样本有重叠——在一个模型上生成的对抗样本,往往也能欺骗其他模型。
5.2 迁移攻击实现
"""
迁移攻击:利用模型可迁移性
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple, List
import numpy as np
class TransferAttack:
"""
基于迁移的对抗攻击
策略:
1. 训练一个替代模型(surrogate model)
2. 在替代模型上生成对抗样本
3. 利用对抗样本的迁移性攻击目标模型
"""
def __init__(self,
surrogate_model: nn.Module,
epsilon: float = 0.03,
num_iter: int = 10,
alpha: float = 0.003):
self.surrogate = surrogate_model
self.epsilon = epsilon
self.num_iter = num_iter
self.alpha = alpha
def generate(self,
images: torch.Tensor,
labels: torch.Tensor,
attack_method: str = "pgd") -> torch.Tensor:
"""
生成可迁移的对抗样本
参数:
attack_method: 'fgsm', 'pgd', 'mim' (momentum iterative)
"""
if attack_method == "fgsm":
return self._fgsm(images, labels)
elif attack_method == "pgd":
return self._pgd(images, labels)
elif attack_method == "mim":
return self._mim(images, labels)
else:
raise ValueError(f"Unknown attack method: {attack_method}")
def _fgsm(self, images, labels):
images.requires_grad = True
output = self.surrogate(images)
loss = F.cross_entropy(output, labels)
loss.backward()
perturbation = self.epsilon * torch.sign(images.grad)
adversarial = images.detach() + perturbation
return torch.clamp(adversarial, 0, 1)
def _pgd(self, images, labels):
adversarial = images.clone()
# 随机初始化
adversarial = adversarial + torch.zeros_like(adversarial).uniform_(
-self.epsilon, self.epsilon
)
for _ in range(self.num_iter):
adversarial.requires_grad = True
output = self.surrogate(adversarial)
loss = F.cross_entropy(output, labels)
loss.backward()
with torch.no_grad():
adversarial = adversarial + self.alpha * torch.sign(adversarial.grad)
adversarial = torch.clamp(adversarial, 0, 1)
# 投影回epsilon范围
adversarial = torch.max(
images - self.epsilon,
torch.min(images + self.epsilon, adversarial)
)
return adversarial
def _mim(self, images, labels):
"""
Momentum Iterative Method
加入动量项提升迁移性
"""
adversarial = images.clone()
momentum = torch.zeros_like(images)
for _ in range(self.num_iter):
adversarial.requires_grad = True
output = self.surrogate(adversarial)
loss = F.cross_entropy(output, labels)
loss.backward()
# 更新动量
with torch.no_grad():
grad = adversarial.grad
momentum = 0.9 * momentum + grad / torch.norm(grad, p=1)
adversarial = adversarial + self.alpha * torch.sign(momentum)
adversarial = torch.clamp(adversarial, 0, 1)
adversarial = torch.max(
images - self.epsilon,
torch.min(images + self.epsilon, adversarial)
)
return adversarial
class EnsembleAttack:
"""
集成攻击:同时攻击多个替代模型
提升迁移成功率和攻击覆盖面
"""
def __init__(self,
models: List[nn.Module],
epsilon: float = 0.03,
num_iter: int = 10):
self.models = models
self.epsilon = epsilon
self.num_iter = num_iter
def attack(self,
images: torch.Tensor,
labels: torch.Tensor,
weights: List[float] = None) -> torch.Tensor:
"""
集成攻击
策略:平均多个模型的梯度
"""
if weights is None:
weights = [1.0 / len(self.models)] * len(self.models)
adversarial = images.clone()
for _ in range(self.num_iter):
adversarial.requires_grad = True
# 收集所有模型的梯度
gradients = []
for model, weight in zip(self.models, weights):
model.eval()
output = model(adversarial)
loss = F.cross_entropy(output, labels)
loss.backward()
gradients.append(adversarial.grad * weight)
# 加权平均梯度
avg_gradient = sum(gradients)
with torch.no_grad():
adversarial = adversarial + self.epsilon * torch.sign(avg_gradient)
adversarial = torch.clamp(adversarial, 0, 1)
return adversarial
def evaluate_transferability(transfer_attack: TransferAttack,
source_models: List[nn.Module],
target_model: nn.Module,
images: torch.Tensor,
labels: torch.Tensor):
"""
评估对抗样本的迁移性
"""
results = {}
for i, model in enumerate(source_models):
attack = TransferAttack(model)
# 在源模型上生成对抗样本
adv_images = attack.generate(images, labels)
# 测试在源模型上的成功率
model.eval()
with torch.no_grad():
source_preds = model(adv_images).argmax(1)
source_success = (source_preds != labels).float().mean().item()
# 测试在目标模型上的成功率
with torch.no_grad():
target_preds = target_model(adv_images).argmax(1)
target_success = (target_preds != labels).float().mean().item()
results[f"model_{i}"] = {
"source_attack_success": source_success,
"target_attack_success": target_success
}
return results
def demo_transfer():
"""迁移攻击演示"""
print("迁移攻击演示")
print("=" * 60)
print("""
迁移攻击原理:
1. 可迁移性:
在模型A上生成的对抗样本,
有一定概率也能欺骗模型B
2. 迁移性来源:
- 不同模型在相似数据上学到相似的决策边界
- 对抗样本位于决策边界的"弱点"附近
- 这些弱点在不同模型间有一定重叠
3. 提升迁移性的方法:
- MIM(动量迭代):加入动量项
- 集成攻击:同时攻击多个模型
- 多步攻击:更强的扰动
- 多样化训练:在不同架构上训练替代模型
""")
if __name__ == "__main__":
demo_transfer()六、对抗补丁:物理世界的攻击
6.1 什么是对抗补丁?
对抗补丁(Adversarial Patch)不是修改整个图像的像素,而是在图像的某个区域放置一个精心设计的”补丁”,就能让模型做出错误判断。
这在现实中很可怕——你可以在Stop标志上贴一个彩色贴纸,就能让自动驾驶系统把它误识别为其他标志。
6.2 对抗补丁攻击实现
"""
对抗补丁攻击实现
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Tuple
class AdversarialPatch:
"""
对抗补丁攻击
目标:在图像中放置一个补丁,使得模型做出错误判断
应用:物理世界攻击(如自动驾驶标志误识别)
"""
def __init__(self,
model: nn.Module,
patch_size: int = 50,
epsilon: float = 2.0,
learning_rate: float = 0.1):
self.model = model
self.patch_size = patch_size
self.epsilon = epsilon
self.lr = learning_rate
def generate(self,
target_class: int,
image_shape: Tuple[int, ...],
num_iterations: int = 100,
show_progress: bool = True) -> np.ndarray:
"""
生成对抗补丁
参数:
target_class: 目标误分类类别
image_shape: 图像形状 (C, H, W)
num_iterations: 迭代次数
"""
# 初始化补丁(使用随机噪声)
patch = torch.rand(1, *image_shape[:2], 3, requires_grad=True)
# 优化器
optimizer = torch.optim.Adam([patch], lr=self.lr)
for iteration in range(num_iterations):
optimizer.zero_grad()
# 生成带有补丁的图像
image = self._apply_patch(patch, image_shape)
# 计算损失(最大化目标类概率)
output = self.model(image)
loss = -F.cross_entropy(output, torch.tensor([target_class]))
# 梯度上升(最大化损失)
loss.backward()
optimizer.step()
# 投影回有效范围
with torch.no_grad():
patch.clamp_(0, 1)
if show_progress and (iteration + 1) % 20 == 0:
print(f"Iteration {iteration+1}/{num_iterations}, Loss: {loss.item():.4f}")
return patch.detach().numpy()[0]
def _apply_patch(self, patch: torch.Tensor,
image_shape: Tuple[int, ...]) -> torch.Tensor:
"""
将补丁应用到图像的随机位置
"""
# 创建全零图像
image = torch.zeros(1, *image_shape[:2], 3)
# 随机位置
h, w = image_shape[:2]
patch_h, patch_w = patch.shape[1:3]
top = np.random.randint(0, h - patch_h)
left = np.random.randint(0, w - patch_w)
# 应用补丁
image[:, top:top+patch_h, left:left+patch_w, :] = patch
return image
class TargetedPatch:
"""
定向对抗补丁:让模型把任何包含补丁的图像分类为特定类别
"""
def __init__(self, model: nn.Module):
self.model = model
def train(self,
source_images: torch.Tensor,
target_class: int,
num_epochs: int = 100,
lr: float = 0.1) -> torch.Tensor:
"""
训练一个可打印的对抗补丁
"""
# 初始化补丁
patch = torch.rand(1, 50, 50, 3, requires_grad=True)
optimizer = torch.optim.Adam([patch], lr=lr)
for epoch in range(num_epochs):
total_loss = 0
for image in source_images:
optimizer.zero_grad()
# 将补丁应用到图像
# 简化版本:直接覆盖图像中心
image_with_patch = image.clone()
h, w = 50, 50
top, left = 87, 87 # ImageNet 224x224的中心
image_with_patch[:, top:top+h, left:left+w] = patch.squeeze()
# 前向传播
output = self.model(image_with_patch.unsqueeze(0))
# 损失:最大化目标类得分
loss = -F.cross_entropy(output, torch.tensor([target_class]))
loss.backward()
optimizer.step()
total_loss += loss.item()
if (epoch + 1) % 20 == 0:
print(f"Epoch {epoch+1}, Avg Loss: {total_loss/len(source_images):.4f}")
return patch.detach()
def demo_patch():
"""对抗补丁演示"""
print("对抗补丁攻击演示")
print("=" * 60)
print("""
对抗补丁 vs 传统对抗样本:
传统对抗样本:
- 修改整张图像
- 扰动很小(||δ||_∞ < ε)
- 通常不可打印
对抗补丁:
- 只修改局部区域
- 扰动可以是任意值
- 可以打印出来贴在物理世界
应用场景:
1. 自动驾驶:Stop标志上贴彩色贴纸
2. 人脸识别:特殊眼镜/帽子
3. 物体检测:让检测器完全忽略某物体
""")
if __name__ == "__main__":
demo_patch()七、防御方法
7.1 对抗训练
对抗训练是最有效的防御方法之一。核心思想是用对抗样本训练模型:
对抗训练:
min_θ E_{(x,y)∈D} max_{δ∈S} L(θ, x+δ, y)
内层:找到最强的对抗扰动
外层:在对抗样本上最小化损失
7.2 对抗训练实现
"""
对抗训练实现
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from typing import Tuple
class AdversarialTraining:
"""
对抗训练:让模型学习对抗样本
策略:
1. PGD-AT: 用PGD攻击生成对抗样本训练
2. TRADES: 同时最大化干净样本和对抗样本的差异
3. MART: 修改对抗风险与去相关
"""
def __init__(self,
model: nn.Module,
epsilon: float = 0.031,
num_iter: int = 7,
alpha: float = 0.008,
attack_method: str = "pgd"):
self.model = model
self.epsilon = epsilon
self.num_iter = num_iter
self.alpha = alpha
self.attack_method = attack_method
def _generate_adversarial(self,
images: torch.Tensor,
labels: torch.Tensor) -> torch.Tensor:
"""
生成对抗样本
"""
if self.attack_method == "pgd":
return self._pgd_attack(images, labels)
elif self.attack_method == "fgsm":
return self._fgsm_attack(images, labels)
else:
raise ValueError(f"Unknown attack: {self.attack_method}")
def _fgsm_attack(self, images, labels):
images.requires_grad = True
output = self.model(images)
loss = F.cross_entropy(output, labels)
loss.backward()
with torch.no_grad():
adversarial = images + self.epsilon * torch.sign(images.grad)
adversarial = torch.clamp(adversarial, 0, 1)
return adversarial
def _pgd_attack(self, images, labels):
adversarial = images.clone()
# 随机初始化
adversarial = adversarial + torch.zeros_like(adversarial).uniform_(
-self.epsilon, self.epsilon
)
adversarial = torch.clamp(adversarial, 0, 1)
for _ in range(self.num_iter):
adversarial.requires_grad = True
output = self.model(adversarial)
loss = F.cross_entropy(output, labels)
loss.backward()
with torch.no_grad():
adversarial = adversarial + self.alpha * torch.sign(adversarial.grad)
adversarial = torch.clamp(adversarial, 0, 1)
# 投影
adversarial = torch.max(
images - self.epsilon,
torch.min(images + self.epsilon, adversarial)
)
return adversarial.detach()
def train(self,
train_loader: DataLoader,
test_loader: DataLoader,
epochs: int = 10,
lr: float = 0.01) -> dict:
"""
执行对抗训练
"""
optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5)
history = {
"train_loss": [],
"train_clean_acc": [],
"train_adv_acc": [],
"test_clean_acc": [],
"test_adv_acc": []
}
for epoch in range(epochs):
self.model.train()
train_loss = 0
clean_correct = 0
adv_correct = 0
total = 0
for images, labels in train_loader:
# 生成对抗样本
adversarial = self._generate_adversarial(images, labels)
# 在对抗样本上训练
optimizer.zero_grad()
# 计算干净样本和对抗样本的损失
clean_loss = F.cross_entropy(self.model(images), labels)
adv_loss = F.cross_entropy(self.model(adversarial), labels)
# 总损失(可以根据需要调整权重)
loss = (clean_loss + adv_loss) / 2
loss.backward()
optimizer.step()
train_loss += loss.item()
# 计算准确率
with torch.no_grad():
clean_pred = self.model(images).argmax(1)
adv_pred = self.model(adversarial).argmax(1)
clean_correct += (clean_pred == labels).sum().item()
adv_correct += (adv_pred == labels).sum().item()
total += labels.size(0)
scheduler.step()
# 评估
clean_acc, adv_acc = self._evaluate(test_loader)
history["train_loss"].append(train_loss / len(train_loader))
history["train_clean_acc"].append(clean_correct / total)
history["train_adv_acc"].append(adv_correct / total)
history["test_clean_acc"].append(clean_acc)
history["test_adv_acc"].append(adv_acc)
print(f"Epoch {epoch+1}/{epochs}")
print(f" Train Loss: {train_loss/len(train_loader):.4f}")
print(f" Clean Acc: {clean_acc:.4f}, Adv Acc: {adv_acc:.4f}")
return history
def _evaluate(self, test_loader: DataLoader) -> Tuple[float, float]:
"""评估模型"""
self.model.eval()
clean_correct = 0
adv_correct = 0
total = 0
with torch.no_grad():
for images, labels in test_loader:
# 干净准确率
clean_pred = self.model(images).argmax(1)
clean_correct += (clean_pred == labels).sum().item()
# 对抗准确率
adversarial = self._generate_adversarial(images, labels)
adv_pred = self.model(adversarial).argmax(1)
adv_correct += (adv_pred == labels).sum().item()
total += labels.size(0)
return clean_correct / total, adv_correct / total
class TRADESDefense(AdversarialTraining):
"""
TRADES (TRADE-OFF) 防御
论文: https://arxiv.org/abs/1908.08016
核心思想:
干净样本和对抗样本的预测应该接近
"""
def __init__(self, *args, beta: float = 6.0, **kwargs):
super().__init__(*args, **kwargs)
self.beta = beta
def trades_loss(self,
images: torch.Tensor,
labels: torch.Tensor) -> torch.Tensor:
"""
计算TRADES损失
"""
# 生成对抗样本
adversarial = self._generate_adversarial(images, labels)
# 干净样本的预测
clean_output = self.model(images)
# 对抗样本的预测(不更新)
with torch.no_grad():
adv_output = self.model(adversarial)
# TRADES损失
# 1. 干净样本的交叉熵损失
ce_loss = F.cross_entropy(clean_output, labels)
# 2. 干净样本和对抗样本预测的KL散度
kl_loss = F.kl_div(
F.log_softmax(clean_output, dim=1),
F.softmax(adv_output, dim=1),
reduction='batchmean'
)
return ce_loss + self.beta * kl_loss
def demo_defense():
"""防御方法演示"""
print("对抗防御方法演示")
print("=" * 60)
print("""
主要防御方法:
1. 对抗训练 (Adversarial Training)
- 在对抗样本上训练
- 最有效但训练慢
2. 输入净化 (Input Purification)
- 对输入进行预处理
- JPEG压缩、去噪等
3. 模型蒸馏 (Defensive Distillation)
- 用软标签训练
- 平滑决策边界
4. 认证鲁棒性 (Certified Robustness)
- 提供可证明的下界
- 不可绕过
防御效果对比:
方法 | 干净准确率 | 对抗准确率
----------------|-----------|-----------
标准训练 | 95% | 10%
PGD-AT | 85% | 75%
TRADES | 87% | 78%
""")
if __name__ == "__main__":
demo_defense()八、对抗攻防的博弈论视角
8.1 攻防博弈
对抗攻防可以建模为博弈论问题:
攻击者:选择对抗扰动 δ
防御者:选择模型参数 θ
对抗训练 = 求解 minimax 问题:
min_θ max_{δ∈S} L(θ, x+δ, y)
这个视角揭示了几个重要洞见:
- 纳什均衡:在博弈达到均衡时,双方都无法单方面改进
- 混合策略:有时候随机化防御策略更有效
- 收益函数设计:如何定义”防御成功”会影响最终结果
8.2 博弈视角的代码实现
"""
对抗攻防的博弈论视角
"""
import numpy as np
from typing import List, Tuple
import torch
import torch.nn.functional as F
class AttackDefenseGame:
"""
攻防博弈模拟器
"""
def __init__(self,
epsilon: float = 0.1,
attack_cost: float = 0.0,
defense_cost: float = 0.1):
self.epsilon = epsilon
self.attack_cost = attack_cost
self.defense_cost = defense_cost
def payoff_matrix(self) -> np.ndarray:
"""
构造支付矩阵
行:防御者策略(epsilon)
列:攻击者策略(epsilon)
"""
epsilons = [0.0, 0.01, 0.03, 0.05, 0.1]
payoff = np.zeros((len(epsilons), len(epsilons)))
for i, def_eps in enumerate(epsilons):
for j, att_eps in enumerate(epsilons):
# 简化模型:攻击成功率与攻击强度正相关,与防御强度负相关
if att_eps <= def_eps:
# 防御成功
attack_payoff = -self.attack_cost
defense_payoff = 1 - self.defense_cost
else:
# 攻击成功
attack_payoff = 1 - self.attack_cost
defense_payoff = -self.defense_cost
payoff[i, j] = defense_payoff
return payoff
def mixed_strategy_nash(self, payoff: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
"""
计算混合策略纳什均衡
"""
n_strategies = payoff.shape[0]
# 简化的最佳响应计算
best_attack = np.zeros(n_strategies)
best_defense = np.zeros(n_strategies)
# 攻击者的最佳响应
for i in range(n_strategies):
payoffs_against_i = payoff[:, i]
best_attack[i] = np.argmax(payoffs_against_i)
# 防御者的最佳响应
for j in range(n_strategies):
payoffs_for_j = payoff[:, j]
best_defense[j] = np.argmax(payoffs_for_j)
return best_defense, best_attack
def analyze_robustness_tradeoff():
"""
分析鲁棒性和实用性之间的权衡
"""
print("鲁棒性与实用性权衡分析")
print("=" * 60)
print("""
对抗训练的双刃剑:
优点:
✓ 大幅提升对已知攻击的防御能力
✓ 模型学到更平滑的决策边界
✓ 对噪声更鲁棒
缺点:
✗ 降低干净样本上的准确率
✗ 训练时间显著增加
✗ 可能被新的攻击方法绕过
典型的准确率权衡:
干净样本准确率 vs 对抗准确率
标准训练: ████████████████████ 95% 干净 / 5% 对抗
PGD-AT: ████████████████ 85% 干净 / 70% 对抗
TRADES: ███████████████ 87% 干净 / 73% 对抗
""")
def demo_game_theory():
"""博弈论视角演示"""
print("对抗攻防的博弈论视角")
print("=" * 60)
game = AttackDefenseGame()
payoff = game.payoff_matrix()
print("\n支付矩阵(防御者视角):")
print(" 攻击强度")
print("防御 ", end="")
print(" ".join([f"{e:.2f}" for e in [0.0, 0.01, 0.03, 0.05, 0.1]]))
print("-" * 50)
for i, eps in enumerate([0.0, 0.01, 0.03, 0.05, 0.1]):
print(f"{eps:.2f} ", end="")
print(" ".join([f"{payoff[i,j]:.2f}" for j in range(5)]))
print("""
博弈分析:
1. 纯策略均衡:取决于成本参数
2. 混合策略:
防御者:选择合适的epsilon(如0.03)
攻击者:计算成本收益比决定是否攻击
3. 实际应用:
- 高价值目标:使用强防御
- 普通系统:使用适度防御 + 监控
""")
if __name__ == "__main__":
demo_game_theory()
analyze_robustness_tradeoff()九、实战经验总结
9.1 常见坑
- epsilon设置不当:太大模型无法学习,太小攻击无效
- 梯度消失/爆炸:需要梯度裁剪或适当初始化
- 随机种子的重要性:对抗样本有随机性,需要固定种子复现
- 批量攻击vs单步攻击:批量攻击通常更有效
9.2 防御建议
"""
防御最佳实践
"""
class DefenseBestPractices:
"""
防御最佳实践清单
"""
@staticmethod
def get_checklist():
return """
防御检查清单:
[ ] 基础防护
- 使用对抗训练 (PGD-AT 或 TRADES)
- 启用输入验证和清洗
- 限制模型输出的置信度
[ ] 监控与检测
- 监控输入分布变化
- 检测潜在的对抗输入
- 记录异常预测
[ ] 模型加固
- 使用模型集成
- 采用认证鲁棒性方法
- 定期用新攻击评估
[ ] 系统层面
- 输入预处理(JPEG压缩等)
- 多模型投票
- 异常检测集成
"""
@staticmethod
def recommended_config():
return {
"adversarial_training": {
"epsilon": 8/255, # ImageNet: 8/255
"num_iter": 7,
"alpha": 2/255,
"method": "pgd"
},
"input_purification": {
"jpeg_compression": 75,
"bit_depth_reduction": False,
"random_resize_padding": True
},
"model_ensemble": {
"n_models": 3,
"diversity": "architecture" # or "training_data"
}
}
# 快速参考表
QUICK_REFERENCE = """
================================================================================
对抗攻防快速参考
================================================================================
攻击方法选择:
┌────────────────┬────────────┬─────────────┬─────────────────────────┐
│ 方法 │ 白盒/黑盒 │ 计算成本 │ 适用场景 │
├────────────────┼────────────┼─────────────┼─────────────────────────┤
│ FGSM │ 白盒 │ ★☆☆ │ 快速baseline │
│ PGD │ 白盒 │ ★★★ │ 最强L∞攻击 │
│ C&W │ 白盒 │ ★★★★ │ 最小范数扰动 │
│ 迁移攻击 │ 黑盒 │ ★★☆ │ 不知道模型结构时 │
│ 对抗补丁 │ 白/黑盒 │ ★★★ │ 物理世界攻击 │
└────────────────┴────────────┴─────────────┴─────────────────────────┘
防御方法选择:
┌────────────────┬────────────┬─────────────┬─────────────────────────┐
│ 方法 │ 防御强度 │ 干净准确率 │ 备注 │
├────────────────┼────────────┼─────────────┼─────────────────────────┤
│ 无防御 │ ☆☆☆ │ 最高 │ 基准 │
│ 对抗训练(AT) │ ★★★ │ 降低5-10% │ 最有效 │
│ TRADES │ ★★★ │ 降低3-8% │ 比AT略好 │
│ 输入净化 │ ★★☆ │ 几乎不变 │ 辅助手段 │
│ 认证防御 │ ★★★ │ 降低10-20% │ 有理论保证 │
└────────────────┴────────────┴─────────────┴─────────────────────────┘
常用参数设置:
- ImageNet (224x224): ε = 8/255 ≈ 0.031
- CIFAR-10 (32x32): ε = 8/255 ≈ 0.031
- MNIST (28x28): ε = 0.3
================================================================================
"""十、总结
10.1 核心要点
- 对抗样本的本质:高维空间中微小的扰动可以显著改变模型输出
- FGSM/PGD:基于梯度的经典攻击方法,PGD是最强L∞攻击
- C&W攻击:直接优化扰动范数,更隐蔽更难防御
- 迁移攻击:利用模型间的可迁移性进行黑盒攻击
- 对抗补丁:局部修改就能实现攻击,可用于物理世界
- 对抗训练:最有效的防御,需要权衡干净准确率
- 博弈论视角:攻防是博弈,均衡点决定最终格局
10.2 未来趋势
- 自适应攻击与防御:攻击和防御相互演化
- 认证鲁棒性:提供可证明的鲁棒性下界
- 真实世界攻击:物理对抗样本的研究越来越重要
- AI安全生态:从单一模型到整个系统的安全