关键词
| 术语 | 英文 | 核心概念 |
|---|---|---|
| 对抗样本 | Adversarial Example | 精心设计的导致模型误分类的扰动输入 |
| 快速梯度符号法 | FGSM | 基于梯度的单步攻击方法 |
| 投影梯度下降 | PGD | 迭代式攻击的代表性方法 |
| Carlini-Wagner攻击 | C&W Attack | 优化的 L2 攻击方法 |
| 扰动 | Perturbation | 添加到原始输入的微小噪声 |
| 白盒攻击 | White-box Attack | 攻击者完全了解模型参数的攻击 |
| 黑盒攻击 | Black-box Attack | 攻击者只能访问模型输入输出的攻击 |
| 对抗补丁 | Adversarial Patch | 物理世界中可打印的对抗攻击 |
| 欺骗攻击 | Evasion Attack | 在推理阶段规避检测的攻击 |
| 毒性攻击 | Poisoning Attack | 在训练阶段注入恶意样本的攻击 |
1. 引言:对抗样本的发现
2013 年,Christian Szegedy 等人在论文《Intriguing properties of neural networks》中首次正式定义了对抗样本(Adversarial Examples)这一概念。他们发现了一个令人震惊的现象:对于一个表现良好的深度神经网络,可以通过对输入图像添加人类几乎无法察觉的微小扰动,使得模型以高置信度输出错误的分类结果。
历史背景
对抗样本的发现打破了人们对深度学习”过拟合”或”泛化能力不足”的传统认知。它揭示了一个更深层次的问题:现代神经网络的决策边界存在系统性缺陷,而非简单的统计误差。
2. 对抗样本的数学定义
2.1 形式化定义
给定一个分类器 和原始输入 ,对抗样本 满足以下条件:
其中 表示 范数, 是人类感知阈值。
2.2 对抗样本的存在性解释
从决策边界角度,对抗样本的存在可以用高维空间的线性近似来解释。对于一个权重向量 和输入 ,模型输出为:
如果 与 符号相同且幅度足够大,即使 极小,也能改变分类结果。
import numpy as np
import torch
import torch.nn as nn
def explain_adversarial_examples(model, x, epsilon=0.1):
"""
解释对抗样本存在性的数值示例
在高维空间中,即使扰动很小,
累积效应也足以改变分类结果
"""
# 获取模型预测
model.eval()
with torch.no_grad():
original_pred = model(x.unsqueeze(0)).argmax(dim=1).item()
# 获取输入梯度(扰动方向)
x.requires_grad = True
output = model(x.unsqueeze(0))
loss = output[0, original_pred]
loss.backward()
gradient = x.grad.data
# 计算梯度方向上的扰动
perturbation = epsilon * torch.sign(gradient)
adversarial_x = x + perturbation
with torch.no_grad():
adversarial_pred = model(adversarial_x.unsqueeze(0)).argmax(dim=1).item()
print(f"原始预测: {original_pred}, 对抗样本预测: {adversarial_pred}")
print(f"扰动范数 L∞: {torch.abs(perturbation).max().item():.4f}")
return adversarial_x, perturbation
# 高维空间线性效应的数学演示
def high_dimensional_linear_effect():
"""
在高维空间中:
- 随机方向扰动的期望范数增长为 O(√d)
- 但正交于特征方向的扰动可以被忽略
- 梯度的符号方向总是"对齐"的
这解释了为什么微小扰动能累积成大影响
"""
d = 10000 # 假设10000维空间
num_samples = 1000
# 随机扰动的平均绝对值
random_perturbations = np.random.randn(num_samples, d)
avg_magnitude = np.mean(np.abs(random_perturbations))
# 梯度方向扰动(所有维度对齐)
aligned_perturbations = np.ones((num_samples, d)) * 0.01
aligned_magnitude = np.mean(np.abs(aligned_perturbations))
print(f"随机扰动平均幅度: {avg_magnitude:.6f}")
print(f"对齐扰动平均幅度: {aligned_magnitude:.4f}")
print(f"比率: {aligned_magnitude / avg_magnitude:.2f}x")3. FGSM:快速梯度符号法
3.1 算法原理
Goodfellow 等人在 2014 年提出 FGSM(Fast Gradient Sign Method),这是一种计算高效的单步攻击方法:
其中 是损失函数, 是输入空间中的梯度。
3.2 FGSM 实现
import torch
import torch.nn.functional as F
def fgsm_attack(model, images, labels, epsilon=0.03):
"""
快速梯度符号法(FGSM)攻击
参数:
model: 目标分类器
images: 输入图像 (B, C, H, W)
labels: 真实标签 (B,)
epsilon: 扰动幅度
返回:
adversarial_images: 对抗样本
"""
# 保存原始图像用于恢复
images.requires_grad = True
# 前向传播
outputs = model(images)
# 计算损失
loss = F.cross_entropy(outputs, labels)
# 反向传播获取梯度
model.zero_grad()
loss.backward()
# 获取梯度并计算扰动
gradient = images.grad.data
perturbation = epsilon * torch.sign(gradient)
# 生成对抗样本
adversarial_images = images + perturbation
# 确保扰动后图像仍在有效范围内
adversarial_images = torch.clamp(adversarial_images, 0, 1)
return adversarial_images, perturbation
def fgsm_targeted_attack(model, images, target_labels, epsilon=0.03):
"""
定向 FGSM 攻击:使模型输出特定目标类别
"""
images.requires_grad = True
outputs = model(images)
# 定向攻击:最大化目标类别的损失
loss = -F.cross_entropy(outputs, target_labels)
model.zero_grad()
loss.backward()
gradient = images.grad.data
perturbation = epsilon * torch.sign(gradient)
adversarial_images = images + perturbation
adversarial_images = torch.clamp(adversarial_images, 0, 1)
return adversarial_images, perturbationFGSM 的特点
- 计算效率高:只需一次前向和反向传播
- 单步攻击:相比迭代方法更快
- 可解释性强:扰动方向由损失函数的梯度决定
- 局限性:对于使用防御技术(如对抗训练)的模型效果有限
4. PGD:投影梯度下降攻击
4.1 算法原理
PGD(Projected Gradient Descent)攻击是 FGSM 的迭代增强版本,被广泛认为是最强的 范数约束攻击:
其中 是投影到允许扰动集合 的操作, 是步长。
4.2 PGD 实现
def pgd_attack(model, images, labels, epsilon=0.03, alpha=0.003, num_iter=10,
targeted=False, target_labels=None):
"""
投影梯度下降(PGD)攻击
参数:
model: 目标分类器
images: 原始图像
labels: 真实标签
epsilon: 最大扰动范数
alpha: 每次迭代的步长
num_iter: 迭代次数
targeted: 是否为定向攻击
target_labels: 定向攻击的目标标签
"""
# 保存原始图像
original_images = images.detach().clone()
# 初始化:随机起点(增加攻击成功率)
images = images.detach() + torch.zeros_like(images).uniform_(-epsilon, epsilon)
images = torch.clamp(images, 0, 1)
for i in range(num_iter):
images.requires_grad = True
outputs = model(images)
if targeted:
# 定向攻击:最大化目标类别概率
loss = -F.cross_entropy(outputs, target_labels)
else:
# 非定向攻击:最小化真实类别概率
loss = F.cross_entropy(outputs, labels)
model.zero_grad()
loss.backward()
# 更新扰动
gradient = images.grad.data
images = images.detach() + alpha * torch.sign(gradient)
# 投影到允许范围
images = torch.maximum(images, original_images - epsilon)
images = torch.minimum(images, original_images + epsilon)
images = torch.clamp(images, 0, 1)
return images
def pgd_attack_l2(model, images, labels, epsilon=1.0, alpha=0.1, num_iter=10,
targeted=False, target_labels=None):
"""
L2 范数约束的 PGD 攻击
"""
original_images = images.detach().clone()
# 随机初始化
delta = torch.zeros_like(images)
delta.normal_()
delta = delta / torch.sqrt(torch.sum(delta ** 2, dim=(1, 2, 3), keepdim=True))
delta = delta * torch.rand(images.size(0), 1, 1, 1).to(images.device) * epsilon
images = (original_images + delta).clamp(0, 1)
for i in range(num_iter):
images.requires_grad = True
outputs = model(images)
if targeted:
loss = -F.cross_entropy(outputs, target_labels)
else:
loss = F.cross_entropy(outputs, labels)
model.zero_grad()
loss.backward()
gradient = images.grad.data
# L2 归一化梯度方向
grad_norm = torch.sqrt(torch.sum(gradient ** 2, dim=(1, 2, 3), keepdim=True))
gradient = gradient / (grad_norm + 1e-10)
# 更新并投影到 L2 球
images = images.detach() + alpha * gradient
delta = images - original_images
delta_norm = torch.sqrt(torch.sum(delta ** 2, dim=(1, 2, 3), keepdim=True))
delta = delta / (delta_norm + 1e-10) * torch.minimum(delta_norm, torch.tensor(epsilon))
images = (original_images + delta).clamp(0, 1)
return images5. Carlini-Wagner 攻击
5.1 算法原理
Carlini-Wagner(C&W)攻击是优化框架下的攻击方法,通过最小化扰动幅度同时保证攻击成功:
其中 是分类器输出转化为标量的辅助函数:
是分类器的 logits 输出, 是目标类别, 是置信度参数。
5.2 C&W 攻击实现
class CWL2Attack:
"""
Carlini-Wagner L2 攻击实现
使用变量替换和重参数化技巧将约束优化问题转化为无约束优化
"""
def __init__(self, model, kappa=0, max_iter=1000, learning_rate=0.01):
self.model = model
self.kappa = kappa # 置信度参数
self.max_iter = max_iter
self.lr = learning_rate
def attack(self, images, target_labels, targeted=True):
"""
执行 C&W L2 攻击
"""
batch_size = images.size(0)
device = images.device
# 初始化扰动变量(使用 tanh 变换确保有界)
w = torch.zeros_like(images)
w.requires_grad = True
optimizer = torch.optim.Adam([w], lr=self.lr)
for iteration in range(self.max_iter):
optimizer.zero_grad()
# 重参数化:w -> delta(-1 到 1)
delta = 0.5 * (torch.tanh(w) + 1) - images
# 计算 logits
adv_images = images + delta
logits = self.model(adv_images)
# 辅助函数 f
if targeted:
# 定向攻击:使目标类别 logit 最大
one_hot = F.one_hot(target_labels, num_classes=logits.size(-1)).float()
other_logits = ((1 - one_hot) * logits - one_hot * 1e9)
f = torch.max(other_logits, dim=-1)[0] - logits.gather(1, target_labels.unsqueeze(1)).squeeze()
else:
# 非定向攻击:使真实类别 logit 最小
real_logits = logits.gather(1, target_labels.unsqueeze(1)).squeeze()
other_logits = torch.where(
torch.arange(logits.size(-1), device=device).unsqueeze(0) == target_labels.unsqueeze(1),
torch.tensor(-1e9, device=device),
logits
)
f = real_logits - torch.max(other_logits, dim=-1)[0]
# 扰动幅度(使用变换后的 delta)
delta_reshaped = delta.view(batch_size, -1)
perturbation_norm = torch.sum(delta_reshaped ** 2, dim=-1)
# 损失函数
loss = perturbation_norm + 0.01 * f.sum()
loss.backward()
optimizer.step()
# 生成最终对抗样本
delta = 0.5 * (torch.tanh(w.detach()) + 1) - images
adversarial_images = (images + delta).clamp(0, 1)
return adversarial_images, delta6. 对抗样本的物理世界攻击
6.1 物理对抗样本的挑战
对抗样本不仅存在于数字世界,还可以被打印出来并对物理世界的感知系统造成威胁。物理攻击需要考虑相机畸变、光照变化、视角变换等多种实际因素。
class PhysicalAdversarialAttack:
"""
物理世界对抗攻击
考虑因素:
- 打印后的颜色失真
- 相机传感器非线性响应
- 不同距离和角度的变换
- 随机噪声和模糊
"""
def __init__(self, epsilon=0.1, num_augmentations=20):
self.epsilon = epsilon
self.num_augmentations = num_augmentations
def apply_physical_transform(self, images):
"""模拟物理世界变换"""
batch_size = images.size(0)
device = images.device
# 随机亮度调整
brightness = torch.rand(batch_size, 1, 1, 1).to(device) * 0.4 + 0.8
images = images * brightness
# 随机对比度调整
contrast = torch.rand(batch_size, 1, 1, 1).to(device) * 0.4 + 0.8
images = (images - 0.5) * contrast + 0.5
# 随机模糊(模拟对焦不准)
if torch.rand(1).item() > 0.5:
kernel_size = 5
sigma = torch.rand(1).item() * 2 + 0.5
# 使用平均池化近似模糊
images = F.avg_pool2d(images, kernel_size, stride=1,
padding=kernel_size//2)
return images.clamp(0, 1)
def expectation_over_transformation(self, model, images, labels, criterion):
"""
EOT(期望变换)方法
优化对抗扰动,使其在多种物理变换下都能保持攻击效果
"""
images.requires_grad = True
device = images.device
# 多次采样变换,计算期望损失
total_loss = 0
for _ in range(self.num_augmentations):
transformed_images = self.apply_physical_transform(images)
outputs = model(transformed_images)
total_loss += criterion(outputs, labels)
avg_loss = total_loss / self.num_augmentations
return avg_loss6.2 对抗补丁(Adversarial Patch)
对抗补丁是一种可以在物理世界中打印和使用的对抗攻击,通过在图像任意位置放置一个局部补丁来欺骗分类器。
class AdversarialPatchAttack:
"""
对抗补丁攻击
核心思想:
- 生成一个局部补丁(可以是任意形状)
- 补丁位置可以是随机的
- 优化补丁图案使其具有最大的"欺骗能力"
"""
def __init__(self, model, patch_size=50, num_classes=1000):
self.model = model
self.patch_size = patch_size
self.num_classes = num_classes
def create_adversarial_patch(self, target_class, iterations=1000):
"""
生成对抗补丁
参数:
target_class: 目标攻击类别
iterations: 优化迭代次数
"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 随机初始化补丁
patch = torch.rand(3, self.patch_size, self.patch_size).to(device)
patch.requires_grad = True
optimizer = torch.optim.Adam([patch], lr=0.1)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=iterations)
for i in range(iterations):
optimizer.zero_grad()
# 生成随机背景图像
background = torch.rand(1, 3, 224, 224).to(device)
# 随机放置补丁位置
h, w = self.patch_size, self.patch_size
top = torch.randint(0, 224 - h, (1,)).item()
left = torch.randint(0, 224 - w, (1,)).item()
# 应用补丁
patched_images = background.clone()
patched_images[:, :, top:top+h, left:left+w] = torch.sigmoid(patch)
# 前向传播
outputs = self.model(patched_images)
# 定向损失:最大化目标类别概率
loss = -F.cross_entropy(outputs, torch.tensor([target_class]).to(device))
loss.backward()
optimizer.step()
scheduler.step()
# 裁剪补丁值到有效范围
with torch.no_grad():
patch.clamp_(0, 1)
if (i + 1) % 100 == 0:
prob = F.softmax(outputs, dim=-1)[0, target_class].item()
print(f"Iter {i+1}, Target prob: {prob:.4f}")
return patch.detach()
def apply_patch_to_image(self, image, patch, location='random'):
"""将补丁应用到图像"""
_, _, h, w = image.shape
patch_h, patch_w = patch.shape[1:]
if location == 'random':
top = torch.randint(0, h - patch_h, (1,)).item()
left = torch.randint(0, w - patch_w, (1,)).item()
else:
top, left = location
patched_image = image.clone()
patched_image[:, top:top+patch_h, left:left+patch_w] = patch
return patched_image物理攻击的威胁等级
对抗补丁攻击已在多项研究中得到验证:
- 通过贴在 Stop 标志上的补丁使自动驾驶系统无法识别停车标志
- 通过佩戴特制眼镜使面部识别系统误识别为特定目标
- 通过打印的补丁使图像分类器输出错误类别
7. 对抗样本的高级攻击技术
7.1 HopSkipJumpAttack
HopSkipJumpAttack 是一种黑盒攻击方法,只需要查询模型的输出概率或类别标签:
def hopskipjump_attack(model, original_image, target_class=None,
max_queries=10000, epsilon=1.0):
"""
HopSkipJumpAttack:基于决策的黑盒攻击
特点:
- 只需要访问模型的预测类别(硬标签)
- 使用二分搜索估计决策边界
- 查询效率较高
"""
device = original_image.device
dim = original_image.view(-1).shape[0]
# 初始化:使用高斯噪声
perturbation = torch.randn_like(original_image) * 0.5
perturbed = (original_image + perturbation).clamp(0, 1)
for query in range(max_queries):
# 获取当前扰动的决策
with torch.no_grad():
current_pred = model(perturbed.unsqueeze(0)).argmax().item()
# 检查是否达到目标
if target_class is not None and current_pred == target_class:
break
elif target_class is None and current_pred != original_image.argmax().item():
break
# 估计步长(二分搜索)
step_size = epsilon / 2
for _ in range(10):
test_perturbation = perturbation * (1 - step_size)
test_perturbed = (original_image + test_perturbation).clamp(0, 1)
with torch.no_grad():
test_pred = model(test_perturbed.unsqueeze(0)).argmax().item()
# 根据结果调整步长
if (target_class is not None and test_pred == target_class) or \
(target_class is None and test_pred == original_image.argmax().item()):
step_size *= 0.5
else:
perturbation = test_perturbation
perturbed = test_perturbed
break
# 梯度估计(使用有限差分)
delta = 0.001
gradient_estimate = torch.zeros_like(original_image)
for i in range(min(dim, 100)): # 随机选择维度
idx = torch.randint(0, dim, (1,)).item()
pos_perturbation = perturbation.clone().view(-1)
pos_perturbation[idx] += delta
pos_perturbed = (original_image + pos_perturbation.view_as(original_image)).clamp(0, 1)
with torch.no_grad():
pos_pred = model(pos_perturbed.unsqueeze(0)).argmax().item()
gradient_estimate.view(-1)[idx] = (1 if (target_class and pos_pred == target_class) or
(not target_class and pos_pred != original_image.argmax().item()) else 0)
# 更新扰动
perturbation = perturbation + 0.01 * gradient_estimate * torch.sign(gradient_estimate)
perturbation = torch.clamp(perturbation, -epsilon, epsilon)
perturbed = (original_image + perturbation).clamp(0, 1)
return perturbed, perturbation8. 学术引用与参考文献
- Szegedy, C., et al. (2013). “Intriguing properties of neural networks.” arXiv:1312.6199.
- Goodfellow, I. J., et al. (2015). “Explaining and Harnessing Adversarial Examples.” ICLR.
- Madry, A., et al. (2017). “Towards Deep Learning Models Resistant to Adversarial Attacks.” ICLR.
- Carlini, N., & Wagner, D. (2017). “Towards Evaluating the Robustness of Neural Networks.” IEEE S&P.
- Kurakin, A., et al. (2016). “Adversarial examples in the physical world.” ICLR Workshop.
- Brown, T. B., et al. (2017). “Adversarial Patch.” arXiv:1712.09665.
- Chen, J., & Jordan, M. I. (2019). “HopSkipJumpAttack: A Query-Efficient Decision-Based Attack.” IEEE S&P.