关键词
| 术语 | 英文 | 核心概念 |
|---|---|---|
| 对抗样本 | Adversarial Example | 精心设计的导致模型误分类的扰动输入 |
| 快速梯度符号法 | FGSM | 基于梯度的单步攻击方法 |
| 投影梯度下降 | PGD | 迭代式攻击的代表性方法 |
| Carlini-Wagner攻击 | C&W Attack | 优化的 L2 攻击方法 |
| 扰动 | Perturbation | 添加到原始输入的微小噪声 |
| 白盒攻击 | White-box Attack | 攻击者完全了解模型参数的攻击 |
| 黑盒攻击 | Black-box Attack | 攻击者只能访问模型输入输出的攻击 |
| 对抗补丁 | Adversarial Patch | 物理世界中可打印的对抗攻击 |
| 欺骗攻击 | Evasion Attack | 在推理阶段规避检测的攻击 |
| 毒性攻击 | Poisoning Attack | 在训练阶段注入恶意样本的攻击 |
| 对抗训练 | Adversarial Training | 使用对抗样本增强模型鲁棒性 |
| 蒸馏防御 | Defensive Distillation | 通过知识蒸馏提高模型平滑性 |
| 输入变换 | Input Transformation | 对输入进行预处理防御 |
| 可证明鲁棒性 | Certified Robustness | 可证明防御边界的理论保证 |
1. 引言:对抗样本的发现
2013 年,Christian Szegedy 等人在论文《Intriguing properties of neural networks》中首次正式定义了对抗样本(Adversarial Examples)这一概念。他们发现了一个令人震惊的现象:对于一个表现良好的深度神经网络,可以通过对输入图像添加人类几乎无法察觉的微小扰动,使得模型以高置信度输出错误的分类结果。
历史背景
对抗样本的发现打破了人们对深度学习”过拟合”或”泛化能力不足”的传统认知。它揭示了一个更深层次的问题:现代神经网络的决策边界存在系统性缺陷,而非简单的统计误差。
1.1 对抗样本的直观理解
对抗样本可以被理解为在高维输入空间中,沿着梯度的微小方向”轻轻一推”,模型就完全”跌落”到了错误的分类区域。这个现象有以下几个关键特点:
1. 普遍性:几乎所有现代神经网络都存在对抗样本。无论网络架构多复杂、训练数据多丰富,对抗样本都能被构造出来。
2. 迁移性:在一个模型上生成的对抗样本,往往也能欺骗其他结构不同的模型。这为黑盒攻击提供了可能性。
3. 人类不可察觉性:添加的扰动通常是极其微小的,在视觉上几乎无法被察觉。但对于神经网络而言,这些微小变化足以导致完全不同的输出。
1.2 对抗样本的分类体系
对抗样本可以从多个维度进行分类:
按攻击者知识分类:
- 白盒攻击(White-box):攻击者知道模型的完整信息(架构、参数、梯度等)
- 黑盒攻击(Black-box):攻击者只知道模型的输入输出
- 灰盒攻击(Gray-box):攻击者知道部分信息
按攻击目标分类:
- 非定向攻击(Untargeted):使模型产生任意错误预测
- 定向攻击(Targeted):使模型产生特定的目标预测
按扰动约束分类:
- 约束:所有像素的扰动不超过
- 约束:扰动的欧几里得范数不超过
- 约束:只修改尽可能少的像素
2. 对抗样本的数学定义
2.1 形式化定义
给定一个分类器 和原始输入 ,对抗样本 满足以下条件:
其中 表示 范数, 是人类感知阈值。
对于定向攻击,目标类别为 ,则需要满足:
2.2 对抗样本的存在性解释
从决策边界角度,对抗样本的存在可以用高维空间的线性近似来解释。对于一个权重向量 和输入 ,模型输出为:
如果 与 符号相同且幅度足够大,即使 极小,也能改变分类结果。
import numpy as np
import torch
import torch.nn as nn
def explain_adversarial_examples(model, x, epsilon=0.1):
"""
解释对抗样本存在性的数值示例
在高维空间中,即使扰动很小,
累积效应也足以改变分类结果
"""
# 获取模型预测
model.eval()
with torch.no_grad():
original_pred = model(x.unsqueeze(0)).argmax(dim=1).item()
# 获取输入梯度(扰动方向)
x.requires_grad = True
output = model(x.unsqueeze(0))
loss = output[0, original_pred]
loss.backward()
gradient = x.grad.data
# 计算梯度方向上的扰动
perturbation = epsilon * torch.sign(gradient)
adversarial_x = x + perturbation
with torch.no_grad():
adversarial_pred = model(adversarial_x.unsqueeze(0)).argmax(dim=1).item()
print(f"原始预测: {original_pred}, 对抗样本预测: {adversarial_pred}")
print(f"扰动范数 L∞: {torch.abs(perturbation).max().item():.4f}")
return adversarial_x, perturbation
# 高维空间线性效应的数学演示
def high_dimensional_linear_effect():
"""
在高维空间中:
- 随机方向扰动的期望范数增长为 O(√d)
- 但正交于特征方向的扰动可以被忽略
- 梯度的符号方向总是"对齐"的
这解释了为什么微小扰动能累积成大影响
"""
d = 10000 # 假设10000维空间
num_samples = 1000
# 随机扰动的平均绝对值
random_perturbations = np.random.randn(num_samples, d)
avg_magnitude = np.mean(np.abs(random_perturbations))
# 梯度方向扰动(所有维度对齐)
aligned_perturbations = np.ones((num_samples, d)) * 0.01
aligned_magnitude = np.mean(np.abs(aligned_perturbations))
print(f"随机扰动平均幅度: {avg_magnitude:.6f}")
print(f"对齐扰动平均幅度: {aligned_magnitude:.4f}")
print(f"比率: {aligned_magnitude / avg_magnitude:.2f}x")2.3 决策边界与对抗样本
对抗样本的存在与神经网络的决策边界结构密切相关。在高维空间中,决策边界往往呈现出复杂的几何结构,导致存在”尖锐”的角落:
class DecisionBoundaryAnalyzer:
"""
决策边界分析器
分析神经网络决策边界的几何特性
"""
def __init__(self, model):
self.model = model
self.model.eval()
def compute_curvature(self, x, epsilon=0.01, num_directions=100):
"""
计算决策边界的曲率
高曲率区域更可能产生对抗样本
"""
curvatures = []
for _ in range(num_directions):
# 随机方向
direction = torch.randn_like(x)
direction = direction / direction.norm()
# 计算沿方向的二阶导数
x_plus = (x + epsilon * direction).requires_grad_(True)
x_minus = (x - epsilon * direction).requires_grad_(True)
# 一阶导数
loss_plus = self.model(x_plus).max()
loss_minus = self.model(x_minus).max()
grad_plus = torch.autograd.grad(loss_plus, x_plus)[0]
grad_minus = torch.autograd.grad(loss_minus, x_minus)[0]
# 二阶差分近似曲率
curvature = torch.norm(grad_plus - grad_minus) / (2 * epsilon)
curvatures.append(curvature.item())
return np.mean(curvatures)
def find_adversarial_direction(self, x, target_class):
"""
找到指向目标类别的对抗方向
返回:
- 对抗方向的单位向量
- 到达决策边界的估计距离
"""
x.requires_grad = True
# 获取目标类别的logit
output = self.model(x.unsqueeze(0))
target_logit = output[0, target_class]
# 梯度指向增加目标logit的方向
grad = torch.autograd.grad(target_logit, x)[0]
return grad / (grad.norm() + 1e-8)3. FGSM:快速梯度符号法
3.1 算法原理
Goodfellow 等人在 2014 年提出 FGSM(Fast Gradient Sign Method),这是一种计算高效的单步攻击方法:
其中 是损失函数, 是输入空间中的梯度。
3.2 FGSM 实现
import torch
import torch.nn.functional as F
def fgsm_attack(model, images, labels, epsilon=0.03):
"""
快速梯度符号法(FGSM)攻击
参数:
model: 目标分类器
images: 输入图像 (B, C, H, W)
labels: 真实标签 (B,)
epsilon: 扰动幅度
返回:
adversarial_images: 对抗样本
"""
# 保存原始图像用于恢复
images.requires_grad = True
# 前向传播
outputs = model(images)
# 计算损失
loss = F.cross_entropy(outputs, labels)
# 反向传播获取梯度
model.zero_grad()
loss.backward()
# 获取梯度并计算扰动
gradient = images.grad.data
perturbation = epsilon * torch.sign(gradient)
# 生成对抗样本
adversarial_images = images + perturbation
# 确保扰动后图像仍在有效范围内
adversarial_images = torch.clamp(adversarial_images, 0, 1)
return adversarial_images, perturbation
def fgsm_targeted_attack(model, images, target_labels, epsilon=0.03):
"""
定向 FGSM 攻击:使模型输出特定目标类别
"""
images.requires_grad = True
outputs = model(images)
# 定向攻击:最大化目标类别的损失
loss = -F.cross_entropy(outputs, target_labels)
model.zero_grad()
loss.backward()
gradient = images.grad.data
perturbation = epsilon * torch.sign(gradient)
adversarial_images = images + perturbation
adversarial_images = torch.clamp(adversarial_images, 0, 1)
return adversarial_images, perturbationFGSM 的特点
- 计算效率高:只需一次前向和反向传播
- 单步攻击:相比迭代方法更快
- 可解释性强:扰动方向由损失函数的梯度决定
- 局限性:对于使用防御技术(如对抗训练)的模型效果有限
4. PGD:投影梯度下降攻击
4.1 算法原理
PGD(Projected Gradient Descent)攻击是 FGSM 的迭代增强版本,被广泛认为是最强的 范数约束攻击:
其中 是投影到允许扰动集合 的操作, 是步长。
4.2 PGD 实现
def pgd_attack(model, images, labels, epsilon=0.03, alpha=0.003, num_iter=10,
targeted=False, target_labels=None):
"""
投影梯度下降(PGD)攻击
参数:
model: 目标分类器
images: 原始图像
labels: 真实标签
epsilon: 最大扰动范数
alpha: 每次迭代的步长
num_iter: 迭代次数
targeted: 是否为定向攻击
target_labels: 定向攻击的目标标签
"""
# 保存原始图像
original_images = images.detach().clone()
# 初始化:随机起点(增加攻击成功率)
images = images.detach() + torch.zeros_like(images).uniform_(-epsilon, epsilon)
images = torch.clamp(images, 0, 1)
for i in range(num_iter):
images.requires_grad = True
outputs = model(images)
if targeted:
# 定向攻击:最大化目标类别概率
loss = -F.cross_entropy(outputs, target_labels)
else:
# 非定向攻击:最小化真实类别概率
loss = F.cross_entropy(outputs, labels)
model.zero_grad()
loss.backward()
# 更新扰动
gradient = images.grad.data
images = images.detach() + alpha * torch.sign(gradient)
# 投影到允许范围
images = torch.maximum(images, original_images - epsilon)
images = torch.minimum(images, original_images + epsilon)
images = torch.clamp(images, 0, 1)
return images
def pgd_attack_l2(model, images, labels, epsilon=1.0, alpha=0.1, num_iter=10,
targeted=False, target_labels=None):
"""
L2 范数约束的 PGD 攻击
"""
original_images = images.detach().clone()
# 随机初始化
delta = torch.zeros_like(images)
delta.normal_()
delta = delta / torch.sqrt(torch.sum(delta ** 2, dim=(1, 2, 3), keepdim=True))
delta = delta * torch.rand(images.size(0), 1, 1, 1).to(images.device) * epsilon
images = (original_images + delta).clamp(0, 1)
for i in range(num_iter):
images.requires_grad = True
outputs = model(images)
if targeted:
loss = -F.cross_entropy(outputs, target_labels)
else:
loss = F.cross_entropy(outputs, labels)
model.zero_grad()
loss.backward()
gradient = images.grad.data
# L2 归一化梯度方向
grad_norm = torch.sqrt(torch.sum(gradient ** 2, dim=(1, 2, 3), keepdim=True))
gradient = gradient / (grad_norm + 1e-10)
# 更新并投影到 L2 球
images = images.detach() + alpha * gradient
delta = images - original_images
delta_norm = torch.sqrt(torch.sum(delta ** 2, dim=(1, 2, 3), keepdim=True))
delta = delta / (delta_norm + 1e-10) * torch.minimum(delta_norm, torch.tensor(epsilon))
images = (original_images + delta).clamp(0, 1)
return images5. Carlini-Wagner 攻击
5.1 算法原理
Carlini-Wagner(C&W)攻击是优化框架下的攻击方法,通过最小化扰动幅度同时保证攻击成功:
其中 是分类器输出转化为标量的辅助函数:
是分类器的 logits 输出, 是目标类别, 是置信度参数。
5.2 C&W 攻击实现
class CWL2Attack:
"""
Carlini-Wagner L2 攻击实现
使用变量替换和重参数化技巧将约束优化问题转化为无约束优化
"""
def __init__(self, model, kappa=0, max_iter=1000, learning_rate=0.01):
self.model = model
self.kappa = kappa # 置信度参数
self.max_iter = max_iter
self.lr = learning_rate
def attack(self, images, target_labels, targeted=True):
"""
执行 C&W L2 攻击
"""
batch_size = images.size(0)
device = images.device
# 初始化扰动变量(使用 tanh 变换确保有界)
w = torch.zeros_like(images)
w.requires_grad = True
optimizer = torch.optim.Adam([w], lr=self.lr)
for iteration in range(self.max_iter):
optimizer.zero_grad()
# 重参数化:w -> delta(-1 到 1)
delta = 0.5 * (torch.tanh(w) + 1) - images
# 计算 logits
adv_images = images + delta
logits = self.model(adv_images)
# 辅助函数 f
if targeted:
# 定向攻击:使目标类别 logit 最大
one_hot = F.one_hot(target_labels, num_classes=logits.size(-1)).float()
other_logits = ((1 - one_hot) * logits - one_hot * 1e9)
f = torch.max(other_logits, dim=-1)[0] - logits.gather(1, target_labels.unsqueeze(1)).squeeze()
else:
# 非定向攻击:使真实类别 logit 最小
real_logits = logits.gather(1, target_labels.unsqueeze(1)).squeeze()
other_logits = torch.where(
torch.arange(logits.size(-1), device=device).unsqueeze(0) == target_labels.unsqueeze(1),
torch.tensor(-1e9, device=device),
logits
)
f = real_logits - torch.max(other_logits, dim=-1)[0]
# 扰动幅度(使用变换后的 delta)
delta_reshaped = delta.view(batch_size, -1)
perturbation_norm = torch.sum(delta_reshaped ** 2, dim=-1)
# 损失函数
loss = perturbation_norm + 0.01 * f.sum()
loss.backward()
optimizer.step()
# 生成最终对抗样本
delta = 0.5 * (torch.tanh(w.detach()) + 1) - images
adversarial_images = (images + delta).clamp(0, 1)
return adversarial_images, delta6. DeepFool 攻击
6.1 算法原理
DeepFool 由 Seyed-Mohsen Moosavi-Dezfooli 等人在 2016 年提出,是一种迭代攻击方法,通过在决策边界之间逐步扰动来找到最小扰动:
class DeepFool:
"""
DeepFool 攻击
原理:
1. 找到当前点所属决策区域
2. 计算到最近决策边界的距离
3. 沿法向量方向进行最小步长移动
4. 重复直到分类改变
"""
def __init__(self, model, num_classes=10, overshoot=0.02, max_iter=50):
self.model = model
self.num_classes = num_classes
self.overshoot = overshoot
self.max_iter = max_iter
def attack(self, images, labels):
"""
执行 DeepFool 攻击
"""
batch_size = images.size(0)
device = images.device
adversarial_images = images.clone().detach()
perturbed = torch.zeros_like(images)
for idx in range(batch_size):
x = images[idx:idx+1].clone().detach().requires_grad_(True)
original_label = labels[idx].item()
current_label = original_label
iteration = 0
while current_label == original_label and iteration < self.max_iter:
iteration += 1
# 获取模型输出
output = self.model(x)
# 计算梯度
model.zero_grad()
output[0, original_label].backward(retain_graph=True)
grad_original = x.grad.data.clone()
# 寻找最小扰动方向
min_dist = float('inf')
min_perturbation = None
min_class = None
for class_idx in range(self.num_classes):
if class_idx == original_label:
continue
# 计算到其他类别边界的距离
model.zero_grad()
output[0, class_idx].backward(retain_graph=True)
grad_target = x.grad.data.clone()
# 扰动方向
perturbation = grad_original - grad_target
# 距离计算
dist = torch.abs(output[0, original_label] - output[0, class_idx]) / (
torch.norm(perturbation) + 1e-10
)
if dist < min_dist:
min_dist = dist
min_perturbation = perturbation
min_class = class_idx
# 应用扰动
if min_perturbation is not None:
r = (min_dist + 1e-4) * (min_perturbation / torch.norm(min_perturbation))
x = (x + (1 + self.overshoot) * r).detach().requires_grad_(True)
# 检查是否改变分类
with torch.no_grad():
current_label = self.model(x).argmax(dim=1).item()
# 保存对抗样本
adversarial_images[idx] = x.squeeze()
perturbed[idx] = x.squeeze() - images[idx]
return adversarial_images, perturbed7. EOT:期望过转换攻击
7.1 算法原理
EOT(Expectation Over Transformation)攻击通过在多种图像变换下优化对抗扰动,使攻击对物理变换具有鲁棒性:
class EOTAttack:
"""
EOT(期望过转换)攻击
核心思想:
在多种随机变换下优化对抗扰动
使得变换后的对抗样本仍然具有攻击性
适用于:
- 物理世界攻击
- 对抗补丁
- 相机传感器攻击
"""
def __init__(self, model, transformations, epsilon=0.1, num_iter=100):
self.model = model
self.transformations = transformations
self.epsilon = epsilon
self.num_iter = num_iter
def apply_random_transform(self, images):
"""应用随机变换"""
transformed = []
for img in images:
for transform in self.transformations:
t_img = transform(img)
transformed.append(t_img)
return torch.stack(transformed)
def compute_eot_gradient(self, images, labels, target_labels=None):
"""
计算 EOT 梯度
对所有可能的变换计算期望梯度
"""
total_grad = torch.zeros_like(images)
for _ in range(10): # 采样次数
# 应用随机变换
transformed_images = self.apply_random_transform(images)
# 计算梯度
transformed_images.requires_grad = True
outputs = self.model(transformed_images)
if target_labels is not None:
loss = -F.cross_entropy(outputs, target_labels.repeat(len(self.transformations)))
else:
loss = F.cross_entropy(outputs, labels.repeat(len(self.transformations)))
loss.backward()
total_grad += transformed_images.grad.data
return total_grad / 10
def attack(self, images, labels, target_labels=None):
"""执行 EOT 攻击"""
adversarial_images = images.clone()
for iteration in range(self.num_iter):
# 计算 EOT 梯度
grad = self.compute_eot_gradient(adversarial_images, labels, target_labels)
# 更新扰动
perturbation = self.epsilon * torch.sign(grad.mean(dim=0, keepdim=True))
adversarial_images = adversarial_images + perturbation
# 投影
adversarial_images = torch.clamp(adversarial_images, 0, 1)
return adversarial_images8. 对抗补丁(Adversarial Patch)
8.1 补丁攻击的原理
对抗补丁是一种可以在物理世界中打印和使用的对抗攻击,通过在图像任意位置放置一个局部补丁来欺骗分类器。
class AdversarialPatchAttack:
"""
对抗补丁攻击
核心思想:
- 生成一个局部补丁(可以是任意形状)
- 补丁位置可以是随机的
- 优化补丁图案使其具有最大的"欺骗能力"
"""
def __init__(self, model, patch_size=50, num_classes=1000):
self.model = model
self.patch_size = patch_size
self.num_classes = num_classes
def create_adversarial_patch(self, target_class, iterations=1000):
"""
生成对抗补丁
参数:
target_class: 目标攻击类别
iterations: 优化迭代次数
"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 随机初始化补丁
patch = torch.rand(3, self.patch_size, self.patch_size).to(device)
patch.requires_grad = True
optimizer = torch.optim.Adam([patch], lr=0.1)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=iterations)
for i in range(iterations):
optimizer.zero_grad()
# 生成随机背景图像
background = torch.rand(1, 3, 224, 224).to(device)
# 随机放置补丁位置
h, w = self.patch_size, self.patch_size
top = torch.randint(0, 224 - h, (1,)).item()
left = torch.randint(0, 224 - w, (1,)).item()
# 应用补丁
patched_images = background.clone()
patched_images[:, :, top:top+h, left:left+w] = torch.sigmoid(patch)
# 前向传播
outputs = self.model(patched_images)
# 定向损失:最大化目标类别概率
loss = -F.cross_entropy(outputs, torch.tensor([target_class]).to(device))
loss.backward()
optimizer.step()
scheduler.step()
# 裁剪补丁值到有效范围
with torch.no_grad():
patch.clamp_(0, 1)
if (i + 1) % 100 == 0:
prob = F.softmax(outputs, dim=-1)[0, target_class].item()
print(f"Iter {i+1}, Target prob: {prob:.4f}")
return patch.detach()
def apply_patch_to_image(self, image, patch, location='random'):
"""将补丁应用到图像"""
_, _, h, w = image.shape
patch_h, patch_w = patch.shape[1:]
if location == 'random':
top = torch.randint(0, h - patch_h, (1,)).item()
left = torch.randint(0, w - patch_w, (1,)).item()
else:
top, left = location
patched_image = image.clone()
patched_image[:, top:top+patch_h, left:left+patch_w] = patch
return patched_image
class RobustAdversarialPatch:
"""
鲁棒对抗补丁
增强补丁对以下变换的鲁棒性:
- 旋转
- 缩放
- 亮度变化
- 对比度变化
"""
def __init__(self, model, patch_shape=(50, 50)):
self.model = model
self.patch_shape = patch_shape
def random_transform(self, patch, max_rotation=30, max_scale=0.2):
"""随机变换补丁"""
# 随机旋转
angle = torch.rand(1).item() * max_rotation - max_rotation / 2
# 随机缩放
scale = 1 + torch.rand(1).item() * max_scale - max_scale / 2
# 简化的变换(实际应用中需要更复杂的实现)
return patch * scale
def train_robust_patch(self, target_class, epochs=50, batch_size=32):
"""训练鲁棒补丁"""
device = next(self.model.parameters()).device
# 初始化补丁
patch = torch.rand(3, *self.patch_shape, device=device, requires_grad=True)
optimizer = torch.optim.Adam([patch], lr=0.1)
for epoch in range(epochs):
total_loss = 0
for _ in range(batch_size):
optimizer.zero_grad()
# 生成目标图像
target_img = torch.rand(1, 3, 224, 224, device=device)
# 应用变换后的补丁
transformed_patch = self.random_transform(patch)
# 随机位置
h, w = self.patch_shape
top = torch.randint(0, 224 - h, (1,)).item()
left = torch.randint(0, 224 - w, (1,)).item()
# 应用
img = target_img.clone()
img[:, :, top:top+h, left:left+w] = torch.sigmoid(transformed_patch)
# 前向传播
output = self.model(img)
# 损失:最大化目标类别
loss = -F.cross_entropy(output, torch.tensor([target_class], device=device))
loss.backward()
optimizer.step()
total_loss += loss.item()
# 归一化补丁
with torch.no_grad():
patch.clamp_(0, 1)
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}, Loss: {total_loss / batch_size:.4f}")
return patch.detach()物理攻击的威胁等级
对抗补丁攻击已在多项研究中得到验证:
- 通过贴在 Stop 标志上的补丁使自动驾驶系统无法识别停车标志
- 通过佩戴特制眼镜使面部识别系统误识别为特定目标
- 通过打印的补丁使图像分类器输出错误类别
9. 黑盒攻击技术
9.1 基于查询的黑盒攻击
class QueryBasedBlackBoxAttack:
"""
基于查询的黑盒攻击
假设攻击者只能:
- 输入图像获取模型输出
- 不知道模型内部结构
"""
def __init__(self, target_model, epsilon=0.03, delta=1e-4):
self.model = target_model
self.epsilon = epsilon
self.delta = delta
def estimate_gradient(self, x, labels, num_queries=100):
"""
用有限差分估计梯度
对每个维度添加小扰动,估算梯度
"""
batch_size = x.size(0)
dim = x.view(batch_size, -1).size(1)
gradient = torch.zeros_like(x)
x_flat = x.view(batch_size, -1)
grad_flat = gradient.view(batch_size, -1)
# 随机选择维度进行估计(加速)
num_estimate = min(num_queries, dim)
selected_dims = torch.randperm(dim)[:num_estimate]
for dim_idx in selected_dims:
# 正向扰动
x_plus = x_flat.clone()
x_plus[:, dim_idx] += self.delta
x_plus = x_plus.view_as(x)
with torch.no_grad():
output_plus = self.model(x_plus)
pred_plus = output_plus.argmax(dim=1)
# 计算梯度估计
for i in range(batch_size):
if pred_plus[i] != labels[i]:
grad_flat[i, dim_idx] = 1
else:
grad_flat[i, dim_idx] = -1
return gradient
def attack(self, images, labels, num_iterations=10):
"""黑盒攻击"""
adversarial_images = images.clone()
for iteration in range(num_iterations):
# 估计梯度
grad = self.estimate_gradient(adversarial_images, labels)
# 更新
adversarial_images = adversarial_images + self.epsilon * torch.sign(grad)
adversarial_images = torch.clamp(adversarial_images, 0, 1)
return adversarial_images
class NaturalEvolutionStrategiesAttack:
"""
自然进化策略(NES)黑盒攻击
使用进化策略估计梯度
"""
def __init__(self, model, population_size=100, learning_rate=0.01):
self.model = model
self.population_size = population_size
self.lr = learning_rate
def estimate_nes_gradient(self, x, target_labels, direction='maximize'):
"""
用NES估计梯度
采样多个扰动,计算期望损失变化
"""
batch_size = x.size(0)
dim = x.numel()
# 采样扰动
noise = torch.randn(self.population_size, dim, device=x.device)
# 计算每个扰动对应的损失
losses = []
for i in range(self.population_size):
perturbed = x.view(batch_size, -1) + 0.01 * noise[i]
perturbed = perturbed.view_as(x)
with torch.no_grad():
output = self.model(perturbed)
loss = F.cross_entropy(output, target_labels)
if direction == 'minimize':
loss = -loss
losses.append(loss.item())
# 加权平均估计梯度
losses = torch.tensor(losses, device=x.device)
weights = losses - losses.mean()
gradient = (1.0 / (0.01 * self.population_size)) * torch.matmul(
weights, noise
).view_as(x)
return gradient
def attack(self, images, labels, epsilon=0.03, num_iterations=10):
"""NES黑盒攻击"""
adversarial_images = images.clone()
for iteration in range(num_iterations):
grad = self.estimate_nes_gradient(adversarial_images, labels)
adversarial_images = adversarial_images + self.lr * grad
adversarial_images = torch.clamp(adversarial_images, images - epsilon, images + epsilon)
adversarial_images = torch.clamp(adversarial_images, 0, 1)
return adversarial_images9.2 迁移攻击
class TransferAttack:
"""
迁移攻击
利用模型之间的迁移性:
1. 训练一个替代模型
2. 在替代模型上生成对抗样本
3. 将对抗样本迁移到目标模型
"""
def __init__(self, substitute_model, target_model):
self.substitute = substitute_model
self.target = target_model
def train_substitute(self, training_images, training_labels, epochs=10):
"""
训练替代模型
使用雅可比矩阵增强训练数据
"""
substitute = copy.deepcopy(self.substitute)
optimizer = torch.optim.Adam(substitute.parameters(), lr=0.001)
for epoch in range(epochs):
for images, labels in DataLoader(zip(training_images, training_labels)):
optimizer.zero_grad()
outputs = substitute(images)
loss = F.cross_entropy(outputs, labels)
loss.backward()
optimizer.step()
self.substitute = substitute
return substitute
def augment_with_jacobian(self, images, labels):
"""
使用雅可比矩阵增强训练数据
增加沿梯度方向的样本
"""
augmented = [images]
for img, label in zip(images, labels):
img = img.unsqueeze(0).requires_grad_(True)
output = self.substitute(img)
loss = F.cross_entropy(output, label.unsqueeze(0))
grad = torch.autograd.grad(loss, img)[0]
# 添加正向和负向扰动
augmented.append(img + 0.1 * grad)
augmented.append(img - 0.1 * grad)
return torch.cat(augmented, dim=0)
def generate_adversarial(self, images, epsilon=0.03, method='fgsm'):
"""
在替代模型上生成对抗样本
"""
if method == 'fgsm':
return fgsm_attack(self.substitute, images, labels=None, epsilon=epsilon)
elif method == 'pgd':
return pgd_attack(self.substitute, images, labels=None, epsilon=epsilon)
def transfer(self, images, labels):
"""执行迁移攻击"""
# 在替代模型上生成
adv_images = self.generate_adversarial(images)
# 测试在目标模型上的效果
with torch.no_grad():
target_output = self.target(adv_images)
return adv_images, target_output10. 对抗防御技术
10.1 对抗训练
class AdversarialTraining:
"""
对抗训练
在训练过程中包含对抗样本
提高模型对对抗攻击的鲁棒性
"""
def __init__(self, model, epsilon=0.03, alpha=0.003, num_iter=7):
self.model = model
self.epsilon = epsilon
self.alpha = alpha
self.num_iter = num_iter
def train_step(self, images, labels, optimizer):
"""
对抗训练步骤
策略:使用PGD攻击生成对抗样本进行训练
"""
# 生成对抗样本
adversarial_images = pgd_attack(
self.model, images, labels,
epsilon=self.epsilon,
alpha=self.alpha,
num_iter=self.num_iter
)
# 联合训练:真实样本 + 对抗样本
optimizer.zero_grad()
# 真实样本损失
real_outputs = self.model(images)
real_loss = F.cross_entropy(real_outputs, labels)
# 对抗样本损失
adv_outputs = self.model(adversarial_images)
adv_loss = F.cross_entropy(adv_outputs, labels)
# 总损失
total_loss = real_loss + adv_loss
total_loss.backward()
optimizer.step()
return total_loss.item()
def curriculum_training(self, images, labels, optimizer, epoch, total_epochs):
"""
课程对抗训练
随着训练进行,逐步增加对抗强度
"""
# 动态调整epsilon
progress = epoch / total_epochs
current_epsilon = self.epsilon * (0.5 + 0.5 * progress)
current_alpha = self.alpha * (0.5 + 0.5 * progress)
adversarial_images = pgd_attack(
self.model, images, labels,
epsilon=current_epsilon,
alpha=current_alpha,
num_iter=self.num_iter
)
optimizer.zero_grad()
# 混合训练
outputs = self.model(images)
adv_outputs = self.model(adversarial_images)
# 加权损失
weight = min(1.0, progress)
loss = (1 - weight) * F.cross_entropy(outputs, labels) + \
weight * F.cross_entropy(adv_outputs, labels)
loss.backward()
optimizer.step()
return loss.item()
class TRADESLoss:
"""
TRADES (TRADE-off between robustness and accuracy)
对抗训练的另一种损失函数
同时优化干净样本准确率和鲁棒性
"""
def __init__(self, model, beta=1.0):
self.model = model
self.beta = beta
def compute_loss(self, images, labels, epsilon=0.03):
"""
TRADES 损失
Loss = CE(model(x), y) + β * KL(model(x)||model(x+ε))
"""
# 干净样本的预测
outputs = self.model(images)
# 生成对抗样本
adversarial_images = pgd_attack(
self.model, images, labels,
epsilon=epsilon, num_iter=7
)
# 对抗样本的预测
outputs_adv = self.model(adversarial_images)
# 交叉熵损失
ce_loss = F.cross_entropy(outputs, labels)
# KL散度损失
kl_loss = F.kl_div(
F.log_softmax(outputs, dim=1),
F.log_softmax(outputs_adv, dim=1),
reduction='batchmean'
)
return ce_loss + self.beta * kl_loss10.2 输入变换防御
class InputTransformationDefense:
"""
输入变换防御
通过对输入进行预处理来抵御对抗攻击
"""
def __init__(self):
self.transforms = []
def add_transform(self, transform_fn):
"""添加变换"""
self.transforms.append(transform_fn)
def apply_randomization(self, x, num_samples=5):
"""
输入随机化
对输入应用随机变换,平均多次预测
"""
predictions = []
for _ in range(num_samples):
x_transformed = x.clone()
# 随机裁剪
if torch.rand(1).item() > 0.5:
x_transformed = self.random_crop(x_transformed)
# 随机翻转
if torch.rand(1).item() > 0.5:
x_transformed = torch.flip(x_transformed, dims=[3])
# 随机缩放
if torch.rand(1).item() > 0.5:
x_transformed = self.random_scale(x_transformed)
predictions.append(x_transformed)
# 返回原始尺度(假设平均处理)
return x_transformed # 简化:实际需要逆变换
def random_crop(self, x, crop_size=None):
"""随机裁剪"""
if crop_size is None:
crop_size = int(x.size(-1) * 0.9)
h, w = x.size(-2), x.size(-1)
top = torch.randint(0, h - crop_size + 1, (1,)).item()
left = torch.randint(0, w - crop_size + 1, (1,)).item()
return F.interpolate(
x[:, :, top:top+crop_size, left:left+crop_size],
size=(h, w),
mode='bilinear',
align_corners=False
)
def random_scale(self, x, scale_range=(0.9, 1.1)):
"""随机缩放"""
scale = torch.rand(1).item() * (scale_range[1] - scale_range[0]) + scale_range[0]
h, w = x.size(-2), x.size(-1)
scaled = F.interpolate(
x, scale_factor=scale, mode='bilinear', align_corners=False
)
# 裁剪或填充到原始大小
if scale > 1:
# 裁剪中心区域
new_h, new_w = scaled.size(-2), scaled.size(-1)
top = (new_h - h) // 2
left = (new_w - w) // 2
return scaled[:, :, top:top+h, left:left+w]
else:
# 填充
pad_h = (h - scaled.size(-2)) // 2
pad_w = (w - scaled.size(-1)) // 2
return F.pad(scaled, (pad_w, w-scaled.size(-1)-pad_w,
pad_h, h-scaled.size(-2)-pad_h))
class FeatureDenoising:
"""
特征去噪防御
在特征层面去除对抗扰动
"""
def __init__(self, model):
self.model = model
def add_denoising_layer(self, feature_dim):
"""添加去噪层"""
self.denoise = nn.Sequential(
nn.Conv2d(feature_dim, feature_dim, 3, padding=1),
nn.ReLU(),
nn.Conv2d(feature_dim, feature_dim, 3, padding=1)
)
def forward_with_denoise(self, x):
"""带去噪的前向传播"""
features = self.model.extract_features(x)
denoised_features = self.denoise(features)
return self.model.classify(denoised_features)10.3 蒸馏防御
class DefensiveDistillation:
"""
防御性蒸馏
使用知识蒸馏增强模型的平滑性
"""
def __init__(self, teacher_model, student_model, temperature=100):
self.teacher = teacher_model
self.student = student_model
self.temperature = temperature
def distill(self, images, labels):
"""
蒸馏训练
使用教师模型的软标签训练学生模型
"""
# 教师模型生成软标签
with torch.no_grad():
soft_labels = F.softmax(self.teacher(images) / self.temperature, dim=1)
# 学生模型学习软标签
student_outputs = self.student(images)
distill_loss = F.kl_div(
F.log_softmax(student_outputs / self.temperature, dim=1),
soft_labels,
reduction='batchmean'
) * (self.temperature ** 2)
# 可选:也包含硬标签损失
hard_loss = F.cross_entropy(student_outputs, labels)
return 0.5 * distill_loss + 0.5 * hard_loss
def train_student(self, dataloader, epochs=20):
"""训练学生模型"""
optimizer = torch.optim.Adam(self.student.parameters(), lr=0.001)
for epoch in range(epochs):
for images, labels in dataloader:
optimizer.zero_grad()
loss = self.distill(images, labels)
loss.backward()
optimizer.step()
return self.student10.4 可证明防御
class CertifiedDefense:
"""
可证明鲁棒性防御
提供可证明的防御边界
"""
def __init__(self, model, epsilon=0.1):
self.model = model
self.epsilon = epsilon
def verify_sample(self, x, label, timeout=60):
"""
验证样本的鲁棒性
使用区间分支边界(IBP)方法
返回:是否鲁棒 + 证明边界
"""
device = next(self.model.parameters()).device
# 初始化上界和下界
lower = x - self.epsilon
upper = x + self.epsilon
lower = torch.clamp(lower, 0, 1)
upper = torch.clamp(upper, 0, 1)
# 迭代优化边界
for iteration in range(10):
# 计算中间点
mid = (lower + upper) / 2
# 计算在mid点的输出
mid_out = self.model(mid)
mid_pred = mid_out.argmax(dim=1)
# 检查是否改变预测
if mid_pred != label:
# 预测改变了,缩小上界
upper = mid
else:
# 预测未变,增大下界
lower = mid
# 返回验证结果
is_robust = (lower.max() >= upper.min())
return is_robust, lower, upper
class IBP_verifier:
"""
区间传播边界(IBP)验证器
用于计算神经网络输出在输入扰动下的界
"""
def __init__(self, model):
self.model = model
def propagate_interval(self, lower, upper):
"""
传播输入区间到输出区间
对于线性层:
- output_lower = W * lower - |W| * (upper - lower) / 2
- output_upper = W * upper + |W| * (upper - lower) / 2
"""
center = (lower + upper) / 2
radius = (upper - lower) / 2
output = self.model(center)
# 简化:假设输出随输入线性变化
# 实际需要更复杂的传播规则
return output - 0.1, output + 0.111. 对抗样本检测
11.1 基于置信度的检测
class AdversarialDetector:
"""
对抗样本检测器
多种检测方法的集合
"""
def __init__(self, model):
self.model = model
self.model.eval()
def confidence_based_detection(self, x, threshold=0.9):
"""
基于置信度的检测
正常样本通常有较高的分类置信度
对抗样本可能有异常的置信度分布
"""
with torch.no_grad():
outputs = self.model(x)
probs = F.softmax(outputs, dim=1)
max_probs = probs.max(dim=1)[0]
# 低置信度可能表明对抗样本
is_adversarial = max_probs < threshold
return is_adversarial, max_probs
def lid_detection(self, x, batch, k=20):
"""
局部内在维度(LID)检测
对抗样本通常具有异常的高LID值
"""
# 获取特征
with torch.no_grad():
features = self.model.extract_features(x)
batch_features = self.model.extract_features(batch)
# 计算每个样本的LID
n_samples = x.size(0)
lids = []
for i in range(n_samples):
# 计算到其他样本的距离
distances = torch.norm(batch_features - features[i:i+1], dim=1)
distances = distances.sort()[0][1:k+1] # k近邻
# LID估计
lid = -k / torch.log(distances / (distances.max() + 1e-10))
lids.append(lid.mean().item())
return torch.tensor(lids)
def mahalanobis_detection(self, x, class_means, class_covs, threshold=0.1):
"""
马氏距离检测
计算样本到各类别分布的马氏距离
对抗样本可能远离所有类别分布
"""
with torch.no_grad():
features = self.model.extract_features(x)
distances = []
for i, (mean, cov) in enumerate(zip(class_means, class_covs)):
# 马氏距离
diff = features - mean
cov_inv = torch.inverse(cov + 1e-6 * torch.eye(cov.size(0)))
mahal = torch.sum(diff @ cov_inv * diff, dim=1)
distances.append(mahal)
distances = torch.stack(distances, dim=1)
min_distances = distances.min(dim=1)[0]
# 大马氏距离可能表明对抗样本
is_adversarial = min_distances > threshold
return is_adversarial, min_distances
class FeatureSqueezeDetector:
"""
特征压缩检测
通过压缩输入并比较输出来检测对抗样本
"""
def __init__(self, model):
self.model = model
def squeeze(self, x, method='bit_depth', bit_depth=4):
"""压缩输入"""
if method == 'bit_depth':
# 位深度压缩
levels = 2 ** bit_depth
return torch.round(x * levels) / levels
elif method == 'jpeg':
# JPEG压缩(简化实现)
return x # 实际需要图像处理库
elif method == 'median_filter':
# 中值滤波
return F.median_pool2d(x, kernel_size=3)
def detect(self, x, threshold=0.1):
"""检测对抗样本"""
# 原始预测
with torch.no_grad():
original_pred = self.model(x).argmax(dim=1)
# 压缩后预测
squeezed_x = self.squeeze(x)
with torch.no_grad():
squeezed_pred = self.model(squeezed_x).argmax(dim=1)
# 预测不一致可能表明对抗样本
is_adversarial = (original_pred != squeezed_pred)
return is_adversarial, (original_pred != squeezed_pred).float()12. 对抗攻击的实际应用场景
12.1 自动驾驶系统攻击
class AutonomousVehicleAttack:
"""
自动驾驶系统对抗攻击
目标:欺骗感知系统使车辆做出错误决策
"""
def __init__(self, perception_model):
self.perception = perception_model
def traffic_sign_patch_attack(self, sign_image, target_class):
"""
交通标志补丁攻击
生成贴在交通标志上的对抗补丁
使感知系统误识别标志
"""
patch_size = (50, 50)
# 初始化补丁
patch = torch.rand(3, *patch_size, requires_grad=True)
optimizer = torch.optim.Adam([patch], lr=0.1)
for iteration in range(500):
optimizer.zero_grad()
# 将补丁应用到标志图像
patched_sign = sign_image.clone()
patched_sign[:, :, :patch_size[0], :patch_size[1]] = torch.sigmoid(patch)
# 前向传播
output = self.perception(patched_sign)
# 定向损失
loss = -F.cross_entropy(output, torch.tensor([target_class]))
loss.backward()
optimizer.step()
return patch.detach()
def lane_line_manipulation(self, road_image, target_offset):
"""
车道线操纵
在路面图像上添加扰动
使车辆错误估计车道位置
"""
perturbation = torch.zeros_like(road_image, requires_grad=True)
optimizer = torch.optim.Adam([perturbation], lr=0.01)
for iteration in range(100):
optimizer.zero_grad()
# 应用扰动
modified = road_image + 0.1 * torch.tanh(perturbation)
# 车道线检测
lane_prediction = self.perception.detect_lanes(modified)
# 损失:使预测偏离目标
loss = torch.abs(lane_prediction - target_offset).sum()
loss.backward()
optimizer.step()
return 0.1 * torch.tanh(perturbation.detach())
class LidarSpoofingAttack:
"""
激光雷达欺骗攻击
在点云数据上添加虚假障碍物
"""
def __init__(self, lidar_model):
self.model = lidar_model
def create_false_object(self, point_cloud, object_type='pedestrian'):
"""
在点云中创建虚假物体
参数:
- point_cloud: 原始点云
- object_type: 目标物体类型(行人、车辆等)
"""
# 随机选择虚假物体的位置
num_points = point_cloud.size(1)
# 生成符合物体形状的点
if object_type == 'pedestrian':
# 人形点云
center = torch.tensor([[5.0, 0.0, 0.5]]) # x, y, z
fake_points = self.generate_human_shape(center)
elif object_type == 'vehicle':
# 车辆形状
center = torch.tensor([[10.0, 0.0, 0.5]])
fake_points = self.generate_vehicle_shape(center)
# 将虚假点云添加到原始点云
spoofed_cloud = torch.cat([point_cloud, fake_points], dim=1)
return spoofed_cloud
def generate_human_shape(self, center):
"""生成人形点云"""
# 简化的人形模型
x = center[:, 0] + torch.randn(100) * 0.1
y = center[:, 1] + torch.randn(100) * 0.2
z = center[:, 2] + torch.randn(100) * 1.6 + torch.linspace(0, 1.6, 100)
points = torch.stack([x, y, z], dim=1).unsqueeze(0)
return points12.2 面部识别系统攻击
class FaceRecognitionAttack:
"""
面部识别系统攻击
"""
def __init__(self, recognition_model):
self.model = recognition_model
def adversarial_glasses_attack(self, face_image, target_identity):
"""
对抗眼镜攻击
生成佩戴后能使系统误识别为特定身份的眼镜图案
"""
# 眼镜区域掩码
mask = self.get_eyewear_mask(face_image)
# 初始化眼镜图案
glasses = torch.rand(1, 3, 50, 150, requires_grad=True)
optimizer = torch.optim.Adam([glasses], lr=0.1)
for iteration in range(300):
optimizer.zero_grad()
# 应用眼镜
patched_face = face_image.clone()
patched_face = patched_face + mask * torch.tanh(glasses)
# 前向传播获取身份嵌入
embedding = self.model.extract_embedding(patched_face)
target_embedding = self.model.get_identity_embedding(target_identity)
# 损失:最小化与目标身份的嵌入距离
loss = 1 - F.cosine_similarity(embedding, target_embedding).mean()
loss.backward()
optimizer.step()
return torch.tanh(glasses.detach())
def get_eyewear_mask(self, face_image):
"""获取眼镜区域的掩码"""
_, _, h, w = face_image.shape
mask = torch.zeros_like(face_image)
# 假设眼镜在眼睛位置
mask[:, :, int(h*0.35):int(h*0.45), int(w*0.2):int(w*0.8)] = 1.0
return mask
def universal_adversarial_patch(self, num_classes=1000):
"""
通用对抗补丁
训练一个可应用于任意面部的补丁
"""
# 初始化补丁
patch = torch.rand(3, 100, 100, requires_grad=True)
# 训练
for epoch in range(10):
for batch_idx, (images, labels) in enumerate(dataloader):
optimizer.zero_grad()
patched_images = self.apply_random_position(images, patch)
outputs = self.model(patched_images)
# 最大化所有类别的损失
loss = -F.cross_entropy(outputs, labels)
loss.backward()
optimizer.step()
return patch.detach()
态apply_random_position(self, images, patch):
"""随机位置应用补丁"""
batch_size = images.size(0)
_, _, h, w = images.shape
patch_h, patch_w = patch.shape[1:]
patched = images.clone()
for i in range(batch_size):
top = torch.randint(0, h - patch_h, (1,)).item()
left = torch.randint(0, w - patch_w, (1,)).item()
patched[i:i+1, :, top:top+patch_h, left:left+patch_w] = patch
return patched12.3 恶意软件检测规避
class MalwareDetectorEvasion:
"""
恶意软件检测规避攻击
通过修改恶意软件二进制代码
绕过机器学习检测器
"""
def __init__(self, detector):
self.detector = detector
def feature manipulation(self, malware_features, target_label=0):
"""
特征操纵攻击
修改恶意软件的特征向量
使其被误分类为良性
"""
perturbation = torch.zeros_like(malware_features, requires_grad=True)
optimizer = torch.optim.Adam([perturbation], lr=0.1)
for iteration in range(100):
optimizer.zero_grad()
# 添加扰动
modified = malware_features + 0.1 * torch.tanh(perturbation)
# 检测
prediction = self.detector(modified)
# 损失:使检测器预测为良性
loss = -F.cross_entropy(prediction, torch.tensor([target_label]))
loss.backward()
optimizer.step()
return 0.1 * torch.tanh(perturbation.detach())
def adversarial_malware_generation(self, benign_sample, constraint='api_call'):
"""
对抗性恶意软件生成
基于良性样本生成能绕过检测的变体
"""
# 获取良性样本的特征
benign_features = self.extract_features(benign_sample)
# 识别可修改的特征
modifiable_features = self.get_modifiable_features(constraint)
# 优化修改
modification = torch.zeros_like(benign_features)
modification.requires_grad = True
optimizer = torch.optim.Adam([modification], lr=0.01)
for iteration in range(200):
optimizer.zero_grad()
# 只修改允许的特征
modified = benign_features.clone()
for feat_idx in modifiable_features:
modified[feat_idx] += 0.1 * torch.tanh(modification[feat_idx])
# 检测
detection_score = self.detector(modified)
# 损失:最小化恶意软件概率
loss = detection_score[:, 1].sum() # 假设第二类是恶意
loss.backward()
optimizer.step()
return benign_sample + modification.detach()13. 对抗攻击的评估与基准
13.1 攻击成功率度量
class AttackEvaluator:
"""
攻击评估器
评估对抗攻击的效果
"""
def __init__(self, model):
self.model = model
def evaluate_attack(self, original_images, labels, adversarial_images):
"""
评估攻击效果
返回:
- 攻击成功率
- 平均扰动幅度
- 置信度变化
"""
with torch.no_grad():
# 原始预测
original_output = self.model(original_images)
original_pred = original_output.argmax(dim=1)
original_conf = F.softmax(original_output, dim=1).max(dim=1)[0]
# 对抗预测
adversarial_output = self.model(adversarial_images)
adversarial_pred = adversarial_output.argmax(dim=1)
adversarial_conf = F.softmax(adversarial_output, dim=1).max(dim=1)[0]
# 攻击成功:预测改变
attack_success = (original_pred != adversarial_pred).float()
# 计算指标
success_rate = attack_success.mean().item()
# 扰动幅度
perturbation = adversarial_images - original_images
perturbation_linf = perturbation.view(perturbation.size(0), -1).abs().max(dim=1)[0].mean().item()
perturbation_l2 = perturbation.view(perturbation.size(0), -1).norm(dim=1).mean().item()
# 置信度变化
conf_change = (adversarial_conf - original_conf).mean().item()
return {
'success_rate': success_rate,
'perturbation_linf': perturbation_linf,
'perturbation_l2': perturbation_l2,
'confidence_change': conf_change,
'original_confidence': original_conf.mean().item(),
'adversarial_confidence': adversarial_conf.mean().item()
}
def evaluate_targeted_attack(self, images, target_labels, adversarial_images):
"""
评估定向攻击
检查是否成功攻击到目标类别
"""
with torch.no_grad():
adversarial_pred = self.model(adversarial_images).argmax(dim=1)
# 定向攻击成功
targeted_success = (adversarial_pred == target_labels).float()
return {
'targeted_success_rate': targeted_success.mean().item(),
'original_to_target_conf': None # 可添加更多指标
}
def robustness_curve(self, images, labels, epsilon_range):
"""
计算鲁棒性曲线
测试不同扰动幅度下的攻击成功率
"""
results = []
for epsilon in epsilon_range:
adversarial_images = fgsm_attack(self.model, images, labels, epsilon)
metrics = self.evaluate_attack(images, labels, adversarial_images)
metrics['epsilon'] = epsilon
results.append(metrics)
return results13.2 防御评估基准
class DefenseBenchmark:
"""
防御评估基准
标准化的防御效果评估
"""
def __init__(self, model, defenses):
self.model = model
self.defenses = defenses
self.attacks = ['fgsm', 'pgd', 'cw']
def benchmark(self, test_loader):
"""
运行完整的防御基准测试
"""
results = {}
for defense_name, defense_fn in self.defenses.items():
defense_results = {}
for attack_name in self.attacks:
clean_acc = self.evaluate_clean(self.model, test_loader)
# 应用攻击
adv_loader = self.generate_adversarial(
test_loader, attack_name
)
# 应用防御
defended_loader = self.apply_defense(
adv_loader, defense_fn
)
# 评估
defense_acc = self.evaluate_accuracy(
self.model, defended_loader
)
defense_results[attack_name] = {
'clean_accuracy': clean_acc,
'adversarial_accuracy': defense_acc,
'robustness_improvement': defense_acc - self.baseline_robust_acc
}
results[defense_name] = defense_results
return results
def evaluate_clean(self, model, dataloader):
"""评估干净样本准确率"""
model.eval()
correct = 0
total = 0
with torch.no_grad():
for images, labels in dataloader:
outputs = model(images)
predictions = outputs.argmax(dim=1)
correct += (predictions == labels).sum().item()
total += labels.size(0)
return correct / total
def generate_adversarial(self, dataloader, attack_name):
"""生成对抗样本"""
# 实现各种攻击
pass
def apply_defense(self, dataloader, defense_fn):
"""应用防御"""
pass14. 对抗样本的可解释性分析
14.1 对抗扰动的空间分析
class AdversarialSpaceAnalyzer:
"""
对抗空间分析器
分析对抗样本在高维空间中的几何特性
"""
def __init__(self, model):
self.model = model
def analyze_direction(self, x, direction):
"""
分析沿特定方向的决策变化
追踪沿对抗方向移动时的预测变化
"""
trajectory = []
step_size = 0.001
current = x.clone()
for i in range(100):
with torch.no_grad():
pred = self.model(current).argmax().item()
conf = F.softmax(self.model(current), dim=1).max().item()
trajectory.append({
'step': i,
'prediction': pred,
'confidence': conf,
'norm': torch.norm(current - x).item()
})
# 移动
current = current + step_size * direction
return trajectory
def find_decision_boundary(self, x1, x2, num_samples=100):
"""
找到两点之间穿过决策边界的位置
用于理解对抗样本如何跨越边界
"""
boundary_points = []
for i in range(num_samples):
alpha = i / num_samples
x_mid = alpha * x1 + (1 - alpha) * x2
with torch.no_grad():
pred = self.model(x_mid.unsqueeze(0)).argmax().item()
boundary_points.append({
'alpha': alpha,
'prediction': pred,
'position': x_mid
})
return boundary_points
def compute_local_geometry(self, x, num_directions=1000):
"""
分析局部几何结构
估计局部区域的曲率和方向
"""
curvatures = []
for _ in range(num_directions):
direction = torch.randn_like(x)
direction = direction / direction.norm()
# 在正负方向计算梯度
x_pos = x + 0.01 * direction
x_neg = x - 0.01 * direction
with torch.no_grad():
grad_pos = torch.autograd.grad(
self.model(x_pos).max(), x, retain_graph=True
)[0]
grad_neg = torch.autograd.grad(
self.model(x_neg).max(), x, retain_graph=True
)[0]
# 曲率近似
curvature = torch.norm(grad_pos - grad_neg)
curvatures.append(curvature.item())
return np.mean(curvatures)15. 对抗样本的伦理与安全考量
15.1 负责任的研究实践
class ResponsibleAI:
"""
负责任的AI研究框架
对抗样本研究的伦理指导
"""
@staticmethod
def threat_model_assessment(threat_model):
"""
威胁模型评估
在进行研究前评估潜在风险
"""
assessment = {
'severity': None,
'misuse_potential': None,
'mitigation_needed': True,
'responsible_disclosure': False
}
# 评估攻击的严重性
if threat_model['target'] in ['critical_infrastructure', 'safety_critical']:
assessment['severity'] = 'high'
assessment['responsible_disclosure'] = True
# 评估滥用潜力
if threat_model['scalability'] > 0.8:
assessment['misuse_potential'] = 'high'
return assessment
@staticmethod
def implement_guardrails(attack_code):
"""
实施安全防护措施
确保研究成果不被滥用
"""
safeguards = {
'access_control': 'limit_to_verified_researchers',
'output_filtering': 'prevent_direct_application',
'redaction': 'remove_specific_target_details',
'time_delay': 'embargo_period_for_vendors'
}
return safeguards16. 学术引用与参考文献
- Szegedy, C., et al. (2013). “Intriguing properties of neural networks.” arXiv:1312.6199.
- Goodfellow, I. J., et al. (2015). “Explaining and Harnessing Adversarial Examples.” ICLR.
- Madry, A., et al. (2017). “Towards Deep Learning Models Resistant to Adversarial Attacks.” ICLR.
- Carlini, N., & Wagner, D. (2017). “Towards Evaluating the Robustness of Neural Networks.” IEEE S&P.
- Kurakin, A., et al. (2016). “Adversarial examples in the physical world.” ICLR Workshop.
- Brown, T. B., et al. (2017). “Adversarial Patch.” arXiv:1712.09665.
- Chen, J., & Jordan, M. I. (2019). “HopSkipJumpAttack: A Query-Efficient Decision-Based Attack.” IEEE S&P.
- Moosavi-Dezfooli, S. M., et al. (2016). “DeepFool: A Universal and Approximate Method.” CVPR.
- Athalye, A., et al. (2018). “Obfuscated Gradients Give a False Sense of Security.” ICML.
- Tramèr, F., et al. (2017). “Ensemble Adversarial Training.” arXiv:1705.07204.
- Zhang, H., et al. (2019). “Theoretically Principled Trade-off between Robustness and Accuracy.” ICML.
- Ilyas, A., et al. (2019). “Adversarial Examples Are Not Bugs, They Are Features.” NeurIPS.
- Xie, C., et al. (2019). “Feature Denoising for Improving Adversarial Robustness.” CVPR.
- Cohen, J., et al. (2019). “Certified Adversarial Robustness via Randomized Smoothing.” ICML.
- Dong, Y., et al. (2018). “Boosting Adversarial Attacks with Momentum.” CVPR.