关键词

术语英文核心概念
对抗样本Adversarial Example精心设计的导致模型误分类的扰动输入
快速梯度符号法FGSM基于梯度的单步攻击方法
投影梯度下降PGD迭代式攻击的代表性方法
Carlini-Wagner攻击C&W Attack优化的 L2 攻击方法
扰动Perturbation添加到原始输入的微小噪声
白盒攻击White-box Attack攻击者完全了解模型参数的攻击
黑盒攻击Black-box Attack攻击者只能访问模型输入输出的攻击
对抗补丁Adversarial Patch物理世界中可打印的对抗攻击
欺骗攻击Evasion Attack在推理阶段规避检测的攻击
毒性攻击Poisoning Attack在训练阶段注入恶意样本的攻击

1. 引言:对抗样本的发现

2013 年,Christian Szegedy 等人在论文《Intriguing properties of neural networks》中首次正式定义了对抗样本(Adversarial Examples)这一概念。他们发现了一个令人震惊的现象:对于一个表现良好的深度神经网络,可以通过对输入图像添加人类几乎无法察觉的微小扰动,使得模型以高置信度输出错误的分类结果。

历史背景

对抗样本的发现打破了人们对深度学习”过拟合”或”泛化能力不足”的传统认知。它揭示了一个更深层次的问题:现代神经网络的决策边界存在系统性缺陷,而非简单的统计误差。


2. 对抗样本的数学定义

2.1 形式化定义

给定一个分类器 和原始输入 ,对抗样本 满足以下条件:

其中 表示 范数, 是人类感知阈值。

2.2 对抗样本的存在性解释

从决策边界角度,对抗样本的存在可以用高维空间的线性近似来解释。对于一个权重向量 和输入 ,模型输出为:

如果 符号相同且幅度足够大,即使 极小,也能改变分类结果。

import numpy as np
import torch
import torch.nn as nn
 
def explain_adversarial_examples(model, x, epsilon=0.1):
    """
    解释对抗样本存在性的数值示例
    
    在高维空间中,即使扰动很小,
    累积效应也足以改变分类结果
    """
    # 获取模型预测
    model.eval()
    with torch.no_grad():
        original_pred = model(x.unsqueeze(0)).argmax(dim=1).item()
    
    # 获取输入梯度(扰动方向)
    x.requires_grad = True
    output = model(x.unsqueeze(0))
    loss = output[0, original_pred]
    loss.backward()
    
    gradient = x.grad.data
    
    # 计算梯度方向上的扰动
    perturbation = epsilon * torch.sign(gradient)
    adversarial_x = x + perturbation
    
    with torch.no_grad():
        adversarial_pred = model(adversarial_x.unsqueeze(0)).argmax(dim=1).item()
    
    print(f"原始预测: {original_pred}, 对抗样本预测: {adversarial_pred}")
    print(f"扰动范数 L∞: {torch.abs(perturbation).max().item():.4f}")
    
    return adversarial_x, perturbation
 
# 高维空间线性效应的数学演示
def high_dimensional_linear_effect():
    """
    在高维空间中:
    - 随机方向扰动的期望范数增长为 O(√d)
    - 但正交于特征方向的扰动可以被忽略
    - 梯度的符号方向总是"对齐"的
    
    这解释了为什么微小扰动能累积成大影响
    """
    d = 10000  # 假设10000维空间
    num_samples = 1000
    
    # 随机扰动的平均绝对值
    random_perturbations = np.random.randn(num_samples, d)
    avg_magnitude = np.mean(np.abs(random_perturbations))
    
    # 梯度方向扰动(所有维度对齐)
    aligned_perturbations = np.ones((num_samples, d)) * 0.01
    aligned_magnitude = np.mean(np.abs(aligned_perturbations))
    
    print(f"随机扰动平均幅度: {avg_magnitude:.6f}")
    print(f"对齐扰动平均幅度: {aligned_magnitude:.4f}")
    print(f"比率: {aligned_magnitude / avg_magnitude:.2f}x")

3. FGSM:快速梯度符号法

3.1 算法原理

Goodfellow 等人在 2014 年提出 FGSM(Fast Gradient Sign Method),这是一种计算高效的单步攻击方法:

其中 是损失函数, 是输入空间中的梯度。

3.2 FGSM 实现

import torch
import torch.nn.functional as F
 
def fgsm_attack(model, images, labels, epsilon=0.03):
    """
    快速梯度符号法(FGSM)攻击
    
    参数:
        model: 目标分类器
        images: 输入图像 (B, C, H, W)
        labels: 真实标签 (B,)
        epsilon: 扰动幅度
    
    返回:
        adversarial_images: 对抗样本
    """
    # 保存原始图像用于恢复
    images.requires_grad = True
    
    # 前向传播
    outputs = model(images)
    
    # 计算损失
    loss = F.cross_entropy(outputs, labels)
    
    # 反向传播获取梯度
    model.zero_grad()
    loss.backward()
    
    # 获取梯度并计算扰动
    gradient = images.grad.data
    perturbation = epsilon * torch.sign(gradient)
    
    # 生成对抗样本
    adversarial_images = images + perturbation
    
    # 确保扰动后图像仍在有效范围内
    adversarial_images = torch.clamp(adversarial_images, 0, 1)
    
    return adversarial_images, perturbation
 
def fgsm_targeted_attack(model, images, target_labels, epsilon=0.03):
    """
    定向 FGSM 攻击:使模型输出特定目标类别
    """
    images.requires_grad = True
    
    outputs = model(images)
    
    # 定向攻击:最大化目标类别的损失
    loss = -F.cross_entropy(outputs, target_labels)
    
    model.zero_grad()
    loss.backward()
    
    gradient = images.grad.data
    perturbation = epsilon * torch.sign(gradient)
    
    adversarial_images = images + perturbation
    adversarial_images = torch.clamp(adversarial_images, 0, 1)
    
    return adversarial_images, perturbation

FGSM 的特点

  • 计算效率高:只需一次前向和反向传播
  • 单步攻击:相比迭代方法更快
  • 可解释性强:扰动方向由损失函数的梯度决定
  • 局限性:对于使用防御技术(如对抗训练)的模型效果有限

4. PGD:投影梯度下降攻击

4.1 算法原理

PGD(Projected Gradient Descent)攻击是 FGSM 的迭代增强版本,被广泛认为是最强的 范数约束攻击:

其中 是投影到允许扰动集合 的操作, 是步长。

4.2 PGD 实现

def pgd_attack(model, images, labels, epsilon=0.03, alpha=0.003, num_iter=10, 
               targeted=False, target_labels=None):
    """
    投影梯度下降(PGD)攻击
    
    参数:
        model: 目标分类器
        images: 原始图像
        labels: 真实标签
        epsilon: 最大扰动范数
        alpha: 每次迭代的步长
        num_iter: 迭代次数
        targeted: 是否为定向攻击
        target_labels: 定向攻击的目标标签
    """
    # 保存原始图像
    original_images = images.detach().clone()
    
    # 初始化:随机起点(增加攻击成功率)
    images = images.detach() + torch.zeros_like(images).uniform_(-epsilon, epsilon)
    images = torch.clamp(images, 0, 1)
    
    for i in range(num_iter):
        images.requires_grad = True
        
        outputs = model(images)
        
        if targeted:
            # 定向攻击:最大化目标类别概率
            loss = -F.cross_entropy(outputs, target_labels)
        else:
            # 非定向攻击:最小化真实类别概率
            loss = F.cross_entropy(outputs, labels)
        
        model.zero_grad()
        loss.backward()
        
        # 更新扰动
        gradient = images.grad.data
        images = images.detach() + alpha * torch.sign(gradient)
        
        # 投影到允许范围
        images = torch.maximum(images, original_images - epsilon)
        images = torch.minimum(images, original_images + epsilon)
        images = torch.clamp(images, 0, 1)
    
    return images
 
def pgd_attack_l2(model, images, labels, epsilon=1.0, alpha=0.1, num_iter=10,
                   targeted=False, target_labels=None):
    """
    L2 范数约束的 PGD 攻击
    """
    original_images = images.detach().clone()
    
    # 随机初始化
    delta = torch.zeros_like(images)
    delta.normal_()
    delta = delta / torch.sqrt(torch.sum(delta ** 2, dim=(1, 2, 3), keepdim=True))
    delta = delta * torch.rand(images.size(0), 1, 1, 1).to(images.device) * epsilon
    images = (original_images + delta).clamp(0, 1)
    
    for i in range(num_iter):
        images.requires_grad = True
        
        outputs = model(images)
        
        if targeted:
            loss = -F.cross_entropy(outputs, target_labels)
        else:
            loss = F.cross_entropy(outputs, labels)
        
        model.zero_grad()
        loss.backward()
        
        gradient = images.grad.data
        # L2 归一化梯度方向
        grad_norm = torch.sqrt(torch.sum(gradient ** 2, dim=(1, 2, 3), keepdim=True))
        gradient = gradient / (grad_norm + 1e-10)
        
        # 更新并投影到 L2 球
        images = images.detach() + alpha * gradient
        delta = images - original_images
        delta_norm = torch.sqrt(torch.sum(delta ** 2, dim=(1, 2, 3), keepdim=True))
        delta = delta / (delta_norm + 1e-10) * torch.minimum(delta_norm, torch.tensor(epsilon))
        images = (original_images + delta).clamp(0, 1)
    
    return images

5. Carlini-Wagner 攻击

5.1 算法原理

Carlini-Wagner(C&W)攻击是优化框架下的攻击方法,通过最小化扰动幅度同时保证攻击成功:

其中 是分类器输出转化为标量的辅助函数:

是分类器的 logits 输出, 是目标类别, 是置信度参数。

5.2 C&W 攻击实现

class CWL2Attack:
    """
    Carlini-Wagner L2 攻击实现
    
    使用变量替换和重参数化技巧将约束优化问题转化为无约束优化
    """
    
    def __init__(self, model, kappa=0, max_iter=1000, learning_rate=0.01):
        self.model = model
        self.kappa = kappa  # 置信度参数
        self.max_iter = max_iter
        self.lr = learning_rate
    
    def attack(self, images, target_labels, targeted=True):
        """
        执行 C&W L2 攻击
        """
        batch_size = images.size(0)
        device = images.device
        
        # 初始化扰动变量(使用 tanh 变换确保有界)
        w = torch.zeros_like(images)
        w.requires_grad = True
        
        optimizer = torch.optim.Adam([w], lr=self.lr)
        
        for iteration in range(self.max_iter):
            optimizer.zero_grad()
            
            # 重参数化:w -> delta(-1 到 1)
            delta = 0.5 * (torch.tanh(w) + 1) - images
            
            # 计算 logits
            adv_images = images + delta
            logits = self.model(adv_images)
            
            # 辅助函数 f
            if targeted:
                # 定向攻击:使目标类别 logit 最大
                one_hot = F.one_hot(target_labels, num_classes=logits.size(-1)).float()
                other_logits = ((1 - one_hot) * logits - one_hot * 1e9)
                f = torch.max(other_logits, dim=-1)[0] - logits.gather(1, target_labels.unsqueeze(1)).squeeze()
            else:
                # 非定向攻击:使真实类别 logit 最小
                real_logits = logits.gather(1, target_labels.unsqueeze(1)).squeeze()
                other_logits = torch.where(
                    torch.arange(logits.size(-1), device=device).unsqueeze(0) == target_labels.unsqueeze(1),
                    torch.tensor(-1e9, device=device),
                    logits
                )
                f = real_logits - torch.max(other_logits, dim=-1)[0]
            
            # 扰动幅度(使用变换后的 delta)
            delta_reshaped = delta.view(batch_size, -1)
            perturbation_norm = torch.sum(delta_reshaped ** 2, dim=-1)
            
            # 损失函数
            loss = perturbation_norm + 0.01 * f.sum()
            
            loss.backward()
            optimizer.step()
        
        # 生成最终对抗样本
        delta = 0.5 * (torch.tanh(w.detach()) + 1) - images
        adversarial_images = (images + delta).clamp(0, 1)
        
        return adversarial_images, delta

6. 对抗样本的物理世界攻击

6.1 物理对抗样本的挑战

对抗样本不仅存在于数字世界,还可以被打印出来并对物理世界的感知系统造成威胁。物理攻击需要考虑相机畸变、光照变化、视角变换等多种实际因素。

class PhysicalAdversarialAttack:
    """
    物理世界对抗攻击
    
    考虑因素:
    - 打印后的颜色失真
    - 相机传感器非线性响应
    - 不同距离和角度的变换
    - 随机噪声和模糊
    """
    
    def __init__(self, epsilon=0.1, num_augmentations=20):
        self.epsilon = epsilon
        self.num_augmentations = num_augmentations
    
    def apply_physical_transform(self, images):
        """模拟物理世界变换"""
        batch_size = images.size(0)
        device = images.device
        
        # 随机亮度调整
        brightness = torch.rand(batch_size, 1, 1, 1).to(device) * 0.4 + 0.8
        images = images * brightness
        
        # 随机对比度调整
        contrast = torch.rand(batch_size, 1, 1, 1).to(device) * 0.4 + 0.8
        images = (images - 0.5) * contrast + 0.5
        
        # 随机模糊(模拟对焦不准)
        if torch.rand(1).item() > 0.5:
            kernel_size = 5
            sigma = torch.rand(1).item() * 2 + 0.5
            # 使用平均池化近似模糊
            images = F.avg_pool2d(images, kernel_size, stride=1, 
                                 padding=kernel_size//2)
        
        return images.clamp(0, 1)
    
    def expectation_over_transformation(self, model, images, labels, criterion):
        """
        EOT(期望变换)方法
        
        优化对抗扰动,使其在多种物理变换下都能保持攻击效果
        """
        images.requires_grad = True
        device = images.device
        
        # 多次采样变换,计算期望损失
        total_loss = 0
        for _ in range(self.num_augmentations):
            transformed_images = self.apply_physical_transform(images)
            outputs = model(transformed_images)
            total_loss += criterion(outputs, labels)
        
        avg_loss = total_loss / self.num_augmentations
        return avg_loss

6.2 对抗补丁(Adversarial Patch)

对抗补丁是一种可以在物理世界中打印和使用的对抗攻击,通过在图像任意位置放置一个局部补丁来欺骗分类器。

class AdversarialPatchAttack:
    """
    对抗补丁攻击
    
    核心思想:
    - 生成一个局部补丁(可以是任意形状)
    - 补丁位置可以是随机的
    - 优化补丁图案使其具有最大的"欺骗能力"
    """
    
    def __init__(self, model, patch_size=50, num_classes=1000):
        self.model = model
        self.patch_size = patch_size
        self.num_classes = num_classes
    
    def create_adversarial_patch(self, target_class, iterations=1000):
        """
        生成对抗补丁
        
        参数:
            target_class: 目标攻击类别
            iterations: 优化迭代次数
        """
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        
        # 随机初始化补丁
        patch = torch.rand(3, self.patch_size, self.patch_size).to(device)
        patch.requires_grad = True
        
        optimizer = torch.optim.Adam([patch], lr=0.1)
        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=iterations)
        
        for i in range(iterations):
            optimizer.zero_grad()
            
            # 生成随机背景图像
            background = torch.rand(1, 3, 224, 224).to(device)
            
            # 随机放置补丁位置
            h, w = self.patch_size, self.patch_size
            top = torch.randint(0, 224 - h, (1,)).item()
            left = torch.randint(0, 224 - w, (1,)).item()
            
            # 应用补丁
            patched_images = background.clone()
            patched_images[:, :, top:top+h, left:left+w] = torch.sigmoid(patch)
            
            # 前向传播
            outputs = self.model(patched_images)
            
            # 定向损失:最大化目标类别概率
            loss = -F.cross_entropy(outputs, torch.tensor([target_class]).to(device))
            
            loss.backward()
            optimizer.step()
            scheduler.step()
            
            # 裁剪补丁值到有效范围
            with torch.no_grad():
                patch.clamp_(0, 1)
            
            if (i + 1) % 100 == 0:
                prob = F.softmax(outputs, dim=-1)[0, target_class].item()
                print(f"Iter {i+1}, Target prob: {prob:.4f}")
        
        return patch.detach()
    
    def apply_patch_to_image(self, image, patch, location='random'):
        """将补丁应用到图像"""
        _, _, h, w = image.shape
        patch_h, patch_w = patch.shape[1:]
        
        if location == 'random':
            top = torch.randint(0, h - patch_h, (1,)).item()
            left = torch.randint(0, w - patch_w, (1,)).item()
        else:
            top, left = location
        
        patched_image = image.clone()
        patched_image[:, top:top+patch_h, left:left+patch_w] = patch
        
        return patched_image

物理攻击的威胁等级

对抗补丁攻击已在多项研究中得到验证:

  • 通过贴在 Stop 标志上的补丁使自动驾驶系统无法识别停车标志
  • 通过佩戴特制眼镜使面部识别系统误识别为特定目标
  • 通过打印的补丁使图像分类器输出错误类别

7. 对抗样本的高级攻击技术

7.1 HopSkipJumpAttack

HopSkipJumpAttack 是一种黑盒攻击方法,只需要查询模型的输出概率或类别标签:

def hopskipjump_attack(model, original_image, target_class=None, 
                       max_queries=10000, epsilon=1.0):
    """
    HopSkipJumpAttack:基于决策的黑盒攻击
    
    特点:
    - 只需要访问模型的预测类别(硬标签)
    - 使用二分搜索估计决策边界
    - 查询效率较高
    """
    device = original_image.device
    dim = original_image.view(-1).shape[0]
    
    # 初始化:使用高斯噪声
    perturbation = torch.randn_like(original_image) * 0.5
    perturbed = (original_image + perturbation).clamp(0, 1)
    
    for query in range(max_queries):
        # 获取当前扰动的决策
        with torch.no_grad():
            current_pred = model(perturbed.unsqueeze(0)).argmax().item()
        
        # 检查是否达到目标
        if target_class is not None and current_pred == target_class:
            break
        elif target_class is None and current_pred != original_image.argmax().item():
            break
        
        # 估计步长(二分搜索)
        step_size = epsilon / 2
        for _ in range(10):
            test_perturbation = perturbation * (1 - step_size)
            test_perturbed = (original_image + test_perturbation).clamp(0, 1)
            
            with torch.no_grad():
                test_pred = model(test_perturbed.unsqueeze(0)).argmax().item()
            
            # 根据结果调整步长
            if (target_class is not None and test_pred == target_class) or \
               (target_class is None and test_pred == original_image.argmax().item()):
                step_size *= 0.5
            else:
                perturbation = test_perturbation
                perturbed = test_perturbed
                break
        
        # 梯度估计(使用有限差分)
        delta = 0.001
        gradient_estimate = torch.zeros_like(original_image)
        for i in range(min(dim, 100)):  # 随机选择维度
            idx = torch.randint(0, dim, (1,)).item()
            
            pos_perturbation = perturbation.clone().view(-1)
            pos_perturbation[idx] += delta
            pos_perturbed = (original_image + pos_perturbation.view_as(original_image)).clamp(0, 1)
            
            with torch.no_grad():
                pos_pred = model(pos_perturbed.unsqueeze(0)).argmax().item()
            
            gradient_estimate.view(-1)[idx] = (1 if (target_class and pos_pred == target_class) or 
                                               (not target_class and pos_pred != original_image.argmax().item()) else 0)
        
        # 更新扰动
        perturbation = perturbation + 0.01 * gradient_estimate * torch.sign(gradient_estimate)
        perturbation = torch.clamp(perturbation, -epsilon, epsilon)
        perturbed = (original_image + perturbation).clamp(0, 1)
    
    return perturbed, perturbation

8. 学术引用与参考文献

  1. Szegedy, C., et al. (2013). “Intriguing properties of neural networks.” arXiv:1312.6199.
  2. Goodfellow, I. J., et al. (2015). “Explaining and Harnessing Adversarial Examples.” ICLR.
  3. Madry, A., et al. (2017). “Towards Deep Learning Models Resistant to Adversarial Attacks.” ICLR.
  4. Carlini, N., & Wagner, D. (2017). “Towards Evaluating the Robustness of Neural Networks.” IEEE S&P.
  5. Kurakin, A., et al. (2016). “Adversarial examples in the physical world.” ICLR Workshop.
  6. Brown, T. B., et al. (2017). “Adversarial Patch.” arXiv:1712.09665.
  7. Chen, J., & Jordan, M. I. (2019). “HopSkipJumpAttack: A Query-Efficient Decision-Based Attack.” IEEE S&P.

9. 相关文档