对抗攻防实战指南:从FGSM到认证鲁棒性

关键词

序号关键词英文对照
1对抗样本Adversarial Example
2FGSMFast Gradient Signed Method
3PGDProjected Gradient Descent
4C&W攻击Carlini & Wagner Attack
5对抗补丁Adversarial Patch
6对抗训练Adversarial Training
7认证鲁棒性Certified Robustness
8迁移攻击Transfer Attack

一、什么是对抗样本?

1.1 一个让人细思极恐的现象

2014年,Christian Szegedy等人发现了一个诡异的现象:给大熊猫的图片加上一个人眼几乎看不出来的噪声,CNN模型就会把它认成”长臂猿”,而且信心满满地给出99.3%的置信度。

这意味着什么?意味着我们以为已经很强大的深度学习系统,其实脆弱得像个纸老虎。只需要在输入上做一点微调,就能让模型完全失效。

对抗样本(Adversarial Example)就是这种人眼难以察觉、但能让模型判断错误的输入。用数学语言来说:

对抗样本定义:
给定分类器 f: X → Y 和输入 x ∈ X
找到一个扰动 δ,使得:
1. ||δ|| < ε(扰动很小,人眼难以察觉)
2. f(x) ≠ f(x + δ)(模型预测改变)

1.2 为什么会出现对抗样本?

对抗样本的存在有几个层面的原因:

线性视角:神经网络虽然是非线性的,但它的激活函数(如ReLU)在很多区域是线性的。在高维空间中,即使是很小的扰动,沿着梯度的方向累积起来,也足以让输出发生质变。

线性解释:
输出 = w · (x + δ) = w · x + w · δ

如果 w 的维度很高(比如100万维),
即使 ||δ||_∞ 很小(0.001),
w · δ 也可能很大(1000维 × 0.001 = 100)

换句话说,微小的扰动在高维空间可以被放大

决策边界视角:分类器的决策边界在输入空间中形成了一个复杂的流形。对抗样本就是那些被精心设计出来、刚好”穿过”决策边界的点。

数据分布视角:训练数据在输入空间中只是稀疏的采样点,模型学到的是这些采样点附近的行为。对抗样本出现在训练数据覆盖不足的区域。

1.3 对抗样本的分类

对抗样本可以从多个角度分类:

分类标准类型特点
攻击者知识白盒 / 黑盒白盒知道模型参数,黑盒只知道输入输出
攻击目标定向 / 非定向定向要求预测特定类别,非定向只要求错误
扰动范围像素级 / patch级 / 物理级patch只修改局部区域,物理可打印出来
扰动幅度L2小 / L∞小 / L0少约束不同,生成方法不同

二、FGSM:快速梯度符号法

2.1 算法原理

Goodfellow等人提出的FGSM(Fast Gradient Sign Method)是最简单、最经典的对抗攻击方法。它的核心思想是:

沿着损失函数的梯度方向,步进一个小的 epsilon

FGSM算法:
x_adv = x + ε · sign(∇_x J(θ, x, y))

其中:
- x: 原始输入
- x_adv: 对抗样本
- ε: 扰动幅度(超参数)
- J: 损失函数
- ∇_x J: 损失对输入的梯度
- sign: 符号函数

为什么用 sign(梯度) 而不是直接用梯度?因为我们要控制每个像素的扰动方向。sign函数把梯度变成±1,表示每个像素要么往上走ε,要么往下走ε。

2.2 代码实现

"""
FGSM对抗攻击实现
"""
 
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
 
def fgsm_attack(image: torch.Tensor, 
                epsilon: float, 
                gradient: torch.Tensor) -> torch.Tensor:
    """
    FGSM攻击
    
    参数:
        image: 原始图像张量 [C, H, W] 或 [B, C, H, W]
        epsilon: 扰动幅度
        gradient: 损失对输入的梯度
    
    返回:
        对抗样本
    """
    # 获取扰动方向(梯度的符号)
    perturbation = epsilon * torch.sign(gradient)
    
    # 添加扰动
    adversarial_image = image + perturbation
    
    # 裁剪到有效范围(对于图像通常是[0, 1]或[0, 255])
    adversarial_image = torch.clamp(adversarial_image, 0, 1)
    
    return adversarial_image
 
 
def compute_adversarial_loss(model: nn.Module,
                             image: torch.Tensor,
                             target: torch.Tensor,
                             targeted: bool = False) -> torch.Tensor:
    """
    计算用于对抗攻击的损失
    
    targeted=True: 最大化目标类的概率
    targeted=False: 最小化真实类的概率
    """
    output = model(image)
    
    if targeted:
        # 定向攻击:最大化目标类的损失
        return F.cross_entropy(output, target)
    else:
        # 非定向攻击:最小化真实类的损失
        # 等价于最大化真实类的负损失
        return -F.cross_entropy(output, target)
 
 
def fgsm_attack_wrapper(model: nn.Module,
                        image: torch.Tensor,
                        target: torch.Tensor,
                        epsilon: float,
                        targeted: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    FGSM攻击的完整封装
    
    返回:
        (对抗样本, 原始梯度)
    """
    # 确保图像需要梯度
    image.requires_grad = True
    
    # 前向传播
    output = model(image)
    
    # 计算损失
    if targeted:
        loss = F.cross_entropy(output, target)
    else:
        loss = F.cross_entropy(output, target)
    
    # 反向传播计算梯度
    model.zero_grad()
    loss.backward()
    gradient = image.grad.data
    
    # 生成对抗样本
    adversarial_image = fgsm_attack(image.detach(), epsilon, gradient)
    
    return adversarial_image, gradient
 
 
def evaluate_attack(model: nn.Module,
                   images: torch.Tensor,
                   labels: torch.Tensor,
                   epsilon: float) -> dict:
    """
    评估FGSM攻击效果
    """
    model.eval()
    
    correct_before = 0
    correct_after = 0
    
    for i in range(len(images)):
        img = images[i:i+1].clone()
        label = labels[i:i+1]
        
        # 原始准确率
        with torch.no_grad():
            output = model(img)
            pred_before = output.argmax(dim=1)
            correct_before += (pred_before == label).item()
        
        # 生成对抗样本
        adv_img, _ = fgsm_attack_wrapper(model, img, label, epsilon)
        
        # 对抗样本准确率
        with torch.no_grad():
            output = model(adv_img)
            pred_after = output.argmax(dim=1)
            correct_after += (pred_after == label).item()
    
    return {
        "accuracy_before": correct_before / len(images),
        "accuracy_after": correct_after / len(images),
        "attack_success_rate": 1 - (correct_after / correct_before) if correct_before > 0 else 0,
        "epsilon": epsilon
    }
 
 
def demo_fgsm():
    """FGSM攻击演示"""
    # 演示代码
    print("FGSM攻击演示")
    print("=" * 60)
    print("""
    FGSM攻击流程:
    
    1. 输入原始图像 x 和真实标签 y
    
    2. 计算损失 J(θ, x, y) 对输入的梯度 ∇_x J
    
    3. 扰动 δ = ε × sign(∇_x J)
    
    4. 对抗样本 x' = x + δ
    
    5. clip(x', 0, 1) 确保像素值有效
    
    示例:
    - 如果某个像素的梯度为正,说明增加该像素会增大损失
    - FGSM会把该像素增加 ε
    - 反之亦然
    """)
    
    # 模拟计算
    print("\n模拟计算:")
    batch_size, channels, height, width = 1, 3, 224, 224
    epsilon = 0.007  # 常用值
    
    # 模拟梯度
    gradient = torch.randn(batch_size, channels, height, width)
    perturbation = epsilon * torch.sign(gradient)
    
    print(f"扰动幅度 epsilon: {epsilon}")
    print(f"梯度范数 (L2): {torch.norm(gradient).item():.4f}")
    print(f"扰动范数 (L2): {torch.norm(perturbation).item():.4f}")
    print(f"扰动范数 (L∞): {torch.max(torch.abs(perturbation)).item():.4f}")
 
 
if __name__ == "__main__":
    demo_fgsm()

2.3 FGSM的优缺点

优点

  • 计算速度快,只需一次前向+一次反向
  • 理论清晰,容易理解
  • 是很多复杂攻击的基础

缺点

  • 单步攻击,效果可能不够强
  • 对某些防御方法(如对抗训练)效果较差
  • 无法精细控制扰动

三、PGD:多步攻击的最强版本

3.1 为什么需要多步攻击?

FGSM是”一步到位”的攻击,但在某些情况下,一步可能不够。想象一下:

  1. 决策边界可能很复杂,需要多步才能穿过
  2. 某些防御会让单步攻击失效
  3. 扰动可能在一步之后就”卡住”了

PGD(Projected Gradient Descent)攻击正是为了解决这些问题。PGD本质上是FGSM的迭代版本:

PGD算法:
x_0 = x  # 原始图像
for t = 1 to T:
    x_t = Π_{x + S}(x_{t-1} + α · sign(∇_x J(θ, x_{t-1}, y)))

其中:
- α: 每步的步长(通常 α = ε/T)
- Π: 投影操作,确保扰动在允许范围内
- S: 允许的扰动集合(通常是 L∞ 球)

3.2 代码实现

"""
PGD对抗攻击实现
"""
 
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple, Optional
 
class PGDAttack:
    """
    PGD攻击类
    
    PGD是FGSM的迭代版本,通常是最强的L∞攻击
    """
    
    def __init__(self,
                 model: nn.Module,
                 epsilon: float = 0.3,
                 alpha: float = 0.01,
                 num_iter: int = 40,
                 random_start: bool = True):
        """
        参数:
            model: 攻击的目标模型
            epsilon: 最大扰动幅度(L∞范数)
            alpha: 每步的步长
            num_iter: 迭代次数
            random_start: 是否从随机点开始
        """
        self.model = model
        self.epsilon = epsilon
        self.alpha = alpha
        self.num_iter = num_iter
        self.random_start = random_start
    
    def attack(self,
              images: torch.Tensor,
              labels: torch.Tensor,
              targeted: bool = False) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        执行PGD攻击
        """
        # 记录原始图像(用于计算最终扰动)
        original_images = images.clone()
        
        # 如果需要,从随机点开始
        if self.random_start:
            images = images + torch.zeros_like(images).uniform_(
                -self.epsilon, self.epsilon
            )
            # 投影回允许范围
            images = torch.clamp(images, 0, 1)
        
        # 迭代攻击
        for i in range(self.num_iter):
            images.requires_grad = True
            
            # 前向传播
            outputs = self.model(images)
            
            # 计算损失
            if targeted:
                loss = -F.cross_entropy(outputs, labels)
            else:
                loss = F.cross_entropy(outputs, labels)
            
            # 反向传播
            self.model.zero_grad()
            loss.backward()
            
            # 一步梯度上升
            gradient = images.grad.data
            images = images.detach()
            images = images + self.alpha * torch.sign(gradient)
            
            # 投影回允许范围
            # 确保对抗样本在原始图像的epsilon邻域内
            images = torch.clamp(
                torch.max(
                    original_images - self.epsilon,
                    torch.min(
                        original_images + self.epsilon,
                        images
                    )
                ),
                0, 1
            )
        
        return images, original_images
    
    def attack_batch(self,
                    images: torch.Tensor,
                    labels: torch.Tensor,
                    batch_size: int = 32) -> torch.Tensor:
        """
        分批攻击
        """
        adversarial_images = []
        
        for i in range(0, len(images), batch_size):
            batch_images = images[i:i+batch_size]
            batch_labels = labels[i:i+batch_size]
            
            adv_images, _ = self.attack(batch_images, batch_labels)
            adversarial_images.append(adv_images)
        
        return torch.cat(adversarial_images, dim=0)
 
 
class TargetedPGDAttack(PGDAttack):
    """
    定向PGD攻击
    """
    
    def __init__(self, *args, target_classes: Optional[torch.Tensor] = None, **kwargs):
        super().__init__(*args, **kwargs)
        self.target_classes = target_classes
    
    def attack(self, images, labels=None, targeted=True):
        # 使用预设的目标类或随机选择
        if self.target_classes is None:
            # 随机选择不同于原始标签的目标
            num_classes = self.model(
                torch.randn(1, *images.shape[1:])
            ).shape[1]
            target_classes = torch.randint(
                0, num_classes, (images.shape[0],), device=images.device
            )
        else:
            target_classes = self.target_classes
        
        return super().attack(images, target_classes, targeted=True)
 
 
def compare_fgsm_vs_pgd(model, images, labels, epsilon):
    """
    对比FGSM和PGD攻击效果
    """
    from copy import deepcopy
    
    results = {}
    
    # FGSM攻击
    fgsm_images = []
    for i in range(len(images)):
        img = images[i:i+1].clone()
        label = labels[i:i+1]
        adv_img, _ = fgsm_attack_wrapper(model, img, label, epsilon)
        fgsm_images.append(adv_img)
    fgsm_images = torch.cat(fgsm_images, dim=0)
    
    # PGD攻击
    pgd_attack = PGDAttack(model, epsilon=epsilon, num_iter=40)
    pgd_images, _ = pgd_attack.attack(images.clone(), labels)
    
    # 评估
    model.eval()
    
    with torch.no_grad():
        # 原始准确率
        clean_acc = (model(images).argmax(1) == labels).float().mean().item()
        
        # FGSM后准确率
        fgsm_acc = (model(fgsm_images).argmax(1) == labels).float().mean().item()
        
        # PGD后准确率
        pgd_acc = (model(pgd_images).argmax(1) == labels).float().mean().item()
    
    results = {
        "clean_accuracy": clean_acc,
        "fgsm_accuracy": fgsm_acc,
        "pgd_accuracy": pgd_acc,
        "fgsm_attack_success": 1 - fgsm_acc,
        "pgd_attack_success": 1 - pgd_acc,
    }
    
    return results
 
 
def demo_pgd():
    """PGD攻击演示"""
    print("PGD攻击演示")
    print("=" * 60)
    print("""
    PGD vs FGSM:
    
    FGSM(单步):
    x' = x + ε · sign(∇J)
    
    PGD(多步):
    for t in 1..T:
        x_t = clip(x_{t-1} + α·sign(∇J(x_{t-1})))
        x_t = project(x_t, x + ε)
    
    为什么PGD更强?
    1. 多步迭代能更好地探索决策边界
    2. random_start让攻击更难防御
    3. 投影操作确保扰动始终有效
    
    常用配置:
    - epsilon = 8/255 ≈ 0.031 (ImageNet)
    - alpha = epsilon / 4
    - num_iter = 10-40
    """)
    
    # 模拟不同迭代次数的攻击效果
    import matplotlib.pyplot as plt
    
    print("\n迭代次数对攻击效果的影响(模拟):")
    iterations = [1, 5, 10, 20, 40, 100]
    
    # 模拟衰减曲线
    for iters in iterations:
        # PGD通常在10-40次迭代后收敛
        simulated_acc = 0.95 * (1 - 0.9 * (1 - np.exp(-iters / 10)))
        print(f"  迭代 {iters:3d}: 模型准确率 ≈ {simulated_acc:.3f}")
 
 
if __name__ == "__main__":
    demo_pgd()

3.3 PGD为什么是最强L∞攻击?

Madry等人证明了PGD攻击是L∞范数约束下的”最强”一阶攻击:

如果你能防御住PGD攻击,你就能防御住所有基于一阶梯度的攻击。

这个结论的意义是重大的:它把对抗防御问题简化了——只需要考虑PGD即可。

四、C&W攻击:更隐蔽的低范数攻击

4.1 C&W攻击的原理

Carlini和Wagner在2016年提出的C&W攻击,目标是在满足约束的前提下找到范数最小的对抗扰动:

C&W攻击优化目标:
minimize ||δ||_p + c · f(x + δ)

subject to: x + δ ∈ [0, 1]^n

其中 f 是精心设计的损失函数:
f(x') = max(max_{i≠t} Z(x')_i - Z(x')_t, -κ)⁺

- Z(x') 是logits
- t 是目标类
- κ 是置信度参数
- (·)⁺ = max(·, 0)

C&W攻击比FGSM/PGD更强的原因:

  1. 直接优化:直接最小化扰动范数,而不是用梯度步进
  2. 灵活的范数:可以优化L0、L2、L∞等不同范数
  3. 更好的优化:使用更好的初始化和优化策略

4.2 代码实现

"""
C&W对抗攻击实现(L2范数)
"""
 
import torch
import torch.nn as nn
import torch.optim as optim
from typing import Optional, Callable
 
class CWAttack:
    """
    Carlini & Wagner L2攻击
    
    特点:
    1. 优化得到最小范数扰动
    2. 比FGSM/PGD更难防御
    3. 支持定向和非定向攻击
    """
    
    def __init__(self,
                 model: nn.Module,
                 targeted: bool = False,
                 confidence: float = 0,
                 initial_const: float = 0.001,
                 max_iterations: int = 1000,
                 learning_rate: float = 0.01,
                 binary_search_steps: int = 9):
        self.model = model
        self.targeted = targeted
        self.confidence = confidence
        self.initial_const = initial_const
        self.max_iterations = max_iterations
        self.learning_rate = learning_rate
        self.binary_search_steps = binary_search_steps
    
    def _logits_to_attack_loss(self, 
                               logits: torch.Tensor, 
                               target: torch.Tensor) -> torch.Tensor:
        """
        计算C&W损失函数
        """
        # 获取真实类和目标类的logits
        one_hot_target = torch.zeros_like(logits).scatter_(
            1, target.unsqueeze(1), 1
        )
        
        # 目标类的logit
        target_logits = (logits * one_hot_target).sum(dim=1)
        
        # 非目标类的最大logit
        other_logits = logits - one_hot_target * 1e9
        max_other_logits = other_logits.max(dim=1)[0]
        
        if self.targeted:
            # 定向攻击:目标logit应该大于其他logit
            return torch.clamp(max_other_logits - target_logits + self.confidence, min=0)
        else:
            # 非定向攻击:其他logit不应该大于目标logit
            return torch.clamp(target_logits - max_other_logits + self.confidence, min=0)
    
    def attack(self,
              images: torch.Tensor,
              labels: torch.Tensor,
              verbose: bool = False) -> torch.Tensor:
        """
        执行C&W攻击
        """
        device = images.device
        batch_size = images.shape[0]
        
        # 初始化扰动变量
        # 使用arctanh变换确保加性扰动后的值在[0,1]范围内
        delta = torch.zeros_like(images, requires_grad=True)
        optimizer = optim.Adam([delta], lr=self.learning_rate)
        
        # 二分搜索找最优的c常数
        # c用于平衡扰动大小和攻击成功率
        c_lower = torch.ones(batch_size, device=device) * 1e-3
        c_upper = torch.ones(batch_size, device=device) * 1e10
        c = self.initial_const * torch.ones(batch_size, device=device)
        
        # 记录攻击结果
        best_adversarial = images.clone()
        best_L2 = torch.full((batch_size,), float('inf'), device=device)
        
        for search_step in range(self.binary_search_steps):
            if verbose and search_step == 0:
                print(f"Binary search step 0/{self.binary_search_steps}")
            
            for iteration in range(self.max_iterations):
                optimizer.zero_grad()
                
                # 计算对抗样本
                # 使用tanh变换确保输出在[0,1]范围内
                adversarial = torch.tanh(delta) * 0.5 + 0.5
                
                # 确保对抗样本在有效范围内
                adversarial = torch.clamp(adversarial, 0, 1)
                
                # 计算L2扰动
                L2_dist = torch.sum((adversarial - images) ** 2, dim=[1, 2, 3])
                
                # 计算攻击损失
                logits = self.model(adversarial)
                attack_loss = self._logits_to_attack_loss(logits, labels)
                
                # 总损失:L2扰动 + c * 攻击损失
                total_loss = L2_dist + c * attack_loss
                
                # 平均损失用于优化
                mean_loss = total_loss.mean()
                mean_loss.backward()
                optimizer.step()
                
                if verbose and (iteration + 1) % 200 == 0:
                    print(f"  Iteration {iteration+1}, loss: {mean_loss.item():.4f}")
            
            # 检查攻击结果
            adversarial = torch.clamp(torch.tanh(delta.detach()) * 0.5 + 0.5, 0, 1)
            L2_dist = torch.sum((adversarial - images) ** 2, dim=[1, 2, 3])
            
            # 更新最优解
            success = attack_loss.detach() == 0
            improved = (L2_dist < best_L2) & success
            best_adversarial[improved] = adversarial[improved]
            best_L2[improved] = L2_dist[improved]
            
            # 二分搜索更新c
            if self.targeted:
                c[attack_loss.detach() == 0] *= 2
                c[attack_loss.detach() > 0] /= 2
            else:
                c[attack_loss.detach() == 0] *= 2
                c[attack_loss.detach() > 0] /= 2
        
        return best_adversarial
 
 
def demo_cw():
    """C&W攻击演示"""
    print("C&W攻击演示")
    print("=" * 60)
    print("""
    C&W vs FGSM/PGD:
    
    FGSM/PGD:
    - 使用固定步长
    - 最小化扰动不是主要目标
    - 可能在不需要的地方添加扰动
    
    C&W:
    - 直接优化扰动范数
    - 最小化 ||δ||₂ + c · f(x+δ)
    - 扰动更小、更隐蔽
    
    攻击效果对比(通常):
    FGSM < PGD < C&W
    (C&W最强,因为直接优化)
    """)
    
    print("\n模拟攻击效果:")
    print("-" * 40)
    print(f"{'攻击方法':<15} {'扰动范数(L2)':<15} {'成功率':<10}")
    print("-" * 40)
    print(f"{'FGSM':<15} {'0.032':<15} {'78%':<10}")
    print(f"{'PGD':<15} {'0.028':<15} {'85%':<10}")
    print(f"{'C&W':<15} {'0.021':<15} {'92%':<10}")
 
 
if __name__ == "__main__":
    demo_cw()

五、黑盒攻击与迁移攻击

5.1 黑盒攻击的原理

在现实中,攻击者通常不知道目标模型的具体参数和结构。黑盒攻击就是在这种情况下发起攻击。

黑盒攻击利用两个关键性质:

1. 决策/梯度查询:攻击者可以查询模型的输入输出,通过观察输出变化来推断梯度方向。

2. 模型可迁移性:不同模型学到的对抗样本有重叠——在一个模型上生成的对抗样本,往往也能欺骗其他模型。

5.2 迁移攻击实现

"""
迁移攻击:利用模型可迁移性
"""
 
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple, List
import numpy as np
 
class TransferAttack:
    """
    基于迁移的对抗攻击
    
    策略:
    1. 训练一个替代模型(surrogate model)
    2. 在替代模型上生成对抗样本
    3. 利用对抗样本的迁移性攻击目标模型
    """
    
    def __init__(self,
                 surrogate_model: nn.Module,
                 epsilon: float = 0.03,
                 num_iter: int = 10,
                 alpha: float = 0.003):
        self.surrogate = surrogate_model
        self.epsilon = epsilon
        self.num_iter = num_iter
        self.alpha = alpha
    
    def generate(self,
                images: torch.Tensor,
                labels: torch.Tensor,
                attack_method: str = "pgd") -> torch.Tensor:
        """
        生成可迁移的对抗样本
        
        参数:
            attack_method: 'fgsm', 'pgd', 'mim' (momentum iterative)
        """
        if attack_method == "fgsm":
            return self._fgsm(images, labels)
        elif attack_method == "pgd":
            return self._pgd(images, labels)
        elif attack_method == "mim":
            return self._mim(images, labels)
        else:
            raise ValueError(f"Unknown attack method: {attack_method}")
    
    def _fgsm(self, images, labels):
        images.requires_grad = True
        output = self.surrogate(images)
        loss = F.cross_entropy(output, labels)
        loss.backward()
        
        perturbation = self.epsilon * torch.sign(images.grad)
        adversarial = images.detach() + perturbation
        return torch.clamp(adversarial, 0, 1)
    
    def _pgd(self, images, labels):
        adversarial = images.clone()
        
        # 随机初始化
        adversarial = adversarial + torch.zeros_like(adversarial).uniform_(
            -self.epsilon, self.epsilon
        )
        
        for _ in range(self.num_iter):
            adversarial.requires_grad = True
            output = self.surrogate(adversarial)
            loss = F.cross_entropy(output, labels)
            loss.backward()
            
            with torch.no_grad():
                adversarial = adversarial + self.alpha * torch.sign(adversarial.grad)
                adversarial = torch.clamp(adversarial, 0, 1)
                # 投影回epsilon范围
                adversarial = torch.max(
                    images - self.epsilon,
                    torch.min(images + self.epsilon, adversarial)
                )
        
        return adversarial
    
    def _mim(self, images, labels):
        """
        Momentum Iterative Method
        加入动量项提升迁移性
        """
        adversarial = images.clone()
        momentum = torch.zeros_like(images)
        
        for _ in range(self.num_iter):
            adversarial.requires_grad = True
            output = self.surrogate(adversarial)
            loss = F.cross_entropy(output, labels)
            loss.backward()
            
            # 更新动量
            with torch.no_grad():
                grad = adversarial.grad
                momentum = 0.9 * momentum + grad / torch.norm(grad, p=1)
                
                adversarial = adversarial + self.alpha * torch.sign(momentum)
                adversarial = torch.clamp(adversarial, 0, 1)
                adversarial = torch.max(
                    images - self.epsilon,
                    torch.min(images + self.epsilon, adversarial)
                )
        
        return adversarial
 
 
class EnsembleAttack:
    """
    集成攻击:同时攻击多个替代模型
    提升迁移成功率和攻击覆盖面
    """
    
    def __init__(self,
                 models: List[nn.Module],
                 epsilon: float = 0.03,
                 num_iter: int = 10):
        self.models = models
        self.epsilon = epsilon
        self.num_iter = num_iter
    
    def attack(self,
              images: torch.Tensor,
              labels: torch.Tensor,
              weights: List[float] = None) -> torch.Tensor:
        """
        集成攻击
        
        策略:平均多个模型的梯度
        """
        if weights is None:
            weights = [1.0 / len(self.models)] * len(self.models)
        
        adversarial = images.clone()
        
        for _ in range(self.num_iter):
            adversarial.requires_grad = True
            
            # 收集所有模型的梯度
            gradients = []
            for model, weight in zip(self.models, weights):
                model.eval()
                output = model(adversarial)
                loss = F.cross_entropy(output, labels)
                loss.backward()
                gradients.append(adversarial.grad * weight)
            
            # 加权平均梯度
            avg_gradient = sum(gradients)
            
            with torch.no_grad():
                adversarial = adversarial + self.epsilon * torch.sign(avg_gradient)
                adversarial = torch.clamp(adversarial, 0, 1)
        
        return adversarial
 
 
def evaluate_transferability(transfer_attack: TransferAttack,
                             source_models: List[nn.Module],
                             target_model: nn.Module,
                             images: torch.Tensor,
                             labels: torch.Tensor):
    """
    评估对抗样本的迁移性
    """
    results = {}
    
    for i, model in enumerate(source_models):
        attack = TransferAttack(model)
        
        # 在源模型上生成对抗样本
        adv_images = attack.generate(images, labels)
        
        # 测试在源模型上的成功率
        model.eval()
        with torch.no_grad():
            source_preds = model(adv_images).argmax(1)
            source_success = (source_preds != labels).float().mean().item()
        
        # 测试在目标模型上的成功率
        with torch.no_grad():
            target_preds = target_model(adv_images).argmax(1)
            target_success = (target_preds != labels).float().mean().item()
        
        results[f"model_{i}"] = {
            "source_attack_success": source_success,
            "target_attack_success": target_success
        }
    
    return results
 
 
def demo_transfer():
    """迁移攻击演示"""
    print("迁移攻击演示")
    print("=" * 60)
    print("""
    迁移攻击原理:
    
    1. 可迁移性:
       在模型A上生成的对抗样本,
       有一定概率也能欺骗模型B
    
    2. 迁移性来源:
       - 不同模型在相似数据上学到相似的决策边界
       - 对抗样本位于决策边界的"弱点"附近
       - 这些弱点在不同模型间有一定重叠
    
    3. 提升迁移性的方法:
       - MIM(动量迭代):加入动量项
       - 集成攻击:同时攻击多个模型
       - 多步攻击:更强的扰动
       - 多样化训练:在不同架构上训练替代模型
    """)
 
 
if __name__ == "__main__":
    demo_transfer()

六、对抗补丁:物理世界的攻击

6.1 什么是对抗补丁?

对抗补丁(Adversarial Patch)不是修改整个图像的像素,而是在图像的某个区域放置一个精心设计的”补丁”,就能让模型做出错误判断。

这在现实中很可怕——你可以在Stop标志上贴一个彩色贴纸,就能让自动驾驶系统把它误识别为其他标志。

6.2 对抗补丁攻击实现

"""
对抗补丁攻击实现
"""
 
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from typing import Tuple
 
class AdversarialPatch:
    """
    对抗补丁攻击
    
    目标:在图像中放置一个补丁,使得模型做出错误判断
    应用:物理世界攻击(如自动驾驶标志误识别)
    """
    
    def __init__(self,
                 model: nn.Module,
                 patch_size: int = 50,
                 epsilon: float = 2.0,
                 learning_rate: float = 0.1):
        self.model = model
        self.patch_size = patch_size
        self.epsilon = epsilon
        self.lr = learning_rate
    
    def generate(self,
                target_class: int,
                image_shape: Tuple[int, ...],
                num_iterations: int = 100,
                show_progress: bool = True) -> np.ndarray:
        """
        生成对抗补丁
        
        参数:
            target_class: 目标误分类类别
            image_shape: 图像形状 (C, H, W)
            num_iterations: 迭代次数
        """
        # 初始化补丁(使用随机噪声)
        patch = torch.rand(1, *image_shape[:2], 3, requires_grad=True)
        
        # 优化器
        optimizer = torch.optim.Adam([patch], lr=self.lr)
        
        for iteration in range(num_iterations):
            optimizer.zero_grad()
            
            # 生成带有补丁的图像
            image = self._apply_patch(patch, image_shape)
            
            # 计算损失(最大化目标类概率)
            output = self.model(image)
            loss = -F.cross_entropy(output, torch.tensor([target_class]))
            
            # 梯度上升(最大化损失)
            loss.backward()
            optimizer.step()
            
            # 投影回有效范围
            with torch.no_grad():
                patch.clamp_(0, 1)
            
            if show_progress and (iteration + 1) % 20 == 0:
                print(f"Iteration {iteration+1}/{num_iterations}, Loss: {loss.item():.4f}")
        
        return patch.detach().numpy()[0]
    
    def _apply_patch(self, patch: torch.Tensor, 
                    image_shape: Tuple[int, ...]) -> torch.Tensor:
        """
        将补丁应用到图像的随机位置
        """
        # 创建全零图像
        image = torch.zeros(1, *image_shape[:2], 3)
        
        # 随机位置
        h, w = image_shape[:2]
        patch_h, patch_w = patch.shape[1:3]
        
        top = np.random.randint(0, h - patch_h)
        left = np.random.randint(0, w - patch_w)
        
        # 应用补丁
        image[:, top:top+patch_h, left:left+patch_w, :] = patch
        
        return image
 
 
class TargetedPatch:
    """
    定向对抗补丁:让模型把任何包含补丁的图像分类为特定类别
    """
    
    def __init__(self, model: nn.Module):
        self.model = model
    
    def train(self,
             source_images: torch.Tensor,
             target_class: int,
             num_epochs: int = 100,
             lr: float = 0.1) -> torch.Tensor:
        """
        训练一个可打印的对抗补丁
        """
        # 初始化补丁
        patch = torch.rand(1, 50, 50, 3, requires_grad=True)
        optimizer = torch.optim.Adam([patch], lr=lr)
        
        for epoch in range(num_epochs):
            total_loss = 0
            
            for image in source_images:
                optimizer.zero_grad()
                
                # 将补丁应用到图像
                # 简化版本:直接覆盖图像中心
                image_with_patch = image.clone()
                h, w = 50, 50
                top, left = 87, 87  # ImageNet 224x224的中心
                image_with_patch[:, top:top+h, left:left+w] = patch.squeeze()
                
                # 前向传播
                output = self.model(image_with_patch.unsqueeze(0))
                
                # 损失:最大化目标类得分
                loss = -F.cross_entropy(output, torch.tensor([target_class]))
                
                loss.backward()
                optimizer.step()
                
                total_loss += loss.item()
            
            if (epoch + 1) % 20 == 0:
                print(f"Epoch {epoch+1}, Avg Loss: {total_loss/len(source_images):.4f}")
        
        return patch.detach()
 
 
def demo_patch():
    """对抗补丁演示"""
    print("对抗补丁攻击演示")
    print("=" * 60)
    print("""
    对抗补丁 vs 传统对抗样本:
    
    传统对抗样本:
    - 修改整张图像
    - 扰动很小(||δ||_∞ < ε)
    - 通常不可打印
    
    对抗补丁:
    - 只修改局部区域
    - 扰动可以是任意值
    - 可以打印出来贴在物理世界
    
    应用场景:
    1. 自动驾驶:Stop标志上贴彩色贴纸
    2. 人脸识别:特殊眼镜/帽子
    3. 物体检测:让检测器完全忽略某物体
    """)
 
 
if __name__ == "__main__":
    demo_patch()

七、防御方法

7.1 对抗训练

对抗训练是最有效的防御方法之一。核心思想是用对抗样本训练模型

对抗训练:
min_θ E_{(x,y)∈D} max_{δ∈S} L(θ, x+δ, y)

内层:找到最强的对抗扰动
外层:在对抗样本上最小化损失

7.2 对抗训练实现

"""
对抗训练实现
"""
 
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from typing import Tuple
 
class AdversarialTraining:
    """
    对抗训练:让模型学习对抗样本
    
    策略:
    1. PGD-AT: 用PGD攻击生成对抗样本训练
    2. TRADES: 同时最大化干净样本和对抗样本的差异
    3. MART: 修改对抗风险与去相关
    """
    
    def __init__(self,
                 model: nn.Module,
                 epsilon: float = 0.031,
                 num_iter: int = 7,
                 alpha: float = 0.008,
                 attack_method: str = "pgd"):
        self.model = model
        self.epsilon = epsilon
        self.num_iter = num_iter
        self.alpha = alpha
        self.attack_method = attack_method
    
    def _generate_adversarial(self,
                             images: torch.Tensor,
                             labels: torch.Tensor) -> torch.Tensor:
        """
        生成对抗样本
        """
        if self.attack_method == "pgd":
            return self._pgd_attack(images, labels)
        elif self.attack_method == "fgsm":
            return self._fgsm_attack(images, labels)
        else:
            raise ValueError(f"Unknown attack: {self.attack_method}")
    
    def _fgsm_attack(self, images, labels):
        images.requires_grad = True
        output = self.model(images)
        loss = F.cross_entropy(output, labels)
        loss.backward()
        
        with torch.no_grad():
            adversarial = images + self.epsilon * torch.sign(images.grad)
            adversarial = torch.clamp(adversarial, 0, 1)
        
        return adversarial
    
    def _pgd_attack(self, images, labels):
        adversarial = images.clone()
        
        # 随机初始化
        adversarial = adversarial + torch.zeros_like(adversarial).uniform_(
            -self.epsilon, self.epsilon
        )
        adversarial = torch.clamp(adversarial, 0, 1)
        
        for _ in range(self.num_iter):
            adversarial.requires_grad = True
            output = self.model(adversarial)
            loss = F.cross_entropy(output, labels)
            loss.backward()
            
            with torch.no_grad():
                adversarial = adversarial + self.alpha * torch.sign(adversarial.grad)
                adversarial = torch.clamp(adversarial, 0, 1)
                # 投影
                adversarial = torch.max(
                    images - self.epsilon,
                    torch.min(images + self.epsilon, adversarial)
                )
        
        return adversarial.detach()
    
    def train(self,
             train_loader: DataLoader,
             test_loader: DataLoader,
             epochs: int = 10,
             lr: float = 0.01) -> dict:
        """
        执行对抗训练
        """
        optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
        scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5)
        
        history = {
            "train_loss": [],
            "train_clean_acc": [],
            "train_adv_acc": [],
            "test_clean_acc": [],
            "test_adv_acc": []
        }
        
        for epoch in range(epochs):
            self.model.train()
            train_loss = 0
            clean_correct = 0
            adv_correct = 0
            total = 0
            
            for images, labels in train_loader:
                # 生成对抗样本
                adversarial = self._generate_adversarial(images, labels)
                
                # 在对抗样本上训练
                optimizer.zero_grad()
                
                # 计算干净样本和对抗样本的损失
                clean_loss = F.cross_entropy(self.model(images), labels)
                adv_loss = F.cross_entropy(self.model(adversarial), labels)
                
                # 总损失(可以根据需要调整权重)
                loss = (clean_loss + adv_loss) / 2
                
                loss.backward()
                optimizer.step()
                
                train_loss += loss.item()
                
                # 计算准确率
                with torch.no_grad():
                    clean_pred = self.model(images).argmax(1)
                    adv_pred = self.model(adversarial).argmax(1)
                    clean_correct += (clean_pred == labels).sum().item()
                    adv_correct += (adv_pred == labels).sum().item()
                    total += labels.size(0)
            
            scheduler.step()
            
            # 评估
            clean_acc, adv_acc = self._evaluate(test_loader)
            
            history["train_loss"].append(train_loss / len(train_loader))
            history["train_clean_acc"].append(clean_correct / total)
            history["train_adv_acc"].append(adv_correct / total)
            history["test_clean_acc"].append(clean_acc)
            history["test_adv_acc"].append(adv_acc)
            
            print(f"Epoch {epoch+1}/{epochs}")
            print(f"  Train Loss: {train_loss/len(train_loader):.4f}")
            print(f"  Clean Acc: {clean_acc:.4f}, Adv Acc: {adv_acc:.4f}")
        
        return history
    
    def _evaluate(self, test_loader: DataLoader) -> Tuple[float, float]:
        """评估模型"""
        self.model.eval()
        clean_correct = 0
        adv_correct = 0
        total = 0
        
        with torch.no_grad():
            for images, labels in test_loader:
                # 干净准确率
                clean_pred = self.model(images).argmax(1)
                clean_correct += (clean_pred == labels).sum().item()
                
                # 对抗准确率
                adversarial = self._generate_adversarial(images, labels)
                adv_pred = self.model(adversarial).argmax(1)
                adv_correct += (adv_pred == labels).sum().item()
                
                total += labels.size(0)
        
        return clean_correct / total, adv_correct / total
 
 
class TRADESDefense(AdversarialTraining):
    """
    TRADES (TRADE-OFF) 防御
    论文: https://arxiv.org/abs/1908.08016
    
    核心思想:
    干净样本和对抗样本的预测应该接近
    """
    
    def __init__(self, *args, beta: float = 6.0, **kwargs):
        super().__init__(*args, **kwargs)
        self.beta = beta
    
    def trades_loss(self,
                   images: torch.Tensor,
                   labels: torch.Tensor) -> torch.Tensor:
        """
        计算TRADES损失
        """
        # 生成对抗样本
        adversarial = self._generate_adversarial(images, labels)
        
        # 干净样本的预测
        clean_output = self.model(images)
        
        # 对抗样本的预测(不更新)
        with torch.no_grad():
            adv_output = self.model(adversarial)
        
        # TRADES损失
        # 1. 干净样本的交叉熵损失
        ce_loss = F.cross_entropy(clean_output, labels)
        
        # 2. 干净样本和对抗样本预测的KL散度
        kl_loss = F.kl_div(
            F.log_softmax(clean_output, dim=1),
            F.softmax(adv_output, dim=1),
            reduction='batchmean'
        )
        
        return ce_loss + self.beta * kl_loss
 
 
def demo_defense():
    """防御方法演示"""
    print("对抗防御方法演示")
    print("=" * 60)
    print("""
    主要防御方法:
    
    1. 对抗训练 (Adversarial Training)
       - 在对抗样本上训练
       - 最有效但训练慢
       
    2. 输入净化 (Input Purification)
       - 对输入进行预处理
       - JPEG压缩、去噪等
       
    3. 模型蒸馏 (Defensive Distillation)
       - 用软标签训练
       - 平滑决策边界
       
    4. 认证鲁棒性 (Certified Robustness)
       - 提供可证明的下界
       - 不可绕过
    
    防御效果对比:
    方法            | 干净准确率 | 对抗准确率
    ----------------|-----------|-----------
    标准训练        | 95%       | 10%
    PGD-AT          | 85%       | 75%
    TRADES          | 87%       | 78%
    """)
 
 
if __name__ == "__main__":
    demo_defense()

八、对抗攻防的博弈论视角

8.1 攻防博弈

对抗攻防可以建模为博弈论问题:

攻击者:选择对抗扰动 δ
防御者:选择模型参数 θ

对抗训练 = 求解 minimax 问题:
min_θ max_{δ∈S} L(θ, x+δ, y)

这个视角揭示了几个重要洞见:

  1. 纳什均衡:在博弈达到均衡时,双方都无法单方面改进
  2. 混合策略:有时候随机化防御策略更有效
  3. 收益函数设计:如何定义”防御成功”会影响最终结果

8.2 博弈视角的代码实现

"""
对抗攻防的博弈论视角
"""
 
import numpy as np
from typing import List, Tuple
import torch
import torch.nn.functional as F
 
class AttackDefenseGame:
    """
    攻防博弈模拟器
    """
    
    def __init__(self,
                 epsilon: float = 0.1,
                 attack_cost: float = 0.0,
                 defense_cost: float = 0.1):
        self.epsilon = epsilon
        self.attack_cost = attack_cost
        self.defense_cost = defense_cost
    
    def payoff_matrix(self) -> np.ndarray:
        """
        构造支付矩阵
        行:防御者策略(epsilon)
        列:攻击者策略(epsilon)
        """
        epsilons = [0.0, 0.01, 0.03, 0.05, 0.1]
        
        payoff = np.zeros((len(epsilons), len(epsilons)))
        
        for i, def_eps in enumerate(epsilons):
            for j, att_eps in enumerate(epsilons):
                # 简化模型:攻击成功率与攻击强度正相关,与防御强度负相关
                if att_eps <= def_eps:
                    # 防御成功
                    attack_payoff = -self.attack_cost
                    defense_payoff = 1 - self.defense_cost
                else:
                    # 攻击成功
                    attack_payoff = 1 - self.attack_cost
                    defense_payoff = -self.defense_cost
                
                payoff[i, j] = defense_payoff
        
        return payoff
    
    def mixed_strategy_nash(self, payoff: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        """
        计算混合策略纳什均衡
        """
        n_strategies = payoff.shape[0]
        
        # 简化的最佳响应计算
        best_attack = np.zeros(n_strategies)
        best_defense = np.zeros(n_strategies)
        
        # 攻击者的最佳响应
        for i in range(n_strategies):
            payoffs_against_i = payoff[:, i]
            best_attack[i] = np.argmax(payoffs_against_i)
        
        # 防御者的最佳响应
        for j in range(n_strategies):
            payoffs_for_j = payoff[:, j]
            best_defense[j] = np.argmax(payoffs_for_j)
        
        return best_defense, best_attack
 
 
def analyze_robustness_tradeoff():
    """
    分析鲁棒性和实用性之间的权衡
    """
    print("鲁棒性与实用性权衡分析")
    print("=" * 60)
    print("""
    对抗训练的双刃剑:
    
    优点:
    ✓ 大幅提升对已知攻击的防御能力
    ✓ 模型学到更平滑的决策边界
    ✓ 对噪声更鲁棒
    
    缺点:
    ✗ 降低干净样本上的准确率
    ✗ 训练时间显著增加
    ✗ 可能被新的攻击方法绕过
    
    典型的准确率权衡:
    
    干净样本准确率 vs 对抗准确率
    
    标准训练:    ████████████████████ 95% 干净 /  5% 对抗
    PGD-AT:      ████████████████ 85% 干净 / 70% 对抗
    TRADES:      ███████████████ 87% 干净 / 73% 对抗
    """)
 
 
def demo_game_theory():
    """博弈论视角演示"""
    print("对抗攻防的博弈论视角")
    print("=" * 60)
    
    game = AttackDefenseGame()
    payoff = game.payoff_matrix()
    
    print("\n支付矩阵(防御者视角):")
    print("         攻击强度")
    print("防御   ", end="")
    print(" ".join([f"{e:.2f}" for e in [0.0, 0.01, 0.03, 0.05, 0.1]]))
    print("-" * 50)
    for i, eps in enumerate([0.0, 0.01, 0.03, 0.05, 0.1]):
        print(f"{eps:.2f}  ", end="")
        print(" ".join([f"{payoff[i,j]:.2f}" for j in range(5)]))
    
    print("""
    
    博弈分析:
    
    1. 纯策略均衡:取决于成本参数
    
    2. 混合策略:
       防御者:选择合适的epsilon(如0.03)
       攻击者:计算成本收益比决定是否攻击
    
    3. 实际应用:
       - 高价值目标:使用强防御
       - 普通系统:使用适度防御 + 监控
    """)
 
 
if __name__ == "__main__":
    demo_game_theory()
    analyze_robustness_tradeoff()

九、实战经验总结

9.1 常见坑

  1. epsilon设置不当:太大模型无法学习,太小攻击无效
  2. 梯度消失/爆炸:需要梯度裁剪或适当初始化
  3. 随机种子的重要性:对抗样本有随机性,需要固定种子复现
  4. 批量攻击vs单步攻击:批量攻击通常更有效

9.2 防御建议

"""
防御最佳实践
"""
 
class DefenseBestPractices:
    """
    防御最佳实践清单
    """
    
    @staticmethod
    def get_checklist():
        return """
        防御检查清单:
        
        [ ] 基础防护
            - 使用对抗训练 (PGD-AT 或 TRADES)
            - 启用输入验证和清洗
            - 限制模型输出的置信度
            
        [ ] 监控与检测
            - 监控输入分布变化
            - 检测潜在的对抗输入
            - 记录异常预测
            
        [ ] 模型加固
            - 使用模型集成
            - 采用认证鲁棒性方法
            - 定期用新攻击评估
            
        [ ] 系统层面
            - 输入预处理(JPEG压缩等)
            - 多模型投票
            - 异常检测集成
        """
    
    @staticmethod
    def recommended_config():
        return {
            "adversarial_training": {
                "epsilon": 8/255,  # ImageNet: 8/255
                "num_iter": 7,
                "alpha": 2/255,
                "method": "pgd"
            },
            "input_purification": {
                "jpeg_compression": 75,
                "bit_depth_reduction": False,
                "random_resize_padding": True
            },
            "model_ensemble": {
                "n_models": 3,
                "diversity": "architecture"  # or "training_data"
            }
        }
 
 
# 快速参考表
 
QUICK_REFERENCE = """
================================================================================
                        对抗攻防快速参考
================================================================================
 
攻击方法选择:
┌────────────────┬────────────┬─────────────┬─────────────────────────┐
│ 方法           │ 白盒/黑盒  │ 计算成本    │ 适用场景                │
├────────────────┼────────────┼─────────────┼─────────────────────────┤
│ FGSM           │ 白盒       │ ★☆☆        │ 快速baseline            │
│ PGD            │ 白盒       │ ★★★        │ 最强L∞攻击              │
│ C&W            │ 白盒       │ ★★★★       │ 最小范数扰动            │
│ 迁移攻击       │ 黑盒       │ ★★☆        │ 不知道模型结构时         │
│ 对抗补丁       │ 白/黑盒    │ ★★★        │ 物理世界攻击            │
└────────────────┴────────────┴─────────────┴─────────────────────────┘
 
防御方法选择:
┌────────────────┬────────────┬─────────────┬─────────────────────────┐
│ 方法           │ 防御强度    │ 干净准确率  │ 备注                    │
├────────────────┼────────────┼─────────────┼─────────────────────────┤
│ 无防御         │ ☆☆☆        │ 最高        │ 基准                    │
│ 对抗训练(AT)   │ ★★★        │ 降低5-10%   │ 最有效                  │
│ TRADES         │ ★★★        │ 降低3-8%    │ 比AT略好                │
│ 输入净化       │ ★★☆        │ 几乎不变    │ 辅助手段                │
│ 认证防御       │ ★★★        │ 降低10-20%  │ 有理论保证              │
└────────────────┴────────────┴─────────────┴─────────────────────────┘
 
常用参数设置:
- ImageNet (224x224): ε = 8/255 ≈ 0.031
- CIFAR-10 (32x32):   ε = 8/255 ≈ 0.031
- MNIST (28x28):      ε = 0.3
 
================================================================================
"""

十、总结

10.1 核心要点

  1. 对抗样本的本质:高维空间中微小的扰动可以显著改变模型输出
  2. FGSM/PGD:基于梯度的经典攻击方法,PGD是最强L∞攻击
  3. C&W攻击:直接优化扰动范数,更隐蔽更难防御
  4. 迁移攻击:利用模型间的可迁移性进行黑盒攻击
  5. 对抗补丁:局部修改就能实现攻击,可用于物理世界
  6. 对抗训练:最有效的防御,需要权衡干净准确率
  7. 博弈论视角:攻防是博弈,均衡点决定最终格局

10.2 未来趋势

  • 自适应攻击与防御:攻击和防御相互演化
  • 认证鲁棒性:提供可证明的鲁棒性下界
  • 真实世界攻击:物理对抗样本的研究越来越重要
  • AI安全生态:从单一模型到整个系统的安全

相关文档:对抗样本基础 | 对抗训练技术 | AI安全基础 | 模型鲁棒性评估