关键词

术语英文核心概念
对抗训练Adversarial Training通过对抗样本增强模型鲁棒性
min-max优化Min-Max Optimization对抗训练的数学框架
PGD对抗训练PGD Adversarial Training投影梯度下降对抗训练
鲁棒性Robustness模型对扰动的不变性
认证防御Certified Defense有理论保证的防御方法
随机平滑Randomized Smoothing基于随机化的认证方法
防御蒸馏Defensive Distillation降低梯度敏感性的训练方法
认证边界Certified Bound鲁棒性的理论保证
对抗正则化Adversarial Regularization对抗样本作为正则化项
迁移攻击Transfer Attack利用替代模型生成对抗样本

1. 引言:对抗训练的起源与意义

对抗训练(Adversarial Training)是应对对抗样本威胁最核心的防御方法之一。其基本思想直接而深刻:既然神经网络可以被对抗样本欺骗,那么就让模型在训练过程中”见过”这些对抗样本,从而学会抵御它们。这种”以毒攻毒”的训练范式,源于 Madry 等人在 2017 年的开创性工作,并迅速成为对抗防御研究的主流方法。

对抗训练的重要性

对抗训练不仅是目前最有效的对抗防御方法之一,更是从博弈论角度理解深度学习鲁棒性的理论框架。它将对抗学习问题形式化为一个 min-max 优化问题,为理解神经网络的决策边界提供了数学工具。


2. 对抗训练原理:min-max优化框架

2.1 从直觉到数学

对抗训练的核心是将鲁棒优化问题形式化为一个双人零和博弈的 min-max 优化:

其中:

  • :模型参数
  • :训练数据分布
  • :损失函数(如交叉熵)
  • :扰动预算
  • :内层最大化问题(攻击者)
  • :外层最小化问题(防御者)
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
 
class AdversarialTraining:
    """
    对抗训练框架
    
    min-max 优化:
    - 内层(最大化):找到最有效的对抗扰动
    - 外层(最小化):训练模型以最小化对抗损失
    """
    
    def __init__(self, model, epsilon=0.03, alpha=0.01, num_iter=7):
        self.model = model
        self.epsilon = epsilon
        self.alpha = alpha
        self.num_iter = num_iter
    
    def pgd_attack(self, images, labels, targeted=False, target_labels=None):
        """
        PGD攻击:生成对抗样本
        
        这是对抗训练中内层最大化问题的求解方法
        """
        original_images = images.detach().clone()
        
        # 随机初始化(在允许扰动范围内)
        images = images.detach() + torch.zeros_like(images).uniform_(-self.epsilon, self.epsilon)
        images = torch.clamp(images, 0, 1)
        
        for i in range(self.num_iter):
            images.requires_grad = True
            
            outputs = self.model(images)
            
            if targeted:
                loss = -F.cross_entropy(outputs, target_labels)
            else:
                loss = F.cross_entropy(outputs, labels)
            
            self.model.zero_grad()
            loss.backward()
            
            # 更新扰动
            with torch.no_grad():
                images = images + self.alpha * torch.sign(images.grad)
                # 投影到允许范围
                images = torch.maximum(images, original_images - self.epsilon)
                images = torch.minimum(images, original_images + self.epsilon)
                images = torch.clamp(images, 0, 1)
        
        return images.detach()
    
    def train_step(self, images, labels, optimizer):
        """
        对抗训练单步
        
        1. 生成对抗样本(内层最大化)
        2. 用对抗样本训练模型(外层最小化)
        """
        # 生成对抗样本
        adversarial_images = self.pgd_attack(images, labels)
        
        # 正常样本和对抗样本混合训练
        optimizer.zero_grad()
        
        # 正常样本损失
        clean_outputs = self.model(images)
        clean_loss = F.cross_entropy(clean_outputs, labels)
        
        # 对抗样本损失
        adv_outputs = self.model(adversarial_images)
        adv_loss = F.cross_entropy(adv_outputs, labels)
        
        # 总损失
        total_loss = clean_loss + adv_loss
        
        total_loss.backward()
        optimizer.step()
        
        return {
            'clean_loss': clean_loss.item(),
            'adv_loss': adv_loss.item(),
            'total_loss': total_loss.item()
        }

2.2 鲁棒 vs 准确:训练权衡

对抗训练面临一个重要的权衡问题:鲁棒性和标准准确率之间存在张力。

标准训练: 优化干净数据上的准确率
对抗训练: 优化对抗数据上的准确率

理论分析(Wong & Kolter, 2018):
- 存在数据集使得任何鲁棒模型的标准准确率必须低于某个界限
- 原因:对抗扰动可以改变数据的"语义"但保持人类感知不变
def analyze_robustness_accuracy_tradeoff():
    """
    分析鲁棒性与标准准确率的权衡
    
    理论背景:
    - 对抗样本利用的是模型在高维空间的线性敏感性
    - 鲁棒模型需要在决策边界附近"平滑"
    - 这可能牺牲对干净数据的拟合能力
    """
    # 实验观察(基于 CIFAR-10):
    # 标准训练:Clean Acc ~ 95%, Adv Acc (PGD) ~ 0%
    # 对抗训练:Clean Acc ~ 85%, Adv Acc (PGD) ~ 50%
    
    tradeoffs = {
        'standard_training': {
            'clean_accuracy': 0.95,
            'robust_accuracy_pgd20': 0.0,
            'robust_accuracy_pgd100': 0.0
        },
        'adversarial_training': {
            'clean_accuracy': 0.85,
            'robust_accuracy_pgd20': 0.50,
            'robust_accuracy_pgd100': 0.48
        },
        'TRADES_regularization': {
            'clean_accuracy': 0.90,
            'robust_accuracy_pgd20': 0.55,
            'robust_accuracy_pgd100': 0.52
        },
        'MART_regularization': {
            'clean_accuracy': 0.88,
            'robust_accuracy_pgd20': 0.53,
            'robust_accuracy_pgd100': 0.50
        }
    }
    
    return tradeoffs

3. PGD对抗训练详解

3.1 PGD-Training 完整实现

Projected Gradient Descent (PGD) 对抗训练是目前最广泛使用的对抗训练方法:

class PGDTraining:
    """
    PGD对抗训练
    
    步骤:
    1. 使用PGD生成对抗样本
    2. 用对抗样本计算梯度并更新模型
    3. 重复直到收敛
    """
    
    def __init__(self, model, epsilon=8/255, alpha=2/255, num_iter=7,
                 lr=0.01, weight_decay=5e-4):
        self.model = model
        self.epsilon = epsilon
        self.alpha = alpha
        self.num_iter = num_iter
        self.device = next(model.parameters()).device
        
        # 优化器
        self.optimizer = torch.optim.SGD(
            model.parameters(),
            lr=lr,
            momentum=0.9,
            weight_decay=weight_decay
        )
        
        # 学习率调度
        self.scheduler = torch.optim.lr_scheduler.MultiStepLR(
            self.optimizer,
            milestones=[100, 105],
            gamma=0.1
        )
    
    def pgd_attack(self, images, labels):
        """PGD攻击"""
        images = images.to(self.device)
        labels = labels.to(self.device)
        
        original_images = images.detach()
        
        # 随机初始化
        if True:  # 使用随机初始化
            images = images + torch.zeros_like(images).uniform_(
                -self.epsilon, self.epsilon
            )
        
        images = torch.clamp(images, 0, 1)
        
        for _ in range(self.num_iter):
            images.requires_grad = True
            
            outputs = self.model(images)
            loss = F.cross_entropy(outputs, labels)
            
            self.model.zero_grad()
            loss.backward()
            
            with torch.no_grad():
                images = images + self.alpha * images.grad.sign()
                # 投影到L无穷球
                images = torch.maximum(images, original_images - self.epsilon)
                images = torch.minimum(images, original_images + self.epsilon)
                images = torch.clamp(images, 0, 1)
        
        return images.detach()
    
    def train_epoch(self, dataloader, epoch):
        """训练一个epoch"""
        self.model.train()
        total_loss = 0
        correct_clean = 0
        correct_adv = 0
        total = 0
        
        for batch_idx, (images, labels) in enumerate(dataloader):
            images, labels = images.to(self.device), labels.to(self.device)
            
            # 生成对抗样本
            adversarial_images = self.pgd_attack(images, labels)
            
            # 前向传播
            clean_outputs = self.model(images)
            adv_outputs = self.model(adversarial_images)
            
            # 损失
            clean_loss = F.cross_entropy(clean_outputs, labels)
            adv_loss = F.cross_entropy(adv_outputs, labels)
            loss = clean_loss + adv_loss
            
            # 反向传播
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
            
            # 统计
            total_loss += loss.item()
            _, clean_pred = clean_outputs.max(1)
            _, adv_pred = adv_outputs.max(1)
            correct_clean += clean_pred.eq(labels).sum().item()
            correct_adv += adv_pred.eq(labels).sum().item()
            total += labels.size(0)
        
        self.scheduler.step()
        
        return {
            'loss': total_loss / len(dataloader),
            'clean_acc': 100. * correct_clean / total,
            'adv_acc': 100. * correct_adv / total,
            'lr': self.optimizer.param_groups[0]['lr']
        }

3.2 PGD训练的理论分析

def theoretical_analysis_pgd():
    """
    PGD对抗训练的理论分析
    
    关键结论:
    1. 如果PGD能找到局部最大,则接近全局最大
    2. 训练后的模型对任意L无穷攻击都有一定鲁棒性
    3. 鲁棒性来源于决策边界的"平滑"
    """
    analysis = {
        'local_maxima': """
            在L无穷约束下,损失函数的局部最大值
            通常接近全局最大值。
            
            原因:
            - 高维空间中局部最大值和全局最大值差距不大
            - PGD的随机初始化有助于发现强对抗样本
        """,
        'convergence': """
            PGD在有限步内收敛:
            - 步长 alpha >= epsilon / num_iter
            - 通常 7-10 步足够
        """,
        'robustness_certification': """
            认证边界:
            - 如果模型对epsilon-PGD攻击鲁棒
            - 则对任意L无穷范数<=epsilon的扰动鲁棒
            - 条件:攻击必须是最优的
        """
    }
    
    return analysis

4. 对抗训练的效率问题与解决方案

4.1 计算开销分析

对抗训练的主要瓶颈在于内层最大化问题的计算。对于每个训练样本,需要多次前向-反向传播来生成对抗样本:

标准训练: 1次前向 + 1次反向
对抗训练: (1 + num_iter)次前向 + num_iter次反向

计算开销比: (1 + K) / 1,其中K是PGD迭代次数
def compute_overhead_analysis():
    """
    计算开销分析
    """
    overheads = {
        'FGSM_attack': {
            'forward_passes': 2,  # 干净样本 + 对抗样本
            'backward_passes': 1,
            'overhead_ratio': 2
        },
        'PGD_7': {
            'forward_passes': 8,  # 1初始化 + 7迭代 + 1最终
            'backward_passes': 7,
            'overhead_ratio': 8
        },
        'PGD_20': {
            'forward_passes': 22,
            'backward_passes': 20,
            'overhead_ratio': 21
        },
        'PGD_100': {
            'forward_passes': 102,
            'backward_passes': 100,
            'overhead_ratio': 101
        }
    }
    
    return overheads

4.2 Free AT:自由对抗训练

Free Adversarial Training (Free AT) 通过复用梯度来减少计算开销:

class FreeAdversarialTraining:
    """
    Free Adversarial Training (Shafahi et al., 2019)
    
    核心思想:
    - 在每次参数更新中复用对抗扰动
    - m次小批量更新后重新生成对抗扰动
    
    原始方法:
    - 训练时间:m倍于标准训练
    - 对抗强度:与标准PGD训练相当
    """
    
    def __init__(self, model, epsilon=8/255, alpha=2/255, 
                 num_iter=2, num_replays=8):
        self.model = model
        self.epsilon = epsilon
        self.alpha = alpha
        self.num_iter = num_iter
        self.num_replays = num_replays
        
        # 累积的扰动
        self.delta = None
    
    def free_at_attack(self, images, labels, reverse=False):
        """
        Free AT的扰动更新
        
        在反向传播后立即更新扰动方向
        """
        if self.delta is None or reverse:
            self.delta = torch.zeros_like(images)
        
        images_adv = images + self.delta
        images_adv = torch.clamp(images_adv, 0, 1)
        images_adv.requires_grad = True
        
        outputs = self.model(images_adv)
        loss = F.cross_entropy(outputs, labels)
        
        self.model.zero_grad()
        loss.backward()
        
        with torch.no_grad():
            # 更新扰动
            self.delta = self.delta + self.alpha * images_adv.grad.sign()
            # 投影
            self.delta = torch.clamp(self.delta, -self.epsilon, self.epsilon)
        
        return (images + self.delta).detach()
    
    def train_step(self, images, labels, optimizer):
        """Free AT训练步骤"""
        for replay in range(self.num_replays):
            # 使用累积扰动生成对抗样本
            adversarial_images = self.free_at_attack(images, labels)
            
            # 计算损失并更新
            optimizer.zero_grad()
            outputs = self.model(adversarial_images)
            loss = F.cross_entropy(outputs, labels)
            loss.backward()
            optimizer.step()
        
        return loss.item()

4.3 Fast AT:快速对抗训练

Fast Adversarial Training (Fast AT) 使用FGSM代替PGD来加速:

class FastAdversarialTraining:
    """
    Fast Adversarial Training (Wong et al., 2020)
    
    核心发现:
    - 使用FGSM(单步攻击)可以训练出鲁棒模型
    - 关键:使用随机初始化
    - 训练时间与标准训练几乎相同
    """
    
    def __init__(self, model, epsilon=8/255):
        self.model = model
        self.epsilon = epsilon
        self.device = next(model.parameters()).device
    
    def fast_at_attack(self, images, labels):
        """
        Fast AT攻击
        
        使用随机初始化的FGSM
        """
        # 随机初始化
        delta = torch.zeros_like(images).uniform_(-self.epsilon, self.epsilon)
        images_adv = images + delta
        images_adv = torch.clamp(images_adv, 0, 1).detach()
        images_adv.requires_grad = True
        
        outputs = self.model(images_adv)
        loss = F.cross_entropy(outputs, labels)
        
        self.model.zero_grad()
        loss.backward()
        
        with torch.no_grad():
            # FGSM扰动
            images_adv = images_adv + self.epsilon * torch.sign(images_adv.grad)
            # 投影
            images_adv = torch.clamp(images_adv, 0, 1)
        
        return images_adv.detach()
    
    def train_step(self, images, labels, optimizer):
        """
        Fast AT训练步骤
        
        与标准训练的计算开销几乎相同
        """
        # 生成FGSM对抗样本
        adversarial_images = self.fast_at_attack(images, labels)
        
        optimizer.zero_grad()
        
        # 使用对抗样本训练
        outputs = self.model(adversarial_images)
        loss = F.cross_entropy(outputs, labels)
        
        loss.backward()
        optimizer.step()
        
        return loss.item()

Fast AT vs PGD AT

  • Fast AT:训练快(~1x时间),但需要精心调参
  • PGD AT:训练慢(~7x时间),但更稳定可靠
  • 两者在最终鲁棒性上可以相近,但Fast AT对超参数更敏感

4.4 SMART:自我对抗正则化

SMART (Self-adversarial training with Margin Enhancement) 结合了多个改进:

class SMARTTraining:
    """
    SMART: Self-adversarial training with Margin Enhancement
    (Jiang et al., 2020)
    
    特点:
    1. 使用动量FGSM提高对抗样本质量
    2. 引入margin loss增强决策边界
    3. 标签平滑减少过拟合
    """
    
    def __init__(self, model, epsilon=8/255, alpha=1.25*8/255,
                 margin=0.2, label_smoothing=0.1):
        self.model = model
        self.epsilon = epsilon
        self.alpha = alpha
        self.margin = margin
        self.label_smoothing = label_smoothing
        self.device = next(model.parameters()).device
        
        # 动量
        self.momentum = torch.zeros_like(model.parameters())
    
    def momentum_fgsm_attack(self, images, labels):
        """
        动量FGSM攻击
        
        累积梯度方向以稳定对抗扰动
        """
        delta = torch.zeros_like(images).uniform_(-self.epsilon, self.epsilon).to(self.device)
        
        for _ in range(2):  # 2步迭代
            delta.requires_grad = True
            
            outputs = self.model(images + delta)
            loss = F.cross_entropy(outputs, labels)
            
            self.model.zero_grad()
            loss.backward()
            
            # 更新动量
            self.momentum = 0.9 * self.momentum + delta.grad / (delta.grad.abs().mean() + 1e-10)
            
            with torch.no_grad():
                delta = delta + self.alpha * torch.sign(self.momentum)
                delta = torch.clamp(delta, -self.epsilon, self.epsilon)
                delta = torch.clamp(images + delta, 0, 1) - images
        
        return (images + delta).detach()
    
    def margin_loss(self, outputs, labels):
        """
        Margin loss
        
        鼓励模型对真实类别的置信度显著高于其他类别
        """
        # 获取目标类别的logit
        target_logits = outputs.gather(1, labels.unsqueeze(1)).squeeze()
        
        # 获取次高logit
        other_logits = outputs.clone()
        other_logits.scatter_(1, labels.unsqueeze(1), float('-inf'))
        second_logits = other_logits.max(dim=1)[0]
        
        # Margin: 目标logit - 次高logit
        margins = target_logits - second_logits
        
        # 希望margin大于给定阈值
        return F.relu(self.margin - margins).mean()
    
    def train_step(self, images, labels, optimizer):
        """SMART训练步骤"""
        # 生成对抗样本
        adversarial_images = self.momentum_fgsm_attack(images, labels)
        
        optimizer.zero_grad()
        
        # 标准交叉熵
        ce_loss = F.cross_entropy(
            self.model(adversarial_images), 
            labels, 
            label_smoothing=self.label_smoothing
        )
        
        # Margin loss
        outputs = self.model(adversarial_images)
        margin_loss = self.margin_loss(outputs, labels)
        
        # 总损失
        loss = ce_loss + 0.5 * margin_loss
        
        loss.backward()
        optimizer.step()
        
        return {
            'ce_loss': ce_loss.item(),
            'margin_loss': margin_loss.item(),
            'total_loss': loss.item()
        }

5. 对抗正则化方法

5.1 TRADES:对抗正则化

TRADES (TRADE-off between robustness and accuracy) 通过KL散度正则化:

class TRADESTraining:
    """
    TRADES: TRADE-off between robustness and accuracy
    (Zhang et al., 2019)
    
    损失函数:
    L(x, y, θ) = L_ce(f_θ(x), y) + β * KL(f_θ(x) || f_θ(x'))
    
    其中 x' 是对抗样本
    """
    
    def __init__(self, model, epsilon=8/255, beta=6.0):
        self.model = model
        self.epsilon = epsilon
        self.beta = beta
        self.device = next(model.parameters()).device
    
    def pgd_attack(self, images, labels):
        """生成对抗样本"""
        original_images = images.detach()
        images = images + torch.zeros_like(images).uniform_(-self.epsilon, self.epsilon)
        images = torch.clamp(images, 0, 1)
        
        for _ in range(10):
            images.requires_grad = True
            outputs = self.model(images)
            loss = F.cross_entropy(outputs, labels)
            self.model.zero_grad()
            loss.backward()
            
            with torch.no_grad():
                images = images + 2/255 * torch.sign(images.grad)
                images = torch.maximum(images, original_images - self.epsilon)
                images = torch.minimum(images, original_images + self.epsilon)
                images = torch.clamp(images, 0, 1)
        
        return images.detach()
    
    def train_step(self, images, labels, optimizer):
        """TRADES训练步骤"""
        # 生成对抗样本
        adversarial_images = self.pgd_attack(images, labels)
        
        optimizer.zero_grad()
        
        # 干净样本的交叉熵损失
        clean_outputs = self.model(images)
        ce_loss = F.cross_entropy(clean_outputs, labels)
        
        # KL散度正则化
        clean_probs = F.log_softmax(clean_outputs, dim=1)
        adv_probs = F.softmax(self.model(adversarial_images), dim=1)
        kl_loss = F.kl_div(clean_probs, adv_probs, reduction='batchmean')
        
        # 总损失
        loss = ce_loss + self.beta * kl_loss
        
        loss.backward()
        optimizer.step()
        
        return {
            'ce_loss': ce_loss.item(),
            'kl_loss': kl_loss.item(),
            'total_loss': loss.item()
        }

5.2 MART:基于鲁棒错误理论的对抗训练

class MARTTraining:
    """
    MART: Misclassification Aware Adversarial Training
    (Wang et al., 2019)
    
    损失函数:
    L = L_ce + λ * BCE
    
    其中BCE是鲁棒错误分类损失
    """
    
    def __init__(self, model, epsilon=8/255, beta=6.0):
        self.model = model
        self.epsilon = epsilon
        self.beta = beta
    
    def train_step(self, images, labels, optimizer):
        """MART训练步骤"""
        # 生成对抗样本
        adversarial_images = self.pgd_attack(images, labels)
        
        optimizer.zero_grad()
        
        # 干净样本和对抗样本的输出
        clean_outputs = self.model(images)
        adv_outputs = self.model(adversarial_images)
        
        # 交叉熵损失
        ce_loss = F.cross_entropy(clean_outputs, labels)
        
        # 鲁棒错误分类损失
        # 计算被错误分类的样本的对抗KL散度
        clean_probs = F.softmax(clean_outputs, dim=1)
        adv_probs = F.softmax(adv_outputs, dim=1)
        
        # 找到当前预测类别
        _, clean_pred = clean_outputs.max(1)
        
        # BCE损失
        bce_loss = clean_probs.gather(1, labels.unsqueeze(1)).squeeze().clamp(1e-8, 1)
        bce_loss = -torch.log(bce_loss).mean()
        
        # 总损失
        loss = ce_loss + self.beta * bce_loss
        
        loss.backward()
        optimizer.step()
        
        return loss.item()
    
    def pgd_attack(self, images, labels):
        """PGD攻击"""
        original_images = images.detach()
        images = images + torch.zeros_like(images).uniform_(-self.epsilon, self.epsilon)
        images = torch.clamp(images, 0, 1)
        
        for _ in range(7):
            images.requires_grad = True
            outputs = self.model(images)
            loss = F.cross_entropy(outputs, labels)
            self.model.zero_grad()
            loss.backward()
            
            with torch.no_grad():
                images = images + 2/255 * torch.sign(images.grad)
                images = torch.maximum(images, original_images - self.epsilon)
                images = torch.minimum(images, original_images + self.epsilon)
                images = torch.clamp(images, 0, 1)
        
        return images.detach()

6. 对抗训练 vs 认证防御

6.1 认证防御概述

认证防御(Certified Defense)提供理论保证:对于给定的扰动幅度 ,证明模型在 -球内所有样本上都是正确的。

class CertifiedDefense:
    """
    认证防御框架
    
    与对抗训练的区别:
    - 对抗训练:经验性防御,可能被更强的攻击突破
    - 认证防御:理论保证,无法被突破
    """
    
    def __init__(self, model):
        self.model = model
        self.device = next(model.parameters()).device
    
    def bound_analysis(self, x, y, epsilon):
        """
        认证边界分析
        
        对于线性模型,可以精确计算认证边界
        """
        w = self.model.weight.data
        b = self.model.bias.data
        
        # 线性模型的认证边界
        # 鲁棒性条件:w^T x + b 和 w^T y + b 的符号在epsilon球内不变
        
        # 决策边界距离
        margin = (w @ x + b) - (w @ y + b)
        
        # 认证半径
        norm_w = torch.norm(w)
        certified_radius = margin / norm_w
        
        return certified_radius.item()

6.2 随机平滑:可扩展的认证方法

class RandomizedSmoothing:
    """
    随机平滑(Randomized Smoothing)
    
    认证方法:
    - 给输入添加高斯噪声
    - 通过投票决定最终预测
    - 提供认证半径保证
    
    论文:Cohen et al., 2019
    """
    
    def __init__(self, model, sigma=0.25, num_samples=1000):
        self.model = model
        self.sigma = sigma
        self.num_samples = num_samples
        self.device = next(model.parameters()).device
    
    def certify(self, x, n0=100, alpha=0.05):
        """
        认证函数
        
        参数:
        - x: 输入样本
        - n0: 初始样本数(用于估计p_A)
        - alpha: 置信度参数
        
        返回:
        - predicted_class: 预测类别
        - certified_radius: 认证半径
        """
        self.model.eval()
        
        # 估计 top 类别的概率
        with torch.no_grad():
            # 采样 n0 次
            counts = self._sample_predictions(x, n0)
        
        top_class = counts.argmax().item()
        p_lower = self._lower_confidence_bound(counts[top_class].item(), n0, alpha)
        
        if p_lower < 0.5:
            return top_class, 0.0
        
        # 估计其他类别的上界
        n = self.num_samples
        with torch.no_grad():
            all_counts = self._sample_predictions(x, n)
        
        p_upper = self._upper_confidence_bound(
            all_counts[top_class].item(), n, alpha
        )
        
        # 认证半径
        radius = 0.5 * self.sigma * torch.distributions.Normal(0, 1).ppf(p_lower - 1 + alpha)
        
        return top_class, radius.item()
    
    def _sample_predictions(self, x, num_samples):
        """采样预测"""
        # 重复输入
        x_repeated = x.repeat(num_samples, 1, 1, 1)
        
        # 添加噪声
        noise = torch.randn_like(x_repeated) * self.sigma
        x_noisy = torch.clamp(x_repeated + noise, 0, 1)
        
        # 预测
        with torch.no_grad():
            outputs = self.model(x_noisy)
            predictions = outputs.argmax(dim=1)
        
        # 计数
        num_classes = outputs.size(1)
        counts = torch.zeros(num_classes)
        for pred in predictions:
            counts[pred] += 1
        
        return counts
    
    def _lower_confidence_bound(self, successes, trials, alpha):
        """Wilson score interval 下界"""
        import math
        z = 1.96  # 95% 置信度
        n = trials
        p_hat = successes / n
        
        denominator = 1 + z**2 / n
        center = (p_hat + z**2 / (2*n)) / denominator
        margin = z * math.sqrt(p_hat * (1 - p_hat) / n + z**2 / (4*n**2)) / denominator
        
        return max(0, center - margin)
    
    def _upper_confidence_bound(self, successes, trials, alpha):
        """Wilson score interval 上界"""
        import math
        z = 1.96
        n = trials
        p_hat = successes / n
        
        denominator = 1 + z**2 / n
        center = (p_hat + z**2 / (2*n)) / denominator
        margin = z * math.sqrt(p_hat * (1 - p_hat) / n + z**2 / (4*n**2)) / denominator
        
        return min(1, center + margin)

认证 vs 经验鲁棒性

  • 认证半径提供了理论保证,即使攻击者知道模型参数也无法突破
  • 但认证半径通常保守,可能低估实际鲁棒性
  • 对抗训练可以提高经验鲁棒性,但不一定提高认证鲁棒性

7. 对抗训练的实用指南

7.1 训练配置建议

def get_training_recommendations():
    """
    对抗训练实用配置建议
    """
    recommendations = {
        'PGD_adversarial_training': {
            'epsilon': '8/255 (CIFAR-10), 4/255 (ImageNet)',
            'alpha': 'epsilon / 4',
            'num_iter': '7-10',
            'optimizer': 'SGD with momentum 0.9',
            'learning_rate': '0.01-0.1',
            'weight_decay': '5e-4',
            'batch_size': '128-256',
            'training_epochs': '110'
        },
        'fast_adversarial_training': {
            'epsilon': '8/255',
            'alpha': '1.25 * epsilon',
            'random_init': 'True',
            'optimizer': 'SGD with momentum 0.9',
            'learning_rate': '0.1'
        },
        'TRADES': {
            'epsilon': '8/255',
            'beta': '6.0',
            'optimizer': 'Adam or SGD',
            'learning_rate': '0.01'
        }
    }
    
    return recommendations

7.2 评估协议

def robust_evaluation_protocol():
    """
    鲁棒性评估标准协议
    """
    evaluation = {
        'white_box_attacks': [
            'PGD-20: PGD with 20 steps',
            'PGD-100: PGD with 100 steps',
            'AutoAttack: ensemble of multiple attacks'
        ],
        'black_box_attacks': [
            'Transfer attack from surrogate model',
            'HopSkipJumpAttack',
            'Square Attack'
        ],
        'certified_robustness': [
            'Randomized Smoothing',
            'CROWN bounds',
            'IBP (Interval Bound Propagation)'
        ],
        'metrics': [
            'Robust Accuracy: accuracy on adversarial examples',
            'Clean Accuracy: accuracy on clean data',
            'Certified Radius: guaranteed perturbation bound'
        ]
    }
    
    return evaluation

8. 学术引用与参考文献

  1. Madry, A., et al. (2017). “Towards Deep Learning Models Resistant to Adversarial Attacks.” ICLR.
  2. Goodfellow, I. J., et al. (2015). “Explaining and Harnessing Adversarial Examples.” ICLR.
  3. Wong, E., & Kolter, J. Z. (2018). “Provable Defenses against Adversarial Examples via the Convex Outer Adversarial Polytope.” ICML.
  4. Wong, E., et al. (2020). “Fast Is Better Than Free: Revisiting Adversarial Fine-Tuning.” ICLR.
  5. Shafahi, A., et al. (2019). “Adversarial Training for Free!” NeurIPS.
  6. Zhang, H., et al. (2019). “Theoretically Principled Trade-off between Robustness and Accuracy.” ICML.
  7. Wang, Y., et al. (2019). “Improving Adversarial Robustness via Misclassification Aware Adversarial Training.” NeurIPS.
  8. Cohen, J. M., et al. (2019). “Certified Adversarial Robustness via Randomized Smoothing.” ICML.
  9. Carlini, N., et al. (2019). “On Evaluating Adversarial Robustness.” arXiv.

9. 相关文档