关键词列表

术语英文/缩写重要性
投机解码Speculative Decoding⭐⭐⭐⭐⭐
KV缓存KV Cache⭐⭐⭐⭐⭐
量化Quantization⭐⭐⭐⭐⭐
模型蒸馏Distillation⭐⭐⭐⭐
Test-time Compute推理时计算扩展⭐⭐⭐⭐
Chain-of-Thought思维链⭐⭐⭐⭐
INT4/INT8量化INT4/INT8 Quantization⭐⭐⭐⭐
Flash Attention快速注意力⭐⭐⭐⭐
稀疏注意力Sparse Attention⭐⭐⭐⭐
连续批处理Continuous Batching⭐⭐⭐⭐

推理计算成本优化:技术全景与实战指南

一、推理成本问题的本质

1.1 为什么推理成本如此重要

大语言模型的推理成本是部署和应用的核心瓶颈。与训练成本的一次性投入不同,推理成本是持续性的运营支出,直接影响AI应用的经济可行性。

成本构成分析

成本要素影响因素优化空间
计算成本模型参数量、FLOPs
内存带宽参数访问模式
存储I/O模型大小、加载频率
延迟并行度、批处理效率
能耗计算密度

1.2 推理的数学本质

自回归推理的过程可以形式化为:

给定输入序列 ,模型需要逐个生成输出 token

每次生成需要一次完整的前向传播:

关键瓶颈

对于一个 层、隐藏维度 的模型:

  • 参数量:
  • 单次前向计算量:
  • 内存占用:

二、Chain-of-Thought的Token开销

2.1 CoT的双刃剑效应

Chain-of-Thought(思维链)推理显著提升了模型在复杂任务上的表现,但同时也带来了巨大的Token开销。

典型案例分析

def analyze_cot_overhead():
    """
    分析CoT推理的Token开销
    """
    # 标准问答
    standard_prompt = """
    用户: 请问北京的人口是多少?
    回答: 约2189万(2023年数据)
    """
    standard_tokens = 50
    
    # CoT问答
    cot_prompt = """
    用户: 请问北京的人口是多少?
    思考: 
    - 北京是中华人民共和国的首都
    - 根据最新的人口普查数据
    - 2023年北京市常住人口约为2189万人
    - 这个数据来源于北京市统计局
    回答: 约2189万(2023年数据)
    """
    cot_tokens = 150
    
    # 开销分析
    token_overhead = cot_tokens - standard_tokens  # 额外100 tokens
    cost_increase = cot_tokens / standard_tokens  # 3倍
    
    return {
        'standard_tokens': standard_tokens,
        'cot_tokens': cot_tokens,
        'overhead': token_overhead,
        'cost_multiplier': cost_increase,
        'latency_increase': estimate_latency_increase(standard_tokens, cot_tokens)
    }

复杂推理的Token爆炸

对于多步骤推理任务,Token数量可能呈指数增长:

def estimate_reasoning_tokens(problem_complexity):
    """
    估算推理问题的Token消耗
    """
    # 问题复杂度等级
    levels = {
        'simple': {'steps': 2, 'tokens_per_step': 30},
        'moderate': {'steps': 5, 'tokens_per_step': 50},
        'complex': {'steps': 15, 'tokens_per_step': 80},
        'expert': {'steps': 50, 'tokens_per_step': 120}
    }
    
    config = levels[problem_complexity]
    
    # 基础回答 + 推理过程
    base_tokens = 20
    reasoning_tokens = config['steps'] * config['tokens_per_step']
    
    return {
        'base_tokens': base_tokens,
        'reasoning_tokens': reasoning_tokens,
        'total_tokens': base_tokens + reasoning_tokens,
        'estimated_cost': (base_tokens + reasoning_tokens) * 0.00001  # 假设单价
    }

2.2 Test-time Compute Scaling

OpenAI的”Test-time Compute”理论指出:模型可以在推理阶段主动分配更多计算资源来提升答案质量。

三种Scaling策略

class TestTimeComputeScaler:
    """
    推理时计算扩展策略
    """
    
    def strategy_1_sample_verify(self, model, prompt, num_samples=8):
        """
        策略1:多次采样+验证
        适用于:数学问题、代码生成
        """
        samples = [model.generate(prompt, temperature=0.8) 
                   for _ in range(num_samples)]
        
        # 验证每个答案
        verified_answers = []
        for sample in samples:
            verification = self.verify_answer(prompt, sample)
            if verification['is_correct']:
                verified_answers.append((sample, verification['confidence']))
        
        # 选择最佳答案
        if verified_answers:
            return max(verified_answers, key=lambda x: x[1])[0]
        
        return samples[0]  # 回退到第一个
    
    def strategy_2_process_verbalizer(self, model, prompt):
        """
        策略2:过程监督(PRM)
        适用于:多步骤推理
        """
        # 引导模型生成推理步骤
        reasoning_steps = model.generate_reasoning_steps(prompt)
        
        # 对每个步骤进行置信度评估
        step_confidences = []
        for step in reasoning_steps:
            confidence = model.assess_confidence(prompt, step)
            step_confidences.append((step, confidence))
        
        # 选择置信度最高的推理路径
        best_path = self.select_best_path(step_confidences)
        return best_path['final_answer']
    
    def strategy_3_look_ahead_search(self, model, prompt, beam_width=4, depth=5):
        """
        策略3:前瞻搜索(类似AlphaGo的MCTS)
        适用于:复杂决策任务
        """
        # 蒙特卡洛树搜索
        root = MCTSNode(state={'prompt': prompt, 'response': ''})
        
        for _ in range(100):  # 搜索迭代次数
            # 选择
            node = root.select()
            
            # 扩展
            if not node.is_terminal():
                next_tokens = model.sample_next_tokens(node.state['response'], k=beam_width)
                for token in next_tokens:
                    node.expand(token)
            
            # 模拟
            simulation_result = self.simulate(node)
            
            # 回传
            node.backpropagate(simulation_result)
        
        # 选择最佳行动
        best_child = root.best_child()
        return best_child.state['response']

三、Speculative Decoding(投机解码)

3.1 核心原理

投机解码是一种革命性的推理加速技术,通过”预测-验证”范式来加速自回归生成。

核心洞察:大语言模型的大部分推理时间花在与”廉价”的Draft模型生成少量token,而不是与”昂贵”的Verify模型验证。投机解码反转了这个过程。

工作流程

┌─────────────────────────────────────────────────────────────┐
│                    投机解码工作流程                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  阶段1:投机生成(Drafter)                                  │
│  ┌─────────┐                                                │
│  │ 小模型   │ ──▶ 生成 k 个候选token: [t1, t2, ..., tk]     │
│  │ (快速)   │                                                │
│  └─────────┘                                                │
│                          ↓                                  │
│  阶段2:并行验证(Verifier)                                 │
│  ┌─────────┐                                                │
│  │ 大模型   │ ──▶ 并行验证: [t1, t2, ..., tk]               │
│  │ (准确)   │      计算: P_large(t1), P_large(t2), ...      │
│  └─────────┘                                                │
│                          ↓                                  │
│  阶段3:接受决策                                             │
│  对于每个token ti:                                          │
│    if P_draft(ti) >= P_large(ti) * λ:  接受               │
│    else: 以P_large拒绝,重新采样                             │
│                                                             │
│  预期加速比: k / (k - avg_rejected) ≈ 3-4x                 │
└─────────────────────────────────────────────────────────────┘

3.2 完整实现

import torch
import torch.nn.functional as F
from typing import List, Tuple
 
class SpeculativeDecoder:
    """
    投机解码器实现
    """
    def __init__(self, draft_model, verifier_model, max_draft_tokens=8, 
                 acceptance_threshold=0.8, temperature=1.0):
        self.draft_model = draft_model
        self.verifier_model = verifier_model
        self.max_draft_tokens = max_draft_tokens
        self.acceptance_threshold = acceptance_threshold
        self.temperature = temperature
    
    def draft(self, input_ids: torch.Tensor, 
              draft_kv_cache: dict = None) -> Tuple[List[int], dict]:
        """
        阶段1:投机生成
        """
        draft_tokens = []
        current_kv_cache = {} if draft_kv_cache is None else draft_kv_cache.copy()
        current_input = input_ids
        
        for _ in range(self.max_draft_tokens):
            # 单一token前向传播(轻量)
            logits = self.draft_model.forward_single(
                current_input, 
                **current_kv_cache
            )
            
            # 采样下一个token
            probs = F.softmax(logits[-1] / self.temperature, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1).item()
            
            # 检查是否生成结束
            if next_token == self.draft_model.eos_token_id:
                break
            
            draft_tokens.append(next_token)
            current_input = torch.tensor([[next_token]])
            
            # 更新KV cache
            current_kv_cache = self.draft_model.update_kv_cache(
                current_kv_cache
            )
        
        return draft_tokens, current_kv_cache
    
    def verify(self, input_ids: torch.Tensor, 
               draft_tokens: List[int],
               verifier_kv_cache: dict = None) -> Tuple[List[int], List[int]]:
        """
        阶段2:并行验证
        """
        if not draft_tokens:
            return [], []
        
        # 构造完整序列
        full_sequence = torch.cat([
            input_ids,
            torch.tensor([[t] for t in draft_tokens])
        ], dim=0)
        
        # 完整验证前向传播
        all_logits = self.verifier_model.forward(full_sequence, **verifier_kv_cache)
        
        # 获取每个draft token的验证概率
        accepted_tokens = []
        rejected_indices = []
        
        draft_probs = F.softmax(
            self.draft_model.get_logits(
                torch.tensor([[t] for t in draft_tokens])
            ) / self.temperature, dim=-1
        )
        
        verifier_logits = all_logits[len(input_ids):]
        verifier_probs = F.softmax(verifier_logits / self.temperature, dim=-1)
        
        for i, draft_tok in enumerate(draft_tokens):
            draft_prob = draft_probs[i, draft_tok].item()
            verifier_prob = verifier_probs[i, draft_tok].item()
            
            # 接受决策
            if draft_prob >= verifier_prob * self.acceptance_threshold:
                accepted_tokens.append(draft_tok)
            else:
                # 拒绝并记录
                rejected_indices.append(i)
                # 接受前面所有token
                break  # 一旦拒绝,后面的token全部需要重新生成
        
        return accepted_tokens, rejected_indices
    
    def generate_with_draft(self, input_ids: torch.Tensor, 
                           max_new_tokens: int = 100) -> List[int]:
        """
        完整的投机解码生成流程
        """
        output_tokens = []
        current_input = input_ids
        total_generated = 0
        
        # KV caches
        draft_kv_cache = {}
        verifier_kv_cache = {}
        
        while total_generated < max_new_tokens:
            # 阶段1:投机生成
            draft_tokens, draft_kv_cache = self.draft(
                current_input, draft_kv_cache
            )
            
            if not draft_tokens:
                # Draft失败,使用verifier直接生成
                logits = self.verifier_model.forward_single(
                    current_input, **verifier_kv_cache
                )
                next_token = torch.argmax(logits[-1]).item()
                output_tokens.append(next_token)
                current_input = torch.tensor([[next_token]])
                total_generated += 1
                continue
            
            # 阶段2:验证
            accepted, rejected_at = self.verify(
                current_input, draft_tokens, verifier_kv_cache
            )
            
            # 添加所有接受的token
            output_tokens.extend(accepted)
            
            # 如果有被拒绝的token,以verifier的分布采样
            if rejected_at < len(draft_tokens):
                # 在verifier的分布上采样
                rejected_logits = self.verifier_model.get_logits_after_draft(
                    current_input, draft_tokens[:rejected_at], verifier_kv_cache
                )
                next_token = torch.argmax(rejected_logits[-1]).item()
                output_tokens.append(next_token)
                current_input = torch.tensor([[next_token]])
                total_generated += 1
            else:
                # 所有draft都被接受,继续
                current_input = torch.tensor([[draft_tokens[-1]]])
                total_generated += len(accepted)
            
            # 检查是否结束
            if output_tokens[-1] == self.verifier_model.eos_token_id:
                break
        
        return output_tokens
 
class DraftModelSelector:
    """
    Draft模型选择器
    """
    def __init__(self):
        self.draft_options = {
            'tinyllama-1b': {'params': '1B', 'speedup': 3.0},
            'phi-2-2.7b': {'params': '2.7B', 'speedup': 2.5},
            'minicpm-2b': {'params': '2B', 'speedup': 2.8},
        }
    
    def select_draft(self, verifier_model_name: str) -> str:
        """
        根据verifier模型选择合适的draft模型
        """
        # 简化选择逻辑
        if '70b' in verifier_model_name or '65b' in verifier_model_name:
            return 'phi-2-2.7b'  # 大模型用较大draft
        else:
            return 'tinyllama-1b'

四、KV Cache优化

4.1 标准KV Cache的问题

在自回归生成过程中,KV Cache存储了所有已生成token的Key和Value矩阵:

def kv_cache_memory_cost(model_config):
    """
    计算KV Cache的内存消耗
    """
    num_layers = model_config['num_layers']  # e.g., 32
    num_heads = model_config['num_heads']   # e.g., 32
    head_dim = model_config['head_dim']     # e.g., 128
    batch_size = model_config['batch_size'] # e.g., 1
    max_seq_len = model_config['max_seq_len']  # e.g., 4096
    
    # 每个token的KV大小(单层)
    kv_size_per_layer = 2 * num_heads * head_dim * 4  # K + V, float32
    
    # 总大小
    total_size = (num_layers * kv_size_per_layer * max_seq_len * batch_size) / (1024**2)
    
    return {
        'per_token_mb': kv_size_per_layer / (1024**2),
        'per_layer_mb': num_heads * head_dim * 2 * 4 / (1024**2),
        'total_mb': total_size,
        'total_gb': total_size / 1024
    }
 
# 示例计算
example_config = {
    'num_layers': 32,
    'num_heads': 32,
    'head_dim': 128,
    'batch_size': 1,
    'max_seq_len': 4096
}
 
memory = kv_cache_memory_cost(example_config)
print(f"KV Cache总内存: {memory['total_gb']:.2f} GB")

4.2 Flash Attention的高效实现

Flash Attention通过IO-aware优化显著加速注意力计算:

class FlashAttention:
    """
    Flash Attention核心实现(简化版)
    实际实现需要CUDA kernel编程
    """
    
    @staticmethod
    def flash_attention(Q, K, V, scale, block_size=128):
        """
        Flash Attention的分块计算
        
        核心思想:
        1. 将Q, K, V分成小块
        2. 分块计算注意力,避免 materialization 完整注意力矩阵
        3. O(N) 空间复杂度 vs O(N^2)
        """
        seq_len = Q.shape[0]
        d = Q.shape[1]
        
        # 初始化输出和归一化因子
        O = torch.zeros_like(Q)
        l = torch.zeros(seq_len, 1)
        m = torch.full((seq_len, 1), float('-inf'))
        
        # 分块处理
        for i in range(0, seq_len, block_size):
            # 加载Q的一个块
            Q_i = Q[i:i+block_size]
            
            # 初始化该块的输出
            O_i = torch.zeros_like(Q_i)
            l_i = torch.zeros(block_size, 1)
            m_i = torch.full((block_size, 1), float('-inf'))
            
            for j in range(0, seq_len, block_size):
                # 加载K, V的块
                K_j = K[j:j+block_size]
                V_j = V[j:j+block_size]
                
                # 计算注意力分数
                S_ij = Q_i @ K_j.T * scale
                
                # 数值稳定的最大值
                m_ij = torch.max(S_ij, dim=-1, keepdim=True).values
                
                # 减去最大值实现稳定softmax
                P_ij = torch.exp(S_ij - m_ij)
                
                # 更新
                O_i = O_i * torch.exp(m_i - m_ij) + P_ij @ V_j
                l_i = l_i * torch.exp(m_i - m_ij) + P_ij.sum(dim=-1, keepdim=True)
                m_i = m_ij
            
            # 归一化
            O[i:i+block_size] = O_i / l_i
            l[i:i+block_size] = l_i
            m[i:i+block_size] = m_i
        
        return O
 
class PagedKVCache:
    """
    分页KV Cache:类似操作系统的虚拟内存管理
    """
    def __init__(self, block_size=16, max_blocks=1024):
        self.block_size = block_size
        self.max_blocks = max_blocks
        
        # 块表:记录每个序列的KV块位置
        self.block_table = {}  # {sequence_id: [block_ids]}
        
        # KV存储区
        self.kv_storage = torch.zeros(
            max_blocks, block_size, num_heads, head_dim
        )
        self.block_usage = [0] * max_blocks  # 每个块已使用的slot
    
    def allocate(self, sequence_id):
        """为新序列分配KV块"""
        self.block_table[sequence_id] = []
    
    def append(self, sequence_id, k, v):
        """
        追加新的KV
        """
        block_ids = self.block_table[sequence_id]
        
        # 查找或分配新块
        if not block_ids or self.block_usage[block_ids[-1]] >= self.block_size:
            new_block_id = self._allocate_block()
            block_ids.append(new_block_id)
            self.block_usage[new_block_id] = 0
        
        # 写入KV
        current_block = block_ids[-1]
        slot = self.block_usage[current_block]
        self.kv_storage[current_block, slot] = k
        self.kv_storage[current_block, slot + self.block_size] = v  # V存储在后面的区域
        self.block_usage[current_block] += 1
    
    def get(self, sequence_id, position):
        """获取特定位置的KV"""
        block_id = position // self.block_size
        slot = position % self.block_size
        actual_block = self.block_table[sequence_id][block_id]
        
        return (
            self.kv_storage[actual_block, slot],
            self.kv_storage[actual_block, slot + self.block_size]
        )

五、量化技术详解

5.1 量化基础理论

量化(Quantization)通过降低权重和激活值的精度来减少内存占用和计算量。

精度对比

格式位宽内存压缩比精度损失
FP3232bit1x
FP1616bit2x很小
BF1616bit2x很小
INT88bit4x中等
INT44bit8x较大
NF44bit8x较小(针对正态分布优化)

5.2 量化实现

import torch
 
class Quantizer:
    """量化器基类"""
    
    @staticmethod
    def dynamic_quantize(tensor: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, float]:
        """
        动态量化
        仅将权重量化为int8,激活值保持fp16
        """
        # 计算scale
        scale = tensor.abs().max() / 127.0
        
        # 量化
        quantized = torch.round(tensor / scale).to(torch.int8)
        
        return quantized, scale, 127.0
    
    @staticmethod
    def dequantize(quantized: torch.Tensor, scale: torch.Tensor, zero_point: float = 0) -> torch.Tensor:
        """反量化"""
        return quantized.float() * scale
 
class GPTQQuantizer:
    """
    GPTQ量化:一次性后训练量化
    基于Hessian信息进行最优量化
    """
    def __init__(self, model, bits=4, per_channel=True):
        self.model = model
        self.bits = bits
        self.per_channel = per_channel
    
    def quantize_layer(self, layer: nn.Linear):
        """
        量化单个线性层
        """
        weight = layer.weight.data
        bias = layer.bias.data if layer.bias is not None else None
        
        # 获取权重矩阵的维度
        out_features, in_features = weight.shape
        
        # 分组量化
        group_size = 128 if self.per_channel else in_features
        
        quantized_weights = []
        scales = []
        zeros = []
        
        for i in range(0, in_features, group_size):
            group = weight[:, i:i+group_size]
            
            # 计算最优scale和zero point
            wmin, wmax = group.min(), group.max()
            
            # 4-bit: 16个值
            num_values = 2 ** self.bits
            
            scale = (wmax - wmin) / (num_values - 1)
            zero = wmin
            
            # 量化
            q_weight = torch.round((group - zero) / scale)
            q_weight = torch.clamp(q_weight, 0, num_values - 1)
            
            quantized_weights.append(q_weight.to(torch.uint8))
            scales.append(scale)
            zeros.append(zero)
        
        return {
            'weights': quantized_weights,
            'scales': scales,
            'zeros': zeros,
            'group_size': group_size
        }
 
class AWQQuantizer:
    """
    AWQ量化:Activation-Aware Weight Quantization
    考虑激活分布的量化方法
    """
    def __init__(self, model, bits=4):
        self.model = model
        self.bits = bits
    
    def find_scale(self, weight, activations, n_grid=128):
        """
        寻找最优的量化scale
        """
        best_scale = None
        best_error = float('inf')
        
        scales = torch.linspace(0.1, 10, n_grid)
        
        for s in scales:
            # 缩放权重
            scaled_weight = weight / s
            
            # 量化并反量化
            quantized = self._quantize_tensor(scaled_weight)
            dequantized = self._dequantize_tensor(quantized, s)
            
            # 计算重构误差
            error = ((dequantized - weight) ** 2).mean()
            
            if error < best_error:
                best_error = error
                best_scale = s
        
        return best_scale
    
    def quantize(self, model, calibration_data):
        """
        完整AWQ量化流程
        """
        # 获取激活分布
        activations = self.get_activations(model, calibration_data)
        
        # 对每个线性层进行量化
        for name, module in model.named_modules():
            if isinstance(module, nn.Linear):
                scale = self.find_scale(module.weight.data, 
                                       activations[name])
                # 应用量化
                ...

5.3 FP8量化

class FP8Quantizer:
    """
    FP8量化实现
    E4M3 (4位指数, 3位尾数) vs E5M2 (5位指数, 2位尾数)
    """
    
    @staticmethod
    def to_fp8_e4m3(tensor: torch.Tensor) -> torch.Tensor:
        """转换为FP8 E4M3格式"""
        # E4M3: 范围 [-448, 448], 精度高
        max_val = 448.0
        min_val = -448.0
        
        # Clamp
        clamped = torch.clamp(tensor, min_val, max_val)
        
        # 转换为FP8表示
        # 简化实现
        fp8_tensor = (clamped * 127.0 / max_val).to(torch.float8_e4m3fn)
        
        return fp8_tensor
    
    @staticmethod
    def to_fp8_e5m2(tensor: torch.Tensor) -> torch.Tensor:
        """转换为FP8 E5M2格式"""
        # E5M2: 范围更大, 精度较低
        # 适合梯度
        max_val = 57344.0
        min_val = -57344.0
        
        clamped = torch.clamp(tensor, min_val, max_val)
        fp8_tensor = (clamped * 127.0 / max_val).to(torch.float8_e5m2)
        
        return fp8_tensor

六、蒸馏技术

6.1 蒸馏原理

知识蒸馏(Knowledge Distillation)通过让小模型学习大模型的行为来压缩模型。

class KnowledgeDistiller:
    """
    知识蒸馏器
    """
    def __init__(self, teacher_model, student_model, temperature=2.0):
        self.teacher = teacher_model
        self.student = student_model
        self.temperature = temperature
        self.alpha = 0.7  # 蒸馏损失权重
    
    def distillation_loss(self, student_logits, teacher_logits):
        """
        蒸馏损失:KL散度
        """
        soft_teacher = F.softmax(teacher_logits / self.temperature, dim=-1)
        soft_student = F.log_softmax(student_logits / self.temperature, dim=-1)
        
        # 缩放后的KL散度
        distillation = F.kl_div(
            soft_student, soft_teacher, reduction='batchmean'
        ) * (self.temperature ** 2)
        
        return distillation
    
    def hard_loss(self, student_logits, labels):
        """硬标签损失"""
        return F.cross_entropy(student_logits, labels)
    
    def total_loss(self, student_logits, teacher_logits, labels):
        """总损失 = α * 蒸馏损失 + (1-α) * 硬损失"""
        distill = self.distillation_loss(student_logits, teacher_logits)
        hard = self.hard_loss(student_logits, labels)
        
        return self.alpha * distill + (1 - self.alpha) * hard
 
class MiniLMv2:
    """
    MiniLMv2蒸馏策略:深层蒸馏
    学生模型学习教师模型最后几层的表示
    """
    def __init__(self, teacher_layers, student_layers):
        self.teacher_layers = teacher_layers
        self.student_layers = student_layers
    
    def last_layer_distillation(self, teacher_output, student_output):
        """
        最后一层蒸馏:MSE损失
        """
        return F.mse_loss(student_output, teacher_output)
    
    def multi_layer_distillation(self, teacher_hidden, student_hidden):
        """
        多层蒸馏
        """
        total_loss = 0
        num_layers = min(len(teacher_hidden), len(student_hidden))
        
        for t, s in zip(teacher_hidden[-num_layers:], student_hidden[-num_layers:]):
            # 使用Cosine嵌入损失
            loss = 1 - F.cosine_similarity(t, s, dim=-1).mean()
            total_loss += loss
        
        return total_loss / num_layers

七、优化策略综合对比

策略加速比精度损失适用场景实施难度
INT8量化2-3x<5%通用部署
INT4量化4-6x5-15%边缘部署
FP8量化1.5-2x<2%新硬件
投机解码2-4x流式生成
Flash Attention2-4x通用加速高(需CUDA)
KV Cache优化1.5-2x长序列
模型蒸馏变量变量模型压缩
连续批处理3-10x高吞吐

八、相关主题链接