DeepSpeed训练优化

关键词速览

关键词

DeepSpeedZeRO-1/2/3BF16混合精度FSDPGradient Checkpointing
梯度累积显存优化通信优化梯度裁剪CPU卸载
NVMe卸载Transformer引擎序列并行3D并行训练效率

1. DeepSpeed ZeRO Stages

DeepSpeed的ZeRO(Zero Redundancy Optimizer)是一种革命性的显存优化技术,通过分片技术消除训练过程中的冗余显存占用。ZeRO有三个阶段,每个阶段优化不同层面的显存。

1.1 ZeRO Stage 1:优化器状态分片

ZeRO-1仅分片优化器状态(Optimizer States),每个GPU仅保存1/N的优化器状态:

显存占用对比:
- 原始优化器状态:P * 4 bytes(N=参数量)
- ZeRO-1后:P * 4/N + P * 4 bytes(仅优化器分片)

例如70B模型,8卡:
- 原始:70B * 4 * 8 = 2240GB(不可行)
- ZeRO-1:70B * 4/8 + 70B * 4 = 367.5GB(可行)

配置示例

{
    "zero_optimization": {
        "stage": 1,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true,
            "ratio": 1.0
        }
    }
}

1.2 ZeRO Stage 2:+梯度分片

ZeRO-2在ZeRO-1基础上增加梯度分片,进一步减少显存:

ZeRO-2显存占用:
- 优化器状态:P * 4/N
- 梯度:P * 2/N
- 模型参数:P * 2 bytes(完整保留)
- 总计:约 (P * 4/N + P * 4) bytes

70B模型8卡:约140GB(单卡约17.5GB)

配置示例

{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 5e7,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e7,
        "contiguous_gradients": true
    }
}

ZeRO-2使用建议

  • 适用场景:7B-70B模型,单机多卡
  • 推荐配置:stage=2, overlap_comm=true
  • 显存收益:相比原始减少约4-8倍

1.3 ZeRO Stage 3:+参数分片

ZeRO-3对模型参数也进行分片,是最激进的显存优化方案:

ZeRO-3显存占用:
- 优化器状态:P * 4/N
- 梯度:P * 2/N
- 模型参数:P * 2/N(按需获取)

理论上:总显存 ≈ P * 12/N bytes
70B模型8卡:约105GB(单卡约13GB)

配置示例

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "nvme_path": "/local/nvme",
            "buffer_count": 4,
            "buffer_size": 1e8
        },
        "overlap_comm": true,
        "reduce_scatter": true,
        "contiguous_gradients": true,
        "round_robin_gradients": true
    }
}

1.4 ZeRO-Offload:CPU卸载策略

ZeRO-Offload将部分计算卸载到CPU,适合显存有限但需要训练大模型的场景:

Offload优化器状态

{
    "zero_optimization": {
        "stage": 1,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true,
            "ratio": 1.0
        }
    }
}

Offload参数到NVMe

{
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/path/to/nvme_store",
            "buffer_count": 8,
            "buffer_size": 2e8,
            "fast_init": true
        },
        "offload_optimizer": {
            "device": "cpu"
        }
    }
}
卸载策略显存节省计算开销适用场景
Offload优化器中等中等模型,单机
Offload参数(CPU)大模型,多机
Offload参数(NVMe)最高超大模型,极限显存

Offload注意事项

  • CPU卸载需要足够的CPU内存(通常每GPU需64GB+)
  • NVMe卸载需要高性能SSD(推荐NVMe Gen4)
  • 卸载会显著增加训练时间,约为正常训练的2-5倍

1.5 ZeRO Stage对比总结

特性ZeRO-1ZeRO-2ZeRO-3
分片内容优化器状态优化器+梯度全部参数
显存节省4x8xNx
通信量1x1.5x2x
实现复杂度
适用模型规模<30B<70B无限制

2. 混合精度训练配置

2.1 精度类型选择

DeepSpeed支持多种精度训练模式:

精度类型显存占用计算速度数值稳定性推荐场景
FP32100%基准最佳基准测试
FP1650%良好通用训练
BF1650%最佳推荐首选
FP825%最快一般H100专用

2.2 BF16混合精度(推荐)

BF16(Brain Float 16)是当前大模型训练的最佳精度选择:

{
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false
    }
}

BF16优势

  • 与FP32相同的指数位(8位),避免溢出
  • 比FP16更稳定的训练过程
  • Google、Meta等公司训练LLM的标准选择

2.3 FP16混合精度

{
    "bf16": {
        "enabled": false
    },
    "fp16": {
        "enabled": true,
        "auto_cast": false,
        "loss_scale": 0,
        "initial_scale_power": 16,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    }
}

2.4 梯度缩放配置

{
    "gradient_clipping": 1.0,
    "gradient_accumulation_steps": "auto",
    "steps_per_print": 10
}

3. Gradient Checkpointing

梯度检查点(Gradient Checkpointing)是一种用时间换显存的技术,通过在前向传播时不保存中间激活值,而在反向传播时重新计算。

3.1 工作原理

标准前向传播:
Input -> Layer1(保存激活) -> Layer2(保存激活) -> ... -> LayerN -> Output

使用Gradient Checkpointing:
Input -> Layer1 -> Layer2 -> ... -> LayerN -> Output
          ↑                                    ↓
          └──── 反向传播时重新计算中间激活 ←────┘

3.2 配置方法

{
    "gradient_checkpointing": true,
    "gradient_checkpointing_args": {
        "xformers_attn": true,
        "linear": true
    }
}

3.3 与DeepSpeed集成

{
    "zero_optimization": {
        "stage": 2
    },
    "gradient_clipping": 1.0,
    "fp16": {
        "enabled": false
    },
    "bf16": {
        "enabled": true
    }
}

3.4 显存节省效果

模型规模无Checkpointing有Checkpointing节省比例
7B约14GB约8GB43%
13B约26GB约14GB46%
70B约140GB约60GB57%

4. 通信优化

4.1 Overlap通信计算

DeepSpeed通过异步通信实现计算与通信的重叠:

{
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "allgather_bucket_size": 5e7,
        "reduce_bucket_size": 5e7
    }
}

4.2 通信桶大小调优

参数作用推荐值
allgather_bucket_size参数收集通信量5e7 ~ 2e8
reduce_bucket_size梯度聚合通信量5e7 ~ 2e8
{
    "zero_optimization": {
        "stage": 3,
        "allgather_bucket_size": 2e8,
        "reduce_bucket_size": 2e8
    }
}

4.3 NCCL优化配置

# 环境变量优化
export NCCL_IB_DISABLE=0
export NCCL_NET_GDR_LEVEL=P2P
export NCCL_GRAPH_MAPPING_ENABLE=1
export NCCL_TOPOLOGY_FILE=/path/to/topology.file
export NCCL_GRAPH_ENABLE=1
export NCCL_MIN_NCHANNELS=4

4.4 网络拓扑感知

对于多节点训练,DeepSpeed支持读取InfiniBand或RoCE的拓扑文件:

{
    "zero_optimization": {
        "stage": 3,
        "tcp_backend": {
            "enabled": false
        }
    },
    "communication_data_type": "fp16"
}

5. FSDP(Fully Sharded Data Parallel)

FSDP是PyTorch原生的参数分片方案,与DeepSpeed ZeRO-3功能类似但集成方式不同。

5.1 与ZeRO-3对比

特性DeepSpeed ZeRO-3FSDP
集成方式外部引擎PyTorch原生
易用性需配置JSON代码集成
优化空间更多较少
灵活性
PyTorch兼容性一般最佳

5.2 FSDP配置

from torch.distributed.fsdp import (
    FullyShardedDataParallel as FSDP,
    MixedPrecision,
    ShardingStrategy,
    BackwardPrefetch,
    AutoWrapPolicy,
)
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
 
# 混合精度配置
mp_policy = MixedPrecision(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.float32,
    buffer_dtype=torch.bfloat16,
)
 
# 自动包装策略
auto_wrap_policy = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls={
        TransformerEncoderLayer,
        TransformerDecoderLayer,
        LlamaDecoderLayer,
        MistralDecoderLayer,
    },
)
 
# 初始化FSDP
model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,  # 或 HYBRID_SHARD
    auto_wrap_policy=auto_wrap_policy,
    mixed_precision=mp_policy,
    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
    device_id=torch.cuda.current_device(),
    sync_module_states=True,
    limit_all_gathers=True,
)

5.3 混合分片策略

from torch.distributed.fsdp import ShardingStrategy
 
# 混合分片:节点内ZeRO,跨节点全分片
HYBRID_SHARD = ShardingStrategy.HYBRID_SHARD
 
# 分组分片:部分参数ZeRO-2
SHARD_GRAD_OP = ShardingStrategy.SHARD_GRAD_OP
 
model = FSDP(
    model,
    sharding_strategy=HYBRID_SHARD,
    process_group=my_process_group,
)

5.4 FSDP检查点保存

from torch.distributed.fsdp import FullOptimStateDictConfig, FullStateDictConfig
 
# 保存(需合并)
save_policy = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
model.save_state_dict(
    "checkpoint",
    save_policy=save_policy,
)
 
# 加载
load_policy = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
model.load_state_dict(
    "checkpoint",
    load_policy=load_policy,
)

6. 训练效率调优

6.1 批量大小优化

等效批量大小计算

等效batch_size = micro_batch_size * gradient_accumulation_steps * num_gpus

例如:
- micro_batch_size = 1
- gradient_accumulation_steps = 16
- num_gpus = 8
等效batch_size = 1 * 16 * 8 = 128

建议配置

GPU显存推荐batch_sizegradient_accumulation_steps等效batch
24GB13232
40GB21632
80GB4832

6.2 序列长度与打包

{
    "max_seq_len": 4096,
    "sample_packing": true,
    "max_packed_sequences": 16,
    "attention_dropout": 0.0,
    "hidden_dropout": 0.0
}

6.3 优化器选择

优化器显存占用收敛速度推荐场景
AdamW全参数微调
AdamW8bitLoRA/QLoRA
Sophia省钱优先
Lion实验性
{
    "optimizer": {
        "type": "adamw",
        "params": {
            "lr": 1e-4,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    }
}

6.4 学习率调度

{
    "lr_scheduler": {
        "type": "cosine",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 1e-4,
            "warmup_num_steps": 100,
            "min_lr_ratio": 0.1
        }
    }
}

6.5 Transformer引擎

DeepSpeed Transformer引擎提供额外的优化:

{
    "transformer_engine": {
        "enabled": true,
        "layer_type": "transformer",
        "training_tp_factors": 1,
        "inference_tp_factors": 1,
        "kv_channels": 128
    }
}

6.6 完整配置示例

{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 5e7,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e7,
        "contiguous_gradients": true
    },
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false
    },
    "gradient_checkpointing": true,
    "steps_per_print": 10,
    "wall_clock_breakdown": false,
    "zero_allow_untested_optimizer": true
}

相关文档