FSDP 与 DeepSpeed

Accelerate 通过集成两个极其强大的分布式训练工具——Pytorch FSDP 和 Microsoft DeepSpeed，提供了训练框架的灵活性。本教程的目的是对比这两种工具的相似之处和潜在差异，以帮助用户在这两个框架之间无缝切换。

配置功能

模型张量被拆分到不同的 GPU 上，以尝试扩展模型规模；这在 FSDP 中称为分片，在 DeepSpeed 中称为分区。FSDP 分片和 DeepSpeed ZeRO（分区）阶段分别通过 --fsdp_sharding_strategy 和 --zero_stage 进行配置。特别是，FSDP 的 FULL_SHARD 对应于 DeepSpeed ZeRO 阶段 3；请参阅此 FSDP 分片和 DeepSpeed ZeRO 设置之间的详细映射。下表总结并分组了类似的设置：

Group	Framework	Configuration	Example	Restrictions (if any)
sharding / partitioning	FSDP DeepSpeed	`--fsdp_sharding_strategy` `--zero_stage`	`1` (`FULL_SHARD`) `3`
offload	FSDP DeepSpeed	`--fsdp_offload_params` `--offload_param_device` `--offload_optimizer_device`	`true` `cpu` `cpu`	all or nothing
model loading	FSDP DeepSpeed	`--fsdp_cpu_ram_efficient_loading` `--zero3_init_flag`	`true` `true`	only ZeRO 3
efficient checkpointing	FSDP DeepSpeed	`--fsdp_state_dict_type` `--zero3_save_16bit_model`	`SHARDED_STATE_DICT` `true`	only ZeRO 3
weights prefetching	FSDP DeepSpeed	`--fsdp_forward_prefetch` `--fsdp_backward_prefetch` None	`true` `BACKWARD_PRE`
model	FSDP DeepSpeed	`--fsdp_auto_wrap_policy` `--fsdp_transformer_layer_cls_to_wrap` None	`TRANSFORMER_BASED_WRAP` <Layer Class>	Usually not needed Transparent to user.
parameters summoning	FSDP DeepSpeed	`--fsdp_use_orig_params` None	`true`	required for `torch.compile` Transparent to user
parameters syncing	FSDP DeepSpeed	`--fsdp_sync_module_states` None	`true`
training	FSDP DeepSpeed	None `--gradient_accumulation_steps` `--gradient_clipping`	`auto` `auto`	Transparent to user

有关上述内容的详细描述，请参阅 Accelerate 启动文档。

检查点

请注意，FSDP 可以通过 --fsdp_state_dict_type 配置为保存完整或分片的检查点。

卸载

FSDP 仅允许 全有或全无 的卸载（即，要么卸载参数、梯度和优化器，要么将它们全部保留在 GPU 中），而 DeepSpeed 可以分别卸载参数和优化器。此外，DeepSpeed 还支持卸载到 NVME。

预取

FSDP 允许两种预取配置 --fsdp_forward_prefetch 和 --fsdp_backward_prefetch，以提高通信/计算的重叠，但会增加额外的内存开销，详见 FSDP 文档。对于 DeepSpeed，预取会在需要时自动开启，并且根据某些超参数（如 stage3_param_persistence_threshold、stage3_max_reuse_distance 等）来决定是否开启，这些超参数可以在 Zero3 中配置；accelerate 可能会自动设置这些超参数，如果你没有在 DeepSpeed 配置文件中显式设置它们。

模型加载

虽然 FSDP 需要显式地使用 --fsdp_cpu_ram_efficient_loading true 来激活高效的模型加载，但 transformers 在使用 DeepSpeed Zero3 时会自动激活类似的功能。

模型

FSDP 需要一个显式的 --fsdp_auto_wrap_policy 参数，以便算法决定如何调度 all-gather 和 reduce-scatter 操作。但对于 DeepSpeed，这是对用户透明的。

参数调用

如果使用 torch.compile，FSDP 需要显式地设置 --fsdp_use_orig_params 标志，详情请参阅 PyTorch 文档。对于 DeepSpeed，这一过程对用户是透明的。

训练

Deepspeed 需要显式指定 --gradient_accumulation_steps 和 --gradient_clipping 标志。对于 FSDP，这是对用户透明的。

关于数据精度处理的差异

为了讨论 FSDP 和 Deepspeed 在数据精度处理上的差异，首先概述这些框架中模型参数的处理方式是有帮助的。在模型/优化器参数被分发到多个 GPU 之前，参数准备涉及将它们首先"展平"为一维的torch.Tensor。FSDP 和 Deepspeed 在这些"展平"参数的 dtype 存储方式上有所不同，这会影响到torch.Optimizer 如何分配它们的 dtype。下表概述了两个框架的处理过程；"本地"列表示在每个 GPU 级别发生的处理过程，因此任何由于类型提升而产生的内存开销应理解为由使用的 GPU 数量分摊。

Process	Local	Framework	Details
Loading, i.e., [`AutoModel.from_pretrained(..., torch_dtype=torch_dtype)`]
Preparation, i.e., creation of "flat params"	✅	FSDP DeepSpeed	created in `torch_dtype`. disregards `torch_dtype`, created in `float32`.
Optimizer initialization	✅	FSDP DeepSpeed	creates parameters in `torch_dtype` creates parameters in `float32`
Training Step, i.e, forward, backward, reduction		FSDP DeepSpeed	follows `MixedPrecision` follows `deepspeed_config_file` mixed precision settings.
Optimizer (Pre-Step)	✅	FSDP DeepSpeed	upcasting (if any) to `torch_dtype` upcasted to `float32`
Optimizer (Actual Step)	✅	FSDP DeepSpeed	occurs in `torch_dtype` occurs in `float32`.

为了澄清上表，考虑以下具体示例；为了简洁，将优化器的预步骤和实际步骤合并在一起。使用 FSDP 时，可以以以下两种模式运行，但 DeepSpeed 只能以其中一种模式运行。

Framework	Model Loading (`torch_dtype`)	Mixed Precision	Preparation (Local)	Training	Optimizer (Local)
FSDP	bf16	default (none)	bf16	bf16	bf16
FSDP	bf16	bf16	fp32	bf16	fp32
DeepSpeed	bf16	bf16	fp32	bf16	fp32

FSDP 与 DeepSpeed ​

配置功能 ​

检查点 ​

卸载 ​

预取 ​

模型加载 ​

模型 ​

参数调用 ​

训练 ​

关于数据精度处理的差异 ​

FSDP 与 DeepSpeed

配置功能

检查点

卸载

预取

模型加载

模型

参数调用

训练

关于数据精度处理的差异