使用 Accelerate 进行本地 SGD

本地 SGD 是一种分布式训练技术，其中梯度并不是在每一步都进行同步。因此，每个进程都会更新自己版本的模型权重，并在给定的步数后通过所有进程的平均值来同步这些权重。这提高了通信效率，并且在计算机缺乏更快的互连（如 NVLink）时，可以显著加快训练速度。与梯度累积（其中提高通信效率需要增加有效批量大小）不同，本地 SGD 不需要改变批量大小或学习率/调度。然而，如果需要，本地 SGD 也可以与梯度累积结合使用。

在本教程中，你将看到如何快速设置本地 SGD Accelerate。与标准的 Accelerate 设置相比，这只需要额外两行代码。

这个示例将使用一个非常简单的 PyTorch 训练循环，该循环每两个批次进行一次梯度累积：

python

device = "cuda"
model.to(device)

gradient_accumulation_steps = 2

for index, batch in enumerate(training_dataloader):
    inputs, targets = batch
    inputs = inputs.to(device)
    targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss = loss / gradient_accumulation_steps
    loss.backward()
    if (index + 1) % gradient_accumulation_steps == 0:
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

转换为使用 Accelerate

首先，将前面展示的代码转换为使用 Accelerate，而不使用 LocalSGD 或梯度累积辅助工具：

diff

+ from accelerate import Accelerator
+ accelerator = Accelerator()

+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+     model, optimizer, training_dataloader, scheduler
+ )

  for index, batch in enumerate(training_dataloader):
      inputs, targets = batch
-     inputs = inputs.to(device)
-     targets = targets.to(device)
      outputs = model(inputs)
      loss = loss_function(outputs, targets)
      loss = loss / gradient_accumulation_steps
+     accelerator.backward(loss)
      if (index+1) % gradient_accumulation_steps == 0:
          optimizer.step()
          scheduler.step()

让 Accelerate 处理模型同步

现在剩下的就是让 Accelerate 处理模型参数同步和梯度累积。为了简单起见，我们假设需要每 8 步同步一次。这可以通过添加一个 with LocalSGD 语句和在每次优化器步骤后调用 local_sgd.step() 来实现：

diff

+local_sgd_steps=8

+with LocalSGD(accelerator=accelerator, model=model, local_sgd_steps=8, enabled=True) as local_sgd:
    for batch in training_dataloader:
        with accelerator.accumulate(model):
            inputs, targets = batch
            outputs = model(inputs)
            loss = loss_function(outputs, targets)
            accelerator.backward(loss)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
+           local_sgd.step()

在内部，Local SGD 代码禁用了自动梯度同步（但梯度累积仍然按预期工作！）。相反，它每 local_sgd_steps 步（以及在训练循环结束时）平均模型参数。

限制

当前实现仅支持基本的多 GPU（或多 CPU）训练，不支持例如 DeepSpeed.。

参考文献

尽管我们不知道这种简单方法的真正起源，但 Local SGD 的概念相当古老，至少可以追溯到：

Zhang, J., De Sa, C., Mitliagkas, I., & Ré, C. (2016). [Parallel SGD: When does averaging help?. arXiv preprint
arXiv:1606.07365.](https://arxiv.org/abs/1606.07365)

我们将 Local SGD 这个术语归功于以下论文（但可能有更早的参考文献我们并不知道）。

Stich, Sebastian Urban. ["Local SGD Converges Fast and Communicates Little." ICLR 2019-International Conference on
Learning Representations. No. CONF. 2019.](https://arxiv.org/abs/1805.09767)

使用 Accelerate 进行本地 SGD ​

转换为使用 Accelerate ​

让 Accelerate 处理模型同步 ​

限制 ​

参考文献 ​

使用 Accelerate 进行本地 SGD

转换为使用 Accelerate

让 Accelerate 处理模型同步

限制

参考文献