使用 Accelerate 进行梯度累积
梯度累积是一种技术,可以让你在更大的批量上进行训练,而这些批量通常无法完全装入你的机器内存。这是通过在多个批次上累积梯度,并在完成一定数量的批次后才更新优化器来实现的。
虽然从技术上讲,标准的梯度累积代码在分布式设置中也能正常工作,但这并不是最高效的方法,你可能会遇到显著的性能下降!
在本教程中,你将看到如何快速设置梯度累积,并使用 Accelerate 提供的工具来执行它,这总共只需添加一行代码!
这个示例将使用一个非常简单的 PyTorch 训练循环,每两个批次进行一次梯度累积:
device = "cuda"
model.to(device)
gradient_accumulation_steps = 2
for index, batch in enumerate(training_dataloader):
inputs, targets = batch
inputs = inputs.to(device)
targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
loss = loss / gradient_accumulation_steps
loss.backward()
if (index + 1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
转换为使用 Accelerate
首先,将前面展示的代码转换为使用 Accelerate,而不使用特殊的梯度累积辅助工具:
+ from accelerate import Accelerator
+ accelerator = Accelerator()
+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+ model, optimizer, training_dataloader, scheduler
+ )
for index, batch in enumerate(training_dataloader):
inputs, targets = batch
- inputs = inputs.to(device)
- targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
loss = loss / gradient_accumulation_steps
+ accelerator.backward(loss)
if (index+1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
让 Accelerate 处理梯度累积
现在剩下的就是让 Accelerate 为我们处理梯度累积。为此,你应该在创建 [Accelerator
] 时传递一个 gradient_accumulation_steps
参数,指定在每次调用 step()
之前要执行的步数,以及在调用 [~Accelerator.backward
] 时如何自动调整损失:
from accelerate import Accelerator
- accelerator = Accelerator()
+ accelerator = Accelerator(gradient_accumulation_steps=2)
或者,你可以在 [Accelerator
] 对象的 __init__
中传递一个 gradient_accumulation_plugin
参数,这将允许你进一步自定义梯度累积行为。 更多内容请参阅 GradientAccumulationPlugin 文档。
从这里开始,你可以在训练循环中使用 [~Accelerator.accumulate
] 上下文管理器来自动执行梯度累积! 你只需将其包裹在代码的整个训练部分:
- for index, batch in enumerate(training_dataloader):
+ for batch in training_dataloader:
+ with accelerator.accumulate(model):
inputs, targets = batch
outputs = model(inputs)
你可以移除所有关于步骤编号和损失调整的特殊检查:
- loss = loss / gradient_accumulation_steps
accelerator.backward(loss)
- if (index+1) % gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step()
optimizer.zero_grad()
如你所见,[Accelerator
] 能够跟踪你当前所在的批次编号,并且它会自动判断是否需要通过已准备的优化器进行更新以及如何调整损失。
完成的代码
以下是使用 Accelerate 进行梯度累积的完整实现。
from accelerate import Accelerator
accelerator = Accelerator(gradient_accumulation_steps=2)
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
model, optimizer, training_dataloader, scheduler
)
for batch in training_dataloader:
with accelerator.accumulate(model):
inputs, targets = batch
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
要了解更多关于这一魔法背后的原理,请阅读梯度同步概念指南
完整示例
以下是一个完整的示例,你可以运行它来查看使用 Accelerate 进行梯度累积的过程:
import torch
import copy
from accelerate import Accelerator
from accelerate.utils import set_seed
from torch.utils.data import TensorDataset, DataLoader
# seed
set_seed(0)
# define toy inputs and labels
x = torch.tensor([1., 2., 3., 4., 5., 6., 7., 8.])
y = torch.tensor([2., 4., 6., 8., 10., 12., 14., 16.])
gradient_accumulation_steps = 4
batch_size = len(x) // gradient_accumulation_steps
# define dataset and dataloader
dataset = TensorDataset(x, y)
dataloader = DataLoader(dataset, batch_size=batch_size)
# define model, optimizer and loss function
class SimpleLinearModel(torch.nn.Module):
def __init__(self):
super(SimpleLinearModel, self).__init__()
self.weight = torch.nn.Parameter(torch.zeros((1, 1)))
def forward(self, inputs):
return inputs @ self.weight
model = SimpleLinearModel()
model_clone = copy.deepcopy(model)
criterion = torch.nn.MSELoss()
model_optimizer = torch.optim.SGD(model.parameters(), lr=0.02)
accelerator = Accelerator(gradient_accumulation_steps=gradient_accumulation_steps)
model, model_optimizer, dataloader = accelerator.prepare(model, model_optimizer, dataloader)
model_clone_optimizer = torch.optim.SGD(model_clone.parameters(), lr=0.02)
print(f"initial model weight is {model.weight.mean().item():.5f}")
print(f"initial model weight is {model_clone.weight.mean().item():.5f}")
for i, (inputs, labels) in enumerate(dataloader):
with accelerator.accumulate(model):
inputs = inputs.view(-1, 1)
print(i, inputs.flatten())
labels = labels.view(-1, 1)
outputs = model(inputs)
loss = criterion(outputs, labels)
accelerator.backward(loss)
model_optimizer.step()
model_optimizer.zero_grad()
loss = criterion(x.view(-1, 1) @ model_clone.weight, y.view(-1, 1))
model_clone_optimizer.zero_grad()
loss.backward()
model_clone_optimizer.step()
print(f"w/ accumulation, the final model weight is {model.weight.mean().item():.5f}")
print(f"w/o accumulation, the final model weight is {model_clone.weight.mean().item():.5f}")
initial model weight is 0.00000
initial model weight is 0.00000
0 tensor([1., 2.])
1 tensor([3., 4.])
2 tensor([5., 6.])
3 tensor([7., 8.])
w/ accumulation, the final model weight is 2.04000
w/o accumulation, the final model weight is 2.04000