Multi Node
Multi Node Multi GPU Training
The following section provide Python scripts and training approach scripts for a fully distributed multi node multi gpu training.
Some organizations use SLURM to manage resources. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. If you are using slurm, you can use the
Trainer
by submitting a batch script using sbatch. Further below, we also provide the sbatch scripts for each setup.You can also use
torchrun
to run the training process - which provides a superset of the functionality astorch.distributed.launch
. Here you need to specify the torchrun arguments arguments to set up the distributed training setup. We also include thetorchrun
sbatch scripts for each setup below.
Example Training Script (train.py
):
We are going to use the following training script for the examples below. Python Script:
import torch
import argparse
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset
from perceptrain import TrainConfig, Trainer
from perceptrain.optimize_step import optimize_step
Trainer.set_use_grad(True)
# __main__ is recommended.
if __name__ == "__main__":
# simple dataset for y = 2πx
x = torch.linspace(0, 1, 100).reshape(-1, 1)
y = torch.sin(2 * torch.pi * x)
dataloader = DataLoader(TensorDataset(x, y), batch_size=16, shuffle=True)
# Simple model with no hidden layer and ReLU activation to fit the data for y = 2πx
model = nn.Sequential(nn.Linear(1, 16), nn.ReLU(), nn.Linear(16, 1))
# SGD optimizer with 0.01 learning rate
optimizer = optim.SGD(model.parameters(), lr=0.01)
# TrainConfig
parser = argparse.ArgumentParser()
parser.add_argument("--nprocs", type=int,
default=1, help="Number of processes (GPUs) to use.")
parser.add_argument("--compute_setup", type=str,
default="auto", choices=["cpu", "gpu", "auto"], help="Computational Setup.")
parser.add_argument("--backend", type=str,
default="nccl", choices=["nccl", "gloo", "mpi"], help="Distributed backend.")
args = parser.parse_args()
train_config = TrainConfig(
backend=args.backend,
nprocs=args.nprocs,
compute_setup=args.compute_setup,
print_every=5,
max_iter=50
)
trainer = Trainer(model, optimizer, train_config, loss_fn="mse", optimize_step=optimize_step)
trainer.fit(dataloader)
Multi-Node Multi-GPU:
For high performance using multiple GPUs in multiple nodes. - Assuming that you have two nodes with two GPU each. These numbers can be customised on user needs.
For multi-node, it is suggested to submit a sbatch script.
SLURM
- We should have one task per gpu. i.e.
ntasks
is equal to the number of nodes. nprocs
should be equal to the total number of gpus (world_size). which is this case is 4.
#!/bin/bash
#SBATCH --job-name=multi_node
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --gpus-per-task=2
#SBATCH --cpus-per-task=4
#SBATCH --mem=10G
srun python3 train.py --backend nccl --nprocs 4
TORCHRUN
Torchrun takes care of setting the nprocs
based on the cluster setup. We only need to specify to use the compute_setup
, which can be either auto
or gpu
.
- nnodes
for torchrun should be the number of nodes
- nproc_per_node
should be equal to the number of GPUs per node.
Note: We use the first node of the allocated resources on the cluster as the head node. However, any other node can also be chosen.
#!/bin/bash #SBATCH --job-name=multi_node #SBATCH --nodes=2 #SBATCH --ntasks=2 #SBATCH --gpus-per-task=2 #SBATCH --cpus-per-task=4 #SBATCH --mem=10G nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) ) nodes_array=($nodes) head_node=${nodes_array[0]} head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname -I | awk '{print $1}') export LOGLEVEL=INFO srun torchrun \ --nnodes 2 \ --nproc_per_node 2 \ --rdzv_id $RANDOM \ --rdzv_backend c10d \ --rdzv_endpoint $head_node_ip:29501 \ train.py --compute_setup auto