Multi Node
Multi Node Multi GPU Training
The following section provide Python scripts and training approach scripts for a fully distributed multi node multi gpu training.
Some organizations use SLURM to manage resources. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. If you are using slurm, you can use the
Trainerby submitting a batch script using sbatch. Further below, we also provide the sbatch scripts for each setup.You can also use
torchrunto run the training process - which provides a superset of the functionality astorch.distributed.launch. Here you need to specify the torchrun arguments arguments to set up the distributed training setup. We also include thetorchrunsbatch scripts for each setup below.
Example Training Script (train.py):
We are going to use the following training script for the examples below. Python Script:
import torch
import argparse
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset
from perceptrain import TrainConfig, Trainer
from perceptrain.optimize_step import optimize_step
Trainer.set_use_grad(True)
# __main__ is recommended.
if __name__ == "__main__":
# simple dataset for y = 2πx
x = torch.linspace(0, 1, 100).reshape(-1, 1)
y = torch.sin(2 * torch.pi * x)
dataloader = DataLoader(TensorDataset(x, y), batch_size=16, shuffle=True)
# Simple model with no hidden layer and ReLU activation to fit the data for y = 2πx
model = nn.Sequential(nn.Linear(1, 16), nn.ReLU(), nn.Linear(16, 1))
# SGD optimizer with 0.01 learning rate
optimizer = optim.SGD(model.parameters(), lr=0.01)
# TrainConfig
parser = argparse.ArgumentParser()
parser.add_argument("--nprocs", type=int,
default=1, help="Number of processes (GPUs) to use.")
parser.add_argument("--compute_setup", type=str,
default="auto", choices=["cpu", "gpu", "auto"], help="Computational Setup.")
parser.add_argument("--backend", type=str,
default="nccl", choices=["nccl", "gloo", "mpi"], help="Distributed backend.")
args = parser.parse_args()
train_config = TrainConfig(
backend=args.backend,
nprocs=args.nprocs,
compute_setup=args.compute_setup,
print_every=5,
max_iter=50
)
trainer = Trainer(model, optimizer, train_config, loss_fn="mse", optimize_step=optimize_step)
trainer.fit(dataloader)
Multi-Node Multi-GPU:
For high performance using multiple GPUs in multiple nodes. - Assuming that you have two nodes with two GPU each. These numbers can be customised on user needs.
For multi-node, it is suggested to submit a sbatch script.
SLURM
- We should have one task per gpu. i.e.
ntasksis equal to the number of nodes. nprocsshould be equal to the total number of gpus (world_size). which is this case is 4.
#!/bin/bash
#SBATCH --job-name=multi_node
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --gpus-per-task=2
#SBATCH --cpus-per-task=4
#SBATCH --mem=10G
srun python3 train.py --backend nccl --nprocs 4
TORCHRUN
Torchrun takes care of setting the nprocs based on the cluster setup. We only need to specify to use the compute_setup, which can be either auto or gpu.
- nnodes for torchrun should be the number of nodes
- nproc_per_node should be equal to the number of GPUs per node.
Note: We use the first node of the allocated resources on the cluster as the head node. However, any other node can also be chosen.
#!/bin/bash #SBATCH --job-name=multi_node #SBATCH --nodes=2 #SBATCH --ntasks=2 #SBATCH --gpus-per-task=2 #SBATCH --cpus-per-task=4 #SBATCH --mem=10G nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) ) nodes_array=($nodes) head_node=${nodes_array[0]} head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname -I | awk '{print $1}') export LOGLEVEL=INFO srun torchrun \ --nnodes 2 \ --nproc_per_node 2 \ --rdzv_id $RANDOM \ --rdzv_backend c10d \ --rdzv_endpoint $head_node_ip:29501 \ train.py --compute_setup auto