Training on GPU with Trainer
This guide explains how to train models on GPU using Trainer
from perceptrain
, covering single-GPU, multi-GPU (single node), and multi-node multi-GPU setups.
Understanding Arguments
- nprocs: Number of processes to run. To enable multi-processing and launch separate processes, set nprocs > 1.
- compute_setup: The computational setup used for training. Options include
cpu
,gpu
, andauto
.
For more details on the advanced training options, please refer to TrainConfig Documentation
Configuring TrainConfig
for GPU Training
By adjusting TrainConfig
, you can switch between single and multi-GPU training setups. Below are the key settings for each configuration:
Single-GPU Training Configuration:
compute_setup
: Selected training setup. (gpu
orauto
)backend="nccl"
: Optimized backend for GPU training.nprocs=1
: Uses one GPU.
Multi-GPU (Single Node) Training Configuration:
compute_setup
: Selected training setup. (gpu
orauto
)backend="nccl"
: Multi-GPU optimized backend.nprocs=2
: Utilizes 2 GPUs on a single node.
Multi-Node Multi-GPU Training Configuration:
compute_setup
: Selected training setup. (gpu
orauto
)backend="nccl"
: Required for multi-node setups.nprocs=4
: Uses 4 GPUs across nodes.
Examples
The next sections provide Python scripts and training approach scripts for each setup.
Some organizations use SLURM to manage resources. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. If you are using slurm, you can use the
Trainer
by submitting a batch script using sbatch.You can also use
torchrun
to run the training process - which provides a superset of the functionality astorch.distributed.launch
. Here you need to specify the torchrun arguments arguments to set up the distributed training setup. We also include thetorchrun
sbatch scripts for each setup below.