Training on GPU with Trainer
This guide explains how to train models on GPU using Trainer from perceptrain, covering single-GPU, multi-GPU (single node), and multi-node multi-GPU setups.
Understanding Arguments
- nprocs: Number of processes to run. To enable multi-processing and launch separate processes, set nprocs > 1.
- compute_setup: The computational setup used for training. Options include
cpu,gpu, andauto.
For more details on the advanced training options, please refer to TrainConfig Documentation
Configuring TrainConfig for GPU Training
By adjusting TrainConfig, you can switch between single and multi-GPU training setups. Below are the key settings for each configuration:
Single-GPU Training Configuration:
compute_setup: Selected training setup. (gpuorauto)backend="nccl": Optimized backend for GPU training.nprocs=1: Uses one GPU.
Multi-GPU (Single Node) Training Configuration:
compute_setup: Selected training setup. (gpuorauto)backend="nccl": Multi-GPU optimized backend.nprocs=2: Utilizes 2 GPUs on a single node.
Multi-Node Multi-GPU Training Configuration:
compute_setup: Selected training setup. (gpuorauto)backend="nccl": Required for multi-node setups.nprocs=4: Uses 4 GPUs across nodes.
Examples
The next sections provide Python scripts and training approach scripts for each setup.
Some organizations use SLURM to manage resources. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. If you are using slurm, you can use the
Trainerby submitting a batch script using sbatch.You can also use
torchrunto run the training process - which provides a superset of the functionality astorch.distributed.launch. Here you need to specify the torchrun arguments arguments to set up the distributed training setup. We also include thetorchrunsbatch scripts for each setup below.