Configurations
Data and configurations
TrainConfig(max_iter=10000, print_every=0, write_every=0, checkpoint_every=0, plot_every=0, live_plot_every=0, callbacks=lambda: list()(), log_model=False, root_folder=Path('./qml_logs'), create_subfolder_per_run=False, log_folder=Path('./'), checkpoint_best_only=False, val_every=0, val_epsilon=1e-05, validation_criterion=None, trainstop_criterion=None, batch_size=1, verbose=True, tracking_tool=ExperimentTrackingTool.TENSORBOARD, hyperparams=dict(), plotting_functions=tuple(), _subfolders=list(), nprocs=1, compute_setup='cpu', backend='gloo', log_setup='cpu', dtype=None, all_reduce_metrics=False)
dataclass
Default configuration for the training process.
This class provides default settings for various aspects of the training loop,
such as logging, checkpointing, and validation. The default values for these
fields can be customized when an instance of TrainConfig
is created.
Example:
TrainConfig(max_iter=10000, print_every=0, write_every=0, checkpoint_every=0, plot_every=0, live_plot_every=0, callbacks=[], log_model=False, root_folder='/tmp/train', create_subfolder_per_run=False, log_folder=PosixPath('.'), checkpoint_best_only=False, val_every=0, val_epsilon=1e-05, validation_criterion=None, trainstop_criterion=None, batch_size=1, verbose=True, tracking_tool=<ExperimentTrackingTool.TENSORBOARD: 'tensorboard'>, hyperparams={}, plotting_functions=(), _subfolders=[], nprocs=1, compute_setup='cpu', backend='gloo', log_setup='cpu', dtype=None, all_reduce_metrics=False)
all_reduce_metrics = False
class-attribute
instance-attribute
Whether to aggregate metrics (e.g., loss, accuracy) across processes.
When True, metrics from different training processes are averaged to provide a consolidated metrics. Note: Since aggregation requires synchronization/all_reduce operation, this can increase the computation time significantly.
backend = 'gloo'
class-attribute
instance-attribute
Backend used for distributed training communication.
The default is "gloo". Other options may include "nccl" - which is optimized for GPU-based training or "mpi",
depending on your system and requirements.
It should be one of the backends supported by torch.distributed
. For further details, please look at
torch backends
batch_size = 1
class-attribute
instance-attribute
The batch size to use when processing a list or tuple of torch.Tensors.
This specifies how many samples are processed in each training iteration.
callbacks = field(default_factory=lambda: list())
class-attribute
instance-attribute
List of callbacks to execute during training.
Callbacks can be used for custom behaviors, such as early stopping, custom logging, or other actions triggered at specific events.
checkpoint_best_only = False
class-attribute
instance-attribute
If True
, checkpoints are only saved if there is an improvement in the.
validation metric. This conserves storage by only keeping the best models.
validation_criterion is required when this is set to True.
checkpoint_every = 0
class-attribute
instance-attribute
Frequency (in epochs) for saving model and optimizer checkpoints during training.
Set to 0 to disable checkpointing. This helps in resuming training or recovering models. Note that setting checkpoint_best_only = True will disable this and only best checkpoints will be saved.
compute_setup = 'cpu'
class-attribute
instance-attribute
Compute device setup; options are "auto", "gpu", or "cpu".
- "auto": Automatically uses GPU if available; otherwise, falls back to CPU.
- "gpu": Forces GPU usage, raising an error if no CUDA device is available.
- "cpu": Forces the use of CPU regardless of GPU availability.
create_subfolder_per_run = False
class-attribute
instance-attribute
Whether to create a subfolder for each run, named <id>_<timestamp>_<PID>
.
This ensures logs and checkpoints from different runs do not overwrite each other,
which is helpful for rapid prototyping. If False
, training will resume from
the latest checkpoint if one exists in the specified log folder.
dtype = None
class-attribute
instance-attribute
Data type (precision) for computations.
Both model parameters, and dataset will be of the provided precision.
If not specified or None, the default torch precision (usually torch.float32) is used. If provided dtype is torch.complex128, model parameters will be torch.complex128, and data parameters will be torch.float64
hyperparams = field(default_factory=dict)
class-attribute
instance-attribute
A dictionary of hyperparameters to be tracked.
This can include learning rates, regularization parameters, or any other training-related configurations.
live_plot_every = 0
class-attribute
instance-attribute
Frequency for live plotting all the metrics in a single dynamic subplot.
Set to 0 to disable.
for more personalized behaviour, such as showing only a subset of the
metrics or arranging over different subplots, leave this parameter to 0,
define a LivePlotMetrics
callback and pass it to callbacks
.
log_folder = Path('./')
class-attribute
instance-attribute
The log folder for saving checkpoints and tensorboard logs.
This stores the path where all logs and checkpoints are being saved
for this training session. log_folder
takes precedence over root_folder
,
but it is ignored if create_subfolders_per_run=True
(in which case, subfolders
will be spawned in the root folder).
log_model = False
class-attribute
instance-attribute
Whether to log a serialized version of the model.
When set to True
, the
model's state will be logged, useful for model versioning and reproducibility.
log_setup = 'cpu'
class-attribute
instance-attribute
Logging device setup; options are "auto" or "cpu".
- "auto": Uses the same device for logging as for computation.
- "cpu": Forces logging to occur on the CPU. This can be useful to avoid potential conflicts with GPU processes.
max_iter = 10000
class-attribute
instance-attribute
Number of training iterations (epochs) to perform.
This defines the total number of times the model will be updated.
In case of InfiniteTensorDataset, each epoch will have 1 batch. In case of TensorDataset, each epoch will have len(dataloader) batches.
nprocs = 1
class-attribute
instance-attribute
The number of processes to use for training when spawning subprocesses.
For effective parallel processing, set this to a value greater than 1. - In case of Multi-GPU or Multi-Node-Multi-GPU setups, nprocs should be equal to the total number of GPUs across all nodes (world size), or total number of GPU to be used.
If nprocs > 1, multiple processes will be spawned for training. The training framework will launch additional processes (e.g., for distributed or parallel training). - For CPU setup, this will launch a true parallel processes - For GPU setup, this will launch a distributed training routine. This uses the DistributedDataParallel framework from PyTorch.
plot_every = 0
class-attribute
instance-attribute
Frequency (in epochs) for generating and saving figures during training.
Set to 0 to disable plotting.
plotting_functions = field(default_factory=tuple)
class-attribute
instance-attribute
Functions used for in-training plotting.
These are called to generate plots that are logged or saved at specified intervals.
print_every = 0
class-attribute
instance-attribute
Frequency (in epochs) for printing loss and metrics to the console during training.
Set to 0 to disable this output, meaning that metrics and loss will not be printed during training.
root_folder = Path('./qml_logs')
class-attribute
instance-attribute
The root folder for saving checkpoints and tensorboard logs.
The default path is "./qml_logs"
This can be set to a specific directory where training artifacts are to be stored.
Checkpoints will be saved inside a subfolder in this directory. Subfolders will be
created based on create_subfolder_per_run
argument.
tracking_tool = ExperimentTrackingTool.TENSORBOARD
class-attribute
instance-attribute
The tool used for tracking training progress and logging metrics.
Options include tools like TensorBoard, which help visualize and monitor model training.
trainstop_criterion = None
class-attribute
instance-attribute
A function to determine if the training process should stop based on a.
specific stopping metric. If None
, training continues until max_iter
is reached.
val_epsilon = 1e-05
class-attribute
instance-attribute
A small safety margin used to compare the current validation loss with the.
best previous validation loss. This is used to determine improvements in metrics.
val_every = 0
class-attribute
instance-attribute
Frequency (in epochs) for performing validation.
If set to 0, validation is not performed.
Note that metrics from validation are always written, regardless of the write_every
setting.
Note that initial validation happens at the start of training (when val_every > 0)
For initial validation - initial metrics are written.
- checkpoint is saved (when checkpoint_best_only = False)
validation_criterion = None
class-attribute
instance-attribute
A function to evaluate whether a given validation metric meets a desired condition.
The validation_criterion has the following format: def validation_criterion(val_loss: float, best_val_loss: float, val_epsilon: float) -> bool: # process
If None
, no custom validation criterion is applied.
verbose = True
class-attribute
instance-attribute
Whether to print metrics and status messages during training.
If True
, detailed metrics and status updates will be displayed in the console.
write_every = 0
class-attribute
instance-attribute
Frequency (in epochs) for writing loss and metrics using the tracking tool during training.
Set to 0 to disable this logging, which prevents metrics from being logged to the tracking tool. Note that the metrics will always be written at the end of training regardless of this setting.
get_parameters(model)
Retrieve all trainable model parameters in a single vector.
PARAMETER | DESCRIPTION |
---|---|
model
|
the input PyTorch model
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tensor
|
a 1-dimensional tensor with the parameters
TYPE:
|
Source code in perceptrain/parameters.py
num_parameters(model)
set_parameters(model, theta)
Set all trainable parameters of a model from a single vector.
Notice that this function assumes prior knowledge of right number of parameters in the model
PARAMETER | DESCRIPTION |
---|---|
model
|
the input PyTorch model
TYPE:
|
theta
|
the parameters to assign
TYPE:
|
Source code in perceptrain/parameters.py
Default Torch optimize step with closure.
This is the default optimization step.
PARAMETER | DESCRIPTION |
---|---|
model
|
The input model to be optimized.
TYPE:
|
optimizer
|
The chosen Torch optimizer.
TYPE:
|
loss_fn
|
A custom loss function that returns the loss value and a dictionary of metrics.
TYPE:
|
xs
|
The input data. If None, it means the given model does not require any input data.
TYPE:
|
device
|
A target device to run computations on.
TYPE:
|
dtype
|
Data type for
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
tuple[Tensor | float, dict | None]
|
tuple[Tensor | float, dict | None]: A tuple containing the computed loss value and a dictionary with collected metrics. |
Source code in perceptrain/optimize_step.py
DictDataLoader(dataloaders)
dataclass
This class only holds a dictionary of DataLoader
s and samples from them.
GenerativeIterableDataset(proba_dist)
Bases: IterableDataset
Dataset for sampling from a probability distribution.
Samples once per iteration.
PARAMETER | DESCRIPTION |
---|---|
proba_dist
|
the probability distribution to be sampled.
TYPE:
|
Source code in perceptrain/data.py
InfiniteTensorDataset(*tensors)
Bases: IterableDataset
Randomly sample points from the first dimension of the given tensors.
Behaves like a normal torch Dataset
just that we can sample from it as
many times as we want.
Examples:
import torch
from perceptrain.data import InfiniteTensorDataset
x_data, y_data = torch.rand(5,2), torch.ones(5,1)
# The dataset accepts any number of tensors with the same batch dimension
ds = InfiniteTensorDataset(x_data, y_data)
# call `next` to get one sample from each tensor:
xs = next(iter(ds))
Source code in perceptrain/data.py
OptimizeResult(iteration, model, optimizer, loss=None, metrics=lambda: dict()(), extra=lambda: dict()(), rank=0, device='cpu')
dataclass
OptimizeResult stores many optimization intermediate values.
We store at a current iteration, the model, optimizer, loss values, metrics. An extra dict can be used for saving other information to be used for callbacks.
device = 'cpu'
class-attribute
instance-attribute
Device on which this result for calculated.
extra = field(default_factory=lambda: dict())
class-attribute
instance-attribute
Extra dict for saving anything else to be used in callbacks.
iteration
instance-attribute
Current iteration number.
loss = None
class-attribute
instance-attribute
Loss value.
metrics = field(default_factory=lambda: dict())
class-attribute
instance-attribute
Metrics that can be saved during training.
model
instance-attribute
Model at iteration.
optimizer
instance-attribute
Optimizer at iteration.
rank = 0
class-attribute
instance-attribute
Rank of the process for which this result was generated.
R3Dataset(proba_dist, n_samples, release_threshold=0.1)
Bases: Dataset
Dataset for R3 sampling (introduced in https://arxiv.org/abs/2207.02338#).
This is an evolutionary dataset, that updates itself during training, based on the fitness values of the samples. It releases samples if the corresponding fitness value is below the threshold and retains them otherwise. The released samples are replaced by new samples generated from a probability distribution.
While this scheme was originally proposed for training physics-informed neural networks, this implementation can be used for any type of data that can be sampled from a probability distribution.
PARAMETER | DESCRIPTION |
---|---|
proba_dist
|
Probability distribution function for generating features.
TYPE:
|
n_samples
|
Number of samples to generate.
TYPE:
|
release_threshold
|
Threshold for releasing samples.
TYPE:
|
Source code in perceptrain/data.py
update(fitness_values)
Update the dataset by releasing samples below fitness threshold and resampling.
PARAMETER | DESCRIPTION |
---|---|
fitness_values
|
the fitness values of the samples.
TYPE:
|
Source code in perceptrain/data.py
data_to_device(xs, *args, **kwargs)
Utility method to move arbitrary data to 'device'.
to_dataloader(*tensors, batch_size=1, infinite=False, collate_fn=None)
Convert torch tensors an (infinite) Dataloader.
PARAMETER | DESCRIPTION |
---|---|
*tensors
|
Torch tensors to use in the dataloader.
TYPE:
|
batch_size
|
batch size of sampled tensors
TYPE:
|
infinite
|
if
TYPE:
|
collate_fn
|
function to collate the sampled tensors. Passed to torch.utils.data.DataLoader. If None, defaults to torch.utils.data.default_collate.
TYPE:
|
Examples:
import torch
from perceptrain import to_dataloader
(x, y, z) = [torch.rand(10) for _ in range(3)]
loader = iter(to_dataloader(x, y, z, batch_size=5, infinite=True))
print(next(loader))
print(next(loader))
print(next(loader))
[tensor([0.3443, 0.8561, 0.1568, 0.7208, 0.0254]), tensor([0.0738, 0.6195, 0.6742, 0.9500, 0.8758]), tensor([0.1853, 0.2132, 0.4902, 0.3378, 0.1490])]
[tensor([0.2239, 0.4870, 0.3889, 0.7024, 0.9865]), tensor([0.0828, 0.4410, 0.6529, 0.2872, 0.1952]), tensor([0.0183, 0.5617, 0.3583, 0.7822, 0.9182])]
[tensor([0.3443, 0.8561, 0.1568, 0.7208, 0.0254]), tensor([0.0738, 0.6195, 0.6742, 0.9500, 0.8758]), tensor([0.1853, 0.2132, 0.4902, 0.3378, 0.1490])]