rising.loading¶

rising.loading provides an alternative DataLoader that extends torch.utils.data.DataLoader by the following:

Seeding of Numpy in each worker process: The seed is generated by numpy in the main process before starting the workers. For reproducibility numpy must be seeded in the main process.
Per-Sample Transforms outside the dataset (optional with pseudo batch dimension if the transforms require it). Will be executed within the spawned worker processes before batching.
Batched Transforms for better performance. Will be executed within the worker processes after batching.
Batched GPU-Transforms. Will be executed after syncing results back to main process (i.e. as last transforms) to avoid multiple CUDA initializations.

Furthermore it also provides a Dataset (based on torch.utils.data.Dataset)that can create subsets from itself by given indices and an AsyncDataset as well as different options for collation.

DataLoader¶

class rising.loading.loader.DataLoader(dataset, batch_size=1, shuffle=False, batch_transforms=None, gpu_transforms=None, sample_transforms=None, pseudo_batch_dim=False, device=None, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, auto_convert=True, transform_call=<function default_transform_call>)[source][source]¶

Bases: torch.utils.data.DataLoader

A DataLoader introducing batch-transforms, per-sample-transforms, numpy seeds for worker processes outside the dataset

Note

For Reproducibility numpy and pytorch must be seeded in the main process, as these frameworks will be used to generate their own seeds for each worker.

Note

len(dataloader) heuristic is based on the length of the sampler used. When dataset is an IterableDataset, an infinite sampler is used, whose __len__() is not implemented, because the actual length depends on both the iterable as well as multi-process loading configurations. So one should not query this method unless they work with a map-style dataset.

Warning

If the spawn start method is used, worker_init_fn cannot be an unpicklable object, e.g., a lambda function. See Multiprocessing best practices on more details related to multiprocessing in PyTorch.

Note

The GPU-Transforms for a batch are always executed in the main process after the batch was gathered from subprocesses which apply the CPU-Transformations. The desired workflow is as follows:

Disk -> CPU-Transforms -> GPU-Memory -> GPU-Transforms -> Further GPU Processing (e.g. training a neural network)

Parameters

dataset (Union[Sequence, Dataset]) – dataset from which to load the data
batch_size (int) – how many samples per batch to load (default: 1).
shuffle (bool) – set to True to have the data reshuffled at every epoch (default: False)
batch_transforms (Optional[Callable]) – transforms which can be applied to a whole batch. Usually this accepts either mappings or sequences and returns the same type containing transformed elements
gpu_transforms (Optional[Callable]) – transforms which can be applied to a whole batch (on the GPU). Unlike batch_transforms this is not done in multiple processes, but in the main process on the GPU, because GPUs are capable of non-blocking and asynchronous working. Before executing these transforms all data will be moved to device. This copy is done in a non-blocking way if pin_memory is set to True.
sample_transforms (Optional[Callable]) – transforms applied to each sample (on CPU). These are the first transforms applied to the data, since they are applied on sample retrieval from dataset before batching occurs.
pseudo_batch_dim (bool) – whether the sample_transforms work on batches and thus need a pseudo batch dim of 1 to work correctly.
device (Union[str, device, None]) – the device to move the data to for gpu_transforms. If None: the device will be the current device.
sampler (Optional[Sampler]) – defines the strategy to draw samples from the dataset. If specified, shuffle must be False.
batch_sampler (Optional[Sampler]) – like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.
num_workers (int) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)
collate_fn (Optional[Callable]) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
pin_memory (bool) – If True, the data loader will copy Tensors into CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.
drop_last (bool) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)
timeout (Union[int, float]) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)
worker_init_fn (Optional[Callable]) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)
auto_convert (bool) – if set to True, the batches will always be transformed to torch.Tensors, if possible. (default: True)
transform_call (Callable[[Any, Callable], Any]) – function which determines how transforms are called. By default Mappings and Sequences are unpacked during the transform.

get_batch_transformer()[source][source]¶: A getter function for the BatchTransformer :returns: the initialized BatchTransformer :rtype: BatchTransformer

get_gpu_batch_transformer()[source][source]¶

A getter function for the BatchTransformer holding the GPU-Transforms

Returns: the initialized BatchTransformer
Return type: BatchTransformer

get_sample_transformer()[source][source]¶

A getter function for the SampleTransformer holding the Per-Sample-Transforms

Returns: the initialized SampleTransformer
Return type: SampleTransformer

rising.loading.loader.default_transform_call(batch, transform)[source][source]¶

Default function to call transforms. Mapping and Sequences are unpacked during the transform call. Other types are passed as a positional argument.

Parameters

batch (Any) – current batch which is passed to transforms
transform (Callable) – transform to perform

Returns

transformed batch

Return type

Any

DataLoader¶

class rising.loading.loader.DataLoader(dataset, batch_size=1, shuffle=False, batch_transforms=None, gpu_transforms=None, sample_transforms=None, pseudo_batch_dim=False, device=None, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, auto_convert=True, transform_call=<function default_transform_call>)[source][source]

Bases: torch.utils.data.DataLoader

A DataLoader introducing batch-transforms, per-sample-transforms, numpy seeds for worker processes outside the dataset

Note

For Reproducibility numpy and pytorch must be seeded in the main process, as these frameworks will be used to generate their own seeds for each worker.

Note

len(dataloader) heuristic is based on the length of the sampler used. When dataset is an IterableDataset, an infinite sampler is used, whose __len__() is not implemented, because the actual length depends on both the iterable as well as multi-process loading configurations. So one should not query this method unless they work with a map-style dataset.

Warning

If the spawn start method is used, worker_init_fn cannot be an unpicklable object, e.g., a lambda function. See Multiprocessing best practices on more details related to multiprocessing in PyTorch.

Note

The GPU-Transforms for a batch are always executed in the main process after the batch was gathered from subprocesses which apply the CPU-Transformations. The desired workflow is as follows:

Disk -> CPU-Transforms -> GPU-Memory -> GPU-Transforms -> Further GPU Processing (e.g. training a neural network)

Parameters

dataset (Union[Sequence, Dataset]) – dataset from which to load the data
batch_size (int) – how many samples per batch to load (default: 1).
shuffle (bool) – set to True to have the data reshuffled at every epoch (default: False)
batch_transforms (Optional[Callable]) – transforms which can be applied to a whole batch. Usually this accepts either mappings or sequences and returns the same type containing transformed elements
gpu_transforms (Optional[Callable]) – transforms which can be applied to a whole batch (on the GPU). Unlike batch_transforms this is not done in multiple processes, but in the main process on the GPU, because GPUs are capable of non-blocking and asynchronous working. Before executing these transforms all data will be moved to device. This copy is done in a non-blocking way if pin_memory is set to True.
sample_transforms (Optional[Callable]) – transforms applied to each sample (on CPU). These are the first transforms applied to the data, since they are applied on sample retrieval from dataset before batching occurs.
pseudo_batch_dim (bool) – whether the sample_transforms work on batches and thus need a pseudo batch dim of 1 to work correctly.
device (Union[str, device, None]) – the device to move the data to for gpu_transforms. If None: the device will be the current device.
sampler (Optional[Sampler]) – defines the strategy to draw samples from the dataset. If specified, shuffle must be False.
batch_sampler (Optional[Sampler]) – like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.
num_workers (int) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)
collate_fn (Optional[Callable]) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
pin_memory (bool) – If True, the data loader will copy Tensors into CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.
drop_last (bool) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)
timeout (Union[int, float]) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)
worker_init_fn (Optional[Callable]) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)
auto_convert (bool) – if set to True, the batches will always be transformed to torch.Tensors, if possible. (default: True)
transform_call (Callable[[Any, Callable], Any]) – function which determines how transforms are called. By default Mappings and Sequences are unpacked during the transform.

get_batch_transformer()[source][source]: A getter function for the BatchTransformer :returns: the initialized BatchTransformer :rtype: BatchTransformer

get_gpu_batch_transformer()[source][source]

A getter function for the BatchTransformer holding the GPU-Transforms

Returns: the initialized BatchTransformer
Return type: BatchTransformer

get_sample_transformer()[source][source]

A getter function for the SampleTransformer holding the Per-Sample-Transforms

Returns: the initialized SampleTransformer
Return type: SampleTransformer

default_transform_call¶

rising.loading.loader.default_transform_call(batch, transform)[source][source]

Default function to call transforms. Mapping and Sequences are unpacked during the transform call. Other types are passed as a positional argument.

Parameters

batch (Any) – current batch which is passed to transforms
transform (Callable) – transform to perform

Returns

transformed batch

Return type

Any

BatchTransformer¶

class rising.loading.loader.BatchTransformer(collate_fn, transforms=None, auto_convert=True, transform_call=<function default_transform_call>)[source][source]¶

Bases: object

A callable wrapping the collate_fn to enable transformations on a batch-basis.

Parameters

collate_fn (Callable) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
transforms (Optional[Callable]) – transforms which can be applied to a whole batch. Usually this accepts either mappings or sequences and returns the same type containing transformed elements
auto_convert (bool) – if set to True, the batches will always be transformed to torch.Tensors, if possible. (default: True)
transform_call (Callable[[Any, Callable], Any]) – function which determines how transforms are called. By default Mappings and Sequences are unpacked during the transform.

__call__(*args, **kwargs)[source][source]¶

Apply batch workflow: collate -> augmentation -> default_convert

Parameters

*args – positional batch arguments
**kwargs – keyword batch arguments

Returns

batched and augmented data

Return type

Any

patch_worker_init_fn¶

rising.loading.loader.patch_worker_init_fn(loader, new_worker_init)[source][source]¶

Patches the loader to temporarily have the correct worker init function.

Parameters

loader (DataLoader) – the loader to patch
new_worker_init (Callable) – the new worker init function

Yields

the patched loader

Return type

Generator

patch_collate_fn¶

rising.loading.loader.patch_collate_fn(loader)[source][source]¶

Patches the loader to temporarily have the correct collate function

Parameters: loader (DataLoader) – the loader to patch
Yields: the patched loader
Return type: Generator

Dataset¶

class rising.loading.dataset.Dataset(*args, **kwargs)[source][source]¶

Bases: torch.utils.data.Dataset

Extension of torch.utils.data.Dataset by a get_subset method which returns a sub-dataset.

get_subset(indices)[source][source]¶

Returns a torch.utils.data.Subset of the current dataset based on given indices

Parameters: indices (Sequence[int]) – valid indices to extract subset from current dataset
Returns: the subset of the current dataset
Return type: Subset

class rising.loading.dataset.AsyncDataset(data_path, load_fn, mode='append', num_workers=0, verbose=False, **load_kwargs)[source][source]¶

Bases: rising.loading.dataset.Dataset

A dataset to preload all the data and cache it for the entire lifetime of this class.

Parameters

data_path (Union[Path, str, list]) – the path(s) containing the actual data samples
load_fn (Callable) – function to load the actual data
mode (str) – whether to append the sample to a list or to extend the list by it. Supported modes are: append and extend. Default: append
num_workers (Optional[int]) – the number of workers to use for preloading. 0 means, all the data will be loaded in the main process, while None means, the number of processes will default to the number of logical cores.
verbose (bool) – whether to show the loading progress.
**load_kwargs – additional keyword arguments. Passed directly to load_fn

Warning

if using multiprocessing to load data, there are some restrictions to which load_fn() are supported, please refer to the dill or pickle documentation

static _add_item(data, item, mode)[source][source]¶

Adds items to the given data list. The actual way of adding these items depends on mode

Parameters

data (list) – the list containing the already loaded data
item (Any) – the current item which will be added to the list
mode (str) – the string specifying the mode of how the item should be added.F

Raises

TypeError – No known mode detected

Return type

None

_make_dataset(path, mode)[source][source]¶

Function to build the entire dataset

Parameters

path (Union[Path, str, list]) – the path(s) containing the data samples
mode (str) – whether to append or extend the dataset by the loaded sample

Returns

the loaded data

Return type

list

load_multi_process(load_fn, path)[source][source]¶

Helper function to load dataset with multiple processes

Parameters

load_fn (Callable) – function to load a single sample
path (Sequence) – a sequence of paths which should be loaded

Returns

loaded data

Return type

list

load_single_process(load_fn, path)[source][source]¶

Helper function to load dataset with single process

Parameters

load_fn (Callable) – function to load a single sample
path (Sequence) – a sequence of paths which should be loaded

Returns

iterator of loaded data

Return type

Iterator

Dataset¶

class rising.loading.dataset.Dataset(*args, **kwargs)[source][source]

Bases: torch.utils.data.Dataset

Extension of torch.utils.data.Dataset by a get_subset method which returns a sub-dataset.

get_subset(indices)[source][source]

Returns a torch.utils.data.Subset of the current dataset based on given indices

Parameters: indices (Sequence[int]) – valid indices to extract subset from current dataset
Returns: the subset of the current dataset
Return type: Subset

AsyncDataset¶

class rising.loading.dataset.AsyncDataset(data_path, load_fn, mode='append', num_workers=0, verbose=False, **load_kwargs)[source][source]

Bases: rising.loading.dataset.Dataset

A dataset to preload all the data and cache it for the entire lifetime of this class.

Parameters

data_path (Union[Path, str, list]) – the path(s) containing the actual data samples
load_fn (Callable) – function to load the actual data
mode (str) – whether to append the sample to a list or to extend the list by it. Supported modes are: append and extend. Default: append
num_workers (Optional[int]) – the number of workers to use for preloading. 0 means, all the data will be loaded in the main process, while None means, the number of processes will default to the number of logical cores.
verbose (bool) – whether to show the loading progress.
**load_kwargs – additional keyword arguments. Passed directly to load_fn

Warning

if using multiprocessing to load data, there are some restrictions to which load_fn() are supported, please refer to the dill or pickle documentation

static _add_item(data, item, mode)[source][source]

Adds items to the given data list. The actual way of adding these items depends on mode

Parameters

data (list) – the list containing the already loaded data
item (Any) – the current item which will be added to the list
mode (str) – the string specifying the mode of how the item should be added.F

Raises

TypeError – No known mode detected

Return type

None

_make_dataset(path, mode)[source][source]

Function to build the entire dataset

Parameters

path (Union[Path, str, list]) – the path(s) containing the data samples
mode (str) – whether to append or extend the dataset by the loaded sample

Returns

the loaded data

Return type

list

load_multi_process(load_fn, path)[source][source]

Helper function to load dataset with multiple processes

Parameters

load_fn (Callable) – function to load a single sample
path (Sequence) – a sequence of paths which should be loaded

Returns

loaded data

Return type

list

load_single_process(load_fn, path)[source][source]

Helper function to load dataset with single process

Parameters

load_fn (Callable) – function to load a single sample
path (Sequence) – a sequence of paths which should be loaded

Returns

iterator of loaded data

Return type

Iterator

dill_helper¶

rising.loading.dataset.dill_helper(payload)[source][source]¶

Load single sample from data serialized by dill :type payload: Any :param payload: data which is loaded with dill

Returns: loaded data
Return type: Any

load_async¶

rising.loading.dataset.load_async(pool, fn, *args, callback=None, **kwargs)[source][source]¶

Load data asynchronously and serialize data via dill

Parameters

pool (Pool) – multiprocessing pool to use for apply_async()
fn (Callable) – function to load a single sample
*args – positional arguments to dump with dill
callback (Optional[Callable]) – optional callback. defaults to None.
**kwargs – keyword arguments to dump with dill

Returns

reference to obtain data with get()

Return type

Any

Collation¶

rising.loading.collate.numpy_collate(batch)[source][source]¶

function to collate the samples to a whole batch of numpy arrays. PyTorch Tensors, scalar values and sequences will be casted to arrays automatically.

Parameters

batch (Any) – a batch of samples. In most cases either sequence, mapping or mixture of them

Returns

collated batch with optionally converted type: (to numpy.ndarray)

Return type

Any

Raises

TypeError – When batch could not be collated automatically

rising.loading.collate.do_nothing_collate(batch)[source][source]¶

Returns the batch as is (with out any collation :type batch: Any :param batch: input batch (typically a sequence, mapping or mixture of those).

Returns: the batch as given to this function
Return type: Any

numpy_collate¶

rising.loading.collate.numpy_collate(batch)[source][source]

function to collate the samples to a whole batch of numpy arrays. PyTorch Tensors, scalar values and sequences will be casted to arrays automatically.

Parameters

batch (Any) – a batch of samples. In most cases either sequence, mapping or mixture of them

Returns

collated batch with optionally converted type: (to numpy.ndarray)

Return type

Any

Raises

TypeError – When batch could not be collated automatically

do_nothing_collate¶

rising.loading.collate.do_nothing_collate(batch)[source][source]

Returns the batch as is (with out any collation :type batch: Any :param batch: input batch (typically a sequence, mapping or mixture of those).

Returns: the batch as given to this function
Return type: Any