Shortcuts

rising.loading

rising.loading provides an alternative DataLoader that extends torch.utils.data.DataLoader by the following:

  • Seeding of Numpy in each worker process: The seed is generated by numpy in the main process before starting the workers. For reproducibility numpy must be seeded in the main process.

  • Per-Sample Transforms outside the dataset (optional with pseudo batch dimension if the transforms require it). Will be executed within the spawned worker processes before batching.

  • Batched Transforms for better performance. Will be executed within the worker processes after batching.

  • Batched GPU-Transforms. Will be executed after syncing results back to main process (i.e. as last transforms) to avoid multiple CUDA initializations.

Furthermore it also provides a Dataset (based on torch.utils.data.Dataset)that can create subsets from itself by given indices and an AsyncDataset as well as different options for collation.

DataLoader

class rising.loading.loader.DataLoader(dataset, batch_size=1, shuffle=False, batch_transforms=None, gpu_transforms=None, sample_transforms=None, pseudo_batch_dim=False, device=None, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, auto_convert=True, transform_call=<function default_transform_call>, **kwargs)[source][source]

Bases: torch.utils.data.DataLoader

A DataLoader introducing batch-transforms, per-sample-transforms, numpy seeds for worker processes outside the dataset

Note

For Reproducibility numpy and pytorch must be seeded in the main process, as these frameworks will be used to generate their own seeds for each worker.

Note

len(dataloader) heuristic is based on the length of the sampler used. When dataset is an IterableDataset, an infinite sampler is used, whose __len__() is not implemented, because the actual length depends on both the iterable as well as multi-process loading configurations. So one should not query this method unless they work with a map-style dataset.

Warning

If the spawn start method is used, worker_init_fn cannot be an unpicklable object, e.g., a lambda function. See Multiprocessing best practices on more details related to multiprocessing in PyTorch.

Note

The GPU-Transforms for a batch are always executed in the main process after the batch was gathered from subprocesses which apply the CPU-Transformations. The desired workflow is as follows:

Disk -> CPU-Transforms -> GPU-Memory -> GPU-Transforms -> Further GPU Processing (e.g. training a neural network)

Parameters
  • dataset (Union[Sequence, Dataset]) – dataset from which to load the data

  • batch_size (int) – how many samples per batch to load (default: 1).

  • shuffle (bool) – set to True to have the data reshuffled at every epoch (default: False)

  • batch_transforms (Optional[Callable]) – transforms which can be applied to a whole batch. Usually this accepts either mappings or sequences and returns the same type containing transformed elements

  • gpu_transforms (Optional[Callable]) – transforms which can be applied to a whole batch (on the GPU). Unlike batch_transforms this is not done in multiple processes, but in the main process on the GPU, because GPUs are capable of non-blocking and asynchronous working. Before executing these transforms all data will be moved to device. This copy is done in a non-blocking way if pin_memory is set to True.

  • sample_transforms (Optional[Callable]) – transforms applied to each sample (on CPU). These are the first transforms applied to the data, since they are applied on sample retrieval from dataset before batching occurs.

  • pseudo_batch_dim (bool) – whether the sample_transforms work on batches and thus need a pseudo batch dim of 1 to work correctly.

  • device (Union[str, device, None]) – the device to move the data to for gpu_transforms. If None: the device will be the current device.

  • sampler (Optional[Sampler]) – defines the strategy to draw samples from the dataset. If specified, shuffle must be False.

  • batch_sampler (Optional[Sampler]) – like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.

  • num_workers (int) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)

  • collate_fn (Optional[Callable]) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

  • pin_memory (bool) – If True, the data loader will copy Tensors into CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.

  • drop_last (bool) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

  • timeout (Union[int, float]) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)

  • worker_init_fn (Optional[Callable]) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

  • auto_convert (bool) – if set to True, the batches will always be transformed to torch.Tensors, if possible. (default: True)

  • transform_call (Callable[[Any, Callable], Any]) – function which determines how transforms are called. By default Mappings and Sequences are unpacked during the transform.

get_batch_transformer()[source][source]

A getter function for the BatchTransformer :returns: the initialized BatchTransformer :rtype: BatchTransformer

get_gpu_batch_transformer()[source][source]

A getter function for the BatchTransformer holding the GPU-Transforms

Returns

the initialized BatchTransformer

Return type

BatchTransformer

get_sample_transformer()[source][source]

A getter function for the SampleTransformer holding the Per-Sample-Transforms

Returns

the initialized SampleTransformer

Return type

SampleTransformer

rising.loading.loader.default_transform_call(batch, transform)[source][source]

Default function to call transforms. Mapping and Sequences are unpacked during the transform call. Other types are passed as a positional argument.

Parameters
  • batch (Any) – current batch which is passed to transforms

  • transform (Callable) – transform to perform

Returns

transformed batch

Return type

Any

DataLoader

class rising.loading.loader.DataLoader(dataset, batch_size=1, shuffle=False, batch_transforms=None, gpu_transforms=None, sample_transforms=None, pseudo_batch_dim=False, device=None, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, auto_convert=True, transform_call=<function default_transform_call>, **kwargs)[source][source]

Bases: torch.utils.data.DataLoader

A DataLoader introducing batch-transforms, per-sample-transforms, numpy seeds for worker processes outside the dataset

Note

For Reproducibility numpy and pytorch must be seeded in the main process, as these frameworks will be used to generate their own seeds for each worker.

Note

len(dataloader) heuristic is based on the length of the sampler used. When dataset is an IterableDataset, an infinite sampler is used, whose __len__() is not implemented, because the actual length depends on both the iterable as well as multi-process loading configurations. So one should not query this method unless they work with a map-style dataset.

Warning

If the spawn start method is used, worker_init_fn cannot be an unpicklable object, e.g., a lambda function. See Multiprocessing best practices on more details related to multiprocessing in PyTorch.

Note

The GPU-Transforms for a batch are always executed in the main process after the batch was gathered from subprocesses which apply the CPU-Transformations. The desired workflow is as follows:

Disk -> CPU-Transforms -> GPU-Memory -> GPU-Transforms -> Further GPU Processing (e.g. training a neural network)

Parameters
  • dataset (Union[Sequence, Dataset]) – dataset from which to load the data

  • batch_size (int) – how many samples per batch to load (default: 1).

  • shuffle (bool) – set to True to have the data reshuffled at every epoch (default: False)

  • batch_transforms (Optional[Callable]) – transforms which can be applied to a whole batch. Usually this accepts either mappings or sequences and returns the same type containing transformed elements

  • gpu_transforms (Optional[Callable]) – transforms which can be applied to a whole batch (on the GPU). Unlike batch_transforms this is not done in multiple processes, but in the main process on the GPU, because GPUs are capable of non-blocking and asynchronous working. Before executing these transforms all data will be moved to device. This copy is done in a non-blocking way if pin_memory is set to True.

  • sample_transforms (Optional[Callable]) – transforms applied to each sample (on CPU). These are the first transforms applied to the data, since they are applied on sample retrieval from dataset before batching occurs.

  • pseudo_batch_dim (bool) – whether the sample_transforms work on batches and thus need a pseudo batch dim of 1 to work correctly.

  • device (Union[str, device, None]) – the device to move the data to for gpu_transforms. If None: the device will be the current device.

  • sampler (Optional[Sampler]) – defines the strategy to draw samples from the dataset. If specified, shuffle must be False.

  • batch_sampler (Optional[Sampler]) – like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.

  • num_workers (int) – how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)

  • collate_fn (Optional[Callable]) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

  • pin_memory (bool) – If True, the data loader will copy Tensors into CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.

  • drop_last (bool) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

  • timeout (Union[int, float]) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)

  • worker_init_fn (Optional[Callable]) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

  • auto_convert (bool) – if set to True, the batches will always be transformed to torch.Tensors, if possible. (default: True)

  • transform_call (Callable[[Any, Callable], Any]) – function which determines how transforms are called. By default Mappings and Sequences are unpacked during the transform.

get_batch_transformer()[source][source]

A getter function for the BatchTransformer :returns: the initialized BatchTransformer :rtype: BatchTransformer

get_gpu_batch_transformer()[source][source]

A getter function for the BatchTransformer holding the GPU-Transforms

Returns

the initialized BatchTransformer

Return type

BatchTransformer

get_sample_transformer()[source][source]

A getter function for the SampleTransformer holding the Per-Sample-Transforms

Returns

the initialized SampleTransformer

Return type

SampleTransformer

default_transform_call

rising.loading.loader.default_transform_call(batch, transform)[source][source]

Default function to call transforms. Mapping and Sequences are unpacked during the transform call. Other types are passed as a positional argument.

Parameters
  • batch (Any) – current batch which is passed to transforms

  • transform (Callable) – transform to perform

Returns

transformed batch

Return type

Any

BatchTransformer

class rising.loading.loader.BatchTransformer(collate_fn, transforms=None, auto_convert=True, transform_call=<function default_transform_call>)[source][source]

Bases: object

A callable wrapping the collate_fn to enable transformations on a batch-basis.

Parameters
  • collate_fn (Callable) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

  • transforms (Optional[Callable]) – transforms which can be applied to a whole batch. Usually this accepts either mappings or sequences and returns the same type containing transformed elements

  • auto_convert (bool) – if set to True, the batches will always be transformed to torch.Tensors, if possible. (default: True)

  • transform_call (Callable[[Any, Callable], Any]) – function which determines how transforms are called. By default Mappings and Sequences are unpacked during the transform.

__call__(*args, **kwargs)[source][source]

Apply batch workflow: collate -> augmentation -> default_convert

Parameters
  • *args – positional batch arguments

  • **kwargs – keyword batch arguments

Returns

batched and augmented data

Return type

Any

patch_worker_init_fn

rising.loading.loader.patch_worker_init_fn(loader, new_worker_init)[source][source]

Patches the loader to temporarily have the correct worker init function.

Parameters
  • loader (DataLoader) – the loader to patch

  • new_worker_init (Callable) – the new worker init function

Yields

the patched loader

Return type

Generator

patch_collate_fn

rising.loading.loader.patch_collate_fn(loader)[source][source]

Patches the loader to temporarily have the correct collate function

Parameters

loader (DataLoader) – the loader to patch

Yields

the patched loader

Return type

Generator

Dataset

class rising.loading.dataset.Dataset(*args, **kwargs)[source][source]

Bases: torch.utils.data.Dataset

Extension of torch.utils.data.Dataset by a get_subset method which returns a sub-dataset.

get_subset(indices)[source][source]

Returns a torch.utils.data.Subset of the current dataset based on given indices

Parameters

indices (Sequence[int]) – valid indices to extract subset from current dataset

Returns

the subset of the current dataset

Return type

Subset

class rising.loading.dataset.AsyncDataset(data_path, load_fn, mode='append', num_workers=0, verbose=False, **load_kwargs)[source][source]

Bases: rising.loading.dataset.Dataset

A dataset to preload all the data and cache it for the entire lifetime of this class.

Parameters
  • data_path (Union[Path, str, list]) – the path(s) containing the actual data samples

  • load_fn (Callable) – function to load the actual data

  • mode (str) – whether to append the sample to a list or to extend the list by it. Supported modes are: append and extend. Default: append

  • num_workers (Optional[int]) – the number of workers to use for preloading. 0 means, all the data will be loaded in the main process, while None means, the number of processes will default to the number of logical cores.

  • verbose (bool) – whether to show the loading progress.

  • **load_kwargs – additional keyword arguments. Passed directly to load_fn

Warning

if using multiprocessing to load data, there are some restrictions to which load_fn() are supported, please refer to the dill or pickle documentation

static _add_item(data, item, mode)[source][source]

Adds items to the given data list. The actual way of adding these items depends on mode

Parameters
  • data (list) – the list containing the already loaded data

  • item (Any) – the current item which will be added to the list

  • mode (str) – the string specifying the mode of how the item should be added.F

Raises

TypeError – No known mode detected

Return type

None

_make_dataset(path, mode)[source][source]

Function to build the entire dataset

Parameters
  • path (Union[Path, str, list]) – the path(s) containing the data samples

  • mode (str) – whether to append or extend the dataset by the loaded sample

Returns

the loaded data

Return type

list

load_multi_process(load_fn, path)[source][source]

Helper function to load dataset with multiple processes

Parameters
  • load_fn (Callable) – function to load a single sample

  • path (Sequence) – a sequence of paths which should be loaded

Returns

loaded data

Return type

list

load_single_process(load_fn, path)[source][source]

Helper function to load dataset with single process

Parameters
  • load_fn (Callable) – function to load a single sample

  • path (Sequence) – a sequence of paths which should be loaded

Returns

iterator of loaded data

Return type

Iterator

Dataset

class rising.loading.dataset.Dataset(*args, **kwargs)[source][source]

Bases: torch.utils.data.Dataset

Extension of torch.utils.data.Dataset by a get_subset method which returns a sub-dataset.

get_subset(indices)[source][source]

Returns a torch.utils.data.Subset of the current dataset based on given indices

Parameters

indices (Sequence[int]) – valid indices to extract subset from current dataset

Returns

the subset of the current dataset

Return type

Subset

AsyncDataset

class rising.loading.dataset.AsyncDataset(data_path, load_fn, mode='append', num_workers=0, verbose=False, **load_kwargs)[source][source]

Bases: rising.loading.dataset.Dataset

A dataset to preload all the data and cache it for the entire lifetime of this class.

Parameters
  • data_path (Union[Path, str, list]) – the path(s) containing the actual data samples

  • load_fn (Callable) – function to load the actual data

  • mode (str) – whether to append the sample to a list or to extend the list by it. Supported modes are: append and extend. Default: append

  • num_workers (Optional[int]) – the number of workers to use for preloading. 0 means, all the data will be loaded in the main process, while None means, the number of processes will default to the number of logical cores.

  • verbose (bool) – whether to show the loading progress.

  • **load_kwargs – additional keyword arguments. Passed directly to load_fn

Warning

if using multiprocessing to load data, there are some restrictions to which load_fn() are supported, please refer to the dill or pickle documentation

static _add_item(data, item, mode)[source][source]

Adds items to the given data list. The actual way of adding these items depends on mode

Parameters
  • data (list) – the list containing the already loaded data

  • item (Any) – the current item which will be added to the list

  • mode (str) – the string specifying the mode of how the item should be added.F

Raises

TypeError – No known mode detected

Return type

None

_make_dataset(path, mode)[source][source]

Function to build the entire dataset

Parameters
  • path (Union[Path, str, list]) – the path(s) containing the data samples

  • mode (str) – whether to append or extend the dataset by the loaded sample

Returns

the loaded data

Return type

list

load_multi_process(load_fn, path)[source][source]

Helper function to load dataset with multiple processes

Parameters
  • load_fn (Callable) – function to load a single sample

  • path (Sequence) – a sequence of paths which should be loaded

Returns

loaded data

Return type

list

load_single_process(load_fn, path)[source][source]

Helper function to load dataset with single process

Parameters
  • load_fn (Callable) – function to load a single sample

  • path (Sequence) – a sequence of paths which should be loaded

Returns

iterator of loaded data

Return type

Iterator

dill_helper

rising.loading.dataset.dill_helper(payload)[source][source]

Load single sample from data serialized by dill :type payload: Any :param payload: data which is loaded with dill

Returns

loaded data

Return type

Any

load_async

rising.loading.dataset.load_async(pool, fn, *args, callback=None, **kwargs)[source][source]

Load data asynchronously and serialize data via dill

Parameters
  • pool (Pool) – multiprocessing pool to use for apply_async()

  • fn (Callable) – function to load a single sample

  • *args – positional arguments to dump with dill

  • callback (Optional[Callable]) – optional callback. defaults to None.

  • **kwargs – keyword arguments to dump with dill

Returns

reference to obtain data with get()

Return type

Any

Collation

rising.loading.collate.numpy_collate(batch)[source][source]

function to collate the samples to a whole batch of numpy arrays. PyTorch Tensors, scalar values and sequences will be casted to arrays automatically.

Parameters

batch (Any) – a batch of samples. In most cases either sequence, mapping or mixture of them

Returns

collated batch with optionally converted type

(to numpy.ndarray)

Return type

Any

Raises

TypeError – When batch could not be collated automatically

rising.loading.collate.do_nothing_collate(batch)[source][source]

Returns the batch as is (with out any collation :type batch: Any :param batch: input batch (typically a sequence, mapping or mixture of those).

Returns

the batch as given to this function

Return type

Any

numpy_collate

rising.loading.collate.numpy_collate(batch)[source][source]

function to collate the samples to a whole batch of numpy arrays. PyTorch Tensors, scalar values and sequences will be casted to arrays automatically.

Parameters

batch (Any) – a batch of samples. In most cases either sequence, mapping or mixture of them

Returns

collated batch with optionally converted type

(to numpy.ndarray)

Return type

Any

Raises

TypeError – When batch could not be collated automatically

do_nothing_collate

rising.loading.collate.do_nothing_collate(batch)[source][source]

Returns the batch as is (with out any collation :type batch: Any :param batch: input batch (typically a sequence, mapping or mixture of those).

Returns

the batch as given to this function

Return type

Any


© Copyright Copyright (c) 2019-2020, Justus Schock, Michael Baumgartner.. Revision 437f8919.

Read the Docs v: v0.2.2
Versions
latest
stable
v0.2.2
v0.2.1
v0.2.0post0
v0.2.0
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.