rising.loading¶
rising.loading
provides an alternative DataLoader
that extends
torch.utils.data.DataLoader
by the following:
Seeding of Numpy in each worker process: The seed is generated by numpy in the main process before starting the workers. For reproducibility numpy must be seeded in the main process.
Per-Sample Transforms outside the dataset (optional with pseudo batch dimension if the transforms require it). Will be executed within the spawned worker processes before batching.
Batched Transforms for better performance. Will be executed within the worker processes after batching.
Batched GPU-Transforms. Will be executed after syncing results back to main process (i.e. as last transforms) to avoid multiple CUDA initializations.
Furthermore it also provides a Dataset
(based on
torch.utils.data.Dataset
)that can create subsets from itself by
given indices and an AsyncDataset
as well as different options for
collation.
DataLoader¶
-
class
rising.loading.loader.
DataLoader
(dataset, batch_size=1, shuffle=False, batch_transforms=None, gpu_transforms=None, sample_transforms=None, pseudo_batch_dim=False, device=None, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, auto_convert=True, transform_call=<function default_transform_call>)[source][source]¶ Bases:
torch.utils.data.DataLoader
A DataLoader introducing batch-transforms, per-sample-transforms, numpy seeds for worker processes outside the dataset
Note
For Reproducibility numpy and pytorch must be seeded in the main process, as these frameworks will be used to generate their own seeds for each worker.
Note
len(dataloader)
heuristic is based on the length of the sampler used. Whendataset
is anIterableDataset
, an infinite sampler is used, whose__len__()
is not implemented, because the actual length depends on both the iterable as well as multi-process loading configurations. So one should not query this method unless they work with a map-style dataset.Warning
If the
spawn
start method is used,worker_init_fn
cannot be an unpicklable object, e.g., a lambda function. See Multiprocessing best practices on more details related to multiprocessing in PyTorch.Note
The GPU-Transforms for a batch are always executed in the main process after the batch was gathered from subprocesses which apply the CPU-Transformations. The desired workflow is as follows:
Disk -> CPU-Transforms -> GPU-Memory -> GPU-Transforms -> Further GPU Processing (e.g. training a neural network)
- Parameters
dataset (
Union
[Sequence
,Dataset
]) – dataset from which to load the databatch_size (
int
) – how many samples per batch to load (default:1
).shuffle (
bool
) – set toTrue
to have the data reshuffled at every epoch (default:False
)batch_transforms (
Optional
[Callable
]) – transforms which can be applied to a whole batch. Usually this accepts either mappings or sequences and returns the same type containing transformed elementsgpu_transforms (
Optional
[Callable
]) – transforms which can be applied to a whole batch (on the GPU). Unlikebatch_transforms
this is not done in multiple processes, but in the main process on the GPU, because GPUs are capable of non-blocking and asynchronous working. Before executing these transforms all data will be moved todevice
. This copy is done in a non-blocking way ifpin_memory
is set to True.sample_transforms (
Optional
[Callable
]) – transforms applied to each sample (on CPU). These are the first transforms applied to the data, since they are applied on sample retrieval from dataset before batching occurs.pseudo_batch_dim (
bool
) – whether thesample_transforms
work on batches and thus need a pseudo batch dim of 1 to work correctly.device (
Union
[str
,device
,None
]) – the device to move the data to for gpu_transforms. If None: the device will be the current device.sampler (
Optional
[Sampler
]) – defines the strategy to draw samples from the dataset. If specified,shuffle
must beFalse
.batch_sampler (
Optional
[Sampler
]) – likesampler
, but returns a batch of indices at a time. Mutually exclusive withbatch_size
,shuffle
,sampler
, anddrop_last
.num_workers (
int
) – how many subprocesses to use for data loading.0
means that the data will be loaded in the main process. (default:0
)collate_fn (
Optional
[Callable
]) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.pin_memory (
bool
) – IfTrue
, the data loader will copy Tensors into CUDA pinned memory before returning them. If your data elements are a custom type, or yourcollate_fn
returns a batch that is a custom type, see the example below.drop_last (
bool
) – set toTrue
to drop the last incomplete batch, if the dataset size is not divisible by the batch size. IfFalse
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default:False
)timeout (
Union
[int
,float
]) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default:0
)worker_init_fn (
Optional
[Callable
]) – If notNone
, this will be called on each worker subprocess with the worker id (an int in[0, num_workers - 1]
) as input, after seeding and before data loading. (default:None
)auto_convert (
bool
) – if set toTrue
, the batches will always be transformed totorch.Tensors
, if possible. (default:True
)transform_call (
Callable
[[Any
,Callable
],Any
]) – function which determines how transforms are called. By default Mappings and Sequences are unpacked during the transform.
-
get_batch_transformer
()[source][source]¶ A getter function for the
BatchTransformer
:returns: the initialized BatchTransformer :rtype: BatchTransformer
-
get_gpu_batch_transformer
()[source][source]¶ A getter function for the
BatchTransformer
holding the GPU-Transforms- Returns
the initialized BatchTransformer
- Return type
-
rising.loading.loader.
default_transform_call
(batch, transform)[source][source]¶ Default function to call transforms. Mapping and Sequences are unpacked during the transform call. Other types are passed as a positional argument.
DataLoader¶
-
class
rising.loading.loader.
DataLoader
(dataset, batch_size=1, shuffle=False, batch_transforms=None, gpu_transforms=None, sample_transforms=None, pseudo_batch_dim=False, device=None, sampler=None, batch_sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, auto_convert=True, transform_call=<function default_transform_call>)[source][source] Bases:
torch.utils.data.DataLoader
A DataLoader introducing batch-transforms, per-sample-transforms, numpy seeds for worker processes outside the dataset
Note
For Reproducibility numpy and pytorch must be seeded in the main process, as these frameworks will be used to generate their own seeds for each worker.
Note
len(dataloader)
heuristic is based on the length of the sampler used. Whendataset
is anIterableDataset
, an infinite sampler is used, whose__len__()
is not implemented, because the actual length depends on both the iterable as well as multi-process loading configurations. So one should not query this method unless they work with a map-style dataset.Warning
If the
spawn
start method is used,worker_init_fn
cannot be an unpicklable object, e.g., a lambda function. See Multiprocessing best practices on more details related to multiprocessing in PyTorch.Note
The GPU-Transforms for a batch are always executed in the main process after the batch was gathered from subprocesses which apply the CPU-Transformations. The desired workflow is as follows:
Disk -> CPU-Transforms -> GPU-Memory -> GPU-Transforms -> Further GPU Processing (e.g. training a neural network)
- Parameters
dataset (
Union
[Sequence
,Dataset
]) – dataset from which to load the databatch_size (
int
) – how many samples per batch to load (default:1
).shuffle (
bool
) – set toTrue
to have the data reshuffled at every epoch (default:False
)batch_transforms (
Optional
[Callable
]) – transforms which can be applied to a whole batch. Usually this accepts either mappings or sequences and returns the same type containing transformed elementsgpu_transforms (
Optional
[Callable
]) – transforms which can be applied to a whole batch (on the GPU). Unlikebatch_transforms
this is not done in multiple processes, but in the main process on the GPU, because GPUs are capable of non-blocking and asynchronous working. Before executing these transforms all data will be moved todevice
. This copy is done in a non-blocking way ifpin_memory
is set to True.sample_transforms (
Optional
[Callable
]) – transforms applied to each sample (on CPU). These are the first transforms applied to the data, since they are applied on sample retrieval from dataset before batching occurs.pseudo_batch_dim (
bool
) – whether thesample_transforms
work on batches and thus need a pseudo batch dim of 1 to work correctly.device (
Union
[str
,device
,None
]) – the device to move the data to for gpu_transforms. If None: the device will be the current device.sampler (
Optional
[Sampler
]) – defines the strategy to draw samples from the dataset. If specified,shuffle
must beFalse
.batch_sampler (
Optional
[Sampler
]) – likesampler
, but returns a batch of indices at a time. Mutually exclusive withbatch_size
,shuffle
,sampler
, anddrop_last
.num_workers (
int
) – how many subprocesses to use for data loading.0
means that the data will be loaded in the main process. (default:0
)collate_fn (
Optional
[Callable
]) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.pin_memory (
bool
) – IfTrue
, the data loader will copy Tensors into CUDA pinned memory before returning them. If your data elements are a custom type, or yourcollate_fn
returns a batch that is a custom type, see the example below.drop_last (
bool
) – set toTrue
to drop the last incomplete batch, if the dataset size is not divisible by the batch size. IfFalse
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default:False
)timeout (
Union
[int
,float
]) – if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default:0
)worker_init_fn (
Optional
[Callable
]) – If notNone
, this will be called on each worker subprocess with the worker id (an int in[0, num_workers - 1]
) as input, after seeding and before data loading. (default:None
)auto_convert (
bool
) – if set toTrue
, the batches will always be transformed totorch.Tensors
, if possible. (default:True
)transform_call (
Callable
[[Any
,Callable
],Any
]) – function which determines how transforms are called. By default Mappings and Sequences are unpacked during the transform.
-
get_batch_transformer
()[source][source] A getter function for the
BatchTransformer
:returns: the initialized BatchTransformer :rtype: BatchTransformer
-
get_gpu_batch_transformer
()[source][source] A getter function for the
BatchTransformer
holding the GPU-Transforms- Returns
the initialized BatchTransformer
- Return type
default_transform_call¶
BatchTransformer¶
-
class
rising.loading.loader.
BatchTransformer
(collate_fn, transforms=None, auto_convert=True, transform_call=<function default_transform_call>)[source][source]¶ Bases:
object
A callable wrapping the collate_fn to enable transformations on a batch-basis.
- Parameters
collate_fn (
Callable
) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.transforms (
Optional
[Callable
]) – transforms which can be applied to a whole batch. Usually this accepts either mappings or sequences and returns the same type containing transformed elementsauto_convert (
bool
) – if set toTrue
, the batches will always be transformed to torch.Tensors, if possible. (default:True
)transform_call (
Callable
[[Any
,Callable
],Any
]) – function which determines how transforms are called. By default Mappings and Sequences are unpacked during the transform.
patch_worker_init_fn¶
-
rising.loading.loader.
patch_worker_init_fn
(loader, new_worker_init)[source][source]¶ Patches the loader to temporarily have the correct worker init function.
- Parameters
loader (
DataLoader
) – the loader to patchnew_worker_init (
Callable
) – the new worker init function
- Yields
the patched loader
- Return type
Dataset¶
-
class
rising.loading.dataset.
Dataset
(*args, **kwargs)[source][source]¶ Bases:
torch.utils.data.Dataset
Extension of
torch.utils.data.Dataset
by aget_subset
method which returns a sub-dataset.
-
class
rising.loading.dataset.
AsyncDataset
(data_path, load_fn, mode='append', num_workers=0, verbose=False, **load_kwargs)[source][source]¶ Bases:
rising.loading.dataset.Dataset
A dataset to preload all the data and cache it for the entire lifetime of this class.
- Parameters
data_path (
Union
[Path
,str
,list
]) – the path(s) containing the actual data samplesload_fn (
Callable
) – function to load the actual datamode (
str
) – whether to append the sample to a list or to extend the list by it. Supported modes are:append
andextend
. Default:append
num_workers (
Optional
[int
]) – the number of workers to use for preloading.0
means, all the data will be loaded in the main process, whileNone
means, the number of processes will default to the number of logical cores.verbose (
bool
) – whether to show the loading progress.**load_kwargs – additional keyword arguments. Passed directly to
load_fn
Warning
if using multiprocessing to load data, there are some restrictions to which
load_fn()
are supported, please refer to thedill
orpickle
documentation-
static
_add_item
(data, item, mode)[source][source]¶ Adds items to the given data list. The actual way of adding these items depends on
mode
Dataset¶
-
class
rising.loading.dataset.
Dataset
(*args, **kwargs)[source][source] Bases:
torch.utils.data.Dataset
Extension of
torch.utils.data.Dataset
by aget_subset
method which returns a sub-dataset.-
get_subset
(indices)[source][source] Returns a
torch.utils.data.Subset
of the current dataset based on given indices
-
AsyncDataset¶
-
class
rising.loading.dataset.
AsyncDataset
(data_path, load_fn, mode='append', num_workers=0, verbose=False, **load_kwargs)[source][source] Bases:
rising.loading.dataset.Dataset
A dataset to preload all the data and cache it for the entire lifetime of this class.
- Parameters
data_path (
Union
[Path
,str
,list
]) – the path(s) containing the actual data samplesload_fn (
Callable
) – function to load the actual datamode (
str
) – whether to append the sample to a list or to extend the list by it. Supported modes are:append
andextend
. Default:append
num_workers (
Optional
[int
]) – the number of workers to use for preloading.0
means, all the data will be loaded in the main process, whileNone
means, the number of processes will default to the number of logical cores.verbose (
bool
) – whether to show the loading progress.**load_kwargs – additional keyword arguments. Passed directly to
load_fn
Warning
if using multiprocessing to load data, there are some restrictions to which
load_fn()
are supported, please refer to thedill
orpickle
documentation-
static
_add_item
(data, item, mode)[source][source] Adds items to the given data list. The actual way of adding these items depends on
mode
dill_helper¶
Collation¶
-
rising.loading.collate.
numpy_collate
(batch)[source][source]¶ function to collate the samples to a whole batch of numpy arrays. PyTorch Tensors, scalar values and sequences will be casted to arrays automatically.
- Parameters
batch (
Any
) – a batch of samples. In most cases either sequence, mapping or mixture of them- Returns
- collated batch with optionally converted type
(to
numpy.ndarray
)
- Return type
Any
- Raises
TypeError – When batch could not be collated automatically
-
rising.loading.collate.
do_nothing_collate
(batch)[source][source]¶ Returns the batch as is (with out any collation :type batch:
Any
:param batch: input batch (typically a sequence, mapping or mixture of those).- Returns
the batch as given to this function
- Return type
Any
numpy_collate¶
-
rising.loading.collate.
numpy_collate
(batch)[source][source] function to collate the samples to a whole batch of numpy arrays. PyTorch Tensors, scalar values and sequences will be casted to arrays automatically.
- Parameters
batch (
Any
) – a batch of samples. In most cases either sequence, mapping or mixture of them- Returns
- collated batch with optionally converted type
(to
numpy.ndarray
)
- Return type
Any
- Raises
TypeError – When batch could not be collated automatically