DataLoader2¶
A new, light-weight DataLoader2 is introduced to decouple the overloaded data-manipulation functionalities from torch.utils.data.DataLoader to DataPipe operations. Besides, certain features can only be achieved with DataLoader2 like snapshotting and switching backend services to perform high-performant operations.
DataLoader2¶
- class torchdata.dataloader2.DataLoader2(datapipe: Optional[Union[IterDataPipe, MapDataPipe]], datapipe_adapter_fn: Optional[Union[Iterable[Adapter], Adapter]] = None, reading_service: Optional[ReadingServiceInterface] = None)¶
DataLoader2is used to optimize and execute the givenDataPipegraph based onReadingServiceandAdapterfunctions, with support forDynamic sharding for multiprocess and distributed data loading
Multiple backend
ReadingServicesDataPipegraph in-place modification like shuffle control, memory pinning, etc.Snapshot the state of data-preprocessing pipeline (WIP)
- Parameters:
datapipe (
IterDataPipeorMapDataPipe) –DataPipefrom which to load the data. A deepcopy of this datapipe will be made during initialization, allowing the input to be re-used in a differentDataLoader2without sharing states. InputNonecan only be used ifload_state_dictis called right after the creation of the DataLoader.datapipe_adapter_fn (
Iterable[Adapter]orAdapter, optional) –Adapterfunction(s) that will be applied to the DataPipe (default:None).reading_service (ReadingServiceInterface, optional) – defines how
DataLoader2should execute operations over theDataPipe, e.g. multiprocessing/distributed (default:None). A deepcopy of this will be created during initialization, allowing the ReadingService to be re-used in a differentDataLoader2without sharing states.
Note
When a
MapDataPipeis passed intoDataLoader2, in order to iterate through the data,DataLoader2will attempt to create an iterator viaiter(datapipe). If the object has a non-zero-indexed indices, this may fail. Consider using.shuffle()(which convertsMapDataPipetoIterDataPipe) ordatapipe.to_iter_datapipe(custom_indices).- __iter__() DataLoader2Iterator[T_co]¶
Return a singleton iterator from the
DataPipegraph adapted byReadingService.DataPipewill be restored if the serialized state is provided to constructDataLoader2. And,initialize_iterationandfinalize_iteratorwill be invoked at the beginning and end of the iteration correspondingly.
- classmethod from_state(state: Dict[str, Any], reading_service: CheckpointableReadingServiceInterface) DataLoader2[T_co]¶
Create new
DataLoader2withDataPipegraph andReadingServicerestored from the serialized state.
- load_state_dict(state_dict: Dict[str, Any]) None¶
For the existing
DataLoader2, load serialized state to restoreDataPipegraph and reset the internal state ofReadingService.
- seed(seed: int) None¶
Set random seed for DataLoader2 to control determinism.
- Parameters:
seed – Random uint64 seed
- shutdown() None¶
Shuts down
ReadingServiceand clean up iterator.
- state_dict() Dict[str, Any]¶
Return a dictionary to represent the state of data-processing pipeline with keys:
serialized_datapipe:SerializedDataPipebeforeReadingServiceadaption.reading_service_state: The state ofReadingServiceand adaptedDataPipe.
Note:
DataLoader2 doesn’t support torch.utils.data.Dataset or torch.utils.data.IterableDataset. Please wrap each of them with the corresponding DataPipe below:
torchdata.datapipes.map.SequenceWrapper:torch.utils.data.Datasettorchdata.datapipes.iter.IterableWrapper:torch.utils.data.IterableDataset
ReadingService¶
ReadingService specifies the execution backend for the data-processing graph. There are three types of ReadingServices provided in TorchData:
|
|
Default ReadingService to serve the ``DataPipe` graph in the main process, and apply graph settings like determinism control to the graph. |
|
Spawns multiple worker processes to load data from the |
|
Each ReadingServices would take the DataPipe graph and rewrite it to achieve a few features like dynamic sharding, sharing random seeds and snapshoting for multi-/distributed processes. For more detail about those features, please refer to the documentation.
Adapter¶
Adapter is used to configure, modify and extend the DataPipe graph in DataLoader2. It allows in-place
modification or replace the pre-assembled DataPipe graph provided by PyTorch domains. For example, Shuffle(False) can be
provided to DataLoader2, which would disable any shuffle operations in the DataPipes graph.
- class torchdata.dataloader2.adapter.Adapter¶
Adapter Base Class that follows python Callable protocol.
- abstract __call__(datapipe: Union[IterDataPipe, MapDataPipe]) Union[IterDataPipe, MapDataPipe]¶
Callable function that either runs in-place modification of the
DataPipegraph, or returns a newDataPipegraph.- Parameters:
datapipe –
DataPipethat needs to be adapted.- Returns:
Adapted
DataPipeor newDataPipe.
Here are the list of Adapter provided by TorchData in torchdata.dataloader2.adapter:
Shuffle DataPipes adapter allows control over all existing Shuffler ( |
|
CacheTimeout DataPipes adapter allows control over timeouts of all existing EndOnDiskCacheHolder ( |
And, we will provide more Adapters to cover data-processing options:
PinMemory: Attach aDataPipeat the end of the data-processing graph that coverts output data totorch.Tensorin pinned memory.FullSync: Attach aDataPipeto make sure the data-processing graph synchronized between distributed processes to prevent hanging.ShardingPolicy: Modify sharding policy ifsharding_filteris presented in theDataPipegraph.PrefetchPolicy,InvalidateCache, etc.
If you have feature requests about the Adapters you’d like to be provided, please open a GitHub issue. For specific
needs, DataLoader2 also accepts any custom Adapter as long as it inherits from the Adapter class.