Iterable-style DataPipes¶
An iterable-style dataset is an instance of a subclass of IterableDataset that implements the __iter__() protocol,
and represents an iterable over data samples. This type of datasets is particularly suitable for cases where random
reads are expensive or even improbable, and where the batch size depends on the fetched data.
For example, such a dataset, when called iter(iterdatapipe), could return a stream of data reading from a database,
a remote server, or even logs generated in real time.
This is an updated version of IterableDataset in torch.
- class torchdata.datapipes.iter.IterDataPipe(*args, **kwds)¶
- Iterable-style DataPipe. - All DataPipes that represent an iterable of data samples should subclass this. This style of DataPipes is particularly useful when data come from a stream, or when the number of samples is too large to fit them all in memory. - All subclasses should overwrite - __iter__(), which would return an iterator of samples in this DataPipe.- IterDataPipe is lazily initialized and its elements are computed only when - next()is called on its iterator.- These DataPipes can be invoked in two ways, using the class constructor or applying their functional form onto an existing IterDataPipe (recommended, available to most but not all DataPipes). You can chain multiple IterDataPipe together to form a pipeline that will perform multiple operations in succession. - Note - When a subclass is used with - DataLoader, each item in the DataPipe will be yielded from the- DataLoaderiterator. When- num_workers > 0, each worker process will have a different copy of the DataPipe object, so it is often desired to configure each copy independently to avoid having duplicate data returned from the workers.- get_worker_info(), when called in a worker process, returns information about the worker. It can be used in either the dataset’s- __iter__()method or the- DataLoader‘s- worker_init_fnoption to modify each copy’s behavior.- Example - >>> from torchdata.datapipes.iter import IterableWrapper, Mapper >>> dp = IterableWrapper(range(10)) >>> map_dp_1 = Mapper(dp, lambda x: x + 1) # Using class constructor >>> map_dp_2 = dp.map(lambda x: x + 1) # Using functional form (recommended) >>> list(map_dp_1) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> list(map_dp_2) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> filter_dp = map_dp_1.filter(lambda x: x % 2 == 0) >>> list(filter_dp) [2, 4, 6, 8, 10] 
We have different types of Iterable DataPipes:
- Archive - open and decompress archive files of different formats. 
- Augmenting - augment your samples (e.g. adding index, or cycle through indefinitely). 
- Combinatorial - perform combinatorial operations (e.g. sampling, shuffling). 
- Combining/Splitting - interact with multiple DataPipes by combining them or splitting one to many. 
- Grouping - group samples within a DataPipe 
- IO - interacting with the file systems or remote server (e.g. downloading, opening, saving files, and listing the files in directories). 
- Mapping - apply the a given function to each element in the DataPipe. 
- Others - perform miscellaneous set of operations. 
- Selecting - select specific samples within a DataPipe. 
- Text - parse, read, and transform text files and data 
Archive DataPipes¶
These DataPipes help opening and decompressing archive files of different formats.
| Takes tuples of path and compressed stream of data, and returns tuples of path and decompressed stream of data (functional name:  | |
| Decompresses rar binary streams from input Iterable Datapipes which contains tuples of path name and rar binary stream, and yields  a tuple of path name and extracted binary stream (functional name:  | |
| Opens/decompresses tar binary streams from an Iterable DataPipe which contains tuples of path name and tar binary stream, and yields a tuple of path name and extracted binary stream (functional name:  | |
| Decompresses xz (lzma) binary streams from an Iterable DataPipe which contains tuples of path name and xy binary streams, and yields a tuple of path name and extracted binary stream (functional name:  | |
| Opens/decompresses zip binary streams from an Iterable DataPipe which contains a tuple of path name and zip binary stream, and yields a tuple of path name and extracted binary stream (functional name:  | 
Augmenting DataPipes¶
These DataPipes help to augment your samples.
| Cycles the specified input in perpetuity by default, or for the specified number of times (functional name:  | |
| Adds an index to an existing DataPipe through enumeration, with the index starting from 0 by default (functional name:  | |
| Adds an index to an existing Iterable DataPipe with (functional name:  | 
Combinatorial DataPipes¶
These DataPipes help to perform combinatorial operations.
| Generates sample elements using the provided  | |
| Shuffles the input DataPipe with a buffer (functional name:  | 
Combining/Spliting DataPipes¶
These tend to involve multiple DataPipes, combining them or splitting one to many.
| Concatenates multiple Iterable DataPipes (functional name:  | |
| Splits the input DataPipe into multiple child DataPipes, using the given classification function (functional name:  | |
| Creates multiple instances of the same Iterable DataPipe (functional name:  | |
| Zips two IterDataPipes together based on the matching key (functional name:  | |
| Joins the items from the source IterDataPipe with items from a MapDataPipe (functional name:  | |
| Yields one element at a time from each of the input Iterable DataPipes (functional name:  | |
| Takes a Dict of (IterDataPipe, Weight), and yields items by sampling from these DataPipes with respect to their weights. | |
| Takes in a DataPipe of Sequences, unpacks each Sequence, and return the elements in separate DataPipes based on their position in the Sequence. | |
| Aggregates elements into a tuple from each of the input DataPipes (functional name:  | 
Grouping DataPipes¶
These DataPipes have you group samples within a DataPipe.
| Creates mini-batches of data (functional name:  | |
| Creates mini-batches of data from sorted bucket (functional name:  | |
| Collates samples from DataPipe to Tensor(s) by a custom collate function (functional name:  | |
| Groups data from input IterDataPipe by keys which are generated from  | |
| Undoes batching of data (functional name:  | 
IO DataPipes¶
These DataPipes help interacting with the file systems or remote server (e.g. downloading, opening, saving files, and listing the files in directories).
| Lists the contents of the directory at the provided  | |
| Opens files from input datapipe which contains fsspec paths and yields a tuple of pathname and opened file stream (functional name:  | |
| Takes in a DataPipe of tuples of metadata and data, saves the data to the target path (generated by the filepath_fn and metadata), and yields the resulting fsspec path (functional name:  | |
| Given path(s) to the root directory, yields file pathname(s) (path + filename) of files within the root directory. | |
| Given pathnames, opens files and yield pathname and file stream in a tuple. | |
| Takes URLs pointing at GDrive files, and yields tuples of file name and IO stream. | |
| Takes file URLs (HTTP URLs pointing to files), and yields tuples of file URL and IO stream. | |
| Lists the contents of the directory at the provided  | |
| Opens files from input datapipe which contains pathnames or URLs, and yields a tuple of pathname and opened file stream (functional name:  | |
| Takes in a DataPipe of tuples of metadata and data, saves the data to the target path which is generated by the  | |
| Takes file URLs (can be HTTP URLs pointing to files or URLs to GDrive files), and yields tuples of file URL and IO stream. | |
| Takes in paths to Parquet files and return a TorchArrow DataFrame for each row group within a Parquet file (functional name:  | |
| Takes in a DataPipe of tuples of metadata and data, saves the data to the target path generated by the  | 
Mapping DataPipes¶
These DataPipes apply the a given function to each element in the DataPipe.
| Applies a function over each item from the source DataPipe, then flattens the outputs to a single, unnested IterDataPipe (functional name:  | |
| Applies a function over each item from the source DataPipe (functional name:  | 
Other DataPipes¶
A miscellaneous set of DataPipes with different functionalities.
| Takes rows of data, batches a number of them together and creates TorchArrow DataFrames (functional name:  | |
| Indicates when the result of prior DataPipe will be saved local files specified by  | |
| Computes and checks the hash of each file, from an input DataPipe of tuples of file name and data/stream (functional name:  | |
| Stores elements from the source DataPipe in memory, up to a size limit if specified (functional name:  | |
| Wraps an iterable object to create an IterDataPipe. | |
| Caches the outputs of multiple DataPipe operations to local files, which are typically performance bottleneck such download, decompress, and etc (functional name:  | |
| Wrapper that allows DataPipe to be sharded (functional name:  | 
Selecting DataPipes¶
These DataPipes helps you select specific samples within a DataPipe.
| Filters out elements from the source datapipe according to input  | |
| Yields elements from the source DataPipe from the start, up to the specfied limit (functional name:  | 
Text DataPipes¶
These DataPipes help you parse, read, and transform text files and data.
| Accepts a DataPipe consists of tuples of file name and CSV data stream, reads and returns the contents within the CSV files one row at a time (functional name:  | |
| Accepts a DataPipe consists of tuples of file name and CSV data stream, reads and returns the contents within the CSV files one row at a time (functional name:  | |
| Reads from JSON data streams and yields a tuple of file name and JSON data (functional name:  | |
| Accepts a DataPipe consisting of tuples of file name and string data stream, and for each line in the stream, yields a tuple of file name and the line (functional name:  | |
| Aggregates lines of text from the same file into a single paragraph (functional name:  | |
| Decodes binary streams from input DataPipe, yields pathname and decoded data in a tuple (functional name:  | |
| Accepts an input DataPipe with batches of data, and processes one batch at a time and yields a Dict for each batch, with  | |
| Given IO streams and their label names, yields bytes with label name in a tuple. |