torchtext.datasets¶
General use cases are as follows:
# import datasets
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
def tokenize(label, line):
    return line.split()
tokens = []
for label, line in train_iter:
    tokens += tokenize(label, line)
The following datasets are available:
Datasets
Text Classification¶
AG_NEWS¶
- 
torchtext.datasets.AG_NEWS(root='.data', split=('train', 'test'))[source]¶
- AG_NEWS dataset - Separately returns the train/test split - Number of lines per split:
- train: 120000 - test: 7600 
- Number of classes
- 4 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’) 
 
 
SogouNews¶
- 
torchtext.datasets.SogouNews(root='.data', split=('train', 'test'))[source]¶
- SogouNews dataset - Separately returns the train/test split - Number of lines per split:
- train: 450000 - test: 60000 
- Number of classes
- 5 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’) 
 
 
DBpedia¶
- 
torchtext.datasets.DBpedia(root='.data', split=('train', 'test'))[source]¶
- DBpedia dataset - Separately returns the train/test split - Number of lines per split:
- train: 560000 - test: 70000 
- Number of classes
- 14 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’) 
 
 
YelpReviewPolarity¶
- 
torchtext.datasets.YelpReviewPolarity(root='.data', split=('train', 'test'))[source]¶
- YelpReviewPolarity dataset - Separately returns the train/test split - Number of lines per split:
- train: 560000 - test: 38000 
- Number of classes
- 2 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’) 
 
 
YelpReviewFull¶
- 
torchtext.datasets.YelpReviewFull(root='.data', split=('train', 'test'))[source]¶
- YelpReviewFull dataset - Separately returns the train/test split - Number of lines per split:
- train: 650000 - test: 50000 
- Number of classes
- 5 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’) 
 
 
YahooAnswers¶
- 
torchtext.datasets.YahooAnswers(root='.data', split=('train', 'test'))[source]¶
- YahooAnswers dataset - Separately returns the train/test split - Number of lines per split:
- train: 1400000 - test: 60000 
- Number of classes
- 10 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’) 
 
 
AmazonReviewPolarity¶
- 
torchtext.datasets.AmazonReviewPolarity(root='.data', split=('train', 'test'))[source]¶
- AmazonReviewPolarity dataset - Separately returns the train/test split - Number of lines per split:
- train: 3600000 - test: 400000 
- Number of classes
- 2 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’) 
 
 
AmazonReviewFull¶
- 
torchtext.datasets.AmazonReviewFull(root='.data', split=('train', 'test'))[source]¶
- AmazonReviewFull dataset - Separately returns the train/test split - Number of lines per split:
- train: 3000000 - test: 650000 
- Number of classes
- 5 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’) 
 
 
IMDb¶
- 
torchtext.datasets.IMDB(root='.data', split=('train', 'test'))[source]¶
- IMDB dataset - Separately returns the train/test split - Number of lines per split:
- train: 25000 - test: 25000 
- Number of classes
- 2 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’) 
 
 
Language Modeling¶
WikiText-2¶
- 
torchtext.datasets.WikiText2(root='.data', split=('train', 'valid', 'test'))[source]¶
- WikiText2 dataset - Separately returns the train/valid/test split - Number of lines per split:
- train: 36718 - valid: 3760 - test: 4358 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’) 
 
 
WikiText103¶
- 
torchtext.datasets.WikiText103(root='.data', split=('train', 'valid', 'test'))[source]¶
- WikiText103 dataset - Separately returns the train/valid/test split - Number of lines per split:
- train: 1801350 - valid: 3760 - test: 4358 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’) 
 
 
PennTreebank¶
- 
torchtext.datasets.PennTreebank(root='.data', split=('train', 'valid', 'test'))[source]¶
- PennTreebank dataset - Separately returns the train/valid/test split - Number of lines per split:
- train: 42068 - valid: 3370 - test: 3761 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’) 
 
 
Machine Translation¶
Multi30k¶
- 
torchtext.datasets.Multi30k(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source]¶
- Multi30k dataset - Reference: http://www.statmt.org/wmt16/multimodal-task.html#task1 - Parameters
- root – Directory where the datasets are saved. Default: “.data” 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’) 
- language_pair – tuple or list containing src and tgt language. Available options are (‘de’,’en’) and (‘en’, ‘de’) 
 
 
IWSLT2016¶
- 
torchtext.datasets.IWSLT2016(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')[source]¶
- IWSLT2016 dataset - The available datasets include following: - Language pairs: - ‘en’ - ‘fr’ - ‘de’ - ‘cs’ - ‘ar’ - ‘en’ - x - x - x - x - ‘fr’ - x - ‘de’ - x - ‘cs’ - x - ‘ar’ - x - valid/test sets: [‘dev2010’, ‘tst2010’, ‘tst2011’, ‘tst2012’, ‘tst2013’, ‘tst2014’] - For additional details refer to source website: https://wit3.fbk.eu/2016-01 - Parameters
- root – Directory where the datasets are saved. Default: “.data” 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’) 
- language_pair – tuple or list containing src and tgt language 
- valid_set – a string to identify validation set. 
- test_set – a string to identify test set. 
 
 - Examples - >>> from torchtext.datasets import IWSLT2016 >>> train_iter, valid_iter, test_iter = IWSLT2016() >>> src_sentence, tgt_sentence = next(train_iter) 
IWSLT2017¶
- 
torchtext.datasets.IWSLT2017(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source]¶
- IWSLT2017 dataset - The available datasets include following: - Language pairs: - ‘en’ - ‘nl’ - ‘de’ - ‘it’ - ‘ro’ - ‘en’ - x - x - x - x - ‘nl’ - x - x - x - x - ‘de’ - x - x - x - x - ‘it’ - x - x - x - x - ‘ro’ - x - x - x - x - For additional details refer to source website: https://wit3.fbk.eu/2017-01 - Parameters
- root – Directory where the datasets are saved. Default: “.data” 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’) 
- language_pair – tuple or list containing src and tgt language 
 
 - Examples - >>> from torchtext.datasets import IWSLT2017 >>> train_iter, valid_iter, test_iter = IWSLT2017() >>> src_sentence, tgt_sentence = next(train_iter) 
Sequence Tagging¶
UDPOS¶
- 
torchtext.datasets.UDPOS(root='.data', split=('train', 'valid', 'test'))[source]¶
- UDPOS dataset - Separately returns the train/valid/test split - Number of lines per split:
- train: 12543 - valid: 2002 - test: 2077 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’) 
 
 
CoNLL2000Chunking¶
- 
torchtext.datasets.CoNLL2000Chunking(root='.data', split=('train', 'test'))[source]¶
- CoNLL2000Chunking dataset - Separately returns the train/test split - Number of lines per split:
- train: 8936 - test: 2012 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’) 
 
 
Question Answer¶
SQuAD 1.0¶
- 
torchtext.datasets.SQuAD1(root='.data', split=('train', 'dev'))[source]¶
- SQuAD1 dataset - Separately returns the train/dev split - Number of lines per split:
- train: 87599 - dev: 10570 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘dev’) 
 
 
SQuAD 2.0¶
- 
torchtext.datasets.SQuAD2(root='.data', split=('train', 'dev'))[source]¶
- SQuAD2 dataset - Separately returns the train/dev split - Number of lines per split:
- train: 130319 - dev: 11873 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘dev’) 
 
 
Unsupervised Learning¶
EnWik9¶
- 
torchtext.datasets.EnWik9(root='.data', split=('train', ))[source]¶
- EnWik9 dataset - Separately returns the train split - Number of lines per split:
- train: 13147026 
 - Parameters
- root – Directory where the datasets are saved. Default: .data 
- split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’,)