torchtext.transforms
Transforms are common text transforms. They can be chained together using torch.nn.Sequential
or using torchtext.transforms.Sequential
to support torch-scriptability.
SentencePieceTokenizer
-
class
torchtext.transforms.
SentencePieceTokenizer
(sp_model_path: str)[source]
Transform for Sentence Piece tokenizer from pre-trained sentencepiece model
Additiona details: https://github.com/google/sentencepiece
- Parameters
sp_model_path (str) – Path to pre-trained sentencepiece model
- Example
>>> from torchtext.transforms import SpmTokenizerTransform
>>> transform = SentencePieceTokenizer("spm_model")
>>> transform(["hello world", "attention is all you need!"])
- Tutorials using
SentencePieceTokenizer
:
-
forward
(input: Any) → Any[source]
- Parameters
input (Union[str, List[str]]) – Input sentence or list of sentences on which to apply tokenizer.
- Returns
tokenized text
- Return type
Union[List[str], List[List(str)]]
GPT2BPETokenizer
-
class
torchtext.transforms.
GPT2BPETokenizer
(encoder_json_path: str, vocab_bpe_path: str)[source]
-
forward
(input: Any) → Any[source]
- Parameters
input (Union[str, List[str]]) – Input sentence or list of sentences on which to apply tokenizer.
- Returns
tokenized text
- Return type
Union[List[str], List[List(str)]]
CLIPTokenizer
-
class
torchtext.transforms.
CLIPTokenizer
(merges_path: str, encoder_json_path: Optional[str] = None, num_merges: Optional[int] = None)[source]
-
forward
(input: Any) → Any[source]
- Parameters
input (Union[str, List[str]]) – Input sentence or list of sentences on which to apply tokenizer.
- Returns
tokenized text
- Return type
Union[List[str], List[List(str)]]
ToTensor
-
class
torchtext.transforms.
ToTensor
(padding_value: Optional[int] = None, dtype: torch.dtype = torch.int64)[source]
Convert input to torch tensor
- Parameters
-
-
forward
(input: Any) → torch.Tensor[source]
- Parameters
input (Union[List[int], List[List[int]]]) – Sequence or batch of token ids
- Return type
Tensor
LabelToIndex
-
class
torchtext.transforms.
LabelToIndex
(label_names: Optional[List[str]] = None, label_path: Optional[str] = None, sort_names=False)[source]
Transform labels from string names to ids.
- Parameters
label_names (Optional[List[str]]) – a list of unique label names
label_path (Optional[str]) – a path to file containing unique label names containing 1 label per line. Note that either label_names or label_path should be supplied
but not both.
-
forward
(input: Any) → Any[source]
- Parameters
input (Union[str, List[str]]) – Input labels to convert to corresponding ids
- Return type
Union[int, List[int]]
Truncate
-
class
torchtext.transforms.
Truncate
(max_seq_len: int)[source]
Truncate input sequence
- Parameters
max_seq_len (int) – The maximum allowable length for input sequence
- Tutorials using
Truncate
:
-
forward
(input: Any) → Any[source]
- Parameters
input (Union[List[Union[str, int]], List[List[Union[str, int]]]]) – Input sequence or batch of sequence to be truncated
- Returns
Truncated sequence
- Return type
Union[List[Union[str, int]], List[List[Union[str, int]]]]
AddToken
-
class
torchtext.transforms.
AddToken
(token: Union[int, str], begin: bool = True)[source]
Add token to beginning or end of sequence
- Parameters
token (Union[int, str]) – The token to be added
begin (bool, optional) – Whether to insert token at start or end or sequence, defaults to True
- Tutorials using
AddToken
:
-
forward
(input: Any) → Any[source]
- Parameters
input (Union[List[Union[str, int]], List[List[Union[str, int]]]]) – Input sequence or batch
Sequential
-
class
torchtext.transforms.
Sequential
(*args: torch.nn.modules.module.Module)[source]
-
class
torchtext.transforms.
Sequential
(arg: OrderedDict[str, Module])
A container to host a sequence of text transforms.
- Tutorials using
Sequential
:
-
forward
(input: Any) → Any[source]
- Parameters
input (Any) – Input sequence or batch. The input type must be supported by the first transform in the sequence.