Transforming and augmenting images¶
Torchvision supports common computer vision transformations in the
torchvision.transforms and torchvision.transforms.v2 modules. Transforms
can be used to transform or augment data for training or inference of different
tasks (image classification, detection, segmentation, video classification).
# Image Classification
import torch
from torchvision.transforms import v2
H, W = 32, 32
img = torch.randint(0, 256, size=(3, H, W), dtype=torch.uint8)
transforms = v2.Compose([
    v2.RandomResizedCrop(size=(224, 224), antialias=True),
    v2.RandomHorizontalFlip(p=0.5),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
img = transforms(img)
# Detection (re-using imports and transforms from above)
from torchvision import tv_tensors
img = torch.randint(0, 256, size=(3, H, W), dtype=torch.uint8)
boxes = torch.randint(0, H // 2, size=(3, 4))
boxes[:, 2:] += boxes[:, :2]
boxes = tv_tensors.BoundingBoxes(boxes, format="XYXY", canvas_size=(H, W))
# The same transforms can be used!
img, boxes = transforms(img, boxes)
# And you can pass arbitrary input structures
output_dict = transforms({"image": img, "boxes": boxes})
Transforms are typically passed as the transform or transforms argument
to the Datasets.
Start here¶
Whether you’re new to Torchvision transforms, or you’re already experienced with them, we encourage you to start with Getting started with transforms v2 in order to learn more about what can be done with the new v2 transforms.
Then, browse the sections in below this page for general information and performance tips. The available transforms and functionals are listed in the API reference.
More information and tutorials can also be found in our example gallery, e.g. Transforms v2: End-to-end object detection/segmentation example or How to write your own v2 transforms.
Supported input types and conventions¶
Most transformations accept both PIL images and tensor inputs. Both CPU and CUDA tensors are supported. The result of both backends (PIL or Tensors) should be very close. In general, we recommend relying on the tensor backend for performance. The conversion transforms may be used to convert to and from PIL images, or for converting dtypes and ranges.
Tensor image are expected to be of shape (C, H, W), where C is the
number of channels, and H and W refer to height and width. Most
transforms support batched tensor input. A batch of Tensor images is a tensor of
shape (N, C, H, W), where N is a number of images in the batch. The
v2 transforms generally accept an arbitrary number of leading
dimensions (..., C, H, W) and can handle batched images or batched videos.
Dtype and expected value range¶
The expected range of the values of a tensor image is implicitly defined by
the tensor dtype. Tensor images with a float dtype are expected to have
values in [0, 1]. Tensor images with an integer dtype are expected to
have values in [0, MAX_DTYPE] where MAX_DTYPE is the largest value
that can be represented in that dtype. Typically, images of dtype
torch.uint8 are expected to have values in [0, 255].
Use ToDtype to convert both the dtype and
range of the inputs.
V1 or V2? Which one should I use?¶
TL;DR We recommending using the torchvision.transforms.v2 transforms
instead of those in torchvision.transforms. They’re faster and they can do
more things. Just change the import and you should be good to go. Moving
forward, new features and improvements will only be considered for the v2
transforms.
In Torchvision 0.15 (March 2023), we released a new set of transforms available
in the torchvision.transforms.v2 namespace. These transforms have a lot of
advantages compared to the v1 ones (in torchvision.transforms):
- They can transform images but also bounding boxes, masks, or videos. This provides support for tasks beyond image classification: detection, segmentation, video classification, etc. See Getting started with transforms v2 and Transforms v2: End-to-end object detection/segmentation example. 
- They support more transforms like - CutMixand- MixUp. See How to use CutMix and MixUp.
- They’re faster. 
- They support arbitrary input structures (dicts, lists, tuples, etc.). 
- Future improvements and features will be added to the v2 transforms only. 
These transforms are fully backward compatible with the v1 ones, so if
you’re already using tranforms from torchvision.transforms, all you need to
do to is to update the import to torchvision.transforms.v2. In terms of
output, there might be negligible differences due to implementation differences.
Performance considerations¶
We recommend the following guidelines to get the best performance out of the transforms:
- Rely on the v2 transforms from - torchvision.transforms.v2
- Use tensors instead of PIL images 
- Use - torch.uint8dtype, especially for resizing
- Resize with bilinear or bicubic mode 
This is what a typical transform pipeline could look like:
from torchvision.transforms import v2
transforms = v2.Compose([
    v2.ToImage(),  # Convert to tensor, only needed if you had a PIL image
    v2.ToDtype(torch.uint8, scale=True),  # optional, most input are already uint8 at this point
    # ...
    v2.RandomResizedCrop(size=(224, 224), antialias=True),  # Or Resize(antialias=True)
    # ...
    v2.ToDtype(torch.float32, scale=True),  # Normalize expects float input
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
The above should give you the best performance in a typical training environment
that relies on the torch.utils.data.DataLoader with num_workers >
0.
Transforms tend to be sensitive to the input strides / memory format. Some
transforms will be faster with channels-first images while others prefer
channels-last. Like torch operators, most transforms will preserve the
memory format of the input, but this may not always be respected due to
implementation details. You may want to experiment a bit if you’re chasing the
very best performance.  Using torch.compile() on individual transforms may
also help factoring out the memory format variable (e.g. on
Normalize). Note that we’re talking about
memory format, not tensor shape.
Note that resize transforms like Resize
and RandomResizedCrop typically prefer
channels-last input and tend not to benefit from torch.compile() at
this time.
Transform classes, functionals, and kernels¶
Transforms are available as classes like
Resize, but also as functionals like
resize() in the
torchvision.transforms.v2.functional namespace.
This is very much like the torch.nn package which defines both classes
and functional equivalents in torch.nn.functional.
The functionals support PIL images, pure tensors, or TVTensors, e.g. both resize(image_tensor) and resize(boxes) are
valid.
Note
Random transforms like RandomCrop will
randomly sample some parameter each time they’re called. Their functional
counterpart (crop()) does not do
any kind of random sampling and thus have a slighlty different
parametrization. The get_params() class method of the transforms class
can be used to perform parameter sampling when using the functional APIs.
The torchvision.transforms.v2.functional namespace also contains what we
call the “kernels”. These are the low-level functions that implement the
core functionalities for specific types, e.g. resize_bounding_boxes or
`resized_crop_mask. They are public, although not documented. Check the
code
to see which ones are available (note that those starting with a leading
underscore are not public!). Kernels are only really useful if you want
torchscript support for types like bounding
boxes or masks.
Torchscript support¶
Most transform classes and functionals support torchscript. For composing
transforms, use torch.nn.Sequential instead of
Compose:
transforms = torch.nn.Sequential(
    CenterCrop(10),
    Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
)
scripted_transforms = torch.jit.script(transforms)
Warning
v2 transforms support torchscript, but if you call torch.jit.script() on
a v2 class transform, you’ll actually end up with its (scripted) v1
equivalent.  This may lead to slightly different results between the
scripted and eager executions due to implementation differences between v1
and v2.
If you really need torchscript support for the v2 transforms, we recommend
scripting the functionals from the
torchvision.transforms.v2.functional namespace to avoid surprises.
Also note that the functionals only support torchscript for pure tensors, which are always treated as images. If you need torchscript support for other types like bounding boxes or masks, you can rely on the low-level kernels.
For any custom transformations to be used with torch.jit.script, they should
be derived from torch.nn.Module.
See also: Torchscript support.
V2 API reference - Recommended¶
Geometry¶
Resizing¶
| 
 | Resize the input to the given size. | 
| 
 | Perform Large Scale Jitter on the input according to "Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation". | 
| 
 | Randomly resize the input. | 
| 
 | Randomly resize the input. | 
Functionals
| 
 | See  | 
Cropping¶
| 
 | Crop the input at a random location. | 
| 
 | Crop a random portion of the input and resize it to a given size. | 
| 
 | Random IoU crop transformation from "SSD: Single Shot MultiBox Detector". | 
| 
 | Crop the input at the center. | 
| 
 | Crop the image or video into four corners and the central crop. | 
| 
 | Crop the image or video into four corners and the central crop plus the flipped version of these (horizontal flipping is used by default). | 
Functionals
| 
 | See  | 
| 
 | See  | 
| 
 | See  | 
| 
 | See  | 
| 
 | See  | 
Others¶
| Horizontally flip the input with a given probability. | |
| Vertically flip the input with a given probability. | |
| 
 | Pad the input on all sides with the given "pad" value. | 
| 
 | "Zoom out" transformation from "SSD: Single Shot MultiBox Detector". | 
| 
 | Rotate the input by angle. | 
| 
 | Random affine transformation the input keeping center invariant. | 
| 
 | Perform a random perspective transformation of the input with a given probability. | 
| 
 | Transform the input with elastic transformations. | 
Functionals
| See  | |
| See  | |
| 
 | See  | 
| 
 | See  | 
| 
 | See  | 
| 
 | See  | 
| 
 | See  | 
Color¶
| 
 | Randomly change the brightness, contrast, saturation and hue of an image or video. | 
| Randomly permute the channels of an image or video | |
| 
 | Randomly distorts the image or video as used in SSD: Single Shot MultiBox Detector. | 
| 
 | Convert images or videos to grayscale. | 
| 
 | Convert images or videos to RGB (if they are already not RGB). | 
| 
 | Randomly convert image or videos to grayscale with a probability of p (default 0.1). | 
| 
 | Blurs image with randomly chosen Gaussian blur kernel. | 
| 
 | Add gaussian noise to images or videos. | 
| 
 | Inverts the colors of the given image or video with a given probability. | 
| 
 | Posterize the image or video with a given probability by reducing the number of bits for each color channel. | 
| 
 | Solarize the image or video with a given probability by inverting all pixel values above a threshold. | 
| 
 | Adjust the sharpness of the image or video with a given probability. | 
| Autocontrast the pixels of the given image or video with a given probability. | |
| 
 | Equalize the histogram of the given image or video with a given probability. | 
Functionals
| 
 | Permute the channels of the input according to the given permutation. | 
| 
 | See  | 
| See  | |
| 
 | See  | 
| 
 | See  | 
| 
 | See  | 
| 
 | See  | 
| 
 | See  | 
| 
 | See  | 
| 
 | |
| See  | |
| 
 | |
| 
 | See  | 
| 
 | Adjust brightness. | 
| 
 | Adjust saturation. | 
| 
 | Adjust hue | 
| 
 | Adjust gamma. | 
Composition¶
| 
 | Composes several transforms together. | 
| 
 | Apply randomly a list of transformations with a given probability. | 
| 
 | Apply single transformation randomly picked from a list. | 
| 
 | Apply a list of transformations in a random order. | 
Miscellaneous¶
| Transform a tensor image or video with a square transformation matrix and a mean_vector computed offline. | |
| 
 | Normalize a tensor image or video with mean and standard deviation. | 
| 
 | Randomly select a rectangle region in the input image or video and erase its pixels. | 
| 
 | Apply a user-defined function as a transform. | 
| 
 | Remove degenerate/invalid bounding boxes and their corresponding labels and masks. | 
| Clamp bounding boxes to their corresponding image dimensions. | |
| 
 | Uniformly subsample  | 
| 
 | Apply JPEG compression and decompression to the given images. | 
Functionals
| 
 | See  | 
| 
 | See  | 
| 
 | Remove degenerate/invalid bounding boxes and return the corresponding indexing mask. | 
| 
 | See  | 
| See  | |
| 
 | See  | 
Conversion¶
Note
Beware, some of these conversion transforms below will scale the values
while performing the conversion, while some may not do any scaling. By
scaling, we mean e.g. that a uint8 -> float32 would map the [0,
255] range into [0, 1] (and vice-versa). See Dtype and expected value range.
| Convert a tensor, ndarray, or PIL Image to  | |
| Convert all TVTensors to pure tensors, removing associated metadata (if any). | |
| Convert a PIL Image to a tensor of the same type - this does not scale values. | |
| 
 | Convert a tensor or an ndarray to PIL Image | 
| 
 | Converts the input to a specific dtype, optionally scaling the values for images or videos. | 
| 
 | Convert bounding box coordinates to the given  | 
functionals
| 
 | See  | 
| Convert a  | |
| 
 | Convert a tensor or an ndarray to PIL Image. | 
| 
 | See  | 
| See  | 
Deprecated
| [DEPRECATED] Use  | |
| 
 | [DEPREACTED] Use to_image() and to_dtype() instead. | 
| 
 | [DEPRECATED] Use  | 
| 
 | [DEPRECATED] Use to_dtype() instead. | 
Auto-Augmentation¶
AutoAugment is a common Data Augmentation technique that can improve the accuracy of Image Classification models. Though the data augmentation policies are directly linked to their trained dataset, empirical studies show that ImageNet policies provide significant improvements when applied to other datasets. In TorchVision we implemented 3 policies learned on the following datasets: ImageNet, CIFAR10 and SVHN. The new transform can be used standalone or mixed-and-matched with existing transforms:
| 
 | AutoAugment data augmentation method based on "AutoAugment: Learning Augmentation Strategies from Data". | 
| 
 | RandAugment data augmentation method based on "RandAugment: Practical automated data augmentation with a reduced search space". | 
| 
 | Dataset-independent data-augmentation with TrivialAugment Wide, as described in "TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation". | 
| 
 | AugMix data augmentation method based on "AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty". | 
CutMix - MixUp¶
CutMix and MixUp are special transforms that are meant to be used on batches rather than on individual images, because they are combining pairs of images together. These can be used after the dataloader (once the samples are batched), or part of a collation function. See How to use CutMix and MixUp for detailed usage examples.
| 
 | Apply CutMix to the provided batch of images and labels. | 
| 
 | Apply MixUp to the provided batch of images and labels. | 
Developer tools¶
| 
 | Decorate a kernel to register it for a functional and a (custom) tv_tensor type. | 
V1 API Reference¶
Geometry¶
| 
 | Resize the input image to the given size. | 
| 
 | Crop the given image at a random location. | 
| 
 | Crop a random portion of image and resize it to a given size. | 
| 
 | Crops the given image at the center. | 
| 
 | Crop the given image into four corners and the central crop. | 
| 
 | Crop the given image into four corners and the central crop plus the flipped version of these (horizontal flipping is used by default). | 
| 
 | Pad the given image on all sides with the given "pad" value. | 
| 
 | Rotate the image by angle. | 
| 
 | Random affine transformation of the image keeping center invariant. | 
| 
 | Performs a random perspective transformation of the given image with a given probability. | 
| 
 | Transform a tensor image with elastic transformations. | 
| 
 | Horizontally flip the given image randomly with a given probability. | 
| 
 | Vertically flip the given image randomly with a given probability. | 
Color¶
| 
 | Randomly change the brightness, contrast, saturation and hue of an image. | 
| 
 | Convert image to grayscale. | 
| 
 | Randomly convert image to grayscale with a probability of p (default 0.1). | 
| 
 | Blurs image with randomly chosen Gaussian blur. | 
| 
 | Inverts the colors of the given image randomly with a given probability. | 
| 
 | Posterize the image randomly with a given probability by reducing the number of bits for each color channel. | 
| 
 | Solarize the image randomly with a given probability by inverting all pixel values above a threshold. | 
| 
 | Adjust the sharpness of the image randomly with a given probability. | 
| 
 | Autocontrast the pixels of the given image randomly with a given probability. | 
| 
 | Equalize the histogram of the given image randomly with a given probability. | 
Composition¶
| 
 | Composes several transforms together. | 
| 
 | Apply randomly a list of transformations with a given probability. | 
| 
 | Apply single transformation randomly picked from a list. | 
| 
 | Apply a list of transformations in a random order. | 
Miscellaneous¶
| 
 | Transform a tensor image with a square transformation matrix and a mean_vector computed offline. | 
| 
 | Normalize a tensor image with mean and standard deviation. | 
| 
 | Randomly selects a rectangle region in a torch.Tensor image and erases its pixels. | 
| 
 | Apply a user-defined lambda as a transform. | 
Conversion¶
Note
Beware, some of these conversion transforms below will scale the values
while performing the conversion, while some may not do any scaling. By
scaling, we mean e.g. that a uint8 -> float32 would map the [0,
255] range into [0, 1] (and vice-versa). See Dtype and expected value range.
| 
 | Convert a tensor or an ndarray to PIL Image | 
| 
 | Convert a PIL Image or ndarray to tensor and scale the values accordingly. | 
| Convert a PIL Image to a tensor of the same type - this does not scale values. | |
| 
 | Convert a tensor image to the given  | 
Auto-Augmentation¶
AutoAugment is a common Data Augmentation technique that can improve the accuracy of Image Classification models. Though the data augmentation policies are directly linked to their trained dataset, empirical studies show that ImageNet policies provide significant improvements when applied to other datasets. In TorchVision we implemented 3 policies learned on the following datasets: ImageNet, CIFAR10 and SVHN. The new transform can be used standalone or mixed-and-matched with existing transforms:
| 
 | AutoAugment policies learned on different datasets. | 
| 
 | AutoAugment data augmentation method based on "AutoAugment: Learning Augmentation Strategies from Data". | 
| 
 | RandAugment data augmentation method based on "RandAugment: Practical automated data augmentation with a reduced search space". | 
| 
 | Dataset-independent data-augmentation with TrivialAugment Wide, as described in "TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation". | 
| 
 | AugMix data augmentation method based on "AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty". | 
Functional Transforms¶
| 
 | Adjust brightness of an image. | 
| 
 | Adjust contrast of an image. | 
| 
 | Perform gamma correction on an image. | 
| 
 | Adjust hue of an image. | 
| 
 | Adjust color saturation of an image. | 
| 
 | Adjust the sharpness of an image. | 
| 
 | Apply affine transformation on the image keeping image center invariant. | 
| 
 | Maximize contrast of an image by remapping its pixels per channel so that the lowest becomes black and the lightest becomes white. | 
| 
 | Crops the given image at the center. | 
| 
 | Convert a tensor image to the given  | 
| 
 | Crop the given image at specified location and output size. | 
| 
 | Equalize the histogram of an image by applying a non-linear mapping to the input in order to create a uniform distribution of grayscale values in the output. | 
| 
 | Erase the input Tensor Image with given value. | 
| 
 | Crop the given image into four corners and the central crop. | 
| 
 | Performs Gaussian blurring on the image by given kernel | 
| 
 | Returns the dimensions of an image as [channels, height, width]. | 
| Returns the number of channels of an image. | |
| 
 | Returns the size of an image as [width, height]. | 
| 
 | Horizontally flip the given image. | 
| 
 | Invert the colors of an RGB/grayscale image. | 
| 
 | Normalize a float tensor image with mean and standard deviation. | 
| 
 | Pad the given image on all sides with the given "pad" value. | 
| 
 | Perform perspective transform of the given image. | 
| 
 | Convert a  | 
| 
 | Posterize an image by reducing the number of bits for each color channel. | 
| 
 | Resize the input image to the given size. | 
| 
 | Crop the given image and resize it to desired size. | 
| 
 | Convert RGB image to grayscale version of image. | 
| 
 | Rotate the image by angle. | 
| 
 | Solarize an RGB/grayscale image by inverting all pixel values above a threshold. | 
| 
 | Generate ten cropped images from the given image. | 
| 
 | Convert PIL image of any mode (RGB, HSV, LAB, etc) to grayscale version of image. | 
| 
 | Convert a tensor or an ndarray to PIL Image. | 
| 
 | Convert a  | 
| 
 | Vertically flip the given image. |