Transformers documentation

Image Processor

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.3.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Image Processor

An image processor is in charge of loading images (optionally), preparing input features for vision models and post processing their outputs. This includes transformations such as resizing, normalization, and conversion to PyTorch and Numpy tensors. It may also include model specific post-processing such as converting logits to segmentation masks.

Image processors use a backend-based architecture. The class hierarchy is:

  • BaseImageProcessor — abstract base class (for backward compatibility only; do not instantiate directly)
    • TorchvisionBackend — the default torchvision-backed backend. GPU-accelerated and significantly faster than the PIL backend. All models expose a <Model>ImageProcessor class that inherits from it.
    • PilBackend — the PIL/NumPy alternative backend. Portable, CPU-only. Only available for older models via a <Model>ImageProcessorPil class; useful when exact numerical parity with the original implementation is required.

Both backends expose the same API. Use the backend attribute to inspect which backend a loaded processor uses (e.g. processor.backend == "torchvision").

Use AutoImageProcessor.from_pretrained() with the backend argument to select a backend. When backend is omitted (the default), torchvision is picked when it is installed and PIL is used otherwise. Pass an explicit string to override that choice:

from transformers import AutoImageProcessor

# Default: picks torchvision if available, otherwise pil
processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50")

# Explicitly request torchvision
processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", backend="torchvision")

# Explicitly request PIL
processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", backend="pil")

When using the torchvision backend, you can set the device argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise.

from torchvision.io import read_image
from transformers import DetrImageProcessor

images = read_image("image.jpg")
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
images_processed = processor(images, return_tensors="pt", device="cuda")

Here are some speed comparisons between the torchvision and PIL backends for the DETR and RT-DETR models, and how they impact overall inference time:

These benchmarks were run on an AWS EC2 g5.2xlarge instance, utilizing an NVIDIA A10G Tensor Core GPU.

ImageProcessingMixin

class transformers.ImageProcessingMixin

< >

( **kwargs )

This is an image processor mixin used to provide saving/loading functionality for sequential and image feature extractors.

from_pretrained

< >

( pretrained_model_name_or_path: str | os.PathLike cache_dir: str | os.PathLike | None = None force_download: bool = False local_files_only: bool = False token: str | bool | None = None revision: str = 'main' **kwargs )

Parameters

  • pretrained_model_name_or_path (str or os.PathLike) — This can be either:

    • a string, the model id of a pretrained image_processor hosted inside a model repo on huggingface.co.
    • a path to a directory containing a image processor file saved using the save_pretrained() method, e.g., ./my_model_directory/.
    • a path or url to a saved image processor JSON file, e.g., ./my_model_directory/preprocessor_config.json.
  • cache_dir (str or os.PathLike, optional) — Path to a directory in which a downloaded pretrained model image processor should be cached if the standard cache should not be used.
  • force_download (bool, optional, defaults to False) — Whether or not to force to (re-)download the image processor files and override the cached versions if they exist.
  • proxies (dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request.
  • token (str or bool, optional) — The token to use as HTTP bearer authorization for remote files. If True, or not specified, will use the token generated when running hf auth login (stored in ~/.huggingface).
  • revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.

Instantiate a type of ImageProcessingMixin from an image processor.

Examples:

# We can't instantiate directly the base class *ImageProcessingMixin* so let's show the examples on a
# derived class: *CLIPImageProcessor*
image_processor = CLIPImageProcessor.from_pretrained(
    "openai/clip-vit-base-patch32"
)  # Download image_processing_config from huggingface.co and cache.
image_processor = CLIPImageProcessor.from_pretrained(
    "./test/saved_model/"
)  # E.g. image processor (or model) was saved using *save_pretrained('./test/saved_model/')*
image_processor = CLIPImageProcessor.from_pretrained("./test/saved_model/preprocessor_config.json")
image_processor = CLIPImageProcessor.from_pretrained(
    "openai/clip-vit-base-patch32", do_normalize=False, foo=False
)
assert image_processor.do_normalize is False
image_processor, unused_kwargs = CLIPImageProcessor.from_pretrained(
    "openai/clip-vit-base-patch32", do_normalize=False, foo=False, return_unused_kwargs=True
)
assert image_processor.do_normalize is False
assert unused_kwargs == {"foo": False}

save_pretrained

< >

( save_directory: str | os.PathLike push_to_hub: bool = False **kwargs )

Parameters

  • save_directory (str or os.PathLike) — Directory where the image processor JSON file will be saved (will be created if it does not exist).
  • push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace).
  • kwargs (dict[str, Any], optional) — Additional key word arguments passed along to the push_to_hub() method.

Save an image processor object to the directory save_directory, so that it can be re-loaded using the from_pretrained() class method.

BatchFeature

class transformers.BatchFeature

< >

( data: dict[str, typing.Any] | None = None tensor_type: None | str | transformers.utils.generic.TensorType = None skip_tensor_conversion: list[str] | set[str] | None = None )

Parameters

  • data (dict, optional) — Dictionary of lists/arrays/tensors returned by the call/pad methods (‘input_values’, ‘attention_mask’, etc.).
  • tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.
  • skip_tensor_conversion (list[str] or set[str], optional) — List or set of keys that should NOT be converted to tensors, even when tensor_type is specified.

Holds the output of the pad() and feature extractor specific __call__ methods.

This class is derived from a python dictionary and can be used as a dictionary.

convert_to_tensors

< >

( tensor_type: str | transformers.utils.generic.TensorType | None = None skip_tensor_conversion: list[str] | set[str] | None = None )

Parameters

  • tensor_type (str or TensorType, optional) — The type of tensors to use. If str, should be one of the values of the enum TensorType. If None, no modification is done.
  • skip_tensor_conversion (list[str] or set[str], optional) — List or set of keys that should NOT be converted to tensors, even when tensor_type is specified.

Convert the inner content to tensors.

Note: Values that don’t have an array-like structure (e.g., strings, dicts, lists of strings) are automatically skipped and won’t be converted to tensors. Ragged arrays (lists of arrays with different lengths) are still attempted, though they may raise errors during conversion.

to

< >

( *args **kwargs ) BatchFeature

Parameters

  • args (Tuple) — Will be passed to the to(...) function of the tensors.
  • kwargs (Dict, optional) — Will be passed to the to(...) function of the tensors. To enable asynchronous data transfer, set the non_blocking flag in kwargs (defaults to False).

Returns

BatchFeature

The same instance after modification.

Send all values to device by calling v.to(*args, **kwargs) (PyTorch only). This should support casting in different dtypes and sending the BatchFeature to a different device.

BaseImageProcessor

class transformers.BaseImageProcessor

< >

( **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )

Base class for image processors with an inheritance-based backend architecture.

This class defines the preprocessing pipeline: kwargs validation, input preparation, and dispatching to the backend’s _preprocess method. Backend subclasses (TorchvisionBackend, PilBackend) inherit from this class and implement the actual image operations (resize, crop, rescale, normalize, etc.). Model-specific image processors then inherit from the appropriate backend class.

Architecture Overview

The class hierarchy is:

BaseImageProcessor (this class) ├── TorchvisionBackend (GPU-accelerated, torch.Tensor) │ └── ModelImageProcessor (e.g. LlavaNextImageProcessor) └── PilBackend (portable CPU, np.ndarray) └── ModelImageProcessorPil (e.g. CLIPImageProcessorPil)

The preprocessing flow is:

call() → preprocess() → _preprocess_image_like_inputs() → _prepare_image_like_inputs() (calls process_image per image) → _preprocess() (batch operations: resize, crop, etc.)

  • process_image: Implemented by backends. Converts a single raw input (PIL, NumPy, or Tensor) to the backend’s working format (torch.Tensor or np.ndarray), handles RGB conversion and channel reordering.
  • _preprocess: Implemented by backends. Performs the actual batch processing (resize, center crop, rescale, normalize, pad) and returns a BatchFeature.

Basic Implementation

For processors that only need standard operations (resize, center crop, rescale, normalize), inherit from a backend and define class attributes:

from transformers.image_processing_backends import PilBackend

class MyImageProcessorPil(PilBackend): resample = PILImageResampling.BILINEAR image_mean = IMAGENET_DEFAULT_MEAN image_std = IMAGENET_DEFAULT_STD size = {“height”: 224, “width”: 224} do_resize = True do_rescale = True do_normalize = True

The backend’s _preprocess method handles the standard pipeline automatically.

Custom Processing

For processors that need custom logic (e.g., patch-based processing, multiple input types), override _preprocess in your model-specific processor. The _preprocess method receives already-prepared images (converted to the backend format with channels-first ordering) and performs the actual processing:

class MyImageProcessor(TorchvisionBackend): def _preprocess(self, images, do_resize, size, do_normalize, image_mean, image_std, **kwargs):

Group images by shape for efficient batched operations

grouped_images, grouped_images_index = group_images_by_shape(images) processed_groups = {} for shape, stacked_images in grouped_images.items(): if do_resize: stacked_images = self.resize(stacked_images, size=size) if do_normalize: stacked_images = self.normalize(stacked_images, mean=image_mean, std=image_std) processed_groups[shape] = stacked_images processed_images = reorder_images(processed_groups, grouped_images_index) return BatchFeature(data={“pixel_values”: processed_images})

For processors handling multiple input types (e.g., images + segmentation maps), override _preprocess_image_like_inputs:

def _preprocess_image_like_inputs( self, images: ImageInput, segmentation_maps: ImageInput | None = None, kwargs, ) -> BatchFeature: images = self._prepare_image_like_inputs(images, kwargs) batch_feature = self._preprocess(images, **kwargs)

if segmentation_maps is not None: maps = self._prepare_image_like_inputs(segmentation_maps, kwargs) batch_feature[“labels”] = self._preprocess(maps, kwargs).pixel_values

return batch_feature

Extending Backend Behavior

To customize operations for a specific backend, subclass the backend and override its methods:

from transformers.image_processing_backends import TorchvisionBackend, PilBackend

class MyTorchvisionProcessor(TorchvisionBackend): def resize(self, image, size, **kwargs):

Custom resize logic for torchvision

return super().resize(image, size, **kwargs)

class MyPilProcessor(PilBackend): def resize(self, image, size, **kwargs):

Custom resize logic for PIL

return super().resize(image, size, **kwargs)

Custom Parameters

To add parameters beyond ImagesKwargs, create a custom kwargs class and set it as valid_kwargs:

class MyImageProcessorKwargs(ImagesKwargs): custom_param: int | None = None

class MyImageProcessor(TorchvisionBackend): valid_kwargs = MyImageProcessorKwargs custom_param = 10 # default value

Key Notes

  • Backend selection is done at the class level: inherit from TorchvisionBackend or PilBackend
  • Backends receive images as torch.Tensor (Torchvision) or np.ndarray (PIL), always channels-first
  • All images have channel dimension first during processing, regardless of backend
  • Arguments not provided by users default to class attribute values
  • Backend classes encapsulate backend-specific logic (resize, normalize, etc.) and can be overridden

center_crop

< >

( image: ndarray size: dict data_format: str | transformers.image_utils.ChannelDimension | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None **kwargs )

Parameters

  • image (np.ndarray) — Image to center crop.
  • size (dict[str, int]) — Size of the output image.
  • data_format (str or ChannelDimension, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:
    • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
    • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
    • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
    • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.

Center crop an image to (size["height"], size["width"]). If the input size is smaller than crop_size along any edge, the image is padded with 0’s and then center cropped.

normalize

< >

( image: ndarray mean: float | collections.abc.Iterable[float] std: float | collections.abc.Iterable[float] data_format: str | transformers.image_utils.ChannelDimension | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None **kwargs ) np.ndarray

Parameters

  • image (np.ndarray) — Image to normalize.
  • mean (float or Iterable[float]) — Image mean to use for normalization.
  • std (float or Iterable[float]) — Image standard deviation to use for normalization.
  • data_format (str or ChannelDimension, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:
    • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
    • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
    • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
    • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.

Returns

np.ndarray

The normalized image.

Normalize an image. image = (image - image_mean) / image_std.

preprocess

< >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] *args **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] ) ~image_processing_base.BatchFeature

Parameters

  • images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.
  • return_tensors (str or TensorType, optional) — Returns stacked tensors if set to 'pt', otherwise returns a list of tensors.
  • **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.

Returns

~image_processing_base.BatchFeature

  • data (dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.).
  • tensor_type (Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.

process_image

< >

( *args **kwargs )

Process a single raw image into the backend’s working format.

Implemented by backend subclasses (TorchvisionBackend, PilBackend). Converts a raw input (PIL Image, NumPy array, or torch Tensor) to the backend’s internal format (torch.Tensor for Torchvision, np.ndarray for PIL), handles RGB conversion and ensures channels-first ordering.

rescale

< >

( image: ndarray scale: float data_format: str | transformers.image_utils.ChannelDimension | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None **kwargs ) np.ndarray

Parameters

  • image (np.ndarray) — Image to rescale.
  • scale (float) — The scaling factor to rescale pixel values by.
  • data_format (str or ChannelDimension, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:
    • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
    • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:
    • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
    • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.

Returns

np.ndarray

The rescaled image.

Rescale an image by a scale factor. image = image * scale.

TorchvisionBackend

class transformers.TorchvisionBackend

< >

( **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )

Torchvision backend for GPU-accelerated batched image processing.

center_crop

< >

( image: torch.Tensor size: SizeDict **kwargs )

Center crop an image using Torchvision.

convert_to_rgb

< >

( image: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] )

Convert an image to RGB format.

normalize

< >

( image: torch.Tensor mean: float | collections.abc.Iterable[float] std: float | collections.abc.Iterable[float] **kwargs )

Normalize an image using Torchvision.

pad

< >

( images: list pad_size: SizeDict = None fill_value: int | None = 0 padding_mode: str | None = 'constant' return_mask: bool = False disable_grouping: bool | None = False is_nested: bool | None = False **kwargs )

Pad images using Torchvision with batched operations.

process_image

< >

( image: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_convert_rgb: bool | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None device: typing.Optional[ForwardRef('torch.device')] = None **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )

Process a single image for torchvision backend.

rescale

< >

( image: torch.Tensor scale: float **kwargs )

Rescale an image by a scale factor using Torchvision.

rescale_and_normalize

< >

( images: torch.Tensor do_rescale: bool rescale_factor: float do_normalize: bool image_mean: float | list[float] image_std: float | list[float] )

Rescale and normalize images using Torchvision (fused for efficiency).

resize

< >

( image: torch.Tensor size: SizeDict resample: PILImageResampling | tvF.InterpolationMode | int | None = None antialias: bool = True **kwargs )

Resize an image using Torchvision.

PilBackend

class transformers.PilBackend

< >

( **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )

PIL/NumPy backend for portable CPU-only image processing.

center_crop

< >

( image: ndarray size: SizeDict **kwargs )

Center crop an image using NumPy.

convert_to_rgb

< >

( image: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] )

Convert an image to RGB format.

normalize

< >

( image: ndarray mean: float | collections.abc.Iterable[float] std: float | collections.abc.Iterable[float] **kwargs )

Normalize an image using NumPy.

pad

< >

( images: list pad_size: SizeDict = None fill_value: int | None = 0 padding_mode: str | None = 'constant' return_mask: bool = False **kwargs )

Pad images to specified size using NumPy.

process_image

< >

( image: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_convert_rgb: bool | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )

Process a single image for PIL backend.

rescale

< >

( image: ndarray scale: float **kwargs )

Rescale an image by a scale factor using NumPy.

resize

< >

( image: ndarray size: SizeDict resample: typing.Union[ForwardRef('PILImageResampling'), ForwardRef('tvF.InterpolationMode'), int, NoneType] = None reducing_gap: int | None = None **kwargs )

Resize an image using PIL/NumPy.

Update on GitHub