Transformers documentation
Image Processor
Image Processor
An image processor is in charge of loading images (optionally), preparing input features for vision models and post processing their outputs. This includes transformations such as resizing, normalization, and conversion to PyTorch and Numpy tensors. It may also include model specific post-processing such as converting logits to segmentation masks.
Image processors use a backend-based architecture. The class hierarchy is:
- BaseImageProcessor — abstract base class (for backward compatibility only; do not instantiate directly)
- TorchvisionBackend — the default torchvision-backed backend. GPU-accelerated and significantly faster than the PIL backend. All models expose a
<Model>ImageProcessorclass that inherits from it. - PilBackend — the PIL/NumPy alternative backend. Portable, CPU-only. Only available for older models via a
<Model>ImageProcessorPilclass; useful when exact numerical parity with the original implementation is required.
- TorchvisionBackend — the default torchvision-backed backend. GPU-accelerated and significantly faster than the PIL backend. All models expose a
Both backends expose the same API. Use the backend attribute to inspect which backend a loaded processor uses (e.g. processor.backend == "torchvision").
Use AutoImageProcessor.from_pretrained() with the backend argument to select a backend. When backend is omitted (the default), torchvision is picked when it is installed and PIL is used otherwise. Pass an explicit string to override that choice:
from transformers import AutoImageProcessor
# Default: picks torchvision if available, otherwise pil
processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50")
# Explicitly request torchvision
processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", backend="torchvision")
# Explicitly request PIL
processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", backend="pil")When using the torchvision backend, you can set the device argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise.
from torchvision.io import read_image
from transformers import DetrImageProcessor
images = read_image("image.jpg")
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
images_processed = processor(images, return_tensors="pt", device="cuda")Here are some speed comparisons between the torchvision and PIL backends for the DETR and RT-DETR models, and how they impact overall inference time:




These benchmarks were run on an AWS EC2 g5.2xlarge instance, utilizing an NVIDIA A10G Tensor Core GPU.
ImageProcessingMixin
This is an image processor mixin used to provide saving/loading functionality for sequential and image feature extractors.
from_pretrained
< source >( pretrained_model_name_or_path: str | os.PathLike cache_dir: str | os.PathLike | None = None force_download: bool = False local_files_only: bool = False token: str | bool | None = None revision: str = 'main' **kwargs )
Parameters
- pretrained_model_name_or_path (
stroros.PathLike) — This can be either:- a string, the model id of a pretrained image_processor hosted inside a model repo on huggingface.co.
- a path to a directory containing a image processor file saved using the
save_pretrained() method, e.g.,
./my_model_directory/. - a path or url to a saved image processor JSON file, e.g.,
./my_model_directory/preprocessor_config.json.
- cache_dir (
stroros.PathLike, optional) — Path to a directory in which a downloaded pretrained model image processor should be cached if the standard cache should not be used. - force_download (
bool, optional, defaults toFalse) — Whether or not to force to (re-)download the image processor files and override the cached versions if they exist. - proxies (
dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g.,{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.The proxies are used on each request. - token (
strorbool, optional) — The token to use as HTTP bearer authorization for remote files. IfTrue, or not specified, will use the token generated when runninghf auth login(stored in~/.huggingface). - revision (
str, optional, defaults to"main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, sorevisioncan be any identifier allowed by git.
Instantiate a type of ImageProcessingMixin from an image processor.
Examples:
# We can't instantiate directly the base class *ImageProcessingMixin* so let's show the examples on a
# derived class: *CLIPImageProcessor*
image_processor = CLIPImageProcessor.from_pretrained(
"openai/clip-vit-base-patch32"
) # Download image_processing_config from huggingface.co and cache.
image_processor = CLIPImageProcessor.from_pretrained(
"./test/saved_model/"
) # E.g. image processor (or model) was saved using *save_pretrained('./test/saved_model/')*
image_processor = CLIPImageProcessor.from_pretrained("./test/saved_model/preprocessor_config.json")
image_processor = CLIPImageProcessor.from_pretrained(
"openai/clip-vit-base-patch32", do_normalize=False, foo=False
)
assert image_processor.do_normalize is False
image_processor, unused_kwargs = CLIPImageProcessor.from_pretrained(
"openai/clip-vit-base-patch32", do_normalize=False, foo=False, return_unused_kwargs=True
)
assert image_processor.do_normalize is False
assert unused_kwargs == {"foo": False}save_pretrained
< source >( save_directory: str | os.PathLike push_to_hub: bool = False **kwargs )
Parameters
- save_directory (
stroros.PathLike) — Directory where the image processor JSON file will be saved (will be created if it does not exist). - push_to_hub (
bool, optional, defaults toFalse) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to withrepo_id(will default to the name ofsave_directoryin your namespace). - kwargs (
dict[str, Any], optional) — Additional key word arguments passed along to the push_to_hub() method.
Save an image processor object to the directory save_directory, so that it can be re-loaded using the
from_pretrained() class method.
BatchFeature
class transformers.BatchFeature
< source >( data: dict[str, typing.Any] | None = None tensor_type: None | str | transformers.utils.generic.TensorType = None skip_tensor_conversion: list[str] | set[str] | None = None )
Parameters
- data (
dict, optional) — Dictionary of lists/arrays/tensors returned by the call/pad methods (‘input_values’, ‘attention_mask’, etc.). - tensor_type (
Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization. - skip_tensor_conversion (
list[str]orset[str], optional) — List or set of keys that should NOT be converted to tensors, even whentensor_typeis specified.
Holds the output of the pad() and feature extractor specific __call__ methods.
This class is derived from a python dictionary and can be used as a dictionary.
convert_to_tensors
< source >( tensor_type: str | transformers.utils.generic.TensorType | None = None skip_tensor_conversion: list[str] | set[str] | None = None )
Parameters
- tensor_type (
stror TensorType, optional) — The type of tensors to use. Ifstr, should be one of the values of the enum TensorType. IfNone, no modification is done. - skip_tensor_conversion (
list[str]orset[str], optional) — List or set of keys that should NOT be converted to tensors, even whentensor_typeis specified.
Convert the inner content to tensors.
Note: Values that don’t have an array-like structure (e.g., strings, dicts, lists of strings) are automatically skipped and won’t be converted to tensors. Ragged arrays (lists of arrays with different lengths) are still attempted, though they may raise errors during conversion.
to
< source >( *args **kwargs ) → BatchFeature
Parameters
- args (
Tuple) — Will be passed to theto(...)function of the tensors. - kwargs (
Dict, optional) — Will be passed to theto(...)function of the tensors. To enable asynchronous data transfer, set thenon_blockingflag inkwargs(defaults toFalse).
Returns
The same instance after modification.
Send all values to device by calling v.to(*args, **kwargs) (PyTorch only). This should support casting in
different dtypes and sending the BatchFeature to a different device.
BaseImageProcessor
class transformers.BaseImageProcessor
< source >( **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )
Base class for image processors with an inheritance-based backend architecture.
This class defines the preprocessing pipeline: kwargs validation, input preparation, and dispatching to the
backend’s _preprocess method. Backend subclasses (TorchvisionBackend, PilBackend) inherit from this class
and implement the actual image operations (resize, crop, rescale, normalize, etc.). Model-specific image
processors then inherit from the appropriate backend class.
Architecture Overview
The class hierarchy is:
BaseImageProcessor (this class) ├── TorchvisionBackend (GPU-accelerated, torch.Tensor) │ └── ModelImageProcessor (e.g. LlavaNextImageProcessor) └── PilBackend (portable CPU, np.ndarray) └── ModelImageProcessorPil (e.g. CLIPImageProcessorPil)
The preprocessing flow is:
call() → preprocess() → _preprocess_image_like_inputs() → _prepare_image_like_inputs() (calls process_image per image) → _preprocess() (batch operations: resize, crop, etc.)
process_image: Implemented by backends. Converts a single raw input (PIL, NumPy, or Tensor) to the backend’s working format (torch.Tensor or np.ndarray), handles RGB conversion and channel reordering._preprocess: Implemented by backends. Performs the actual batch processing (resize, center crop, rescale, normalize, pad) and returns aBatchFeature.
Basic Implementation
For processors that only need standard operations (resize, center crop, rescale, normalize), inherit from a backend and define class attributes:
from transformers.image_processing_backends import PilBackend
class MyImageProcessorPil(PilBackend): resample = PILImageResampling.BILINEAR image_mean = IMAGENET_DEFAULT_MEAN image_std = IMAGENET_DEFAULT_STD size = {“height”: 224, “width”: 224} do_resize = True do_rescale = True do_normalize = True
The backend’s _preprocess method handles the standard pipeline automatically.
Custom Processing
For processors that need custom logic (e.g., patch-based processing, multiple input types), override
_preprocess in your model-specific processor. The _preprocess method receives already-prepared images
(converted to the backend format with channels-first ordering) and performs the actual processing:
class MyImageProcessor(TorchvisionBackend): def _preprocess(self, images, do_resize, size, do_normalize, image_mean, image_std, **kwargs):
Group images by shape for efficient batched operations
grouped_images, grouped_images_index = group_images_by_shape(images) processed_groups = {} for shape, stacked_images in grouped_images.items(): if do_resize: stacked_images = self.resize(stacked_images, size=size) if do_normalize: stacked_images = self.normalize(stacked_images, mean=image_mean, std=image_std) processed_groups[shape] = stacked_images processed_images = reorder_images(processed_groups, grouped_images_index) return BatchFeature(data={“pixel_values”: processed_images})
For processors handling multiple input types (e.g., images + segmentation maps), override
_preprocess_image_like_inputs:
def _preprocess_image_like_inputs( self, images: ImageInput, segmentation_maps: ImageInput | None = None, kwargs, ) -> BatchFeature: images = self._prepare_image_like_inputs(images, kwargs) batch_feature = self._preprocess(images, **kwargs)
if segmentation_maps is not None: maps = self._prepare_image_like_inputs(segmentation_maps, kwargs) batch_feature[“labels”] = self._preprocess(maps, kwargs).pixel_values
return batch_feature
Extending Backend Behavior
To customize operations for a specific backend, subclass the backend and override its methods:
from transformers.image_processing_backends import TorchvisionBackend, PilBackend
class MyTorchvisionProcessor(TorchvisionBackend): def resize(self, image, size, **kwargs):
Custom resize logic for torchvision
return super().resize(image, size, **kwargs)
class MyPilProcessor(PilBackend): def resize(self, image, size, **kwargs):
Custom resize logic for PIL
return super().resize(image, size, **kwargs)
Custom Parameters
To add parameters beyond ImagesKwargs, create a custom kwargs class and set it as valid_kwargs:
class MyImageProcessorKwargs(ImagesKwargs): custom_param: int | None = None
class MyImageProcessor(TorchvisionBackend): valid_kwargs = MyImageProcessorKwargs custom_param = 10 # default value
Key Notes
- Backend selection is done at the class level: inherit from
TorchvisionBackendorPilBackend - Backends receive images as
torch.Tensor(Torchvision) ornp.ndarray(PIL), always channels-first - All images have channel dimension first during processing, regardless of backend
- Arguments not provided by users default to class attribute values
- Backend classes encapsulate backend-specific logic (resize, normalize, etc.) and can be overridden
center_crop
< source >( image: ndarray size: dict data_format: str | transformers.image_utils.ChannelDimension | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None **kwargs )
Parameters
- image (
np.ndarray) — Image to center crop. - size (
dict[str, int]) — Size of the output image. - data_format (
strorChannelDimension, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format.
- input_data_format (
ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format.
Center crop an image to (size["height"], size["width"]). If the input size is smaller than crop_size along
any edge, the image is padded with 0’s and then center cropped.
normalize
< source >( image: ndarray mean: float | collections.abc.Iterable[float] std: float | collections.abc.Iterable[float] data_format: str | transformers.image_utils.ChannelDimension | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None **kwargs ) → np.ndarray
Parameters
- image (
np.ndarray) — Image to normalize. - mean (
floatorIterable[float]) — Image mean to use for normalization. - std (
floatorIterable[float]) — Image standard deviation to use for normalization. - data_format (
strorChannelDimension, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format.
- input_data_format (
ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format.
Returns
np.ndarray
The normalized image.
Normalize an image. image = (image - image_mean) / image_std.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] *args **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] ) → ~image_processing_base.BatchFeature
Parameters
- images (
Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list[PIL.Image.Image], list[numpy.ndarray], list[torch.Tensor]]) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False. - return_tensors (
stror TensorType, optional) — Returns stacked tensors if set to'pt', otherwise returns a list of tensors. - **kwargs (ImagesKwargs, optional) — Additional image preprocessing options. Model-specific kwargs are listed above; see the TypedDict class for the complete list of supported arguments.
Returns
~image_processing_base.BatchFeature
- data (
dict) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.). - tensor_type (
Union[None, str, TensorType], optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/Numpy Tensors at initialization.
Process a single raw image into the backend’s working format.
Implemented by backend subclasses (TorchvisionBackend, PilBackend). Converts a raw input
(PIL Image, NumPy array, or torch Tensor) to the backend’s internal format (torch.Tensor for
Torchvision, np.ndarray for PIL), handles RGB conversion and ensures channels-first ordering.
rescale
< source >( image: ndarray scale: float data_format: str | transformers.image_utils.ChannelDimension | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None **kwargs ) → np.ndarray
Parameters
- image (
np.ndarray) — Image to rescale. - scale (
float) — The scaling factor to rescale pixel values by. - data_format (
strorChannelDimension, optional) — The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format.
- input_data_format (
ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"orChannelDimension.FIRST: image in (num_channels, height, width) format."channels_last"orChannelDimension.LAST: image in (height, width, num_channels) format.
Returns
np.ndarray
The rescaled image.
Rescale an image by a scale factor. image = image * scale.
TorchvisionBackend
class transformers.TorchvisionBackend
< source >( **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )
Torchvision backend for GPU-accelerated batched image processing.
Center crop an image using Torchvision.
convert_to_rgb
< source >( image: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] )
Convert an image to RGB format.
normalize
< source >( image: torch.Tensor mean: float | collections.abc.Iterable[float] std: float | collections.abc.Iterable[float] **kwargs )
Normalize an image using Torchvision.
pad
< source >( images: list pad_size: SizeDict = None fill_value: int | None = 0 padding_mode: str | None = 'constant' return_mask: bool = False disable_grouping: bool | None = False is_nested: bool | None = False **kwargs )
Pad images using Torchvision with batched operations.
process_image
< source >( image: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_convert_rgb: bool | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None device: typing.Optional[ForwardRef('torch.device')] = None **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )
Process a single image for torchvision backend.
Rescale an image by a scale factor using Torchvision.
rescale_and_normalize
< source >( images: torch.Tensor do_rescale: bool rescale_factor: float do_normalize: bool image_mean: float | list[float] image_std: float | list[float] )
Rescale and normalize images using Torchvision (fused for efficiency).
resize
< source >( image: torch.Tensor size: SizeDict resample: PILImageResampling | tvF.InterpolationMode | int | None = None antialias: bool = True **kwargs )
Resize an image using Torchvision.
PilBackend
class transformers.PilBackend
< source >( **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )
PIL/NumPy backend for portable CPU-only image processing.
Center crop an image using NumPy.
convert_to_rgb
< source >( image: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] )
Convert an image to RGB format.
normalize
< source >( image: ndarray mean: float | collections.abc.Iterable[float] std: float | collections.abc.Iterable[float] **kwargs )
Normalize an image using NumPy.
pad
< source >( images: list pad_size: SizeDict = None fill_value: int | None = 0 padding_mode: str | None = 'constant' return_mask: bool = False **kwargs )
Pad images to specified size using NumPy.
process_image
< source >( image: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_convert_rgb: bool | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None **kwargs: typing_extensions.Unpack[transformers.processing_utils.ImagesKwargs] )
Process a single image for PIL backend.
Rescale an image by a scale factor using NumPy.
resize
< source >( image: ndarray size: SizeDict resample: typing.Union[ForwardRef('PILImageResampling'), ForwardRef('tvF.InterpolationMode'), int, NoneType] = None reducing_gap: int | None = None **kwargs )
Resize an image using PIL/NumPy.