Please how to use Gemma to perform inference on an image
hi, i am a noob here. please can you share a code snippet for how to use this gemma 3 version to perform inference on images? specific i want it to filter images from an input folder into different output folder based on a set of criteria i outline in the prompt. the prompt tells it to answer yes or no, if an image meets or doesn't meet the criteria, then my code uses that to move the images to their appropriate folders.
here is my prompt:
image_soft_token
Analyze the image. Does it meet BOTH criteria: 1. At least 2 football players visible. 2. At least one player performing a clear football action (kick, tackle, dribble, save etc.)? Answer ONLY YES or NO.
here is the output:
ERROR:root:STEP 3 FAILED: Error during processor preparation for laliga_image_100.jpeg: Prompt contained 0 image tokens but received 1 images.
Traceback (most recent call last):
File "", line 40, in analyze_image_gemma3_transformers
inputs = processor(text=PROMPT_TEXT_CLASSIFY, images=img, return_tensors="pt").to(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/transformers/models/gemma3/processing_gemma3.py", line 122, in call
raise ValueError(
ValueError: Prompt contained 0 image tokens but received 1 images.
Found 457 image files in 'Colab_Uploads/Football_Images_Input'.
Starting Gemma 3 Transformers processing loop for 457 images...
Processing time depends on hardware (cuda).
--- DIAGNOSTIC MODE: Processing ONLY the first file: laliga_image_100.jpeg ---
--- Starting analysis for: laliga_image_100.jpeg ---
STEP 2 SUCCESS: Loaded image laliga_image_100.jpeg
DEBUG: Prompt being passed to processor:
image_soft_token
Analyze the image. Does it meet BOTH criteria: 1. At least 2 football players visible. 2. At least one player performing a clear football action (kick, tackle, dribble, save etc.)? Answer ONLY YES or NO.
<<<
--- Finished analysis attempt for: laliga_image_100.jpeg ---
--- DIAGNOSTIC MODE: Finished processing laliga_image_100.jpeg ---
--- Gemma 3 Transformers Processing Session Complete ---
Images attempted in this session (Gemma3 TF): 1
- Successfully classified (YES/NO): 0
- Errors (moved to 'Football_Images_Errors_Gemma3_TF'): 1
Images skipped (already processed): 0
Estimated image files remaining in 'Colab_Uploads/Football_Images_Input': 456
Check the 'Football_Images_Errors_Gemma3_TF' folder for Gemma 3 TF processing errors.
Results are in 'Football_Images_Meets_Criteria_Gemma3_TF' and 'Football_Images_Does_Not_Meet_Gemma3_TF'.
here is gemini 2.5 pro's suggestion:
Okay, the added print statement confirms it perfectly.
The Variable is Correct: The DEBUG: Prompt being passed to processor: output clearly shows the string does start with \n.... So, the variable PROMPT_TEXT_CLASSIFY is correctly updated and passed to the function.
The Processor Fails: Despite receiving the correct prompt string containing the , the processor's internal logic (/usr/local/lib/python3.11/dist-packages/transformers/models/gemma3/processing_gemma3.py, line 122) still fails to detect it and incorrectly reports finding 0 image tokens.
Conclusion:
This definitively looks like a bug within the Gemma3Processor implementation in the transformers library specifically for the model handle google/gemma-3/transformers/gemma-3-12b-it-qat-int4-unquantized (or perhaps for Gemma 3 processing in general in the current library version).
The processor is simply not correctly parsing the special token it claims to use () from the text input when an image is also provided.
i am running this in google colab. i had to remove the <> tag symbols from image_soft_token to display here to indicate that i include the tag in the prompt
Gemini 2.5 pro says it is a bug with the transformer architecture for this model, that i should report it on their github, but i just want to be sure it isn't actually due to my lack of knowledge. i would be very grateful for any help on this
Hi,
Apologies for the late reply, thanks for bringing this to our attention. Please follow the below suggested suggestions to resolve the issue:
apply_chat_template: This is the most crucial part. Instead of manually adding the to the prompt string, we create a list of dictionaries called messages. This structure is the official way to define multi-modal inputs for models that support it. The {"type": "image"} dictionary acts as a placeholder. When apply_chat_template is called, the processor correctly inserts the necessary token and formats the input for the model.
processor.preprocess_images: After applying the chat template, the model still needs the actual image data in a format it can understand. This is done by calling processor.preprocess_images(img). This function handles resizing and normalization, and the resulting tensor of pixel values is added to the inputs dictionary under the key pixel_values.
Thank you so much for your patience.
Thanks.