ColNetraEmbed

ColNetraEmbed is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.

Model Description

ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim).

Model Type: Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations
Architecture: ColPali with Gemma3-4B backbone
Embedding Dimension: 128 per token
Capabilities: Multilingual, Multimodal (Vision + Text), Multi-vector late interaction
Use Case: Visual document retrieval, multilingual document understanding, fine-grained visual search

Paper

📄 M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Installation

pip install git+https://github.com/adithya-s-k/colpali.git

Quick Start

import torch
from PIL import Image
from colpali_engine.models import ColGemma3, ColGemmaProcessor3

# Load model and processor
model_name = "Cognitive-Lab/ColNetraEmbed"
model = ColGemma3.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = ColGemmaProcessor3.from_pretrained(model_name)

# Load your images
images = [
    Image.open("document1.jpg"),
    Image.open("document2.jpg"),
]

# Define queries
queries = [
    "What is the total revenue?",
    "Show me the organizational chart",
]

# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

with torch.no_grad():
    image_embeddings = model(**batch_images)  # Shape: (num_images, num_patches, 128)
    query_embeddings = model(**batch_queries)  # Shape: (num_queries, num_tokens, 128)

# Compute similarity scores using MaxSim
scores = processor.score_multi_vector(
    qs=query_embeddings,
    ps=image_embeddings,
)  # Shape: (num_queries, num_images)

# Get best matches
for i, query in enumerate(queries):
    best_idx = scores[i].argmax().item()
    print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})")

Use Cases

Document Retrieval: Search through large collections of visual documents
Visual Question Answering: Answer questions about document content
Document Understanding: Extract and match information from scanned documents
Cross-lingual Document Search: Multilingual visual document retrieval

Model Details

Base Model: Gemma3-4B-IT
Vision Encoder: SigLIP
Training Data: Multilingual document datasets
Embedding Strategy: Multi-vector (Late Interaction)
Similarity Function: MaxSim (Maximum Similarity)

Performance

ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on Nayana-IR Bench (22 languages) and ViDoRe v2.

Benchmark Results

Nayana-IR Cross-Lingual

Model	NDCG@5	Recall@10	MAP@10	MRR@10
ColNetraEmbed	0.637	0.700	0.610	0.610
Jina-Embeddings-v4	0.435	0.435	0.390	0.548
ColNomic-Embed-3B	0.315	0.320	0.267	0.444
ColPali-v1.3	0.284	0.347	0.249	0.403
GME-Qwen2-VL-2B	0.235	0.308	0.209	0.314
ColQwen2.5-v0.2	0.143	0.160	0.127	0.220
ColQwen2-v1.0	0.050	0.065	0.038	0.109

Nayana-IR Monolingual

Model	NDCG@5	Recall@10	MAP@10	MRR@10
ColNetraEmbed	0.670	0.764	0.645	0.686
ColNomic-Embed-3B	0.534	0.603	0.515	0.546
ColQwen2.5-v0.2	0.453	0.513	0.437	0.464
GME-Qwen2-VL-2B	0.444	0.525	0.426	0.452
ColQwen2-v1.0	0.413	0.466	0.398	0.422
ColPali-v1.3	0.410	0.484	0.393	0.422

ViDoRe v2

Model	NDCG@5	Recall@10	MAP@10	MRR@10
ColQwen2.5-v0.2	0.592	0.664	0.484	0.711
Jina-Embeddings-v4	0.576	0.686	-	-
GME-Qwen2-VL-2B	0.574	0.630	0.466	0.690
ColNomic-Embed-3B	0.556	0.633	0.451	0.672
ColNetraEmbed	0.551	0.664	0.445	0.445
ColQwen2-v1.0	0.545	0.640	0.438	0.653
ColPali-v1.3	0.538	0.627	0.436	0.644

Key Results:

🏆 Strong multilingual performance with ColBERT-style late interaction
📈 124% improvement over ColPali-v1.3 on cross-lingual tasks
🌍 Supports 22 languages across diverse script families
🔍 Fine-grained matching through token-level MaxSim scoring

Comparison: Multi-vector vs Single-vector

ColNetraEmbed (multi-vector): More interpretable with token-level attribution
NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage

See our paper for comprehensive evaluation and architectural comparisons.

Citation

@misc{kolavi2025m3druniversalmultilingualmultimodal,
  title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, 
  author={Adithya S Kolavi and Vyoman Jain},
  year={2025},
  eprint={2512.03514},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2512.03514}
}

License

This model is released under the same license as the base Gemma3 model.

Acknowledgments

This work benefited from compute credits for training, inference, and evaluation provided by Modal, acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the Meta LLaMA Impact Grant through our Nayana initiative. We appreciate Meta for continued support of our research efforts at CognitiveLab.

Built on top of the ColPali framework and Gemma3 architecture.

Downloads last month: 94

Inference Providers NEW

Visual Document Retrieval

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Cognitive-Lab/ColNetraEmbed

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Finetuned

(437)

this model

Space using Cognitive-Lab/ColNetraEmbed 1

Evaluation results

NDCG@5 on Nayana-IR Cross-Lingual
test set self-reported

0.637
Recall@10 on Nayana-IR Cross-Lingual
test set self-reported

0.700
MAP@10 on Nayana-IR Cross-Lingual
test set self-reported

0.610
MRR@10 on Nayana-IR Cross-Lingual
test set self-reported

0.610
NDCG@5 on Nayana-IR Monolingual
test set self-reported

0.670
Recall@10 on Nayana-IR Monolingual
test set self-reported

0.764
MAP@10 on Nayana-IR Monolingual
test set self-reported

0.645
MRR@10 on Nayana-IR Monolingual
test set self-reported

0.686
NDCG@5 on ViDoRe v2
test set self-reported

0.551
Recall@10 on ViDoRe v2
test set self-reported

0.664

View on Papers With Code