ColNetraEmbed

Group 54 (1)

Paper GitHub Model Blog Demo

ColNetraEmbed is a state-of-the-art multilingual multimodal embedding model for visual document retrieval, powered by the Gemma3 backbone and using Colbert-style multi-vector representations.

Model Description

ColNetraEmbed is a multilingual multimodal embedding model that encodes documents as multi-vector representations using the ColPali architecture. Each image patch is mapped to a contextualized embedding, enabling fine-grained matching between visual content and text queries through late interaction (MaxSim).

  • Model Type: Multilingual Multimodal Embedding Model with ColPali-style Multi-vector representations
  • Architecture: ColPali with Gemma3-4B backbone
  • Embedding Dimension: 128 per token
  • Capabilities: Multilingual, Multimodal (Vision + Text), Multi-vector late interaction
  • Use Case: Visual document retrieval, multilingual document understanding, fine-grained visual search

Paper

๐Ÿ“„ M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Installation

pip install git+https://github.com/adithya-s-k/colpali.git

Quick Start

import torch
from PIL import Image
from colpali_engine.models import ColGemma3, ColGemmaProcessor3

# Load model and processor
model_name = "Cognitive-Lab/ColNetraEmbed"
model = ColGemma3.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = ColGemmaProcessor3.from_pretrained(model_name)

# Load your images
images = [
    Image.open("document1.jpg"),
    Image.open("document2.jpg"),
]

# Define queries
queries = [
    "What is the total revenue?",
    "Show me the organizational chart",
]

# Process and encode
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

with torch.no_grad():
    image_embeddings = model(**batch_images)  # Shape: (num_images, num_patches, 128)
    query_embeddings = model(**batch_queries)  # Shape: (num_queries, num_tokens, 128)

# Compute similarity scores using MaxSim
scores = processor.score_multi_vector(
    qs=query_embeddings,
    ps=image_embeddings,
)  # Shape: (num_queries, num_images)

# Get best matches
for i, query in enumerate(queries):
    best_idx = scores[i].argmax().item()
    print(f"Query: '{query}' -> Best match: Image {best_idx + 1} (score: {scores[i, best_idx]:.2f})")

Use Cases

  • Document Retrieval: Search through large collections of visual documents
  • Visual Question Answering: Answer questions about document content
  • Document Understanding: Extract and match information from scanned documents
  • Cross-lingual Document Search: Multilingual visual document retrieval

Model Details

  • Base Model: Gemma3-4B-IT
  • Vision Encoder: SigLIP
  • Training Data: Multilingual document datasets
  • Embedding Strategy: Multi-vector (Late Interaction)
  • Similarity Function: MaxSim (Maximum Similarity)

Performance

ColNetraEmbed achieves strong performance on multilingual document retrieval benchmarks. Evaluated on Nayana-IR Bench (22 languages) and ViDoRe v2.

Benchmark Results

Nayana-IR Cross-Lingual

Model NDCG@5 Recall@10 MAP@10 MRR@10
ColNetraEmbed 0.637 0.700 0.610 0.610
Jina-Embeddings-v4 0.435 0.435 0.390 0.548
ColNomic-Embed-3B 0.315 0.320 0.267 0.444
ColPali-v1.3 0.284 0.347 0.249 0.403
GME-Qwen2-VL-2B 0.235 0.308 0.209 0.314
ColQwen2.5-v0.2 0.143 0.160 0.127 0.220
ColQwen2-v1.0 0.050 0.065 0.038 0.109

Nayana-IR Monolingual

Model NDCG@5 Recall@10 MAP@10 MRR@10
ColNetraEmbed 0.670 0.764 0.645 0.686
ColNomic-Embed-3B 0.534 0.603 0.515 0.546
ColQwen2.5-v0.2 0.453 0.513 0.437 0.464
GME-Qwen2-VL-2B 0.444 0.525 0.426 0.452
ColQwen2-v1.0 0.413 0.466 0.398 0.422
ColPali-v1.3 0.410 0.484 0.393 0.422

ViDoRe v2

Model NDCG@5 Recall@10 MAP@10 MRR@10
ColQwen2.5-v0.2 0.592 0.664 0.484 0.711
Jina-Embeddings-v4 0.576 0.686 - -
GME-Qwen2-VL-2B 0.574 0.630 0.466 0.690
ColNomic-Embed-3B 0.556 0.633 0.451 0.672
ColNetraEmbed 0.551 0.664 0.445 0.445
ColQwen2-v1.0 0.545 0.640 0.438 0.653
ColPali-v1.3 0.538 0.627 0.436 0.644

Key Results:

  • ๐Ÿ† Strong multilingual performance with ColBERT-style late interaction
  • ๐Ÿ“ˆ 124% improvement over ColPali-v1.3 on cross-lingual tasks
  • ๐ŸŒ Supports 22 languages across diverse script families
  • ๐Ÿ” Fine-grained matching through token-level MaxSim scoring

Comparison: Multi-vector vs Single-vector

  • ColNetraEmbed (multi-vector): More interpretable with token-level attribution
  • NetraEmbed (single-vector): Higher accuracy (0.716 vs 0.637) and 250x more efficient storage

See our paper for comprehensive evaluation and architectural comparisons.

Citation

@misc{kolavi2025m3druniversalmultilingualmultimodal,
  title={M3DR: Towards Universal Multilingual Multimodal Document Retrieval}, 
  author={Adithya S Kolavi and Vyoman Jain},
  year={2025},
  eprint={2512.03514},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2512.03514}
}

License

This model is released under the same license as the base Gemma3 model.

Acknowledgments

This work benefited from compute credits for training, inference, and evaluation provided by Modal, acknowledged as a compute sponsor. Dataset curation and synthesis were supported by the Meta LLaMA Impact Grant through our Nayana initiative. We appreciate Meta for continued support of our research efforts at CognitiveLab.

Built on top of the ColPali framework and Gemma3 architecture.

Downloads last month
94
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Cognitive-Lab/ColNetraEmbed

Finetuned
(437)
this model

Space using Cognitive-Lab/ColNetraEmbed 1

Evaluation results