LateOn-Code & ColGrep: LightOn unveils state-of-the-art code retrieval models and code search tooling Feb 12 • 53
LightOnOCR-1B: The Case for End-to-End and Efficient Domain-Specific Vision-Language Models for OCR Oct 23, 2025 • 73
LightOnOCR-2 🦉 LightOnOCR-2-1B: a lightweight high-performance end-to-end OCR model family lightonai/LightOnOCR-2-1B Image-Text-to-Text • 1B • Updated Feb 20 • 661k • 639 lightonai/LightOnOCR-2-1B-bbox Image-Text-to-Text • 1B • Updated Jan 23 • 4.19k • 23 Running on Zero Featured 108 LightOnOCR 2 1B Demo 🐨 108 Extract text from images or PDFs with OCR lightonai/LightOnOCR-2-1B-base Image-Text-to-Text • 1B • Updated Jan 21 • 10.3k • 11
LateOn-Code 💻 State-of-the-art late interaction code retrieval models lightonai/LateOn-Code-edge Sentence Similarity • 16.8M • Updated Feb 12 • 1.76k • • 26 lightonai/LateOn-Code Sentence Similarity • 0.1B • Updated Feb 12 • 216 • 25 lightonai/LateOn-Code-edge-pretrain Sentence Similarity • 16.8M • Updated Feb 12 • 7 • 3 lightonai/LateOn-Code-pretrain Sentence Similarity • 0.1B • Updated Feb 13 • 29 • 2
PyLate 🐕 State-of-the-art late interaction models trained using PyLate lightonai/Reason-ModernColBERT Sentence Similarity • 0.1B • Updated Sep 9, 2025 • 13.5k • • 238 lightonai/GTE-ModernColBERT-v1 Sentence Similarity • Updated Jan 21 • 59.6k • 168 lightonai/LateOn-Code-edge Sentence Similarity • 16.8M • Updated Feb 12 • 1.76k • • 26 lightonai/LateOn-Code Sentence Similarity • 0.1B • Updated Feb 12 • 216 • 25
Embeddings datasets ⚡️ This collection gather datasets for embeddings pre-training and fine-tuning. lightonai/embeddings-pre-training Viewer • Updated Jan 5 • 1.38B • 1.38k • 18 lightonai/nanobeir-multilingual Viewer • Updated Sep 16, 2025 • 522k • 401 • 11
ModernBERT Bringing BERT into modernity via both architecture changes and scaling answerdotai/ModernBERT-base Fill-Mask • 0.1B • Updated Jan 15, 2025 • 7.25M • 1.02k lightonai/GTE-ModernColBERT-v1 Sentence Similarity • Updated Jan 21 • 59.6k • 168 lightonai/Reason-ModernColBERT Sentence Similarity • 0.1B • Updated Sep 9, 2025 • 13.5k • • 238 lightonai/modernbert-embed-large Sentence Similarity • 0.4B • Updated May 14, 2025 • 7.6k • • 32
RITA 🧿 A suite of autoregressive generative models for protein sequences, with up to 1.2Bparameters, trained on over 280 million protein sequences. lightonai/RITA_s Text Generation • 85.1M • Updated Nov 13, 2024 • 3.03k • 3 lightonai/RITA_m Text Generation • 0.3B • Updated Jan 6, 2025 • 8 lightonai/RITA_l Text Generation • Updated May 19, 2022 • 1.58k lightonai/RITA_xl Text Generation • 1B • Updated Dec 10, 2024 • 2.58k • 3
ArabicWeb24-ablation-models 900M models trained on 25BT to compare different data processing choices (filtering, sentence dedup, minhash, etc) lightonai/ArabicWeb24-ablation-model-v1 Text Generation • Updated Aug 19, 2024 • 8 lightonai/ArabicWeb24-ablation-model-v5 Text Generation • Updated Aug 19, 2024 • 1
ColBERT-Zero 🐶 First large-scale fully pre-trained ColBERT model using only public data, outperforming GTE-ModernColBERT and GTE-ModernBERT ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models Paper • 2602.16609 • Published Feb 18 • 6 lightonai/ColBERT-Zero Sentence Similarity • 0.1B • Updated Feb 23 • 4.01k • • 34 lightonai/ColBERT-Zero-supervised Sentence Similarity • 0.1B • Updated Feb 23 • 44 • 3 lightonai/ColBERT-Zero-unsupervised Sentence Similarity • 0.1B • Updated Feb 23 • 137 • 2
ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models Paper • 2602.16609 • Published Feb 18 • 6
OriOn Visual long document VLMs based on Mistral-Small-3.1-24B-Instruct-2503 and Qwen3-VL-32B-Instruct lightonai/OriOn-Qwen 33B • Updated Feb 18 • 36 • 8 lightonai/OriOn-Mistral 24B • Updated Feb 18 • 60 • 3 lightonai/MMLBD-C Viewer • Updated Feb 18 • 1.08k • 176 • 5 lightonai/OriOn-Leaderboard Updated Feb 8 • 1
LightOnOCR 🦉 The Case for End-to-End and Efficient Domain-Specific Vision-Language Models for OCR lightonai/LightOnOCR-1B-1025 Image-to-Text • Updated Feb 20 • 162k • 247 lightonai/LightOnOCR-0.9B-16k-1025 Updated Feb 20 • 28 • 12 lightonai/LightOnOCR-0.9B-32k-1025 Updated Feb 20 • 151 • 19 Running 42 LightOnOCR 1B Demo 💬 42 Extract text from images or PDFs with OCR
Ettin A collection of SOTA, open-data, paired encoder-only and decoder only models ranging from 17M params to 1B Seq vs Seq: An Open Suite of Paired Encoders and Decoders Paper • 2507.11412 • Published Jul 15, 2025 • 31 jhu-clsp/ettin-encoder-17m Fill-Mask • Updated Jul 16, 2025 • 2.51k • 15 jhu-clsp/ettin-encoder-32m Feature Extraction • Updated Jul 18, 2025 • 593 • 11 jhu-clsp/ettin-encoder-150m Fill-Mask • Updated Jul 18, 2025 • 18.4k • • 10
Seq vs Seq: An Open Suite of Paired Encoders and Decoders Paper • 2507.11412 • Published Jul 15, 2025 • 31
PAGnol 🇫🇷 French language models. These model were trained in early 2021 following the then scaling laws and using the exact same training data as the CamemBERT lightonai/pagnol-small Text Generation • Updated Mar 21, 2024 • 481 • 1 lightonai/pagnol-medium Text Generation • 0.4B • Updated Jan 6, 2025 • 12 • 1 lightonai/pagnol-large Text Generation • Updated Mar 24, 2024 • 9 • 1 lightonai/pagnol-xl Text Generation • 2B • Updated Nov 7, 2024 • 11 • 1
LightOnOCR-2 🦉 LightOnOCR-2-1B: a lightweight high-performance end-to-end OCR model family lightonai/LightOnOCR-2-1B Image-Text-to-Text • 1B • Updated Feb 20 • 661k • 639 lightonai/LightOnOCR-2-1B-bbox Image-Text-to-Text • 1B • Updated Jan 23 • 4.19k • 23 Running on Zero Featured 108 LightOnOCR 2 1B Demo 🐨 108 Extract text from images or PDFs with OCR lightonai/LightOnOCR-2-1B-base Image-Text-to-Text • 1B • Updated Jan 21 • 10.3k • 11
ColBERT-Zero 🐶 First large-scale fully pre-trained ColBERT model using only public data, outperforming GTE-ModernColBERT and GTE-ModernBERT ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models Paper • 2602.16609 • Published Feb 18 • 6 lightonai/ColBERT-Zero Sentence Similarity • 0.1B • Updated Feb 23 • 4.01k • • 34 lightonai/ColBERT-Zero-supervised Sentence Similarity • 0.1B • Updated Feb 23 • 44 • 3 lightonai/ColBERT-Zero-unsupervised Sentence Similarity • 0.1B • Updated Feb 23 • 137 • 2
ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models Paper • 2602.16609 • Published Feb 18 • 6
LateOn-Code 💻 State-of-the-art late interaction code retrieval models lightonai/LateOn-Code-edge Sentence Similarity • 16.8M • Updated Feb 12 • 1.76k • • 26 lightonai/LateOn-Code Sentence Similarity • 0.1B • Updated Feb 12 • 216 • 25 lightonai/LateOn-Code-edge-pretrain Sentence Similarity • 16.8M • Updated Feb 12 • 7 • 3 lightonai/LateOn-Code-pretrain Sentence Similarity • 0.1B • Updated Feb 13 • 29 • 2
OriOn Visual long document VLMs based on Mistral-Small-3.1-24B-Instruct-2503 and Qwen3-VL-32B-Instruct lightonai/OriOn-Qwen 33B • Updated Feb 18 • 36 • 8 lightonai/OriOn-Mistral 24B • Updated Feb 18 • 60 • 3 lightonai/MMLBD-C Viewer • Updated Feb 18 • 1.08k • 176 • 5 lightonai/OriOn-Leaderboard Updated Feb 8 • 1
PyLate 🐕 State-of-the-art late interaction models trained using PyLate lightonai/Reason-ModernColBERT Sentence Similarity • 0.1B • Updated Sep 9, 2025 • 13.5k • • 238 lightonai/GTE-ModernColBERT-v1 Sentence Similarity • Updated Jan 21 • 59.6k • 168 lightonai/LateOn-Code-edge Sentence Similarity • 16.8M • Updated Feb 12 • 1.76k • • 26 lightonai/LateOn-Code Sentence Similarity • 0.1B • Updated Feb 12 • 216 • 25
LightOnOCR 🦉 The Case for End-to-End and Efficient Domain-Specific Vision-Language Models for OCR lightonai/LightOnOCR-1B-1025 Image-to-Text • Updated Feb 20 • 162k • 247 lightonai/LightOnOCR-0.9B-16k-1025 Updated Feb 20 • 28 • 12 lightonai/LightOnOCR-0.9B-32k-1025 Updated Feb 20 • 151 • 19 Running 42 LightOnOCR 1B Demo 💬 42 Extract text from images or PDFs with OCR
Embeddings datasets ⚡️ This collection gather datasets for embeddings pre-training and fine-tuning. lightonai/embeddings-pre-training Viewer • Updated Jan 5 • 1.38B • 1.38k • 18 lightonai/nanobeir-multilingual Viewer • Updated Sep 16, 2025 • 522k • 401 • 11
Ettin A collection of SOTA, open-data, paired encoder-only and decoder only models ranging from 17M params to 1B Seq vs Seq: An Open Suite of Paired Encoders and Decoders Paper • 2507.11412 • Published Jul 15, 2025 • 31 jhu-clsp/ettin-encoder-17m Fill-Mask • Updated Jul 16, 2025 • 2.51k • 15 jhu-clsp/ettin-encoder-32m Feature Extraction • Updated Jul 18, 2025 • 593 • 11 jhu-clsp/ettin-encoder-150m Fill-Mask • Updated Jul 18, 2025 • 18.4k • • 10
Seq vs Seq: An Open Suite of Paired Encoders and Decoders Paper • 2507.11412 • Published Jul 15, 2025 • 31
ModernBERT Bringing BERT into modernity via both architecture changes and scaling answerdotai/ModernBERT-base Fill-Mask • 0.1B • Updated Jan 15, 2025 • 7.25M • 1.02k lightonai/GTE-ModernColBERT-v1 Sentence Similarity • Updated Jan 21 • 59.6k • 168 lightonai/Reason-ModernColBERT Sentence Similarity • 0.1B • Updated Sep 9, 2025 • 13.5k • • 238 lightonai/modernbert-embed-large Sentence Similarity • 0.4B • Updated May 14, 2025 • 7.6k • • 32
PAGnol 🇫🇷 French language models. These model were trained in early 2021 following the then scaling laws and using the exact same training data as the CamemBERT lightonai/pagnol-small Text Generation • Updated Mar 21, 2024 • 481 • 1 lightonai/pagnol-medium Text Generation • 0.4B • Updated Jan 6, 2025 • 12 • 1 lightonai/pagnol-large Text Generation • Updated Mar 24, 2024 • 9 • 1 lightonai/pagnol-xl Text Generation • 2B • Updated Nov 7, 2024 • 11 • 1
RITA 🧿 A suite of autoregressive generative models for protein sequences, with up to 1.2Bparameters, trained on over 280 million protein sequences. lightonai/RITA_s Text Generation • 85.1M • Updated Nov 13, 2024 • 3.03k • 3 lightonai/RITA_m Text Generation • 0.3B • Updated Jan 6, 2025 • 8 lightonai/RITA_l Text Generation • Updated May 19, 2022 • 1.58k lightonai/RITA_xl Text Generation • 1B • Updated Dec 10, 2024 • 2.58k • 3
ArabicWeb24-ablation-models 900M models trained on 25BT to compare different data processing choices (filtering, sentence dedup, minhash, etc) lightonai/ArabicWeb24-ablation-model-v1 Text Generation • Updated Aug 19, 2024 • 8 lightonai/ArabicWeb24-ablation-model-v5 Text Generation • Updated Aug 19, 2024 • 1