intelli-embed-v3
Custom embedding model: 4-phase multi-signal fine-tune of Snowflake/snowflake-arctic-embed-l-v2.0 (XLM-RoBERTa-large, 1024-dim).
First local model to surpass Azure text-embedding-3-large (sep=0.520 GPU fp16 vs 0.515). Beats azure-large on 5/6 OpenMemory-specific metrics. CPU INT8 ONNX reaches sep=0.505 (98% of azure-large) at 11 ms latency.
Built for OpenMemory — optimized for personal memory storage, retrieval, deduplication, negation detection, and entity search.
Model Details
| Property | Value |
|---|---|
| Architecture | XLM-RoBERTa-large (24 layers, 1024-dim) |
| Base model | Snowflake/snowflake-arctic-embed-l-v2.0 |
| Pooling | CLS token + L2-normalize |
| Output dimension | 1024 |
| Similarity function | Cosine similarity |
| Vocab | 250,002 (multilingual, XLM-R) |
| Max tokens | 256 (fine-tune) / 8,194 (architecture limit) |
| PyTorch fp32 size | 2.2 GB (model.safetensors) |
| ONNX INT8 size | 542 MB (onnx/model_quantized.onnx) |
| ONNX fp32 size | 2.2 GB (onnx/model.onnx + model.onnx_data) |
| GPU fp16 VRAM | ~2.3 GB |
Full Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True})
(2): Normalize()
)
Benchmark Results
Benchmarked on 10 SW-engineering positive pairs + 10 negative pairs. Sep = mean(PosSim) − mean(NegSim).
| Deployment | Sep | p50 latency | Key advantage |
|---|---|---|---|
| GPU fp16 (sentence-transformers CUDA) | 0.520 | 23 ms | Beats azure-large (0.515) |
| CPU INT8 (ort-node ONNX) | 0.505 | 88 ms | Zero GPU needed |
| Node.js (transformers.js q8) | 0.505 | 11 ms | Fastest, identical quality to CPU INT8 |
| azure-large (cloud baseline) | 0.515 | 97 ms | Cloud reference |
Extended OpenMemory Metrics (vs azure-large)
| Metric | v3 GPU | v3 CPU | azure-large | What it measures |
|---|---|---|---|---|
memSep |
0.541 | 0.533 | 0.509 | Personal-memory discrimination |
dedupGap |
0.156 | 0.159 | 0.050 | Dedup reliability (>0.15 target) |
negGap |
0.111 | 0.103 | 0.088 | Negation detection (>0.10 safe) |
supSim |
0.592 | 0.580 | 0.651 | Supersede zone (0.75–0.92 target) |
entSep |
0.539 | 0.522 | 0.514 | Entity search quality |
v3 wins 5/6 metrics. Only supSim is lower (both below target zone — requires threshold tuning).
Held-Out Validation
Tested on completely fresh sentences with zero overlap to training data:
| Provider | Sep (benchmark) | Sep (held-out) | Delta |
|---|---|---|---|
| v3-GPU-fp16 | 0.520 | 0.595 | +14% |
| v3-CPU-INT8 | 0.505 | 0.578 | +14% |
| azure-large | 0.515 | 0.637 | +24% |
Scores improve on held-out data, confirming no benchmark inflation.
MTEB Benchmark (English, v2)
Full evaluation on MTEB(eng, v2) — 41 tasks across 7 categories.
Evaluated with mteb v2.8.8 on NVIDIA RTX 4090, FP16, batch_size=256.
| Category | Avg Score | # Tasks |
|---|---|---|
| Classification | 0.7650 | 8 |
| Clustering | 0.4228 | 8 |
| PairClassification | 0.7976 | 4 |
| Reranking | 0.3001 | 1 |
| Retrieval | 0.4931 | 10 |
| STS | 0.8341 | 9 |
| Summarization | 0.3452 | 1 |
| Overall | 0.5654 | 41 |
Individual Task Scores (click to expand)
| Task | Category | Score |
|---|---|---|
| AmazonCounterfactualClassification | Classification | 0.7597 |
| ArXivHierarchicalClusteringP2P | Clustering | 0.5530 |
| ArXivHierarchicalClusteringS2S | Clustering | 0.5295 |
| ArguAna | Retrieval | 0.5589 |
| AskUbuntuDupQuestions | PairClassification | 0.6227 |
| BIOSSES | STS | 0.8664 |
| Banking77Classification | Classification | 0.8327 |
| BiorxivClusteringP2P.v2 | Clustering | 0.3622 |
| CQADupstackGamingRetrieval | Retrieval | 0.5932 |
| CQADupstackUnixRetrieval | Retrieval | 0.4309 |
| ClimateFEVERHardNegatives | Retrieval | 0.2524 |
| FEVERHardNegatives | Retrieval | 0.7669 |
| FiQA2018 | Retrieval | 0.3531 |
| HotpotQAHardNegatives | Retrieval | 0.6076 |
| ImdbClassification | Classification | 0.7771 |
| MTOPDomainClassification | Classification | 0.9265 |
| MassiveIntentClassification | Classification | 0.7191 |
| MassiveScenarioClassification | Classification | 0.7614 |
| MedrxivClusteringP2P.v2 | Clustering | 0.3400 |
| MedrxivClusteringS2S.v2 | Clustering | 0.3116 |
| MindSmallReranking | Reranking | 0.3001 |
| SCIDOCS | Retrieval | 0.1818 |
| SICK-R | STS | 0.7885 |
| STS12 | STS | 0.8132 |
| STS13 | STS | 0.8908 |
| STS14 | STS | 0.8527 |
| STS15 | STS | 0.8961 |
| STS17 | STS | 0.8468 |
| STS22.v2 | STS | 0.6754 |
| STSBenchmark | STS | 0.8765 |
| SprintDuplicateQuestions | PairClassification | 0.9579 |
| StackExchangeClustering.v2 | Clustering | 0.4990 |
| StackExchangeClusteringP2P.v2 | Clustering | 0.3935 |
| SummEvalSummarization.v2 | Summarization | 0.3452 |
| TRECCOVID | Retrieval | 0.6636 |
| Touche2020Retrieval.v3 | Retrieval | 0.5224 |
| ToxicConversationsClassification | Classification | 0.7053 |
| TweetSentimentExtractionClassification | Classification | 0.6379 |
| TwentyNewsgroupsClustering.v2 | Clustering | 0.3938 |
| TwitterSemEval2015 | PairClassification | 0.7398 |
| TwitterURLCorpus | PairClassification | 0.8698 |
Training Strategy
4-phase multi-signal fine-tuning with complementary objectives:
| Phase | Loss | Teacher / Data | Purpose |
|---|---|---|---|
| A | GISTEmbedLoss + MNRL | mxbai-embed-large-v1 + 200k AllNLI + 83 OpenMemory pairs |
PosSim↑ NegSim↓ via guided contrastive learning |
| B | MSE + Relational Distillation (3 epochs) | text-embedding-3-large 1024-dim (API, 189k cached) |
Full Sep signal + pairwise cosine structure alignment (α=0.3) |
| C | MNRL + CoSENT | 50k NLI contradiction triplets + 20 custom negation + STSB | Negation separation (negGap) + fine-grained similarity ordering |
| D | Hard-Negative MNRL | 13k+ mined triplets (margin=0.210) | Decision boundary refinement |
Training Hyperparameters
- Hardware: NVIDIA RTX 4090 (24 GB VRAM)
- Training time: ~2.1 hours total (all 4 phases)
- Batch size: 32 (phases A, C, D), 128 (phase B)
- Learning rate: 2e-5 (phase A), 1e-5 (phase B), 5e-6 (phases C, D)
- Scheduler: Cosine with 5–10% warmup
- Precision: bf16 mixed precision
- Optimizer: AdamW (fused)
- Gradient checkpointing: enabled
Framework Versions
- Python: 3.10.6
- Sentence Transformers: 5.2.3
- Transformers: 4.57.6
- PyTorch: 2.6.0+cu124
- ONNX Runtime: 1.22.0 (INT8 quantization)
Usage
Python (sentence-transformers)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("serhiiseletskyi/intelli-embed-v3")
queries = model.encode(["What programming language does the user prefer?"])
docs = model.encode(["The user is a senior TypeScript developer who loves React."])
similarity = model.similarity(queries, docs)
print(similarity) # tensor([[0.62]])
Node.js (transformers.js — recommended for production)
import { pipeline } from "@huggingface/transformers";
const extractor = await pipeline(
"feature-extraction",
"serhiiseletskyi/intelli-embed-v3",
{ dtype: "q8" } // loads onnx/model_quantized.onnx (542 MB)
);
const output = await extractor("User prefers dark mode", {
pooling: "cls",
normalize: true,
});
console.log(output.data.length); // 1024
Node.js (onnxruntime-node — direct ONNX)
import * as ort from "onnxruntime-node";
const session = await ort.InferenceSession.create(
"onnx/model_quantized.onnx",
{ executionProviders: ["cpu"] }
);
// Tokenize with @huggingface/transformers AutoTokenizer, then:
const result = await session.run({ input_ids, attention_mask });
// CLS pooling: result.last_hidden_state[0, 0, :1024], then L2-normalize
GPU Server (CUDA fp16)
pip install sentence-transformers flask
python v3_gpu_server.py --port 8234
# OpenAI-compatible /v1/embeddings endpoint
curl http://localhost:8234/v1/embeddings \
-d '{"input": "hello world", "model": "intelli-embed-v3-gpu"}'
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
2.2 GB | PyTorch fp32 weights (for GPU inference or further fine-tuning) |
onnx/model_quantized.onnx |
542 MB | INT8 quantized ONNX (recommended for CPU/Node.js) |
onnx/model.onnx + model.onnx_data |
2.2 GB | fp32 ONNX (highest precision, large) |
config.json |
— | Model architecture config |
tokenizer.json |
16 MB | Fast tokenizer |
1_Pooling/config.json |
— | CLS pooling config |
Citation
If you use this model, please cite:
@misc{intelli-embed-v3,
title={intelli-embed-v3: Multi-Signal Fine-Tuned Embedding Model for Personal Memory Systems},
author={Serhii Seletskyi},
year={2026},
url={https://huggingface.co/serhiiseletskyi/intelli-embed-v3},
note={4-phase fine-tune of Snowflake/snowflake-arctic-embed-l-v2.0 for OpenMemory}
}
License
Apache 2.0 (same as base model Snowflake/snowflake-arctic-embed-l-v2.0)
- Downloads last month
- 33
Model tree for serhiiseletskyi/intelli-embed-v3
Base model
Snowflake/snowflake-arctic-embed-l-v2.0