intelli-embed-v3

Custom embedding model: 4-phase multi-signal fine-tune of Snowflake/snowflake-arctic-embed-l-v2.0 (XLM-RoBERTa-large, 1024-dim).

First local model to surpass Azure text-embedding-3-large (sep=0.520 GPU fp16 vs 0.515). Beats azure-large on 5/6 OpenMemory-specific metrics. CPU INT8 ONNX reaches sep=0.505 (98% of azure-large) at 11 ms latency.

Built for OpenMemory — optimized for personal memory storage, retrieval, deduplication, negation detection, and entity search.

Model Details

Property Value
Architecture XLM-RoBERTa-large (24 layers, 1024-dim)
Base model Snowflake/snowflake-arctic-embed-l-v2.0
Pooling CLS token + L2-normalize
Output dimension 1024
Similarity function Cosine similarity
Vocab 250,002 (multilingual, XLM-R)
Max tokens 256 (fine-tune) / 8,194 (architecture limit)
PyTorch fp32 size 2.2 GB (model.safetensors)
ONNX INT8 size 542 MB (onnx/model_quantized.onnx)
ONNX fp32 size 2.2 GB (onnx/model.onnx + model.onnx_data)
GPU fp16 VRAM ~2.3 GB

Full Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True})
  (2): Normalize()
)

Benchmark Results

Benchmarked on 10 SW-engineering positive pairs + 10 negative pairs. Sep = mean(PosSim) − mean(NegSim).

Deployment Sep p50 latency Key advantage
GPU fp16 (sentence-transformers CUDA) 0.520 23 ms Beats azure-large (0.515)
CPU INT8 (ort-node ONNX) 0.505 88 ms Zero GPU needed
Node.js (transformers.js q8) 0.505 11 ms Fastest, identical quality to CPU INT8
azure-large (cloud baseline) 0.515 97 ms Cloud reference

Extended OpenMemory Metrics (vs azure-large)

Metric v3 GPU v3 CPU azure-large What it measures
memSep 0.541 0.533 0.509 Personal-memory discrimination
dedupGap 0.156 0.159 0.050 Dedup reliability (>0.15 target)
negGap 0.111 0.103 0.088 Negation detection (>0.10 safe)
supSim 0.592 0.580 0.651 Supersede zone (0.75–0.92 target)
entSep 0.539 0.522 0.514 Entity search quality

v3 wins 5/6 metrics. Only supSim is lower (both below target zone — requires threshold tuning).

Held-Out Validation

Tested on completely fresh sentences with zero overlap to training data:

Provider Sep (benchmark) Sep (held-out) Delta
v3-GPU-fp16 0.520 0.595 +14%
v3-CPU-INT8 0.505 0.578 +14%
azure-large 0.515 0.637 +24%

Scores improve on held-out data, confirming no benchmark inflation.

MTEB Benchmark (English, v2)

Full evaluation on MTEB(eng, v2) — 41 tasks across 7 categories. Evaluated with mteb v2.8.8 on NVIDIA RTX 4090, FP16, batch_size=256.

Category Avg Score # Tasks
Classification 0.7650 8
Clustering 0.4228 8
PairClassification 0.7976 4
Reranking 0.3001 1
Retrieval 0.4931 10
STS 0.8341 9
Summarization 0.3452 1
Overall 0.5654 41
Individual Task Scores (click to expand)
Task Category Score
AmazonCounterfactualClassification Classification 0.7597
ArXivHierarchicalClusteringP2P Clustering 0.5530
ArXivHierarchicalClusteringS2S Clustering 0.5295
ArguAna Retrieval 0.5589
AskUbuntuDupQuestions PairClassification 0.6227
BIOSSES STS 0.8664
Banking77Classification Classification 0.8327
BiorxivClusteringP2P.v2 Clustering 0.3622
CQADupstackGamingRetrieval Retrieval 0.5932
CQADupstackUnixRetrieval Retrieval 0.4309
ClimateFEVERHardNegatives Retrieval 0.2524
FEVERHardNegatives Retrieval 0.7669
FiQA2018 Retrieval 0.3531
HotpotQAHardNegatives Retrieval 0.6076
ImdbClassification Classification 0.7771
MTOPDomainClassification Classification 0.9265
MassiveIntentClassification Classification 0.7191
MassiveScenarioClassification Classification 0.7614
MedrxivClusteringP2P.v2 Clustering 0.3400
MedrxivClusteringS2S.v2 Clustering 0.3116
MindSmallReranking Reranking 0.3001
SCIDOCS Retrieval 0.1818
SICK-R STS 0.7885
STS12 STS 0.8132
STS13 STS 0.8908
STS14 STS 0.8527
STS15 STS 0.8961
STS17 STS 0.8468
STS22.v2 STS 0.6754
STSBenchmark STS 0.8765
SprintDuplicateQuestions PairClassification 0.9579
StackExchangeClustering.v2 Clustering 0.4990
StackExchangeClusteringP2P.v2 Clustering 0.3935
SummEvalSummarization.v2 Summarization 0.3452
TRECCOVID Retrieval 0.6636
Touche2020Retrieval.v3 Retrieval 0.5224
ToxicConversationsClassification Classification 0.7053
TweetSentimentExtractionClassification Classification 0.6379
TwentyNewsgroupsClustering.v2 Clustering 0.3938
TwitterSemEval2015 PairClassification 0.7398
TwitterURLCorpus PairClassification 0.8698

Training Strategy

4-phase multi-signal fine-tuning with complementary objectives:

Phase Loss Teacher / Data Purpose
A GISTEmbedLoss + MNRL mxbai-embed-large-v1 + 200k AllNLI + 83 OpenMemory pairs PosSim↑ NegSim↓ via guided contrastive learning
B MSE + Relational Distillation (3 epochs) text-embedding-3-large 1024-dim (API, 189k cached) Full Sep signal + pairwise cosine structure alignment (α=0.3)
C MNRL + CoSENT 50k NLI contradiction triplets + 20 custom negation + STSB Negation separation (negGap) + fine-grained similarity ordering
D Hard-Negative MNRL 13k+ mined triplets (margin=0.210) Decision boundary refinement

Training Hyperparameters

  • Hardware: NVIDIA RTX 4090 (24 GB VRAM)
  • Training time: ~2.1 hours total (all 4 phases)
  • Batch size: 32 (phases A, C, D), 128 (phase B)
  • Learning rate: 2e-5 (phase A), 1e-5 (phase B), 5e-6 (phases C, D)
  • Scheduler: Cosine with 5–10% warmup
  • Precision: bf16 mixed precision
  • Optimizer: AdamW (fused)
  • Gradient checkpointing: enabled

Framework Versions

  • Python: 3.10.6
  • Sentence Transformers: 5.2.3
  • Transformers: 4.57.6
  • PyTorch: 2.6.0+cu124
  • ONNX Runtime: 1.22.0 (INT8 quantization)

Usage

Python (sentence-transformers)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("serhiiseletskyi/intelli-embed-v3")

queries = model.encode(["What programming language does the user prefer?"])
docs = model.encode(["The user is a senior TypeScript developer who loves React."])

similarity = model.similarity(queries, docs)
print(similarity)  # tensor([[0.62]])

Node.js (transformers.js — recommended for production)

import { pipeline } from "@huggingface/transformers";

const extractor = await pipeline(
  "feature-extraction",
  "serhiiseletskyi/intelli-embed-v3",
  { dtype: "q8" }  // loads onnx/model_quantized.onnx (542 MB)
);

const output = await extractor("User prefers dark mode", {
  pooling: "cls",
  normalize: true,
});
console.log(output.data.length);  // 1024

Node.js (onnxruntime-node — direct ONNX)

import * as ort from "onnxruntime-node";

const session = await ort.InferenceSession.create(
  "onnx/model_quantized.onnx",
  { executionProviders: ["cpu"] }
);
// Tokenize with @huggingface/transformers AutoTokenizer, then:
const result = await session.run({ input_ids, attention_mask });
// CLS pooling: result.last_hidden_state[0, 0, :1024], then L2-normalize

GPU Server (CUDA fp16)

pip install sentence-transformers flask
python v3_gpu_server.py --port 8234
# OpenAI-compatible /v1/embeddings endpoint
curl http://localhost:8234/v1/embeddings \
  -d '{"input": "hello world", "model": "intelli-embed-v3-gpu"}'

Files

File Size Description
model.safetensors 2.2 GB PyTorch fp32 weights (for GPU inference or further fine-tuning)
onnx/model_quantized.onnx 542 MB INT8 quantized ONNX (recommended for CPU/Node.js)
onnx/model.onnx + model.onnx_data 2.2 GB fp32 ONNX (highest precision, large)
config.json — Model architecture config
tokenizer.json 16 MB Fast tokenizer
1_Pooling/config.json — CLS pooling config

Citation

If you use this model, please cite:

@misc{intelli-embed-v3,
  title={intelli-embed-v3: Multi-Signal Fine-Tuned Embedding Model for Personal Memory Systems},
  author={Serhii Seletskyi},
  year={2026},
  url={https://huggingface.co/serhiiseletskyi/intelli-embed-v3},
  note={4-phase fine-tune of Snowflake/snowflake-arctic-embed-l-v2.0 for OpenMemory}
}

License

Apache 2.0 (same as base model Snowflake/snowflake-arctic-embed-l-v2.0)

Downloads last month
33
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for serhiiseletskyi/intelli-embed-v3

Quantized
(10)
this model

Datasets used to train serhiiseletskyi/intelli-embed-v3