intelli-embed-v3

Custom embedding model: 4-phase multi-signal fine-tune of Snowflake/snowflake-arctic-embed-l-v2.0 (XLM-RoBERTa-large, 1024-dim).

First local model to surpass Azure text-embedding-3-large (sep=0.520 GPU fp16 vs 0.515). Beats azure-large on 5/6 OpenMemory-specific metrics. CPU INT8 ONNX reaches sep=0.505 (98% of azure-large) at 11 ms latency.

Built for OpenMemory — optimized for personal memory storage, retrieval, deduplication, negation detection, and entity search.

Model Details

Property	Value
Architecture	XLM-RoBERTa-large (24 layers, 1024-dim)
Base model	`Snowflake/snowflake-arctic-embed-l-v2.0`
Pooling	CLS token + L2-normalize
Output dimension	1024
Similarity function	Cosine similarity
Vocab	250,002 (multilingual, XLM-R)
Max tokens	256 (fine-tune) / 8,194 (architecture limit)
PyTorch fp32 size	2.2 GB (`model.safetensors`)
ONNX INT8 size	542 MB (`onnx/model_quantized.onnx`)
ONNX fp32 size	2.2 GB (`onnx/model.onnx` + `model.onnx_data`)
GPU fp16 VRAM	~2.3 GB

Full Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True})
  (2): Normalize()
)

Benchmark Results

Benchmarked on 10 SW-engineering positive pairs + 10 negative pairs. Sep = mean(PosSim) − mean(NegSim).

Deployment	Sep	p50 latency	Key advantage
GPU fp16 (sentence-transformers CUDA)	0.520	23 ms	Beats azure-large (0.515)
CPU INT8 (ort-node ONNX)	0.505	88 ms	Zero GPU needed
Node.js (transformers.js q8)	0.505	11 ms	Fastest, identical quality to CPU INT8
azure-large (cloud baseline)	0.515	97 ms	Cloud reference

Extended OpenMemory Metrics (vs azure-large)

Metric	v3 GPU	v3 CPU	azure-large	What it measures
`memSep`	0.541	0.533	0.509	Personal-memory discrimination
`dedupGap`	0.156	0.159	0.050	Dedup reliability (>0.15 target)
`negGap`	0.111	0.103	0.088	Negation detection (>0.10 safe)
`supSim`	0.592	0.580	0.651	Supersede zone (0.75–0.92 target)
`entSep`	0.539	0.522	0.514	Entity search quality

v3 wins 5/6 metrics. Only supSim is lower (both below target zone — requires threshold tuning).

Held-Out Validation

Tested on completely fresh sentences with zero overlap to training data:

Provider	Sep (benchmark)	Sep (held-out)	Delta
v3-GPU-fp16	0.520	0.595	+14%
v3-CPU-INT8	0.505	0.578	+14%
azure-large	0.515	0.637	+24%

Scores improve on held-out data, confirming no benchmark inflation.

MTEB Benchmark (English, v2)

Full evaluation on MTEB(eng, v2) — 41 tasks across 7 categories. Evaluated with mteb v2.8.8 on NVIDIA RTX 4090, FP16, batch_size=256.

Category	Avg Score	# Tasks
Classification	0.7650	8
Clustering	0.4228	8
PairClassification	0.7976	4
Reranking	0.3001	1
Retrieval	0.4931	10
STS	0.8341	9
Summarization	0.3452	1
Overall	0.5654	41

Individual Task Scores (click to expand)

Task	Category	Score
AmazonCounterfactualClassification	Classification	0.7597
ArXivHierarchicalClusteringP2P	Clustering	0.5530
ArXivHierarchicalClusteringS2S	Clustering	0.5295
ArguAna	Retrieval	0.5589
AskUbuntuDupQuestions	PairClassification	0.6227
BIOSSES	STS	0.8664
Banking77Classification	Classification	0.8327
BiorxivClusteringP2P.v2	Clustering	0.3622
CQADupstackGamingRetrieval	Retrieval	0.5932
CQADupstackUnixRetrieval	Retrieval	0.4309
ClimateFEVERHardNegatives	Retrieval	0.2524
FEVERHardNegatives	Retrieval	0.7669
FiQA2018	Retrieval	0.3531
HotpotQAHardNegatives	Retrieval	0.6076
ImdbClassification	Classification	0.7771
MTOPDomainClassification	Classification	0.9265
MassiveIntentClassification	Classification	0.7191
MassiveScenarioClassification	Classification	0.7614
MedrxivClusteringP2P.v2	Clustering	0.3400
MedrxivClusteringS2S.v2	Clustering	0.3116
MindSmallReranking	Reranking	0.3001
SCIDOCS	Retrieval	0.1818
SICK-R	STS	0.7885
STS12	STS	0.8132
STS13	STS	0.8908
STS14	STS	0.8527
STS15	STS	0.8961
STS17	STS	0.8468
STS22.v2	STS	0.6754
STSBenchmark	STS	0.8765
SprintDuplicateQuestions	PairClassification	0.9579
StackExchangeClustering.v2	Clustering	0.4990
StackExchangeClusteringP2P.v2	Clustering	0.3935
SummEvalSummarization.v2	Summarization	0.3452
TRECCOVID	Retrieval	0.6636
Touche2020Retrieval.v3	Retrieval	0.5224
ToxicConversationsClassification	Classification	0.7053
TweetSentimentExtractionClassification	Classification	0.6379
TwentyNewsgroupsClustering.v2	Clustering	0.3938
TwitterSemEval2015	PairClassification	0.7398
TwitterURLCorpus	PairClassification	0.8698

Training Strategy

4-phase multi-signal fine-tuning with complementary objectives:

Phase	Loss	Teacher / Data	Purpose
A	GISTEmbedLoss + MNRL	`mxbai-embed-large-v1` + 200k AllNLI + 83 OpenMemory pairs	PosSim↑ NegSim↓ via guided contrastive learning
B	MSE + Relational Distillation (3 epochs)	`text-embedding-3-large` 1024-dim (API, 189k cached)	Full Sep signal + pairwise cosine structure alignment (α=0.3)
C	MNRL + CoSENT	50k NLI contradiction triplets + 20 custom negation + STSB	Negation separation (negGap) + fine-grained similarity ordering
D	Hard-Negative MNRL	13k+ mined triplets (margin=0.210)	Decision boundary refinement

Training Hyperparameters

Hardware: NVIDIA RTX 4090 (24 GB VRAM)
Training time: ~2.1 hours total (all 4 phases)
Batch size: 32 (phases A, C, D), 128 (phase B)
Learning rate: 2e-5 (phase A), 1e-5 (phase B), 5e-6 (phases C, D)
Scheduler: Cosine with 5–10% warmup
Precision: bf16 mixed precision
Optimizer: AdamW (fused)
Gradient checkpointing: enabled

Framework Versions

Python: 3.10.6
Sentence Transformers: 5.2.3
Transformers: 4.57.6
PyTorch: 2.6.0+cu124
ONNX Runtime: 1.22.0 (INT8 quantization)

Usage

Python (sentence-transformers)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("serhiiseletskyi/intelli-embed-v3")

queries = model.encode(["What programming language does the user prefer?"])
docs = model.encode(["The user is a senior TypeScript developer who loves React."])

similarity = model.similarity(queries, docs)
print(similarity)  # tensor([[0.62]])

Node.js (transformers.js — recommended for production)

import { pipeline } from "@huggingface/transformers";

const extractor = await pipeline(
  "feature-extraction",
  "serhiiseletskyi/intelli-embed-v3",
  { dtype: "q8" }  // loads onnx/model_quantized.onnx (542 MB)
);

const output = await extractor("User prefers dark mode", {
  pooling: "cls",
  normalize: true,
});
console.log(output.data.length);  // 1024

Node.js (onnxruntime-node — direct ONNX)

import * as ort from "onnxruntime-node";

const session = await ort.InferenceSession.create(
  "onnx/model_quantized.onnx",
  { executionProviders: ["cpu"] }
);
// Tokenize with @huggingface/transformers AutoTokenizer, then:
const result = await session.run({ input_ids, attention_mask });
// CLS pooling: result.last_hidden_state[0, 0, :1024], then L2-normalize

GPU Server (CUDA fp16)

pip install sentence-transformers flask
python v3_gpu_server.py --port 8234
# OpenAI-compatible /v1/embeddings endpoint
curl http://localhost:8234/v1/embeddings \
  -d '{"input": "hello world", "model": "intelli-embed-v3-gpu"}'

Files

File	Size	Description
`model.safetensors`	2.2 GB	PyTorch fp32 weights (for GPU inference or further fine-tuning)
`onnx/model_quantized.onnx`	542 MB	INT8 quantized ONNX (recommended for CPU/Node.js)
`onnx/model.onnx` + `model.onnx_data`	2.2 GB	fp32 ONNX (highest precision, large)
`config.json`	—	Model architecture config
`tokenizer.json`	16 MB	Fast tokenizer
`1_Pooling/config.json`	—	CLS pooling config

Citation

If you use this model, please cite:

@misc{intelli-embed-v3,
  title={intelli-embed-v3: Multi-Signal Fine-Tuned Embedding Model for Personal Memory Systems},
  author={Serhii Seletskyi},
  year={2026},
  url={https://huggingface.co/serhiiseletskyi/intelli-embed-v3},
  note={4-phase fine-tune of Snowflake/snowflake-arctic-embed-l-v2.0 for OpenMemory}
}

License

Apache 2.0 (same as base model Snowflake/snowflake-arctic-embed-l-v2.0)

Downloads last month: 33

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for serhiiseletskyi/intelli-embed-v3

Base model

Snowflake/snowflake-arctic-embed-l-v2.0

Quantized

(10)

this model

serhiiseletskyi
/

intelli-embed-v3