AI SERVICES · EMBEDDING MODEL OPTIMISATION
Generic embeddings rarely understand your domain. We fix that.
RankSaga fine-tunes embedding models on your corpus and your query distribution, then measures the retrieval lift against a held-out eval set. Our published BEIR work delivered up to 51 percent improvement. The same methodology runs against customer data inside customer environments.
An off-the-shelf embedding model is trained on the open web. Your domain is not the open web. The gap between general-purpose embeddings and domain-tuned embeddings is often the difference between an AI system that demos and an AI system that ships.
WHY THIS MATTERS
Embedding choice is a system decision, not a vendor choice.
The most common reason an enterprise RAG system retrieves the wrong passage is that the embedding model has never seen text that looks like the customer's corpus. Legal, medical, financial, scientific, and operational language sits outside the distribution that public embedding models were optimised against.
RankSaga's embedding work is empirical. We start with a baseline measurement on your corpus and your real queries. We design a fine-tuning regimen, Multiple Negatives Ranking Loss, hard-negative mining, contrastive training, and the data pipeline that feeds it. Every iteration is measured against a held-out evaluation set, and the regression harness ships with the model.
Our published BEIR benchmarking work demonstrates the methodology in the open. We achieved up to 51 percent retrieval improvement across Scifact, nfcorpus, scidocs, and quora, and the resulting RankSaga-Optimised-E5-v2 model is on HuggingFace. The same approach, the same engineering team, runs against customer data inside customer environments under NDA.
WHAT WE SHIP
Six concrete pieces of work.
01 / Capability
Baseline Measurement
Recall@k, MRR, and nDCG against a labelled evaluation set drawn from your real queries and your real corpus. The number to beat is the number you have today.
02 / Capability
Training Data Pipeline
Query mining from search logs, hard-negative selection, synthetic positive generation where supervision is sparse, and the data quality controls that prevent label leakage.
03 / Capability
Fine-Tuning Regimen
Multiple Negatives Ranking Loss, contrastive learning, in-batch negatives, hard-negative mining, and the LoRA / full-parameter trade-off selected against your latency and storage budget.
04 / Capability
Distillation
Where production cost requires it, distillation of the optimised model to a smaller, faster variant that retains most of the retrieval quality at a fraction of the inference cost.
05 / Capability
Eval Harness
The regression suite that runs on every change. Recall@k, MRR, nDCG against the held-out set, plus the application-level metric the embedding is supposed to move.
06 / Capability
Production Embedding Service
The inference path that serves the embeddings. GPU vs CPU, batch vs request, caching strategy, and the operational posture to keep latency stable under load.
HOW WE OPERATE
Measure first, then train.
01 / Step
Baseline and Goal
We measure the retrieval quality of the current embedding on a labelled evaluation set drawn from your real queries. We agree the metric and the lift target before any training begins.
02 / Step
Train and Measure
Iterative fine-tuning against the eval set. Each round changes one variable, the loss function, the negative-mining strategy, the data mix, so the contribution to lift is attributable.
03 / Step
Deploy and Monitor
The optimised model ships into your inference path with the regression harness wired in. Quality drift is detected and alerted, not discovered by a customer.
STACK POSTURE
Models, methods, and surfaces.
Base models
E5, BGE, GTE, Jina, Cohere, OpenAI text-embedding-3, and customer-provided base models. Choice driven by your latency, residency, and cost constraints.
Methods
Multiple Negatives Ranking Loss, contrastive learning, hard-negative mining, query mining from logs, synthetic data generation, LoRA / full-parameter fine-tuning.
Eval
BEIR-style evaluation extended to your domain. Recall@k, MRR, nDCG, plus application-level metrics (citation correctness, downstream task completion).
Distillation
Teacher-student distillation to smaller models for production cost. Quality retention measured against the eval set, not assumed.
Deployment
HuggingFace TEI, vLLM, Ollama, custom inference services, or managed endpoints inside the customer's environment.
Open source
RankSaga-Optimised-E5-v2 published on HuggingFace. Methodology and results published in our research record.
PROOF
Published BEIR results, open-source model, live deployments.
RankSaga's optimised E5-v2 embeddings delivered up to 51 percent retrieval improvement across BEIR benchmark datasets. The model, the methodology, and the evaluation framework are all published. The same engineering team operates AI systems in live deployment for the Australian Armed Forces.
RANKSAGA · BEIR BENCHMARK · ADF DEPLOYMENT · 2026
- ·Up to 51 percent retrieval lift on BEIR datasets.
- ·Open-source RankSaga-Optimised-E5-v2 model on HuggingFace.
- ·Published research available via Zenodo.
- ·Live air-gapped AI deployment for the Australian Armed Forces.
RELATED CAPABILITIES
Where embedding work connects.
Adjacent
Vector Database Management →
The index that stores the embeddings. Architecture and ops to keep retrieval fast.
Adjacent
Semantic Search & Retrieval →
The retrieval system that consumes the embeddings. Re-ranking, hybrid retrieval, query understanding.
Adjacent
Fine-Tuning & Distillation →
The same methodology applied to foundation models, not just embeddings.
QUESTIONS
What customers ask before we start.
How much retrieval lift can we realistically expect?+
It depends on the gap between your domain and the embedding model's training distribution. Highly specialised corpora, legal, medical, biotech, defence, regularly see 20-50 percent recall@k improvement after domain fine-tuning. General-business corpora typically see 5-15 percent. We give a real number after the baseline measurement, not before.
Do we need labelled training data?+
Helpful but not required. We mine queries and positive pairs from search logs, generate synthetic positives where supervision is sparse, and apply hard-negative mining to sharpen the contrastive signal. Engagements that begin with no labels regularly produce production-quality fine-tuning sets within the first weeks.
How big a model can we afford to fine-tune?+
Most production embedding work sits in the 100M-2B parameter range. Smaller models (E5-base, BGE-small) hit very competitive recall after domain tuning and inference cheaply on CPU. We size the model to your latency, throughput, and cost budget rather than defaulting to the largest available.
Can the optimised embeddings stay inside our environment?+
Yes. We fine-tune inside customer VPC, on-premise, or air-gapped environments where required. Training data and the resulting model never leave the customer boundary. The methodology and tooling we bring with us are not sensitive; the model and the data are yours.
What happens to embeddings already in production?+
Re-embedding is part of the engagement. We run the new model in parallel against ingestion, build the new index alongside the old one, validate retrieval quality on a held-out set, and cut over only when the new system matches or beats the old one. Rollback is a config flip.
ENGAGE
If your retrieval quality is unmeasured, that is the place to start.
Most enterprise teams have never measured the retrieval quality of their embeddings on their real queries. The measurement itself often reveals where the lift is. Engagements typically begin with that measurement.