RankSaga · AI-Driven Decision Software

AI SERVICES · EMBEDDING MODEL OPTIMISATION

Generic embeddings rarely understand your domain. We fix that.

RankSaga fine-tunes embedding models on your corpus and your query distribution, then measures the retrieval lift against a held-out eval set. Our published BEIR work delivered up to 51 percent improvement. The same methodology runs against customer data inside customer environments.

An off-the-shelf embedding model is trained on the open web. Your domain is not the open web. The gap between general-purpose embeddings and domain-tuned embeddings is often the difference between an AI system that demos and an AI system that ships.

+51%BEIR retrieval lift, RankSaga-optimised E5
Open SourceRankSaga-Optimised-E5-v2 on HuggingFace
MNR · Hard NegativesMethodology we ship
MeasuredEvery change passes through eval

WHY THIS MATTERS

Embedding choice is a system decision, not a vendor choice.

The most common reason an enterprise RAG system retrieves the wrong passage is that the embedding model has never seen text that looks like the customer's corpus. Legal, medical, financial, scientific, and operational language sits outside the distribution that public embedding models were optimised against.

RankSaga's embedding work is empirical. We start with a baseline measurement on your corpus and your real queries. We design a fine-tuning regimen, Multiple Negatives Ranking Loss, hard-negative mining, contrastive training, and the data pipeline that feeds it. Every iteration is measured against a held-out evaluation set, and the regression harness ships with the model.

Our published BEIR benchmarking work demonstrates the methodology in the open. We achieved up to 51 percent retrieval improvement across Scifact, nfcorpus, scidocs, and quora, and the resulting RankSaga-Optimised-E5-v2 model is on HuggingFace. The same approach, the same engineering team, runs against customer data inside customer environments under NDA.

WHAT WE SHIP

Six concrete pieces of work.

01 / Capability

Baseline Measurement

Recall@k, MRR, and nDCG against a labelled evaluation set drawn from your real queries and your real corpus. The number to beat is the number you have today.

02 / Capability

Training Data Pipeline

Query mining from search logs, hard-negative selection, synthetic positive generation where supervision is sparse, and the data quality controls that prevent label leakage.

03 / Capability

Fine-Tuning Regimen

Multiple Negatives Ranking Loss, contrastive learning, in-batch negatives, hard-negative mining, and the LoRA / full-parameter trade-off selected against your latency and storage budget.

04 / Capability

Distillation

Where production cost requires it, distillation of the optimised model to a smaller, faster variant that retains most of the retrieval quality at a fraction of the inference cost.

05 / Capability

Eval Harness

The regression suite that runs on every change. Recall@k, MRR, nDCG against the held-out set, plus the application-level metric the embedding is supposed to move.

06 / Capability

Production Embedding Service

The inference path that serves the embeddings. GPU vs CPU, batch vs request, caching strategy, and the operational posture to keep latency stable under load.

HOW WE OPERATE

Measure first, then train.

01 / Step

Baseline and Goal

We measure the retrieval quality of the current embedding on a labelled evaluation set drawn from your real queries. We agree the metric and the lift target before any training begins.

02 / Step

Train and Measure

Iterative fine-tuning against the eval set. Each round changes one variable, the loss function, the negative-mining strategy, the data mix, so the contribution to lift is attributable.

03 / Step

Deploy and Monitor

The optimised model ships into your inference path with the regression harness wired in. Quality drift is detected and alerted, not discovered by a customer.

STACK POSTURE

Models, methods, and surfaces.

Base models

E5, BGE, GTE, Jina, Cohere, OpenAI text-embedding-3, and customer-provided base models. Choice driven by your latency, residency, and cost constraints.

Methods

Multiple Negatives Ranking Loss, contrastive learning, hard-negative mining, query mining from logs, synthetic data generation, LoRA / full-parameter fine-tuning.

Eval

BEIR-style evaluation extended to your domain. Recall@k, MRR, nDCG, plus application-level metrics (citation correctness, downstream task completion).

Distillation

Teacher-student distillation to smaller models for production cost. Quality retention measured against the eval set, not assumed.

Deployment

HuggingFace TEI, vLLM, Ollama, custom inference services, or managed endpoints inside the customer's environment.

Open source

RankSaga-Optimised-E5-v2 published on HuggingFace. Methodology and results published in our research record.

PROOF

Published BEIR results, open-source model, live deployments.

RankSaga's optimised E5-v2 embeddings delivered up to 51 percent retrieval improvement across BEIR benchmark datasets. The model, the methodology, and the evaluation framework are all published. The same engineering team operates AI systems in live deployment for the Australian Armed Forces.

RANKSAGA · BEIR BENCHMARK · ADF DEPLOYMENT · 2026

  • ·Up to 51 percent retrieval lift on BEIR datasets.
  • ·Open-source RankSaga-Optimised-E5-v2 model on HuggingFace.
  • ·Published research available via Zenodo.
  • ·Live air-gapped AI deployment for the Australian Armed Forces.

QUESTIONS

What customers ask before we start.

How much retrieval lift can we realistically expect?+

It depends on the gap between your domain and the embedding model's training distribution. Highly specialised corpora, legal, medical, biotech, defence, regularly see 20-50 percent recall@k improvement after domain fine-tuning. General-business corpora typically see 5-15 percent. We give a real number after the baseline measurement, not before.

Do we need labelled training data?+

Helpful but not required. We mine queries and positive pairs from search logs, generate synthetic positives where supervision is sparse, and apply hard-negative mining to sharpen the contrastive signal. Engagements that begin with no labels regularly produce production-quality fine-tuning sets within the first weeks.

How big a model can we afford to fine-tune?+

Most production embedding work sits in the 100M-2B parameter range. Smaller models (E5-base, BGE-small) hit very competitive recall after domain tuning and inference cheaply on CPU. We size the model to your latency, throughput, and cost budget rather than defaulting to the largest available.

Can the optimised embeddings stay inside our environment?+

Yes. We fine-tune inside customer VPC, on-premise, or air-gapped environments where required. Training data and the resulting model never leave the customer boundary. The methodology and tooling we bring with us are not sensitive; the model and the data are yours.

What happens to embeddings already in production?+

Re-embedding is part of the engagement. We run the new model in parallel against ingestion, build the new index alongside the old one, validate retrieval quality on a held-out set, and cut over only when the new system matches or beats the old one. Rollback is a config flip.

ENGAGE

If your retrieval quality is unmeasured, that is the place to start.

Most enterprise teams have never measured the retrieval quality of their embeddings on their real queries. The measurement itself often reveals where the lift is. Engagements typically begin with that measurement.