AI SERVICES · FINE-TUNING & DISTILLATION
Foundation models are good. Tuned models, the right size, are deployable.
RankSaga fine-tunes embedding and foundation models on your domain corpus, then distills them where production cost and latency demand it. LoRA, QLoRA, full-parameter, teacher-student distillation. Methodology selected against the metric the model is supposed to move, not against vendor preference.
The right model for production is rarely the largest. It is the smallest model that meets the quality bar on your evaluation set inside your latency and cost budget. Fine-tuning and distillation are how you find it.
WHY THIS MATTERS
Quality, cost, and latency are not three problems. They are one problem.
Most enterprise teams reach for the largest available model first, encounter the cost and latency reality, and then rebuild the system on a smaller model under deadline pressure. That sequence rarely produces the best system. The better sequence is to fine-tune or distill a smaller model against the actual quality target, then ship it.
RankSaga's fine-tuning practice is empirical. We start with a baseline measurement on your eval set. We pick a base model and a methodology, LoRA when the storage and serving footprint matters, full-parameter when the quality lift requires it, distillation when production cost demands a smaller variant. Every iteration is measured. Every method choice is an experiment with an attributable contribution to the metric.
Our published BEIR work demonstrates the methodology in the open. The RankSaga-Optimised-E5-v2 model on HuggingFace is the result of the same fine-tuning regimen we apply against customer data inside customer environments under NDA.
WHAT WE SHIP
Six concrete pieces of work.
01 / Capability
Embedding Fine-Tuning
Domain tuning of embedding models for retrieval. MNR loss, hard-negative mining, contrastive learning. Methodology behind the BEIR result.
02 / Capability
LLM Fine-Tuning
Foundation model tuning for classification, structured generation, instruction-following on your task. LoRA, QLoRA, and full-parameter selected against the quality and cost target.
03 / Capability
Distillation
Teacher-student distillation from a large, expensive model to a smaller, faster, cheaper variant. Quality retention measured against the eval set rather than assumed.
04 / Capability
Training Data Pipeline
Mining from production logs, synthetic data generation, label collection, hard-negative mining. The data quality controls that prevent the training set being the bottleneck.
05 / Capability
Evaluation Harness
The eval set, the regression suite, and the application-level metric. Every training run is measured the same way so contributions to lift are comparable across iterations.
06 / Capability
Production Inference
Serving the resulting model. vLLM, TGI, TEI, custom inference paths. GPU vs CPU, batch vs request, quantisation strategy, and the operational posture to keep latency stable under load.
HOW WE OPERATE
Baseline, train, measure, ship.
01 / Step
Baseline and Constraint
We measure the current quality and agree the latency and cost budget. The training brief is a quality target inside a constraint envelope, not an open-ended research question.
02 / Step
Iterate on the Eval Set
Each iteration changes one variable, the loss function, the data mix, the LoRA rank, the base model, so the contribution to lift is attributable. The eval is the contract.
03 / Step
Distill and Deploy
Where the quality target is met by a model too large to serve cheaply, we distill. The deployed model is the smallest one that meets the target inside the budget.
STACK POSTURE
Methods, models, and serving.
Methods
LoRA, QLoRA, full-parameter fine-tuning, DPO and ORPO for preference tuning, teacher-student distillation.
Embedding bases
E5, BGE, GTE, Jina, plus customer-provided base models. Choice driven by latency, residency, and cost.
LLM bases
Llama 3.x, Qwen 2.x, Mixtral, plus customer-licensed bases. Selection against the eval target and the serving budget.
Training infrastructure
Customer cloud (AWS, GCP, Azure), customer on-premise GPU clusters, or RankSaga-managed training environments. Data never leaves the customer boundary unless explicitly authorised.
Quantisation
AWQ, GPTQ, GGUF, BitsAndBytes. Quality retention vs throughput trade-off measured rather than assumed.
Serving
vLLM, TGI, TEI, llama.cpp, Ollama, custom inference services. GPU and CPU deployments selected against the latency and throughput target.
PROOF
Published embedding work, applied at scale.
RankSaga's BEIR fine-tuning work delivered up to 51 percent retrieval improvement and produced the open-source RankSaga-Optimised-E5-v2 model on HuggingFace. The methodology, MNR loss, hard-negative mining, eval-driven iteration, runs against customer data inside customer environments under NDA.
RANKSAGA · BEIR BENCHMARK · 2026
- ·Up to 51 percent BEIR retrieval lift, published.
- ·Open-source RankSaga-Optimised-E5-v2 model on HuggingFace.
- ·Customer fine-tuning runs inside customer VPC, on-premise, and air-gapped environments.
- ·Same engineering team that operates AI in live ADF deployment.
RELATED CAPABILITIES
Where fine-tuning connects.
Adjacent
Embedding Model Optimisation →
Embedding-specific fine-tuning for retrieval. The capability behind the BEIR result.
Adjacent
AI Evals & Observability →
The eval framework that fine-tuning is measured against. The contract between training and production.
Adjacent
Retrieval-Augmented Generation →
Where a fine-tuned generation model often lives. Grounded answering, refusal, attribution.
QUESTIONS
What customers ask before we start.
When should we fine-tune vs prompt-engineer?+
Prompt-engineer first. Fine-tune when prompting alone has plateaued on the eval set, when the cost of context tokens is a real budget constraint, when latency requires a smaller model, or when output structure has to be guaranteed across a high-volume task. We help size that decision honestly rather than pushing the more expensive option.
How much training data do we need?+
For embedding fine-tuning, hundreds to a few thousand high-quality positive pairs (often mined from logs) is usually enough. For LLM fine-tuning, a few hundred well-labelled examples are often enough for narrow tasks; a few thousand are needed for broader instruction-following. Synthetic data and active labelling are part of every engagement.
LoRA, QLoRA, or full-parameter?+
It depends on the target quality, the deployment surface, and the cost of training. LoRA when the storage and serving footprint matters, QLoRA when training-time GPU memory is the constraint, full-parameter when the quality lift requires it. We measure the trade-off rather than defaulting.
Can the fine-tuned model stay inside our environment?+
Yes. Training runs inside customer VPC, on-premise, or air-gapped environments where required. Training data and the resulting model never leave the customer boundary.
How do we know fine-tuning actually worked?+
Against the eval set agreed in the discovery sprint. Recall@k and MRR for embeddings, application-level metrics (task accuracy, citation correctness, refusal appropriateness) for LLMs. The eval harness ships with the model and runs as a regression suite on every change.
Can you distill an OpenAI or Anthropic model into a smaller open-weight one?+
Yes, in the form of distillation from outputs. The teacher model generates responses to a representative query distribution, the student is fine-tuned to match those responses, and quality retention is measured against the eval set. Common when production cost or latency rules out the teacher.
ENGAGE
If your model cost or latency is fighting your quality target, we want to look at it.
The most common fine-tuning engagement we see starts with a quality target the team has hit on a model that is too expensive to serve. The work is to find the smallest model that holds the bar.