AI SERVICES · RETRIEVAL-AUGMENTED GENERATION
RAG that holds up when an auditor asks where the answer came from.
RankSaga builds end-to-end RAG systems for regulated enterprises. Retrieval grounded in your corpus, generation constrained to what was retrieved, citations rendered against the original source, and the evaluation framework that catches a hallucination the moment one slips through.
The hardest thing about a RAG system is not getting it to answer. It is getting it to refuse, to cite, and to be measurably honest about what it does and does not know. That is the line between a demo and a system you can put in front of a regulator.
WHY THIS MATTERS
RAG is a system, not a prompt template.
Almost every team building enterprise RAG starts in the same place: a notebook that stitches together an embedding call, a vector database, and a prompt. The notebook works on the demo question. It fails on the second customer query, the one with an ambiguous referent, the one with conflicting source documents, the one where the right answer is to refuse. That gap is not closed by switching models. It is closed by engineering the surrounding system.
RankSaga's RAG work concentrates on the parts of the pipeline that determine whether the system survives audit and production traffic. Retrieval quality measured against a held-out eval set. Context construction that prevents truncation losing the answer. Generation prompts that include refusal instructions and structured-output contracts. Citation rendering that links every claim back to the passage it came from. And the evaluation framework that scores groundedness, attribution correctness, and refusal appropriateness on every change.
The work composes with our other capabilities. Vector Database Management owns the index. Embedding Model Optimisation owns the retrieval quality. Semantic Search & Retrieval owns the retrieval pipeline. RAG owns the generation layer that consumes them, and the evaluation layer that keeps the whole thing honest.
WHAT WE SHIP
Six concrete pieces of work.
01 / Capability
Grounded Generation
Generation constrained to the retrieved context. Refusal when the answer is not in the corpus, structured outputs when downstream systems consume them, and the prompt and decoding strategy that minimises hallucination.
02 / Capability
Citation & Attribution
Every claim in a generated answer linked to the source passage it came from. Span-level attribution where the application supports it, paragraph-level where it does not.
03 / Capability
Context Construction
Token budget management, passage de-duplication, ordering by relevance vs recency, and the chunk-merging logic that prevents the answer being split across two truncated retrievals.
04 / Capability
Hallucination Detection
Groundedness scoring on every generated answer in CI and in production. NLI-based, retrieval-overlap, or LLM-as-judge depending on the latency and cost budget. Drift detection over time.
05 / Capability
Evaluation Harness
Labelled eval set drawn from real queries. Groundedness, attribution correctness, refusal appropriateness, and the application-level metric (citation correctness, task completion). Regression suite on every change.
06 / Capability
Production RAG Service
The API, the streaming layer, the caching, the observability, and the runbooks. Latency under load, cost per query, and the operational posture to keep the system stable.
HOW WE OPERATE
Build the eval before you build the system.
01 / Step
Eval First
We label a representative eval set from your real queries before we touch generation. Groundedness, attribution, refusal cases. The eval is the contract between us and the production system.
02 / Step
Build the Pipeline
Retrieval, context construction, generation, citation, refusal logic, all measured against the eval set on every change. Each component change has an attributable contribution to the metric.
03 / Step
Operate Under Load
Production RAG service inside your environment. Drift detection on groundedness, cost and latency observability, and the next round of tuning as the corpus and query mix evolve.
WHAT YOU GET
A RAG system that holds up.
01 / Deliverable
A grounded, cited, audit-ready production system
Inside your environment. Generation constrained to the corpus, citations rendered against source, refusal when the answer is not there.
02 / Deliverable
An evaluation framework you own
Groundedness, attribution, refusal appropriateness, plus the application metric. Regression suite, dashboards, alerting on drift.
03 / Deliverable
Hallucination guardrails in production
Detection wired into the inference path. Bad answers are caught at generation time, not by a customer ticket three weeks later.
04 / Deliverable
Cost and latency observability
Per-query cost, p95 latency, retrieval vs generation split, and the dashboards your team uses to operate the system.
PROOF
The retrieval foundation is published. The generation discipline is the same.
RankSaga's BEIR work demonstrates the retrieval methodology in the open. The same engineering discipline, measured baselines, attributable iteration, regression-tested deployments, runs against customer corpora and production RAG systems inside customer environments under NDA.
RANKSAGA · BEIR BENCHMARK · ADF DEPLOYMENT · 2026
- ·Published BEIR retrieval methodology and results.
- ·Same engineering team that operates AI in live ADF deployment.
- ·RAG systems shipped inside customer VPC, on-premise, and air-gapped environments.
- ·Eval-first engagement model with a labelled set as the contract.
RELATED CAPABILITIES
Where RAG connects.
Adjacent
Semantic Search & Retrieval →
The retrieval pipeline RAG depends on. Chunking, hybrid retrieval, re-ranking, query understanding.
Adjacent
AI Evals & Observability →
The eval and monitoring infrastructure that keeps RAG systems honest in production.
Adjacent
Agentic Systems & Tool Use →
When RAG is one tool among several an agent can call.
QUESTIONS
What customers ask before we start.
Why not just use a hosted RAG product?+
Hosted RAG products sell defaults. They are excellent when the defaults fit. They are limiting when your corpus has unusual structure, when residency or auditability matters, when you need attribution down to the span level, or when the application metric is downstream of generation in a way the product was not designed for. We build the system around your constraints rather than fitting your constraints to the product.
How do you handle hallucinations?+
Three layers. Constrained generation (the prompt and decoding strategy reduce the probability of ungrounded text). Detection (every generated answer is scored for groundedness against the retrieved context). Refusal (when groundedness is below threshold, the system refuses or asks a clarifying question rather than guessing). Each layer is measured against the eval set.
Can the system refuse to answer?+
Yes, and for regulated workloads it must. Refusal logic is part of every RAG system we build. The system distinguishes between 'the answer is not in the corpus', 'the question is ambiguous', and 'this question is out of scope', and it returns the right surface for each. Refusal appropriateness is a measured metric.
How do you measure groundedness in production?+
We use a combination depending on the latency and cost budget. NLI-based scoring (fast, cheap, lower precision), retrieval-overlap (almost free, lower recall on paraphrase), and LLM-as-judge sampling (slow, expensive, high quality, run on a sampled fraction of traffic). Drift over time is the metric that triggers investigation.
What models do you use for the generation layer?+
Foundation models from OpenAI, Anthropic, Google Vertex, AWS Bedrock, Cohere, and Mistral, plus open-weight models (Llama, Qwen, Mixtral) we deploy inside customer environments. Choice is driven by the latency budget, the cost per query, the residency posture, and the quality target on the eval set.
Can you work inside our VPC, on-premise, or air-gapped environment?+
Yes. Our defence practice ships air-gapped RAG systems with offline-deployable models. The same engineering team handles VPC and on-premise deployments for commercial customers.
ENGAGE
If your RAG demo is convincing but the production version is not, we want to look at it.
The most common engagement we see starts with a working demo and a stalled production rollout. The gap is almost always evaluation, attribution, and the operational posture. Engagements typically begin with a labelled eval set.