Avishek Anand · TU Delft

Towards Self-Improving Retrieval Augmented Systems

Avishek Anand · TU Delft

Building retrieval systems with industry. Explaining them in academia.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

My research — retrieval-augmented AI systems

Systems that retrieve documents, passages, or knowledge at query time to answer questions, verify claims, and make decisions.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

I'm known for Explainable IR

Foundational methods for understanding why retrieval and ranking models decide what they decide.

Algorithms I introduced are now standard tools in the field.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

But I have a split personality

15+ years alongside explainability: scaling and efficiency of retrieval — inside and alongside industry.

Algorithms running in production at companies you've heard of.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

What goes wrong in production RAG today

Three failure modes — all three show up in this single session.

I'll walk through each one with a concrete example.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft
The reranker can't fix what it never sees.
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

Failure 2 — The drift problem

Modern systems try to fix recall with query reformulations.

Generate several rephrasings of the original query. Retrieve for each. Merge.

The intuition: more queries, more recall.

The reality: most reformulations drift.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

Reformulating "good hotels in Tokyo"

Reformulation What it actually retrieves
"luxury hotels in Tokyo" drifts toward expensive — wrong for most users
"highly rated hotels in Tokyo" usually most useful
"popular hotels in Tokyo" drifts toward tourist traps
"hotels near Tokyo Disneyland" drifts away from the actual intent

Concatenate or rank-fuse all of them, and noise dominates signal.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft
More reformulations → more drift — current query reformulation methods are not the answer.
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

Failure 3 — The cost wall

The obvious response to failures 1 and 2: use a stronger model.

A reasoning-model reranker reads the document, the query, the constraints — and judges relevance like a human would.

It works. It's also prohibitively expensive.

So you're back to the recall ceiling — gated by whatever the cheap retriever surfaces.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft
Better rerankers don't fix bad retrievers. They make the gap more expensive to ignore.
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

What these failures share

All three come from the same root cause:

The system makes one-shot decisions.

The retriever decides once. Reformulation happens once. The reranker reads what it's given.

No stage's signal is fed back to revise the stages before it.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

The principle

Feedback from later stages can — and should — drive decisions in earlier stages.

The reranker's relevance estimates can re-prioritize the retriever's candidates.
The downstream generator's confidence can score upstream evidence.
The cost of judgments can budget the depth of search.

center

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft
In the remaining slides: what feedback to use, how to use feedback, and what to optimize using feedback.
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

The power of feedback

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

The reranker is a better estimator of relevance

Learning to Rank has formalized this for over 20 years.

The reranker weights dozens of signals — query–title match, semantic similarity, price, location, review count, image quality — instead of one BM25 score.

Modern era: BERT reads the full document with cross-encoder attention. LLMs judge relevance with human-level reading comprehension.

The reranker is not a different problem from retrieval — it is a richer estimator of the same thing. Better features, more compute, fewer documents.
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft
What feedback should we use — and how do we reorganize our indexes to react to it?

Use the re-ranking signal as feedback.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

When the LLM gives many semantically different answers, the evidence is bad.
When it gives consistent answers, the evidence is good.
Uncertainty itself becomes the feedback signal.

Avishek Anand · TU Delft
With QUAM and SUNAR, we understood what kind of feedback signals could be used — and how to organize a representation space and explore it cleverly to improve retrieval.
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft
But we are too tied to the representation space. What if there are different aspects of relevance, just like in Learning to Rank?
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

Different queries need different signals

"Marriott London Bridge" — exact match. Lexical / BM25 wins.
Neighborhood signals just add noise.

"romantic weekend getaway near Amsterdam" — vague. Lexical fails.
Embedding similarity and document neighborhood matter.

"kid-friendly hotel with pool, walking distance to Sagrada Familia, under €200" — multi-constraint. No single signal is enough.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

The standard pipeline picks one weighting of these signals at training time.

That weighting is wrong for most queries.

What if the system picked the weighting per query, online, from reranker feedback?

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

Algorithm 3 — ORE — Online Relevance Estimation

QUAM and SUNAR pick the next batch using a fixed affinity structure.

ORE goes further: it learns online, per query, which signals matter for this query.

A linear stochastic bandit over the candidate pool. Each document is an arm; features are simple, well-known relevance factors; rewards are the expensive ranker's scores.

[16] Rathee, Venktesh V, MacAvaney, Anand. Breaking the Lens of the Telescope: Online Relevance Estimation over Large Retrieval Sets. SIGIR 2025.
Avishek Anand · TU Delft

But signals aren't the only thing we can pick per query

ORE asks: given this query, which relevance signals should I trust?

Now ask the same question one level up.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

ReformIR — picking reformulations per query

Failure 2: more reformulations → more drift.

Remember the Tokyo example? Four reformulations, one of them useful, three of them drifty.

The fix isn't fewer reformulations. It's picking which ones to trust, per query, online, from reranker feedback.

[5] Venktesh V, Rathee, Anand. When More Reformulations Hurt: Avoiding Drift using Ranker Feedback. SIGIR 2026.
Avishek Anand · TU Delft

ReformIR — same idea as ORE, one level up

ORE puts relevance signals in the bandit's feature vector.

ReformIR puts reformulations in the bandit's feature vector.

The reranker always scores against the original query, anchoring the system against drift.

Drifty reformulations get downweighted. Useful ones get upweighted.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

This is the paradigm shift

Stop fixing the system at training time.

Per-query, online, from reranker feedback:
Which relevance signals to trust → ORE
Which reformulations to keep → ReformIR
Which neighbors to expand → QUAM, SUNAR
[15] QUAM (WSDM 2025) · [19] SUNAR (NAACL 2025) · [16] ORE (SIGIR 2025) · [5] ReformIR (SIGIR 2026)
Avishek Anand · TU Delft
Retrieval-augmented AI systems need to be feedback-aware and learn on a per-query basis.
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

Bandits give us the language for all of it

Same principle, different decisions, query by query:

  • Which query formulation to issue → ReformIR
  • Which documents to score → ORE, QUAM, SUNAR
  • Which judgments to spend budget on → all of the above

Not "how do I train a better retriever" — but "how do I deploy compute optimally, query by query?"

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

The science: subset selection under uncertainty

Every algorithm I just showed reduces to the same question:

Given a large pool and a tiny budget, which items deserve the expensive call?

This is top-m arms identification in stochastic linear bandits.

The bottleneck: the candidate pool is exponentially large. Classical bandit algorithms compare every arm against every other — infeasible at retrieval scale.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

Explainable Information Retrieval

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

Ways in which you can use explanations

Avishek Anand · TU Delft

EXPLORA — Choosing Explanations

You have a pool of N explanations. In-context learning needs a k-subset — the right k demonstrations decide whether the LLM reasons correctly or not.

The brute-force search: evaluate all subsets with the LLM.
At N = 100, k = 4, that's 3.9 million LLM calls.

Again EXPLORA frames this as bandit arm selection.

Each k-subset is an arm. An LLM call is a pull. The arms are exponentially many — classical bandit algorithms that compare every pair are infeasible.

[24] Purohit, Venktesh V, Devalla, Yerragorla, Bhattacharya, Anand. EXPLORA: Efficient Exemplar Subset Selection for Complex Reasoning. EMNLP 2024, pp. 5367–5388.
Avishek Anand · TU Delft

CASE — what EXPLORA leaves open

EXPLORA explores efficiently but never answers a fundamental question:

When have you found the best subset?
How do you know when to stop?

Without confidence information, EXPLORA can't eliminate arms — it keeps spending budget on subsets that are clearly suboptimal.

[13] Purohit, Venktesh V, Bhattacharya, Anand. Sample Efficient Demonstration Selection for In-Context Learning. ICML 2025.
Avishek Anand · TU Delft

Explain-and-predict isn't always perfect

The classical recipe — extract a rationale, then predict from it — is faithful by construction. But faithfulness is not free.

Recent work shows that the joint optimization can succeed without leaking the label or the feature itself into the rationale — and that the resulting predictors are competitive with their non-interpretable counterparts.

Interpretable retrieval is no longer paying a quality tax —
but the joint problem is delicate, and most pipelines still get it wrong.
Oosterhuis, Lyu, Anand. Local Feature Selection without Label or Feature Leakage for Interpretable Machine Learning Predictions. ICML 2024. arXiv:2407.11778
Avishek Anand · TU Delft

RAG faithfulness isn't perfect either

LLMs can produce a fluent answer, cite the right passage, and still be unfaithful — the citation explains the answer post hoc rather than supporting how the answer was actually derived.

We call this post-rationalization: the model decides, then dresses the decision in a plausible attribution.

Correct ≠ faithful.
A right answer with the wrong reason is still the wrong system.
Wallat, Heuss, de Rijke, Anand. Correctness is not Faithfulness in Retrieval Augmented Generation Attributions. ICTIR 2025, pp. 22–32. DOI: 10.1145/3731120.3744592
Avishek Anand · TU Delft

Towards AutoIR

A new subfield: self-improving retrieval systems.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

A retrieval system that:

  • generates its own data it needs to improve on
  • evaluates itself in a way that aligns with deployment
  • optimizes itself, query by query, over time
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

Personas — generating behavior, not labels

Most synthetic-data work generates labels:
"Is document D relevant to query Q? Yes / No."

That's not how users behave.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

We generate personas:

  • the budget traveler vs the luxury traveler
  • the planner vs the last-minute booker
  • the comparison shopper vs the decisive buyer

Each persona issues queries, reformulates, clicks, abandons —
the way users actually do.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft
We don't simulate labels.
We simulate users.
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

The system creates the data it needs to improve.

Static benchmarks → continuous self-evaluation.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

Where this leaves us

  • Pipelines → systems
  • Retrieval → decisions under uncertainty
  • Static benchmarks → self-improving loops
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft
RAG today is a pipeline.
RAG tomorrow is a system that learns.
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

I work with a small number of companies a year on exactly this.
If any of is interesting for you — let's talk after.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

References

[1] Rathee, Venktesh V, MacAvaney, Anand. Reproducing Adaptive Reranking for Reasoning-Intensive IR. SIGIR 2026 (to appear).

[2] Rathee, Venktesh V, MacAvaney, Anand. Test-time Corpus Feedback: From Retrieval to RAG. Findings of ACL: EACL 2026, pp. 5637–5656.

[3] Venktesh V, Rathee, Anand. When More Reformulations Hurt: Avoiding Drift using Ranker Feedback. SIGIR 2026 (to appear).

[4] Purohit, Venktesh V, Bhattacharya, Anand. Sample Efficient Demonstration Selection for In-Context Learning. ICML 2025.

[5] Rathee, MacAvaney, Anand. Guiding Retrieval Using LLM-Based Listwise Rankers. ECIR 2025, pp. 230–246. DOI: 10.1007/978-3-031-88708-6_15.

[6] Purohit, Venktesh V, Devalla, Yerragorla, Bhattacharya, Anand. EXPLORA: Efficient Exemplar Subset Selection for Complex Reasoning. EMNLP 2024, Miami, FL, pp. 5367–5388. DOI: 10.18653/V1/2024.emnlp-main.310.

[7] Rathee, MacAvaney, Anand. Quam: Adaptive Retrieval through Query Affinity Modelling. WSDM 2025, pp. 954–962. DOI: 10.1145/3701551.3703584.

[8] Rathee, Venktesh V, MacAvaney, Anand. Breaking the Lens of the Telescope: Online Relevance Estimation over Large Retrieval Sets. SIGIR 2025, pp. 2287–2297. DOI: 10.1145/3726302.3729910.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

References (cont.)

[9] Yoon, Kim, Kwon, Anand, Hwang. On Listwise Reranking for Corpus Feedback. WSDM 2026, pp. 1273–1277. DOI: 10.1145/3773966.3779404.

[10] Anand, Saha, Venktesh V. Explainable Information Retrieval. ECIR 2025, pp. 254–261. DOI: 10.1007/978-3-031-88720-8_40.

[11] Chungkham, Venktesh V, Setty, Anand. Think Right, Not More: Test-Time Scaling for Numerical Claim Verification. Findings of ACL: EMNLP 2025, pp. 24345–24363.

[12] Heuss, de Rijke, Anand. RankingSHAP — Faithful Listwise Feature Attribution Explanations for Ranking Models. SIGIR 2025, pp. 381–391. DOI: 10.1145/3726302.3729971.

[13] Nanhekhan, Venktesh V, Martin, Vatndal, Setty, Anand. FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking. ECIR 2025, pp. 385–399. DOI: 10.1007/978-3-031-88717-8_28.

[14] Saha, Agarwal, Venktesh V, Anand et al. ir_explain: A Python Library of Explainable IR Methods. SIGIR 2025, pp. 3563–3572. DOI: 10.1145/3726302.3730343.

[15] Venktesh V, Rathee, Anand. SUNAR: Semantic Uncertainty based Neighborhood Aware Retrieval for Complex QA. NAACL 2025, pp. 5818–5835. DOI: 10.18653/V1/2025.NAACL-LONG.300.

[16] Wallat, Heuss, de Rijke, Anand. Correctness is not Faithfulness in Retrieval Augmented Generation Attributions. ICTIR 2025, pp. 22–32. DOI: 10.1145/3731120.3744592.

Towards Self-Improving Retrieval Augmented Systems

Thats it !!!

Avishek Anand · TU Delft

First-generation RAG — where it breaks

These systems work. But they break in systematic, reproducible ways.

I've seen the same three failures across search engines, fact-checkers, and recommendation systems.

Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft
Retrieval here isn't a step. It's a process.
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

CASE — what EXPLORA leaves open

EXPLORA explores efficiently but never answers a fundamental question:

When have you found the best subset?
How do you know when to stop?

Without confidence information, EXPLORA can't eliminate arms — it keeps spending budget on subsets that are clearly suboptimal.

CASE (ICML 2025) adds gap-index theory:

  • Tracks a shortlist of challenger arms — subsets still statistically competitive with the current best
  • Arms with a large performance gap below the leader are eliminated early
  • The gap-index bounds the expected number of pulls needed to certify the best arm
  • Algorithm terminates with a provable optimality guarantee

Same idea as EXPLORA, with convergence certificates. Fewer queries. Broader tasks.

Avishek Anand · TU Delft

EXPLORA — learning to score subsets

A scoring function σ(α, S) predicts the quality of subset S.
α is unknown — the parameters we learn from LLM feedback.

The loop:

  1. Sample a batch of candidate subsets from the pool
  2. Query the LLM on a handful — observe accuracy as reward
  3. Fit α by minimizing prediction loss over observed rewards
  4. Use σ(α, ·) to guide which subsets to evaluate next
  5. Repeat until budget exhausted

No confidence bounds. No elimination. Explores by loss minimization alone.

Results on AquaRat · FinQA · GSM8K · TabMwp:
Uses ~11% of the LLM calls required by prior SOTA · +12.24% accuracy

[24] Purohit et al. EXPLORA, EMNLP 2024.
Avishek Anand · TU Delft
Towards Self-Improving Retrieval Augmented Systems
Avishek Anand · TU Delft

Algorithm 1 — QUAM

Failure 1: the recall ceiling — the relevant doc never made it into the top-1000.

QUAM asks the reranker to teach the retriever:

  1. Rerank a small batch
  2. For high-scoring docs, look up their learned-affinity neighbors
  3. Score neighbors by set affinity — proximity to the set of already-relevant docs
  4. Send the top neighbors to the reranker next
  5. Repeat until budget exhausted

The learned affinity graph encodes co-relevance, not just textual similarity.

[15] Rathee, MacAvaney, Anand. Quam: Adaptive Retrieval through Query Affinity Modelling. WSDM 2025.
Avishek Anand · TU Delft
Towards Self-Improving Retrieval Augmented Systems

_style: "table { margin: 0 auto; }"