Avishek Anand · TU Delft

Towards Self-Improving Retrieval Augmented Systems

Avishek Anand · TU Delft

Building retrieval systems with industry. Explaining them in academia.

Avishek Anand · TU Delft

My research — retrieval-augmented AI systems

Systems that retrieve documents, passages, or knowledge at query time to answer questions, verify claims, and make decisions.

Avishek Anand · TU Delft

I have built many retrieval systems…

…and, I have broken just as many.
Avishek Anand · TU Delft

I'm known for Explainable IR

Foundational methods for understanding why retrieval and ranking models decide what they decide.

Algorithms I introduced are now standard tools in the field.

Avishek Anand · TU Delft

But I have a split personality

15+ years alongside explainability: scaling and efficiency of retrieval — inside and alongside industry.

Algorithms running in production at companies you've heard of.
Avishek Anand · TU Delft

What goes wrong in production RAG today

Three failure modes — all three show up in this single session.

I'll walk through each one with a concrete example.

Avishek Anand · TU Delft
The reranker can't fix what it never sees.
This is called the bounded recall problem.
Avishek Anand · TU Delft

A natural solution: query reformulations

Generate several rephrasings of the original query. Retrieve for each. Merge.

Reformulations
"luxury hotels in Tokyo"
"highly rated hotels in Tokyo"
"popular hotels in Tokyo"
"hotels near Tokyo Disneyland"
Avishek Anand · TU Delft

A natural solution: query reformulations

Generate several rephrasings of the original query. Retrieve for each. Merge.

Reformulation
"luxury hotels in Tokyo" ✗ drifts
"highly rated hotels in Tokyo" ✓ relevant
"popular hotels in Tokyo" ✗ drifts
"hotels near Tokyo Disneyland" ✗ drifts
Avishek Anand · TU Delft

Failure 2

More reformulations → more drift — current query reformulation methods are not the answer.
Avishek Anand · TU Delft

Why not use a stronger model?

The obvious response to failures 1 and 2: use a stronger model.

A reasoning-model reranker reads the document, the query, the constraints — and judges relevance like a human would.

Avishek Anand · TU Delft

Why not use a stronger model?

The obvious response to failures 1 and 2: use a stronger model.

A reasoning-model reranker reads the document, the query, the constraints — and judges relevance like a human would.

It works. It's also prohibitively expensive.

Avishek Anand · TU Delft

Failure 3

Better reasoning models incur a higher cost — and there is no sustainable scaling solution yet.
Avishek Anand · TU Delft

The principle

Feedback from later stages can — and should — drive decisions in earlier stages.

The reranker's relevance estimates can re-prioritize the retriever's candidates.
The downstream generator's confidence can score upstream evidence.
The cost of judgments can budget the depth of search.

center

Avishek Anand · TU Delft
In the remaining slides: what feedback to use, how to use feedback, and what to optimize using feedback.
Avishek Anand · TU Delft

The power of feedback

Avishek Anand · TU Delft
What feedback should we use — and how do we reorganize our indexes to react to it?

Use the re-ranking signal as feedback.

Avishek Anand · TU Delft

Re-ranking changes the order — and we throw the signal away

d3d2d1
Retriever order
d3 > d2 > d1
d1d2d3
Reranker order
d1 > d2 > d3
The reranker has discovered a new ordering.
Today's pipelines use it only to pick the top-k — and throw the rest of the signal away.
Avishek Anand · TU Delft

When the LLM gives many semantically different answers, the evidence is bad.
When it gives consistent answers, the evidence is good.
Uncertainty itself becomes the feedback signal.

Avishek Anand · TU Delft
With QUAM and SUNAR, we understood what kind of feedback signals could be used — and how to organize a representation space and explore it cleverly to improve retrieval.
Avishek Anand · TU Delft
But we are too tied to the representation space. What if there are different aspects of relevance, just like in Learning to Rank?
Avishek Anand · TU Delft

Different queries need different signals

"Marriott London Bridge" — exact match. Lexical / BM25 wins.
Neighborhood signals just add noise.

"romantic weekend getaway near Amsterdam" — vague. Lexical fails.
Embedding similarity and document neighborhood matter.

"kid-friendly hotel with pool, walking distance to Sagrada Familia, under €200" — multi-constraint. No single signal is enough.

Avishek Anand · TU Delft

The standard pipeline picks one weighting of these signals at training time.

That weighting is wrong for most queries.

What if the system picked the weighting per query, online, from reranker feedback?

Avishek Anand · TU Delft

ORE — Online Relevance Estimation

QUAM and SUNAR pick the next batch using a fixed affinity structure.

ORE goes further: it learns online, per query, which signals matter for this query.

A linear stochastic bandit over the candidate pool. Each document is an arm; features are simple, well-known relevance factors; rewards are the expensive ranker's scores.

[16] Rathee, Venktesh V, MacAvaney, Anand. Breaking the Lens of the Telescope: Online Relevance Estimation over Large Retrieval Sets. SIGIR 2025.
Avishek Anand · TU Delft

But signals aren't the only thing we can pick per query

ORE asks: given this query, which relevance signals should I trust?

Now ask the same question one level up.

Avishek Anand · TU Delft

ReformIR — picking reformulations per query

Failure 2: more reformulations → more drift.

Remember the Tokyo example? Four reformulations, one of them useful, three of them drifty.

The fix isn't fewer reformulations. It's picking which ones to trust, per query, online, from reranker feedback.

[5] Venktesh V, Rathee, Anand. When More Reformulations Hurt: Avoiding Drift using Ranker Feedback. SIGIR 2026.
Avishek Anand · TU Delft

ReformIR — same idea as ORE, one level up

ORE puts relevance signals in the bandit's feature vector.

ReformIR puts reformulations in the bandit's feature vector.

The reranker always scores against the original query, anchoring the system against drift.

Drifty reformulations get downweighted. Useful ones get upweighted.

Avishek Anand · TU Delft

This is the paradigm shift

Stop fixing the system at training time.

Per-query, online, from reranker feedback:
Which relevance signals to trust → ORE
Which reformulations to keep → ReformIR
Which neighbors to expand → QUAM, SUNAR
[15] QUAM (WSDM 2025) · [19] SUNAR (NAACL 2025) · [16] ORE (SIGIR 2025) · [5] ReformIR (SIGIR 2026)
Avishek Anand · TU Delft
Retrieval-augmented AI systems need to be feedback-aware and learn on a per-query basis.
Avishek Anand · TU Delft

Bandits give us the language for all of it

Same principle, different decisions, query by query:

  • Which query formulation to issue → ReformIR
  • Which documents to score → ORE, QUAM, SUNAR
  • Which judgments to spend budget on → all of the above

Not "how do I train a better retriever" — but "how do I deploy compute optimally, query by query?"

Avishek Anand · TU Delft

The science: subset selection under uncertainty

Every algorithm I just showed reduces to the same question:

Given a large pool and a tiny budget, which items deserve the expensive call?

This is top-m arms identification in stochastic linear bandits. Specifically disposable bandits.

The bottleneck: the candidate pool is exponentially large. Classical bandit algorithms compare every arm against every other — infeasible at retrieval scale.

Avishek Anand · TU Delft

Explainable Information Retrieval

Avishek Anand · TU Delft

EXPLORA — Choosing Explanations

You have a pool of N explanations. In-context learning needs a k-subset — the right k demonstrations decide whether the LLM reasons correctly or not.

The brute-force search: evaluate all subsets with the LLM.
At N = 100, k = 4, that's 3.9 million LLM calls.

Again EXPLORA frames this as bandit arm selection.

Each k-subset is an arm. An LLM call is a pull. The arms are exponentially many — classical bandit algorithms that compare every pair are infeasible.

[24] Purohit, Venktesh V, Devalla, Yerragorla, Bhattacharya, Anand. EXPLORA: Efficient Exemplar Subset Selection for Complex Reasoning. EMNLP 2024, pp. 5367–5388.
Avishek Anand · TU Delft

CASE — what EXPLORA leaves open

EXPLORA explores efficiently but never answers a fundamental question:

When have you found the best subset?
How do you know when to stop?

Without confidence information, EXPLORA can't eliminate arms — it keeps spending budget on subsets that are clearly suboptimal.

[13] Purohit, Venktesh V, Bhattacharya, Anand. Sample Efficient Demonstration Selection for In-Context Learning. ICML 2025.
Avishek Anand · TU Delft

Explain-and-predict isn't always perfect

The classical recipe — extract a rationale, then predict from it — is faithful by construction. But faithfulness is not free.

Recent work shows that the joint optimization can succeed without leaking the label or the feature itself into the rationale — and that the resulting predictors are competitive with their non-interpretable counterparts.

Interpretable retrieval is no longer paying a quality tax —
but the joint problem is delicate, and most pipelines still get it wrong.
Oosterhuis, Lyu, Anand. Local Feature Selection without Label or Feature Leakage for Interpretable Machine Learning Predictions. ICML 2024. arXiv:2407.11778
Avishek Anand · TU Delft

RAG faithfulness isn't perfect either

LLMs can produce a fluent answer, cite the right passage, and still be unfaithful — the citation explains the answer post hoc rather than supporting how the answer was actually derived.

We call this post-rationalization: the model decides, then dresses the decision in a plausible attribution.

Correct ≠ faithful.
A right answer with the wrong reason is still the wrong system.
Wallat, Heuss, de Rijke, Anand. Correctness is not Faithfulness in Retrieval Augmented Generation Attributions. ICTIR 2025, pp. 22–32. DOI: 10.1145/3731120.3744592
Avishek Anand · TU Delft

Towards AutoIR

A new subfield: self-improving retrieval systems.

Avishek Anand · TU Delft

A retrieval system that:

  • generates its own data it needs to improve on
  • evaluates itself in a way that aligns with deployment
  • optimizes itself, query by query, over time
Avishek Anand · TU Delft

Personas — generating behavior, not labels

Most synthetic-data work generates labels:
"Is document D relevant to query Q? Yes / No."

That's not how users behave.

Avishek Anand · TU Delft

We generate personas:

  • the budget traveler vs the luxury traveler
  • the planner vs the last-minute booker
  • the comparison shopper vs the decisive buyer

Each persona issues queries, reformulates, clicks, abandons —
the way users actually do.

Avishek Anand · TU Delft
We don't simulate labels.
We simulate users.
Avishek Anand · TU Delft

The system creates the data it needs to improve.

Static benchmarks → continuous self-evaluation.

Avishek Anand · TU Delft

Where this leaves us

  • Pipelines → systems
  • Retrieval → decisions under uncertainty
  • Static benchmarks → self-improving loops
Avishek Anand · TU Delft
RAG today is a pipeline.
RAG tomorrow is a system that learns.
Avishek Anand · TU Delft

We've closed the recall gap

The bandit framing — across QUAM, SUNAR, ORE, ReformIR — lets us recover the documents that classical cascades were leaving on the floor.

Same compute budget. More of the right answers.
Avishek Anand · TU Delft

The gap we closed → the world we're opening

The gap we closed
Bounded-recall → feedback-aware retrieval
The world we're opening
AutoIR — systems that learn on their own
Avishek Anand · TU Delft

I work with a small number of companies a year on exactly this.
If any of is interesting for you — let's talk after.

Avishek Anand · TU Delft

References

[1] Rathee, Venktesh V, MacAvaney, Anand. Reproducing Adaptive Reranking for Reasoning-Intensive IR. SIGIR 2026 (to appear).

[2] Rathee, Venktesh V, MacAvaney, Anand. Test-time Corpus Feedback: From Retrieval to RAG. Findings of ACL: EACL 2026, pp. 5637–5656.

[3] Venktesh V, Rathee, Anand. When More Reformulations Hurt: Avoiding Drift using Ranker Feedback. SIGIR 2026 (to appear).

[4] Purohit, Venktesh V, Bhattacharya, Anand. Sample Efficient Demonstration Selection for In-Context Learning. ICML 2025.

[5] Rathee, MacAvaney, Anand. Guiding Retrieval Using LLM-Based Listwise Rankers. ECIR 2025, pp. 230–246. DOI: 10.1007/978-3-031-88708-6_15.

[6] Purohit, Venktesh V, Devalla, Yerragorla, Bhattacharya, Anand. EXPLORA: Efficient Exemplar Subset Selection for Complex Reasoning. EMNLP 2024, Miami, FL, pp. 5367–5388. DOI: 10.18653/V1/2024.emnlp-main.310.

[7] Rathee, MacAvaney, Anand. Quam: Adaptive Retrieval through Query Affinity Modelling. WSDM 2025, pp. 954–962. DOI: 10.1145/3701551.3703584.

[8] Rathee, Venktesh V, MacAvaney, Anand. Breaking the Lens of the Telescope: Online Relevance Estimation over Large Retrieval Sets. SIGIR 2025, pp. 2287–2297. DOI: 10.1145/3726302.3729910.

Avishek Anand · TU Delft

References (cont.)

[9] Yoon, Kim, Kwon, Anand, Hwang. On Listwise Reranking for Corpus Feedback. WSDM 2026, pp. 1273–1277. DOI: 10.1145/3773966.3779404.

[10] Anand, Saha, Venktesh V. Explainable Information Retrieval. ECIR 2025, pp. 254–261. DOI: 10.1007/978-3-031-88720-8_40.

[11] Chungkham, Venktesh V, Setty, Anand. Think Right, Not More: Test-Time Scaling for Numerical Claim Verification. Findings of ACL: EMNLP 2025, pp. 24345–24363.

[12] Heuss, de Rijke, Anand. RankingSHAP — Faithful Listwise Feature Attribution Explanations for Ranking Models. SIGIR 2025, pp. 381–391. DOI: 10.1145/3726302.3729971.

[13] Nanhekhan, Venktesh V, Martin, Vatndal, Setty, Anand. FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking. ECIR 2025, pp. 385–399. DOI: 10.1007/978-3-031-88717-8_28.

[14] Saha, Agarwal, Venktesh V, Anand et al. ir_explain: A Python Library of Explainable IR Methods. SIGIR 2025, pp. 3563–3572. DOI: 10.1145/3726302.3730343.

[15] Venktesh V, Rathee, Anand. SUNAR: Semantic Uncertainty based Neighborhood Aware Retrieval for Complex QA. NAACL 2025, pp. 5818–5835. DOI: 10.18653/V1/2025.NAACL-LONG.300.

[16] Wallat, Heuss, de Rijke, Anand. Correctness is not Faithfulness in Retrieval Augmented Generation Attributions. ICTIR 2025, pp. 22–32. DOI: 10.1145/3731120.3744592.

Thats it !!!

Avishek Anand · TU Delft

--- # The reranker is a better estimator of relevance <div style="display:grid; grid-template-columns: 1fr 1fr; gap: 32px; align-items: center;"> <div style="font-size: 22px; line-height: 1.45;"> **Learning to Rank** has formalized this for over 20 years. The reranker weights dozens of signals — query–title match, semantic similarity, price, location, review count, image quality — instead of one BM25 score. Modern era: BERT reads the full document with cross-encoder attention. LLMs judge relevance with human-level reading comprehension. <div class="footnote">The reranker is not a different problem from retrieval — it is a richer estimator of the same thing. Better features, more compute, fewer documents.</div>

_style: "table { margin: 0 auto; }"

--- # Lots to discover <div style="text-align: center; margin-top: 8px;"> <img src="booking-diagrams/slide_p40_pipeline_loop.svg" style="width: 78%; height: auto;" /> </div> <div class="closing-line" style="margin-top: 20px;"> What started as a fix for one cascade is the design language for a new kind of system. </div>