When the LLM gives many semantically different answers, the evidence is bad.
When it gives consistent answers, the evidence is good.
Uncertainty itself becomes the feedback signal.
"Marriott London Bridge" — exact match. Lexical / BM25 wins.
Neighborhood signals just add noise.
"romantic weekend getaway near Amsterdam" — vague. Lexical fails.
Embedding similarity and document neighborhood matter.
"kid-friendly hotel with pool, walking distance to Sagrada Familia, under €200" — multi-constraint. No single signal is enough.
The standard pipeline picks one weighting of these signals at training time.
That weighting is wrong for most queries.
What if the system picked the weighting per query, online, from reranker feedback?
QUAM and SUNAR pick the next batch using a fixed affinity structure.
ORE goes further: it learns online, per query, which signals matter for this query.
A linear stochastic bandit over the candidate pool. Each document is an arm; features are simple, well-known relevance factors; rewards are the expensive ranker's scores.
ORE asks: given this query, which relevance signals should I trust?
Now ask the same question one level up.
Failure 2: more reformulations → more drift.
Remember the Tokyo example? Four reformulations, one of them useful, three of them drifty.
The fix isn't fewer reformulations. It's picking which ones to trust, per query, online, from reranker feedback.
ORE puts relevance signals in the bandit's feature vector.
ReformIR puts reformulations in the bandit's feature vector.
The reranker always scores against the original query, anchoring the system against drift.
Drifty reformulations get downweighted. Useful ones get upweighted.
Stop fixing the system at training time.
| Per-query, online, from reranker feedback: |
|---|
| Which relevance signals to trust → ORE |
| Which reformulations to keep → ReformIR |
| Which neighbors to expand → QUAM, SUNAR |
Same principle, different decisions, query by query:
Not "how do I train a better retriever" — but "how do I deploy compute optimally, query by query?"
Every algorithm I just showed reduces to the same question:
Given a large pool and a tiny budget, which items deserve the expensive call?
This is top-m arms identification in stochastic linear bandits. Specifically disposable bandits.
The bottleneck: the candidate pool is exponentially large. Classical bandit algorithms compare every arm against every other — infeasible at retrieval scale.
You have a pool of N explanations. In-context learning needs a k-subset — the right k demonstrations decide whether the LLM reasons correctly or not.
The brute-force search: evaluate all
At N = 100, k = 4, that's 3.9 million LLM calls.
Again EXPLORA frames this as bandit arm selection.
Each k-subset is an arm. An LLM call is a pull. The arms are exponentially many — classical bandit algorithms that compare every pair are infeasible.
EXPLORA explores efficiently but never answers a fundamental question:
Without confidence information, EXPLORA can't eliminate arms — it keeps spending budget on subsets that are clearly suboptimal.
The classical recipe — extract a rationale, then predict from it — is faithful by construction. But faithfulness is not free.
Recent work shows that the joint optimization can succeed without leaking the label or the feature itself into the rationale — and that the resulting predictors are competitive with their non-interpretable counterparts.
LLMs can produce a fluent answer, cite the right passage, and still be unfaithful — the citation explains the answer post hoc rather than supporting how the answer was actually derived.
We call this post-rationalization: the model decides, then dresses the decision in a plausible attribution.
A new subfield: self-improving retrieval systems.
A retrieval system that:
Most synthetic-data work generates labels:
"Is document D relevant to query Q? Yes / No."
That's not how users behave.
We generate personas:
Each persona issues queries, reformulates, clicks, abandons —
the way users actually do.
The system creates the data it needs to improve.
Static benchmarks → continuous self-evaluation.
The bandit framing — across QUAM, SUNAR, ORE, ReformIR — lets us recover the documents that classical cascades were leaving on the floor.
I work with a small number of companies a year on exactly this.
If any of is interesting for you — let's talk after.
[1] Rathee, Venktesh V, MacAvaney, Anand. Reproducing Adaptive Reranking for Reasoning-Intensive IR. SIGIR 2026 (to appear).
[2] Rathee, Venktesh V, MacAvaney, Anand. Test-time Corpus Feedback: From Retrieval to RAG. Findings of ACL: EACL 2026, pp. 5637–5656.
[3] Venktesh V, Rathee, Anand. When More Reformulations Hurt: Avoiding Drift using Ranker Feedback. SIGIR 2026 (to appear).
[4] Purohit, Venktesh V, Bhattacharya, Anand. Sample Efficient Demonstration Selection for In-Context Learning. ICML 2025.
[5] Rathee, MacAvaney, Anand. Guiding Retrieval Using LLM-Based Listwise Rankers. ECIR 2025, pp. 230–246. DOI: 10.1007/978-3-031-88708-6_15.
[6] Purohit, Venktesh V, Devalla, Yerragorla, Bhattacharya, Anand. EXPLORA: Efficient Exemplar Subset Selection for Complex Reasoning. EMNLP 2024, Miami, FL, pp. 5367–5388. DOI: 10.18653/V1/2024.emnlp-main.310.
[7] Rathee, MacAvaney, Anand. Quam: Adaptive Retrieval through Query Affinity Modelling. WSDM 2025, pp. 954–962. DOI: 10.1145/3701551.3703584.
[8] Rathee, Venktesh V, MacAvaney, Anand. Breaking the Lens of the Telescope: Online Relevance Estimation over Large Retrieval Sets. SIGIR 2025, pp. 2287–2297. DOI: 10.1145/3726302.3729910.
[9] Yoon, Kim, Kwon, Anand, Hwang. On Listwise Reranking for Corpus Feedback. WSDM 2026, pp. 1273–1277. DOI: 10.1145/3773966.3779404.
[10] Anand, Saha, Venktesh V. Explainable Information Retrieval. ECIR 2025, pp. 254–261. DOI: 10.1007/978-3-031-88720-8_40.
[11] Chungkham, Venktesh V, Setty, Anand. Think Right, Not More: Test-Time Scaling for Numerical Claim Verification. Findings of ACL: EMNLP 2025, pp. 24345–24363.
[12] Heuss, de Rijke, Anand. RankingSHAP — Faithful Listwise Feature Attribution Explanations for Ranking Models. SIGIR 2025, pp. 381–391. DOI: 10.1145/3726302.3729971.
[13] Nanhekhan, Venktesh V, Martin, Vatndal, Setty, Anand. FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking. ECIR 2025, pp. 385–399. DOI: 10.1007/978-3-031-88717-8_28.
[14] Saha, Agarwal, Venktesh V, Anand et al. ir_explain: A Python Library of Explainable IR Methods. SIGIR 2025, pp. 3563–3572. DOI: 10.1145/3726302.3730343.
[15] Venktesh V, Rathee, Anand. SUNAR: Semantic Uncertainty based Neighborhood Aware Retrieval for Complex QA. NAACL 2025, pp. 5818–5835. DOI: 10.18653/V1/2025.NAACL-LONG.300.
[16] Wallat, Heuss, de Rijke, Anand. Correctness is not Faithfulness in Retrieval Augmented Generation Attributions. ICTIR 2025, pp. 22–32. DOI: 10.1145/3731120.3744592.
--- # The reranker is a better estimator of relevance <div style="display:grid; grid-template-columns: 1fr 1fr; gap: 32px; align-items: center;"> <div style="font-size: 22px; line-height: 1.45;"> **Learning to Rank** has formalized this for over 20 years. The reranker weights dozens of signals — query–title match, semantic similarity, price, location, review count, image quality — instead of one BM25 score. Modern era: BERT reads the full document with cross-encoder attention. LLMs judge relevance with human-level reading comprehension. <div class="footnote">The reranker is not a different problem from retrieval — it is a richer estimator of the same thing. Better features, more compute, fewer documents.</div>
_style: "table { margin: 0 auto; }"
--- # Lots to discover <div style="text-align: center; margin-top: 8px;"> <img src="booking-diagrams/slide_p40_pipeline_loop.svg" style="width: 78%; height: auto;" /> </div> <div class="closing-line" style="margin-top: 20px;"> What started as a fix for one cascade is the design language for a new kind of system. </div>