Why Agentic RAG? — Query Routing and Adaptive Retrieval

Getting Started with Agentic RAG — Query Routing and Adaptive Retrieval
RAG handles "What's the weather in Seoul?" just fine, but fails at "Analyze how Seoul's weather has changed compared to last year." Why? Because a single vector search simply cannot handle such complex, multi-faceted questions.
Traditional RAG always searches the vector DB for similar documents whenever a question comes in. But real-world questions are far more complex. Some require real-time news, others need SQL queries to extract structured data, and some are general knowledge questions that don't need retrieval at all.
Agentic RAG solves this problem. The LLM analyzes the question, autonomously determines the optimal retrieval strategy, and combines multiple sources to generate an answer. In this article, we cover the first core techniques of Agentic RAG: Query Routing and Adaptive Retrieval.
Series: Part 1 (this post) | Part 2: Self-RAG and Corrective RAG | Part 3: Production Pipelines
If you're new to RAG basics, start with the Temporal RAG and Multi-hop RAG series first. If you're new to agent patterns, begin with Getting Started with AI Agents.
Limitations of Naive RAG
Most RAG tutorials follow this structure: Question → Vector Search → LLM Generation. Simple and effective — but only when the question is simple.
Real-world questions are diverse. Real-time news, SQL aggregations, general knowledge, combinations of multiple sources — yet Naive RAG treats every question the same way. It sends the query to the vector DB and passes the results to the LLM.
Here is a typical Naive RAG implementation.
def naive_rag(query: str) -> str:
"""Simple RAG: always performs only vector search."""
docs = vector_store.similarity_search(query, k=4)
context = "\n".join(d.page_content for d in docs)
return llm.invoke(f"Context:\n{context}\n\nQ: {query}")
# Failure case: "What's the recent revenue trend for OpenAI?"
# → No real-time data in the vector DB, so it generates an incorrect answerKey Insight: The fundamental limitation of Naive RAG is the assumption that "every question can be solved with vector search." Agentic RAG breaks this assumption by selecting the optimal strategy for each question.
Query Analysis — Intent Classification
The first step of Agentic RAG is Query Analysis. The LLM first determines what type the incoming question is, how complex it is, and which sources are needed.
By leveraging Structured Output, we can handle classification results programmatically. We define the output schema with a Pydantic model and enforce it using OpenAI's response_format.
from pydantic import BaseModel, Field
from typing import Literal
from openai import OpenAI
client = OpenAI()
class QueryAnalysis(BaseModel):
"""Schema for the result of analyzing a user query."""
intent: Literal["factual", "analytical", "comparison", "temporal", "opinion"]
complexity: Literal["simple", "multi_hop", "aggregation"]
requires_retrieval: bool
suggested_sources: list[Literal["vector_db", "web_search", "sql_db"]]
sub_queries: list[str] = Field(default_factory=list)
SYSTEM_PROMPT = """You are an expert at analyzing user queries.
Determine the intent, complexity, whether retrieval is required (requires_retrieval),
appropriate sources (suggested_sources), and sub-queries (sub_queries).
Rules:
- Recommend web_search if real-time information is needed.
- Recommend sql_db for numerical/statistical questions.
- Set requires_retrieval=false for general knowledge or concept explanations.
- Decompose complex questions into sub_queries."""
def analyze_query(query: str) -> QueryAnalysis:
"""Analyzes the intent and complexity of a user query."""
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": query}
],
response_format=QueryAnalysis
)
return response.choices[0].message.parsedLet's see what happens when we analyze a few different queries.
# Example 1: Simple fact-checking
result = analyze_query("What is the attention mechanism in transformer models?")
# → intent="factual", complexity="simple",
# requires_retrieval=True, suggested_sources=["vector_db"]
# Example 2: Real-time information needed
result = analyze_query("What's the recent revenue trend for OpenAI?")
# → intent="temporal", complexity="aggregation",
# requires_retrieval=True, suggested_sources=["web_search"]
# Example 3: No retrieval needed
result = analyze_query("What's the difference between lists and tuples in Python?")
# → intent="comparison", complexity="simple",
# requires_retrieval=False, suggested_sources=[]Here is a summary of the optimal source for each question type.
Query Routing — Routing to the Optimal Source
Once the question is analyzed, we need to actually fetch information from the appropriate sources. The Query Router selects and executes retrieval backends based on the analysis results.
First, we prepare three retrieval backends.
import chromadb
from tavily import TavilyClient
import sqlite3
# 1. Vector Search: ChromaDB
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_collection("documents")
def vector_search(query: str, k: int = 4) -> list[str]:
"""Searches for similar documents in the vector DB."""
results = collection.query(query_texts=[query], n_results=k)
return results["documents"][0]
# 2. Web Search: Tavily
tavily_client = TavilyClient(api_key="tvly-...")
def web_search(query: str) -> list[str]:
"""Performs a real-time web search."""
response = tavily_client.search(query, max_results=3)
return [r["content"] for r in response["results"]]
# 3. Text-to-SQL: SQLite
conn = sqlite3.connect("./company.db")
def sql_query(query: str) -> str:
"""Converts natural language to SQL and executes it."""
# Convert natural language → SQL via LLM
sql = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": (
"Generate SQL based on the following schema.\n"
"Table: sales(date, product, revenue, region)\n"
"Output only SQL. No explanations."
)},
{"role": "user", "content": query}
]
).choices[0].message.content.strip()
# Execute SQL
cursor = conn.execute(sql)
rows = cursor.fetchall()
columns = [desc[0] for desc in cursor.description]
return f"SQL: {sql}\nResult: {[dict(zip(columns, row)) for row in rows]}"Now we build the router function. It iterates through the suggested_sources from QueryAnalysis and calls the corresponding backend.
def route_query(analysis: QueryAnalysis, query: str) -> list[str]:
"""Retrieves information from appropriate sources based on the analysis result."""
results = []
for source in analysis.suggested_sources:
if source == "vector_db":
results.extend(vector_search(query))
elif source == "web_search":
results.extend(web_search(query))
elif source == "sql_db":
results.append(sql_query(query))
return resultsIt may look simple, but this routing alone yields a significant performance improvement over Naive RAG. The key is "sending the question to the right source." Asking a vector DB about real-time news is like asking a library for today's stock price.
Key Insight: The essence of Query Routing is "tool selection." If you think of retrieval sources as tools, this is the same pattern as an Agent deciding which Tool to use.
Adaptive Retrieval — When Retrieval Isn't Needed
Performing retrieval for every question is inefficient. For a question like "What's the difference between HTTP and HTTPS?", the LLM already knows the answer well enough. Running retrieval can actually introduce unnecessary context and degrade answer quality.
Adaptive Retrieval lets the LLM decide whether to retrieve at all. The requires_retrieval field in QueryAnalysis serves this purpose.
def adaptive_rag(query: str) -> str:
"""Determines whether retrieval is needed and retrieves only when necessary."""
analysis = analyze_query(query)
# If retrieval is not needed, answer directly from LLM knowledge
if not analysis.requires_retrieval:
return client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer accurately using your own knowledge."},
{"role": "user", "content": query}
]
).choices[0].message.content
# If it's a complex question, decompose into sub-queries and retrieve each
queries = analysis.sub_queries if analysis.sub_queries else [query]
all_context = []
for q in queries:
sub_analysis = analyze_query(q) if q != query else analysis
all_context.extend(route_query(sub_analysis, q))
# Generate the final answer based on the collected context
context_text = "\n---\n".join(all_context)
return client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer accurately based on the provided context."},
{"role": "user", "content": f"Context:\n{context_text}\n\nQuestion: {query}"}
]
).choices[0].message.contentThis follows the same principle as the ReAct loop from Agent Part 1. It applies the concept of an Agent deciding "what action to take" to retrieval. Whether to retrieve or not, where to retrieve from, whether to decompose the question — all these decisions are made by the LLM.
Implementing with LangGraph
Structuring the logic so far with LangGraph results in a cleaner, more extensible pipeline. Each step becomes a node, and branching conditions become edges.
If you're new to LangGraph, refer to Agent Part 2: LangGraph in Practice.
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
class AgenticRAGState(TypedDict):
"""Defines the state of the Agentic RAG pipeline."""
query: str # Original user question
analysis: QueryAnalysis | None # Query analysis result
documents: list[str] # Retrieved documents
generation: str # Final generated answer
def analyze_node(state: AgenticRAGState) -> dict:
"""Analyzes the question to determine intent and optimal sources."""
analysis = analyze_query(state["query"])
return {"analysis": analysis}
def should_retrieve(state: AgenticRAGState) -> str:
"""Determines whether retrieval is needed and routes to the next node."""
if state["analysis"].requires_retrieval:
return "retrieve"
return "generate_direct"
def retrieve_node(state: AgenticRAGState) -> dict:
"""Retrieves documents from appropriate sources based on the analysis."""
analysis = state["analysis"]
query = state["query"]
all_docs = []
# If there are sub-queries, retrieve for each one
queries = analysis.sub_queries if analysis.sub_queries else [query]
for q in queries:
sub_analysis = analyze_query(q) if q != query else analysis
all_docs.extend(route_query(sub_analysis, q))
return {"documents": all_docs}
def generate_node(state: AgenticRAGState) -> dict:
"""Generates an answer based on the retrieved documents."""
context = "\n---\n".join(state["documents"])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": (
"Answer accurately based on the provided context. "
"Do not speculate about information not in the context."
)},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {state['query']}"}
]
).choices[0].message.content
return {"generation": response}
def generate_direct_node(state: AgenticRAGState) -> dict:
"""Generates an answer using only LLM knowledge, without retrieval."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer accurately and helpfully using your own knowledge."},
{"role": "user", "content": state["query"]}
]
).choices[0].message.content
return {"generation": response}
# Build the graph
graph = StateGraph(AgenticRAGState)
# Add nodes
graph.add_node("analyze", analyze_node)
graph.add_node("retrieve", retrieve_node)
graph.add_node("generate", generate_node)
graph.add_node("generate_direct", generate_direct_node)
# Connect edges
graph.set_entry_point("analyze")
graph.add_conditional_edges(
"analyze",
should_retrieve,
{
"retrieve": "retrieve", # Retrieval needed → go to retrieve node
"generate_direct": "generate_direct" # No retrieval needed → answer directly
}
)
graph.add_edge("retrieve", "generate") # After retrieval → generate answer
graph.add_edge("generate", END) # Answer complete
graph.add_edge("generate_direct", END) # Direct answer complete
# Compile and run
app = graph.compile()
# Execution example
result = app.invoke({
"query": "Analyze how Seoul's weather has changed compared to last year",
"analysis": None,
"documents": [],
"generation": ""
})
print(result["generation"])The graph flow is: User Question → [analyze] → Retrieval needed? → Yes: [retrieve] → [generate] → END / No: [generate_direct] → END.
The advantage of LangGraph is that each node can be independently tested and swapped out. Adding new sources (GraphRAG, API calls, etc.) to retrieve_node or changing the should_retrieve condition is straightforward.
Does Routing Actually Work?
The theory sounds promising, but how much of a difference does it make in practice? We compared Naive RAG and Agentic RAG (with Query Routing) on 100 questions across 4 question types, for a total of 400 questions. The evaluation metric is accuracy (based on GPT-4o as judge).
Here are a few observations.
- In-corpus questions show minimal difference. Naive RAG already performs well when the answer exists in the vector DB.
- Real-time information shows a dramatic difference. Naive RAG hallucinates because the vector DB lacks the relevant information, while Agentic RAG fetches accurate information via web search.
- Structured data is an entirely different ballgame. You simply cannot answer "Top 5 products by revenue last quarter" with vector search.
- No-retrieval questions also show meaningful improvement. Removing unnecessary context allows the LLM to answer more accurately.
The evaluation methodology is covered in detail in RAG Evaluation.
Key Insight: The greatest value of Query Routing is covering areas "where traditional RAG could not answer at all." In-corpus question performance is similar, but it's a game-changer for real-time and structured data.
Preview of the Next Part
We used routing to find the right source, but what happens when the quality of the retrieved documents is poor? What if the search results are irrelevant to the question or contain outdated information?
Part 2 covers two techniques that address this problem.
- Self-RAG: The LLM evaluates the relevance of search results on its own and re-retrieves if necessary
- Corrective RAG (CRAG): Automatically falls back to web search when retrieval results are insufficient
We will build a complete Agentic RAG pipeline: Query Analysis → Routing (Part 1) → Quality Verification (Part 2) → Production Deployment (Part 3).
References
- Gao, Y., et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey." *arXiv:2312.10997*.
- Asai, A., et al. (2023). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." *arXiv:2310.11511*.
- Yan, S., et al. (2024). "Corrective Retrieval Augmented Generation." *arXiv:2401.15884*.
- LangGraph Official Documentation
- Tavily API Documentation
- Related series: Temporal RAG · Multi-hop RAG · RAG Evaluation · AI Agent Series