Chat API Reference (Stateless & Session-Based Endpoints)
About this document
This page provides complete API documentation for the Chat-with-RAG system, including both stateless and session-based chat endpoints with request/response formats, parameters, and integration examples.
Note: If you landed here directly (for example from documentation hosting or search), start with the repository README to see how to run the system locally and try the interactive demo.
This document covers both the stateless /chat endpoint (used by frontend/chat.html) and the session-based /chat/{session_id} endpoint for server-side conversation management.
Table of Contents
- What this README covers
- High-level data flow
- Stateless API Reference
3.1. Top-level request body
3.2.paramscontract - Session-Based API Reference
4.1. Create Session
4.2. Send Message to Session
4.3. Get Session History - Response Shape
- Example Calls
6.1. Stateless Examples
6.2. Session-Based Examples - API Comparison
- Notes for Integrators
1. What this README covers
- Scope
- The
POST /chatstateless endpoint. - The
POST /chat/{session_id}session-based endpoint. - Session management (
POST /chat/session,GET /chat/{session_id}/history) - How requests map to
ChatRequest,handle_chat, andrun_pipeline. - The JSON request/response shape, including the
paramscontract. - How to control processing stage streaming via
show_processing_steps.
- The
- Out of scope
- Ingest endpoints (
/index,/mediawiki/url,/pdf, etc.). - SSE streaming internals (
backend/stream_stages.py,backend/stream_emit.py).
- Ingest endpoints (
2. High-level data flow
-
Client (browser / external caller)
SendsPOST /chatwith a JSON body matchingbackend/core/schemas.ChatRequest. - FastAPI (
backend/main.py)- Route:
@app.post("/chat", tags=["3. Search & Chat"], summary="5. Chat (stateless)") async def chat_with_content(chat_request: ChatRequest): ... - Delegates to
ChatManager.handle_chat(or the module-levelhandle_chat):result = await asyncio.to_thread(handler, chat_request.model_dump()) return result
- Route:
- Chat manager (
backend/chat/chat_manager.py)handle_chat(payload: Dict[str, Any]) -> Dict[str, Any]:- Extracts
message,history, andparamsfrom the payload. - Prepares
depsandreq. - Calls
run_pipeline(deps=deps, req=req).
- Extracts
run_pipeline(...)contains the full RAG pipeline:- query rewrite → retrieve → maybe rerank → summarize history → build prompt → inference → optional tools → final answer + metrics.
- Response
handle_chatreturns a JSON dict shaped like aChatResponse-plus-metrics:{ "answer": "...", "response": "...", "metrics": { ... }, "turn_metrics": { ... }, "conversation_totals": { ... }, "tools_used": [ ... ], "rewrite_display": { ... } }
3. Request schema
3.1 Top-level request body
The /chat route uses backend.core.schemas.ChatRequest:
class ChatRequest(BaseModel):
message: str
context: List[Dict] = []
use_web_search: Optional[bool] = None
# Pass-through of UI parameters and chat bubbles history (stateless UI)
params: Optional[Dict[str, Any]] = Field(default_factory=dict)
history: Optional[List[Dict[str, str]]] = Field(default_factory=list)
-
message
Users current query (required). -
context
Reserved for future use; not required for stateless path. -
use_web_search
Optional boolean for web search integration. Currently not used by the stateless HTML chat path (chat.htmlsets this tonull). -
params
Arbitrary dict of pipeline parameters, passed through torun_pipeline. -
history
Optional list of prior bubbles:[ {"role": "user", "content": "previous question"}, {"role": "assistant", "content": "previous answer"} ]
3.2 params contract
params is a flat dictionary. Common keys:
Retrieval
top_k: int | nullscore_threshold: float | nullnamespace: str | null- Domain/collection isolation
Summarizer / history window
raw_tail_turns: int | nullsummarizer_max_input_tokens: int | nullsummarizer_max_output_tokens: int | null
Inference
temperature: float | nulltop_p: float | nullmax_output_tokens: int | nullreasoning_effort: str | null- Reasoning intensity for reasoning models (“minimal”, “low”, “medium”, “high”)render_html: bool | null- Enable server-side Markdown to HTML rendering
Query rewrite
enable_query_rewrite: bool | nullrewrite_confidence_threshold: float | nullrewrite_tail_turns: int | nullrewrite_summary_turns: int | null
Tools
use_tools: bool
Provider/model overrides (optional)
model_keys: object- Stage-specific model overrides:{ "inference": "gemini:gemini-2.5-flash", "rewrite": "openai:gpt-4o-mini", "rerank": "openai:gpt-4o-mini", "summary": "openai:gpt-4o-mini", "tools_synth": "openai:gpt-4o-mini" }
UX / observability
query_id: strconversation_id: struser_id: str- Optional user identifier for token accountingshow_sources: bool- Source citation display controlprompt_domain: str- Domain for prompt registry resolution
Processing-stage visibility
show_processing_steps: bool
Controls intermediate SSE stage events (query rewrite, retrieval, rerank, summary, web context, prompt build, generating response, tool calls, tool synthesis). Final "Final Answer" and "Done" stages are always emitted.
4. Backend defaults (Settings)
backend/core/config.py:
class Settings(BaseSettings):
...
# 11) Debug / logging controls
debug_verbose: bool = False
debug_log_keys: bool = False
debug_log_truncate_chars: int = 200 # max chars to print when debug_verbose is True
show_processing_steps: bool = True # controls whether intermediate SSE processing stages are emitted
Resolution in run_pipeline:
- If
params["show_processing_steps"]exists → use that. - Else → fall back to
settings.show_processing_steps.
5. Response shape
Typical response:
{
"answer": "Final answer text",
"response": "Final answer text",
"answer_html": "<p>Final answer with HTML formatting</p>",
"reasoning": "Step-by-step reasoning process...",
"metrics": {
"vectors_retrieved": 8
},
"turn_metrics": {
"tokens": {
"embedding": 1500,
"rewrite": 120,
"rerank": 80,
"inference": 250,
"reasoning": 100,
"total": 1950
},
"cost": {
"embedding": 0.003,
"rewrite": 0.002,
"rerank": 0.001,
"inference": 0.005,
"total": 0.011
}
},
"conversation_totals": {
"tokens": {"total": 5000},
"cost": {"total": 0.025},
"messages": 3
},
"tools_used": ["get_weather"],
"rewrite_display": {
"enabled": true,
"triggered": true,
"accepted": true,
"original": "Where is it?",
"rewritten": "Where is Mount Whitney located?",
"confidence": 0.82,
"threshold": 0.67,
"ambiguous": false,
"reason": "",
"changed": true
}
}
6. Example calls
6.1. Curl example (minimal)
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"message": "Explain how this RAG chat pipeline works.",
"use_web_search": false,
"history": [],
"params": {
"top_k": 8,
"score_threshold": 0.35,
"temperature": 0.4,
"max_output_tokens": 300,
"reasoning_effort": "low",
"render_html": true,
"enable_query_rewrite": true,
"rewrite_confidence_threshold": 0.67,
"rewrite_tail_turns": 1,
"use_tools": false,
"show_processing_steps": true,
"show_sources": true,
"namespace": "default",
"prompt_domain": "default",
"query_id": "abcd1234",
"conversation_id": "demo-convo-1",
"user_id": "user123",
"model_keys": {
"inference": "openai:gpt-4o-mini"
}
}
}'
6.2. Curl example with processing steps hidden
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"message": "Explain how this RAG chat pipeline works.",
"use_web_search": false,
"history": [],
"params": {
"top_k": 8,
"score_threshold": 0.35,
"temperature": 0.4,
"max_output_tokens": 300,
"enable_query_rewrite": true,
"rewrite_confidence_threshold": 0.67,
"rewrite_tail_turns": 1,
"rewrite_summary_turns": 3,
"use_tools": false,
"show_processing_steps": false,
"query_id": "abcd1234",
"conversation_id": "demo-convo-1"
}
}'
6.3. Python example using requests
import uuid
import requests
BASE_URL = "http://localhost:8000"
def call_chat(message: str, show_steps: bool = True):
query_id = uuid.uuid4().hex[:8]
conversation_id = "demo-conversation-1"
payload = {
"message": message,
"use_web_search": False,
"history": [],
"params": {
"top_k": 8,
"score_threshold": 0.35,
"summarizer_max_input_tokens": 400,
"summarizer_max_output_tokens": 200,
"raw_tail_turns": 2,
"temperature": 0.4,
"top_p": 0.9,
"max_output_tokens": 300,
"reasoning_effort": "minimal",
"render_html": False,
"enable_query_rewrite": True,
"rewrite_confidence_threshold": 0.67,
"rewrite_tail_turns": 1,
"rewrite_summary_turns": 3,
"use_tools": False,
"show_processing_steps": show_steps,
"show_sources": True,
"namespace": "default",
"prompt_domain": "default",
"query_id": query_id,
"conversation_id": conversation_id,
"user_id": "demo-user",
"model_keys": {
"inference": "openai:gpt-4o-mini",
"rewrite": "openai:gpt-4o-mini",
"rerank": "openai:gpt-4o-mini"
}
},
}
resp = requests.post(f"{BASE_URL}/chat", json=payload, timeout=60)
resp.raise_for_status()
data = resp.json()
print("Answer:", data.get("answer") or data.get("response"))
print("Answer HTML:", data.get("answer_html")) # When HTML rendering is enabled
print("Reasoning:", data.get("reasoning")) # For reasoning models
print("Metrics:", data.get("metrics"))
print("Turn metrics:", data.get("turn_metrics"))
print("Conversation totals:", data.get("conversation_totals"))
print("Tools used:", data.get("tools_used"))
print("Rewrite display:", data.get("rewrite_display"))
if __name__ == "__main__":
call_chat("Give me a short overview of how this RAG chat pipeline works.", show_steps=True)
7. Notes for integrators
/chatis ideal for:- Browser-based UIs similar to
frontend/chat.html. - External clients that manage their own
conversation_idandhistory.
- Browser-based UIs similar to
-
Use
params.show_processing_stepsfor per-turn control of intermediate stage visibility, andsettings.show_processing_steps(orSHOW_PROCESSING_STEPSenv) for global defaults. - The RAG logic and final answer are unchanged by
show_processing_steps; it only affects what’s emitted on the SSE “Processing steps” stream.
Session-Based Chat API (Stateful /chat/{session_id} Endpoint)
This document describes the session-based chat API that provides server-side conversation state management, ideal for backend integrations, mobile apps, and multi-device scenarios.
Table of Contents
- Session API Overview
- Session Management
2.1. Create Session
2.2. Send Message to Session
2.3. Get Session History - Model Override Examples
- Session vs Stateless Comparison
1. Session API Overview
The session-based API provides:
- Server-side conversation state - No need to send history in each request
- Automatic context management - Token-aware history truncation
- Multi-device support - Same session accessible from different clients
- Identical pipeline quality - Same RAG processing as stateless endpoint
Key Differences from Stateless API
| Feature | Stateless (/chat) |
Session-Based (/chat/{session_id}) |
|---|---|---|
| History Management | Client sends full history each request | Server maintains history automatically |
| State | No server state | Persistent session state |
| Setup | No setup required | Create session first |
| Use Case | Simple integrations, web UI | Backend systems, mobile apps |
2. Session Management
2.1. Create Session
Endpoint: POST /chat/session
Request: (empty body)
Response:
{
"session_id": "12d8cd79-0ee8-4dcd-97a5-5983effcbccd"
}
Example:
curl -X POST http://localhost:8000/chat/session
2.2. Send Message to Session
Endpoint: POST /chat/{session_id}
Request Schema: Same as stateless /chat endpoint, but history is optional (server manages it)
Response Schema: Same as stateless /chat endpoint
Example - First Message:
curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
-H "Content-Type: application/json" \
-d '{
"message": "What is Mount Everest?",
"history": [],
"params": {
"top_k": 5,
"temperature": 0.7,
"max_output_tokens": 500
}
}'
Example - Follow-up Message (Context Preserved):
curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
-H "Content-Type: application/json" \
-d '{
"message": "How tall is it?",
"history": [],
"params": {
"top_k": 5,
"temperature": 0.7,
"max_output_tokens": 500
}
}'
2.3. Get Session History
Endpoint: GET /chat/{session_id}/history
Response:
{
"session_id": "12d8cd79-0ee8-4dcd-97a5-5983effcbccd",
"messages": [
{"role": "user", "content": "What is Mount Everest?"},
{"role": "assistant", "content": "Mount Everest is Earth's highest mountain..."},
{"role": "user", "content": "How tall is it?"},
{"role": "assistant", "content": "Mount Everest stands at 8,848 meters..."}
],
"created_at": "2024-01-15T10:30:00Z",
"last_activity": "2024-01-15T10:32:15Z"
}
Example:
curl http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd/history
3. Model Override Examples
3.1. Override Inference Model
Use model_keys to override models per request:
curl -X POST http://localhost:8000/chat/fd91c243-1f0f-441a-8ce9-635377ba54a5 \
-H "Content-Type: application/json" \
-d '{
"message": "what is the elevation difference with kilimanjaro?",
"history": [],
"params": {
"top_k": 5,
"temperature": 0.7,
"max_output_tokens": 500,
"model_keys": {
"inference": "gemini:gemini-2.5-flash"
}
}
}'
3.2. Stage-Specific Model Overrides
Override specific pipeline stages:
curl -X POST http://localhost:8000/chat/session-id \
-H "Content-Type: application/json" \
-d '{
"message": "Explain quantum computing",
"history": [],
"params": {
"top_k": 5,
"temperature": 0.7,
"max_output_tokens": 500,
"model_keys": {
"inference": "gemini:gemini-2.5-flash",
"rewrite": "openai:gpt-4o-mini",
"rerank": "openai:gpt-4o-mini",
"summary": "openai:gpt-4o-mini",
"tools_synth": "gemini:gemini-2.5-flash"
}
}
}'
3.3. Available Model Keys
| Stage | Example Models |
|---|---|
| inference | openai:gpt-4o-mini, gemini:gemini-2.5-flash, openai:gpt-4o |
| rewrite | openai:gpt-4o-mini, gemini:gemini-2.5-flash |
| rerank | openai:gpt-4o-mini, openai:gpt-4o |
| summary | openai:gpt-4o-mini, gemini:gemini-2.5-flash |
| tools_synth | openai:gpt-4o-mini, gemini:gemini-2.5-flash |
4. Stateless vs Session-Based APIs
4.1. Key Differences at a Glance
| Feature | Stateless (/chat) |
Session-Based (/chat/{session_id}) |
|---|---|---|
| History Management | Client sends full history each request | Server maintains history automatically |
| State | No server state | Persistent session state |
| Setup | No setup required | Create session first |
| Request Size | Larger (includes history) | Smaller (message only) |
| Use Case | Simple integrations, web UI | Backend systems, mobile apps |
| Token Limits | Client manages context | Server manages context automatically |
4.2. When to Use Which API
| Scenario | Recommended API | Reason |
|---|---|---|
| Web frontend | Stateless (/chat) |
Simpler, client-managed state |
| Mobile apps | Session-based (/chat/{session_id}) |
Server-side persistence |
| Backend integrations | Session-based | Automatic context management |
| Multi-device access | Session-based | Shared conversation state |
| Simple API calls | Stateless | No session setup needed |
| Long-running conversations | Session-based | Automatic history management |
4.3. Request/Response Comparison
Stateless Request
{
"message": "What is Mount Everest?",
"history": [
{"role": "user", "content": "Tell me about mountains"},
{"role": "assistant", "content": "Mountains are..."}
],
"params": {...}
}
Session-Based Request
{
"message": "What is Mount Everest?",
"params": {...}
}
(History is managed automatically by server)
5. Token Accounting & Namespaces
5.1. Namespace Patterns
The system uses different namespace patterns for proper token accounting isolation:
| Approach | Namespace Pattern | Source | Example |
|---|---|---|---|
| Stateless | user_id:conversation_id |
Request params | user123:conv456 |
| Session-Based | session:{session_id} |
Auto-generated | session:12d8cd79-... |
4.2. Token Accounting Examples
Stateless Token Accounting
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"message": "What is RAG?",
"history": [],
"params": {
"user_id": "user123",
"conversation_id": "conv456",
"top_k": 5
}
}'
- Namespace:
user123:conv456 - Token tracking: Isolated per conversation
- Use case: Web frontend with multiple conversations per user
Session-Based Token Accounting
curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
-H "Content-Type: application/json" \
-d '{
"message": "What is RAG?",
"history": [],
"params": {
"user_id": "user123", # Optional, for analytics
"top_k": 5
}
}'
- Namespace:
session:12d8cd79-0ee8-4dcd-97a5-5983effcbccd - Token tracking: Isolated per session
- Use case: Mobile app with session-based conversations
4.3. Token Accounting Response
Both APIs return token usage metrics in the response:
{
"answer": "RAG stands for Retrieval-Augmented Generation...",
"metrics": {
"tokens": {
"embedding": 1500,
"rewrite": 120,
"rerank": 80,
"inference": 250
},
"cost": {
"embedding": 0.003,
"rewrite": 0.002,
"rerank": 0.001,
"inference": 0.005
}
},
"turn_metrics": {
"tokens": {"total": 1950},
"cost": {"total": 0.011}
},
"conversation_totals": {
"tokens": {"total": 5000},
"cost": {"total": 0.025},
"messages": 3
}
}
4.4. Benefits of Namespace Isolation
- Cost Tracking - Monitor tokens per user/conversation/session
- Cache Management - Separate caches for different contexts
- Resource Isolation - Prevent cross-contamination of data
- Usage Analytics - Track patterns per namespace
- Multi-tenant Safety - Ensure data isolation between users
4. Session vs Stateless Comparison
When to Use Session-Based API
| Scenario | Recommended API | Reason |
|---|---|---|
| Web frontend | Stateless (/chat) |
Simpler, client-managed state |
| Mobile apps | Session-based (/chat/{session_id}) |
Server-side persistence |
| Backend integrations | Session-based | Automatic context management |
| Multi-device access | Session-based | Shared conversation state |
| Simple API calls | Stateless | No session setup needed |
| Long-running conversations | Session-based | Automatic history management |
Quality and Performance
Both APIs provide:
- Identical RAG pipeline quality
- Same retrieval and reranking
- Same query rewrite logic
- Same LLM inference models
- Same tool execution capabilities
The only difference is history management:
- Stateless: Client sends full history each request
- Session-based: Server maintains and manages history
Session Context Management
The session manager automatically:
- Maintains conversation history across requests
- Applies token limits to prevent context overflow
- Preserves conversation flow for follow-up questions
- Handles context truncation when token limits are exceeded
Context Building Logic:
# From ChatSessionManager.get_context()
for msg in reversed(messages):
msg_tokens = len(msg["content"].split())
if total_tokens + msg_tokens <= max_history_tokens:
context.append(msg)
else:
break