Skip to the content.

← Back to Chat with RAG Home

Chat API Reference (Stateless & Session-Based Endpoints)

About this document

This page provides complete API documentation for the Chat-with-RAG system, including both stateless and session-based chat endpoints with request/response formats, parameters, and integration examples.

Note: If you landed here directly (for example from documentation hosting or search), start with the repository README to see how to run the system locally and try the interactive demo.

This document covers both the stateless /chat endpoint (used by frontend/chat.html) and the session-based /chat/{session_id} endpoint for server-side conversation management.


Table of Contents

  1. What this README covers
  2. High-level data flow
  3. Stateless API Reference
    3.1. Top-level request body
    3.2. params contract
  4. Session-Based API Reference
    4.1. Create Session
    4.2. Send Message to Session
    4.3. Get Session History
  5. Response Shape
  6. Example Calls
    6.1. Stateless Examples
    6.2. Session-Based Examples
  7. API Comparison
  8. Notes for Integrators

1. What this README covers


2. High-level data flow

  1. Client (browser / external caller)
    Sends POST /chat with a JSON body matching backend/core/schemas.ChatRequest.

  2. FastAPI (backend/main.py)
    • Route:
      @app.post("/chat", tags=["3. Search & Chat"], summary="5. Chat (stateless)")
      async def chat_with_content(chat_request: ChatRequest):
          ...
      
    • Delegates to ChatManager.handle_chat (or the module-level handle_chat):
      result = await asyncio.to_thread(handler, chat_request.model_dump())
      return result
      
  3. Chat manager (backend/chat/chat_manager.py)
    • handle_chat(payload: Dict[str, Any]) -> Dict[str, Any]:
      • Extracts message, history, and params from the payload.
      • Prepares deps and req.
      • Calls run_pipeline(deps=deps, req=req).
    • run_pipeline(...) contains the full RAG pipeline:
      • query rewrite → retrieve → maybe rerank → summarize history → build prompt → inference → optional tools → final answer + metrics.
  4. Response
    • handle_chat returns a JSON dict shaped like a ChatResponse-plus-metrics:
      {
        "answer": "...",
        "response": "...",
        "metrics": { ... },
        "turn_metrics": { ... },
        "conversation_totals": { ... },
        "tools_used": [ ... ],
        "rewrite_display": { ... }
      }
      

3. Request schema

3.1 Top-level request body

The /chat route uses backend.core.schemas.ChatRequest:

class ChatRequest(BaseModel):
    message: str
    context: List[Dict] = []
    use_web_search: Optional[bool] = None
    # Pass-through of UI parameters and chat bubbles history (stateless UI)
    params: Optional[Dict[str, Any]] = Field(default_factory=dict)
    history: Optional[List[Dict[str, str]]] = Field(default_factory=list)

3.2 params contract

params is a flat dictionary. Common keys:

Retrieval

Summarizer / history window

Inference

Query rewrite

Tools

Provider/model overrides (optional)

UX / observability

Processing-stage visibility

Controls intermediate SSE stage events (query rewrite, retrieval, rerank, summary, web context, prompt build, generating response, tool calls, tool synthesis). Final "Final Answer" and "Done" stages are always emitted.


4. Backend defaults (Settings)

backend/core/config.py:

class Settings(BaseSettings):
    ...
    # 11) Debug / logging controls
    debug_verbose: bool = False
    debug_log_keys: bool = False
    debug_log_truncate_chars: int = 200  # max chars to print when debug_verbose is True
    show_processing_steps: bool = True  # controls whether intermediate SSE processing stages are emitted

Resolution in run_pipeline:

  1. If params["show_processing_steps"] exists → use that.
  2. Else → fall back to settings.show_processing_steps.

5. Response shape

Typical response:

{
  "answer": "Final answer text",
  "response": "Final answer text",
  "answer_html": "<p>Final answer with HTML formatting</p>",
  "reasoning": "Step-by-step reasoning process...",
  "metrics": {
    "vectors_retrieved": 8
  },
  "turn_metrics": {
    "tokens": {
      "embedding": 1500,
      "rewrite": 120,
      "rerank": 80,
      "inference": 250,
      "reasoning": 100,
      "total": 1950
    },
    "cost": {
      "embedding": 0.003,
      "rewrite": 0.002,
      "rerank": 0.001,
      "inference": 0.005,
      "total": 0.011
    }
  },
  "conversation_totals": {
    "tokens": {"total": 5000},
    "cost": {"total": 0.025},
    "messages": 3
  },
  "tools_used": ["get_weather"],
  "rewrite_display": {
    "enabled": true,
    "triggered": true,
    "accepted": true,
    "original": "Where is it?",
    "rewritten": "Where is Mount Whitney located?",
    "confidence": 0.82,
    "threshold": 0.67,
    "ambiguous": false,
    "reason": "",
    "changed": true
  }
}

6. Example calls

6.1. Curl example (minimal)

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Explain how this RAG chat pipeline works.",
    "use_web_search": false,
    "history": [],
    "params": {
      "top_k": 8,
      "score_threshold": 0.35,
      "temperature": 0.4,
      "max_output_tokens": 300,
      "reasoning_effort": "low",
      "render_html": true,
      "enable_query_rewrite": true,
      "rewrite_confidence_threshold": 0.67,
      "rewrite_tail_turns": 1,
      "use_tools": false,
      "show_processing_steps": true,
      "show_sources": true,
      "namespace": "default",
      "prompt_domain": "default",
      "query_id": "abcd1234",
      "conversation_id": "demo-convo-1",
      "user_id": "user123",
      "model_keys": {
        "inference": "openai:gpt-4o-mini"
      }
    }
  }'

6.2. Curl example with processing steps hidden

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Explain how this RAG chat pipeline works.",
    "use_web_search": false,
    "history": [],
    "params": {
      "top_k": 8,
      "score_threshold": 0.35,
      "temperature": 0.4,
      "max_output_tokens": 300,
      "enable_query_rewrite": true,
      "rewrite_confidence_threshold": 0.67,
      "rewrite_tail_turns": 1,
      "rewrite_summary_turns": 3,
      "use_tools": false,
      "show_processing_steps": false,
      "query_id": "abcd1234",
      "conversation_id": "demo-convo-1"
    }
  }'

6.3. Python example using requests

import uuid
import requests

BASE_URL = "http://localhost:8000"

def call_chat(message: str, show_steps: bool = True):
    query_id = uuid.uuid4().hex[:8]
    conversation_id = "demo-conversation-1"

    payload = {
        "message": message,
        "use_web_search": False,
        "history": [],
        "params": {
            "top_k": 8,
            "score_threshold": 0.35,
            "summarizer_max_input_tokens": 400,
            "summarizer_max_output_tokens": 200,
            "raw_tail_turns": 2,
            "temperature": 0.4,
            "top_p": 0.9,
            "max_output_tokens": 300,
            "reasoning_effort": "minimal",
            "render_html": False,
            "enable_query_rewrite": True,
            "rewrite_confidence_threshold": 0.67,
            "rewrite_tail_turns": 1,
            "rewrite_summary_turns": 3,
            "use_tools": False,
            "show_processing_steps": show_steps,
            "show_sources": True,
            "namespace": "default",
            "prompt_domain": "default",
            "query_id": query_id,
            "conversation_id": conversation_id,
            "user_id": "demo-user",
            "model_keys": {
                "inference": "openai:gpt-4o-mini",
                "rewrite": "openai:gpt-4o-mini",
                "rerank": "openai:gpt-4o-mini"
            }
        },
    }

    resp = requests.post(f"{BASE_URL}/chat", json=payload, timeout=60)
    resp.raise_for_status()
    data = resp.json()
    print("Answer:", data.get("answer") or data.get("response"))
    print("Answer HTML:", data.get("answer_html"))  # When HTML rendering is enabled
    print("Reasoning:", data.get("reasoning"))  # For reasoning models
    print("Metrics:", data.get("metrics"))
    print("Turn metrics:", data.get("turn_metrics"))
    print("Conversation totals:", data.get("conversation_totals"))
    print("Tools used:", data.get("tools_used"))
    print("Rewrite display:", data.get("rewrite_display"))

if __name__ == "__main__":
    call_chat("Give me a short overview of how this RAG chat pipeline works.", show_steps=True)

7. Notes for integrators


Session-Based Chat API (Stateful /chat/{session_id} Endpoint)

This document describes the session-based chat API that provides server-side conversation state management, ideal for backend integrations, mobile apps, and multi-device scenarios.

Table of Contents

  1. Session API Overview
  2. Session Management
    2.1. Create Session
    2.2. Send Message to Session
    2.3. Get Session History
  3. Model Override Examples
  4. Session vs Stateless Comparison

1. Session API Overview

The session-based API provides:

Key Differences from Stateless API

Feature Stateless (/chat) Session-Based (/chat/{session_id})
History Management Client sends full history each request Server maintains history automatically
State No server state Persistent session state
Setup No setup required Create session first
Use Case Simple integrations, web UI Backend systems, mobile apps

2. Session Management

2.1. Create Session

Endpoint: POST /chat/session

Request: (empty body)

Response:

{
  "session_id": "12d8cd79-0ee8-4dcd-97a5-5983effcbccd"
}

Example:

curl -X POST http://localhost:8000/chat/session

2.2. Send Message to Session

Endpoint: POST /chat/{session_id}

Request Schema: Same as stateless /chat endpoint, but history is optional (server manages it)

Response Schema: Same as stateless /chat endpoint

Example - First Message:

curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What is Mount Everest?",
    "history": [],
    "params": {
      "top_k": 5,
      "temperature": 0.7,
      "max_output_tokens": 500
    }
  }'

Example - Follow-up Message (Context Preserved):

curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
  -H "Content-Type: application/json" \
  -d '{
    "message": "How tall is it?",
    "history": [],
    "params": {
      "top_k": 5,
      "temperature": 0.7,
      "max_output_tokens": 500
    }
  }'

2.3. Get Session History

Endpoint: GET /chat/{session_id}/history

Response:

{
  "session_id": "12d8cd79-0ee8-4dcd-97a5-5983effcbccd",
  "messages": [
    {"role": "user", "content": "What is Mount Everest?"},
    {"role": "assistant", "content": "Mount Everest is Earth's highest mountain..."},
    {"role": "user", "content": "How tall is it?"},
    {"role": "assistant", "content": "Mount Everest stands at 8,848 meters..."}
  ],
  "created_at": "2024-01-15T10:30:00Z",
  "last_activity": "2024-01-15T10:32:15Z"
}

Example:

curl http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd/history

3. Model Override Examples

3.1. Override Inference Model

Use model_keys to override models per request:

curl -X POST http://localhost:8000/chat/fd91c243-1f0f-441a-8ce9-635377ba54a5 \
  -H "Content-Type: application/json" \
  -d '{
    "message": "what is the elevation difference with kilimanjaro?",
    "history": [],
    "params": {
      "top_k": 5,
      "temperature": 0.7,
      "max_output_tokens": 500,
      "model_keys": {
        "inference": "gemini:gemini-2.5-flash"
      }
    }
  }'

3.2. Stage-Specific Model Overrides

Override specific pipeline stages:

curl -X POST http://localhost:8000/chat/session-id \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Explain quantum computing",
    "history": [],
    "params": {
      "top_k": 5,
      "temperature": 0.7,
      "max_output_tokens": 500,
      "model_keys": {
        "inference": "gemini:gemini-2.5-flash",
        "rewrite": "openai:gpt-4o-mini",
        "rerank": "openai:gpt-4o-mini",
        "summary": "openai:gpt-4o-mini",
        "tools_synth": "gemini:gemini-2.5-flash"
      }
    }
  }'

3.3. Available Model Keys

Stage Example Models
inference openai:gpt-4o-mini, gemini:gemini-2.5-flash, openai:gpt-4o
rewrite openai:gpt-4o-mini, gemini:gemini-2.5-flash
rerank openai:gpt-4o-mini, openai:gpt-4o
summary openai:gpt-4o-mini, gemini:gemini-2.5-flash
tools_synth openai:gpt-4o-mini, gemini:gemini-2.5-flash

4. Stateless vs Session-Based APIs

4.1. Key Differences at a Glance

Feature Stateless (/chat) Session-Based (/chat/{session_id})
History Management Client sends full history each request Server maintains history automatically
State No server state Persistent session state
Setup No setup required Create session first
Request Size Larger (includes history) Smaller (message only)
Use Case Simple integrations, web UI Backend systems, mobile apps
Token Limits Client manages context Server manages context automatically

4.2. When to Use Which API

Scenario Recommended API Reason
Web frontend Stateless (/chat) Simpler, client-managed state
Mobile apps Session-based (/chat/{session_id}) Server-side persistence
Backend integrations Session-based Automatic context management
Multi-device access Session-based Shared conversation state
Simple API calls Stateless No session setup needed
Long-running conversations Session-based Automatic history management

4.3. Request/Response Comparison

Stateless Request

{
  "message": "What is Mount Everest?",
  "history": [
    {"role": "user", "content": "Tell me about mountains"},
    {"role": "assistant", "content": "Mountains are..."}
  ],
  "params": {...}
}

Session-Based Request

{
  "message": "What is Mount Everest?",
  "params": {...}
}

(History is managed automatically by server)


5. Token Accounting & Namespaces

5.1. Namespace Patterns

The system uses different namespace patterns for proper token accounting isolation:

Approach Namespace Pattern Source Example
Stateless user_id:conversation_id Request params user123:conv456
Session-Based session:{session_id} Auto-generated session:12d8cd79-...

4.2. Token Accounting Examples

Stateless Token Accounting

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What is RAG?",
    "history": [],
    "params": {
      "user_id": "user123",
      "conversation_id": "conv456",
      "top_k": 5
    }
  }'

Session-Based Token Accounting

curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What is RAG?",
    "history": [],
    "params": {
      "user_id": "user123",  # Optional, for analytics
      "top_k": 5
    }
  }'

4.3. Token Accounting Response

Both APIs return token usage metrics in the response:

{
  "answer": "RAG stands for Retrieval-Augmented Generation...",
  "metrics": {
    "tokens": {
      "embedding": 1500,
      "rewrite": 120,
      "rerank": 80,
      "inference": 250
    },
    "cost": {
      "embedding": 0.003,
      "rewrite": 0.002,
      "rerank": 0.001,
      "inference": 0.005
    }
  },
  "turn_metrics": {
    "tokens": {"total": 1950},
    "cost": {"total": 0.011}
  },
  "conversation_totals": {
    "tokens": {"total": 5000},
    "cost": {"total": 0.025},
    "messages": 3
  }
}

4.4. Benefits of Namespace Isolation


4. Session vs Stateless Comparison

When to Use Session-Based API

Scenario Recommended API Reason
Web frontend Stateless (/chat) Simpler, client-managed state
Mobile apps Session-based (/chat/{session_id}) Server-side persistence
Backend integrations Session-based Automatic context management
Multi-device access Session-based Shared conversation state
Simple API calls Stateless No session setup needed
Long-running conversations Session-based Automatic history management

Quality and Performance

Both APIs provide:

The only difference is history management:

Session Context Management

The session manager automatically:

Context Building Logic:

# From ChatSessionManager.get_context()
for msg in reversed(messages):
    msg_tokens = len(msg["content"].split())
    if total_tokens + msg_tokens <= max_history_tokens:
        context.append(msg)
    else:
        break