Skip to the content.

← Back to Chat with RAG Home

Technical Overview

About this document

This page explains the architecture and internal design of the Chat-with-RAG system — a modular reference implementation for building Tool-Assisted Retrieval-Augmented Generation (RAG) conversational applications.

Note: If you landed here directly (for example from documentation hosting or search), start with the repository README to see how to run the system locally and try the interactive demo.

Table of Contents


🗺️ High‑Level Architecture Diagram

A simplified conceptual overview of the system’s flow:

Ingestion Flow

        +-------------------+
        |   Source Docs     |
        | (HTML/Wiki/PDF)   |
        +---------+---------+
                  |
         Extraction & Processing
                  |
          Chunking & Metadata
                  |
     Estimation / Embedding Generation
                  |
             Qdrant Index
                

Retrieval Flow

        +-------------------+
        |   User Query      |
        +---------+---------+
                  |
  Query Processing (Rewrite / Clarify) 
                  |
              Retrieval
                  |
              Reranking
                  |
      Context Assembly (Context Window)
                  |
            Prompt Building
                  | 
        LLM Reasoning & Tool Calls
                  |
            Final User Response
            

.

🎯 Purpose and Scope

The RAG Pipeline Chat system is designed to help organizations convert their internal knowledge—such as technical documentation, operational manuals, process descriptions, wikis, and PDF repositories—into an interactive AI‑powered conversational interface. By ingesting large heterogeneous content sets and indexing them with semantic search, the system enables employees, customers, or support agents to query their knowledge base through grounded, auditable, context‑aware chat interactions.

This document is intended for system architects, AI engineers, and collaborators who want to understand the system’s architecture, including ingestion, embedding, retrieval, chat orchestration, and real-time streaming.

Note: While this repository includes example tools and sample datasets, it is designed as a general‑purpose reference architecture rather than a domain‑specific product. Teams are expected to integrate their own content sources, internal tools, APIs, and policies to adapt the platform to their specific workflows and requirements.

What This Is Not

This repository is not a turnkey enterprise product or a drop‑in replacement for organization‑specific knowledge platforms. It provides a modular, extensible reference architecture that teams can adapt, extend, and integrate with their own tools, data sources, workflows, and compliance requirements.

🧩 System Overview

The RAG Pipeline Chat application integrates document ingestion, vector indexing, semantic retrieval, and Large Language Model (LLM)-based reasoning into a unified end-to-end architecture. It is designed to support context-grounded chat interactions over heterogeneous content sets.

Together, these components form a modular, scalable architecture that supports reliable RAG‑augmented conversational experiences.

🧭 Architecture at a Glance

The system is composed of two independent but connected pipelines:

Both pipelines share the same configuration layer, operate against the active Qdrant collection, and rely on the llm-adapter abstraction (Pypi: https://pypi.org/project/vrraj-llm-adapter/) to keep model/provider integration decoupled from core pipeline logic.

🚀 Runtime & Deployment Model

For local development, the RAG Pipeline Chat application runs as a small containerized stack managed by Docker Compose:

The docker-compose.yml file in the repository root defines these services, their ports, and the shared storage volume (qdrant_storage/). Starting the system with Docker Compose launches both services and ensures the backend is automatically connected to Qdrant through configuration settings.

In production environments, teams commonly:

The ingestion, retrieval, and chat pipelines remain fully decoupled from where Qdrant is hosted; only configuration variables need to be adjusted.

📥 Ingestion Pipeline

The ingestion pipeline is responsible for converting raw documents (HTML, MediaWiki, PDF, etc.) into structured vector entries stored in Qdrant. This section outlines the high-level flow and major components without going into low-level implementation details.

✅ 1. Goals

📦 2a. Batch Ingestion

The Ingestion Pipeline provides a batch ingestion mode driven by a JSON specification. A single batch file can describe a heterogeneous set of sources—local files and remote URLs—that are processed in one run.

Each batch definition contains:

When executed, the batch runner orchestrates extraction, chunking, and (optionally) embedding for each item in sequence, emitting per‑document statistics (chunk counts, token usage, and estimated embedding cost) as well as a final aggregate summary for the batch. This enables teams to quickly onboard corporate PDF repositories, internal wiki pages, or mixed documentation sets into a single Qdrant collection through a repeatable, scriptable workflow.

Note: The Ingestion Pipeline supports an estimation mode that runs extraction and chunking steps without invoking the embedding model or writing to the index. This mode is useful for:

Estimation mode can be triggered via CLI flags or configuration settings.

Directory Structure for Local Files

When using local file paths in batch processing, the backend expects the following structure:

chat-with-rag/
├── data/
│   └── pdf-files-for-upload/  # Recommended directory for PDFs
│       ├── document1.pdf
│       ├── document2.pdf
│       └── document3.pdf

Path Handling Guidelines

Batch Processing Features

Embedding Provider Limits

When configuring chunk sizes and batch processing, be aware of provider-specific limits:

Feature OpenAI (text-embedding-3-small/large) Gemini (gemini-embedding-001)
Max Inputs per Request 2,048 texts 250 texts
Max Tokens per Request Variable (often restricted by Tier) 20,000 tokens
Max Tokens per Text 8,191 tokens 2,048 (or 8,000 on newer models)
Truncation Behavior Manual (must be handled by user) Silent (automatic) by default
Batch API Support Yes (up to 50,000 requests/file) No (synchronous only via API)

Note: These limits affect how you should configure chunk_size and embedding_batch_size in backend/core/config.py. Always check current provider documentation for the latest limits.

Embedding Batch Indexing

To reduce latency and API overhead, the ingestion pipeline batches multiple chunks into a single embeddings call wherever possible:

Batch size is provider-aware and configurable in backend/core/config.py:

embedding_batch_size_default: int = 25
embedding_batch_size_openai: int = 25
embedding_batch_size_gemini: int = 25

The effective behavior is roughly:

Example Batch Configuration

{
  "items": [
    {
      "url": "file:///app/data/pdf-files-for-upload/document1.pdf",
      "doc_type": "pdf",
      "skip_sections": ["References", "External links"]
    },
    {
      "url": "https://en.wikipedia.org/wiki/Example",
      "doc_type": "mediawiki"
    }
  ],
  "max_chunks": 100,
  "estimate": true,
  "force_delete": false
}

Best Practices

  1. Place all PDFs in the data/pdf-files-for-upload directory
  2. Use relative paths when possible for better portability
  3. Start with "estimate": true to preview processing before actual ingestion
  4. Check the web interface’s “View Documents” page to verify successful ingestion

Note: Changing the embedding model requires re-embedding and rebuilding the vector index. See Re-embedding Workflow for the recommended re-ingestion process.

🔄 3. High-Level Flow

At a high level, the Ingestion Pipeline follows this sequence:

  1. Content Source Selection – Identify which documents or URLs should be ingested.
  2. Extraction – Use specialized extractors to pull clean text and structure from each source type.
  3. Chunking & Metadata Construction – Split documents into logical chunks and attach metadata.
  4. Embedding – Convert chunks into vector embeddings using the configured embedding model.
  5. Index Storage (Qdrant) – Upsert embeddings and metadata into the Qdrant collection.

Each of these stages is implemented as a separate component so that they can evolve independently.

🧭 4. Content Source Selection

🧹 5. Extraction

🧬 5.1 Source-Specific Extraction Behavior

Although all extractors normalize into the same internal representation, each source type has additional behavior to preserve as much context as possible:

🧱 5.2 Tables and Structured Data (High-Level)

Tabular content is common in knowledge bases (inventory lists, spec sheets, comparison tables, wiki lists). The ingestion pipeline supports table-aware extraction across source types, but the quality of results depends on the underlying document being structured:

As a result, table-aware ingestion is additive and can be enabled without changing the baseline prose indexing, but it benefits significantly from sources that follow standard formatting conventions.

✂️ 6. Chunking & Metadata

🧲 7. Embedding

Key Features

Token Estimation

Provider-Specific Configuration

Response Metadata

Retry Logic

Core Methods

🗄️ 8. Index Storage (Qdrant)

🧰 Collection Management

The system uses domain-based collection management where each domain is automatically linked to a specific collection and embedding model configuration. This ensures that collections are always paired with the correct embedding model and vector dimensions.

Domain-Based Configuration

The collection name and embedding model are now computed dynamically based on the active domain:

# In backend/core/config.py
DOMAIN_EMBEDDING_CONFIG = {
    "default": {
        "collection_name": "document_index",
        "embedding_model_key": "openai:embed_small"
    },
    "mountains": {
        "collection_name": "document_index", 
        "embedding_model_key": "openai:embed_small"
    },
    "oceans": {
        "collection_name": "document_index_gemini",
        "embedding_model_key": "gemini:native-embed"
    }
}

# Single change point for domain selection
active_domain: str = "oceans"
Computed Properties

The system uses computed properties to automatically resolve collection and model configuration:

@property
def collection_name(self) -> str:
    """Collection name from active domain configuration"""
    return self.DOMAIN_EMBEDDING_CONFIG[self.active_domain]["collection_name"]

@property  
def embedding_model_key(self) -> str:
    """Embedding model key from active domain configuration"""
    return self.DOMAIN_EMBEDDING_CONFIG[self.active_domain]["embedding_model_key"]

@property
def vector_size(self) -> int:
    """Vector size from embedding_model_key registry capabilities"""
    from llm_adapter import get_model_info
    model_info = get_model_info(self.embedding_model_key)
    return int(model_info.capabilities["dimensions"])
Benefits
Usage Examples
# Switch to oceans domain (Gemini embeddings)
settings.active_domain = "oceans"
# → collection_name = "document_index_gemini"
# → embedding_model_key = "gemini:native-embed" 
# → vector_size = 1536

# Switch to mountains domain (OpenAI embeddings)  
settings.active_domain = "mountains"
# → collection_name = "document_index"
# → embedding_model_key = "openai:embed_small"
# → vector_size = 1536
Collection Creation and Management

When a new domain is used for the first time:

  1. Automatic Creation: Qdrant automatically creates the collection on first write
  2. Correct Dimensions: Uses the vector dimensions from the domain’s embedding model
  3. Consistent Schema: Maintains the same payload schema across all collections
  4. Provider Compatibility: Ensures embedding model and collection dimensions match

This mechanism enables:

All ingestion pipelines (HTML, MediaWiki, PDF, batch) and all retrieval flows always operate against the currently configured domain’s collection.

Collection Management Options

Option A: Create Fresh Collections (Recommended) Each domain automatically gets its own collection when first used. No manual setup required.

Option B: Clear Existing Collection Use this if you want to completely clear a collection but keep using the same name.

[!WARNING] This action will permanently delete the collection and all vectors within it. This cannot be undone.

# Activate your environment
source .venv/bin/activate
python scripts/qdrant_scripts/qdrant_ops.py --delete-collection $(python -c "from backend.core.config import settings; print(settings.collection_name)")

🌱 Seed Data and Demo Collection

For local development and exploration, the repository includes a small demo dataset that can be ingested into the default Qdrant collection (document_index) using the standard Makefile targets or ingestion scripts.

In production or enterprise deployments, teams typically ingest their own internal documentation repositories and may disable or replace the demo dataset entirely.

🧹 9. Re-indexing and Maintenance

🧮 Embedding Flow

The embedding flow transforms text chunks produced by the chunking stage into high‑dimensional vector embeddings suitable for semantic retrieval. The embedding layer is provider‑extensible at the code level, but changing the embedding model requires a full re‑embedding and re‑indexing of the corpus, as vectors produced by different models are not directly comparable. The system is designed to be efficient and fully metadata‑preserving.

Note (Re-embedding workflow): If you want to experiment with a different OpenAI embedding model (e.g., text-embedding-3-large), export your existing document URLs, update the embedding model in backend/core/config.py, and then re-ingest the same URLs using the batch ingestion mode. A JSON file ready for batch import can be exported directly from the UI via List Documents → Download JSON.

1. Input and Output

This strict separation ensures that metadata flows through the system unchanged.

2. Model Abstraction Layer

The embedding component wraps the embedding model behind a dedicated interface. This allows:

Note: At this time, stage-level model selection is limited to OpenAI models (embedding, rerank, and inference). The abstraction keeps the ingestion pipeline decoupled from model implementation details and leaves room for additional providers later.

3. Token and Cost Estimation

During estimation mode, the embedding stage computes:

The system performs these calculations without generating any vectors, allowing users to preview ingestion costs before without having to encounter costs associated with the embedding model.

4. Batching and Throughput

The embedding flow processes chunks in batches to improve performance and reduce API overhead. Batching ensures:

The pipeline maintains ordering so metadata remains aligned with each embedding.

5. Metadata Preservation

All metadata generated in earlier stages (e.g., section hierarchy, source URL, chunk ID) is preserved verbatim during embedding. This enables:

Metadata a.k.a payload is extensible and can be configured in the embedding component.

6. Error Handling and Logging

The embedding layer includes:

These controls enhance reliability during large‑scale ingestion.


🔄 Re-embedding Workflow

When you need to change embedding models (e.g., switching from OpenAI to Gemini, or upgrading to a larger model), you must re-embed and rebuild the vector index because vectors produced by different models are not directly comparable.

When Re-embedding is Required

Step 1: Export Existing Document URLs

  1. Navigate to List Documents in the web interface
  2. Use the Download JSON option to export all document URLs and metadata
  3. This creates a batch ingestion file with your existing document sources

Step 2: Update Embedding Configuration

  1. Edit backend/core/config.py
  2. Update the embedding model:
    # Example: Switch to Gemini embeddings
    embedding_model = "gemini:embed"  # Change from "openai:embed_small"
    

Step 3: Clear Existing Collection (Optional)

If you want a completely fresh start:

# Activate your environment
source .venv/bin/activate
python scripts/qdrant_scripts/qdrant_ops.py --delete-collection $(python -c "from backend.core.config import settings; print(settings.collection_name)")

Step 4: Re-ingest with New Model

  1. Use the exported JSON file from Step 1
  2. Run batch ingestion with the new embedding model:
    • Via UI: Upload the JSON file to Process Batch Documents
    • Via API: Use the /process-batch endpoint with the JSON payload
  3. Start with "estimate": true to preview costs and processing behavior
  4. Set "estimate": false for actual ingestion

Step 5: Verify Results

  1. Check the View Documents page to confirm successful ingestion
  2. Test a few queries to ensure retrieval works with new embeddings
  3. Monitor token usage and costs in the application logs

Best Practices

Example Batch Configuration for Re-embedding

{
  "items": [
    {
      "url": "https://en.wikipedia.org/wiki/Mount_Everest",
      "doc_type": "mediawiki"
    },
    {
      "url": "file:///app/data/mountains/k2.pdf",
      "doc_type": "pdf",
      "skip_sections": ["References", "External links"]
    }
  ],
  "max_chunks": 100,
  "estimate": true,
  "force_delete": false
}

Note: The re-embedding process uses the same ingestion pipeline as initial ingestion, ensuring consistent chunking, metadata, and processing across all documents.


Chat Orchestration

This section will describe how user queries flow through the multi-stage chat pipeline, including retrieval, reranking, tool calls, and final LLM response construction. It will outline the orchestration sequence and the role of intermediate stages.

1. Query Reception and Validation

2. Key Stages

At a high level, the chat orchestration pipeline processes queries as follows:

  1. Query Reception – Accept the user query from the frontend.
  2. Query Rewrite (optional) – Uses heuristics and LLM to improve query quality for better retrieval. May trigger clarification if ambiguous.
  3. Query Embedding – Generate embedding for the (potentially rewritten) query.
  4. Vector Retrieval – Perform semantic search on Qdrant with configurable top-k and score thresholds.
  5. Reranking (conditional) – Apply intelligent decision policy to determine if expensive LLM reranking is needed based on result quality.
  6. Web Context (optional) – Augment with web search results if enabled.
  7. Context Assembly – Combine retrieved chunks, conversation history (chunked + summarized), and web context using token-aware management.
  8. LLM Inference – Generate response using configurable models and prompt registry.
  9. Tool Execution (optional) – Execute tool calls if the LLM requests them and tools are enabled.
  10. Response Synthesis – Merge tool outputs and LLM response into final answer.
  11. Post-processing (optional) – Convert Markdown to HTML if enabled.
  12. SSE Stage Emission – Stream intermediate results to the frontend throughout the pipeline.
  13. Response Delivery – Send final answer with sources, metrics, and metadata.

3. Query Rewrite (Optional)

The system uses an intelligent query rewrite mechanism that:

Rewrite Decision Logic

Configuration Parameters

4. Vector Retrieval

Performs semantic search using the (potentially rewritten) query:

Key Parameters

5. Intelligent Reranking (Conditional)

Applies a sophisticated decision policy to avoid expensive LLM reranking when unnecessary:

Rerank Decision Policy

The system skips reranking when:

  1. ≤1 candidates - Nothing to rerank
  2. Fewer than re_ranker_input_rows (default 5) - Insufficient candidates
  3. Exact match found - High-confidence exact match in top 5 (score ≥ 0.80)
  4. Clear winner detected - Top result meets both criteria:
    • Score ≥ rerank_clear_winner_min_top1 (default 0.65)
    • Margin over 5th result ≥ rerank_clear_winner_min_delta (default 0.15)

When Reranking Occurs

Configuration

6. Context Assembly & History Management

Builds the final inference context using multiple sources with token-aware management:

Conversation History Strategy

Uses a chunked history approach for efficient multi-turn conversations:

Context Components

  1. System Instructions - From prompt registry based on domain
  2. Conversation Summary - Accumulated summary of older turns
  3. Recent Conversation - Verbatim turns from current chunk
  4. Retrieved Documents - Reranked search results
  5. Web Context - Optional web search results
  6. User Query - Original (or rewritten) user message

Token Budget Management

Summarization and Token Budgeting

When conversation history exceeds the configured window size, older turns are processed through a summarization stage to maintain context while staying within token limits:

The summarizer processes approximately 4-8 conversation turns (depending on message length) and generates condensed summaries that preserve key context while dramatically reducing token usage. This enables long-running conversations without exceeding model context windows or incurring excessive costs.

5.1 Prompt Registry (YAML)

The chat pipeline uses a YAML-based prompt registry to centralize prompt text and templates.

5.1.1 Registry schema

The registry is structured as:

For each stage, the resolver starts with global_defaults.<stage> and then appends a domain overlay only if it exists at domains.<domain>.<stage>. If a domain is selected but a stage does not define that domain overlay, the stage falls back to the global_defaults behavior.

5.1.2 Prompt domains (params.prompt_domain)

Each chat request can include params.prompt_domain (set by the frontend UI). The backend resolves the domain for the turn using:

  1. params.prompt_domain (if provided)
  2. settings.prompt_domain_default (fallback)

This same domain value is applied consistently across stages that consult the registry.

5.1.3 Stage coverage

Prompt registry coverage is stage-specific:

5.1.4 Context Injection via Jinja Templates

The prompt registry uses Jinja2 templating to inject conversation history and RAG context into user prompts. This ensures:

Template variables by stage

The registry uses Jinja templates. The orchestrator supplies variables per stage:

Summary does not currently use a templated payload; its registry entry provides the fixed instruction string.

5.1.5 Debug logging

By default, the registry logs which domain was resolved and a short tail snippet of the resolved system_instruction. To log the full resolved prompt text and templates, set:

This is intentionally opt-in because it can log sensitive prompt content and can produce large logs.

7. LLM Inference & Tool Execution

Inference Stage

Tool Execution (Optional)

When enabled and the LLM requests tools:

Configuration

8. Post-Processing & Response Delivery

Markdown to HTML (Optional)

Final Response

Includes:

9. Web Search Context (DuckDuckGo Instant Answer)

The chat orchestration pipeline supports optional web context augmentation backed by the DuckDuckGo Instant Answer API (https://api.duckduckgo.com).

9.1 Two web-search paths

There are two ways web search can affect the final answer:

  1. Automatic Web Context (web_context)
    • Triggered when use_web_search is enabled for the request (request-level flag overrides settings.use_web_search).
    • Executed during the pipeline stage Establish Web Context.
    • Results are injected into the inference prompt as a dedicated user message block:
      • WEB SEARCH RESULTS:
  2. Tool-call Web Search (web_search tool)
    • Triggered when tools are enabled and the LLM calls the web_search tool.
    • Tool results are injected into the synthesis step as a labeled user message:
      • [SOURCE: TOOL - web_search].

Both paths currently share the same extraction logic via backend/chat/web_search.py.

9.2 What is extracted (normalized schema)

DuckDuckGo returns a JSON object that may include an abstract (often from Wikipedia), plus optional result lists. WebSearchClient.search() normalizes the response into a list of items of the form:

Extraction order:

  1. Abstract (preferred when present)
    • Requires both AbstractURL and AbstractText to be present.
  2. Results list (often empty in Instant Answer responses)
    • Iterates data["Results"] entries and maps Text + FirstURL.
  3. RelatedTopics (currently disabled)
    • The Instant Answer payload often returns many RelatedTopics entries which are typically DuckDuckGo topic/category links.
    • This repository currently disables adding RelatedTopics into web_context to keep web augmentation focused and low-noise.

9.3 Deduplication and caps

After extraction:

9.4 Prompt injection and citations

When web_context is enabled, the inference prompt includes a block labeled:

The model is instructed to cite web-derived facts as:

The final Sources: section can include corresponding web URLs, and (when “used sources” filtering is enabled) only cited web indices are displayed.

10. Postprocessing (Markdown → HTML)

After inference, the system optionally postprocesses the assistant’s text to render rich Markdown content in the chat UI. This stage is additive and controlled by a feature flag.

10.1 Backend rendering (backend/markdown_render.py)

10.2 Frontend rendering (frontend/static/chat.js)

10.3 Integration points

10.4 Benefits


🔄 Session-Based (Stateful) Chat

The system supports both stateless and stateful chat modes. While the frontend uses stateless chat (client-managed history), the session-based API provides server-side conversation state management.

1. Stateless vs Stateful Comparison

Aspect Stateless (/chat) Stateful (/chat/{session_id})
History Management Client sends full history in each request Server maintains history in session storage
State Management No server state Server-side session state
Use Case Frontend web apps, simple integrations Backend integrations, mobile apps, multi-device scenarios
Pipeline Identical RAG pipeline Identical RAG pipeline
Quality Same retrieval, rewrite, inference quality Same retrieval, rewrite, inference quality

2. Session-Based Chat Flow

Step 1: Create Session

curl -X POST http://localhost:8000/chat/session
# Response: {"session_id": "12d8cd79-0ee8-4dcd-97a5-5983effcbccd"}

Step 2: Send First Message

curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
  -H "Content-Type: application/json" \
  -d '{
    "message": "What is Mount Everest?",
    "history": [],
    "params": {
      "top_k": 5,
      "temperature": 0.7,
      "max_output_tokens": 500
    }
  }'

Step 3: Send Follow-up Message (Context Preserved)

curl -X POST http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd \
  -H "Content-Type: application/json" \
  -d '{
    "message": "How tall is it?",
    "history": [],
    "params": {
      "top_k": 5,
      "temperature": 0.7,
      "max_output_tokens": 500
    }
  }'

Step 4: Check Session History

curl http://localhost:8000/chat/12d8cd79-0ee8-4dcd-97a5-5983effcbccd/history

3. Model Override with Session-Based Chat

Override inference model per request using model_keys:

curl -X POST http://localhost:8000/chat/fd91c243-1f0f-441a-8ce9-635377ba54a5 \
  -H "Content-Type: application/json" \
  -d '{
    "message": "what is the elevation difference with kilimanjaro?",
    "history": [],
    "params": {
      "top_k": 5,
      "temperature": 0.7,
      "max_output_tokens": 500,
      "model_keys": {
        "inference": "gemini:gemini-2.5-flash"
      }
    }
  }'

4. Stage-Specific Model Overrides

Override specific pipeline stages:

{
  "params": {
    "model_keys": {
      "inference": "gemini:gemini-2.5-flash",
      "rewrite": "openai:gpt-4o-mini",
      "rerank": "openai:gpt-4o-mini",
      "summary": "openai:gpt-4o-mini"
    }
  }
}

5. Session Context Management

The session manager automatically:

Context Building Logic

# From ChatSessionManager.get_context()
for msg in reversed(messages):
    msg_tokens = len(msg["content"].split())
    if total_tokens + msg_tokens <= max_history_tokens:
        context.append(msg)
    else:
        break

6. Pipeline Consistency

Both stateless and stateful paths use the identical RAG pipeline:

The only difference is the history source:

7. Use Case Recommendations

Scenario Recommended Approach
Web frontend Stateless (simpler, client-managed)
Mobile apps Stateful (server-side persistence)
Backend integrations Stateful (automatic context management)
Multi-device scenarios Stateful (shared conversation state)
Simple API calls Stateless (no session setup needed)

� Token Accounting & Namespace Management

1. Namespace Patterns

The system uses different namespace patterns for stateless vs session-based chat to ensure proper token accounting isolation.

Stateless Namespaces

# From handle_chat()
user_id = params.get("user_id", "")
conversation_id = params.get("conversation_id", "")
namespace = f"{user_id}:{conversation_id}" if user_id and conversation_id else conversation_id

Session-Based Namespaces

# From chat_manager.chat()
session_id = params.get("session_id", "")
namespace = f"session:{session_id}" if session_id else ""

2. Token Accounting Isolation

Approach Namespace Format Isolation Level Example
Stateless user123:conv456 Per conversation Each conversation tracked separately
Session-Based session:abc123... Per session Each session tracked separately

3. Cache & Resource Management

Namespaces are used for:

4. Implementation Details

Token Accumulation

# Per-namespace token tracking
_CONVO_TOTALS_BY_NS[namespace] = {
    "tokens": {"embedding": 0, "rewrite": 0, "rerank": 0, "inference": 0},
    "cost": {"embedding": 0.0, "rewrite": 0.0, "rerank": 0.0, "inference": 0.0},
    "messages": 0
}

Cache Key Management

# Namespace-aware cache keys
cache_key = f"{namespace}|rewrite|{hash}" if namespace else f"rewrite|{hash}"
_SUMMARY_NS_INDEX[namespace].add(cache_key)

5. Benefits of Namespace Isolation


�� Retrieval and Ranking

The retrieval and ranking subsystem identifies the most semantically relevant document chunks to support the LLM’s answer generation. It operates in two phases: vector retrieval and optional LLM-based reranking.

1. Query Embedding

A single embedding is generated for the rewritten user query to be used to retrieve the most relevant document chunks from Qdrant.

2. Vector Search (Qdrant)

The system executes a similarity search with:

Tuning note (Top‑K vs cost): Retrieval quality is sensitive to the top_k candidate set. For noisy datasets or ambiguous queries, increasing top_k can improve recall, but it may increase downstream reranking cost (when enabled) and can place additional pressure on the inference context budget. This trade‑off is intentional and configurable.

Each Qdrant result includes:

3. Filtering

Before reranking, results may be filtered by (not implemented yet):

4. Heuristic Reranking

The system applies a lightweight heuristic layer to improve relevance:

5. LLM Reranking (optional based on retrieved context)

For ambiguous retrieval sets, the query and top candidates are passed to a rerank model. This produces refined relevance scores and a reduced top-K set.

6. Final Selection and Context Packaging

The number of retrieved chunks included in the inference prompt is bounded by a configurable inference input limit (i.e., how many context rows/chunks are allowed to be sent to the model), with retrieval and optional reranking providing the candidate set. The inference prompt is then assembled from retrieved chunks, raw tail turns, summarized history, and tool outputs (when applicable) to build the final context for LLM inference.

📡 SSE Streaming

The SSE (Server-Sent Events) subsystem delivers real-time streaming updates from the backend to the browser. It enables the UI to reflect pipeline progress and LLM output incrementally.

1. Endpoint Structure

Each chat request receives a unique stream_id. The frontend connects to:

/stream/{stream_id}

using an EventSource client.

2. Event Format

The server emits UTF‑8 encoded events of the form:

event: message
data: { ... JSON payload ... }

Each message corresponds to a pipeline stage or LLM token.

3. Stage Emission

The orchestrator uses a shared emit_stage() helper to push structured stage updates. Stages are human-readable and reflect the exact progress in chat orchestration.

4. Token Streaming

During the LLM call, partial tokens are streamed as incremental data: messages. The frontend appends these to the visible response.

5. Disconnect Handling

sse_starlette automatically detects client disconnects. The backend:

🔗 Frontend–Backend Integration

The frontend interacts with the backend through two channels:

  1. REST POST requests for submitting user messages
  2. SSE streams for receiving staged updates and model output

1. Request Lifecycle

2. Handling SSE Messages

The UI:

3. Error Handling

Frontend reacts to:

4. State & History

The UI maintains:

📊 Metrics and Observability

The system tracks detailed metrics across two phases: ingestion-time estimation and chat-time execution.

1. Ingestion-Time Estimation Metrics (Embedding Cost Preview)

During ingestion, the system can run in an estimation-only mode that performs extraction and chunking without generating vectors. In this mode, metrics focus on predicting embedding cost before indexing:

2. Chat-Time Per-Turn Metrics (Runtime Costs)

During chat execution, the UI displays per-turn metrics that break down token usage and cost by stage:

3. Running Conversation Totals

In addition to per-turn metrics, the system maintains running totals across the conversation during chat sessions:

4. Logging and Diagnostics

Logs are structured with per-stage prefixes and include:

The centralized logging configuration in backend/core/logging.py also configures:

Configuration and Settings

Configuration is centralized across multiple components, with model-specific configurations moved to the model registry and application settings in backend/core/config.py. The system is designed so most behavior can be tuned through configuration without requiring code changes.

LLM Provider Abstraction

The system integrates model providers through the external llm-adapter package, which acts as the model abstraction layer for both inference and embeddings.

This abstraction centralizes:

As a result, the chat and ingestion pipelines can switch between supported providers without changing their internal control flow.

Model Registry Architecture

The Model Registry is provided by the llm-adapter package and serves as the single source of truth for all model configurations:

Stage-Specific Configuration

Models for each pipeline stage are defined in stage_specs and can be overridden at runtime:

Domain-Based Collection Management

The system uses domain-based configuration to automatically link collections with compatible embedding models:

# In backend/core/config.py
DOMAIN_EMBEDDING_CONFIG = {
    "default": {
        "collection_name": "document_index",
        "embedding_model_key": "openai:embed_small"
    },
    "mountains": {
        "collection_name": "document_index", 
        "embedding_model_key": "openai:embed_small"
    },
    "oceans": {
        "collection_name": "document_index_gemini",
        "embedding_model_key": "gemini:native-embed"
    }
}

# Single change point for domain selection
active_domain: str = "mountains"  # Default
Computed Properties

Collection and model configuration are resolved automatically:

@property
def collection_name(self) -> str:
    return self.DOMAIN_EMBEDDING_CONFIG[self.active_domain]["collection_name"]

@property  
def embedding_model_key(self) -> str:
    return self.DOMAIN_EMBEDDING_CONFIG[self.active_domain]["embedding_model_key"]

@property
def vector_size(self) -> int:
    from backend.llm.llm_client import get_model_info
    model_info = get_model_info(model_key=self.embedding_model_key)
    return int(model_info.capabilities.get("dimensions"))
Benefits

Key Configuration Categories

Model Selection (Per Stage)

Models are configured via stage-specific keys and can be overridden at runtime:

Retrieval & Ranking

Conversation Management

Inference & Generation

Content Processing

Features & UI

Runtime Overrides

All settings can be overridden per-request through the params object in the /chat API. The UI exposes these controls in the sidebar, allowing users to:

Configuration values are loaded at application startup and can be overridden via environment variables for different deployment environments.

Error Handling and Stability Guarantees

The system employs multiple layers of protection to prevent runaway computation and ensure graceful degradation.

1. Embedding Safeguards

2. Chat Pipeline Safeguards

3. SSE Safeguards

Extensibility and Organization-Specific Customization

Although this project includes working examples—such as weather and nearby‑airports tools, sample batch ingestion, and seed datasets—it is primarily intended as a modular, general‑purpose RAG architecture that organizations can extend to meet their unique operational needs.

Common customization areas include:

Architecture Summary

The RAG Pipeline Chat system is composed of modular, loosely coupled stages:

User Query → Rewrite → Retrieval → Rerank → Context Assembly → LLM Reasoning → Tool Calls → Final Synthesis

Ingestion flows independently:

Source → Extraction → Chunking → Embedding → Qdrant

This separation ensures maintainability, extensibility, and clear reasoning paths throughout the system.

🗂️ Repository Structure (High-Level)

At a high level, the repository is organized into the following areas:

This structure keeps ingestion, retrieval, chat orchestration, and frontend concerns clearly separated while providing dedicated spaces for scripts, tools, and operational data.

🧑‍💻 Developer & Operator Utilities (Makefile)

The Makefile includes specialized targets essential for debugging, maintenance, and system administration, particularly for the Qdrant vector store. These commands simplify operational tasks by abstracting complex Docker commands and API calls.

Application Start/Stop

Target Description Usage
make start Starts the full Docker Compose stack (Webapp + Qdrant). (Recommended for general deployment.) make start
make start-hybrid Starts the Qdrant container, then runs the FastAPI application in a local Python virtual environment (venv). (Recommended for local development/debugging.) make start-hybrid
make stop Stops and removes the full Docker Compose stack. make stop
make rebuild Rebuilds and starts containers with latest code changes after pulling updates. make rebuild
make stop-hybrid Stops the web app and Qdrant container and resources. make stop-hybrid

Core Operations

Target Description Usage
make seed Ingests sample data into the current Qdrant collection. Requires the local venv to be active. make seed
make seed --force Ingests sample data without interactive prompts (recommended for scripts). make seed --force
make smoke-api Runs an OpenAI API smoke test to verify OPENAI_API_KEY authentication and connectivity. make smoke-api
make start-qdrant Starts only the Qdrant vector database container in detached mode. make start-qdrant
make stop-qdrant Stops and removes the Qdrant container and resources. make stop-qdrant
make stop-uvicorn Gracefully kills the local running FastAPI application process (SIGTERM) without affecting Qdrant. make stop-uvicorn

Qdrant Debugging & Inspection

These targets automatically connect to Qdrant using the configured QDRANT_HOST and QDRANT_PORT settings, falling back to localhost:6333 if not specified.

Target Description Usage Example
make qdrant-collections Lists all collections currently running in Qdrant. make qdrant-collections
make qdrant-info Shows concise info (status, dimensions, vector count) for a specific collection. make qdrant-info COLLECTION=document_index
make qdrant-indexes Shows field indexes (payload schema) for a collection, useful for checking filters. make qdrant-indexes COLLECTION=my_data
make qdrant-logs Streams the logs from the Qdrant container live (docker compose logs -f qdrant). make qdrant-logs

Maintenance & Data Management

| Target | Description | | :— | :— |

make qdrant-backup Creates a compressed archive (.tar.gz) of the local qdrant_storage/ bind mount directory.
make my-ip Utility to retrieve the current machine’s local IP address, useful for connecting to the application from other devices on the same network.

🧱 Qdrant Operations CLI

In addition to the Makefile targets, the repository includes a Python-based Qdrant operations CLI located at scripts/qdrant_scripts/qdrant_ops.py. This utility provides a simple administrative surface over the active collection and is useful for inspection, backup, and safe maintenance.

Supported operations include:

Example invocations:

# List distinct payload fields
python scripts/qdrant_scripts/qdrant_ops.py list-fields

# List document titles (with an optional limit)
python scripts/qdrant_scripts/qdrant_ops.py list-titles --limit 50

# Count chunks for a specific base URL
python scripts/qdrant_scripts/qdrant_ops.py count-chunks --base-url "https://en.wikipedia.org/wiki/Mont_Blanc"

# Export the active collection to a JSONL file under data/
python scripts/qdrant_scripts/qdrant_ops.py export -f docs-index-export.jsonl

# Safely truncate the active collection (interactive confirmation)
python scripts/qdrant_scripts/qdrant_ops.py truncate

# Inspect vector configuration (dimensions + distance)
python scripts/qdrant_scripts/qdrant_ops.py vector-dims

# Explicitly target a different collection (e.g., Gemini-backed index)
python scripts/qdrant_scripts/qdrant_ops.py --collection document_index_gemini vector-dims

The vector-dims command is especially useful when:

Example outputs:

Collection: document_index
Named vectors: no
Vector config:
- default: size=1536, distance=Cosine
Collection: my_multi_vector_collection
Named vectors: yes
Vector config:
- content: size=1536, distance=Cosine
- title:   size=384,  distance=Dot

This CLI complements the Makefile targets by providing more granular and scriptable control over the Qdrant collection, and it can be extended with additional commands as operational needs evolve.

✅ Automated Quality Checks (CI Workflow)

The repository includes a lightweight Continuous Integration (CI) workflow to provide fast feedback on code health without pulling in the full Docker/Qdrant stack.

This CI workflow is intentionally minimal: it validates that dependencies install and that all Python modules compile successfully, while keeping runs fast and avoiding the need to start Docker, Qdrant, or external services. It serves as a basic quality gate and a foundation that teams can extend with additional tests, type checking, or linting as needed.

🌐 Browser Compatibility: Secure Context Requirement

If you access the application from another machine using an IP address (e.g., http://192.168.1.10:8000) certain browsers — especially Safari 15–16.1 — treat this as a non‑secure context.

Some Web APIs such as crypto.randomUUID() are available only in secure contexts (https:// or http://localhost). When the frontend attempted to generate a query_id using:

crypto.randomUUID().slice(0, 8)

this caused Safari to throw an error on non-secure IP-based pages, leading to symptoms like:

Fix Implemented

Replaced the direct crypto.randomUUID() call with a compatibility-safe fallback:

let queryId;
try {
  if (window.crypto && typeof window.crypto.randomUUID === 'function') {
    queryId = window.crypto.randomUUID().slice(0, 8);
  } else if (window.crypto && window.crypto.getRandomValues) {
    const arr = new Uint32Array(2);
    window.crypto.getRandomValues(arr);
    queryId = (arr[0].toString(16) + arr[1].toString(16)).slice(0, 8);
  } else {
    queryId = Math.random().toString(36).slice(2, 10);
  }
} catch (_) {
  queryId = Math.random().toString(36).slice(2, 10);
}

This ensures the chat works on:

Recommendation for Production

To avoid similar issues for end-users:

This ensures maximum compatibility of browser APIs.


🧪 API Examples (Advanced)

Click to expand API ingestion examples - MediaWiki: `POST /mediawiki/url` - Body: `{ "url": "https://en.wikipedia.org/wiki/...", "max_chunks": 0, "force_delete": true }` - Notes: `max_chunks > 0` limits chunks to that number; `0` or omitted means no user limit. A hard cap (`MAX_CHUNKS_PER_DOC`) is always enforced. - Optional: `?estimate=true` query param to return planned chunk count without indexing. - Generic URLs/PDFs: `POST /index` - Body: `{ "urls": ["https://..."], "doc_type": "HTML" | "PDF", "max_chunks": 0, "force_delete": true, "force_crawl": true }` - Behavior: standardize on chunk caps; character-based limits are removed. - Optional: `?estimate=true` query param to return planned chunk count without indexing. - Structured PDF (keep sections/headings like MediaWiki): - Single endpoint: `POST /pdf` as multipart form with fields: - `file` (UploadFile, optional) or `url` (string, optional) - `max_chunks` (int, default 0), `force_delete` (bool, default true) - Optional query: `?estimate=true` to return planned chunk count only Examples: ```bash curl -X POST http://localhost:8000/mediawiki/url \ -H 'Content-Type: application/json' \ -d '{"url":"https://en.wikipedia.org/wiki/OpenAI","max_chunks":50,"force_delete":true}' curl -X POST http://localhost:8000/index \ -H 'Content-Type: application/json' \ -d '{"urls":["https://openai.com"],"doc_type":"HTML","max_chunks":100,"force_delete":true}' # Estimate only examples curl -X POST 'http://localhost:8000/mediawiki/url?estimate=true' \ -H 'Content-Type: application/json' \ -d '{"url":"https://en.wikipedia.org/wiki/OpenAI","max_chunks":0}' curl -X POST 'http://localhost:8000/index?estimate=true' \ -H 'Content-Type: application/json' \ -d '{"urls":["https://openai.com"],"doc_type":"HTML","max_chunks":0}' # Structured PDF examples # Upload a local PDF curl -X POST 'http://localhost:8000/pdf?estimate=false' \ -F 'file=@/path/to/file.pdf' \ -F 'max_chunks=100' \ -F 'force_delete=true' # Use a PDF URL, estimate only curl -X POST 'http://localhost:8000/pdf?estimate=true' \ -F 'url=https://example.com/file.pdf' \ -F 'max_chunks=0' ```

© 2025 Rajkumar Velliavitil — All Rights Reserved

📜 License & Usage

This project is source-available for personal, educational, and evaluation purposes.
It is permitted to run, modify, and fork the code for non-commercial use.

Redistribution, sublicensing, or commercial use of this project or derivative works requires explicit written permission from the author.