Skip to content

Memory & Retrieval-Augmented Generation (RAG)

Memory Architecture

Agents usually use two memory layers:

Short-Term Memory (STM)

  • Holds the current reasoning chain and tool outputs (often thousands of tokens)
  • Lives in the LLM context window or a fast in-memory cache
  • Is cleared or archived after task completion

Long-Term Memory (LTM)

  • Persistent knowledge across sessions: user profiles, past tasks, domain facts
  • Implemented with a Vector DB or another knowledge store
  • Promotion rule: move significant facts from STM to LTM after each milestone
if outcome.is_significant or outcome.reusable:
    embed(summary) → store in vector_db with metadata

RAG Pipeline

Phase 1 — Indexing (offline)

Documents → Chunking → Embedding Model → Vector Database
               │              │                  │
          split into     convert each       store vectors
         text chunks    chunk to vector    with metadata

Phase 2 — Query (runtime)

User Query → Embedding Model → Similarity Search (Vector DB)
                                        │
                                   Top-K Chunks
                                        │
                          ┌─── Context Assembly ───┐
                          │  System Prompt          │
                          │  + Retrieved Chunks     │
                          │  + User Query           │
                          └─────────────────────────┘
                                        │
                                  LLM → Answer

4 RAG components: 1. Embedding model — converts queries and docs into dense vectors 2. Vector Database — stores vectors and supports ANN search 3. Retriever logic — query -> embed -> top-k match -> inject context 4. Freshness strategy — TTL, streaming updates, or live API fallback

Alternative: Vectorless RAG eliminates embeddings entirely — uses LLM reasoning over hierarchical document structure instead.

Retrieval Methods

Method Principle Best When
Dense retrieval Neural embeddings (semantics) Conceptual similarity
Sparse retrieval BM25 / TF-IDF (keywords) Exact term matching
Hybrid search Dense + Sparse (RRF fusion) Best overall recall
Re-ranking LLM/cross-encoder on top-100 High precision required

RRF formula: score(d) = Σ 1 / (k + rank_i(d)), where k=60

Chunking Strategies

Strategy Size Overlap When
Fixed-size window 512 tokens 50–100 Homogeneous text
Semantic segmentation Varies Structured documents
Sentence-based 1–5 sentences 1 sentence FAQ, Q&A
Sliding window 256–1024 25% Long connected text

Rule: chunk_size <= (context_window - system_prompt_tokens - response_budget) / top_k.

Vector Databases

DB Type Key Features Scale Pricing
Pinecone Managed SaaS Auto-scaling, hybrid search, HA Billions of vectors From $0 (Starter)
Weaviate OSS / Cloud GraphQL, knowledge graph, multi-modal Horizontal scaling Cloud ~$75/mo; OSS free
Qdrant OSS / Cloud Rust engine, metadata filtering, ACID Distributed Cloud ~$30/mo; OSS free
Milvus OSS / Cloud GPU support, cloud-native, many index types Billions of vectors Zilliz ~$0.10/hr; OSS free
Chroma OSS Library Python-native, developer-friendly Moderate scale Free
FAISS OSS Library Fastest ANN, GPU/C++, manual sharding Single machine Free

Quick choice guide: Pinecone/Qdrant for production, Weaviate for GraphQL-heavy stacks, Milvus for ultra-scale, Chroma/FAISS for development.

Index types: HNSW (production default, high speed/accuracy), IVF (large datasets, lower RAM), PQ (compression), Flat (dev/exact).

Key embedding models: text-embedding-3-large (OpenAI, 3072d, high quality) · text-embedding-3-small (OpenAI, 1536d, fast) · e5-large-v2 (HuggingFace, 1024d) · all-MiniLM-L6-v2 (HuggingFace, 384d, lightweight).


RAG in Chatbots — Intent Handling & Tag-Based Flow

Wrapping LLM output in structured tags lets the app layer detect intent, track conversation state, and route to the right retriever deterministically, without a separate classifier model.

Tag Schema (System Prompt Instruction)

Always respond in this exact XML format:
<intent>booking | search | faq | cancel | unknown</intent>
<entities>{"slot": "value", ...}</entities>
<state>greeting | slot_filling | confirmation | executing | done</state>
<missing_slots>["unfilled", "required", "slots"]</missing_slots>
<response>User-facing text here.</response>

Example — user says "Book me a flight to Chicago next week":

<intent>booking</intent>
<entities>{"destination": "Chicago", "date": "next week"}</entities>
<state>slot_filling</state>
<missing_slots>["exact_date", "budget", "return_date"]</missing_slots>
<response>What exact date and budget do you have in mind?</response>

Conversation State Machine

greeting → intent_detection ──→ unknown ──→ clarify
                │
                ▼
        slot_filling ◄──────────── missing slots
                │  (all slots filled)
                ▼
        confirmation ── rejected ──→ slot_filling
                │  confirmed
                ▼
        executing (RAG query + tool call)
                │
                ▼
              done

The app reads <state> on each turn, so it does not need an extra LLM call to decide the next step.

Intent → Retriever Routing

INTENT_ROUTER = {
    "booking": {"retriever": flight_hotel_db, "tools": ["search_flights", "search_hotels"]},
    "faq":     {"retriever": faq_vector_db,   "tools": []},
    "cancel":  {"retriever": policy_db,       "tools": ["cancel_booking"]},
    "unknown": {"retriever": general_kb,      "tools": []},
}
# Smaller specialized indices → faster + more precise; no irrelevant context in prompt

Tag Documents at Index Time

vector_db.upsert([
    {"id": "d1", "vector": embed(text), "metadata": {"intent": "faq",     "topic": "cancellation"}},
    {"id": "d2", "vector": embed(text), "metadata": {"intent": "booking", "topic": "flights"}},
])

# Filter retrieval by detected intent tag
results = vector_db.query(
    vector=embed(user_query),
    filter={"intent": {"$eq": detected_intent}},
    top_k=5
)

Slot Filling via LTM

# User: "same trip as last time but in May"
past = ltm.search("previous Chicago booking", top_k=1)
pre_filled = extract_slots(past)
# → {"destination": "Chicago", "airline": "United", "hotel": "Marriott"}
# Only "exact_date" is missing — ask for it once

Full Tagged Flow (per Turn)

Turn App reads App does
1 <intent> Route to retriever + tool set
2+ <state>slot_filling + <missing_slots> Ask for next missing slot
N <state>confirmation Show summary, await yes/no
N+1 <state>executing RAG query (intent-filtered) + tool call
Final <state>done Close conversation turn