Memory & Retrieval-Augmented Generation (RAG)

Memory Architecture

Agents usually use two memory layers:

Short-Term Memory (STM)

Holds the current reasoning chain and tool outputs (often thousands of tokens)
Lives in the LLM context window or a fast in-memory cache
Is cleared or archived after task completion

Long-Term Memory (LTM)

Persistent knowledge across sessions: user profiles, past tasks, domain facts
Implemented with a Vector DB or another knowledge store
Promotion rule: move significant facts from STM to LTM after each milestone

if outcome.is_significant or outcome.reusable:
    embed(summary) → store in vector_db with metadata

RAG Pipeline

Phase 1 — Indexing (offline)

Documents → Chunking → Embedding Model → Vector Database
               │              │                  │
          split into     convert each       store vectors
         text chunks    chunk to vector    with metadata

Phase 2 — Query (runtime)

User Query → Embedding Model → Similarity Search (Vector DB)
                                        │
                                   Top-K Chunks
                                        │
                          ┌─── Context Assembly ───┐
                          │  System Prompt          │
                          │  + Retrieved Chunks     │
                          │  + User Query           │
                          └─────────────────────────┘
                                        │
                                  LLM → Answer

4 RAG components: 1. Embedding model — converts queries and docs into dense vectors 2. Vector Database — stores vectors and supports ANN search 3. Retriever logic — query -> embed -> top-k match -> inject context 4. Freshness strategy — TTL, streaming updates, or live API fallback

Alternative: Vectorless RAG eliminates embeddings entirely — uses LLM reasoning over hierarchical document structure instead.

Retrieval Methods

Method	Principle	Best When
Dense retrieval	Neural embeddings (semantics)	Conceptual similarity
Sparse retrieval	BM25 / TF-IDF (keywords)	Exact term matching
Hybrid search	Dense + Sparse (RRF fusion)	Best overall recall
Re-ranking	LLM/cross-encoder on top-100	High precision required

RRF formula: score(d) = Σ 1 / (k + rank_i(d)), where k=60

Chunking Strategies

Strategy	Size	Overlap	When
Fixed-size window	512 tokens	50–100	Homogeneous text
Semantic segmentation	Varies	—	Structured documents
Sentence-based	1–5 sentences	1 sentence	FAQ, Q&A
Sliding window	256–1024	25%	Long connected text

Rule: chunk_size <= (context_window - system_prompt_tokens - response_budget) / top_k.

Vector Databases

DB	Type	Key Features	Scale	Pricing
Pinecone	Managed SaaS	Auto-scaling, hybrid search, HA	Billions of vectors	From $0 (Starter)
Weaviate	OSS / Cloud	GraphQL, knowledge graph, multi-modal	Horizontal scaling	Cloud ~$75/mo; OSS free
Qdrant	OSS / Cloud	Rust engine, metadata filtering, ACID	Distributed	Cloud ~$30/mo; OSS free
Milvus	OSS / Cloud	GPU support, cloud-native, many index types	Billions of vectors	Zilliz ~$0.10/hr; OSS free
Chroma	OSS Library	Python-native, developer-friendly	Moderate scale	Free
FAISS	OSS Library	Fastest ANN, GPU/C++, manual sharding	Single machine	Free

Quick choice guide: Pinecone/Qdrant for production, Weaviate for GraphQL-heavy stacks, Milvus for ultra-scale, Chroma/FAISS for development.

Index types: HNSW (production default, high speed/accuracy), IVF (large datasets, lower RAM), PQ (compression), Flat (dev/exact).

Key embedding models: text-embedding-3-large (OpenAI, 3072d, high quality) · text-embedding-3-small (OpenAI, 1536d, fast) · e5-large-v2 (HuggingFace, 1024d) · all-MiniLM-L6-v2 (HuggingFace, 384d, lightweight).

RAG in Chatbots — Intent Handling & Tag-Based Flow

Wrapping LLM output in structured tags lets the app layer detect intent, track conversation state, and route to the right retriever deterministically, without a separate classifier model.

Tag Schema (System Prompt Instruction)

Always respond in this exact XML format:
<intent>booking | search | faq | cancel | unknown</intent>
<entities>{"slot": "value", ...}</entities>
<state>greeting | slot_filling | confirmation | executing | done</state>
<missing_slots>["unfilled", "required", "slots"]</missing_slots>
<response>User-facing text here.</response>

Example — user says "Book me a flight to Chicago next week":

<intent>booking</intent>
<entities>{"destination": "Chicago", "date": "next week"}</entities>
<state>slot_filling</state>
<missing_slots>["exact_date", "budget", "return_date"]</missing_slots>
<response>What exact date and budget do you have in mind?</response>

Conversation State Machine

greeting → intent_detection ──→ unknown ──→ clarify
                │
                ▼
        slot_filling ◄──────────── missing slots
                │  (all slots filled)
                ▼
        confirmation ── rejected ──→ slot_filling
                │  confirmed
                ▼
        executing (RAG query + tool call)
                │
                ▼
              done

The app reads <state> on each turn, so it does not need an extra LLM call to decide the next step.

Intent → Retriever Routing

INTENT_ROUTER = {
    "booking": {"retriever": flight_hotel_db, "tools": ["search_flights", "search_hotels"]},
    "faq":     {"retriever": faq_vector_db,   "tools": []},
    "cancel":  {"retriever": policy_db,       "tools": ["cancel_booking"]},
    "unknown": {"retriever": general_kb,      "tools": []},
}
# Smaller specialized indices → faster + more precise; no irrelevant context in prompt

Tag Documents at Index Time

vector_db.upsert([
    {"id": "d1", "vector": embed(text), "metadata": {"intent": "faq",     "topic": "cancellation"}},
    {"id": "d2", "vector": embed(text), "metadata": {"intent": "booking", "topic": "flights"}},
])

# Filter retrieval by detected intent tag
results = vector_db.query(
    vector=embed(user_query),
    filter={"intent": {"$eq": detected_intent}},
    top_k=5
)

Slot Filling via LTM

# User: "same trip as last time but in May"
past = ltm.search("previous Chicago booking", top_k=1)
pre_filled = extract_slots(past)
# → {"destination": "Chicago", "airline": "United", "hotel": "Marriott"}
# Only "exact_date" is missing — ask for it once

Full Tagged Flow (per Turn)

Turn	App reads	App does
1	`<intent>`	Route to retriever + tool set
2+	`<state>slot_filling` + `<missing_slots>`	Ask for next missing slot
N	`<state>confirmation`	Show summary, await yes/no
N+1	`<state>executing`	RAG query (intent-filtered) + tool call
Final	`<state>done`	Close conversation turn