RAG (Retrieval-Augmented Generation): How LLMs Access External Knowledge

Retrieval-Augmented Generation (RAG) is a technique where an LLM's response is augmented with relevant information retrieved from an external knowledge base. The typical pipeline: user query → convert to embedding → search a vector database for similar chunks → inject retrieved chunks into the LLM prompt as context → LLM generates an answer grounded in the retrieved information. RAG addresses the problem of LLM knowledge cutoffs and hallucination by giving models access to current, domain-specific data.

Retrieval-Augmented Generation (RAG) is an architecture pattern where a large language model's response is supplemented with relevant information retrieved from an external knowledge base at query time. Rather than relying solely on knowledge encoded during training, the model receives specific, current context to ground its answers. ## The Standard Pipeline 1. **Document ingestion:** Source documents are split into chunks (typically 200-1000 tokens), each chunk is converted to a numerical embedding vector, and these vectors are stored in a vector database (Pinecone, Weaviate, pgvector, Chroma, etc.) 2. **Query processing:** When a user asks a question, the query is also converted to an embedding vector 3. **Retrieval:** The query embedding is compared against all stored chunk embeddings using cosine similarity or similar metrics. The top-K most similar chunks are retrieved. 4. **Generation:** The retrieved chunks are injected into the LLM's prompt as context, typically with an instruction like "Answer the question based on the following context." The LLM generates a response grounded in the retrieved information. ## What RAG Solves **Knowledge cutoff:** LLMs only know what was in their training data. RAG provides access to information that's newer than the model's training date. **Domain specificity:** A general-purpose LLM doesn't know about your company's internal documentation, product catalog, or proprietary data. RAG lets it answer questions about domain-specific content without fine-tuning. **Hallucination reduction:** By providing grounded context, RAG reduces (but doesn't eliminate) the tendency of LLMs to generate plausible-sounding but incorrect information. **Auditability:** Retrieved chunks can be cited as sources, letting users verify the model's claims against the original documents. ## Limitations RAG quality depends heavily on retrieval quality — if the right chunks aren't retrieved, the model can't use them. Semantic search via embeddings captures meaning but can miss exact keyword matches. Chunking strategy (how documents are split) significantly affects retrieval quality. The context window limits how many chunks can be included. ## Alternatives For personal-scale knowledge (hundreds of pages), simpler approaches like the LLM wiki pattern — where an LLM reads an index and follows explicit links between markdown files — can outperform RAG by leveraging explicit relationships rather than embedding similarity. The LLM Wiki Pattern: Personal Knowledge Bases Without RAG

RAG (Retrieval-Augmented Generation): How LLMs Access External Knowledge

Related Knowledge

The LLM Wiki Pattern: Personal Knowledge Bases Without RAG

Have insights to add?