What is RAG in AI and how does it work?

RAG (Retrieval Augmented Generation) retrieves relevant documents from a knowledge base and feeds them as context to an LLM, reducing hallucination by grounding answers in actual sources.

How does RAG differ from fine-tuning an LLM?

RAG supplies external knowledge at query time without modifying model weights, while fine-tuning permanently alters the model. RAG is cheaper, faster to update, and works with proprietary data the model has never seen.

What is the best vector database for RAG systems?

Elasticsearch, Pinecone, and pgvector are popular choices. Elasticsearch excels when you need hybrid keyword + vector search; Pinecone is purpose-built for vectors; pgvector works if you already use PostgreSQL.

When should you use RAG instead of a larger context window?

Use RAG when your knowledge base exceeds the context window, when data changes frequently, or when you need citation-backed answers. Larger context windows still have cost and latency trade-offs.

阅读中文版 →

How to Build a RAG System with LangChain and Elasticsearch

A hands-on guide to building Retrieval Augmented Generation — from vector embeddings to context-enhanced LLM answers.

zhuermu · October 15, 2024 · 18 min

RAGLangChainElasticsearchAWS BedrockVector SearchLLM

中文版 / Chinese Version: 本文翻译并增强自 CSDN 博客文章。阅读中文原文 →

Large Language Models are impressive — until they confidently make something up. This phenomenon, known as hallucination, is the Achilles’ heel of LLMs. The model predicts the next token based on your prompt, and when it lacks reliable data, it fills in the gaps with plausible-sounding nonsense.

RAG (Retrieval Augmented Generation) is the most practical solution to this problem today. Instead of hoping the model “knows” the answer, you give it the relevant documents at query time and ask it to synthesize a response from actual sources.

This article walks through building a complete RAG system from scratch using LangChain, Elasticsearch as the vector store, and AWS Bedrock (Claude) as the LLM. Every code example is production-oriented and uses current library versions.

What Is RAG?

RAG (Retrieval Augmented Generation) is a two-phase approach to text generation:

Retrieval — Given a user query, search an external knowledge base and retrieve the most relevant documents.
Generation — Feed the retrieved documents as context to an LLM, which generates an answer grounded in that evidence.

The key insight is simple: instead of relying on the model’s parametric knowledge (what it learned during pre-training), you supply non-parametric knowledge (your documents) at inference time. This dramatically reduces hallucination and lets you work with data the model has never seen — proprietary docs, recent updates, internal wikis.

RAG System Overview

Here is the flow in detail:

The user submits a question to the RAG system.
The retriever searches the knowledge base (vector store) for relevant documents.
The knowledge base returns the top-K most similar document chunks.
The prompt builder combines the user’s original question with the retrieved context into an augmented prompt.
The LLM generates an answer grounded in the provided context.
The system returns the answer — ideally with citations pointing back to the source documents.

The Document Embedding Pipeline

Before the system can retrieve anything, you need to build the knowledge base. This is the indexing phase — a one-time (or periodic) process that converts your documents into searchable vectors.

Document Embedding Pipeline

2.1 Loading Documents

LangChain provides dozens of document loaders for different formats — PDF, Markdown, HTML, plain text, Notion exports, and more. Here is a simple example loading Markdown files:

from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader(
    "./knowledge_base/",
    glob="**/*.md",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},
)
documents = loader.load()
print(f"Loaded {len(documents)} documents")

2.2 Chunking: Why Size Matters

Raw documents are typically too long to embed as a single vector. You need to split them into chunks — smaller passages that each become one vector in the store.

Chunk size is one of the most impactful parameters in a RAG system. Get it wrong, and retrieval quality suffers no matter how good your embedding model is.

Chunk Size	Pros	Cons
Small (128-256 tokens)	Precise retrieval, each chunk is topically focused	Loses surrounding context, may return fragments
Medium (512-1024 tokens)	Good balance of precision and context	May include some irrelevant content
Large (1024-2048 tokens)	Preserves full context and reasoning chains	Dilutes the signal, lower retrieval precision

Practical guidance:

Start with 512 tokens as your baseline chunk size.
Use 10-20% overlap between chunks to prevent information loss at boundaries.
For structured documents (Markdown, HTML), use header-aware splitting to respect document structure.
For code documentation, consider larger chunks (1024+) since code snippets need surrounding context to be useful.

Here is a structure-aware splitter for Markdown documents:

from langchain.text_splitter import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain_core.documents import Document


def split_markdown(
    markdown_text: str,
    chunk_size: int = 512,
    chunk_overlap: int = 50,
) -> list[Document]:
    """Split a Markdown document by headers first, then by size."""

    # Phase 1: Split on Markdown headers to preserve structure
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]
    md_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on
    )
    header_splits = md_splitter.split_text(markdown_text)

    # Phase 2: Further split large sections by character count
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )
    final_chunks = text_splitter.split_documents(header_splits)
    return final_chunks

2.3 Embedding Model Selection

The embedding model converts text chunks into dense vectors. The quality of these vectors directly determines retrieval accuracy. Here are the main options on AWS:

Amazon Titan Text Embeddings v2 — The managed option on Bedrock. Produces 1024-dimensional vectors, supports up to 8,192 input tokens, and requires zero infrastructure. Best choice if you are already on AWS and want simplicity.

Cohere Embed v3 — Also available on Bedrock. Supports multiple languages and has a search_document / search_query input type distinction that can improve retrieval quality.

Open-source alternatives (e.g., sentence-transformers/all-MiniLM-L6-v2, BGE, GTE) — Self-hosted, no per-request cost, but you need to manage the infrastructure (SageMaker endpoint or EC2). Consider these if you have very high throughput or cost constraints.

For this tutorial, we will use Amazon Titan Embeddings via Bedrock:

from langchain_aws import BedrockEmbeddings

embeddings = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v2:0",
    region_name="us-east-1",
)

# Quick test
test_vector = embeddings.embed_query("What is RAG?")
print(f"Vector dimensions: {len(test_vector)}")  # 1024

Tip: Always use the same embedding model for indexing and querying. Mixing models (e.g., indexing with Titan but querying with Cohere) will produce meaningless similarity scores because the vector spaces are different.

Building the Knowledge Base with Elasticsearch

Elasticsearch has supported dense vector search since version 7.3, and version 8.x made it a first-class feature with native kNN search. It is an excellent choice for RAG because you get both vector search and traditional keyword search in one system — enabling hybrid retrieval strategies.

AWS alternative: If you prefer a fully managed service, Amazon OpenSearch Serverless supports the same vector search capabilities with zero operational overhead. The LangChain integration (langchain_aws.OpenSearchVectorSearch) works almost identically to the Elasticsearch one shown below.

3.1 Install Dependencies

pip install langchain>=0.3.0 langchain-aws langchain-elasticsearch
pip install langchain-community  # document loaders
pip install boto3                # AWS SDK

3.2 Index Documents

Let us put it all together. We will use AWS service documentation as our knowledge base — a more practical example than general-purpose text.

from langchain_aws import BedrockEmbeddings
from langchain_elasticsearch import ElasticsearchStore
from langchain_core.documents import Document

# ── 1. Initialize the embedding model ──────────────────────
embeddings = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v2:0",
    region_name="us-east-1",
)

# ── 2. Connect to Elasticsearch ────────────────────────────
ES_URL = "https://your-elasticsearch-host:9200"
ES_API_KEY = "your-api-key"

vector_store = ElasticsearchStore(
    embedding=embeddings,
    index_name="aws_docs_index",
    es_url=ES_URL,
    es_api_key=ES_API_KEY,
)

# ── 3. Prepare sample documents (AWS knowledge base) ──────
aws_docs_text = """
# Amazon S3 Storage Classes

## S3 Standard
S3 Standard offers high durability, availability, and performance object storage for
frequently accessed data. It delivers low latency and high throughput, making it
suitable for a wide variety of use cases including cloud applications, dynamic websites,
content distribution, mobile and gaming applications, and big data analytics.

## S3 Intelligent-Tiering
S3 Intelligent-Tiering is the only cloud storage class that delivers automatic storage
cost savings when data access patterns change, without performance impact or operational
overhead. It monitors access patterns and moves objects that have not been accessed for
30 consecutive days to the Infrequent Access tier, delivering 40% cost savings. Objects
that have not been accessed for 90 days move to the Archive Instant Access tier with
68% savings.

## S3 Glacier
Amazon S3 Glacier is a secure, durable, and extremely low-cost Amazon S3 storage class
for data archiving and long-term backup. It is designed to deliver 99.999999999% durability
and provides query-in-place functionality. Retrieval times range from minutes to hours
depending on the retrieval tier selected: Expedited (1-5 minutes), Standard (3-5 hours),
or Bulk (5-12 hours).

## S3 Transfer Acceleration
S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long
distances between your client and an S3 bucket. It leverages Amazon CloudFront globally
distributed edge locations. Data arriving at an edge location is routed to S3 over an
optimized network path, providing 50-500% improvement for cross-region uploads.

## Cross-Region Replication (CRR)
Cross-Region Replication automatically replicates data between buckets across AWS Regions.
CRR helps meet compliance requirements, minimize latency, and increase operational
efficiency. You can replicate objects to a single destination bucket or to multiple
destination buckets in different AWS Regions.
"""

# ── 4. Split and index ─────────────────────────────────────
chunks = split_markdown(aws_docs_text, chunk_size=512, chunk_overlap=50)
print(f"Created {len(chunks)} chunks")

# Add documents to the vector store
vector_store.add_documents(chunks)
print("Documents indexed successfully")

3.3 Test Retrieval

# Configure the retriever with a similarity threshold
retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.6,  # Only return results above this confidence
        "k": 3,                   # Return top 3 matches
    },
)

# Test with a natural language query
results = retriever.invoke("How can I reduce storage costs for infrequently accessed data?")

for i, doc in enumerate(results):
    print(f"\n── Result {i+1} ──")
    print(doc.page_content[:200])
    print(f"Metadata: {doc.metadata}")

Expected output: the S3 Intelligent-Tiering chunk should rank highest, followed by S3 Glacier — both directly relevant to reducing storage costs.

The Retrieval Flow at Query Time

Now that we have a populated vector store, let us trace what happens when a user asks a question.

Query-Time Retrieval Flow

The user’s query is embedded using the same model that was used to index the documents.
The query vector is compared against all document vectors using cosine similarity (or dot product).
The top-K most similar chunks are returned as context.
The original query and retrieved context are assembled into a prompt template.
The LLM generates a response grounded in the provided context.

The critical insight: the LLM never searches the vector store directly. It only sees the final prompt containing the query and the pre-retrieved context. The retrieval step is entirely separate from the generation step.

Retrieval Augmented Generation: The Complete Pipeline

Let us build the full RAG chain using LangChain and AWS Bedrock.

5.1 Define the Prompt Template

The prompt template is crucial. It tells the LLM how to use the retrieved context and what behavior to follow:

from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.documents import Document

ANSWER_TEMPLATE = """You are a helpful technical assistant. Answer the user's question
based ONLY on the provided context. If the context does not contain enough information
to answer the question, say "I don't have enough information to answer this question."

Do not make up facts or use knowledge outside the provided context.

[Context]
{context}

[Question]
{question}

Provide a clear, well-structured answer:"""

ANSWER_PROMPT = ChatPromptTemplate.from_template(ANSWER_TEMPLATE)

5.2 Initialize the LLM

from langchain_aws import ChatBedrock

llm = ChatBedrock(
    model_id="us.anthropic.claude-sonnet-4-20250514",
    region_name="us-east-1",
    model_kwargs={
        "max_tokens": 2048,
        "temperature": 0.1,  # Low temperature for factual answers
    },
)

5.3 Build the RAG Chain

# Helper: combine multiple documents into a single context string
DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")

def combine_documents(
    docs: list[Document],
    document_prompt=DEFAULT_DOCUMENT_PROMPT,
    separator: str = "\n\n",
) -> str:
    """Combine retrieved documents into a single context string."""
    doc_strings = [document_prompt.format(page_content=doc.page_content) for doc in docs]
    return separator.join(doc_strings)


# The RAG chain: retriever -> combine -> prompt -> LLM -> parse
rag_chain = (
    {
        "context": retriever | combine_documents,
        "question": RunnablePassthrough(),
    }
    | ANSWER_PROMPT
    | llm
    | StrOutputParser()
)

# A baseline chain WITHOUT retrieval — for comparison
baseline_chain = (
    ANSWER_PROMPT
    | llm
    | StrOutputParser()
)

5.4 Compare Results

question = "What are the retrieval time options for S3 Glacier?"

# ── With RAG ────────────────────────────────────────────
rag_answer = rag_chain.invoke(question)
print("=== RAG Answer ===")
print(rag_answer)

# ── Without RAG (baseline) ──────────────────────────────
baseline_answer = baseline_chain.invoke({
    "context": "No context available.",
    "question": question,
})
print("\n=== Baseline Answer (no retrieval) ===")
print(baseline_answer)

The RAG answer will cite specific retrieval tiers (Expedited: 1-5 minutes, Standard: 3-5 hours, Bulk: 5-12 hours) directly from the indexed documentation. The baseline answer may be correct if the LLM was trained on this information, but it could also hallucinate specific numbers — and you would have no way to verify.

5.5 Serving the RAG Chain as an API

from fastapi import FastAPI
from langserve import add_routes

app = FastAPI(
    title="RAG API Server",
    version="1.0",
    description="RAG system powered by LangChain, Elasticsearch, and AWS Bedrock",
)

# Expose the RAG chain as a REST endpoint
add_routes(app, rag_chain, path="/rag")

# Expose the baseline chain for comparison
add_routes(app, baseline_chain, path="/baseline")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)

Run the server and test it:

# Start the server
python server.py

# Test the RAG endpoint
curl -X POST http://localhost:8080/rag/invoke \
  -H "Content-Type: application/json" \
  -d '{"input": "How does S3 Transfer Acceleration work?"}'

Chunk Size Strategies and Their Impact

Chunk size is not a “set and forget” parameter. Different use cases demand different strategies:

Fixed-Size Chunking

The simplest approach — split text every N tokens with some overlap. Works well for homogeneous documents (e.g., prose, articles).

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],  # Try to split at natural boundaries
)

Semantic Chunking

More advanced — split based on semantic similarity between sentences. Adjacent sentences with high similarity stay together; a significant drop in similarity triggers a split. This produces variable-length chunks that align with topic boundaries.

from langchain_experimental.text_splitter import SemanticChunker

semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,  # Split at the top 5% similarity drops
)

Parent-Child Chunking

Index small chunks for precise retrieval, but return the larger parent document for context. LangChain supports this via ParentDocumentRetriever:

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# Small chunks for retrieval, full documents for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

store = InMemoryStore()
parent_retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

When to use which:

Fixed-size — Default starting point. Simple, predictable, easy to tune.
Semantic — When your documents mix multiple topics within the same section.
Parent-child — When you need pinpoint retrieval but broad context for generation.

Embedding Model Comparison

Choosing the right embedding model is as important as chunk size. Here is a practical comparison:

Model	Dimensions	Max Tokens	Cost	Best For
Titan Embeddings v2	1024	8,192	~$0.02/1M tokens	General purpose on AWS
Cohere Embed v3	1024	512	~$0.10/1M tokens	Multilingual, search-optimized
all-MiniLM-L6-v2	384	256	Free (self-hosted)	Low-latency, budget-friendly
BGE-large-en-v1.5	1024	512	Free (self-hosted)	High accuracy, English-focused
GTE-large	1024	512	Free (self-hosted)	Balanced quality and speed

Key considerations:

Dimension count affects storage and search speed. 384 dimensions (MiniLM) is fine for most use cases; 1024 (Titan, BGE) provides slightly better accuracy.
Max input tokens determines the maximum chunk size you can use. Titan’s 8,192-token limit is exceptionally generous.
Consistency is non-negotiable. Index and query must use the same model. If you switch models, you must re-index everything.
Evaluate on your data. Run retrieval benchmarks on a sample of real queries before committing to a model. The MTEB leaderboard is useful, but your domain-specific performance may differ.

RAG vs. Fine-Tuning: When to Use Which

RAG is powerful, but it is not always the right approach. Here is an honest comparison:

When RAG Wins

Frequently changing knowledge — Your docs get updated weekly or daily. RAG lets you update the knowledge base without retraining.
Source attribution is required — RAG naturally supports citations because you know exactly which documents informed the answer.
You need to integrate proprietary data — Internal wikis, customer data, product catalogs. RAG keeps this data in your control.
Cost sensitivity — No GPU training costs. You only pay for embedding and inference.
Speed to production — A RAG system can be built in days. Fine-tuning typically takes weeks of experimentation.

When Fine-Tuning Wins

Specific output format or style — If the model needs to generate code in a specific framework’s idiom, or write in a particular brand voice.
Implicit knowledge — When the “knowledge” is not factual but behavioral (e.g., “respond like a medical professional”).
Latency-critical applications — RAG adds retrieval latency. Fine-tuning bakes the knowledge into the model.
Small, stable knowledge domain — If your knowledge base rarely changes and fits within training data limits.

The Hybrid Approach

In practice, many production systems use both:

Fine-tune the base model on domain-specific style and terminology.
RAG for factual, up-to-date knowledge retrieval.

This gives you the best of both worlds: a model that “speaks your language” and stays grounded in current facts.

RAG Limitations to Be Aware Of

Retrieval quality ceiling — If the retriever cannot find the right documents, the LLM cannot produce a good answer. Garbage in, garbage out.
Context window limits — You can only stuff so many retrieved chunks into the prompt. For complex questions requiring synthesis across dozens of documents, RAG struggles.
Latency overhead — Each query requires an embedding call + vector search + LLM call. This adds 200-500ms compared to a direct LLM call.
Chunk boundary problems — Important information that spans two chunks may be partially retrieved. Overlap and parent-child strategies help but do not fully solve this.

Production Checklist

Before deploying your RAG system, run through this checklist:

Evaluation dataset — Build a set of 50+ question-answer pairs with ground truth. Measure retrieval recall (are the right documents found?) and answer accuracy separately.
Chunk size tuning — Test at least three chunk sizes (256, 512, 1024) on your evaluation set. The best size depends on your document structure.
Similarity threshold — Set it too high and you get no results; too low and you get noise. 0.5-0.7 is a reasonable starting range for cosine similarity.
Fallback behavior — What happens when retrieval returns zero results? The LLM should explicitly say “I don’t know” rather than hallucinate.
Monitoring — Log retrieval scores, latency, and user feedback. Track cases where users report incorrect answers — these are your evaluation goldmine.
Re-indexing pipeline — Set up automated re-indexing when source documents change. Stale embeddings are worse than no embeddings.
Hybrid search — Consider combining vector search with keyword (BM25) search. Elasticsearch supports this natively, and hybrid retrieval often outperforms either approach alone.

Conclusion

RAG is not magic — it is engineering. The core idea is straightforward: give the LLM the right context, and it will give you the right answer. But the devil is in the details: chunk size, embedding model selection, retrieval thresholds, prompt design, and fallback behavior all compound to determine whether your system is reliable or frustrating.

The stack we built in this article — LangChain for orchestration, Elasticsearch for vector storage, and AWS Bedrock (Claude) for generation — is production-ready and scales well. Elasticsearch gives you the flexibility of hybrid search (vector + keyword), and Bedrock eliminates the need to manage GPU infrastructure for either embeddings or generation.

Start simple. Get a working pipeline with fixed-size chunks and basic cosine similarity. Measure its performance on real queries. Then iterate: try semantic chunking, experiment with re-ranking, add metadata filters. Every improvement should be driven by data, not intuition.

The complete source code for this tutorial is available on GitHub.

References

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — arXiv (original RAG paper, Lewis et al.)
LangChain Retrieval Documentation — LangChain Docs
Dense vector field type — Elasticsearch Reference
Amazon Bedrock Knowledge Bases — AWS Documentation