Zum Inhalt springen

Retrieval Augmented Generation (RAG) via LangChain

Diese Seite ist noch nicht in deiner Sprache verfügbar. Englische Seite aufrufen

This tutorial shows how to use large language models (LLMs) with the LangChain framework. It covers the basics of Retrieval Augmented Generation (RAG). For an introduction to LangChain, see LangChain expression language (basic usage).

RAG is a technique where the generated answers do not only rely on the model’s world knowledge, but also on a provided context. A typical use case is a chatbot that provides interacting users with correct and complete information. These chatbots have a certain scope or topic they seem familiar with. The framework tries to map incoming questions to a certain subject by querying a vector database. Matching documents from the vector database are then used to provide a context for the LLM when responding to the user’s request. To rephrase it, the retrieved documents are used to augment the process of generation.

Documents in a vector database must match the intended use case of the chatbot, for example, instruction manuals for a product series for a technical consultant bot, or the programme of one or more political parties to inform interested citizens. Otherwise, RAG has no benefit. Consider how the chatbot should behave if a request is outside its intended topic. Such a solution should serve a specific audience, not act as unrestricted public access to an LLM.

This tutorial covers the important aspects of a RAG solution and highlights common pitfalls and opportunities for improvement.

After you create a STACKIT AI Model Serving auth token (see Manage auth tokens), provide it as embedding_model_serving_auth_token. See Available shared models to choose models and provide the model name and the URL. Choose an embedding model for the vector database and a chat model for chat interaction with the knowledge database.

import os
from dotenv import load_dotenv
load_dotenv("../.env")
embedding_model = os.environ["STACKIT_MODEL_SERVING_EMBEDDING_MODEL"]
embedding_base_url = os.environ["STACKIT_MODEL_SERVING_EMBEDDING_BASE_URL"]
embedding_model_serving_auth_token = os.environ["STACKIT_MODEL_SERVING_EMBEDDING_AUTH_TOKEN"]

As mentioned above, the core concept of RAG is generating a response with additional context. This context consists of documents that provide information on the requested subject. These documents are fetched from a database designed to search rapidly for suitable content.

We provide an EmbeddingEngine, a wrapper that uses the OpenAI‑compatible endpoint of STACKIT AI Model Serving for embeddings within the LangChain ecosystem.

from langchain_core.embeddings import Embeddings
from openai import OpenAI
class EmbeddingEngine(Embeddings):
def __init__(self, api_key: str, model: str, base_url: str):
self._openai_compatible_client = OpenAI(
api_key=api_key,
base_url=base_url
)
self._model = model
def embed_documents(self, texts: list[str]) -> list[list[float]]:
"""Embed any number of documents."""
return [self.embed_query(text=text) for text in texts]
def embed_query(self, text: str) -> list[float]:
"""Embed text."""
return self._openai_compatible_client.embeddings.create(
input=[text], model=self._model
).data[0].embedding
async def aembed_documents(self, texts: list[str]) -> list[list[float]]:
pass

Let’s check the embedding with the STACKIT vision as input:

embedding_engine = EmbeddingEngine(
api_key=embedding_model_serving_auth_token,
base_url=embedding_base_url,
model=embedding_model
)
vec = embedding_engine.embed_query(
text="Ein unabhängiges Europa - digital, führend."
)
print(
f"The embedding results in a {len(vec)} dimensional vector representation.\n\nSee the first positions:\n\t{vec[:8]}"
)
# Output
#> The embedding results in a 4096 dimensional vector representation.
#>
#> See the first positions:
#> [0.00975799560546875, 0.007625579833984375, -0.0110931396484375, -0.01410675048828125, -0.0122833251953125, -0.006244659423828125, -0.00787353515625, 0.0283966064453125]

The embedding model lets you choose the vector database. For this tutorial, we use ChromaDB. It is easy to set up within a local directory, making it suitable for fast prototyping and experiments. For production at scale, consider cloud‑native vector stores such as Milvus, Qdrant, or Weaviate. You can also use PostgreSQL, available as a managed service (STACKIT PostgreSQL Flex) via the pgvector extension.

Set up a Chroma instance:

from langchain_chroma import Chroma
vector_store = Chroma(
collection_name="sonnets",
embedding_function=embedding_engine,
persist_directory="../chroma_langchain_db"
)

Provide documents that can be queried from the vector store. These documents should match the intended use case. For this tutorial, we use public domain literature—specifically Shakespeare’s sonnets.

First, load The Complete Works of William Shakespeare and extract the sonnets as instances of langchain_core.documents.Document. These documents can be provided with any metadata that the specific use case requires, which may include additional information or crucial properties to filter when querying the vector store.

import re
from pathlib import Path
from langchain_core.documents import Document
def consume_sonnets(filename: str | Path) -> list[Document]:
"""Read sonnets from a text file and return them as a list of Documents."""
start = "THE SONNETS"
end = "THE END"
split_marker = "##**##"
fp = Path(filename)
_verify_file_path(path=fp)
with fp.open(mode="r", encoding="utf-8") as f:
text = f.read()
sonnets = text.split(start)[-1]
sonnets = sonnets.split(end)[0]
sonnets = re.compile(r'\n\s+\d+').sub(split_marker, sonnets)
return [
Document(
page_content=sonnet.strip(),
metadata={
"author": "Shakespeare, William",
"number": idx+1
}
) for idx, sonnet in enumerate(sonnets.split(split_marker))
][1:]
def _verify_file_path(path: Path) -> None:
"""Raise FileNotFoundError if the specified path does not point to a file."""
if not path.exists():
raise FileNotFoundError(f"Could not find file: {path}")
if not path.is_file():
raise FileNotFoundError(f"Expected {path} to be a file, but found a directory.")
document_sonnets = consume_sonnets(Path.cwd().parent / "data" / "the_complete_works_of_william_shakespeare.txt")
print(f"Consumed {len(document_sonnets)} of expected 154 sonnets from William Shakespeare.")
# Output
#> Consumed 154 of expected 154 sonnets from William Shakespeare.

The add_documents method of our Chroma instance creates embeddings (using the embedding engine) and persists them along with the documents. The vector store responds with the UUID of each embedded document.

vector_store.add_documents(documents=document_sonnets)
# Output
#> '4f97a9c6-7938-4b5b-a2b9-227f6b9bcb51',
#> '533a390a-de57-4c9e-85f6-407d2682d683',
#> 'd3e21a49-ec40-430f-bbe1-31846c0a5cdd',
...

Check which sonnets best match a given topic. The helper best_matching_sonnets queries the vector store for a topic. random_topic_sonnets chooses a topic from a predefined set.

import random
def best_matching_sonnets(topic: str, amount_of_sonnets: int = 2) -> None:
"""Determine best‑matching sonnets for the specified topic."""
results = vector_store.similarity_search(
query=topic,
k=amount_of_sonnets
)
print(f"\nBest‑matching sonnets for the topic <<{topic}>> are:\n"+"==="*19)
for result in results:
print(f"\n{result.page_content}")
def random_topic_sonnets(amount_of_sonnets: int) -> None:
"""Select a random topic and print the best‑matching sonnets for that topic."""
topics = ["Youth", "Death", "Ephemerality", "Love", "Desire", "Loneliness"]
best_matching_sonnets(
topic=random.choice(topics),
amount_of_sonnets=amount_of_sonnets
)
random_topic_sonnets(amount_of_sonnets=2)
# output
#> Best matching sonnets for the topic <<Death>> are:
#> =========================================================
#>
#> No longer mourn for me when I am dead,
#> Than you shall hear the surly sullen bell
#> Give warning to the world that I am fled
#> From this vile world with vilest worms to dwell:
#> Nay if you read this line, remember not,
#> The hand that writ it, for I love you so,
#> That I in your sweet thoughts would be forgot,
#> If thinking on me then should make you woe.
#> O if, I say, you look upon this verse,
#> When I (perhaps) compounded am with clay,
#> Do not so much as my poor name rehearse;
#> But let your love even with my life decay.
#> Lest the wise world should look into your moan,
#> And mock you with me after I am gone.
#>
#> O lest the world should task you to recite,
#> What merit lived in me that you should love
#> After my death, dear love, forget me quite,
#> For you in me can nothing worthy prove.
#> Unless you would devise some virtuous lie,
#> To do more for me than mine own desert,
#> And hang more praise upon deceased I,
#> Than niggard truth would willingly impart:
#> O lest your true love may seem false in this,
#> That you for love speak well of me untrue,
#> My name be buried where my body is,
#> And live no more to shame nor me, nor you.
#> For I am shamed by that which I bring forth,
#> And so should you, to love things nothing worth.

With this Chroma collection, you can use this knowledgee in a chat completion use case.

This section shows how to provide results from a vector store for use by an LLM. Since we have chosen the works of the probably most famous poet as our example subject, we face a similar issue as in the tutorial for advanced LCEL usage: most questions can be handled via the model’s world knowledge, so there is no dire need for additional context. However, these tutorials aim to demonstrate how to make the LangChain components interact properly, enabling you to apply this to your specific use case.

As more basic tutorials on LCEL have been provided, the explanations in this section are more concise.

Regarding the resources utilized by the STACKIT AI Model Serving Service, it is crucial to select a suitable model; chat models and embedding models are not interchangeable. The provided base_url and the utilized STACKIT AI Model Serving Auth Token can be the same.

model = os.environ["STACKIT_MODEL_SERVING_MODEL"]
base_url = os.environ["STACKIT_MODEL_SERVING_BASE_URL"]
model_serving_auth_token = os.environ["STACKIT_MODEL_SERVING_AUTH_TOKEN"]

The following prompts are defined. The first one comes up with an appropriate topic, i.e., a LLM-fueled version of the above utilized random_topic_sonnets. The second prompt, essay_prompt, expects the chosen topic and some sonnets fetched from the vector store to incorporate those in a short essay.

from langchain_core.prompts import ChatPromptTemplate
topic_prompt = ChatPromptTemplate([
("system", "You are a helpful AI bot familiar with poetry from the Elizabethan era."),
("human", "Suggest a typical topic for the era that would be worth a sonnet. Answer concisely with only the topic in a few words."),
])
essay_prompt = ChatPromptTemplate([
("system", "You are a helpful AI bot familiar with poetry from the Elizabethan era."),
("human", "Provide a short essay on Elizabethan poetry addressing the topic '{topic}'. Use the following sonnets as sources and to support your points. Do not include extensive knowledge beyond the provided sources.\n\nHere are the sonnets:\n{sonnets}"),
])

Provide utilities for downstream data handling and a helper to adapt the input dictionary between runnables.

from typing import Dict
from pydantic import BaseModel
class LyricSubject(BaseModel):
topic: str
essay: str
sonnets: str
def extract_single_nested_dict(d: Dict[str, str], key_nested_dict: str = "origin_args") -> Dict[str, str]:
"""Extract a nested dictionary and add all key–value pairs to the top‑level dictionary."""
nested_dict = d.pop(key_nested_dict, None)
return {**nested_dict, **d}

Carry inputs forward with a common RunnableParallel construction.

from langchain_core.output_parsers.string import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnableParallel, RunnablePassthrough
from langchain_openai import ChatOpenAI
topic_chain = RunnableParallel(
topic=(
topic_prompt |
ChatOpenAI(
model=model,
base_url=base_url,
api_key=model_serving_auth_token,
temperature=.1,
frequency_penalty=.05
) |
StrOutputParser()
),
origin_args=RunnablePassthrough()
) | RunnableLambda(extract_single_nested_dict)

Use the Chroma vector store in get_sonnets so the store can be handled as a RunnableLambda. The final chain generates a topic, finds sonnets, derives an essay, and casts the result to a data class.

def get_sonnets(d: Dict[str, str | int]) -> Dict[str, str | int]:
"""Determine best‑matching sonnets for the specified topic."""
results = vector_store.similarity_search(
query=d["topic"],
k=d["amount_of_sonnets"]
)
d["sonnets"] = "\n\n".join([r.page_content for r in results])
return d
essay_chain = RunnableParallel(
essay=(
RunnableLambda(get_sonnets) |
essay_prompt |
ChatOpenAI(
model=model,
base_url=base_url,
api_key=model_serving_auth_token,
temperature=.1,
frequency_penalty=.05
) |
StrOutputParser()
),
origin_args=RunnablePassthrough()
) | RunnableLambda(extract_single_nested_dict)
elizabethan_poetry_essay_chain = topic_chain | essay_chain | RunnableLambda(lambda d: LyricSubject(**d))

Test the chain with two sonnets as input.

lyric_subject = elizabethan_poetry_essay_chain.invoke({"amount_of_sonnets": 2})
print(f"On the topic <<{lyric_subject.topic}>> following essay has been derived:\n\n{lyric_subject.essay}")
# output
#> On the topic <<Fleeting nature of beauty.>> following essay has been derived:
#>
#> The fleeting nature of beauty is a recurring theme in Elizabethan era poetry, as seen in the two provided sonnets. These poems, likely penned by William Shakespeare, lament the transience of beauty and its inevitable decline.
#>
#> In the first sonnet, the speaker highlights the distinction between true beauty and superficial appearance. The rose, with its sweet fragrance, is deemed more beautiful than the canker blooms, which, despite their deep color, lack virtue and eventually fade unloved. This serves as a metaphor for the fleeting nature of human beauty, which, like the rose, is ephemeral and subject to decay. The speaker consoles the "beauteous and lovely youth" that even when their physical beauty fades, their true essence will be preserved through the power of verse.

This concludes the basic RAG tutorial. Key takeaways:

  • Vector stores are key to RAG, and embedding models fuel them.
  • Embedding models differ from chat completion models.
  • Providing an LLM with curated expertise, rather than relying on volatile world knowledge, enables many applications.

Final notes:

  • The data source here is benevolent; in real projects, significant effort goes into converting raw data into embeddable documents, including extracting information from tables or presentations and handling layout noise.
  • Topic stability can vary. In production, set thresholds to accept retrieved documents and define how the application proceeds when no suitable context is found.
  • Do not use a local filesystem‑based vector store in production. Use a cloud‑native vector store.

Many of these issues are addressed in the open‑source STACKIT RAG Template. Consider this tutorial a guided tour of how RAG works and the template as a starting point for deploying your specific RAG use case.