Skip to content

Multi-Modal Retrieval Augmented Generation (RAG)

This tutorial demonstrates the utilization of large language models (chat and embedding) with vision understanding for retrieval augmented generation (RAG) use cases. In this tutorial the LangChain library is used (one of many alternative OpenAI-compatible libraries). It is tested with the following version: openai==2.15.0 langchain-openai==1.1.7 langchain==1.2.7 langchain-chroma==1.1.0.

After you created a STACKIT AI Model Serving auth token, as described here, you may provide it as api_key. Here you can decide, which model to use and provide the model’s name, along with the url. This tutorial requires both models (chat and embedding) to support vision input. Now you are good to go to investigate your possabilities with LLMs. If you try to switch the LLM provider of an existing application, chances are pretty high that this is all you have to adjust. Due to the OpenAI-API compatibility of our STACKIT AI Model Serving Service.

import os
from dotenv import load_dotenv
load_dotenv(".env")
embedding_model_name = os.environ["STACKIT_MODEL_SERVING_EMBEDDING_MODEL"]
chat_model_name = os.environ["STACKIT_MODEL_SERVING_CHAT_MODEL"]
base_url = os.environ["STACKIT_MODEL_SERVING_BASE_URL"]
auth_token = os.environ["STACKIT_MODEL_SERVING_API_KEY"]

Fuel the Vector Store: Multi-Modal Embeddings

Section titled “Fuel the Vector Store: Multi-Modal Embeddings”

Here we define our dataset - which in this example is just the bare minimum to demonstrate the capabilities.

# Create a directory for our data
os.makedirs("rag_data", exist_ok=True)
# You should add your own images to the rag_data directory.
# Here we will mock the image creation for demonstration purposes.
from PIL import Image
Image.new("RGB", (100, 100), color="red").save("rag_data/red_square.jpg")
Image.new("RGB", (100, 100), color="blue").save("rag_data/blue_square.jpg")
documents = [
{
"type": "text",
"content": "The color red is often associated with passion, energy, and love.",
},
{"type": "image", "path": "rag_data/red_square.jpg"},
{
"type": "text",
"content": "The color blue is often associated with calmness, stability, and sadness.",
},
{"type": "image", "path": "rag_data/blue_square.jpg"},
]

We provide an MultiModalEmbeddingEngine, a wrapper that uses the OpenAI‑compatible endpoint of STACKIT AI Model Serving for embeddings within the LangChain ecosystem. We explicitly implement the embed_image method, which handles image embeddings.

import base64
from typing import List
from langchain_core.embeddings import Embeddings
from openai import OpenAI
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
class MultiModalEmbeddingEngine(Embeddings):
def __init__(self, api_key: str, model: str, base_url: str):
self._client = OpenAI(api_key=api_key, base_url=base_url)
self._model = model
def embed_documents(self, texts: List[str]) -> List[List[float]]:
return [self.embed_query(text) for text in texts]
def embed_query(self, text: str) -> List[float]:
"""Standard text embedding"""
return (
self._client.embeddings.create(input=[text], model=self._model)
.data[0]
.embedding
)
def embed_image(self, uris: List[str]) -> List[List[float]]:
"""Image embedding"""
return [
self._client.embeddings.create(
input=[],
extra_body={
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{encode_image(uri)}"
},
}
],
}
]
},
model=self._model,
)
.data[0]
.embedding
for uri in uris
]
async def aembed_documents(self, texts: List[str]) -> List[List[float]]:
# Async implementation if needed
pass
async def aembed_query(self, text: str) -> List[float]:
# Async implementation if needed
pass
embedding_engine = MultiModalEmbeddingEngine(
api_key=auth_token,
base_url=base_url,
model=embedding_model_name,
)

Let’s ingest our RAG data into the vector store using our multi-modal embedding engine.

from langchain_chroma import Chroma
from langchain_core.documents import Document
vector_store = Chroma(
collection_name="multimodal_rag",
embedding_function=embedding_engine, # note: this will connect our multi-modal embedding engine to the vector store
persist_directory="./chroma_db",
)
# Ingest documents
for i, doc_info in enumerate(documents):
if doc_info["type"] == "text":
vector_store.add_documents(
documents=[
Document(
page_content=doc_info["content"],
metadata={"type": "text", "source": f"document-abc-{i}"},
)
]
)
elif doc_info["type"] == "image":
metadatas = [
{
"type": "image",
"source": f"document-abc-{i}",
"source_path": doc_info["path"],
}
]
# note: this will call the Embedding.add_image method on our mutli-modal embedding engine
vector_store.add_images(uris=[doc_info["path"]], metadatas=metadatas)
print(f"Vector store created with {vector_store._collection.count()} documents.")

This example shows how to use the vector store for multi-modal semantic search. It returns both image and text documents from the vector store.

query_text = "What is the color of passion?"
query_embedding = embedding_engine.embed_query(query_text)
retrieved_docs = vector_store.similarity_search_by_vector_with_relevance_scores(
embedding=query_embedding, k=2
)
print("Retrieved documents:")
for doc, score in retrieved_docs:
print(
f" - Type: {doc.metadata['type']}, Content: {doc.page_content}, Score: {score}"
)

Question Answering using Multi-Modal Retrieval

Section titled “Question Answering using Multi-Modal Retrieval”

In this section we build a basic RAG pipeline for multi-modal inputs. We will use a multi-modal chat model along with the multi-modal embedding model from earlier.

from langchain_openai import ChatOpenAI
chat_model = ChatOpenAI(model=chat_model_name, api_key=auth_token, base_url=base_url)

This example uses langchain agents to build the RAG pipeline. There are numerous alternative implementation.

from langchain.agents import create_agent
from langchain.agents.middleware import dynamic_prompt, ModelRequest
@dynamic_prompt
def prompt_with_context(request: ModelRequest) -> str:
"""Inject context into state messages."""
last_query = request.state["messages"][-1].text
retrieved_docs = vector_store.similarity_search(last_query)
docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)
system_message = (
"You are a helpful assistant. Use the following context in your response:"
f"\n\n{docs_content}"
)
return system_message
agent = create_agent(chat_model, tools=[], middleware=[prompt_with_context])

First, we will use a text-only prompt to query the vector store - just like in a text-only RAG pipeline.

text_query = "What is the color of passion?"
prompt_messages = [{"role": "user", "content": text_query}]
for step in agent.stream({"messages": prompt_messages}, stream_mode="values"):
step["messages"][-1].pretty_print()

Now, let’s prompt the model with text and image input.

from PIL import Image
Image.new("RGB", (100, 100), color="red").save("query_image.jpg")
text_query = "Explain the image meaning."
image_query = encode_image("query_image.jpg")
prompt_messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": text_query,
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_query}"},
},
],
},
]
for step in agent.stream(
{"messages": prompt_messages},
stream_mode="values",
):
step["messages"][-1].pretty_print()

This concludes the Multi-Modal RAG tutorial. Key takeaways:

  • RAG is not limited to text: By utilizing multi-modal embedding and chat models, you can map images and text to a shared vector space, enabling semantic search across different media types.
  • Unified Vector Stores: You can store image embeddings alongside text embeddings in the same collection, allowing for hybrid retrieval strategies.
  • Vision-Capable LLMs: Passing retrieved images (as Base64) to a vision-enabled Chat model allows the AI to “see” the context and answer questions based on visual evidence.

Final notes:

  • Data & Storage Strategy: In this tutorial, we handled images via local paths and direct Base64 encoding. In production, store the actual image binaries in a dedicated Object Storage (like STACKIT Object Storage) and save only the references (URLs) and embeddings in your vector database to keep it lightweight.
  • Latency: Images consume additional compute in the preprocessing compared to text-only inputs. Be mindful of the image limits per request and the latency introduced by encoding and transmitting image data.
  • Production Infrastructure: Do not use a local filesystem‑based vector store in production. Use a cloud‑native vector store to ensure scalability and persistence.

Many of these issues are addressed in the open‑source STACKIT RAG Template. Consider this tutorial a guided tour of how RAG works and the template as a starting point for deploying your specific RAG use case.