Multi-Modal Retrieval Augmented Generation (RAG)

This tutorial demonstrates the utilization of large language models (chat and embedding) with vision understanding for retrieval augmented generation (RAG) use cases. In this tutorial the LangChain library is used (one of many alternative OpenAI-compatible libraries). It is tested with the following version: openai==2.15.0 langchain-openai==1.1.7 langchain==1.2.7 langchain-chroma==1.1.0.

The embeddings dimension is selected to 1024, leveraging the Matryoshka Representation Learning (MRL) support of the embedding model (not supported on every embedding model).

Configure access to STACKIT LLM instances

After you created a STACKIT AI Model Serving auth token, as described here, you may provide it as api_key. Here you can decide, which model to use and provide the model’s name, along with the url. This tutorial requires both models (chat and embedding) to support vision input. Now you are good to go to investigate your possabilities with LLMs. If you try to switch the LLM provider of an existing application, chances are pretty high that this is all you have to adjust. Due to the OpenAI-API compatibility of our STACKIT AI Model Serving Service.

import os
from dotenv import load_dotenv

load_dotenv(".env")

embedding_model_name = os.environ["STACKIT_MODEL_SERVING_EMBEDDING_MODEL"]
chat_model_name = os.environ["STACKIT_MODEL_SERVING_CHAT_MODEL"]
base_url = os.environ["STACKIT_MODEL_SERVING_BASE_URL"]
auth_token = os.environ["STACKIT_MODEL_SERVING_API_KEY"]

Here we define our dataset - which in this example is just the bare minimum to demonstrate the capabilities.

# Create a directory for our data
os.makedirs("rag_data", exist_ok=True)

# You should add your own images to the rag_data directory.
# Here we will mock the image creation for demonstration purposes.
from PIL import Image

Image.new("RGB", (100, 100), color="red").save("rag_data/red_square.jpg")
Image.new("RGB", (100, 100), color="blue").save("rag_data/blue_square.jpg")

documents = [
    {
        "type": "text",
        "content": "The color red is often associated with passion, energy, and love.",
    },
    {"type": "image", "path": "rag_data/red_square.jpg"},
    {
        "type": "text",
        "content": "The color blue is often associated with calmness, stability, and sadness.",
    },
    {"type": "image", "path": "rag_data/blue_square.jpg"},
]

We provide an MultiModalEmbeddingEngine, a wrapper that uses the OpenAI‑compatible endpoint of STACKIT AI Model Serving for embeddings within the LangChain ecosystem.

The embedding engine also makes use of a specific embedding vector dimension (here 1024) to demonstrate the support of Matryoshka Representation Learning (MRL) for output downsizing.

We explicitly implement the embed_image method, which handles image embeddings.

import base64
from typing import List
from langchain_core.embeddings import Embeddings
from openai import OpenAI

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


class MultiModalEmbeddingEngine(Embeddings):
    def __init__(self, api_key: str, model: str, base_url: str, embedding_dimension: int = 1024):
        self._client = OpenAI(api_key=api_key, base_url=base_url)
        self._model = model
        self._embedding_dimension = embedding_dimension

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [self.embed_query(text) for text in texts]

    def embed_query(self, text: str) -> List[float]:
        """Standard text embedding"""
        return (
            self._client.embeddings.create(input=[text], model=self._model, extra_body={"dimensions": self._embedding_dimension})
            .data[0]
            .embedding
        )

    def embed_image(self, uris: List[str]) -> List[List[float]]:
        """Image embedding"""
        return [
            self._client.embeddings.create(
                input=[],
                extra_body={
                    "dimensions": self._embedding_dimension,
                    "messages": [
                        {
                            "role": "user",
                            "content": [
                                {
                                    "type": "image_url",
                                    "image_url": {
                                        "url": f"data:image/jpeg;base64,{encode_image(uri)}"
                                    },
                                }
                            ],
                        }
                    ]
                },
                model=self._model,
            )
            .data[0]
            .embedding
            for uri in uris
        ]

    async def aembed_documents(self, texts: List[str]) -> List[List[float]]:
        # Async implementation if needed
        pass

    async def aembed_query(self, text: str) -> List[float]:
        # Async implementation if needed
        pass


embedding_engine = MultiModalEmbeddingEngine(
    api_key=auth_token,
    base_url=base_url,
    model=embedding_model_name,
    embedding_dimension=1024,  # hint: see available models for supported output dimensions on the selected embedding model
)

Let’s ingest our RAG data into the vector store using our multi-modal embedding engine.

from langchain_chroma import Chroma
from langchain_core.documents import Document

vector_store = Chroma(
    collection_name="multimodal_rag",
    embedding_function=embedding_engine,  # note: this will connect our multi-modal embedding engine to the vector store
    persist_directory="./chroma_db",
)

# Ingest documents
for i, doc_info in enumerate(documents):
    if doc_info["type"] == "text":
        vector_store.add_documents(
            documents=[
                Document(
                    page_content=doc_info["content"],
                    metadata={"type": "text", "source": f"document-abc-{i}"},
                )
            ]
        )
    elif doc_info["type"] == "image":
        metadatas = [
            {
                "type": "image",
                "source": f"document-abc-{i}",
                "source_path": doc_info["path"],
            }
        ]
        # note: this will call the Embedding.add_image method on our mutli-modal embedding engine
        vector_store.add_images(uris=[doc_info["path"]], metadatas=metadatas)

print(f"Vector store created with {vector_store._collection.count()} documents.")

Basic Information Retrieval

This example shows how to use the vector store for multi-modal semantic search. It returns both image and text documents from the vector store.

query_text = "What is the color of passion?"
query_embedding = embedding_engine.embed_query(query_text)

retrieved_docs = vector_store.similarity_search_by_vector_with_relevance_scores(
    embedding=query_embedding, k=2
)

print("Retrieved documents:")
for doc, score in retrieved_docs:
    print(
        f"  - Type: {doc.metadata['type']}, Content: {doc.page_content}, Score: {score}"
    )

In this section we build a basic RAG pipeline for multi-modal inputs. We will use a multi-modal chat model along with the multi-modal embedding model from earlier.

from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model=chat_model_name, api_key=auth_token, base_url=base_url)

This example uses langchain agents to build the RAG pipeline. There are numerous alternative implementation.

from langchain.agents import create_agent
from langchain.agents.middleware import dynamic_prompt, ModelRequest


@dynamic_prompt
def prompt_with_context(request: ModelRequest) -> str:
    """Inject context into state messages."""
    last_query = request.state["messages"][-1].text
    retrieved_docs = vector_store.similarity_search(last_query)

    docs_content = "\n\n".join(doc.page_content for doc in retrieved_docs)

    system_message = (
        "You are a helpful assistant. Use the following context in your response:"
        f"\n\n{docs_content}"
    )

    return system_message


agent = create_agent(chat_model, tools=[], middleware=[prompt_with_context])

Text Prompt

First, we will use a text-only prompt to query the vector store - just like in a text-only RAG pipeline.

text_query = "What is the color of passion?"

prompt_messages = [{"role": "user", "content": text_query}]

for step in agent.stream({"messages": prompt_messages}, stream_mode="values"):
    step["messages"][-1].pretty_print()

Now, let’s prompt the model with text and image input.

from PIL import Image

Image.new("RGB", (100, 100), color="red").save("query_image.jpg")

text_query = "Explain the image meaning."
image_query = encode_image("query_image.jpg")

prompt_messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": text_query,
            },
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image_query}"},
            },
        ],
    },
]

for step in agent.stream(
    {"messages": prompt_messages},
    stream_mode="values",
):
    step["messages"][-1].pretty_print()

Conclusion

This concludes the Multi-Modal RAG tutorial. Key takeaways:

RAG is not limited to text: By utilizing multi-modal embedding and chat models, you can map images and text to a shared vector space, enabling semantic search across different media types.
Unified Vector Stores: You can store image embeddings alongside text embeddings in the same collection, allowing for hybrid retrieval strategies.
Vision-Capable LLMs: Passing retrieved images (as Base64) to a vision-enabled Chat model allows the AI to “see” the context and answer questions based on visual evidence.

Final notes:

Data & Storage Strategy: In this tutorial, we handled images via local paths and direct Base64 encoding. In production, store the actual image binaries in a dedicated Object Storage (like STACKIT Object Storage) and save only the references (URLs) and embeddings in your vector database to keep it lightweight.
Latency: Images consume additional compute in the preprocessing compared to text-only inputs. Be mindful of the image limits per request and the latency introduced by encoding and transmitting image data.
Production Infrastructure: Do not use a local filesystem‑based vector store in production. Use a cloud‑native vector store to ensure scalability and persistence.

Many of these issues are addressed in the open‑source STACKIT RAG Template. Consider this tutorial a guided tour of how RAG works and the template as a starting point for deploying your specific RAG use case.

Multi-Modal Retrieval Augmented Generation (RAG)

Configure access to STACKIT LLM instances

Fuel the Vector Store: Multi-Modal Embeddings

Basic Information Retrieval

Question Answering using Multi-Modal Retrieval

Text Prompt

Multi-Modal Prompt

Conclusion