Chat with images - vision understanding (basics)

Zuletzt aktualisiert am 12. Dez. 2025

This tutorial shows how to use large language models (LLMs) with vision understanding.

Configure access to STACKIT LLM instances

After you create a STACKIT AI Model Serving auth token (see Manage auth tokens), provide it as api_key. From Available shared models, choose a model and provide the model name and the base URL. This tutorial requires a model that supports vision input.

You are now ready to explore LLM capabilities. If you switch the LLM provider in an existing application, you may only need to adjust the base URL and key because STACKIT AI Model Serving is compatible with the OpenAI API.

import os

from dotenv import load_dotenv

load_dotenv("../.env")

model = os.environ["STACKIT_MODEL_SERVING_CHAT_MODEL"]       # Select a chat model with vision support from https://support.docs.stackit.cloud/stackit/en/models-licenses-319914532.html
base_url = os.environ["STACKIT_MODEL_SERVING_BASE_URL"]      # For example: "https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1"
api_key = os.environ["STACKIT_MODEL_SERVING_API_KEY"]        # For example: "ey..."

Define a helper function that loads an image in base64 encoding for later use.

import base64

def encode_local_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

For this example, use the OpenAI client to call the LLM hosted on STACKIT.

Install the library with pip install openai~=1.61.1.

from openai import OpenAI

client = OpenAI(
    api_key=api_key,
    base_url=base_url,
)

models = client.models.list()
models

Load an image in base64 and prompt the Vision Language Model (VLM) about its content.

Treat the chat conversation as usual, using content parts of type text. Additionally, attach images via the content type image_url, as shown below.

question = "What do you see in the image?"

image_base64 = encode_local_image("img.png")

chat_completion_from_base64 = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": question
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image_base64}"
                },
            },
        ],
    }],
    model=model,
    max_completion_tokens=50,
)

answer = chat_completion_from_base64.choices[0].message.content
from IPython.display import display, Markdown
display(Markdown(answer))

This concludes the basic tutorial on image understanding with Vision Language Models on STACKIT. Key takeaways:

Images can be used in chat conversations as base64‑encoded strings.
A Vision Language Model can describe image content and extract information from images.

For advanced goals achievable with LLMs, see our other tutorials.