Chat with images - vision understanding (basics)
Diese Seite ist noch nicht in deiner Sprache verfügbar. Englische Seite aufrufen
This tutorial shows how to use large language models (LLMs) with vision understanding.
Configure access to STACKIT LLM instances
Section titled “Configure access to STACKIT LLM instances”After you create a STACKIT AI Model Serving auth token (see Manage auth tokens), provide it as api_key. From Available shared models, choose a model and provide the model name and the base URL. This tutorial requires a model that supports vision input.
You are now ready to explore LLM capabilities. If you switch the LLM provider in an existing application, you may only need to adjust the base URL and key because STACKIT AI Model Serving is compatible with the OpenAI API.
import os
from dotenv import load_dotenv
load_dotenv("../.env")
model = os.environ["STACKIT_MODEL_SERVING_CHAT_MODEL"] # Select a chat model with vision support from https://support.docs.stackit.cloud/stackit/en/models-licenses-319914532.htmlbase_url = os.environ["STACKIT_MODEL_SERVING_BASE_URL"] # For example: "https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1"api_key = os.environ["STACKIT_MODEL_SERVING_API_KEY"] # For example: "ey..."Define a helper function that loads an image in base64 encoding for later use.
import base64
def encode_local_image(image_path): with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode("utf-8")For this example, use the OpenAI client to call the LLM hosted on STACKIT.
Install the library with pip install openai~=1.61.1.
from openai import OpenAI
client = OpenAI( api_key=api_key, base_url=base_url,)
models = client.models.list()modelsLoad an image in base64 and prompt the Vision Language Model (VLM) about its content.
Treat the chat conversation as usual, using content parts of type text. Additionally, attach images via the content type image_url, as shown below.
question = "What do you see in the image?"
image_base64 = encode_local_image("img.png")
chat_completion_from_base64 = client.chat.completions.create( messages=[{ "role": "user", "content": [ { "type": "text", "text": question }, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{image_base64}" }, }, ], }], model=model, max_completion_tokens=50,)
answer = chat_completion_from_base64.choices[0].message.contentfrom IPython.display import display, Markdowndisplay(Markdown(answer))This concludes the basic tutorial on image understanding with Vision Language Models on STACKIT. Key takeaways:
- Images can be used in chat conversations as base64‑encoded strings.
- A Vision Language Model can describe image content and extract information from images.
For advanced goals achievable with LLMs, see our other tutorials.