FAQ

We want to give our customers the information they need to get the most from our STACKIT AI Model Serving. This FAQ section answers common questions. This helps you quickly find solutions and improve your experience. We encourage you to check these FAQ before contacting our support team, as you might find your answer here.

General information

Which clients can be used with the STACKIT AI Model Serving?

The public IP address is randomly assigned based on current availabilities, manual selection is not possible.

Where does my data go?

We do not store any customer data from the requests. Your data belongs solely to you and is not stored or used by us. We do not train any models using your data.

What data is used to train the LLMs?

We serve open-source models only. These models are publicly available on Hugging Face, accompanied by their individual model cards, which provide information on, for example, training data, training procedure, model architecture, and many more details. We do not train these models with any data, nor do we store any customer data.

Which models are offered?

With the STACKIT AI Model Serving, we aim to provide state-of-the-art LLMs for our customers. The offered models are selected carefully. An up-to-date table of shared models can be found on Getting Started with Shared Models. We focus on the best OpenAI open-source models while keeping a stable portfolio. If you require additional models that exceed the shared models offering, please create a service request in the STACKIT Help Center.
Why is my model output truncated, and how can I get the full response?
All LLM models have a certain output limit of generated tokens. If a model reaches this limit before completing the generation, the response will be truncated.

How to handle truncated responses
Section titled “How to handle truncated responses”

You can identify this state by checking the finish_reason field in the API response. If it is set to "length", the generation was stopped due to the token cap.

To retrieve the full output, implement the following logic:
1. Check for Truncation: Monitor the response for finish_reason: "length".
2. Submit a Continuation Prompt: Send the chat history — including the partial response — back to the model.
3. Trigger Completion: Use a prompt instruction such as “continue”.
The model will then resume the generation from where it left off. This iterative approach ensures you can capture full responses for complex reasoning tasks that might exceed single-request limits. This approach might need some use-case specific fine-tuning to get the best results.
I need a specific model. Can you serve it for me?

In case our model portfolio does not cover your requirements, please create a service request in the STACKIT Help Center; we are happy to hear about your requirements and find a solution that covers your needs.

Can I use multiple models with a single authentication token? / Do I need different authentication tokens for different model types (e.g. embedding-models, chat-models)?

An authentication token, known as the STACKIT AI Model Serving Auth Token, is valid for all shared models, including all types of models. A single STACKIT AI Model Serving Auth Token can be used to embed, test, and answer questions in the chat completions API.

Why does Nextcloud Assistant respond after approx. 5min?

Nextcloud Assistant works with background tasks in Nextcloud. By default these tasks get picked up every 5 minutes. Refer to the Official Nextcloud documentation for performance improvements.
Errors
Why does my authentication token (aka API-key) not work?
STACKIT AI Model Serving provides an OpenAI-compatible API. Therefore, the service integrates well with most OpenAI-compatible clients. To use models provided by STACKIT AI Model Serving instead of OpenAI, the following configurations must be adjusted accordingly:
- API base URL: https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1
- API key / Authentication token / Secret key: STACKIT AI Model Serving Auth Token (refer to Getting Started with the STACKIT Portal UI to create a STACKIT AI Model Serving Auth Token in the STACKIT Portal UI)
How can I resolve a "404 Not Found" error from the API?

This error occurs in case a requested resource cannot be found. Most likely this is due to an incorrect “model” parameter in the request body.
Beware that all our models are exclusive to their model type (e.g., chat, embedding). This means a chat model cannot be used to compute embeddings, or vice versa.
Refer to the Getting Started with Shared Models documentation for a list of available models and their types.

My request result in a "LengthFinishReasonError", especially when working with structured output.

This problem can be solved by adjusting thefrequency_penaltyparameter. A value of 0.7 or higher has proven to be sufficient.
Known issues
Unexpected Tool Calling with Empty `tools` Parameter
We have observed inconsistent behavior in specific Llama-based models when the tools parameter is present in the request body but provided as an empty array ("tools": []). Contrary to the expected OpenAI-compatible behavior—where an empty tools list should be ignored—these models interpret the presence of the parameter as a signal to enter “tool-calling mode.” Consequently, the model may ignore the natural language prompt and instead generate a hallucinated JSON function call.

This behavior currently affects the following tool-calling enabled models:
- neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
- cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic
Workaround
Section titled “Workaround”

To avoid hallucinated tool calls, do not send the tools parameter with an empty list. If your client logic dynamically constructs requests, ensure that the tools key is completely omitted from the JSON payload when no tools are available, rather than passing [] or null.

Incorrect Request (Causes Hallucination)
Section titled “Incorrect Request (Causes Hallucination)”

Sending an empty array forces the model to attempt a function call.
payload = { "model": "cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic", "messages": [{"role": "user", "content": "What is the capital of Germany?"}], "tools": [] # This causes the issue }
Correct Request (Recommended)
Section titled “Correct Request (Recommended)”

Omit the key entirely for standard text generation.
payload = { "model": "cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic", "messages": [{"role": "user", "content": "What is the capital of Germany?"}] }

FAQ

General information

Which clients can be used with the STACKIT AI Model Serving?

Where does my data go?

What data is used to train the LLMs?

Which models are offered?

Why is my model output truncated, and how can I get the full response?

How to handle truncated responses

I need a specific model. Can you serve it for me?

Can I use multiple models with a single authentication token? / Do I need different authentication tokens for different model types (e.g. embedding-models, chat-models)?

Why does Nextcloud Assistant respond after approx. 5min?

Errors

Why does my authentication token (aka API-key) not work?

How can I resolve a "404 Not Found" error from the API?

My request result in a "LengthFinishReasonError", especially when working with structured output.

Known issues

Unexpected Tool Calling with Empty `tools` Parameter

Workaround

Incorrect Request (Causes Hallucination)

Correct Request (Recommended)