Available Shared Models

Text Models

Qwen3-VL 235B

Full Name: Qwen/Qwen3-VL-235B-A22B-Instruct-FP8

The provided model is an 8 bit quantized version of the original Qwen3-VL 235B A22B.

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.

This generation excels at visually interacting with GUIs on PCs and mobile devices, and completing tasks autonomously. It boosts visual coding by generating Draw.io diagrams or HTML/CSS/JS from images or videos. With advanced spatial perception, it accurately judges object positions, occlusions, and viewpoints, enabling robust 2D and emerging 3D grounding for spatial reasoning and embodied AI applications. Its upgraded visual recognition handles an expansive range of subjects to broader, higher-quality pretraining. Expanded OCR supports 32 languages, handles challenging conditions, parses complex documents, and integrates seamlessly with text understanding rivaling pure LLMs for unified vision-text comprehension.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Premium
Modalities	Input are text and image and Output is text.
Context length	218K Token
Number of parameters	235 Billion in 8 bit quantization
Specification	OpenAI-compatible
TPM limit	200000
RPM limit	80
License	License on Hugging Face
Status	Supported

Available Endpoints

POST /chat/completions
POST /completions
GET /models

Llama 3.3 70B

Full Name: cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic

The provided model is an 8 bit quantized version of the original Meta Llama 3.3 70B.

The Meta Llama 3.3 model is a significantly enhanced 70 billion parameter auto-regressive language model, offering similar performance to the 405B parameter Llama 3.1 model. It was trained on a new mix of publicly available online data. This model is capable of processing and generating multilingual text, and can also produce code. It has been fine-tuned with a focus on general question answering (GQA) tasks. The model has a token count of over 15 trillion and its knowledge cutoff is December 2023. The Meta Llama 3.3 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

The model is intended for assistant-like chat and can be used in a variety of applications, e.g. agentic AI, RAG, code generation, chatbot.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Plus
Modalities	Input and output are text.
Features	Tool calling enabled
Context length	128K Token
Number of parameters	70.6 Billion in 8 bit quantization
Specification	OpenAI-compatible
TPM limit	200000
RPM limit	80
License	License on Hugging Face
Status	Supported

Available endpoints

POST /chat/completions
POST /completions
GET /models

GPT-OSS 120B

Full Name: openai/gpt-oss-120b

This model is an open-weight model designed for powerful reasoning with 4 bit (MXFP4) quantization and a total of 120 billion parameter.

The GPT-OSS 120B model was trained on a broad mix of publicly available data and post-trained for strong reasoning, tool use, and general assistant-style tasks. The model supports long-context processing (up to 131k tokens), produces high-quality text and code, and is designed for agentic applications such as RAG systems, code assistants, and AI tools

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Plus
Modalities	Input and output are text.
Features	Tool calling enabled
Context length	131K Token
Number of parameters	120 Billion in 4 bit quantization
Specification	OpenAI-compatible
TPM limit	200000
RPM limit	30
License	License on Hugging Face
Status	Supported

Available endpoints

POST /chat/completions
POST /completions
GET /models

Gemma 3 27B

Full Name: google/gemma-3-27b-it

Gemma is a family of lightweight, state-of-the-art open models from Google. Gemma 3 models are multimodal, handling text and image input and generating text output. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning.

The model is intended for assistant-like chat with vision understanding and can be used in a variety of applications, e.g. image-understanding, visional document understanding, agentic AI, RAG, code generation, chatbot.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Plus
Modalities	Input are text and image and Output is text.
Context length	37K Token
Number of parameters	27.4 Billion in 16 bit quantization
Specification	OpenAI-compatible
TPM limit	200000
RPM limit	80
License	License on Google AI
Status	Supported

Available Endpoints

POST /chat/completions
POST /completions
GET /models

GPT-OSS 20B

Full Name: openai/gpt-oss-20b

Part of the GPT-OSS family, this 20-billion parameter model is engineered for maximum cost-efficiency, low latency, and high-throughput performance. By utilizing 4-bit (MXFP4) quantization, it delivers high-speed inference without sacrificing intelligence.

The GPT-OSS 20B model was trained on a diverse corpus of public data and fine-tuned for robust reasoning, seamless tool use, and general assistant tasks. While it shares the same functional DNA as the larger GPT-OSS 120B, this version is optimized for speed and affordability—making it the ideal choice for responsive, real-time chat interactions.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Standard
Modalities	Input and output are text.
Features	Tool calling enabled
Context length	131K Token
Number of parameters	20 Billion in 4 bit quantization
Specification	OpenAI-compatible
TPM limit	200000
RPM limit	80
License	License on Hugging Face
Status	Supported

Available endpoints

POST /chat/completions
POST /completions
GET /models

Mistral-Nemo

Full Name: neuralmagic/Mistral-Nemo-Instruct-2407-FP8

The provided model is an 8 bit quantized version of the original Mistral Nemo Instruct 2407.

The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407. Trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size. The model was trained with a 128k context window on a large proportion of multilingual and code data. It supports multiple languages, including French, German, Spanish, Italian, Portuguese, Russian, Chinese, and Japanese, with varying levels of proficiency.

The model is intended for commercial and research use in English, particularly for assistant-like chat applications.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Plus
Modalities	Input is text and Output is text.
Context length	128K Token
Number of parameters	12.2 Billion in 8 bit quantization
Specification	OpenAI-compatible
TPM limit	200000
RPM limit	80
License	License on Hugging Face
Status	Deprecated - migrate to `openai/gpt-oss-20b`

Available endpoints

POST /chat/completions
POST /completions
GET /models

Llama 3.1 8B

Full Name: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8

The provided model is an 8 bit quantized version of the original Meta Llama 3.1 8B.

Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The Meta Llama 3.1 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

It is optimized for multilingual dialogue use cases and outperforms many available open source and closed chat models on common industry benchmarks.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Standard
Modalities	Input is text and Output is text.
Features	Tool calling enabled
Context length	128K Token
Number of parameters	8.03 Billion in 8 bit quantization
Specification	OpenAI-compatible
TPM limit	200000
RPM limit	80
License	License on Hugging Face
Status	Deprecated - migrate to `openai/gpt-oss-20b`

Available Endpoints

POST /chat/completions
POST /completions
GET /models

Embedding Models

E5 Mistral 7B

Full Name: intfloat/e5-mistral-7b-instruct

This is an embedding model and has no chat capabilities.

The E5 Mistral 7B Instruct model is a powerful language model that excels in text embedding tasks, particularly in English. With 32 layers and an embedding size of 4096, it’s well-suited for tasks like passage ranking and retrieval. However, it’s recommended to use this model for English-only tasks, as its performance may degrade for other languages. It’s capable of handling long input sequences up to 4096 tokens, making it well-suited for complex tasks. Overall, the E5 Mistral 7B Instruct model offers a robust and efficient solution for text embedding tasks, making it a valuable tool for natural language processing applications.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Embedding
Category	Embedding-Standard
Modalities	Input is text and Output are embeddings.
Maximum input tokens	4096
Output dimension	4096
Number of parameters	7 Billion
Specification	OpenAI-compatible
TPM limit	200000
RPM limit	600
License	License on Hugging Face
Status	Supported

Available Endpoints

POST /embeddings
GET /models

Qwen3 Vision-Language Embedding

Full Name: Qwen/Qwen3-VL-Embedding-8B

This is an embedding model, it computes sematic embedding vectors from chat messages including text as well as image content and pure text input.

Qwen3-VL-Embedding-8B is a state-of-the-art multimodal embedding model developed by the Qwen team. Released in early 2026, it is designed to project various data types — text and images — into a unified semantic vector space. Unlike traditional text-only embedding models, it enables Cross-Modal Retrieval (e.g., using text to search images or using images to find related documents). It supports over 30 languages and offers an embedding dimension of 4096.

The model supports different output dimensions of embeddings via Matryoshka Representation Learning. We provide an example in the Tutorials.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Embedding
Category	Embedding-Plus
Modalities	Input is chat messages (including text and image content) or text and output are embeddings.
Maximum input tokens	32000
Output dimension	4096 (256, 384, 768, 1024, 1536 supported via Matryoshka Representation Learning)
Number of parameters	8 Billion
Specification	OpenAI-compatible, with extension: handles chat message input (including text and images)
TPM limit	200000
RPM limit	600
License	License on Hugging Face
Status	Supported

Available Endpoints

POST /embeddings
GET /models

Model limits

Model limits ensure the availability of the models to all users and guarantees a fair-use.

TPM

Tokens per minute: The TPM limit is calculated by adding the prompt tokens to the generation tokens, with generation tokens weighted by a factor of 5.

RPM

Requests per minute.