Available Shared Models

Text Models

Llama 3.3 70B

Full Name: cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic

The provided model is an 8 bit quantized version of the original Meta Llama 3.3 70B.

The Meta Llama 3.3 model is a significantly enhanced 70 billion parameter auto-regressive language model, offering similar performance to the 405B parameter Llama 3.1 model. It was trained on a new mix of publicly available online data. This model is capable of processing and generating multilingual text, and can also produce code. It has been fine-tuned with a focus on general question answering (GQA) tasks. The model has a token count of over 15 trillion and its knowledge cutoff is December 2023. The Meta Llama 3.3 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

The model is intended for assistant-like chat and can be used in a variety of applications, e.g. agentic AI, RAG, code generation, chatbot.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Plus
Modalities	Input and output are text.
Features	Tool calling enabled
Context length	128K Token
Number of parameters	70.6 Billion in 8 bit quantization
Specification	OpenAI Compatible
TPM limit	200000
RPM limit	80
License	License on Hugging Face
Status	Supported

Available endpoints

POST /chat/completions
POST /completions
GET /models

GPT-OSS 120B

Full Name: openai/gpt-oss-120b

This model is an open-weight model designed for powerful reasoning with 4 bit (MXFP4) quantization and a total of 120 billion parameter.

The GPT-OSS 120B model was trained on a broad mix of publicly available data and post-trained for strong reasoning, tool use, and general assistant-style tasks. The model supports long-context processing (up to 131k tokens), produces high-quality text and code, and is designed for agentic applications such as RAG systems, code assistants, and AI tools

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Plus
Modalities	Input and output are text.
Features	Tool calling enabled
Context length	131K Token
Number of parameters	120 Billion in 4 bit quantization
Specification	OpenAI Compatible
TPM limit	200000
RPM limit	30
License	License on Hugging Face
Status	Supported

Available endpoints

POST /chat/completions
POST /completions
GET /models

Gemma 3 27B

Full Name: google/gemma-3-27b-it

Gemma is a family of lightweight, state-of-the-art open models from Google. Gemma 3 models are multimodal, handling text and image input and generating text output. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning.

The model is intended for assistant-like chat with vision understanding and can be used in a variety of applications, e.g. image-understanding, visional document understanding, agentic AI, RAG, code generation, chatbot.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Plus
Modalities	Input are text and image and Output is text.
Context length	37K Token
Number of parameters	27.4 Billion in 16 bit quantization
Specification	OpenAI Compatible
TPM limit	200000
RPM limit	80
License	License on Google AI
Status	Supported

Available Endpoints

POST /chat/completions
POST /completions
GET /models

Mistral-Nemo

Full Name: neuralmagic/Mistral-Nemo-Instruct-2407-FP8

The provided model is an 8 bit quantized version of the original Mistral Nemo Instruct 2407.

The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407. Trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size. The model was trained with a 128k context window on a large proportion of multilingual and code data. It supports multiple languages, including French, German, Spanish, Italian, Portuguese, Russian, Chinese, and Japanese, with varying levels of proficiency.

The model is intended for commercial and research use in English, particularly for assistant-like chat applications.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Plus
Modalities	Input is text and Output is text.
Context length	128K Token
Number of parameters	12.2 Billion in 8 bit quantization
Specification	OpenAI Compatible
TPM limit	200000
RPM limit	80
License	License on Hugging Face
Status	Supported

Available endpoints

POST /chat/completions
POST /completions
GET /models

Llama 3.1 8B

Full Name: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8

The provided model is an 8 bit quantized version of the original Meta Llama 3.1 8B.

Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The Meta Llama 3.1 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

It is optimized for multilingual dialogue use cases and outperforms many available open source and closed chat models on common industry benchmarks.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Chat
Category	LLM-Standard
Modalities	Input is text and Output is text.
Features	Tool calling enabled
Context length	128K Token
Number of parameters	8.03 Billion in 8 bit quantization
Specification	OpenAI Compatible
TPM limit	200000
RPM limit	80
License	License on Hugging Face
Status	Supported

Available Endpoints

POST /chat/completions
POST /completions
GET /models

Embedding Models

E5 Mistral 7B

Full Name: intfloat/e5-mistral-7b-instruct

This is an embedding model and has no chat capabilities.

The E5 Mistral 7B Instruct model is a powerful language model that excels in text embedding tasks, particularly in English. With 32 layers and an embedding size of 4096, it’s well-suited for tasks like passage ranking and retrieval. However, it’s recommended to use this model for English-only tasks, as its performance may degrade for other languages. It’s capable of handling long input sequences up to 4096 tokens, making it well-suited for complex tasks. Overall, the E5 Mistral 7B Instruct model offers a robust and efficient solution for text embedding tasks, making it a valuable tool for natural language processing applications.

Facts


URL	`https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1`
Type	Embedding
Category	Embedding-Standard
Modalities	Input is text and Output are embeddings.
Features	Tool calling enabled
Maximum input tokens	4096
Output dimension	4096
Number of parameters	7 Billion
Specification	OpenAI Compatible
TPM limit	200000
RPM limit	600
License	License on Hugging Face
Status	Supported

Available Endpoints

POST /completions
GET /models

Model limits

Model limits ensure the availability of the models to all users and guarantees a fair-use.

TPM

Tokens per minute: The TPM limit is calculated by adding the prompt tokens to the generation tokens, with generation tokens weighted by a factor of 5.

RPM

Requests per minute.