Available shared models
Text Models
Section titled “Text Models”Llama 3.3 70B
Section titled “Llama 3.3 70B”Full Name: cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic
Section titled “Full Name: cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic”The provided model is an 8 bit quantized version of the original Meta Llama 3.3 70B.
The Meta Llama 3.3 model is a significantly enhanced 70 billion parameter auto-regressive language model, offering similar performance to the 405B parameter Llama 3.1 model. It was trained on a new mix of publicly available online data. This model is capable of processing and generating multilingual text, and can also produce code. It has been fine-tuned with a focus on general question answering (GQA) tasks. The model has a token count of over 15 trillion and its knowledge cutoff is December 2023. The Meta Llama 3.3 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
The model is intended for assistant-like chat and can be used in a variety of applications, e.g. agentic AI, RAG, code generation, chatbot.
| URL | https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1 |
| Type | Chat |
| Category | LLM-Plus |
| Modalities | Input and output are text. |
| Features | Tool calling enabled |
| Context length | 128K Token |
| Number of parameters | 70.6 Billion in 8 bit quantization |
| Specification | OpenAI Compatible |
| TPM limit* | 200000 |
| RPM limit** | 80 |
| License | License on Hugging Face |
| Status | Supported |
Available endpoints
Section titled “Available endpoints”POST /chat/completionsPOST /completionsGET /models
Gemma 3 27B
Section titled “Gemma 3 27B”Gemma is a family of lightweight, state-of-the-art open models from Google. Gemma 3 models are multimodal, handling text and image input and generating text output. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning.
The model is intended for assistant-like chat with vision understanding and can be used in a variety of applications, e.g. image-understanding, visional document understanding, agentic AI, RAG, code generation, chatbot.
| URL | https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1 |
| Type | Chat |
| Category | LLM-Plus |
| Modalities | Input are text and image and Output is text. |
| Context length | 37K Token |
| Number of parameters | 27.4 Billion in 16 bit quantization |
| Specification | OpenAI Compatible |
| TPM limit* | 200000 |
| RPM limit** | 80 |
| License | License on Google AI |
| Status | Supported |
Available Endpoints
Section titled “Available Endpoints”POST /chat/completionsPOST /completionsGET /models
Mistral-Nemo
Section titled “Mistral-Nemo”Full Name: neuralmagic/Mistral-Nemo-Instruct-2407-FP8
Section titled “Full Name: neuralmagic/Mistral-Nemo-Instruct-2407-FP8”The provided model is an 8 bit quantized version of the original Mistral Nemo Instruct 2407.
The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407. Trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size. The model was trained with a 128k context window on a large proportion of multilingual and code data. It supports multiple languages, including French, German, Spanish, Italian, Portuguese, Russian, Chinese, and Japanese, with varying levels of proficiency.
The model is intended for commercial and research use in English, particularly for assistant-like chat applications.
| URL | https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1 |
| Type | Chat |
| Category | LLM-Plus |
| Modalities | Input is text and Output is text. |
| Context length | 128K Token |
| Number of parameters | 12.2 Billion in 8 bit quantization |
| Specification | OpenAI Compatible |
| TPM limit* | 200000 |
| RPM limit** | 80 |
| License | License on Hugging Face |
| Status | Supported |
Available endpoints
Section titled “Available endpoints”POST /chat/completionsPOST /completionsGET /models
Llama 3.1 8B
Section titled “Llama 3.1 8B”Full Name: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
Section titled “Full Name: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8”The provided model is an 8 bit quantized version of the original Meta Llama 3.1 8B.
Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The Meta Llama 3.1 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
It is optimized for multilingual dialogue use cases and outperforms many available open source and closed chat models on common industry benchmarks.
| URL | https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1 |
| Type | Chat |
| Category | LLM-Standard |
| Modalities | Input is text and Output is text. |
| Features | Tool calling enabled |
| Context length | 128K Token |
| Number of parameters | 8.03 Billion in 8 bit quantization |
| Specification | OpenAI Compatible |
| TPM limit* | 200000 |
| RPM limit** | 80 |
| License | License on Hugging Face |
| Status | Supported |
Available Endpoints
Section titled “Available Endpoints”POST /chat/completionsPOST /completionsGET /models
Embedding Models
Section titled “Embedding Models”E5 Mistral 7B
Section titled “E5 Mistral 7B”Full Name: intfloat/e5-mistral-7b-instruct
Section titled “Full Name: intfloat/e5-mistral-7b-instruct”This is an embedding model and has no chat capabilities.
The E5 Mistral 7B Instruct model is a powerful language model that excels in text embedding tasks, particularly in English. With 32 layers and an embedding size of 4096, it’s well-suited for tasks like passage ranking and retrieval. However, it’s recommended to use this model for English-only tasks, as its performance may degrade for other languages. It’s capable of handling long input sequences up to 4096 tokens, making it well-suited for complex tasks. Overall, the E5 Mistral 7B Instruct model offers a robust and efficient solution for text embedding tasks, making it a valuable tool for natural language processing applications.
| URL | https://api.openai-compat.model-serving.eu01.onstackit.cloud/v1 |
| Type | Embedding |
| Category | Embedding-Standard |
| Modalities | Input is text and Output are embeddings. |
| Features | Tool calling enabled |
| Maximum input tokens | 4096 |
| Output dimension | 4096 |
| Number of parameters | 7 Billion |
| Specification | OpenAI Compatible |
| TPM limit* | 200000 |
| RPM limit** | 600 |
| License | License on Hugging Face |
| Status | Supported |
Available Endpoints
Section titled “Available Endpoints”POST /completionsGET /models