Skip to content

Available shared models

The provided model is an 8 bit quantized version of the original Meta Llama 3.3 70B.

The Meta Llama 3.3 model is a significantly enhanced 70 billion parameter auto-regressive language model, offering similar performance to the 405B parameter Llama 3.1 model. It was trained on a new mix of publicly available online data. This model is capable of processing and generating multilingual text, and can also produce code. It has been fine-tuned with a focus on general question answering (GQA) tasks. The model has a token count of over 15 trillion and its knowledge cutoff is December 2023. The Meta Llama 3.3 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

The model is intended for assistant-like chat and can be used in a variety of applications, e.g. agentic AI, RAG, code generation, chatbot.

URLhttps://api.openai-compat.model-serving.eu01.onstackit.cloud/v1
TypeChat
CategoryLLM-Plus
ModalitiesInput and output are text.
FeaturesTool calling enabled
Context length128K Token
Number of parameters70.6 Billion in 8 bit quantization
SpecificationOpenAI Compatible
TPM limit*200000
RPM limit**80
LicenseLicense on Hugging Face
StatusSupported
  • POST /chat/completions
  • POST /completions
  • GET /models

Gemma is a family of lightweight, state-of-the-art open models from Google. Gemma 3 models are multimodal, handling text and image input and generating text output. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning.

The model is intended for assistant-like chat with vision understanding and can be used in a variety of applications, e.g. image-understanding, visional document understanding, agentic AI, RAG, code generation, chatbot.

URLhttps://api.openai-compat.model-serving.eu01.onstackit.cloud/v1
TypeChat
CategoryLLM-Plus
ModalitiesInput are text and image and Output is text.
Context length37K Token
Number of parameters27.4 Billion in 16 bit quantization
SpecificationOpenAI Compatible
TPM limit*200000
RPM limit**80
LicenseLicense on Google AI
StatusSupported
  • POST /chat/completions
  • POST /completions
  • GET /models

The provided model is an 8 bit quantized version of the original Mistral Nemo Instruct 2407.

The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407. Trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size. The model was trained with a 128k context window on a large proportion of multilingual and code data. It supports multiple languages, including French, German, Spanish, Italian, Portuguese, Russian, Chinese, and Japanese, with varying levels of proficiency.

The model is intended for commercial and research use in English, particularly for assistant-like chat applications.

URLhttps://api.openai-compat.model-serving.eu01.onstackit.cloud/v1
TypeChat
CategoryLLM-Plus
ModalitiesInput is text and Output is text.
Context length128K Token
Number of parameters12.2 Billion in 8 bit quantization
SpecificationOpenAI Compatible
TPM limit*200000
RPM limit**80
LicenseLicense on Hugging Face
StatusSupported
  • POST /chat/completions
  • POST /completions
  • GET /models

The provided model is an 8 bit quantized version of the original Meta Llama 3.1 8B.

Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The Meta Llama 3.1 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

It is optimized for multilingual dialogue use cases and outperforms many available open source and closed chat models on common industry benchmarks.

URLhttps://api.openai-compat.model-serving.eu01.onstackit.cloud/v1
TypeChat
CategoryLLM-Standard
ModalitiesInput is text and Output is text.
FeaturesTool calling enabled
Context length128K Token
Number of parameters8.03 Billion in 8 bit quantization
SpecificationOpenAI Compatible
TPM limit*200000
RPM limit**80
LicenseLicense on Hugging Face
StatusSupported
  • POST /chat/completions
  • POST /completions
  • GET /models

This is an embedding model and has no chat capabilities.

The E5 Mistral 7B Instruct model is a powerful language model that excels in text embedding tasks, particularly in English. With 32 layers and an embedding size of 4096, it’s well-suited for tasks like passage ranking and retrieval. However, it’s recommended to use this model for English-only tasks, as its performance may degrade for other languages. It’s capable of handling long input sequences up to 4096 tokens, making it well-suited for complex tasks. Overall, the E5 Mistral 7B Instruct model offers a robust and efficient solution for text embedding tasks, making it a valuable tool for natural language processing applications.

URLhttps://api.openai-compat.model-serving.eu01.onstackit.cloud/v1
TypeEmbedding
CategoryEmbedding-Standard
ModalitiesInput is text and Output are embeddings.
FeaturesTool calling enabled
Maximum input tokens4096
Output dimension4096
Number of parameters7 Billion
SpecificationOpenAI Compatible
TPM limit*200000
RPM limit**600
LicenseLicense on Hugging Face
StatusSupported
  • POST /completions
  • GET /models