Skip to content

Available Shared Models

The provided model is an 8 bit quantized version of the original Qwen3-VL 235B A22B.

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.

This generation excels at visually interacting with GUIs on PCs and mobile devices, and completing tasks autonomously. It boosts visual coding by generating Draw.io diagrams or HTML/CSS/JS from images or videos. With advanced spatial perception, it accurately judges object positions, occlusions, and viewpoints, enabling robust 2D and emerging 3D grounding for spatial reasoning and embodied AI applications. Its upgraded visual recognition handles an expansive range of subjects to broader, higher-quality pretraining. Expanded OCR supports 32 languages, handles challenging conditions, parses complex documents, and integrates seamlessly with text understanding rivaling pure LLMs for unified vision-text comprehension.

  • POST /chat/completions
  • POST /completions
  • GET /models

The provided model is an 8 bit quantized version of the original Meta Llama 3.3 70B.

The Meta Llama 3.3 model is a significantly enhanced 70 billion parameter auto-regressive language model, offering similar performance to the 405B parameter Llama 3.1 model. It was trained on a new mix of publicly available online data. This model is capable of processing and generating multilingual text, and can also produce code. It has been fine-tuned with a focus on general question answering (GQA) tasks. The model has a token count of over 15 trillion and its knowledge cutoff is December 2023. The Meta Llama 3.3 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

The model is intended for assistant-like chat and can be used in a variety of applications, e.g. agentic AI, RAG, code generation, chatbot.

  • POST /chat/completions
  • POST /completions
  • GET /models

This model is an open-weight model designed for powerful reasoning with 4 bit (MXFP4) quantization and a total of 120 billion parameter.

The GPT-OSS 120B model was trained on a broad mix of publicly available data and post-trained for strong reasoning, tool use, and general assistant-style tasks. The model supports long-context processing (up to 131k tokens), produces high-quality text and code, and is designed for agentic applications such as RAG systems, code assistants, and AI tools

  • POST /chat/completions
  • POST /completions
  • GET /models

Gemma is a family of lightweight, state-of-the-art open models from Google. Gemma 3 models are multimodal, handling text and image input and generating text output. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning.

The model is intended for assistant-like chat with vision understanding and can be used in a variety of applications, e.g. image-understanding, visional document understanding, agentic AI, RAG, code generation, chatbot.

  • POST /chat/completions
  • POST /completions
  • GET /models

The provided model is an 8 bit quantized version of the original Mistral Nemo Instruct 2407.

The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407. Trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size. The model was trained with a 128k context window on a large proportion of multilingual and code data. It supports multiple languages, including French, German, Spanish, Italian, Portuguese, Russian, Chinese, and Japanese, with varying levels of proficiency.

The model is intended for commercial and research use in English, particularly for assistant-like chat applications.

  • POST /chat/completions
  • POST /completions
  • GET /models

The provided model is an 8 bit quantized version of the original Meta Llama 3.1 8B.

Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The Meta Llama 3.1 model supports eight languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

It is optimized for multilingual dialogue use cases and outperforms many available open source and closed chat models on common industry benchmarks.

  • POST /chat/completions
  • POST /completions
  • GET /models

This is an embedding model and has no chat capabilities.

The E5 Mistral 7B Instruct model is a powerful language model that excels in text embedding tasks, particularly in English. With 32 layers and an embedding size of 4096, it’s well-suited for tasks like passage ranking and retrieval. However, it’s recommended to use this model for English-only tasks, as its performance may degrade for other languages. It’s capable of handling long input sequences up to 4096 tokens, making it well-suited for complex tasks. Overall, the E5 Mistral 7B Instruct model offers a robust and efficient solution for text embedding tasks, making it a valuable tool for natural language processing applications.

  • POST /embeddings
  • GET /models

This is an embedding model, it computes sematic embedding vectors from chat messages including text as well as image content and pure text input.

Qwen3-VL-Embedding-8B is a state-of-the-art multimodal embedding model developed by the Qwen team. Released in early 2026, it is designed to project various data types — text and images — into a unified semantic vector space. Unlike traditional text-only embedding models, it enables Cross-Modal Retrieval (e.g., using text to search images or using images to find related documents). It supports over 30 languages and offers an embedding dimension of 4096.

  • POST /embeddings
  • GET /models

Model limits ensure the availability of the models to all users and guarantees a fair-use.

Tokens per minute: The TPM limit is calculated by adding the prompt tokens to the generation tokens, with generation tokens weighted by a factor of 5.

Requests per minute.