Rate Limits on shared models

With shared models STACKIT AI Model Serving serves compute-intense AI workloads via a shared API. To ensure that the resources are available to all customers, STACKIT applies rate limits on shared models.

Purpose of Rate Limits

The main goals of rate limits in STACKIT AI Model Serving are:

Ensure fair use of shared resources among all users.
Protect the service against misuse and potential abuse.

How Rate Limits are Applied on Shared Models

In STACKIT AI Model Serving, we apply rate limits on shared models in the following ways:

Two dimensions: STACKIT enforces rate limits based on two dimensions
- Requests per minute (RPM)
- Tokens per minute (TPM)
Per-project basis: Rate limits are aggregated across all STACKIT AI Model Serving authentication tokens associated with a single STACKIT project.
Model-specific limits: Rate limits vary depending on the specific model being used, with details available in Getting Started with Shared Models.

TPM limit computation

For Tokens Per Minute (TPM) limits, the computation is based on the total number of tokens used, considering both prompt tokens and generation tokens (for embedding models, only prompt tokens are used). The TPM limit is calculated by adding the prompt tokens to the generation tokens, with generation tokens weighted by a factor of 5. This means that generation tokens are counted as 5 times the number of prompt tokens:

TPM = (prompt tokens + 5 * generation tokens) / minute

Burst Capacity and Rate Limit Enforcement

STACKIT also provides a slight burst capacity to accommodate short-term spikes in traffic. This means that the rate limits can be exceeded slightly for a very short time, allowing for a brief increase in usage above the standard limits. However, sustained usage above the limits will still be enforced, and requests may be rejected or throttled to prevent abuse.

Response Headers for Rate Limit Information

When you make a request to the STACKIT AI Model Serving, the response includes headers that provide useful information about the remaining capacity and rate limit status. You can use these headers to implement efficient retry mechanisms and optimize your application’s usage of the service.

Every response contains rate-limit specific headers:

x-ratelimit-limit-requests: The RPM limit for your project.
x-ratelimit-limit-tokens: The TPM limit for your project
x-ratelimit-remaining-requests: The number of requests remaining, before you get rate-limited.
x-ratelimit-remaining-tokens: The number of tokens remaining, before you get rate-limited.
x-ratelimit-reset-requests: The time remaining until the request rate limit resets.
x-ratelimit-reset-tokens: The time remaining until the request rate limit resets.

Once you hit a rate limit (RPM or TPM) the API responds with429 Too Many Requests. Given thex-ratelimit-remaining-* headers in the response, you can determine when to retry the request.

With this information, the client-side application can handle rate limiting according to the specific use-case scenario. However, some general strategies and approaches to deal with rate limits are mentioned below.

Strategies for handling rate limits

To build a resilient application, we recommend implementing the following strategies to manage API rate limits gracefully.

Implement exponential backoff with jitter

When you receive a rate limit error (429 Too Many Requests), retry the request with a progressively longer delay. Adding “jitter” (a small, random amount of time) to this delay prevents multiple clients from retrying simultaneously.

Proactively monitor your quota

Use thex-ratelimit-remaining-* headers returned with every API response to track your available quota. Your application can then intelligently pause new requests when the quota is low, avoiding errors before they happen.

What to do if standard rate limits are insufficient

If the standard rate limits are not sufficient for your application, please submit a Service Request via STACKIT Help Center to discuss possible adjustments.