Skip to content

Architecture of AI Model Experiments

Last updated on

AI Model Experiments on STACKIT provides a fully managed MLflow™ environment. By leveraging an instance-based architecture, teams can maintain strict isolation between projects while enabling seamless collaboration within their own dedicated environments.

Each instance is a self-contained, managed MLflow™ environment. This multi-instance approach ensures that metadata and access are decoupled across different teams or workstreams. Each instance includes:

  • Tracking server: The central orchestration engine that processes incoming logs via the MLflow™ REST API and Python SDK.

  • Metadata database: A STACKIT-managed database that stores non-blob data like experiments, parameters, and metrics.

  • MLflow™ UI: A dedicated web interface assigned to that specific instance for visualizing results, comparing runs, and managing the model registry.

Unlike the metadata, artifacts (models, images, large datasets) are stored within the user’s own STACKIT project space.

  • Sovereignty: You retain ownership and control over your binary data.

  • Access: The MLflow™ UI and SDK communicate directly with your storage bucket to upload and retrieve these files.

Security is managed through a combination of STACKIT permissions and application-level tokens:

  • Management: Admins create and manage the instances and access tokens and set access roles via the STACKIT Portal or API.

  • Logging data: Engineers use tokens to authenticate the Python SDK against the specific instance URI to log their training data.

  • UI access: Each instance UI is accessible via a unique URI. Access is restricted via the STACKIT authentication and the roles of the user.

You have full lifecycle control over your MLflow™ environments using the STACKIT Portal, CLI, APIs, or Terraform:

  • Provisioning: Create new dedicated AI Model Experiments instances in minutes.

  • Token Management: Issue, revoke, or rotate access tokens for your engineering team.

  • User Access: Give your users permissions to log into the UI.

  • The admin: Responsible for the lifecycle of the instance. They create the server, set access roles, and issue tokens to the team.

  • The engineer: Uses the provided access tokens and the Python SDK to instrument training scripts, enabling the automated logging of parameters, metrics, and metadata from local environments, remote notebooks, or CI/CD pipelines. Beyond data ingestion, the engineer uses the MLflow™ UI to visualize performance trends, perform side-by-side run comparisons, and manage the model lifecycle by versioning successful candidates.