Zum Inhalt springen

Architecture

Diese Seite ist noch nicht in deiner Sprache verfügbar. Englische Seite aufrufen

STACKIT Intake is a fully managed service designed for ingesting data streams into Apache Iceberg tables in Dremio. Its architecture is centered around three key, hierarchical concepts: the Intake, the Intake Runner, and the Intake user.

Diagram of STACKIT Intake's architecture

An Intake represents a complete data ingestion pipeline into a single Apache Iceberg table. It ingests data in JSON format and periodically flushes it to the corresponding Dremio table, ensuring each message is inserted exactly once. Flushing introduces a delay of around 5 minutes before messages appear in the table.

An Intake can either write to an existing table or dynamically create one, inferring the schema from the first message it receives. To handle potential downstream outages, such as Dremio maintenance, Intakes buffer messages for up to 24 hours.

An Intake is defined by:

  • Dremio Instance Parameters & User Credentials: URLs for the Iceberg REST Catalog and authentication endpoint, along with a Dremio Personal Access Token (PAT) for user authentication.
  • Iceberg Table Definition: The namespace and name of the target table. If not provided, a default namespace and an auto-generated name are used.
  • Partitioning Information: The option to partition the Iceberg table based on a field in the JSON message or automatically by ingestion date using the automatically added __intake_ts column. If no partition option is given, an Intake will not create any partitions in the result table.

Each Intake provides two Kafka protocol topics for communication:

  • Intake Topic: The primary topic for sending messages to be ingested.
  • Dead Letter Queue (DLQ) Topic: A topic where undeliverable messages (e.g., non-JSON messages) are temporarily stored for debugging purposes.

An Intake Runner is a dedicated, isolated runtime environment for managing one or more Intakes. It provides a managed, public Kafka Protocol API endpoint for submitting messages. The runner’s capacity is defined by the user and is based on a desired hourly throughput:

  • Maximum number of messages per hour
  • Maximum message size in KiB The Intake Runner ensures that its Intakes can receive messages up to the allocated capacity and manages the storage for message buffering. Intake Runner capacity can be updated as long as the product of both maximums gets larger.

An Intake user provides secure technical access credentials for applications to connect to an Intake via the Intake Runner’s Kafka Protocol endpoint using SASL authentication. There are two types of users to control access:

  • intake: Used by applications to send messages to the main Intake topic.
  • dead-letter: A read-only user type specifically for accessing the Dead Letter Queue for debugging. When creating an Intake user, a strong password is required, which is stored securely and cannot be retrieved. You can update Intake user passwords at any time.

Intake Runners allocate storage to buffer messages for up to 24 hours. This feature is crucial for handling intermittent delivery problems or planned downtimes, such as Dremio maintenance windows.

Messages that are successfully ingested into an Iceberg table are removed from the buffer. An Intake Runner will block message passing if the volume of undelivered messages exceeds a boundary calculated from the 24-hour buffering period and the defined maximum hourly throughput and message size.

Once messages have been delivered and buffer space is regained, the Intake Runner will resume accepting messages for ingestion until the boundary is reached again.

All resources within STACKIT Intake—Intake Runners, Intakes, and Intake users—include a status field to reflect their current deployment state. This field can have one of the following values:

  • active: The resource has been successfully created and is operational.
  • reconciling: The resource is currently in the process of being created or updated.
  • deleting: The resource is being deleted.
  • failed: The creation or operation of the resource has failed.

Additionally, Intake provide specific fields to help you detect and diagnose ingestion problems:

  • Undelivered message count: This metric shows the number of messages that have been diverted to the Intake’s Dead Letter Queue (DLQ) because they could not be processed, for example, due to a malformed JSON message.
  • Failure message: This field provides the last error message that occurred during message processing, offering a quick explanation of the problem (e.g., a JSON parsing error).
  • Dead Letter Queue (DLQ) Topic: For more detailed debugging, you can use a deadletter-type Intake user to read the problematic messages directly from the DLQ topic. This allows you to inspect the exact content of the messages that failed to be ingested.

You have several options to manage your STACKIT Intake resources, providing flexibility for different workflows and user preferences:

  • STACKIT Portal: A user-friendly graphical interface for straightforward management.
  • STACKIT CLI: The command-line interface for scripting and automation.
  • STACKIT SDK: A software development kit for integrating management tasks into your own applications.
  • STACKIT API: Direct REST API access for full programmatic control.
  • Terraform: An infrastructure-as-code tool for declarative and version-controlled management. All these methods support the full lifecycle management of your resources, including:
  • Creating new Intake Runners, Intakes, and Intake users.
  • Updating existing Intake Runners, Intakes, and Intake users.
  • Deleting resources when they are no longer needed.
  • Listing available resources and checking their status for monitoring purposes.