Best practices
This page summarizes recommended best practices when developing and operating data pipelines with STACKIT Workflows.
Following these guidelines will help ensure maintainability, reproducibility, and efficient execution.
Development workflow (DEV → QA → PROD)
Section titled “Development workflow (DEV → QA → PROD)”A structured development workflow is essential for reliable pipeline delivery.
-
Dedicated Git branches
- Maintain one branch per environment (
dev,qa,prod) in your DAG repository. - Define a clear merge strategy (e.g., feature branches →
dev→qa→prod) to ensure tested changes flow upwards. - Airflow instances can be configured in the STACKIT portal to pull from the correct branch automatically when creating or updating a Workflow Instance.
- Maintain one branch per environment (
-
DEV
- Prototype DAGs, Spark jobs, and notebooks.
- Use small datasets or synthetic test data.
- Expect frequent changes and experiments.
-
QA / STAGING
- Test end-to-end workflows with realistic input data.
- Validate XComs, parameters, SLAs, and error handling.
- Verify security, permissions, and integrations.
-
PROD
- Deploy only tested and reviewed DAGs.
- Apply monitoring, alerting, and logging.
- Ensure reproducibility through fixed image versions (see below).
Installing missing packages
Section titled “Installing missing packages”Sometimes a DAG or task may require additional Python packages.
-
❌ Never install packages in the DAG definition itself
- Example:
os.system("pip install ...")in DAG code → bad practice. - Causes overhead at every DAG parse and instability in the scheduler.
- Example:
-
⚠️ Installing inside tasks is possible but slow
- Example:
pip install ...inside any Operator. - This re-downloads and installs dependencies each run → not reproducible.
- Example:
-
✅ Best practice: Create a custom Docker image
- Extend the official STACKIT Spark image (
schwarzit-xx-sit-dp-customer-artifactory-docker-local.jfrog.io/stackit-spark) with your custom dependencies, push it to your own container registry and request your registry credentials being added as a secret to the airflow cluster through a support ticket. - Use this image consistently across DEV/QA/PROD.
- Extend the official STACKIT Spark image (
Pinning image versions
Section titled “Pinning image versions”To ensure reproducibility and prevent unexpected behavior:
- Always pin image versions in your DAGs.
- Avoid using
:latesttags in production.
Separating business logic from control logic
Section titled “Separating business logic from control logic”As a best practice, we recommend to separate orchestration logic (DAGs) from business logic (SQL executed in Database, Python script that processes data != DAG / Task definition, Spark logic, etc.). This makes it easier to test and maintain your code, as business logic functions can be tested independently from Airflow, for example in the STACKIT Notebooks service. To separate business from orchestration logic, put business logic into separate Python files (i.e. compute_stats.py or kpi_revenue.sql in the example below) that do not contain any references to Airflow or DAGs.
your-dags-repo/├── dags/ # Directory for DAG files│ ├── data_pipeline_dag.py│ ├── ml_training_dag.py│ ├── maintenance_dag.py│ └── tasks/ # Optional module that contains business logic.│ ├── clean_timestamps.py # For Python or Spark jobs, don't import anything from airflow here│ └── compute_stats.py # (should not contain words "airflow" or "DAG")├── include/│ └── sql│ └── kpi_revenue.sql├── plugins/ # For any custom or community Airflow plugins│ └── custom_operators.py└── README.mdThese files can then be imported and used in your DAG files. Alternatively, business logic can also reside in separate git repositories which can be configured on a per-task basis using the git_sync_* parameters of the STACKITSparkScriptOperator or @stackit.spark_kubernetes_task decorator.
Please check the Tutorials section for an example.