Best practices

Last updated on Mar 11, 2026

This page summarizes recommended best practices when developing and operating data pipelines with STACKIT Workflows. Following these guidelines will help ensure maintainability, reproducibility, and efficient running.

Development workflow (DEV → QA → PROD)

A structured development workflow is essential for reliable pipeline delivery.

Dedicated Git branches
- Maintain one branch per environment (dev, qa, prod) in your DAG repository.
- Define a clear merge strategy (e.g., feature branches → dev → qa → prod) to ensure tested changes flow upwards.
- Airflow instances can be configured in the STACKIT portal to pull from the correct branch automatically when creating or updating a Workflow Instance.
DEV
- Prototype DAGs, Spark jobs, and notebooks.
- Use small datasets or synthetic test data.
- Expect frequent changes and experiments.
QA / STAGING
- Test end-to-end workflows with realistic input data.
- Validate XComs, parameters, SLAs, and error handling.
- Verify security, permissions, and integrations.
PROD
- Deploy only tested and reviewed DAGs.
- Apply monitoring, alerting, and logging.
- Ensure reproducibility through fixed image versions (see below).

Installing missing packages

Sometimes a DAG or task may require additional Python packages.

❌ Never install packages in the DAG definition itself
- Example: os.system("pip install ...") in DAG code → bad practice.
- Causes overhead at every DAG parse and instability in the scheduler.
⚠️ Installing inside tasks is possible but slow
- Example: pip install ... inside any Operator.
- This re-downloads and installs dependencies each run → not reproducible.
✅ Best practice: Create a custom Docker image
- Extend the official STACKIT Spark image (schwarzit-xx-sit-dp-customer-artifactory-docker-local.jfrog.io/stackit-spark) with your custom dependencies, push it to your own container registry and request your registry credentials being added as a secret to the airflow cluster through a support ticket.
- Use this image consistently across DEV/QA/PROD.

Pinning image versions

To ensure reproducibility and prevent unexpected behavior:

Always pin image versions in your DAGs.
Avoid using :latest tags in production.

Separating business logic from control logic

As a best practice, we recommend to separate orchestration logic (DAGs) from business logic (SQL running in the database, Python script that processes data != DAG / Task definition, Spark logic, etc.). This makes it easier to test and maintain your code, as business logic functions can be tested independently from Airflow, for example in the STACKIT Notebooks service. To separate business from orchestration logic, put business logic into separate Python files (i.e. compute_stats.py or kpi_revenue.sql in the example below) that do not contain any references to Airflow or DAGs.

your-dags-repo/
├── dags/                        # Directory for DAG files
│   ├── data_pipeline_dag.py
│   ├── ml_training_dag.py
│   ├── maintenance_dag.py
│   └── tasks/                   # Optional module that contains business logic.
│       ├── clean_timestamps.py  # For Python or Spark jobs, don't import anything from airflow here
│       └── compute_stats.py     # (should not contain words "airflow" or "DAG")
├── include/
│   └── sql
│       └── kpi_revenue.sql
├── plugins/                    # For any custom or community Airflow plugins
│   └── custom_operators.py
└── README.md

These files can then be imported and used in your DAG files. Alternatively, business logic can also reside in separate git repositories which can be configured on a per-task basis using the git_sync_* parameters of the STACKITSparkScriptOperator or @stackit.spark_kubernetes_task decorator.

Please check the Tutorials section for an example.