Zum Inhalt springen

Using the STACKIT Spark Image with Extra Packages

Diese Seite ist noch nicht in deiner Sprache verfügbar. Englische Seite aufrufen

STACKIT provides ready-to-use Spark images that are optimized for running Spark workloads on Kubernetes. You can use these images directly in your Workflows DAGs and extend them with additional Python packages at runtime. In this tutorial you’ll learn how to configure your DAG to use a STACKIT Spark image and install extra libraries on the fly.

  1. Create a DAG in your project

    dags/my_extra_packages_dag.py

    import pendulum
    from airflow.decorators import dag
    from stackit_workflows.airflow_plugin.decorators import stackit
    # Specify the STACKIT Spark image
    default_kwargs = {
    "image": "schwarzit-xx-sit-dp-customer-artifactory-docker-local.jfrog.io/stackit-spark:spark3.5.3-0.1.2"
    }
    @dag(
    schedule=None,
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=["demo"],
    dag_id="07_extra_packages",
    )
    def packages():
    # The STACKIT Spark image provides a writable Python environment.
    # You can install extra libraries at runtime using pip, conda, or mamba.
    @stackit.spark_kubernetes_task(**default_kwargs)
    def tell_jokes():
    import subprocess, sys
    # Install an extra package at runtime inside the Spark container
    subprocess.check_call([sys.executable, "-m", "pip", "install", "Joking"])
    import Joking
    print(Joking.random_joke())
    tell_jokes()
    packages()
    • STACKIT Spark image

      • Defined via the image parameter (spark3.5.3-0.1.2).
      • Provides a maintained Spark runtime plus a writable Python environment.
      • You don’t need to build your own base image to get started.
    • Runtime installation

      • Since the environment is writable, you can install additional libraries with pip, conda, or mamba.
      • mamba is recommended because it resolves dependencies faster and ships optimized binaries.
    • Best practice

      • Runtime installs consume resources each time the pod starts.
      • For tasks you run often, build a custom image based on the STACKIT Spark image with all required libraries pre-installed.
  2. Push the DAG to your environment and trigger it in Airflow.

    • Check the task logs

      • You’ll see pip fetching and installing the Joking package.
      • The task will then print a random joke from the installed library.
    • Inspect the Spark pod

      • Confirm that the pod is using the STACKIT Spark image you specified.
      • The additional library is available only during this task’s runtime.