Zum Inhalt springen

Running Jupyter Notebooks with Papermill in STACKIT Workflows

Zuletzt aktualisiert am

  • You have an active STACKIT Workflows instance.
  • You have a Git repository connected to your Workflows instance containing your Notebooks.
  • You have a STACKIT Object Storage bucket (if you wish to persist Notebook execution outputs).

In this guide, you will learn how to automate the execution of Jupyter Notebooks using the STACKITSparkScriptOperator and Papermill.

Papermill is a tool for configuring and executing Jupyter Notebooks. Within STACKIT Workflows, the STACKITSparkScriptOperator automatically detects .ipynb files and uses Papermill to run them as Spark jobs. This allows you to treat notebooks as production-ready data pipelines. When developing notebooks that run both interactively in Notebooks and in production via Papermill/Workflows, you can use the STACKIT__PAPERMILL environment variable to detect the execution context. This allows you to handle different scenarios such as secret management, progress bars, and logging differently.

When running a notebook, you need to decide where the executed version (containing the results and logs) should be stored. You can either push it to an S3-compliant bucket or stream the output directly to the Airflow logs (stdout).

To store results in S3, ensure you have an Airflow Connection configured with your Object Storage credentials. In this example, we use a connection ID named s3_papermill_output.

The following DAG demonstrates two tasks: one that saves the output notebook to S3 and another that prints the output to the Airflow task logs.

import pendulum
import os
from airflow.decorators import dag
from airflow.models.connection import Connection
from stackit_workflows.workflows_plugin.operators import STACKITSparkScriptOperator
from stackit_workflows import config
# Fetch S3 credentials from Airflow Secrets
s3_connection = Connection.get_connection_from_secrets("s3_papermill_output")
@dag(
schedule=None,
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
tags=["demo"],
dag_id="demo_jupyter_notebook",
)
def spark_full_configuration():
output_s3 = STACKITSparkScriptOperator(
task_id="papermill_s3",
# Use a specific Spark image or leave empty for default
image="schwarzit-xx-sit-dp-customer-artifactory-docker-local.jfrog.io/stackit-spark:spark3.5.3-0.1.2",
get_logs=True,
# Explicitly set or let the operator auto-detect via file extension
is_papermill=True,
# Path relative to your git-sync folder
script="Demo/scripts/my_spark_notebook.ipynb",
# Target S3 path for the executed notebook
papermill_target_folder="s3://internal-airflow-papermill-output/notebooks",
s3_arguments={
"AWS_ACCESS_KEY_ID": s3_connection.login,
"AWS_SECRET_ACCESS_KEY": s3_connection.password,
"BOTO3_ENDPOINT_URL": s3_connection.host
},
lakehouse_connections=["lakehouse-rest"],
)
output_stdout = STACKITSparkScriptOperator(
task_id="papermill_stdout",
image="schwarzit-xx-sit-dp-customer-artifactory-docker-local.jfrog.io/stackit-spark:spark3.5.3-0.1.2",
get_logs=True,
is_papermill=True,
script="Demo/scripts/my_spark_notebook.ipynb",
# Setting folder to "-" redirects the notebook JSON to Airflow logs
papermill_target_folder="-",
s3_arguments={
"AWS_ACCESS_KEY_ID": s3_connection.login,
"AWS_SECRET_ACCESS_KEY": s3_connection.password,
"BOTO3_ENDPOINT_URL": s3_connection.host
},
lakehouse_connections=["lakehouse-rest"],
)
output_stdout >> output_s3
spark_full_configuration()
  1. Upload your Notebook Ensure your .ipynb file is committed to the Git repository linked to your STACKIT Workflows instance. The script parameter in the operator must match the relative path in the repo. For a sample Notebook that uses Papermill and Spark, see Using Spark.

  2. Upload the DAG Just like your .ipynb file, commit and push the DAG that runs the notebook (see above). Let’s assume, it is named demo_jupyter_notebook like in the above example.

  3. Trigger the DAG

    Navigate to the Airflow UI, locate demo_jupyter_notebook, and trigger it manually.

  4. Monitor Execution

    Click on the papermill_stdout task and select Logs. You will see the Spark session initializing and Papermill executing each cell sequentially.

  5. Verify Output

    Once the papermill_s3 task completes, check your (STACKIT) Object Storage bucket. You will find a new notebook file containing both the original code and the generated outputs from the run.