Running Jupyter Notebooks with Papermill in STACKIT Workflows
Zuletzt aktualisiert am
Prerequisites
Section titled “Prerequisites”- You have an active STACKIT Workflows instance.
- You have a Git repository connected to your Workflows instance containing your Notebooks.
- You have a STACKIT Object Storage bucket (if you wish to persist Notebook execution outputs).
In this guide, you will learn how to automate the execution of Jupyter Notebooks using the STACKITSparkScriptOperator and Papermill.
Overview of Papermill Integration
Section titled “Overview of Papermill Integration”Papermill is a tool for configuring and executing Jupyter Notebooks. Within STACKIT Workflows, the STACKITSparkScriptOperator automatically detects .ipynb files and uses Papermill to run them as Spark jobs. This allows you to treat notebooks as production-ready data pipelines.
When developing notebooks that run both interactively in Notebooks and in production via Papermill/Workflows, you can use the STACKIT__PAPERMILL environment variable to detect the execution context. This allows you to handle different scenarios such as secret management, progress bars, and logging differently.
Configuring the Operator
Section titled “Configuring the Operator”When running a notebook, you need to decide where the executed version (containing the results and logs) should be stored. You can either push it to an S3-compliant bucket or stream the output directly to the Airflow logs (stdout).
Connection Setup
Section titled “Connection Setup”To store results in S3, ensure you have an Airflow Connection configured with your Object Storage credentials. In this example, we use a connection ID named s3_papermill_output.
| Connection ID | Connection Type | Description |
|---|---|---|
s3_papermill_output | S3 | Credentials for the target bucket where executed notebooks are saved. |
lakehouse-rest | HTTP/Custom | Pre-configured connection for STACKIT Lakehouse integration. |
Implementation Example
Section titled “Implementation Example”The following DAG demonstrates two tasks: one that saves the output notebook to S3 and another that prints the output to the Airflow task logs.
import pendulumimport os
from airflow.decorators import dagfrom airflow.models.connection import Connectionfrom stackit_workflows.workflows_plugin.operators import STACKITSparkScriptOperatorfrom stackit_workflows import config
# Fetch S3 credentials from Airflow Secretss3_connection = Connection.get_connection_from_secrets("s3_papermill_output")
@dag( schedule=None, start_date=pendulum.datetime(2021, 1, 1, tz="UTC"), catchup=False, tags=["demo"], dag_id="demo_jupyter_notebook",)def spark_full_configuration():
output_s3 = STACKITSparkScriptOperator( task_id="papermill_s3", # Use a specific Spark image or leave empty for default image="schwarzit-xx-sit-dp-customer-artifactory-docker-local.jfrog.io/stackit-spark:spark3.5.3-0.1.2", get_logs=True, # Explicitly set or let the operator auto-detect via file extension is_papermill=True, # Path relative to your git-sync folder script="Demo/scripts/my_spark_notebook.ipynb", # Target S3 path for the executed notebook papermill_target_folder="s3://internal-airflow-papermill-output/notebooks", s3_arguments={ "AWS_ACCESS_KEY_ID": s3_connection.login, "AWS_SECRET_ACCESS_KEY": s3_connection.password, "BOTO3_ENDPOINT_URL": s3_connection.host }, lakehouse_connections=["lakehouse-rest"], )
output_stdout = STACKITSparkScriptOperator( task_id="papermill_stdout", image="schwarzit-xx-sit-dp-customer-artifactory-docker-local.jfrog.io/stackit-spark:spark3.5.3-0.1.2", get_logs=True, is_papermill=True, script="Demo/scripts/my_spark_notebook.ipynb", # Setting folder to "-" redirects the notebook JSON to Airflow logs papermill_target_folder="-", s3_arguments={ "AWS_ACCESS_KEY_ID": s3_connection.login, "AWS_SECRET_ACCESS_KEY": s3_connection.password, "BOTO3_ENDPOINT_URL": s3_connection.host }, lakehouse_connections=["lakehouse-rest"], )
output_stdout >> output_s3
spark_full_configuration()Running the workflow / DAG
Section titled “Running the workflow / DAG”-
Upload your Notebook Ensure your
.ipynbfile is committed to the Git repository linked to your STACKIT Workflows instance. Thescriptparameter in the operator must match the relative path in the repo. For a sample Notebook that uses Papermill and Spark, see Using Spark. -
Upload the DAG Just like your
.ipynbfile, commit and push the DAG that runs the notebook (see above). Let’s assume, it is nameddemo_jupyter_notebooklike in the above example. -
Trigger the DAG
Navigate to the Airflow UI, locate
demo_jupyter_notebook, and trigger it manually. -
Monitor Execution
Click on the
papermill_stdouttask and select Logs. You will see the Spark session initializing and Papermill executing each cell sequentially. -
Verify Output
Once the
papermill_s3task completes, check your (STACKIT) Object Storage bucket. You will find a new notebook file containing both the original code and the generated outputs from the run.