Using Spark in JupyterLab

Zuletzt aktualisiert am 27. März 2026

Introduction

STACKIT Notebooks come pre-configured with Spark support. However, depending on whether your code is running interactively in JupyterLab or being run via an Airflow Papermill job, the way you initialize the Spark session differs.

This guide explains how to create a Spark session that connects to your Data Lakehouse (STACKIT Dremio/Iceberg).

Interaction Modes

There are two primary ways your notebook will run:

Interactive Mode: You are manually running cells in the JupyterLab UI. You need to provide credentials (like a Personal Access Token) to authenticate with the catalog.
Papermill/Production Mode: The notebook is running via STACKIT Workflows (Apache Airflow). The environment is already pre-configured with the necessary tokens and catalog URI.

Implementation

The following code snippet demonstrates how to handle both environments. It detects the STACKIT__PAPERMILL environment variable to decide whether to use automated configuration or manual credential injection.

import os
import stackit_spark

# 1. Check if the code is running via Airflow/Papermill
if "STACKIT__PAPERMILL" in os.environ:
    # In Papermill mode, get_spark() automatically retrieves
    # configuration from the environment/operator.
    CATALOG_NAME = "catalog"
    spark = get_spark()
else:
    # 2. Interactive Mode: Manual configuration required
    # These variables should be set in your JupyterLab terminal or environment settings
    DREMIO_PAT = os.environ["DREMIO_PAT"]
    DREMIO_ADDRESS = os.environ["DREMIO_ADDRESS"]
    CATALOG_URI = os.environ["CATALOG_URI"]
    CATALOG_NAME = "stackit"
    ICEBERG_VERSION = "1.9.1"

    JAR_PATH = os.path.abspath("./authmgr-oauth2-runtime-0.1.2.jar")


    packages = (
        f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VERSION},"
        f"org.apache.iceberg:iceberg-azure-bundle:{ICEBERG_VERSION},"
        f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION},"
        f"org.apache.iceberg:iceberg-gcp-bundle:{ICEBERG_VERSION}"
    )

    spark = stackit_spark.get_spark(
        additional_config={
            "spark.jars.packages": packages,
            "spark.jars": JAR_PATH,
            "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
            "spark.driver.userClassPathFirst": "true",

            # --- Iceberg Catalog Configuration ---
            f"spark.sql.catalog.{CATALOG_NAME}": "org.apache.iceberg.spark.SparkCatalog",
            f"spark.sql.catalog.{CATALOG_NAME}.type": "rest",
            f"spark.sql.catalog.{CATALOG_NAME}.warehouse": "default",
            f"spark.sql.catalog.{CATALOG_NAME}.uri": CATALOG_URI,

            # Add these validation bypass configs
            f"spark.sql.catalog.{CATALOG_NAME}.check-location": "false",
            f"spark.sql.catalog.{CATALOG_NAME}.validation.check-location-exists": "false",
            f"spark.sql.catalog.{CATALOG_NAME}.metadata.check-location-exists": "false",

            # --- OAuth2 Authentication ---
            f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.type": "com.dremio.iceberg.authmgr.oauth2.OAuth2Manager",
            f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.client-auth": "none",
            f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.token-endpoint": f"https://{DREMIO_ADDRESS}/oauth/token",
            f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.grant-type": "urn:ietf:params:oauth:grant-type:token-exchange",
            f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.client-id": "dremio",
            f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.scope": "dremio.all",
            f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.token-exchange.subject-token": DREMIO_PAT,
            f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.token-exchange.subject-token-type": "urn:ietf:params:oauth:token-type:dremio:personal-access-token",

            # --- STS/Vended Credentials Configuration ---
            f"spark.sql.catalog.{CATALOG_NAME}.header.X-Iceberg-Access-Delegation": "vended-credentials",
            f"spark.sql.catalog.{CATALOG_NAME}.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
            f"spark.sql.catalog.{CATALOG_NAME}.s3.vended-credentials-enabled": "true"
        }
    )

spark.sql(f"USE {CATALOG_NAME}").show(truncate=False)
spark.sql("SHOW NAMESPACES").show()


spark.sql(f"CREATE TABLE IF NOT EXISTS {CATALOG_NAME}.test.customer (id INT, name STRING, age INT) USING iceberg")
spark.sql(f"SELECT * FROM {CATALOG_NAME}.test.customer ORDER BY id").show()
spark.sql(f"INSERT INTO {CATALOG_NAME}.test.customer VALUES (1, 'Daniel', 30), (2, 'Jane', 25)")
spark.sql(f"SELECT * FROM {CATALOG_NAME}.test.customer ORDER BY id").show()

How to use Spark interactively

Set up Environment Variables

Open a Terminal in JupyterLab and export your Dremio Personal Access Token (PAT).
Terminal window
```
export DREMIO_PAT="your-token-here"
export DREMIO_ADDRESS=your-dremio-url
export CATALOG_URI=your-catalog-uri
```
Alternatively, you can set these environment variables in a code cell via %env DREMIO_PAT="your-token-here".
Import Libraries

The stackit_spark library is a STACKIT-specific wrapper around pyspark. It handles the underlying configuration required to talk to the Spark cluster.
Define your Catalog

In the code above, we define the CATALOG as stackit. This acts as a namespace within your Spark SQL queries (e.g., SELECT * FROM stackit.DB.TABLE).
Execute SQL Queries

Once the spark object is initialized, you can use spark.sql() to interact with your Iceberg tables exactly like a standard SQL database.

Integration with Dremio

The configuration uses the Iceberg REST catalog type. This ensures that any table you create or modify via Spark is immediately visible and readable within Dremio, provided you are using the same warehouse URI.