Zum Inhalt springen

Using Spark in JupyterLab

Zuletzt aktualisiert am

STACKIT Notebooks come pre-configured with Spark support. However, depending on whether your code is running interactively in JupyterLab or being executed via an Airflow Papermill job, the way you initialize the Spark session differs.

This guide explains how to create a Spark session that connects to your Data Lakehouse (STACKIT Dremio/Iceberg).

There are two primary ways your notebook will run:

  1. Interactive Mode: You are manually executing cells in the JupyterLab UI. You need to provide credentials (like a Personal Access Token) to authenticate with the catalog.
  2. Papermill/Production Mode: The notebook is running via STACKIT Workflows (Apache Airflow). The environment is already pre-configured with the necessary tokens and catalog URI.

The following code snippet demonstrates how to handle both environments. It detects the STACKIT__PAPERMILL environment variable to decide whether to use automated configuration or manual credential injection.

import os
import stackit_spark
# 1. Check if the code is running via Airflow/Papermill
if "STACKIT__PAPERMILL" in os.environ:
# In Papermill mode, get_spark() automatically retrieves
# configuration from the environment/operator.
CATALOG_NAME = "catalog"
spark = get_spark()
else:
# 2. Interactive Mode: Manual configuration required
# These variables should be set in your JupyterLab terminal or environment settings
DREMIO_PAT = os.environ["DREMIO_PAT"]
DREMIO_ADDRESS = os.environ["DREMIO_ADDRESS"]
CATALOG_URI = os.environ["CATALOG_URI"]
CATALOG_NAME = "stackit"
ICEBERG_VERSION = "1.9.1"
JAR_PATH = os.path.abspath("./authmgr-oauth2-runtime-0.1.2.jar")
packages = (
f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VERSION},"
f"org.apache.iceberg:iceberg-azure-bundle:{ICEBERG_VERSION},"
f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION},"
f"org.apache.iceberg:iceberg-gcp-bundle:{ICEBERG_VERSION}"
)
spark = stackit_spark.get_spark(
additional_config={
"spark.jars.packages": packages,
"spark.jars": JAR_PATH,
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.driver.userClassPathFirst": "true",
# --- Iceberg Catalog Configuration ---
f"spark.sql.catalog.{CATALOG_NAME}": "org.apache.iceberg.spark.SparkCatalog",
f"spark.sql.catalog.{CATALOG_NAME}.type": "rest",
f"spark.sql.catalog.{CATALOG_NAME}.warehouse": "default",
f"spark.sql.catalog.{CATALOG_NAME}.uri": CATALOG_URI,
# Add these validation bypass configs
f"spark.sql.catalog.{CATALOG_NAME}.check-location": "false",
f"spark.sql.catalog.{CATALOG_NAME}.validation.check-location-exists": "false",
f"spark.sql.catalog.{CATALOG_NAME}.metadata.check-location-exists": "false",
# --- OAuth2 Authentication ---
f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.type": "com.dremio.iceberg.authmgr.oauth2.OAuth2Manager",
f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.client-auth": "none",
f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.token-endpoint": f"https://{DREMIO_ADDRESS}/oauth/token",
f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.grant-type": "urn:ietf:params:oauth:grant-type:token-exchange",
f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.client-id": "dremio",
f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.scope": "dremio.all",
f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.token-exchange.subject-token": DREMIO_PAT,
f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.token-exchange.subject-token-type": "urn:ietf:params:oauth:token-type:dremio:personal-access-token",
# --- STS/Vended Credentials Configuration ---
f"spark.sql.catalog.{CATALOG_NAME}.header.X-Iceberg-Access-Delegation": "vended-credentials",
f"spark.sql.catalog.{CATALOG_NAME}.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
f"spark.sql.catalog.{CATALOG_NAME}.s3.vended-credentials-enabled": "true"
}
)
spark.sql(f"USE {CATALOG_NAME}").show(truncate=False)
spark.sql("SHOW NAMESPACES").show()
spark.sql(f"CREATE TABLE IF NOT EXISTS {CATALOG_NAME}.test.customer (id INT, name STRING, age INT) USING iceberg")
spark.sql(f"SELECT * FROM {CATALOG_NAME}.test.customer ORDER BY id").show()
spark.sql(f"INSERT INTO {CATALOG_NAME}.test.customer VALUES (1, 'Daniel', 30), (2, 'Jane', 25)")
spark.sql(f"SELECT * FROM {CATALOG_NAME}.test.customer ORDER BY id").show()
  1. Set up Environment Variables

    Open a Terminal in JupyterLab and export your Dremio Personal Access Token (PAT).

    Terminal window
    export DREMIO_PAT="your-token-here"
    export DREMIO_ADDRESS=your-dremio-url
    export CATALOG_URI=your-catalog-uri

    Alternatively, you can set these environment variables in a code cell via %env DREMIO_PAT="your-token-here".

  2. Import Libraries

    The stackit_spark library is a STACKIT-specific wrapper around pyspark. It handles the underlying configuration required to talk to the Spark cluster.

  3. Define your Catalog

    In the code above, we define the CATALOG as stackit. This acts as a namespace within your Spark SQL queries (e.g., SELECT * FROM stackit.DB.TABLE).

  4. Execute SQL Queries

    Once the spark object is initialized, you can use spark.sql() to interact with your Iceberg tables exactly like a standard SQL database.

The configuration uses the Iceberg REST catalog type. This ensures that any table you create or modify via Spark is immediately visible and readable within Dremio, provided you are using the same warehouse URI.