Using Spark in JupyterLab
Last updated on
Introduction
Section titled “Introduction”STACKIT Notebooks come pre-configured with Spark support. However, depending on whether your code is running interactively in JupyterLab or being executed via an Airflow Papermill job, the way you initialize the Spark session differs.
This guide explains how to create a Spark session that connects to your Data Lakehouse (STACKIT Dremio/Iceberg).
Interaction Modes
Section titled “Interaction Modes”There are two primary ways your notebook will run:
- Interactive Mode: You are manually executing cells in the JupyterLab UI. You need to provide credentials (like a Personal Access Token) to authenticate with the catalog.
- Papermill/Production Mode: The notebook is running via STACKIT Workflows (Apache Airflow). The environment is already pre-configured with the necessary tokens and catalog URI.
Implementation
Section titled “Implementation”The following code snippet demonstrates how to handle both environments. It detects the STACKIT__PAPERMILL environment variable to decide whether to use automated configuration or manual credential injection.
import osimport stackit_spark
# 1. Check if the code is running via Airflow/Papermillif "STACKIT__PAPERMILL" in os.environ: # In Papermill mode, get_spark() automatically retrieves # configuration from the environment/operator. CATALOG_NAME = "catalog" spark = get_spark()else: # 2. Interactive Mode: Manual configuration required # These variables should be set in your JupyterLab terminal or environment settings DREMIO_PAT = os.environ["DREMIO_PAT"] DREMIO_ADDRESS = os.environ["DREMIO_ADDRESS"] CATALOG_URI = os.environ["CATALOG_URI"] CATALOG_NAME = "stackit" ICEBERG_VERSION = "1.9.1"
JAR_PATH = os.path.abspath("./authmgr-oauth2-runtime-0.1.2.jar")
packages = ( f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VERSION}," f"org.apache.iceberg:iceberg-azure-bundle:{ICEBERG_VERSION}," f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VERSION}," f"org.apache.iceberg:iceberg-gcp-bundle:{ICEBERG_VERSION}" )
spark = stackit_spark.get_spark( additional_config={ "spark.jars.packages": packages, "spark.jars": JAR_PATH, "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.driver.userClassPathFirst": "true",
# --- Iceberg Catalog Configuration --- f"spark.sql.catalog.{CATALOG_NAME}": "org.apache.iceberg.spark.SparkCatalog", f"spark.sql.catalog.{CATALOG_NAME}.type": "rest", f"spark.sql.catalog.{CATALOG_NAME}.warehouse": "default", f"spark.sql.catalog.{CATALOG_NAME}.uri": CATALOG_URI,
# Add these validation bypass configs f"spark.sql.catalog.{CATALOG_NAME}.check-location": "false", f"spark.sql.catalog.{CATALOG_NAME}.validation.check-location-exists": "false", f"spark.sql.catalog.{CATALOG_NAME}.metadata.check-location-exists": "false",
# --- OAuth2 Authentication --- f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.type": "com.dremio.iceberg.authmgr.oauth2.OAuth2Manager", f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.client-auth": "none", f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.token-endpoint": f"https://{DREMIO_ADDRESS}/oauth/token", f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.grant-type": "urn:ietf:params:oauth:grant-type:token-exchange", f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.client-id": "dremio", f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.scope": "dremio.all", f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.token-exchange.subject-token": DREMIO_PAT, f"spark.sql.catalog.{CATALOG_NAME}.rest.auth.oauth2.token-exchange.subject-token-type": "urn:ietf:params:oauth:token-type:dremio:personal-access-token",
# --- STS/Vended Credentials Configuration --- f"spark.sql.catalog.{CATALOG_NAME}.header.X-Iceberg-Access-Delegation": "vended-credentials", f"spark.sql.catalog.{CATALOG_NAME}.io-impl": "org.apache.iceberg.aws.s3.S3FileIO", f"spark.sql.catalog.{CATALOG_NAME}.s3.vended-credentials-enabled": "true" } )
spark.sql(f"USE {CATALOG_NAME}").show(truncate=False)spark.sql("SHOW NAMESPACES").show()
spark.sql(f"CREATE TABLE IF NOT EXISTS {CATALOG_NAME}.test.customer (id INT, name STRING, age INT) USING iceberg")spark.sql(f"SELECT * FROM {CATALOG_NAME}.test.customer ORDER BY id").show()spark.sql(f"INSERT INTO {CATALOG_NAME}.test.customer VALUES (1, 'Daniel', 30), (2, 'Jane', 25)")spark.sql(f"SELECT * FROM {CATALOG_NAME}.test.customer ORDER BY id").show()How to use Spark interactively
Section titled “How to use Spark interactively”-
Set up Environment Variables
Open a Terminal in JupyterLab and export your Dremio Personal Access Token (PAT).
Terminal window export DREMIO_PAT="your-token-here"export DREMIO_ADDRESS=your-dremio-urlexport CATALOG_URI=your-catalog-uriAlternatively, you can set these environment variables in a code cell via
%env DREMIO_PAT="your-token-here". -
Import Libraries
The
stackit_sparklibrary is a STACKIT-specific wrapper aroundpyspark. It handles the underlying configuration required to talk to the Spark cluster. -
Define your Catalog
In the code above, we define the
CATALOGasstackit. This acts as a namespace within your Spark SQL queries (e.g.,SELECT * FROM stackit.DB.TABLE). -
Execute SQL Queries
Once the
sparkobject is initialized, you can usespark.sql()to interact with your Iceberg tables exactly like a standard SQL database.
Integration with Dremio
Section titled “Integration with Dremio”The configuration uses the Iceberg REST catalog type. This ensures that any table you create or modify via Spark is immediately visible and readable within Dremio, provided you are using the same warehouse URI.