DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • How To Work Effectively With JDBC in Java Scripts
  • Getting Started With HarperDB and Java: Your First "Hello, World" Integration
  • The Power of ShardingSphere With Spring Boot
  • Keep Your Application Secrets Secret

Trending

  • Compliance Automated Standard Solution (COMPASS), Part 11: Compliance as Code, the OSCAL MCP Server Way
  • Beyond Manual Annotation: Engineering Self-Correcting Pseudo-Labeling Pipelines
  • Using LLMs to Automate Data Cleaning and Transformation Pipelines
  • When Snowflake Lies to You: Understanding False Failures in dbt Pipelines
  1. DZone
  2. Data Engineering
  3. Databases
  4. Using SingleStore for Iceberg Catalog Storage

Using SingleStore for Iceberg Catalog Storage

SingleStore recently announced bi-directional support for Apache Iceberg. In this short article, we'll implement an Iceberg catalog using SingleStore and JDBC.

By 
Akmal Chaudhri user avatar
Akmal Chaudhri
DZone Core CORE ·
Jul. 29, 24 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
4.6K Views

Join the DZone community and get the full member experience.

Join For Free

SingleStore recently announced bi-directional support for Apache Iceberg. Iceberg uses catalogs that are an integral part of the Iceberg table format, designed to manage large-scale tabular data in a more efficient and reliable way. Catalogs store metadata and track the location of tables, enabling data discovery, access, and management. Iceberg supports multiple catalog backends, including Hive Metastore, AWS Glue, and Hadoop, and through a database system using JDBC. This allows users to choose the most suitable backend for their specific data infrastructure. In this short article, we'll implement an Iceberg catalog using SingleStore and JDBC.

The notebook file used in this article is available on GitHub.

Introduction

The JDBC catalog in Apache Iceberg is a specialized catalog implementation that uses a relational database system to store metadata about Iceberg tables. This option uses the transactions and scalability of relational database systems to manage and query metadata efficiently. The JDBC catalog provides a good choice for environments where relational database systems are already in use or preferred. The JDBC connection needs to support atomic transactions.

Create a SingleStoreDB Cloud account

A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:

  • Workspace Group Name: Iceberg Demo Group
  • Cloud Provider: AWS
  • Region: US East 1 (N. Virginia)
  • Workspace Name: iceberg-demo
  • Size: S-00

We'll make a note of the password and store it in the secrets vault using the name password.

Import the Notebook

We'll download the notebook from GitHub.

From the left navigation pane in the SingleStore cloud portal, we'll select DEVELOP > Data Studio.

In the top right of the web page, we'll select New Notebook > Import From File. We'll use the wizard to locate and import the notebook we downloaded from GitHub.

Run the Notebook

After checking that we are connected to our SingleStore workspace, we'll run the cells one by one.

We'll use Apache Spark to create a tiny Iceberg Lakehouse in the SingleStore portal for testing purposes.

For production environments, please use a robust file system for your Lakehouse.

For the SparkSession, we'll need two packages (SingleStore JDBC Client and Iceberg Spark Runtime), as follows:

Python
 
from pyspark.sql import SparkSession

# List of Maven coordinates for all required packages
maven_packages = [
    "com.singlestore:singlestore-jdbc-client:1.2.3",
    "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2"
]

# Create Spark session with all required packages
spark = (SparkSession
             .builder
             .config("spark.jars.packages", ",".join(maven_packages))
             .appName("Spark Iceberg Catalog Test")
             .getOrCreate()
        )

spark.sparkContext.setLogLevel("ERROR")


In the Iceberg Lakehouse, we'll store the Iris flower data set. We'll first download the Iris CSV file into a Pandas Dataframe and then convert this to a Spark Dataframe.

We'll need to create a SingleStore database to use with Iceberg:

SQL
 
DROP DATABASE IF EXISTS iceberg;
CREATE DATABASE IF NOT EXISTS iceberg;


A quick and easy way to find the connection details for the database is to use the following:

Python
 
from sqlalchemy import *

db_connection = create_engine(connection_url)
url = db_connection.url


The url will contain the host, the port, and the database name. We can use all these details to configure Spark:

Python
 
spark.conf.set("spark.sql.catalog.s2_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.s2_catalog.type", "jdbc")
spark.conf.set("spark.sql.catalog.s2_catalog.warehouse", "warehouse")

# SSL/TLS configuration
spark.conf.set("spark.sql.catalog.s2_catalog.jdbc.useSSL", "true")
spark.conf.set("spark.sql.catalog.s2_catalog.jdbc.trustServerCertificate", "true")

# JDBC connection URL
spark.conf.set("spark.sql.catalog.s2_catalog.uri", f"jdbc:singlestore://{url.host}:{url.port}/{url.database}")

# JDBC credentials
spark.conf.set("spark.sql.catalog.s2_catalog.jdbc.user", "admin")
spark.conf.set("spark.sql.catalog.s2_catalog.jdbc.password", password)


Finally, we can test our setup.

First, we'll store the data from the Spark Dataframe in the Lakehouse, partitioned by Species:

Python
 
(iris_df.write
    .format("iceberg")
    .partitionBy("species")
    .save("s2_catalog.db.iris")
)


Next, we'll check what's stored, as follows:

Python
 
spark.sql("""
    SELECT file_path, file_format, partition, record_count
    FROM s2_catalog.db.iris.files
""").show()


Example output:

Plain Text
 
+--------------------+-----------+-----------------+------------+
|           file_path|file_format|        partition|record_count|
+--------------------+-----------+-----------------+------------+
|warehouse/db/iris...|    PARQUET| {Iris-virginica}|          50|
|warehouse/db/iris...|    PARQUET|    {Iris-setosa}|          50|
|warehouse/db/iris...|    PARQUET|{Iris-versicolor}|          50|
+--------------------+-----------+-----------------+------------+


We can run queries on our tiny Lakehouse:

Python
 
spark.sql("""
    SELECT * FROM s2_catalog.db.iris LIMIT 5
""").show()


We'll now delete all Iris-virginica records:

Python
 
spark.sql("""
    DELETE FROM s2_catalog.db.iris
    WHERE species = 'Iris-virginica'
""")


and check the Lakehouse:

Python
 
spark.sql("""
    SELECT file_path, file_format, partition, record_count
    FROM s2_catalog.db.iris.files
""").show()


Example output:

Plain Text
 
+--------------------+-----------+-----------------+------------+
|           file_path|file_format|        partition|record_count|
+--------------------+-----------+-----------------+------------+
|warehouse/db/iris...|    PARQUET|    {Iris-setosa}|          50|
|warehouse/db/iris...|    PARQUET|{Iris-versicolor}|          50|
+--------------------+-----------+-----------------+------------+


We can also check the metadata stored in SingleStore:

SQL
 
SELECT * FROM iceberg_tables;


Example output:

Plain Text
 
+--------------+-----------------+------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+
| catalog_name | table_namespace | table_name | metadata_location                                                                   | previous_metadata_location                                                          |
+--------------+-----------------+------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+
| s2_catalog   | db              | iris       | warehouse/db/iris/metadata/00001-6ea55045-6162-4462-9f8c-597ddbc5b846.metadata.json | warehouse/db/iris/metadata/00000-39743969-9e4b-4875-81ad-d8310656d28f.metadata.json |
+--------------+-----------------+------------+-------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+


Summary

In this short article, we've seen how to configure SingleStore to manage an Iceberg Lakehouse catalog. Using a simple example, we've run some queries on our Lakehouse and SingleStore has managed the metadata for us using JDBC.

Java Database Connectivity SingleStore

Published at DZone with permission of Akmal Chaudhri. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • How To Work Effectively With JDBC in Java Scripts
  • Getting Started With HarperDB and Java: Your First "Hello, World" Integration
  • The Power of ShardingSphere With Spring Boot
  • Keep Your Application Secrets Secret

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook