{{announcement.body}}
{{announcement.title}}

Databricks Delta Lake Using Java

DZone 's Guide to

Databricks Delta Lake Using Java

In this article, see the advantages that Databricks' open-source Delta Lake can bring to you next application.

· Java Zone ·
Free Resource

Delta Lake is an open source release by Databricks that provides a transactional storage layer on top of data lakes. In real-time systems, a data lake can be an Amazon S3, Azure Data Lake Store/Azure Blob storage, Google Cloud Storage, or Hadoop Distributed file system.

Delta Lake acts as a storage layer that sits on top of Data Lake. Delta Lake brings additional features where Data Lake cannot provide.

Key features of Delta Lake as follows

ACID transactions: In real-time data engineering applications, there are many concurrent pipelines for different business data domains that operate on data lake for concurrent operations that reads and updates the data. This will lead to an issue in the data integrity due to a lack of transaction features in Data Lake. Delta Lake brings transactional features that follow ACID properties which makes data consistent.

Data Versioning: Delta Lake maintains multiple versions of data for all the transactional operations which enables developers to work on the specific version whenever required.

Schema Enforcement: This feature is also called Schema Validation. Delta Lake validates the schema before updating the data. Ensures that data types are correct and required columns are present preventing invalid data and maintaining metadata quality checks

Schema Evolution: Schema evolution is a feature that allows users to easily change a table’s current schema to accommodate data that is changing over time. Most commonly, it’s used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns. Data engineers and scientists can use this option to add new columns to their existing machine learning production tables without breaking existing models that rely on the old columns

Efficient data format: All the data in data lake is stored in Apache Parquet format enabling delta lake to leverage compressions and encoding of native parquet format

Scalable Metadata handling: In Big Data applications, the metadata also can be big data. Delta Lake treats metadata just like data, leveraging Spark’s distributed processing power to handle all its metadata

Unified layer for batch and stream processing: Delta Lake store can act like a source and sink for both batch processing and stream processing. This will be an improvement to existing Lambda architecture which has separate pipelines for batch and stream processing.

Fully compatible with Spark framework: Delta Lake is fully compatible and easily integrated with Spark that is highly distributed and scalable big data framework.

The logical view is represented as below

Application architecture


Generally we use Scala for developing Spark applications. As there are a vast number of developers in Java community and Java still ruling the enterprise world, we can develop Spark applications using Java also as Spark provides complete compatibility for Java programs and utilizes JVM environment.

We will be seeing how we can leverage Delta Lake features in Java using Spark.

Java
 







The above Java program uses the Spark framework that reads employee data and saves the data in Delta Lake.

To leverage delta lake features, the spark read format and write format has to be changed to "delta" from "parquet" as mentioned in the above program.

In this example, we will create Employee POJO and initialize the POJO with some test data. In real time, this will be populated with files from Data Lake or external data sources.

The output of the program is as below

Initial output

In order to run this as Maven project, add the below dependencies in the pom.xml

XML
 




xxxxxxxxxx
1
15


 
1
         <dependency>
2
            <groupId>io.delta</groupId>
3
            <artifactId>delta-core_2.11</artifactId>
4
            <version>0.4.0</version>
5
        </dependency>
6
        <dependency>
7
            <groupId>org.apache.spark</groupId>
8
            <artifactId>spark-core_2.11</artifactId>
9
            <version>2.4.3</version>
10
        </dependency>
11
        <dependency>
12
            <groupId>org.apache.spark</groupId>
13
            <artifactId>spark-sql_2.11</artifactId>
14
            <version>2.4.3</version>
15
        </dependency>



Append the new data to the existing delta table with "append" option.

Java
 







The output of the above program is shown below:

The new data is appended to the table. Now there are 2 versions of the data, where the first version is the original data and data with appended is the new version of data

In order to get the data related to the first version, the below program illustrates the usage. The option "versionAsOf" will get the corresponding version of the data from Delta Lake table

Java
 







The output corresponds to the first version of data:

Initial program output

We can also overwrite with delta option using the mode "overwrite". This will overwrite the old data and create the new data in the table but still the versions are maintained and can fetch the old data with the version number

Java
 







The Employee POJO used in the above examples is as given below

Java
 







Now, we change one of the field name in Employee POJO, say "deptName" field is changed to "deptId" field

The new POJO is as follows

Java
 







Now, we try to append data to the existing delta table with the modified schema

Java
 







When we try to run the above program, the schema enforcement feature will prevent writing the data to the delta table. Delta Lake validates the schema before appending new data to the existing delta lake table. If there is any mismatch in the schema in the new data, it will throw the error about the schema mismatch

The below error occurs when we try to run the above program

Error output

Now, we will add a new field in the existing old schema of the Employee table, which is "sectionName" and try to append data to the existing Delta Lake table.

Delta Lake will utilize a schema evolution feature that accommodates new or modified schema of the data that is being saved dynamically without any explicit DDL operations. This is achieved by using "mergeSchema" option while saving the data to Delta table as shown in the below program

Java
 







The output of the above program is shown, as below:

Program output

As per the above output, "sectionName" field is merged with the existing schema of Delta lake table, and the null value is being updated for already existing records in the table in this new column.

In this way, we can leverage Delta Lake features in real-time applications that brings transactional, historical data maintenance, schema management capabilities to the existing Data Lakes.

Topics:
big data ,data lake ,delta lake ,etl

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}