DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Implementation of Data Quality Framework
  • When Perfect Data Breaks: The Journey from Data Quality to Data Observability
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Modernizing Cloud Data Automation for Faster Insights

Trending

  • Implementing Observability in Distributed Systems Using OpenTelemetry
  • 5 Common Security Pitfalls in Serverless Architectures
  • Chaos Engineering Has a Blind Spot. Agentic AI Lives in It.
  • Every Cache Miss Is a Tiny Tax on Your Performance
  1. DZone
  2. Data Engineering
  3. Data
  4. Vibe Coding Part 3 — Building a Data Quality Framework in Scala and PySpark

Vibe Coding Part 3 — Building a Data Quality Framework in Scala and PySpark

How GenAI-assisted “vibe coding” can simplify building PySpark wrappers over Scala Spark libraries when you know what you’re doing.

By 
Bipin Patwardhan user avatar
Bipin Patwardhan
·
Jan. 23, 26 · Opinion
Likes (2)
Comment
Save
Tweet
Share
1.2K Views

Join the DZone community and get the full member experience.

Join For Free

A couple of years ago, I developed a class-based Data Quality (DQ) framework in Scala. After some time, I became part of an accelerator framework that dealt with many data engineering tasks, with data quality being one set of activities. After considering multiple factors, it was decided that the framework would be built using Python. Given that we wanted to handle large datasets and ensure scalability, we decided to implement the data quality framework in PySpark.

Using the Scala-based DQ framework as a stepping stone and foundation, I developed an equivalent DQ framework in PySpark. The framework was successful, with the DQ functionality implemented as a Python library and packaged as a wheel.

After completing the implementation — and after coming across libraries like AWS Deequ — I wanted to understand and overcome the Scala Spark / Python Spark duality: why maintain two versions of the same library, one in Scala and one in Python? I read the relevant documentation and implemented another PySpark DQ library. This time, I wrote it as a wrapper around the underlying and existing Scala Spark library. This too was a success.

The next natural step was to compare the performance of the PySpark library with the Scala Spark library and the PySpark wrapper library. It made sense to do this comparison. Why? Because Spark is written in Scala, and when we use PySpark, we run code in a Python VM, which in turn starts a JVM and communicates with it — adding a small delay, over and above the overhead introduced by Python itself.

After running a couple of tests, I found that the performance of the PySpark program was (almost) on par with the Scala Spark version. Then the penny dropped. The reason for the similar performance is primarily because the heavy lifting of the DQ logic was being done in Spark, with Python providing a thin wrapper. Even the native PySpark version of the library is essentially a thin wrapper, with the actual work being done by Spark.

After that (long context), coming back to the flavor of the season — vibe coding. One day, after launching Copilot on a whim, I asked it the following question:

Plain Text
 
Generate a class to perform arithmetic operations, written in Scala.
This class has to be exposed as Pyspark object.
Generate the pyspark class such that it uses the Scala class


Copilot gave me the following response (as is, no edits):

Plain Text
 
To create a Scala class for arithmetic operations and expose it to
PySpark, you'll need to follow these steps:

---

### **1. Scala Class for Arithmetic Operations**

Create a Scala class that performs basic arithmetic operations:

```scala
package com.example.arithmetic

class ArithmeticOperations extends Serializable {
  def add(a: Double, b: Double): Double = a + b
  def subtract(a: Double, b: Double): Double = a - b
  def multiply(a: Double, b: Double): Double = a * b
  def divide(a: Double, b: Double): Double = {
    require(b != 0, "Division by zero is not allowed")
    a / b
  }
}
```

Compile this class into a JAR file and make sure it's available
in your Spark environment (e.g., via `--jars` or `spark.jars.packages`).

---

### **2. Expose Scala Class to PySpark via Py4J**

In PySpark, you can access the Scala class using the JVM gateway:

```python
from pyspark.sql import SparkSession

class PyArithmeticOperations:
    JVM
        self.jvm_obj = spark._jvm.com.example.arithmetic.ArithmeticOperations()

    def add(self, a: float, b: float) -> float:
        return self.jvm_obj.add(a, b)

    def subtract(self, a: float, b: float) -> float:
        return self.jvm_obj.subtract(a, b)

    def multiply(self, a: float, b: float) -> float:
        return self.jvm_obj.multiply(a, b)

    def divide(self, a: float, b: float) -> float:
        return self.jvm_obj.divide(a, b)
```

---

### **3. Usage Example in PySpark**

```python
spark = SparkSession.builder \
    .appName("ArithmeticExample") \
    .config("spark.jars", "/path/to/arithmetic-operations.jar") \
    .getOrCreate()

arith = PyArithmeticOperations(spark)

print("Add:", arith.add(10, 5))
print("Subtract:", arith.subtract(10, 5))
print("Multiply:", arith.multiply(10, 5))
print("Divide:", arith.divide(10, 5))
```

---

Would you like help compiling the Scala code into a JAR or
setting up the PySpark environment to test this?


While I have not compiled this response, I am fairly confident it will work as expected — after fixing the definition of the Python class. From my perspective, this response is a positive example of vibe coding.

It should be noted that even in this case, I knew exactly what I wanted the GenAI assistant to do. Since I had already written similar code, I was confident in evaluating the generated output.

Remember that giving proper directions to the assistant is key. Also remember that it is very unlikely that a single prompt will work for the problem at hand. You will need to engage with the GenAI assistant and keep refining the instructions until you either get what you want — or at least get something that satisfies most of your needs, which you can then build upon.

Data quality Data (computing) Framework pyspark Scala (programming language)

Published at DZone with permission of Bipin Patwardhan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Implementation of Data Quality Framework
  • When Perfect Data Breaks: The Journey from Data Quality to Data Observability
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Modernizing Cloud Data Automation for Faster Insights

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook