Vibe Coding Part 3 — Building a Data Quality Framework in Scala and PySpark

How GenAI-assisted “vibe coding” can simplify building PySpark wrappers over Scala Spark libraries when you know what you’re doing.

Bipin Patwardhan

Jan. 23, 26 · Opinion

Likes (2)

Comment

Save

1.3K Views

A couple of years ago, I developed a class-based Data Quality (DQ) framework in Scala. After some time, I became part of an accelerator framework that dealt with many data engineering tasks, with data quality being one set of activities. After considering multiple factors, it was decided that the framework would be built using Python. Given that we wanted to handle large datasets and ensure scalability, we decided to implement the data quality framework in PySpark.

Using the Scala-based DQ framework as a stepping stone and foundation, I developed an equivalent DQ framework in PySpark. The framework was successful, with the DQ functionality implemented as a Python library and packaged as a wheel.

After completing the implementation — and after coming across libraries like AWS Deequ — I wanted to understand and overcome the Scala Spark / Python Spark duality: why maintain two versions of the same library, one in Scala and one in Python? I read the relevant documentation and implemented another PySpark DQ library. This time, I wrote it as a wrapper around the underlying and existing Scala Spark library. This too was a success.

The next natural step was to compare the performance of the PySpark library with the Scala Spark library and the PySpark wrapper library. It made sense to do this comparison. Why? Because Spark is written in Scala, and when we use PySpark, we run code in a Python VM, which in turn starts a JVM and communicates with it — adding a small delay, over and above the overhead introduced by Python itself.

After running a couple of tests, I found that the performance of the PySpark program was (almost) on par with the Scala Spark version. Then the penny dropped. The reason for the similar performance is primarily because the heavy lifting of the DQ logic was being done in Spark, with Python providing a thin wrapper. Even the native PySpark version of the library is essentially a thin wrapper, with the actual work being done by Spark.

After that (long context), coming back to the flavor of the season — vibe coding. One day, after launching Copilot on a whim, I asked it the following question:

    Plain Text
   
   Generate a class to perform arithmetic operations, written in Scala.
This class has to be exposed as Pyspark object.
Generate the pyspark class such that it uses the Scala class

Copilot gave me the following response (as is, no edits):

    Plain Text
   
 

   To create a Scala class for arithmetic operations and expose it to
PySpark, you'll need to follow these steps:

---

### **1. Scala Class for Arithmetic Operations**

Create a Scala class that performs basic arithmetic operations:

```scala
package com.example.arithmetic

class ArithmeticOperations extends Serializable {
  def add(a: Double, b: Double): Double = a + b
  def subtract(a: Double, b: Double): Double = a - b
  def multiply(a: Double, b: Double): Double = a * b
  def divide(a: Double, b: Double): Double = {
    require(b != 0, "Division by zero is not allowed")
    a / b
  }
}
```

Compile this class into a JAR file and make sure it's available
in your Spark environment (e.g., via `--jars` or `spark.jars.packages`).

---

### **2. Expose Scala Class to PySpark via Py4J**

In PySpark, you can access the Scala class using the JVM gateway:

```python
from pyspark.sql import SparkSession

class PyArithmeticOperations:
    JVM
        self.jvm_obj = spark._jvm.com.example.arithmetic.ArithmeticOperations()

    def add(self, a: float, b: float) -> float:
        return self.jvm_obj.add(a, b)

    def subtract(self, a: float, b: float) -> float:
        return self.jvm_obj.subtract(a, b)

    def multiply(self, a: float, b: float) -> float:
        return self.jvm_obj.multiply(a, b)

    def divide(self, a: float, b: float) -> float:
        return self.jvm_obj.divide(a, b)
```

---

### **3. Usage Example in PySpark**

```python
spark = SparkSession.builder \
    .appName("ArithmeticExample") \
    .config("spark.jars", "/path/to/arithmetic-operations.jar") \
    .getOrCreate()

arith = PyArithmeticOperations(spark)

print("Add:", arith.add(10, 5))
print("Subtract:", arith.subtract(10, 5))
print("Multiply:", arith.multiply(10, 5))
print("Divide:", arith.divide(10, 5))
```

---

Would you like help compiling the Scala code into a JAR or
setting up the PySpark environment to test this?
  

While I have not compiled this response, I am fairly confident it will work as expected — after fixing the definition of the Python class. From my perspective, this response is a positive example of vibe coding.

It should be noted that even in this case, I knew exactly what I wanted the GenAI assistant to do. Since I had already written similar code, I was confident in evaluating the generated output.

Remember that giving proper directions to the assistant is key. Also remember that it is very unlikely that a single prompt will work for the problem at hand. You will need to engage with the GenAI assistant and keep refining the instructions until you either get what you want — or at least get something that satisfies most of your needs, which you can then build upon.

Data quality Data (computing) Framework pyspark Scala (programming language)

Published at DZone with permission of Bipin Patwardhan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending

Vibe Coding Part 3 — Building a Data Quality Framework in Scala and PySpark

How GenAI-assisted “vibe coding” can simplify building PySpark wrappers over Scala Spark libraries when you know what you’re doing.

Related

Partner Resources