{{announcement.body}}
{{announcement.title}}

Detailed Guide: Setup Apache Spark Development Environment on Windows 10

DZone 's Guide to

Detailed Guide: Setup Apache Spark Development Environment on Windows 10

In this article, we discuss how to set up a local Apache Spark development environment on Windows 10.

· Big Data Zone ·
Free Resource

Introduction

In my last article, I have covered how to set up and use Hadoop on Windows. Now, this article is all about configuring a local development environment for Apache Spark on Windows OS.

Apache Spark is the most popular cluster computing technology, designed for fast and reliable computation. It provides implicit data parallelism and default fault tolerance. It integrates easily with HIVE and HDFS and provides a seamless experience of parallel data processing. You can read more about Spark at https://spark.apache.org.

By default, Spark SQL projects do not run on Windows OS and require us to perform some basic setup first; that’s all we are going to discuss in this article, as I didn’t find it well documented anywhere over the internet or in books.

This article can also be used for setting up a Spark development environment on Mac or Linux as well. Just make sure you'll downloading the correct OS-version from Spark's website. 

You can refer to the Scala project used in this article from GitHub here: https://github.com/gopal-tiwari/LocalSparkSql.

What to Expect

At the end of this article, you should be able to create/run your Spark SQL projects and spark-shell on Windows OS.

I have divided this article into three parts. You can follow any of the three modes depending on your specific use-case.

You may also like: The Complete Apache Spark Collection [Tutorials and Articles].

Spark Local Development Setup Modes

  1. Single Project Access (Single Project Single Connection)
  2. Setup Time: 15 Minutes
    Functionality: Limited

    • Every project will have its own metastore and warehouse.
    • Databases and Tables created by one project will not be accessible by other projects.
    • Only one Spark SQL project can run or execute at a time.
  3. Multi Project Access (Multi Project Single Connection)

    Setup Time: 20 Minutes
    Functionality: Extended

    • Every project will share a common metastore and warehouse.
    • Tables created by one project will be accessible by other projects or spark-shell.
    • It will provide a pseudo cluster like feel.
    • Only one Spark SQL project can run or execute at a time.
  4. Full Cluster Like Access (Multi Project Multi Connection)

    Setup Time: 40 Minutes
    Functionality: Full

    • This configuration is a bit tedious, but a one-time setup will grant you the ability to have multiple connections open for a metastore.
    • There will be no difference between your local system and a cluster in terms of functionality.
    • Databases and Tables will be shared among all Spark projects or shells.
    • You can keep running multiple spark-shell or Spark projects at the same time.

Solution for Spark Errors

Many of you may have tried running spark on Windows and might have faced the following error while running your project:

16/04/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
16/04/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop
binary path java.io.IOException: Could not locate executable null\bin\winutils.exe
in the Hadoop binaries.

This is because your system does not have native Hadoop binaries for Windows OS.

You can build one by following my previous article or download one from https://github.com/cdarlint/winutils.

The below error is also related to the Native Hadoop Binaries for Windows OS.

16/04/03 19:59:10 ERROR util.Shell: Failed to locate the winutils binary in the
hadoop binary path java.io.IOException: Could not locate executable
C:\hadoop\bin\winutils.exe in the Hadoop binaries.

The solution is the same. We need to setup HADOOP_HOME with Native Windows binaries.

So, just follow along with this article, and at the end of this tutorial, you should be able to get rid of all of these errors.

Downloading the Required Files

  1. Download and install JDK according to your OS and CPU architecture from https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html.
  2. Install the Scala version depending upon the Spark version you're using from https://www.scala-lang.org/download/all.html.
  3. Download and install 7-zip from https://www.7-zip.org/download.html.
  4. Download and extract Apache Spark using 7-zip from http://spark.apache.org/downloads.html.
  5. Download zip or clone Hadoop Windows binaries from https://github.com/cdarlint/winutils.
  6. If you do not have an IDE installed, please install one. Intellij IDEA is preferred, and you can get the Community edition from https://www.jetbrains.com/idea/download/#section=windows.
  7. Download Microsoft Visual C++ 2010 Redistributable Package if your system does not have these pre-installed:
    1. https://www.microsoft.com/en-us/download/details.aspx?id=5555.
    2. https://www.microsoft.com/en-us/download/details.aspx?id=14632.

For 32 Bit (x86) OSs, you need to install only a., and for 64 Bit (x64) please install a. and b.

For this tutorial, we are assuming that Spark and Hadoop binaries are unzipped in your C:\ drive. However, you can unzip them at any location in your system.

Setup and Installation

JDK

Before we proceed further, let’s make sure your Java setup is done properly and environment variables are updated with Java's installation directory.

To confirm that Java's installed on your machine,  just open cmd and type java –version. You should be able to see the version of Java installed on your system.

If you get an error message, like “'java' is not recognized as an internal or external command, operable program or batch file,” then please follow the following steps. Otherwise, skip it.

  1. Execute the downloaded JRE and JDK setup and finish the installation with default settings.
  2. Now, open the system environment variables dialogue
    • For Windows 7, right-click on My Computer and select Properties > Advanced.
    • For Windows 8, go to Control Panel > System > Advanced System Settings.
    • For Windows 10, go to Control Panel > System and security > System > Advanced System Settings.
  3. Click the Environment Variables button.
  4. Click the New… button in the System Variables section. Then, type JAVA_HOME in the Variable name field and provide your JDK installation path in the Variable value field.
  5. If the path contains spaces, use the shortened path name — for example, C:\Progra~1\Java\jdk1.8.0_74
  6. For Windows users on 64-bit systems
    • Progra~1 = 'Program Files'
    • Progra~2 = 'Program Files(x86)'
    It should look like: 

    Adding new system variable

    Adding new system variable

  7. Now click ok.
  8. Then, search for the Path variable in the System Variable section under the Environment Variables dialogue box you just opened.
  9. Edit the path and click the Edit text or New button and add “;%JAVA_HOME%\bin” at the end of it.
  10. Adding "%JAVA_HOME%\bin" to PATH

    Adding "%JAVA_HOME%\bin" to PATH

    Environment variables

    Environment variables

  11. To confirm Java installation, open a new cmd and type java –version, and you should be able to see the version of Java you just installed.
  12. Checking version of Java installed

    Checking version of Java installed

    If your command prompt somewhat looks like the image above, you are good to go. Otherwise, you need to check whether your setup version matches with your OS architecture (x86, x64). The environment variables path could also be incorrect.

Scala

  1. You might need to install Scala, depending on your version of Spark.
  2. For this article, we are going to use Spark 2.4.4 and Scala 2.12.10.
  3. Just execute the downloaded scala-2.12.10.msi and follow the installation instructions.
  4. To confirm Scala installation, open cmd and type scala –version. Your command prompt should look like the image below:
  5. Checking Scala version

    Checking Scala version

IntelliJ IDEA Setup

  1. You can set up Eclipse with a Scala plugin or just install IntelliJ IDEA.
  2. You can choose to install the Scala plugin from the first-time setup screen or from Settings > plugin > Search and install Scala.
  3. If you have VPN proxy issues during the installation of the plugin, you can opt for offline plugin installation.
  4. Make sure the file > settings > plugins window shows that the Scala plugin is installed, as shown in the following image:
  5. Scala plugin installation

    Scala plugin installation

Hadoop Home Setup

  1. Open your downloaded GitHub repo https://github.com/cdarlint/winutils using 7-zip (i.e. winutils-master.zip).
  2. Inside that, you will find the file, hadoop-2.7.7. Now, extract it to the C:\ drive.
  3. Your C:\hadoop-2.7.7\bin directory should look like the image below:
  4. C:\hadoop-2.7.7\bin directory

    C:\hadoop-2.7.7\bin directory

  5. Once extraction is finished, we need to add a new HADOOP_HOME System Environment Variable.
  6. To add a HADOOP_HOME, open the Environment variables dialogue box and click on the New… button in the System variable section and fill the Name and Value text boxes, as shown in the image below:
  7. Adding HADOOP_HOME variable

    Adding HADOOP_HOME variable

Spark Home Setup

  1. Open your downloaded Spark gz file using 7-zip (i.e. spark-2.4.4-bin-hadoop2.7.gz).
  2. Inside that, you will find the tar file spark-2.4.4-bin-hadoop2.7. Double-click on it and extract the spark-2.4.4-bin-hadoop2.7 directory to the C:\ drive.
  3. Copying Hadoop file to C:\ drive

    Copying Hadoop file to C:\ drive

  4. Your C:\ spark-2.4.4-bin-hadoop2.7 directory should look like below image:
  5. C:\ spark-2.4.4-bin-hadoop2.7 directory

    C:\ spark-2.4.4-bin-hadoop2.7 directory

  6. Once extraction is finished, we need to add the Spark bin directory to the System Environment “Path” Variable.
  7. Edit the path variable and add “C:\spark-2.4.4-bin-hadoop2.7\bin,” as shown in the image below:
  8. Editing environment variables

    Editing environment variables

NOTE: If you do not have administrator access to add Environment Variables, don’t worry, as you can set it in your IDE for each project separately. The process is explained in the below section.

Single Project Access

Now, let’s create a new Scala-Maven project named "LocalSparkSql." Alternatively, you can clone it from GitHub: https://github.com/gopal-tiwari/LocalSparkSql.

The project structure looks like this:

initial project structure

Initial project structure

LocalSparkHiveTest.scala

Scala




xxxxxxxxxx
1
34


1
package org.connected.spark
2
 
          
3
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
4
 
          
5
object LocalSparkHiveTest {
6
 
          
7
  def main(args: Array[String]): Unit = {
8
 
          
9
    val spark: SparkSession = SparkSession.builder()
10
      .enableHiveSupport()
11
      .master("local")
12
      .appName("Demo")
13
      .getOrCreate()
14
 
          
15
    spark.sql("SHOW DATABASES").show
16
    spark.sql("CREATE DATABASE IF NOT EXISTS sparkdemo")
17
    spark.sql("CREATE TABLE IF NOT EXISTS sparkdemo.table1(id INT, name STRING)")
18
    spark.sql("SHOW DATABASES").show
19
    
20
    import spark.implicits._
21
    val df:DataFrame = Seq(
22
      (1, "One"),
23
      (2, "Two"),
24
      (3, "Three")
25
    ).toDF("id","name")
26
 
          
27
    df.write.mode(SaveMode.Append).format("hive").saveAsTable("sparkdemo.table1")
28
    //Thread.sleep(60 * 1000)
29
    spark.sql("SELECT * FROM sparkdemo.table1").show(false)
30
    println(spark.sql("select * from sparkdemo.table1").count)
31
 
          
32
  }
33
}



pom.xml

XML




xxxxxxxxxx
1
30


 
1
<?xml version="1.0" encoding="UTF-8"?>
2
<project xmlns="http://maven.apache.org/POM/4.0.0"
3
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
4
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
5
    <modelVersion>4.0.0</modelVersion>
6
 
          
7
    <groupId>org.connected.spark</groupId>
8
    <artifactId>LocalSparkSql</artifactId>
9
    <version>1.0.0-SNAPSHOT</version>
10
 
          
11
    <dependencies>
12
        <dependency>
13
            <groupId>org.apache.spark</groupId>
14
            <artifactId>spark-core_2.12</artifactId>
15
            <version>2.4.4</version>
16
        </dependency>
17
        <dependency>
18
            <groupId>org.apache.spark</groupId>
19
            <artifactId>spark-sql_2.12</artifactId>
20
            <version>2.4.4</version>
21
        </dependency>
22
        <dependency>
23
            <groupId>org.apache.spark</groupId>
24
            <artifactId>spark-hive_2.12</artifactId>
25
            <version>2.4.4</version>
26
        </dependency>
27
    </dependencies>
28
</project>



Now, let’s right-click on the “LocalSparkHiveTest” file and click on Run.

Error 1

Exception in thread "main" org.apache.spark.sql.AnalysisException:
java.lang.RuntimeException: java.io.IOException: (null) entry in command string:
 null chmod 0733 C:\tmp\hive;
 

If you see the above error, it means your HADOOP_HOME is not set properly.

If you were not able to set System Environment Variables due to the Admin access issue you can set it here at project level with few steps given below otherwise, you can skip to next instruction.

  1. Go to the upper right corner of you Intellij screen and Edit Configurations
  2. Edit configurations

    Edit configurations

  3. From the left side panel expand Application can add and choose LocalSparkHiveTest.
  4. Run/Debug configurations

    Run/Debug configurations

  5. Now, add “HADOOP_HOME=C:\hadoop-2.7.7” in Environment Variable textbox.
  6. Adding Environment variables

    Adding Environment variables

  7. Click on apply and close the dialog box.

Now, let’s try to run the main object again

Error 2

Exception in thread "main" org.apache.spark.sql.AnalysisException:
java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir:
 /tmp/hive on HDFS should be writable. Current permissions are: ---------;
 

To resolve this error, we need to open the permissions on the Hive default temp directory.

Here, we need to open your Hadoop home and  then get into the /bin directory and execute the below command:

            winutils.exe chmod 777 /tmp/hive 

The granted permissions should be "drwxrwxrwx," you can check the permission status using below command

            winutils.exe ls \tmp\hive 

Checking permissions

Checking permissions

Error 3

You might get the following error if your machine does not have the Microsoft Visual C++ 2010 Redistributable Package.

The code execution cannot proceed because MSVCR100.dll was not found.
Reinstalling the program may fix this problem.
 

System error

System error

You can download and install C++ 2010 Redistributable Package from the Downloads section of this article, as I have provided the direct download link there.

Now, let’s execute the project again, and you should be able to create a database and table without any issues.

Output:


Creating database and table

Creating database and table

By now, you might have noticed that two new directories have been created under your project’s root folder, metastore_db and spark-warehouse.

Project file structure

Project file structure

What are these directories???

metastore_db:

Spark SQL uses a Hive metastore to manage the metadata information of databases and tables created by users. You can assume it as a small relational database that stores the information about the actual database directory path, table structures, partition columns, file locations, etc.

By default, Spark comes with an Embedded Derby Db support to manage metadata. More information on Derby can be found here https://db.apache.org/derby/.

spark-warehouse:

A warehouse directory is a location where table data is written. By default, Spark creates the directory as spark-warehouse.

You can get more information here https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-hive-metastore.html.

Now, if we create another new project with the same code we used for the LocalSparkSql project, we can notice that we are not able to access the database "sparkdemo" created by our previous project. The reason behind this behavior is that for every project, the metastore_db and spark-warehouse are created again, and they're specific to that project, as they're created under the project’s root directory.

However, we can connect to other projects metastore and warehouse, but a better approach would be to create a common separate metastore_db and spark-warehouse directory and share the same between multiple projects by adding few extra configurations to SparkSession, and that’s what we are going to discuss in our next section.


Multi-Project Access

To simulate this behavior, we need to create a new project and try to access a common metastore_db and warehouse.

For this demo, we are going to use directory locations, as below:

For metastore_db  C:\tmp\hive\metastore_db

For warehouse        C:\tmp\hive\spark-warehouse

Configuring Spark-Project

To let Spark refer to our new common directory, we need to add the following config during the creation of a Spark session object:

spark.sql.warehouse.dir = C:/tmp/hive/spark-warehouse

javax.jdo.option.ConnectionURL = jdbc:derby:;databaseName=C:/tmp/hive/metastore_db;create=true 

 Code should look like:

Scala




x


 
1
val spark: SparkSession = SparkSession.builder()
2
    .enableHiveSupport()
3
    .master("local")
4
    .appName("Demo")
5
    .config("spark.sql.warehouse.dir","C:/tmp/hive/spark-warehouse")
6
    .config("javax.jdo.option.ConnectionURL","jdbc:derby:;databaseName=C:/tmp/hive/metastore_db;create=true")
7
  .getOrCreate()



You need to use the above configuration in each of your projects to let projects access databases and tables created by other applications.

Configuring Spark-Shell

Now, to check the validity of the above configurations, we can launch a spark-shell and try to access a “sparkdemo” database and “table1”.

Before we launch spark-shell, we need to configure it to point towards our common metastore and warehouse using the following steps:

  1. Go to the Spark config directory, C:\spark-2.4.4-bin-hadoop2.7\conf.
  2. Create or rename the already existing file spark-defaults.conf.template” to “spark-defaults.conf”
  3. Add the following two lines at the end of the file: 
    1. spark.driver.extraJavaOptions -Dderby.system.home=C:/tmp/hive 
    2. spark.sql.warehouse.dir C:/tmp/hive/spark-warehouse 
  4. Now, open a new cmd and run spark-shell from the C:\spark-2.4.4-bin-hadoop2.7\bin directory.

Now, let’s try to list all the databases by running SHOW DATABASES and selecting all data from table1 in the shell.

Selecting * from table1

Selecting * from table1

The above result validates that our shared metastore configurations are correct, as we are able to access tables created by the above code.

Now, if you keep the spark-shell open and try to run your scala project at the same point of time you will get the below error. 

Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=C:/tmp/hive/metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------

java.sql.SQLException: Failed to start database 'C:/tmp/hive/metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@4262fdeb, see the next exception for details.


This is because we are using Spark’s embedded derbyDb, and one application (i.e. spark-shell) is already directly connected to the shared metastore_db database and acquired the lock, so another instance cannot be initiated. This situation could have been avoided if we could have connected to derby using network service instead of directly connecting to the database. But, we do not have a derby server instance running, so we can’t do that in this case.

However, we can create the Hive metastore in a separate relational database and let Spark connect to that database to enable a multi-connection setup. We are going to talk about it in our next section.

Full Cluster-Like Access

To configure local metastore in this mode we need to download and install few extra components listed below along with the others specified in the “Downloads” section of this article.

Note: It is not a great idea to directly skip to this section of the article, as you might miss a few important and mandatory steps, so kindly follow the article from the start to get the configuration done properly.

Downloads

  1. Download and Install MySql Community Edition: https://dev.mysql.com/downloads/windows/installer/8.0.html.
  2. If you do not have a SQL GUI Editor, please Install HeidiSql: https://www.heidisql.com/download.php
  3. Download MySql connector jar, depending on your MySql Server Version from https://dev.mysql.com/downloads/connector/j/. Make sure to choose Platform Independent as the Operating System to proceed.
  4. Download or open Hive metastore DDL: https://raw.githubusercontent.com/apache/hive/master/metastore/scripts/upgrade/mysql/hive-schema-2.3.0.mysql.sql  
  5. Download Hive transactional schema DDL from: https://raw.githubusercontent.com/apache/hive/master/metastore/scripts/upgrade/mysql/hive-txn-schema-2.3.0.mysql.sql

Installation

MySql:

  1. Run the downloaded mysql-installer-community-8.0.*.*.msi.
  2. Choose Setup Type as Custom.
  3. Please choose at least the following features to be installed:
  4. Installing MySQL Server

    Installing MySQL Server

  5. On the next page, click Execute to let the installer download and install any missing components from your system.
  6. Click on next and execute to let the installation begin.
  7. After installation, the configuration page will appear; let’s choose Standalone MySQL Server.
  8. Select Development Computer as the config type and leave the other settings as default.
  9. We are setting the root password as "root" for the simplicity of this demo.
  10. Keep the other settings as default and finish the installation.

HeidiSQL

  1. Now, execute the downloaded HeidiSQL_*_Setup.exe and finish the installation with default settings.
  2. Open HeidiSQL and choose the Network type as MariaDB or MySQL.
  3. Provide user and password as root (or your configured username and password) and click on open.
  4. Open a new query window and execute the following statements:
    1. CREATE DATABASE metastore_db; 
    2. USE metastore_db; 
  5. Now, copy the content of the Hive metastore ddl from the hive-schema-2.3.0.mysql.sql file. Link: https://raw.githubusercontent.com/apache/hive/master/metastore/scripts/upgrade/mysql/hive-schema-2.3.0.mysql.sql.
    1. Remove or comment out the line number 835, as “--  SOURCE hive-txn-schema-2.3.0.mysql.sql;”
    2. Execute the entire DDL on metastore_db.
  6. Now, copy and execute the DDL of the Hive transactional schema from hive-schema-2.3.0.mysql.sql file. Link: https://raw.githubusercontent.com/apache/hive/master/metastore/scripts/upgrade/mysql/hive-txn-schema-2.3.0.mysql.sql.

Once both the SQL files are executed on your metastore_db successfully, you are all set to configure your projects to access this metastore. So, let’s configure spark-shell and Scala/Java projects using the following steps:

Configuring Spark-Shell

Before we launch spark-shell, we need to configure it to use our common MySQL metastore and warehouse. Please follow these steps to configure the shell:

  1. Go to the Spark config directory C:\spark-2.4.4-bin-hadoop2.7\conf.
  2. Make sure you delete the file “spark-defaults.conf” or comment out the previously added line for spark.driver.extraJavaOptions and spark.sql.warehouse.dir.
  3. Now, create a new file with the name hive-site.xml. Then, add the following content under the conf directory.
  4. hive-site.xml 

    XML




    xxxxxxxxxx
    1
    29


     
    1
    <configuration>
    2
        <property>
    3
            <name>javax.jdo.option.ConnectionURL</name>
    4
            <value>jdbc:mysql://localhost:3306/metastore_db?ceateDatabaseIfNotExist=true</value>
    5
            <description>JDBC connect string for a JDBC metastore</description>
    6
        </property>
    7
     
               
    8
        <property>
    9
            <name>javax.jdo.option.ConnectionDriverName</name>
    10
            <value>com.mysql.cj.jdbc.Driver</value>
    11
            <description>Driver class name for a JDBC metastore</description>
    12
        </property>
    13
     
               
    14
        <property>
    15
            <name>javax.jdo.option.ConnectionUserName</name>
    16
            <value>root</value>
    17
        </property>
    18
     
               
    19
        <property>
    20
            <name>javax.jdo.option.ConnectionPassword</name>
    21
            <value>root</value>
    22
        </property>
    23
     
               
    24
        <property>
    25
            <name>hive.metastore.warehouse.dir</name>
    26
            <value>C:/tmp/hive/spark-warehouse</value>
    27
            <description>location of default database for the warehouse</description>
    28
        </property>
    29
    </configuration>


           
  5. Copy the mysql-connector-java-8.0.18.jar to the C:\spark-2.4.4-bin-hadoop2.7\jars\ directory. You can get it from C:\Program Files (x86)\MySQL\Connector J 8.0\ or use the already downloaded one from the downloads section. Link: https://dev.mysql.com/downloads/connector/j/
  6. Delete the old warehouse directory (i.e. C:\tmp\hive\spark-warehouse). Otherwise, we might get a “Directory already exists” error, as we are re-using the same directory.
  7. Now, open a new command prompt and run spark-shell from the C:\spark-2.4.4-bin-hadoop2.7\bin directory and execute the following code:
  8. Shell




    xxxxxxxxxx
    1
    32


    1
    spark.sql("CREATE DATABASE IF NOT EXISTS sparkdemo")
    2
     
               
    3
    spark.sql(
    4
        s"""
    5
            CREATE TABLE IF NOT EXISTS sparkdemo.table2
    6
            (
    7
                id INT,
    8
                name STRING
    9
            )
    10
            PARTITIONED BY(
    11
              date STRING
    12
            )
    13
            STORED AS PARQUET
    14
        """)
    15
     
               
    16
    import org.apache.spark.sql.{DataFrame, SaveMode}
    17
    import spark.implicits._
    18
     
               
    19
    spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
    20
     
               
    21
    val df2: DataFrame = Seq(
    22
      (1, "One","2020-01-01"),
    23
      (2, "Two","2020-01-03"),
    24
      (3, "Three","2020-01-11")
    25
    ).toDF("id", "name","date")
    26
     
               
    27
    df2.write.mode("overwrite").insertInto("sparkdemo.table2")
    28
     
               
    29
    spark.sql("SHOW DATABASES").show
    30
    spark.sql(“SELECT * FROM sparkdemo.table2”).show
    31
    spark.sql("SHOW PARTITIONS sparkdemo.table2").show



    Output:Output from SQL commandsOutput from SQL commands

Now, we need to validate that we can open multiple connections to the Hive metastore. Then, we can check that we can run multiple Spark projects concurrently in our local Windows system. Let’s try to access the metastore from a separate project without closing the current spark-shell in our next section.

Configuring a Spark Project

Now, let’s create a new Scala object, LocalMySQLMetastoreTest in our “LocalSparkSql” project.

In this object, we will try to insert a few more records into sparkdemo.table2, which we just created using spark-shell.

Now, we need to provide a few extra configurations related to MySQL server address and credentials, so that it can be used by the SparkSession instance to connect it to our new MySQL metastore.

A configuration sample, along with the full code is given below: 

Sample

Scala




xxxxxxxxxx
1
10


 
1
val spark: SparkSession = SparkSession.builder() 
2
.enableHiveSupport() 
3
.master("local") 
4
   
5
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://<MySQL SERVER>:<PORT>/metastore_db")
6
.config("javax.jdo.option.ConnectionDriverName", "<DATABASE DRIVER CLASS NAME>")
7
.config("javax.jdo.option.ConnectionUserName", "<USER NAME>")
8
.config("javax.jdo.option.ConnectionPassword", "<PASSWORD>")
9
   
10
.getOrCreate()



As we are going to use MySQL JDBC for connection,  we need to add MySQL JDBC driver as a dependency in the pom.xml, as shown below:

XML




xxxxxxxxxx
1


 
1
<dependencies>
2
3
        <dependency>
4
             <groupId>mysql</groupId>
5
             <artifactId>mysql-connector-java</artifactId>
6
             <version>8.0.18</version>
7
        </dependency>
8
..
9
</dependencies>  



Full Code

LocalMySQLMetastoreTest.scala
Scala




xxxxxxxxxx
1
43


 
1
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
2
 
          
3
object LocalMySQLMetastoreTest {
4
 
          
5
  def main(args: Array[String]): Unit = {
6
 
          
7
    val spark: SparkSession = SparkSession.builder()
8
      .enableHiveSupport()
9
      .master("local")
10
      .appName("Demo")
11
      .config("spark.sql.warehouse.dir", "C:\\tmp\\hive\\spark-warehouse")
12
      .config("javax.jdo.option.ConnectionURL", "jdbc:mysql://localhost:3306/metastore_db")
13
      .config("javax.jdo.option.ConnectionDriverName", "com.mysql.cj.jdbc.Driver")
14
      .config("javax.jdo.option.ConnectionUserName", "root")
15
      .config("javax.jdo.option.ConnectionPassword", "root")
16
      .config("hive.exec.dynamic.partition.mode", "nonstrict")
17
      .getOrCreate()
18
 
          
19
    import spark.implicits._
20
    spark.sql("CREATE DATABASE IF NOT EXISTS sparkdemo")
21
    spark.sql(
22
      s"""
23
                  CREATE TABLE IF NOT EXISTS sparkdemo.table2
24
                  (
25
                    id INT,
26
                    name STRING
27
                  )
28
                  PARTITIONED BY(
29
                    date STRING
30
                  )
31
                  STORED AS PARQUET
32
                """)
33
 
          
34
    val df2: DataFrame = Seq(
35
      (4, "Four", "2020-01-13"),
36
      (5, "Five", "2020-01-13"),
37
      (6, "Six", "2020-01-15")
38
    ).toDF("id", "name", "date")
39
    df2.write.mode(SaveMode.Overwrite).insertInto("sparkdemo.table2")
40
    spark.sql("SELECT * FROM sparkdemo.table2").show
41
    spark.sql("SHOW PARTITIONS sparkdemo.table2").show
42
  }
43
}



pom.xml
XML




xxxxxxxxxx
1
33


 
1
<?xml version="1.0" encoding="UTF-8"?>
2
<project xmlns="http://maven.apache.org/POM/4.0.0"
3
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
4
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
5
    <modelVersion>4.0.0</modelVersion>
6
 
          
7
    <groupId>org.connected.spark</groupId>
8
    <artifactId>LocalSparkSql</artifactId>
9
    <version>1.0.0-SNAPSHOT</version>
10
 
          
11
    <dependencies>
12
        <dependency>
13
            <groupId>org.apache.spark</groupId>
14
            <artifactId>spark-core_2.12</artifactId>
15
            <version>2.4.4</version>
16
        </dependency>
17
        <dependency>
18
            <groupId>org.apache.spark</groupId>
19
            <artifactId>spark-sql_2.12</artifactId>
20
            <version>2.4.4</version>
21
        </dependency>
22
        <dependency>
23
            <groupId>org.apache.spark</groupId>
24
            <artifactId>spark-hive_2.12</artifactId>
25
            <version>2.4.4</version>
26
        </dependency>
27
        <dependency>
28
            <groupId>mysql</groupId>
29
            <artifactId>mysql-connector-java</artifactId>
30
            <version>8.0.18</version>
31
        </dependency>
32
    </dependencies>
33
</project>



Now, let’s run our LocalMySQLMetastoreTest object. The output should look like the image below:

Output from LocalMySQLMetastoreTest

Output from LocalMySQLMetastoreTest

Note: In some cases, you might not be able to see a newly added partition or data by other spark-shell/projects, as this is expected behavior of Hive/Spark metastore management. You can consider executing ALTER TABLE ADD PARTITIONS or spark.catalog.refreshTable("dbname.tablename") to reflect new data.

After running the project one more time with the same data, I got the following exception when I tried to run spark.sql("SELECT * FROM sparkdemo.table2").show in my old spark-shell:

java.io.FileNotFoundException: File
file:/C:/tmp/hive/spark-warehouse/sparkdemo.db/table2/date=2020-01-13/part-000
00-666cc8ed-b44b-4025-9606-e5d9e660c8db.c000 does not exist

The reason behind this is simple; we added/modified a few extra rows/partitions from outside of the shell, so our spark-shell’s metastore catalog is not aware of these changes.

Running the below code will execute a forced update of metadata and will eventually sort out this issue

Scala




xxxxxxxxxx
1


 
1
spark.catalog.refreshTable("sparkdemo.table2")



Now, executing spark.sql("SELECT * FROM sparkdemo.table2").show in a shell gives the following updated results:

Updated results

Updated results

End Notes

I hope this extended demo on setting up a local Spark development environment gave you a full, in-depth understanding of configurations related to local setup. I have tried to cover as many failure scenarios as I can, but if you have any other issues, questions, or suggestions, you can share them in the comments below.

Thanks for reading!!


Further Reading

Topics:
spark ,apache spark ,spark hadoop ,spark 2.x ,spark 2.2 ,bigdata ,hive ,metastore ,spark tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}