Nebula Spark Connector Reader: Principles and Practices
Nebula Spark Connector Reader enables Nebula Graph to work as an extended data source for Spark. In this post, we will focus on the Reader.
Join the DZone community and get the full member experience.Join For Free
What Is Nebula Spark Connector?
Nebula Spark Connector is a custom Spark connector, enabling Spark to read data from and write data to Nebula Graph. Therefore, Nebula Spark Connector is composed of a Reader and Writer. In this post, we will focus on the Reader. The Writer will be introduced next time.
How Nebula Spark Connector Reader Is Implemented
Nebula Spark Connector Reader enables Nebula Graph to work as an extended data source for Spark. With it, Spark can read data from Nebula into DataFrame and then execute the operations such as map and reduce.
Spark SQL allows users to customize data sources and supports extended data sources. The data read by Spark SQL is organized into a distributed dataset in the form of named columns, also called a DataFrame. Spark SQL provides many APIs to facilitate the calculation and conversion of DataFrames. You can use the DataFrame interfaces to manipulate multiple types of data sources.
org.apache.spark.sql to call packages of an extended data source. Let’s first learn about the interfaces related to the extended data sources provided by Spark SQL.
Basic interfaces include:
BaseRelation: Indicates a collection of tuples with a known schema. All subclasses that inherit
BaseRelationmust generate a schema in the
StructTypeformat. In other words,
BaseRelationdefines the format of the data that is read out of the source and stored as DataFrame in Spark SQL.
RelationProvider: Obtains a list of parameters and returns a new
BaseRelationbased on the specified parameters.
DataSourceRegister: Indicates registering a data source. When using a data source, you do not need to use its fully qualified class name, but only its alias, namely the custom
Provider related interfaces include:
RelationProvider: Generates custom
relationfrom the specified data source.
relationbased on the given parameters.
SchemaRelationProvider: Generates a new
relationbased on the given parameters and schema.
RDD related interfaces include
RDD[InternalRow], which is scanned from the data source and then is constructed as
To define an external data source for Spark, some of the preceding methods must be customized according to the source.
In Nebula Spark Connector, Nebula Graph is defined as an external data source of Spark SQL and
sparkSession.read is used to read data from Nebula Graph. The following class diagram shows how this function is implemented.
The process is as follows:
- Defining a data source
NebulaRelationProvider: It inherits
DataSourceRegisterto register an external data source.
NebulaRelationto implement a method to convert the schema and data of Nebula Graph. In the
getSchema()method, the Meta Service of Nebula Graph is connected to obtain the schema corresponding to the returned fields configured.
NebulaRDDto read data from Nebula Graph. The
compute()method defines how to read data from Nebula Graph, which mainly involves scanning the Nebula Graph data and converting its data rows into
InternalRowdata of Spark. After conversion,
InternalRowforms rows of RDD. Each
InternalRowrepresents a row of records in Nebula Graph. All data of Nebula Graph is read out and assembled into DataFrames.
Practicing Spark Connector Reader
Nebula Spark Connector Reader provides you with an interface to programmatically read data from Nebula Graph. One tag or edge type is read at a time. The data is read out as DataFrames.
nebula-java repository from GitHub and compile it.
After compilation, copy the JAR package from the
target directory to the local MAVEN repository.
An Example Application
nebula-spark dependency into the POM file of the MAVEN project.
Use the Spark application to read data from Nebula Graph.
Here is the explanation of the variables:
nebula(address: String, space: String, partitionNum: String)
address：Specifies the IP addresses of the Meta service. Multiple IP addresses are separated with commas. For example, "ip1:45500, ip2:45500". space: Specifies a graph space name for Nebula Graph. partitionNum：Specifies the number of Spark partitions. To make sure that each Spark partition has the data of a partition of the specified graph space, we recommend that you set this variable to the partitionNum that you used to create the graph space.
loadVertices(tag: String, fields: String)
tag：Specifies a tag of the specified graph space. fields：Specifies the properties of the given tags. Multiple properties are separated with commas. To read all the properties of the tag, set it to *.
loadEdges(edge: String, fields: String)
edge：Specifies an edge type of the specified graph space. fields：Specifies the properties of the given edge types. Multiple properties are separated with commas. To read all the properties of the edge type, set it to *.
Here is the repository of Nebula Spark Connector Reader on GitHub: https://github.com/vesoft-inc/nebula-java/tree/master/tools/nebula-spark.
Published at DZone with permission of Nicole Wang. See the original article here.
Opinions expressed by DZone contributors are their own.