This topic provides an overview of the Splice Machine in-memory engine, which tremendously boosts OLAP (analytical) query performance. Splice Machine uses Apache Spark™ as its in-memory engine and automatically detects and directs OLAP queries to that engine.
This topic presents a very brief overview of Spark terminology and concepts.
Note: If you’re not yet familiar with Spark, we recommend visiting the Apache Spark website to learn the basics and for links to the official Spark documentation.
Apache Spark is an open-source computational engine that manages tasks in a computing cluster. Spark was originally developed at UC Berkeley in 2009,] and then open-sourced in 2010 as an Apache project. Spark has been engineered from the ground up for performance, exploiting in-memory computing and other optimizations to provide powerful analysis on very large data sets.
Spark provides numerous performance-oriented features, including:
- Ability to cache datasets in memory for interactive data analysis.
- Integration with a host of data sources.
- Very fast data analysis.
- Easy-to-use APIs for operating on large datasets, including numerous operators for transforming and manipulating data, in Java, Scala, Python, and other languages.
- Numerous high-level libraries, including support for machine learning, streaming, and graph processing.
- Scalability to thousands of nodes.
Spark applications consist of a driver program and some number of worker programs running on cluster nodes. The datasets (RDDs) used by the application are distributed across the worker nodes.
Splice Machine launches Spark queries on your cluster as Spark jobs, each of which consists of some number of stages. Each stage then runs a number of tasks, each of which is a unit of work that is sent to an executor.
The table below contains a brief glossary of the terms you’ll see when using the Splice Machine Management Console:
|Action||A function that returns a value to the driver after running a computation on an RDD.
Examples include save and collect functions.
A user program built on Spark. Each application consists of a driver program and a number of executors running on your cluster.
An application creates RDDs, transforms those RDDs, and runs actions on them. These result in a directed acyclic graph (DAG) of operations, which is compiled into a set of stages. Each stage consists of a number of tasks.
|DAG||A Directed Acyclic Graph of the operations to run on an RDD.|
|Driver program||This is the process that’s running the main() function of the application and creating the SparkContext object, which sends jobs to executors.|
|Executor||A process that is launched (by the driver program) for an application on a worker node. The executor launches tasks and maintains data for them.|
|Job||A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action.|
|Partition||A subset of the elements in an RDD. Partitions define the unit of parallelism; Spark processes elements within a partition in sequence and multiple partitions in parallel.|
|RDD||A Resilient Distributed Dataset. This is the core programming abstraction in Spark, consisting of a fault-tolerant collection of elements that can be operated on in parallel.|
|Stage||A set of tasks that run in parallel. The stage creates a task for each partition in an RDD, serializes those tasks, and sends those tasks to executors.|
|Task||The fundamental unit of work in Spark; each task fetches input, executes operations, and generates output.|
|Transformation||A function that creates a new RDD from an existing RDD.|
A cluster node that can run application code.