Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Splice Machine In-Memory Engine

DZone's Guide to

Splice Machine In-Memory Engine

Get an in-depth overview of the Splice Machine in-memory engine, which uses Apache Spark and automatically detects and directs OLAP queries to that engine.

· Database Zone
Free Resource

Whether you work in SQL Server Management Studio or Visual Studio, Redgate tools integrate with your existing infrastructure, enabling you to align DevOps for your applications with DevOps for your SQL Server databases. Discover true Database DevOps, brought to you in partnership with Redgate.

This topic provides an overview of the Splice Machine in-memory engine, which tremendously boosts OLAP (analytical) query performance. Splice Machine uses Apache Spark™ as its in-memory engine and automatically detects and directs OLAP queries to that engine.

This topic presents a very brief overview of Spark terminology and concepts.

Note: If you’re not yet familiar with Spark, we recommend visiting the Apache Spark website to learn the basics and for links to the official Spark documentation.

Spark Overview

Apache Spark is an open-source computational engine that manages tasks in a computing cluster. Spark was originally developed at UC Berkeley in 2009,] and then open-sourced in 2010 as an Apache project. Spark has been engineered from the ground up for performance, exploiting in-memory computing and other optimizations to provide powerful analysis on very large data sets.

Spark provides numerous performance-oriented features, including:

  • Ability to cache datasets in memory for interactive data analysis.
  • Abstraction.
  • Integration with a host of data sources.
  • Very fast data analysis.
  • Easy-to-use APIs for operating on large datasets, including numerous operators for transforming and manipulating data, in Java, Scala, Python, and other languages.
  • Numerous high-level libraries, including support for machine learning, streaming, and graph processing.
  • Scalability to thousands of nodes.

Spark applications consist of a driver program and some number of worker programs running on cluster nodes. The datasets (RDDs) used by the application are distributed across the worker nodes.

Spark Terminology

Splice Machine launches Spark queries on your cluster as Spark jobs, each of which consists of some number of stages. Each stage then runs a number of tasks, each of which is a unit of work that is sent to an executor.

The table below contains a brief glossary of the terms you’ll see when using the Splice Machine Management Console:

Term Definition
Action A function that returns a value to the driver after running a computation on an RDD.
Examples include save and collect functions.
Application

A user program built on Spark. Each application consists of a driver program and a number of executors running on your cluster.

An application creates RDDs, transforms those RDDs, and runs actions on them. These result in a directed acyclic graph (DAG) of operations, which is compiled into a set of stages. Each stage consists of a number of tasks.

DAG A Directed Acyclic Graph of the operations to run on an RDD.
Driver program This is the process that’s running the main() function of the application and creating the SparkContext object, which sends jobs to executors.
Executor A process that is launched (by the driver program) for an application on a worker node. The executor launches tasks and maintains data for them.
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action.
Partition A subset of the elements in an RDD. Partitions define the unit of parallelism; Spark processes elements within a partition in sequence and multiple partitions in parallel.
RDD A Resilient Distributed Dataset. This is the core programming abstraction in Spark, consisting of a fault-tolerant collection of elements that can be operated on in parallel.
Stage A set of tasks that run in parallel. The stage creates a task for each partition in an RDD, serializes those tasks, and sends those tasks to executors.
Task The fundamental unit of work in Spark; each task fetches input, executes operations, and generates output.
Transformation A function that creates a new RDD from an existing RDD.
Worker node

A cluster node that can run application code.

It’s easier than you think to extend DevOps practices to SQL Server with Redgate tools. Discover how to introduce true Database DevOps, brought to you in partnership with Redgate

Topics:
splice machine ,apache spark ,olap ,query performance

Published at DZone with permission of Gary Hillerson, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}