Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

What's New in Apache Spark 2.2?

DZone's Guide to

What's New in Apache Spark 2.2?

Learn about the improvements, updates, and new features in Apache Spark 2.2, with a focus on its new and improved, production-ready structured streaming.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Apache recently released a new version of Spark: 2.2. The newest version comes with improvements as well as the addition of new functionalities.

The major addition to this release is structured streaming. It has been marked as production-ready and its experimental tag has been removed.

As a whole, some of the high-level changes and improvements are:

  • Production-ready structured streaming
  • Expanding SQL functionalities
  • New distributed machine learning algorithms in R
  • Additional algorithms in MLlib and GraphX

The production-ready structured streaming comes with additional high-level changes:

  • Kafka source and sink: In the previous Spark version, Kafka was supported only as a source. But in the current release, we can use Kafka both as a source and as a sink.
  • Kafka improvements: Now, a cached instance of a Kafka producer will be used for writing to Kafka sinks, reducing latency.
  • Additional stateful APIs: Support for complex stateful processing and timeouts using [flat]MapGroupsWithState. 
  • Run once triggers: Allows for triggering only one-time execution, lowering the cost of clusters.

Spark 2.2 adds a number of SQL functionalities:

  • API updates: Added support for creating Hive tables with DataFrameWriter and CatalogLATERAL VIEW OUTER explode(), and unify CREATE TABLE syntax for data sources and Hive serde tables. Added broadcast hints including BROADCASTBROADCASTJOIN, and MAPJOIN for SQL queries, as well as support sessions for local timezones when machines or users are in different timezones. It also adds support for ADD COLUMNS with the ALTER TABLE command.
  • Overall performance and stability:
    • Cost-based optimizer: Cardinality estimation for filter, join, aggregate, project, and limit/sample operators. It decides the join order of a multi-way join query based on the cost function. TPC-DS performance improvements using star-schema heuristics.
    • File listing/IO improvements for CSV and JSON.
    • Partial aggregation support of HiveUDAFFunction.
    • Introduce a JVM object-based aggregate operator.
    • Limiting the max number of records written per file.
  • Other notable changes:
    • Support for parsing multiline JSON and CSV files.
    • Analyze table commands on partitioned tables.
    • Drop staging directories and data files after completion of insertions/CTAs against Hive serde tables.
    • More robust view canonicalization without full SQL expansion.
    • Support reading data from Hive metastore 2.0/2.1.
    • Removed support for Hadoop 2.5 and earlier.
    • Removed Java 7 support.

A major set of changes in Spark 2.2 focuses on advanced analytics and Python. PySpark from PyPI can be installed using pip install.

A few new algorithms were also added to MLlib and GraphX:

  • Locality sensitive hashing
  • Multiclass logistic regression
  • Personalized PageRank

Spark 2.2 also adds support for the following distributed algorithms in SparkR:

  • ALS
  • Isotonic regression
  • Multilayer perceptron classifier
  • Random forest
  • Gaussian mixture model
  • LDA
  • Multiclass logistic regression
  • Gradient boosted trees

The main focus of SparkR in the 2.2.0 release was adding extensive support for existing Spark SQL features:

  • Structured Streaming API for R
  • Support complete catalog API in R
  • Column functions to_json and from_json
  • Coalesce on DataFrame and coalesce on column
  • Support DataFrame checkpointing
  • Multi-column approxQuantile in R

Some of the features like support for Python 2.6 have been dropped and features like createExternalTable have been deprecated.

Happy learning!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
apache spark ,structured streaming ,spark 2.2 ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}