Restructuring Big Data With Spark
Restructuring Big Data With Spark
Big data has evolved, and the need for real-time performance, data governance, and higher efficiency is forcing us to focus more on structure and context.
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
Big data used to be about storing unstructured data in its raw form. We'd say, “forget about structures and schema — it will be defined when we read the data.” But big data has evolved, and the need for real-time performance, data governance, and higher efficiency is forcing back some structure and context.
Traditional databases have well-defined schemas that describe the content and the strict relations between the data elements. This made things extremely complex and rigid. Big data's initial application was to analyze unstructured machine log files, so having rigid schemas was impractical. It then expanded to CSV and JSON files with data extracted (via ETL) from different data sources. All the data was processed in an offline batch manner where latency wasn’t critical.
Big data is now taking place at the forefront of the business and is being used in real-time decision support systems, online customer engagement, and interactive data analysis where users expect immediate results. Reducing time to insight and moving from batch to real-time is becoming the most critical requirement. Unfortunately, when data is stored as inflated and unstructured text, queries take forever and consume significant CPU, network, and storage resources.
Big data today needs to serve a variety of use cases, users, and content. Data must be accessible and organized for it to be used efficiently. Unfortunately, traditional “data preparation” processes are slow and manual and don’t scale, and those data sets become partial and inaccurate and get dumped into the lake without context.
As the focus on data security is growing, we need to control who can access the data and when. When data is unorganized there is no way for us to know if files contain sensitive data, and we cannot block access to individual records or fields/columns.
Structured Data to the Rescue
To address the performance and data wrangling challenge, new file formats like Parquet and ORC were developed. These are highly efficient compressed and binary data structures with flexible schemas. It is now the norm to use Parquet with Hive or Spark since it enables much faster data scanning and allows for reading only the specific columns that are relevant to the query as opposed to having to go over the entire file.
Using Parquet, one can save up to 80% of storage capacity comparing to a text format while making queries 2-3x faster.
The new formats force us to define some structure up front with the option to expand or modify the schema dynamically, unlike older legacy databases. Having such schema and metadata helps in reducing data errors and makes it possible for different users to understand the content of the data and collaborate. With built-in metadata, it becomes much simpler to secure and govern the data and filter or anonymize parts of it.
One challenge with the current Hadoop file-based approach, regardless of whether it is unstructured or structured data, is that updating individual records is impossible and is limited to bulk data uploads. This means that dynamic and online applications will be forced to rewrite an entire file just to modify a single field. When reading an individual record, we still need to run full scans instead of selective random reads or updates. This is also true for what may seem to be sequential data (for example, delayed time series data or historical data adjustments).
Spark Moving to Structured Data
Apache Spark is the fastest-growing analytics platform and can replace many older Hadoop-based frameworks. It is constantly evolving and trying to address the demand for interactive queries on large datasets, real-time stream processing, graphs, and machine learning. Spark has changed dramatically with the introduction of DataFrames, in-memory table constructs that are manipulated in parallel using machine-optimized low-level processing (see the project Tungsten). DataFrames are structured and can be mapped directly to a variety of data sources via a pluggable API, including:
Files such as Parquet, ORC, Avro, Json, and CSV.
Databases such as MongoDB, Cassandra, MySQL, Oracle, and HP Vertica.
Cloud storage like Amazon S3 and DynamoDB.
DataFrames can be loaded directly from external databases or created from unstructured data by crawling and parsing the text (a long and CPU-/disk-intensive task). DataFrames can be written back to external data sources in a random and indexed fashion if the backend supports such an operation (for example, in the case of a database).
The Spark 2.0 release adds structured streaming, expanding the use of DataFrames from batch and SQL to streaming and real-time. This will greatly simplify data manipulation and speed up performance. Now we can use streaming, SQL, machine learning, and graph processing semantics over the same data!
Spark is not the only streaming engine moving to structured data. Apache Druid delivers high performance and efficiency by working with structured data and columnar compression.
New applications are designed to process data as it gets ingested and reacts in seconds or less instead of waiting for hours or days. IoT will drive huge volumes of data which, in some cases, may need to be processed immediately to save or improve our lives. The only way to process such high volumes of data while lowering the time to insight is to normalize, clean, and organize the data as it lands in the data lake and store it in highly efficient dynamic structures. When analyzing massive amounts of data, we run better over structured and pre-indexed data. This will be faster in orders of magnitudes.
With SSDs and Flash at our disposal, there is no reason to re-write an entire file just to update individual fields or records — we’d better harness structured data and only modify the impacted pages.
At the center of this revolution, we have Spark and DataFrames. After years of investment in Hadoop, some of its projects are becoming superfluous and are being displaced by faster and simpler Spark-based applications. Spark engineers made the right choice and opened it up to a variety of external data sources instead of sticking to the Hadoop’s approach and forcing us to copy all the data into a crippled and low-performing file-system... yes, I’m talking about HDFS.
Published at DZone with permission of Yaron Haviv , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.