51 Big Data Terms You Need to Know

DZone 's Guide to

51 Big Data Terms You Need to Know

Some of the biggest definitions that you need to know when it comes to big data.

· Big Data Zone ·
Free Resource

With billions of bytes of data being collected daily, it's more important than ever to understand the intricacies of big data. In an effort to help bring clarity to this field, we created a compiled list from our recent big data guides of what we feel are the most important related terms and definitions you need to know. (By the way, if you're interested in this, you might also be interested in our AI glossary!)

Any terms you think we should add? Let us know in the comments!


Algorithm: A set of rules given to an AI, neural network, or other machines to help it learn on its own; classification, clustering, recommendation, and regression are four of the most popular types.

Apache Flink: An open-source streaming data processing framework. It is written in Java and Scala and is used as a distributed streaming dataflow engine.

Apache Hadoop: An open-source tool to process and store large distributed data sets across machines by using MapReduce.

Apache Kafka: A distributed streaming platform that improves upon traditional message brokers through improved throughput, built-in partitioning, replication, latency, and reliability.

Apache NiFi: An open-source Java server that enables the automation of data flows between systems in an extensible, pluggable, open manner. NiFi was open-sourced by the NSA.

Apache Spark: An open-source big data processing engine that runs on top of Apache Hadoop, Mesos, or the cloud.

Artificial intelligence: A machine's ability to make decisions and perform tasks that simulate human intelligence and behavior.


Big data: A common term for large amounts of data. To be qualified as big data, data must be coming into the system at a high velocity, with large variation, or at high volumes.

Blob storage: An Azure service that stores unstructured data in the cloud as a blob or an object.

Business intelligence: The process of visualizing and analyzing business data for the purpose of making actionable and informed decisions.


Cluster: A subset of data that share particular characteristics. Can also refer to several machines that work together to solve a single problem.

COAP: Constrained Application Protocol is an Internet Application protocol for limited resource devices that can be translated to HTTP if needed.


Data engineering: The collection, storage, and processing of data so that it can be queried by a data scientist.

Data flow management: The specialized process of ingesting raw device data, while managing the flow of thousands of producers and consumers. Then performing basic data enrichment, analysis in stream, aggregation, splitting, schema translation, format conversion, and other initial steps to prepare the data for further business processing.

Data governance: The process of managing the availability, usability, integrity, and security of data within a data lake.

Data integration: The process of combining data from different sources and providing a unified view for the user.

Data lake: A storage repository that holds raw data in its native format.

Data mining: A practice to generate new information through the process of examining and analyzing large databases. 

Data operationalization: The process of strictly defining variables into measurable factors.

Data preparation: The process of collecting, cleaning, and consolidating data into one file or data table, primarily for use in analysis.

Data processing: The process of retrieving, transforming, analyzing, or classifying information by a machine.

Data science: A field that explores repeatable processes and methods to derive insights from data.

Data swamp: What a data lake becomes without proper governance.

Data validation: The act of examining data sets to ensure that all data is clean, correct, and useful before it is processed.

Data warehouse: A large collection of data from various sources used to help companies make informed decisions.

Device layer: The entire range of sensors, actuators, smartphones, gateways, and industrial equipment that send data streams corresponding to their environment and performance characteristics.


GPU-accelerated databases: Databases which are required to ingest streaming data.

Graph analytics: A way to organize and visualize relationships between different data points in a set.


Hadoop: A programming framework for processing and storing big data, particularly in distributed computing environments.


Ingestion: The intake of streaming data from any number of different sources.


MapReduce: A data processing model that filters and sorts data in the Map stage, then performs a function on that data and returns an output in the Reduce stage.

Munging: The process of manually converting or mapping data from one raw form into another format for more convenient consumption.


Normal distribution: A common graph representing the probability of a large number of random variables, where those variables approach normalcy as the data set increases in size. Also called a Gaussian distribution or bell curve.

Normalizing: The process of organizing data into tables so that the results of using the database are always unambiguous and as intended.


Parse: To divide data, such as a string, into smaller parts for analysis.

Persistent storage: A non-changing place, such as a disk, where data is saved after the process that created it has ended.

Python: A general-purpose programming language that emphasizes code readability in order to allow programmers to use fewer lines of code to express their concepts.


R: An open-source language primarily used for data visualization and predictive analytics.

Real-time stream processing: A model for analyzing sequences of data by using machines in parallel, though with reduced functionality.

Relational database management system (RDBMS): A system that manages, captures, and analyzes data that is grouped based on shared attributes called relations.

Resilient distributed dataset: The primary way that Apache Spark abstracts data, where data is stored across multiple machines in a fault-tolerant way.


Shard: An individual partition of a database.

Smart data: Digital information that is formatted so it can be acted upon at the collection point before being sent to a downstream analytics platform for further data consolidation and analytics.

Stream processing: The real-time processing of data. The data is processed continuously, concurrently, and record-by-record.

Structured data: Information with a high degree of organization.


Taxonomy: The classification of data according to a pre-determined system with the resulting catalog used to provide a conceptual framework for easy access and retrieval.

Telemetry: The remote acquisition of information about an object (for example, from an automobile, smartphone, medical device, or IoT device).

Transformation: The conversion of data from one format to another.


Unstructured data: Data that either does not have a pre-defined data model or is not organized in a pre-defined manner.


Visualization: The process of analyzing data and expressing it in a readable, graphical format, such as a chart or graph.


Zones: Distinct areas within a data lake that serve specific, well-defined purposes.

big data ,data analytics ,data visualization ,hadoop ,kafka ,parsing ,r

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}