Big Data Resources

Introduction to Spring Data Elasticsearch 4.1

Getting started with the latest version of Spring Data Elasticsearch 4.1 using Elasticsearch 7 as a NoSQL database.

June 13, 2021

by Arnošt Havelka

CORE

· 23,160 Views · 9 Likes

Confluent’s Kafka REST Proxy, The Silk Route for Data Movement to Operational Kafka Cluster

In this article, I am going to detailing out the steps to integrate the prebuilt versions of Confluent REST Proxy with running a multi-broker Apache Kafka cluster.

June 13, 2021

by Gautam Goswami

CORE

· 20,391 Views · 3 Likes

Introducing Cloudera SQL Stream Builder (SSB)

SSB is an improved release of Eventador's SQL Stream Builder with integration into Cloudera Manager, Cloudera Flink, and other streaming tools.

Updated June 6, 2021

by Tim Spann

CORE

· 14,881 Views · 5 Likes

Applications for GPU-Based AI and Machine Learning

We look at some of the most talked-about Artificial Intelligence and Machine Learning areas where graphical processing units (GPU) play an ever-increasing role.

June 6, 2021

by Kevin Vu

· 16,725 Views · 5 Likes

Oracle BI vs. Tableau: Which Business Intelligence Tool Is Better?

The choice between these 2 equally good BI software will depend on the scale, complexity of data, and the objective of the enterprises towards BI implementation.

June 6, 2021

by Raju Shahi

· 9,653 Views · 4 Likes

Deploying CockroachDB on Kubernetes using OpenEBS LocalPV

CockroachDB is a cloud-native SQL database that features both scalability and consistency. The database is designed to withstand data center failures by deploying multiple instances of symmetric nodes in a cluster consisting of several machines, disks, and data centers. Kubernetes’ built-in capabilities to scale and survive node failures make it well suited to orchestrate CockroachDB’s databases. This is particularly for the reason that Kubernetes simplifies cluster management and helps maintain high-availability by replicating data across independent nodes. This guide focuses on how OpenEBS LocalPV devices can be used to persist storage for Kubernetes-Hosted CockroachDB clusters. Introduction to Distributed, Scaled-out Databases Ever growing demands for resilience, performance, scalability and ease of use have led to an explosion of choices for developers and data scientists in search of an open source database to address their needs. Databases are often characterized as either SQL databases, noted for their consistency guarantees with PostgreSQL and MariaDB considered to be ACID compliant (Atomic, Consistent, Isolated, Durable), or NoSQL databases which have been noted for their scalability and flexibility however not considered to be either ACID compliant or completely compatible with SQL. More recently Distributed, Scaled-out Databases were introduced that promise to avoid the trade-offs between SQL and NoSQL databases, allowing for the scalability of NoSQL DBs along with the ACID (Atomic, Consistent, Isolated, Durable) transactions, strong consistency, and relational schemas of SQL DBs. CockroachDB is a distributed database that is built on top of RocksDB as its transactional and key-value store. Cockroach DB supports both ACID transactions and vertical & horizontal scalability. With extensive geographical distribution, CockroachDB can maintain availability with controlled latency in case of disk, machine or even a data center failure. How CockroachDB works: CockroachDB is deployed in clusters consisting of multiple nodes. Each node is divided into five layers: The SQL Layer converts client queries to key-value entities by first parsing them against a YACC file then converting them into an abstract syntax tree. With this tree, the database will generate a network of plan nodes containing a key-value code. When the plan nodes are executed, they initiate communication with the transaction layer. The Transaction Layer then uses two-phase commits to implement the semantics of ACID transactions. These commits are executed across all nodes in the cluster. The commit involves posting write extents and transaction records, then executing read operations. Once a commit has been made at the transaction layer, a request is made to the respective node’s Distribution Layer. This layer then identifies the destination node for the request and forwards the request to its replication layer. The Replication Layer’s primary responsibility is creating multiple copies of data across cluster nodes. It also uses a raft algorithm to ensure consensus between different nodes holding similar copies of data. The Storage Layer uses RocksDB to store data as key-value pairs. Although CockroachDB can run on Mac, Linux, and Windows OS, production instances of CockroachDB are typically run on Linux Virtual machines or containers. The database can be orchestrated either on cloud or on-premises setup. For running stateful applications, orchestration tools like Kubernetes are considered perfect. Orchestrating CockroachDB with Kubernetes Clusters: Before we begin To understand how CockroachDB is orchestrated on Kubernetes, here are some Kubernetes terminology applicable to storage and stateful applications: A StatefulSet is a collection of Kubernetes PODs viewed as a single stateful unit with its own network identity. A StatefulSet is a stable Kubernetes object that always binds to the same persistent storage when it restarts. A Persistent Volume is a block-storage-based file system that is bound to a POD. A volume’s lifecycle is not tied to the POD to which it is attached, and every CockroachDB node can attach to the same persistent volume every time it restarts. A Certificate Signing Request is a request by a client to have their TLS certificate signed by the Certificate Authority built into Kubernetes by default. Role-Based Access Control (RBAC) is the system used by Kubernetes to administer access permissions in the cluster. Roles allow users to access certain resources within the cluster. To use the most up-to-date files, Kubernetes version 1.15 or higher is required to run CockroachDB clusters. The database can be deployed on any Kubernetes distribution, including a Local cluster (such as Minikube), Amazon AWS, EKS, Google GKE and GCE, among others. For persistence and replication, CockroachDB relies on external persistent volumes such as OpenEBS LocalPV. Installing CockroachDB Operators on OpenEBS LocalPV Devices When using OpenEBS with CockroachDB, a LocalPV is provisioned on the node where a CockroachDB POD is attached. The volume uses an unattached block device, which is used to store data. OpenEBS Dynamic LocalPV provisioner can create Kubernetes Local Persistent Volumes using block devices available on the node to persist data, hereafter referred to as OpenEBS LocalPV Device volumes. When compared to native Kubernetes Local Persistent Volumes, OpenEBS LocalPV Device volumes have the following advantages. Dynamic Volume provisioner as opposed to a Static Provisioner. Better management of the block devices used for creating LocalPVs by OpenEBS NDM. NDM provides capabilities like discovering block device properties, setting up device filters, metrics collection and the ability to detect if the block devices have moved across nodes. Once a volume claims a block device, no other application can use the device for storage. If there are limited block devices in other nodes, nodeSelectors can be used to provision storage for applications on particular cluster nodes. The recommended configuration for CockroachDB clusters is at least three nodes with one unclaimed Local SSD per node. This solution guide takes you through installing CockroachDB Kubernetes operators, and then configuring the cluster to use Local OpenEBS devices as the storage engines. The guide also highlights how to access the database for SQL queries, and finally demonstrates how to monitor the database using Prometheus and Grafana. Let us know how you use CockroachDB in production and if you have an interesting use case to share. Also, please check out other OpenEBS deployment guides on common Kubernetes stateful workloads at: Deploying Kafka on Kubernetes Deploying Elasticsearch on Kubernetes Deploying WordPress on DigitalOcean Kubernetes Deploying Magento on Kubernetes Deploying Percona on Kubernetes Deploying Cassandra on Kubernetes Deploying MinIO on Kubernetes Deploying Prometheus on Kubernetes This article has already been published on https://blog.mayadata.io/deploying-cockroachdb-on-kubernetes-using-openebs-localpv and authorised by MayaData for a republish.

May 31, 2021

by Sudip Sengupta

CORE

· 13,891 Views · 3 Likes

AWS Serverless Data Lake: Built Real-time Using Apache Hudi, AWS Glue, and Kinesis Stream

In an enterprise system, populating a data lake relies heavily on interdependent batch processes. Today’s business demands high-quality data in minutes or seconds.

May 29, 2021

by Gaurav Gupta

· 13,258 Views · 3 Likes

4 Ways the IoT Creates Intelligent Pipeline Monitoring

IoT sensors make it possible to detect and pinpoint leaks in pipelines more effectively in the pipeline industry. How can they improve pipeline monitoring?

May 27, 2021

by Emily Newton

· 21,242 Views · 2 Likes

Azure Synapse Analytics – New Insights Into Data Security

Integrated Azure Synapse Workspace helps handle the security of data in one place for all data lakes, data analytics, and warehousing needs, but also requires learning some new concepts.

May 24, 2021

by Piotr Gwiazda

· 8,474 Views · 2 Likes

Creating Live Dashboards With QuickSight

See how you can bring together AWS Lambda, S3 and QuickSight to create a live dashboard of COVID-19 vaccination.

May 20, 2021

by James Sugrue

· 9,321 Views · 4 Likes

Introduction to Apache Kafka With Spring

Introduction to Apache Kafka with Spring.

May 20, 2021

by Otavio Santana

CORE

· 13,225 Views · 12 Likes

Looking for the Best Java Data Computation Layer Tool

This essay is a deep dive into 4 types of data computation layer tools (class libraries) to compare structured data computing capabilities and basic functionalities.

May 20, 2021

by Jerry Zhang

· 7,007 Views · 1 Like

Best Practices for Data Pipeline Error Handling in Apache NiFi

Learn actionable strategies for error management modeling in Apache NiFi data pipelines, and understand the benefits of planning for error handling.

May 19, 2021

by Pieter Humphrey

· 18,359 Views · 8 Likes

Migrate Data Across Kafka Cluster Using mirrormaker2 in Strimzi

In this article, we will discuss a use case where data from one Kafka cluster has to be migrated to another Kafka Cluster. We will be using mirrormaker 2.

Updated May 18, 2021

by Chandra Shekhar Pandey

· 9,721 Views · 2 Likes

Deploying an Apache Kafka mock service with Microcks

Microcks is an open source Kubernetes-native platform for API mocking and testing. You can use the AsyncAPI specification examples to tell Microcks to generate events to Apache Kafka with a simple configuration.

May 17, 2021

by Hugo Guerrero

CORE

· 19,379 Views · 9 Likes

High-Performance Batch Processing Using Apache Spark and Spring Batch

Batch processing is dealing with a large amount of data; it actually is a method of running high-volume, repetitive data jobs and each job does a specific task.

May 16, 2021

by Reza Ganji

CORE

· 29,642 Views · 7 Likes

Veeva Nitro and AWS SageMaker for Life Sciences Data Scientists

There is a rise in industry-specific data analytics solutions because building up and maintaining a custom data warehouse is difficult.

May 14, 2021

by Istvan Szegedi

· 6,783 Views · 2 Likes

Deploy Elasticsearch on Kubernetes Using OpenEBS LocalPV

Overview Elastic Stack is a group of open-source tools that includes Elasticsearch for supporting data ingestion, storage, enrichment, visualization, and analysis for containerized applications. As a distributed search and analytics engine, Elasticsearch is an open-source tool that ingests application data, indexes it then stores it for analytics. Since it gathers large volumes of data while indexing different data types, Elasticsearch is often considered write-heavy. To manage such dynamic volumes of data, Kubernetes makes it easy to configure, manage, and scale Elasticsearch clusters. Kubernetes also simplifies the provisioning of resources for Elasticsearch using Infrastructure-as-Code configurations, abstracting cluster management. While Kubernetes alone cannot store data generated by a cluster, persistent volumes can be used to sustain it for future use. To help with this, OpenEBS provisions local persistent volumes or LocalPV and allows for data to be stored on physical disks. Many users have shared their experience of using OpenEBS for local storage management in Kubernetes for Elasticsearch, including the Cloud Native Computing Foundation, ByteDance (TikTok), and Zeta Associates (Lockheed Martin) on the Adopters list in the OpenEBS community available here. In this guide, we explore how OpenEBS LocalPV can provision data storage for Elasticsearch clusters. This guide will also cover - Primary functions of Elastic Stack operators in a Kubernetes cluster Integrating Elasticsearch operators with Fluentd and Kibana to form the EFK stack Monitoring Elasticsearch cluster metrics with Prometheus and Grafana Getting Started with Elasticsearch Analytics Elasticsearch extends the ability to store and search large amounts of textual, graphical or numerical data efficiently. Kubernetes makes it easy to manage the connections between Elasticsearch nodes, thereby simplifying deploying Elasticsearch on-premises or in hosted cloud environments. It must be noted that Elasticsearch nodes are different from Kubernetes nodes of a cluster. While an Elasticsearch node runs a single instance of Elasticsearch, a Kubernetes node is a physical or virtual machine that the orchestrator runs on. Elasticsearch Cluster Topology From Kubernetes’ point of view, an Elasticsearch node can be considered as a POD. Whenever an Elasticsearch cluster is deployed, three types of Elasticsearch PODs are created: Master - manage the Elasticsearch cluster Client - direct incoming traffic to appropriate PODs Data - responsible for storing and availing cluster data The diagram below shows the topology of a typical 7 POD Elasticsearch cluster with 3-master, 2-client and 2-data nodes: Deploying Elasticsearch involves creating manifest files for each of the cluster’s PODs. By connecting to the cluster, OpenEBS creates a visibility tier that enables cluster monitoring, logging and topology checks for LocalPV Storage. Additionally, to enable cluster-wide analytics, the following tools are deployed : Fluentd - An open-source data collection agent that integrates with Elasticsearch to collect log data, transform it then ship it to the Elastic Backend. Fluentd is set up on cluster nodes to collect and convert POD information and send it to the Elasticsearch data PODs for storage and indexing. It is typically set up as a DaemonSet to run on each Kubernetes worker node. Kibana - Once the cluster is deployed on Kubernetes, it needs to be monitored and managed. To help with this, Kibana is used as a visualization tool for cluster data by providing the Elasticsearch client service as an environment variable in PODs that Kibana should connect to. Solution Guide The following solution guide explains the steps and important considerations for deploying Elasticsearch clusters on Kubernetes using OpenEBS Persistent Volumes. By following the guide, you can create persistent storage for the EFK stack supported by Kubernetes, to which OpenEBS is deployed. The guide includes steps on performing metric checks and performance monitoring for the Elasticsearch cluster using Prometheus and Grafana. Let us know how you use Elasticsearch in production and if you have an interesting use case to share. Also, please check out other OpenEBS deployment guides on common Kubernetes stateful workloads on our website. Deploying Kafka on Kubernetes Deploying WordPress on DigitalOcean Kubernetes Deploying Magento on Kubernetes Deploying Percona on Kubernetes Deploying Cassandra on Kubernetes Deploying MinIO on Kubernetes Deploying Prometheus on Kubernetes This article has already been published on https://blog.mayadata.io/deploy-elasticsearch-on-kubernetes-using-openebs-localpv and has been authorized by MayaData for a republish.

May 12, 2021

by Sudip Sengupta

CORE

· 7,927 Views · 3 Likes

How Do AI Systems Identify Duplicate Data?

A discussion of AI concepts, such as comparing records in a database, and how these techniques can be used in conjunction with Salesforce.

May 10, 2021

by Ilya Dudkin

CORE

· 16,177 Views · 3 Likes

Spring Cloud Stream Channel Interceptor

A Channel Interceptor is used to capture a message before being sent or received in order to view or modify it. Learn how a channel interceptor works and how to use it.

May 5, 2021

by Mohammed ZAHID

· 15,382 Views · 4 Likes

The Latest Big Data Topics