DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

Big Data

Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.

icon
Latest Refcards and Trend Reports
Trend Report
Data Pipelines
Data Pipelines
Refcard #382
Getting Started With Apache Iceberg
Getting Started With Apache Iceberg
Refcard #371
Data Pipeline Essentials
Data Pipeline Essentials

DZone's Featured Big Data Resources

Getting Started With Data Lakes
Refcard #350

Getting Started With Data Lakes

Trend Report

Data Warehousing

Data warehousing has become an absolute must in today’s fast-paced, data-driven, modern business landscape. As the demand for informed business decisions and analytics continues to skyrocket, data warehouses are gaining in popularity, especially as more and more businesses adopt cloud-based data warehouses.DZone’s 2020 Data Warehousing Trend Report explores data warehouse adoption across industries, including key challenges, cloud storage, and common data tools such as data lakes, data virtualization, and ETL/ELT. In this report, readers will find original research, an exclusive interview with "the father of data warehousing," and additional resources with helpful tips, best practices, and more.

Data Warehousing
Dagster: A New Data Orchestrator To Bring Data Closer to Business Value
Dagster: A New Data Orchestrator To Bring Data Closer to Business Value
By Miguel Garcia
Techniques You Should Know as a Kafka Streams Developer
Techniques You Should Know as a Kafka Streams Developer
By Ludovic Dehon
Streaming Data to RDBMS via Kafka JDBC Sink Connector Without Leveraging Schema Registry
Streaming Data to RDBMS via Kafka JDBC Sink Connector Without Leveraging Schema Registry
By Gautam Goswami CORE
How To Use Change Data Capture With Apache Kafka and ScyllaDB
How To Use Change Data Capture With Apache Kafka and ScyllaDB

In this hands-on lab from ScyllaDB University, you will learn how to use the ScyllaDB CDC source connector to push the row-level changes events in the tables of a ScyllaDB cluster to a Kafka server. What Is ScyllaDB CDC? To recap, Change Data Capture (CDC) is a feature that allows you to not only query the current state of a database’s table but also query the history of all changes made to the table. CDC is production-ready (GA) starting from ScyllaDB Enterprise 2021.1.1 and ScyllaDB Open Source 4.3. In ScyllaDB, CDC is optional and enabled on a per-table basis. The history of changes made to a CDC-enabled table is stored in a separate associated table. You can enable CDC when creating or altering a table using the CDC option, for example: CREATE TABLE ks.t (pk int, ck int, v int, PRIMARY KEY (pk, ck, v)) WITH cdc = {'enabled':true}; ScyllaDB CDC Source Connector ScyllaDB CDC Source Connector is a source connector capturing row-level changes in the tables of a ScyllaDB cluster. It is a Debezium connector, compatible with Kafka Connect (with Kafka 2.6.0+). The connector reads the CDC log for specified tables and produces Kafka messages for each row-level INSERT, UPDATE, or DELETE operation. The connector is fault-tolerant, retrying reading data from Scylla in case of failure. It periodically saves the current position in the ScyllaDB CDC log using Kafka Connect offset tracking. Each generated Kafka message contains information about the source, such as the timestamp and the table name. Note: at the time of writing, there is no support for collection types (LIST, SET, MAP) and UDTs—columns with those types are omitted from generated messages. Stay up to date on this enhancement request and other developments in the GitHub project. Confluent and Kafka Connect Confluent is a full-scale data streaming platform that enables you to easily access, store, and manage data as continuous, real-time streams. It expands the benefits of Apache Kafka with enterprise-grade features. Confluent makes it easy to build modern, event-driven applications, and gain a universal data pipeline, supporting scalability, performance, and reliability. Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems. It makes it simple to define connectors that move large data sets in and out of Kafka. It can ingest entire databases or collect metrics from application servers into Kafka topics, making the data available for stream processing with low latency. Kafka Connect includes two types of connectors: Source connector: Source connectors ingest entire databases and stream table updates to Kafka topics. Source connectors can also collect metrics from application servers and store the data in Kafka topics, making the data available for stream processing with low latency. Sink connector: Sink connectors deliver data from Kafka topics to secondary indexes, such as Elasticsearch, or batch systems, such as Hadoop, for offline analysis. Service Setup With Docker In this lab, you’ll use Docker. Please ensure that your environment meets the following prerequisites: Docker for Linux, Mac, or Windows. Note: running ScyllaDB in Docker is only recommended to evaluate and try ScyllaDB. ScyllaDB open source. For the best performance, a regular install is recommended. 8 GB of RAM or greater for Kafka and ScyllaDB services. docker-compose Git ScyllaDB Install and Init Table First, you’ll launch a three-node ScyllaDB cluster and create a table with CDC enabled. If you haven’t done so yet, download the example from git: git clone https://github.com/scylladb/scylla-code-samples.git cd scylla-code-samples/CDC_Kafka_Lab This is the docker-compose file you’ll use. It starts a three-node ScyllaDB Cluster: version: "3" services: scylla-node1: container_name: scylla-node1 image: scylladb/scylla:5.0.0 ports: - 9042:9042 restart: always command: --seeds=scylla-node1,scylla-node2 --smp 1 --memory 750M --overprovisioned 1 --api-address 0.0.0.0 scylla-node2: container_name: scylla-node2 image: scylladb/scylla:5.0.0 restart: always command: --seeds=scylla-node1,scylla-node2 --smp 1 --memory 750M --overprovisioned 1 --api-address 0.0.0.0 scylla-node3: container_name: scylla-node3 image: scylladb/scylla:5.0.0 restart: always command: --seeds=scylla-node1,scylla-node2 --smp 1 --memory 750M --overprovisioned 1 --api-address 0.0.0.0 Launch the ScyllaDB cluster: docker-compose -f docker-compose-scylladb.yml up -d Wait for a minute or so, and check that the ScyllaDB cluster is up and in normal status: docker exec scylla-node1 nodetool status Next, you’ll use cqlsh to interact with ScyllaDB. Create a keyspace, and a table with CDC enabled, and insert a row into the table: docker exec -ti scylla-node1 cqlsh CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor' : 1}; CREATE TABLE ks.my_table (pk int, ck int, v int, PRIMARY KEY (pk, ck, v)) WITH cdc = {'enabled':true}; INSERT INTO ks.my_table(pk, ck, v) VALUES (1, 1, 20); exit [guy@fedora cdc_test]$ docker-compose -f docker-compose-scylladb.yml up -d Creating scylla-node1 ... done Creating scylla-node2 ... done Creating scylla-node3 ... done [guy@fedora cdc_test]$ docker exec scylla-node1 nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 172.19.0.3 ? 256 ? 4d4eaad4-62a4-485b-9a05-61432516a737 rack1 UN 172.19.0.2 496 KB 256 ? bec834b5-b0de-4d55-b13d-a8aa6800f0b9 rack1 UN 172.19.0.4 ? 256 ? 2788324e-548a-49e2-8337-976897c61238 rack1 Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless [guy@fedora cdc_test]$ docker exec -ti scylla-node1 cqlsh Connected to at 172.19.0.2:9042. [cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4] Use HELP for help. cqlsh> CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor' : 1}; cqlsh> CREATE TABLE ks.my_table (pk int, ck int, v int, PRIMARY KEY (pk, ck, v)) WITH cdc = {'enabled':true}; cqlsh> INSERT INTO ks.my_table(pk, ck, v) VALUES (1, 1, 20); cqlsh> exit [guy@fedora cdc_test]$ Confluent Setup and Connector Configuration To launch a Kafka server, you’ll use the Confluent platform, which provides a user-friendly web GUI to track topics and messages. The confluent platform provides a docker-compose.yml file to set up the services. Note: this is not how you would use Apache Kafka in production. The example is useful for training and development purposes only. Get the file: wget -O docker-compose-confluent.yml https://raw.githubusercontent.com/confluentinc/cp-all-in-one/7.3.0-post/cp-all-in-one/docker-compose.yml Next, download the ScyllaDB CDC connector: wget -O scylla-cdc-plugin.jar https://github.com/scylladb/scylla-cdc-source-connector/releases/download/scylla-cdc-source-connector-1.0.1/scylla-cdc-source-connector-1.0.1-jar-with-dependencies.jar Add the ScyllaDB CDC connector to the Confluent connect service plugin directory using a Docker volume by editing docker-compose-confluent.yml to add the two lines as below, replacing the directory with the directory of your scylla-cdc-plugin.jar file. image: cnfldemos/cp-server-connect-datagen:0.5.3-7.1.0 hostname: connect container_name: connect + volumes: + - <directory>/scylla-cdc-plugin.jar:/usr/share/java/kafka/plugins/scylla-cdc-plugin.jar depends_on: - broker - schema-registry Launch the Confluent services: docker-compose -f docker-compose-confluent.yml up -d Wait a minute or so, then access http://localhost:9021 for the Confluent web GUI. Add the ScyllaConnector using the Confluent dashboard: Add the Scylla Connector by clicking the plugin: Fill the “Hosts” with the IP address of one of the Scylla nodes (you can see it in the output of the nodetool status command) and port 9042, which is listened to by the ScyllaDB service. The “Namespace” is the keyspace you created before in ScyllaDB. Notice that it might take a minute or so for the ks.my_table to appear: Test Kafka Messages You can see that MyScyllaCluster.ks.my_table is the topic created by the ScyllaDB CDC connector. Now, check for Kafka messages from the Topics panel: Select the topic, which is the same as the keyspace and table name that you created in ScyllaDB: From the “Overview” tab, you can see the topic info. At the bottom, it shows this topic is on partition 0. A partition is the smallest storage unit that holds a subset of records owned by a topic. Each partition is a single log file where records are written to it in an append-only fashion. The records in the partitions are each assigned a sequential identifier called the offset, which is unique for each record within the partition. The offset is an incremental and immutable number maintained by Kafka. As you already know, the ScyllaDB CDC messages are sent to the ks.my_table topic, and the partition id of the topic is 0. Next, go to the “Messages” tab and enter partition id 0 into the “offset” field: You can see from the output of the Kafka topic messages that the ScyllaDB table INSERT event and the data were transferred to Kafka messages by the Scylla CDC Source Connector. Click on the message to view the full message info: The message contains the ScyllaDB table name and keyspace name with the time, as well as the data status before the action and afterward. Since this is an insert operation, the data before the insert is null. Next, insert another row into the ScyllaDB table: docker exec -ti scylla-node1 cqlsh INSERT INTO ks.my_table(pk, ck, v) VALUES (200, 50, 70); Now, in Kafka, wait for a few seconds and you can see the details of the new Message: Cleanup Once you are done working on this lab, you can stop and remove the Docker containers and images. To view a list of all container IDs: docker container ls -aq Then you can stop and remove the containers you are no longer using: docker stop <ID_or_Name> docker rm <ID_or_Name> Later, if you want to rerun the lab, you can follow the steps and use docker-compose as before. Summary With the CDC source connector, a Kafka plugin compatible with Kafka Connect, you can capture all the ScyllaDB table row-level changes (INSERT, UPDATE, or DELETE) and convert those events to Kafka messages. You can then consume the data from other applications or perform any other operation with Kafka.

By Guy Shtub
Decentralized Data Mesh With Apache Kafka in Financial Services
Decentralized Data Mesh With Apache Kafka in Financial Services

Digital transformation requires agility and fast time to market as critical factors for success in any enterprise. The decentralization with a data mesh separates applications and business units into independent domains. Data sharing in real-time with data streaming helps provide information in the proper context to the correct application at the right time. This article explores a case study from the financial services sector where a data mesh was built across countries for loosely coupled data sharing but standardized enterprise-wide data governance. Data Mesh: The Need for Real-Time Data Streaming If there were a buzzword of the hour, it would undoubtedly be “data mesh!” This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. The heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable: Digital Transformation in Financial Services The new enterprise reality in the financial services sector: Innovate or be disrupted! A few initiatives I have seen in banks around the world with real-time data leveraging data streaming: Legacy modernization, e.g., mainframe offloading and replacement with Apache Kafka. Middleware modernization with scalable, open infrastructures replacing ETL, ESB, and iPaaS platforms. Hybrid cloud data replication for disaster recovery, migration, and other scenarios. Transactions and analytics in real-time at any scale. Fraud detection in real-time with Kafka and Flink to prevent fraud before it happens. Cloud-native core banking to enable modern business processes. Digital payment innovation with Apache Kafka as the data hub for cryptocurrency, DeFi, NFT, and Metaverse (beyond the buzz). Let’s look at a practical example from the real world. Raiffeisen Bank International: A Bank Transformation Across 12 Countries Raiffeisen Bank International (RBI) is scaling an event-driven architecture across the group as part of a bank-wide transformation program. This includes the creation of a reference architecture and the re-use of technology and concepts across twelve countries. The universal bank is headquartered in Vienna, Austria. It has decades of experience (and related legacy infrastructure) in retail, corporate and markets, and investment banking. Let’s explore the journey of Raiffeisen Bank’s digital transformation. Building a Data Mesh Without Knowing It Raiffeisen Bank, operating across twelve countries, has all the apparent challenges and requirements for data sharing across applications, platforms, and governments. Raiffeisen Bank built a decentralized data mesh enterprise architecture with real-time data sharing as the fundamental key to its digital transformation. They did not even know about it because the buzzword did not exist when they started making it. However, there are good reasons for using data streaming as the data hub: The Enterprise Architecture of RBI’s Data Mesh With Data Streaming: The reference architecture includes data streaming as the heart of the infrastructure. It is real-time, scalable, and decoupled independent domains and applications. Open banking APIs exist for request-response communication: Source: Raiffeisen Bank International The three core principles of the enterprise architecture ensure an agile, scalable, and future-ready infrastructure across the countries: API: Internal APIs standardized based on domain-driven design. Group integration: Live, connected with eleven countries, 320 APIs available, and constantly increasing. EDA: Event-driven reference architecture created and roll-out ongoing, group layer live with the first use cases. The combination of data streaming with Apache Kafka and request-response with REST/HTTP is prevalent in enterprise architectures. Having said that, more and more use cases directly leverage a stream data exchange for data sharing across business units or organizations. Decoupling With Decentralized Data Streaming as the Integration Layer The whole IT platform and technology stack is built for re-use in the group: Source: Raiffeisen Bank International Raiffeisen Bank’s reference architecture has all the characteristics that define a data mesh: Loose coupling between applications, databases, and business units with domain-driven design. Independent microservices and data products (like different core banking platforms or individual analytics in the countries). Data sharing in real-time via a decentralized data streaming platform (fully-managed in the cloud where possible, but freedom of choice for each country). Enterprise-wide API contacts (= schemas in the Kafka world). Data Governance in Regulated Banking Across the Data Mesh Financial service is a regulated market around the world. PCI, GDPR, and other compliance requirements are mandatory, whether you build monoliths or a decentralized data mesh. Raiffeisenbank international built its data mesh with data governance, legal compliance, and data privacy in mind from the beginning: Source: Raiffeisen Bank International Here are the fundamental principles of Raiffeisen Bank’s data governance strategy: Central integration layer for data sharing across the independent groups in real-time for transactional and analytical workloads. Cloud-first strategy (when it makes sense) with fully-managed confluent cloud for data streaming. Group-wide standardized event taxonomy and API contracts with schema registry. Group-wide governance with event product owners across the group. Platform as a service for self-service for internal customers within the different groups. Combining these paradigms and rules enables independent data processing and innovation while still being compliant and enabling data sharing across different groups. Conclusion Independent applications, domains, and organizations built separate data products in a data mesh. Real-time data sharing across these units with standardized and loosely coupled events is a critical success factor. Each downstream consumer gets the data as needed: Real-time, near real-time, batch, or request-response. The case study from Raiffeisen Bank International showed how to build a powerful and flexible data mesh leveraging cloud-native data streaming powered by Apache Kafka. While this example comes from financial services, the principles and architectures apply to any vertical. The business objects and interfaces look different. But the significant challenges are very similar across industries. How do you build a data mesh? Do you use batch technology like ETL tools and data lakes or rely on real-time data streaming for data sharing and integration? Comment below with your response.

By Kai Wähner CORE
ChatGPT for Newbies in Data Science
ChatGPT for Newbies in Data Science

ChatGPT is a cutting-edge artificial intelligence model developed by OpenAI, designed to generate human-like text based on the input provided. This model is trained on a massive dataset of text data, giving it extensive knowledge of the patterns and relationships in language. With its ability to understand and generate text, ChatGPT can perform a wide range of Natural Language Processing (NLP) tasks, such as language translation, question-answering, and text generation. One of the most famous examples of ChatGPT's capabilities is its use in generating realistic chatbot conversations. Many companies and organizations have used chatbots to interact with customers, providing quick and accurate responses to common questions. Another example is the use of ChatGPT in language translation, where it can automatically translate text from one language to another, making communication more manageable and more accessible. Another exciting application of ChatGPT is in the field of content creation. With its ability to understand and generate text, ChatGPT has been used to create articles, poems, and even song lyrics. For example, OpenAI has developed a GPT-3 that can create articles on various topics, from sports to politics, with stunning accuracy and attention to detail. The success of ChatGPT can be attributed to its use of a transformer architecture, a type of deep learning model that is well-suited for NLP tasks involving sequential data like text. Furthermore, the pre-training of ChatGPT on a large corpus of text data also gives it a solid foundation of language knowledge, allowing it to perform well on various NLP tasks. Understanding Natural Language Processing (NLP) NLP is a subfield of artificial intelligence that deals with the interaction between computers and human language. It is a complex field that involves the application of computer science, computational linguistics, and machine learning to process, understand and generate human language. NLP has a long history, dating back to the 1950s and 60s when early researchers began exploring the use of computers to process and understand natural language. One of the pioneers of NLP was the computer scientist and cognitive psychologist Noam Chomsky. Chomsky is widely regarded as the father of modern linguistics, and his work laid the foundation for developing NLP. In addition, Chomsky's theories about language structure and humans' innate ability to learn languages have profoundly impacted the field of NLP. Another important figure in the history of NLP is John Searle, who developed the Chinese Room argument, which challenged the idea that a machine could truly understand language. Despite this argument, the development of NLP continued to advance, and in the 1990s, there was a significant increase in research in the field, leading to the development of new NLP techniques and tools. Despite its advances, NLP continues to face significant challenges. One of the main difficulties in NLP is the complexity of human language, which can vary greatly depending on the context and the speaker. This variability can make it difficult for computers to understand and generate language, as they must be able to recognize the nuances and subtleties of language to perform NLP tasks accurately. Another challenge in NLP is the need for labeled training data, which is required to train NLP models. Unfortunately, creating labeled data is time-consuming and labor-intensive, and obtaining high-quality labeled data can take time and effort. This makes it challenging to train NLP models that can perform well on various NLP tasks. Despite these challenges, the field of NLP continues to advance, and new techniques and models are constantly being developed. For example, the rise of big data and the availability of large amounts of text data has led to the development of more powerful NLP models, like ChatGPT, which can process and generate human-like text. The Importance of NLP in AI NLP plays a critical role in the development of artificial intelligence. As mentioned, NLP enables computers to process, understand, and generate human language, which is essential for building AI systems that can interact with humans naturally and intuitively. One of the critical reasons for the importance of NLP in AI is the sheer amount of text data generated daily. This data includes emails, social media posts, news articles, and many other forms of text-based information. The ability to process and analyze this text data is critical for a wide range of applications, including sentiment analysis, information extraction, and machine translation, to name a few. NLP also plays a crucial role in developing conversational AI, allowing computers to engage in natural language conversations with humans. This is a rapidly growing area of AI. NLP is essential for building chatbots, virtual assistants, and other conversational AI systems to help businesses and organizations interact more efficiently and effectively with their customers. To illustrate the importance of NLP in AI, consider the example of sentiment analysis. Sentiment analysis is the process of determining the emotion or attitude expressed in a piece of text. This is a critical task in social media analysis, where it is used to gauge public opinion on a particular issue. NLP analyzes text data, identifies sentiment, and classifies it as positive, negative, or neutral. Another example of the importance of NLP in AI is information extraction, which is the process of automatically extracting structured information from unstructured text data. This is a critical task in news analysis and business intelligence, where large amounts of unstructured text data must be processed and analyzed to gain insights into trends and patterns. NLP is used to analyze text data, identify relevant information, and extract it in a structured format that can be easily researched. NLP is a critical component of AI. Its importance will only continue to grow as more and more text data is generated and the need for AI systems that can process and understand human language increases. The development of NLP has led to significant advances in AI, and it will continue to play a crucial role in shaping the future of AI and how computers and humans interact. How ChatGPT Works ChatGPT is based on the GPT (Generative Pre-trained Transformer) architecture, introduced in 2018 by researchers at OpenAI, including Ilya Sutskever, co-founder of OpenAI and the father of deep learning, and Sam Altman, President of OpenAI. The key innovation of the GPT architecture was its use of the Transformer network, introduced in 2017 by Vaswani et al. in a paper titled "Attention is All You Need." The Transformer network was designed to be more computationally efficient and easier to train than previous neural network architectures, and it quickly became the dominant architecture in NLP. ChatGPT is pre-trained on a large corpus of text data, which includes books, websites, and other forms of text-based information. This pre-training allows ChatGPT to learn language patterns and structures, generating coherent and natural language text based on user input. The pre-training process is followed by fine-tuning, where the model is further trained on specific tasks, such as question-answering, text generation, and conversation. During fine-tuning, the model is trained on a smaller dataset specific to the task. This fine-tuning allows the model to specialize in a particular task and generate more accurate and relevant text. Once the model is trained, it can generate text by providing it with an input prompt. The input prompt can be a question, a statement, or any other form of text, and the model will generate a response based on the information it has learned during training. The generated response will be coherent and natural language text, which is generated based on the language patterns and structures that the model learned during pre-training. For example, if a user provides the input prompt "What is the capital of France?", ChatGPT will generate the response "The capital of France is Paris." This response is generated based on the information that ChatGPT has learned about the relationships between geographical locations and their capitals, which it has learned during pre-training and fine-tuning. The Transformer Architecture: A Technical Overview The Transformer architecture is the backbone of the ChatGPT model and allows the model to generate human-like text. The Transformer architecture is called "Transformer" because it uses self-attention mechanisms to "transform" the input data into a representation suitable for generating text. The self-attention mechanism allows the model to weigh the importance of different input data parts, enabling it to generate more accurate and relevant text. In the Transformer architecture, the input data is processed by multiple layers of the neural network, each using self-attention mechanisms to transform the input data into a new representation. The output from each layer is then passed to the next layer, which is repeated until the final layer generates the output text. Each layer of the Transformer architecture comprises two sub-layers: the Multi-Head Self-Attention mechanism and the Position-wise Feed-Forward Network. The Multi-Head Self-Attention mechanism is used to weigh the importance of different parts of the input data. The Position-wise Feed-Forward Network is used to process the weighted input data and generate a new representation. The Multi-Head Self-Attention mechanism is implemented as a series of attention heads, each of which performs a separate attention mechanism on the input data. The attention heads are combined to produce the final output, which is then passed to the Position-wise Feed-Forward Network. The Position-wise Feed-Forward Network is a fully connected neural network that takes the output from the Multi-Head Self-Attention mechanism as input and generates a new representation. The Position-wise Feed-Forward Network is designed to be computationally efficient and easy to train, which makes it a valuable component of the Transformer architecture. Pre-Training: The Key to ChatGPT's Success Pre-training is essential in creating the ChatGPT model and sets it apart from other conversational AI systems. Pre-training is training the model on a massive amount of data before fine-tuning it for a specific task. By pre-training the model on a large corpus of text, the model can learn the patterns and structures of human language, which makes it more capable of generating human-like text. ChatGPT was pre-trained on various text sources, including books, news articles, Wikipedia articles, and web pages. The vast amount of text data used for pre-training allows the model to learn a wide range of styles and genres, making it well-suited for generating text in various contexts. The pre-training data for ChatGPT was also carefully curated to ensure that the model was exposed to high-quality, well-written text. This is important because the quality of the pre-training data directly impacts the generated text's quality. For example, if the pre-training data contains errors, grammatical mistakes, or low-quality text, the model will be less capable of generating high-quality text. Pre-training is a computationally intensive process that requires a lot of computational resources. To pre-train the ChatGPT model, OpenAI used a large cluster of GPUs, allowing the model to be trained relatively short. Once the pre-training process is complete, the model is fine-tuned for a specific task. Fine-tuning is adjusting the model weights to better suit the task at hand. For example, if the task is to generate conversational text, the model may be fine-tuned to create more conversational text. Fine-Tuning: Customizing ChatGPT for Specific Tasks Fine-tuning is adjusting the weights of the pre-trained ChatGPT model to suit a specific task better. The fine-tuning process is essential because it allows the model to be customized for a particular use case, which results in better performance. One of the main challenges of fine-tuning is finding the right amount of data to use for fine-tuning. If too little data is used, the model may not be able to learn the patterns and structures of the specific task at hand. On the other hand, if too much data is used, the model may become overfit to the training data, which means it will perform poorly on new data. Another challenge of fine-tuning is choosing the correct hyperparameters. Hyperparameters are the values that control the model's behavior, such as the learning rate, number of layers, and number of neurons. Choosing the correct hyperparameters is essential because it can significantly impact the model's performance. To overcome these challenges, researchers and practitioners have developed several techniques to help fine-tune the ChatGPT model. One of the most popular techniques is transfer learning, which involves using a pre-trained model as a starting point and then fine-tuning the model for a specific task. Transfer learning allows the model to take advantage of the knowledge it has learned from the pre-training data, which results in faster and more effective fine-tuning. Another technique that has been developed to help fine-tune the ChatGPT model is active learning. Active learning is a semi-supervised learning method that allows the model to learn from labeled and unlabeled data. By using active learning, the model can learn from a more significant amount of data, which results in better performance. Conclusion: The Future of ChatGPT In conclusion, ChatGPT is a powerful and sophisticated language model that has revolutionized the field of NLP. With its ability to generate human-like text, ChatGPT has been used in many applications, from conversational agents and language translation to question-answering and sentiment analysis. As AI advances, ChatGPT will likely continue to evolve and become even more sophisticated. Future developments could include improved pre-training techniques, better architectures, and new fine-tuning methods. Additionally, as more data becomes available, ChatGPT will become even more accurate and effective at performing a more comprehensive range of tasks. However, it is essential to note that ChatGPT has drawbacks. One potential drawback is the possibility of ethical issues arising from using the model. For example, there are concerns about the potential for the model to generate harmful or biased text. In addition, there is also the risk of the model being used for malicious purposes, such as creating fake news or impersonating individuals. Another potential drawback is the high computational cost of training and using the model. This can be a significant barrier to entry for many organizations, particularly smaller ones, who may need more resources to invest in the necessary hardware and infrastructure. Despite these drawbacks, the potential benefits of ChatGPT are too great to ignore. As AI continues to evolve, ChatGPT will likely play an increasingly important role in our daily lives. Whether it will lead to a future filled with intelligent and helpful conversational agents or a world where the lines between human and machine language become blurred, the future of ChatGPT is exciting and intriguing. ChatGPT is a powerful language model that has revolutionized the field of NLP. With its ability to generate human-like text, it has a wide range of applications, from conversational agents to sentiment analysis. While there are potential drawbacks to its use, the future of ChatGPT is exciting and intriguing, filled with possibilities for further development and application.

By Taras Baranyuk CORE
Introduction to Azure Data Lake Storage Gen2
Introduction to Azure Data Lake Storage Gen2

Built on Azure Blob Storage, Azure Data Lake Storage Gen2 is a suite of features for big data analytics. Azure Data Lake Storage Gen1 and Azure Blob Storage's capabilities are combined in Data Lake Storage Gen2. For instance, Data Lake Storage Gen2 offers scale, file-level security, and file system semantics. You will also receive low-cost, tiered storage with high availability and disaster recovery capabilities because these capabilities are built on Blob storage. Developed for Enterprise Huge Data Analytics Azure Storage is now the starting point for creating enterprise data lakes on Azure, thanks to Data Lake Storage Gen2. Data Lake Storage Gen2, created from the ground up to support many petabytes of data while supporting hundreds of gigabits of throughput, enables you to easily manage enormous volumes of data. The expansion of Blob storage to include a hierarchical namespace is a key component of Data Lake Storage Gen2. For effective data access, the hierarchical namespace groups object and files into a hierarchy of folders. Slashes are frequently used in object storage names to simulate a hierarchical directory structure. The advent of Data Lake Storage Gen2 makes this arrangement a reality. Operations on a directory, including renaming or removing it, become single atomic metadata operations. There is no requirement to enumerate and handle every object that shares the directory's name prefix. Blob storage is a foundation for Data Lake Storage Gen2, which improves administration, security, and performance in the following ways: Performance As a result of not needing to replicate or change data before analysis, performance is optimized. In addition, the hierarchical namespace on Blob storage performs directory management activities far better than the flat namespace does, which enhances job performance. Management Because you can arrange and manage files using directories and subdirectories, management is simpler. Security Because POSIX permissions can be set on folders or specific files, security is enforceable. Additionally, Data Lake Storage Gen2 is relatively affordable because it is based on the inexpensive Azure Blob Storage. The additional functionalities reduce the overall cost of ownership for using Azure to execute big data analytics. Important Characteristics of Data Lake Storage Gen2 Data Lake Storage Gen2 enables you to organize and access data in a manner that is comparable to that of a Hadoop Distributed File System (HDFS). All Apache Hadoop settings support the new ABFS driver, which is used to access data. Azure HDInsight, Azure Databricks, and Azure Synapse Analytics are some examples of these environments. ACLs and POSIX permissions are supported by the security model for Data Lake Gen2, as well as additional granularity unique to Data Lake Storage Gen2. In addition, frameworks like Hive and Spark, as well as Storage Explorer, allow for the configuration of settings. Cost-effective: Low-cost storage space and transactions are available with Data Lake Storage Gen2. Thanks to features like Azure Blob Storage lifecycle, costs are reduced as data moves through its lifecycle. Driver optimization: The ABFS driver has been tailored for big data analytics. The endpoint dfs.core.windows.net exposes the corresponding REST APIs. Scalability Whether you access via Data Lake Storage Gen2 or Blob storage interfaces, Azure Storage is scalable by design. Many exabytes of data can be stored and served by it. The throughput for this quantity of storage is measured in gigabits per second (Gbps) at high input/output operation rates per second (IOPS). Latencies for processing are monitored at the service, account, and file levels and are nearly constant per request. Whether you access via Data Lake Storage Gen2 or Blob storage interfaces, Azure Storage is scalable by design. Many exabytes of data can be stored and served by it. The throughput for this quantity of storage is measured in gigabits per second (Gbps) at high input/output operation rates per second (IOPS). Latencies for processing are monitored at the service, account, and file levels and are nearly constant per request. Cost-Effectiveness Storage capacity and transaction costs are lower since Data Lake Storage Gen2 is built on top of Azure Blob Storage. You don't need to relocate or change your data before you can study it, unlike other cloud storage providers. Visit Azure Storage pricing for additional details on pricing. The overall performance of many analytics activities is also greatly enhanced by features like the hierarchical namespace. Because of the increase in performance, processing the same amount of data now requires less computing power, which lowers the total cost of ownership (TCO) for the entire analytics project. A Single Service, Many Ideas Since Data Lake Storage Gen2 is based on Azure Blob Storage, the same shared objects can be described by several concepts. The following are identical objects that are described by various concepts. Unless otherwise stated, the following terms are directly synonymous: Concept Top Level Organization Lower-Level Organization Data Container Blobs - General purpose object storage Container Virtual directory (SDK only - doesn't provide atomic manipulation) Blob Azure Data Lake Storage Gen2 - Analytics Storage Container Directory File Blob Storage-Supporting Features Your account has access to Blob Storage capabilities like diagnostic logging, access tiers, and Blob Storage lifecycle management policies. Most Blob Storage features are fully supported, although some are only supported in preview mode or not at all. See Blob Storage feature support in Azure Storage accounts for further information on how each Blob Storage feature is supported with Data Lake Storage Gen2. Supported Integrations of Azure Services Several Azure services are supported by Data Lake Storage gen2. They can be used to perform analytics, produce visual representations, and absorb data. See Azure services that support Azure Data Lake Storage Gen2 for a list of supported Azure services. Open-Source Platforms That Are Supported Data Lake Storage Gen2 is supported by several open-source platforms. See Open-source platforms that support Azure Data Lake Storage Gen2 for a comprehensive list. Utilizing Azure Data Lake Storage Gen2 Best Practices The Gen2 version of Azure Data Lake Storage is not a specific service or account type. It is a collection of tools for high-throughput analytical tasks. Best practices and instructions for exploiting these capabilities are provided in the Data Lake Storage Gen2 reference. See the Blob storage documentation content for information on all other facets of account administration, including setting up network security, designing for high availability, and disaster recovery. Review Feature Compatibility and Known Problems When setting up your account to leverage Blob storage services, apply the approach below. To find out if a feature is fully supported in your account, read the page on Azure Storage accounts' Blob Storage feature support. In accounts with Data Lake Storage Gen2 enabled, several features are either not supported at all or only partially supported. As feature support continues to grow, be sure to frequently check this page for changes. Check the Known issues with the Azure Data Lake Storage Gen2 article to check if the functionality you want to use has any restrictions or needs any specific instructions. Look through feature articles for any advice that applies specifically to accounts with Data Lake Storage Gen2 enabled. Recognize the Terminology Used in the Documentation You'll notice some minor vocabulary variations as you switch between content sets. For instance, the term "blob" will be used instead of "file" in the featured featured featured content in the Blob storage description. Technically, the data you upload to your storage account turns into blobs there. Consequently, the phrase is accurate. However, if you're used to the term "file," the term "blob" could be confusing. A file system will also be referred to as a "container." Think of these phrases as interchangeable. Think About Premium Consider adopting a premium block blob storage account if your workloads demand low constant latency and/or a high volume of input-output operations per second (IOP). High-performance hardware is used in this sort of account to make data accessible. Solid-state drives (SSDs), which are designed for minimal latency, are used to store data. Compared to conventional hard drives, SSDs offer a greater throughput. Premium performance has greater storage costs but reduced transaction costs. Therefore, a premium performance block blob account may be cost-effective if your applications conduct a lot of transactions. We strongly advise using Azure Data Lake Storage Gen2 together with a premium block blob storage account if your storage account will be used for analytics. The premium tier for Azure Data Lake Storage is the usage of premium block blob storage accounts in conjunction with a Data Lake Storage-enabled account. Improve Data Ingestion The source hardware, source network hardware, or network connectivity to your storage account may be a bottleneck while ingesting data from a source system. Source Hardware Make sure to carefully choose the right hardware, whether you're using virtual machines (VMs) on Azure or on-premises equipment. Pick disc hardware with quicker spindles and think about employing Solid State Drives (SSD). Use the quickest Network Interface Controllers (NIC) you can find for network hardware. We advise using Azure D14 VMs because they have adequate networking and disc hardware power. Connection to the Storage Account's Network There may occasionally be a bottleneck in the network connectivity between your source data and your storage account. When your source data is on-premises, you might want to use an Azure ExpressRoute dedicated link. The performance is optimum when your source data, if it's in Azure, is in the same Azure region as your Data Lake Storage Gen2-enabled account. Set Up Data Ingestion Mechanisms for the Most Parallel Processing Possible Use all available throughput by running as many reads and writes in parallel as you can to get the optimum performance. Sets of Structured Data Consider organizing your data's structure beforehand. Performance and expense can be affected by file format, file size, and directory organization. File Formats Different formats can be used to absorb data. Data can be presented in compressed binary formats like tar. go or in human-readable formats like JSON, CSV, or XML. Data can also arrive in a variety of sizes. Large files (a few terabytes) can make up data, such as information from the export of a SQL table from your on-premises systems. The data from real-time events from an Internet of things (IoT) solution, for example, can also come in the form of numerous little files (a few kilobytes in size). By selecting an appropriate file format and file size, you may maximize efficiency and cut costs. A selection of file formats supported by Hadoop is designed specifically for storing and analyzing structured data. Avro, Parquet, and the Optimized Row Columnar (ORC) format are a few popular formats. These are all binary file formats that can be read by machines. They are compressed to assist you in controlling file size. They are self-describing because each file contains an embedded schema. The method used to store data varies between different formats. The Parquet and ORC formats store data in a columnar fashion, whereas Avro stores data in a row-based format. If your I/O patterns are more write-heavy or your query patterns favor fetching numerous rows of information in their entirety, you might want to use the Avro file format. The Avro format, for instance, works well with message buses that write a sequence of events or messages, like Event Hubs or Kafka. When the I/O patterns are more read-intensive or the query patterns are concentrated on a specific subset of columns in the records, consider the Parquet and ORC file formats. Instead of reading the full record, read transactions might be streamlined to get only certain columns. Open-source Apache Parquet is a file format that is designed for read-intensive analytics pipelines. You can skip over irrelevant data because of Parquet's columnar storage format. Because your queries may specifically target which data to send from storage to the analytics engine, they are significantly more efficient. Additionally, Parquet provides effective data encoding and compression techniques that can reduce the cost of data storage because similar data types (for a column) are stored together. There is native Parquet file format support in services like Azure Synapse Analytics, Azure Databricks, and Azure Data Factory. File Size Larger files result in improved performance and lower expenses. Analytics engines like HDInsight typically include a per-file overhead that includes activities like listing, determining access, and carrying out different metadata operations. Data storage in the form of several little files might have a negative impact on performance. For improved performance, organize your data into larger files (256 MB to 100 GB in size). Files more than 100 GB in size may not be processed efficiently by some engines and programs. Reducing transaction costs is another benefit of enlarging files. You will be charged for read and write activities in 4-megabyte increments, whether the file contains 4 megabytes or merely a few kilobytes. See Azure Data Lake Storage pricing for more on pricing. The raw data, which consists of numerous little files, can occasionally be under the limited control of data pipelines. We advise that your system have a procedure to combine small files into bigger ones for use by downstream applications. If you're processing data in real-time, you can use a real-time streaming engine (like Spark Streaming or Azure Stream Analytics) in conjunction with a message broker (like Event Hubs or Apache Kafka) to save your data as larger files. As you combine small files into bigger ones, consider saving them in a read-optimized format, like Apache Parquet, for later processing. Directory Structure These are some typical layouts to consider when working with the Internet of Things (IoT), batch scenarios, or when optimizing for time-series data. Every workload has various requirements on how the data is consumed. Arrangement of a Batch Work In IoT workloads, a lot of data that comes from many goods, devices, businesses, and customers may be absorbed. The directory layout needs to be planned in time for organization, security, and effective data processing for downstream users. The following design could serve as a broad template to think about: {Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/ For illustration, the structure of landing telemetry for an aircraft engine in the UK might be as follows: UK/Planes/BA1293/Engine1/2017/08/11/12/ You may more easily restrict regions and topics to users and groups in this example by adding the date to the end of the directory structure. It would be far more challenging to secure these areas and topics if the date structure came first. For instance, you would need to request separate authorization for several directories beneath every hour directory if you wanted to restrict access to UK data or only particular planes. Additionally, as time went on, this arrangement would rapidly grow the number of directories. The Structure of a Batch Task Adding data to an "in" directory is a frequent batch processing technique. After the data has been processed, place the fresh data in an "out" directory so that other processes can use it. This directory structure is occasionally used for tasks that only need to examine a single file and don't necessarily need to be executed in massively parallel across a big dataset. A useful directory structure has parent-level directories for things like area and subject matters, like the IoT structure shown above (for example, organization, product, or producer). To enable better organization, filtered searches, security, and automation in the processing, take date and time into consideration when designing the structure. The frequency with which the data is uploaded or downloaded determines the level of granularity for the data structure. File processing can occasionally fail because of data corruption or unexpected formats. A directory structure may benefit from having a /bad folder in these situations so that the files can be moved there for additional analysis. The batch task may also oversee reporting or alerting users to these problematic files so that they can take manual action. Think about the following template arrangement: {Region}/{SubjectMatter(s)}/In/{yyyy}/{mm}/{dd}/{hh}/ {Region}/{SubjectMatter(s)}/Out/{yyyy}/{mm}/{dd}/{hh}/ {Region}/{SubjectMatter(s)}/Bad/{yyyy}/{mm}/{dd}/{hh}/ For example, a marketing firm receives daily data extracts of customer updates from its clients in North America. It might look like the following snippet before and after being processed: NA/Extracts/ACMEPaperCo/In/2017/08/14/updates_08142017.csv NA/Extracts/ACMEPaperCo/Out/2017/08/14/processed_updates_08142017.csv The output already goes into a separate folder for the Hive table or external database in the typical scenario of batch data processing being done directly into databases like Hive or traditional SQL databases; thus, there is no need for a /in or /out directory. For instance, each directory would receive daily extracts from consumers. After that, a daily Hive or Spark job would be started by a service like Azure Data Factory, Apache Oozie, or Apache Airflow to process and write the data into a Hive table. Data Structure for Time Series Partition pruning of time-series data for Hive workloads might enable some queries to read only a portion of the data, improving performance. Pipelines for absorbing time-series data commonly organize their files into named folders. Here is an example of data that is often arranged by date: /DataSet/YYYY/MM/DD/datafile_YYYY_MM_DD.tsv Keep in mind that the Date and Time details can be found in the filename as well as the folders. The following is a typical format for date and time: /DataSet/YYYY/MM/DD/HH/mm/datafile_YYYY_MM_DD_HH_mm.tsv Once more, the decision you make about how to organize your folders and files should be optimized for higher file sizes and a manageable quantity of files in each folder. Set Up Security Start by going over the suggestions in the article Security considerations for Blob storage. You'll receive best practice advice on how to safeguard data behind a firewall, guard against malicious or unintentional deletion, and use Azure Active Directory (Azure AD) as the foundation for identity management. Then, for advice unique to accounts that have Data Lake Storage Gen2 enabled, see the Access control model section of the Azure Data Lake Storage Gen2 page. This article explains how to apply security permissions to directories and files in your hierarchical file system using Azure role-based access control (Azure RBAC) roles and access control lists (ACLs). Ingest, Carry Out, and Assess Data can be ingested into a Data Lake Storage Gen2-capable account from a wide variety of sources and in a variety of methods. To prototyping applications, you can ingest huge sets of data from HDInsight and Hadoop clusters as well as smaller collections of ad hoc data. You can take in streaming data produced by a variety of sources, including software, hardware, and sensors. You can utilize tools to record and process this kind of data in real-time, event by event, and then write the events into your account in batches. Web server logs, which include details like the history of page requests, can also be ingested. If you want to have the flexibility to integrate your data uploading component into your bigger big data application, think about using bespoke scripts or programs to submit log data. Once the data is available in your account, you can run an analysis on that data, create visualizations, and even download the data to your local machine or to other repositories, such as an Azure SQL database or SQL Server instance. Monitor Telemetry Operationalizing your service requires careful attention to usage and performance monitoring. Examples include procedures that happen frequently, have a lot of latency, or throttle services. Through the Azure Storage logs in Azure Monitor, you may access all of the telemetries for your storage account. This functionality allows you to archive logs to another storage account and links your storage account with Log Analytics and Event Hubs. Visit the Azure Storage monitoring data reference to view the whole collection of metrics, resource logs, and their corresponding structure. Depending on how you intend to access them, you can store your logs anywhere you like. You can store your logs in a Log Analytics workspace, for instance, if you wish to have access to them in almost real-time and be able to connect events in the logs with other metrics from Azure Monitor. Then, use KQL and author queries to query your logs, which list the StorageBlobLogs table in your workspace. You can set up your diagnostic settings to send logs to both a Log Analytics workspace and a storage account if you wish to store your logs for both near real-time query and long-term preservation. You can set up your diagnostic settings to send logs to an event hub and ingest logs from the event hub to your preferred destination if you want to retrieve your logs through another query engine, such as Splunk. Conclusion Azure Data Lake Storage Gen1 and Azure Blob Storage's capabilities are combined in Data Lake Storage Gen2. For instance, Data Lake Storage Gen2 offers scale, file-level security, and file system semantics. You will also receive low-cost, tiered storage with high availability/disaster recovery capabilities because these capabilities are built on Blob storage.

By Sardar Mudassar Ali Khan
Apache Kafka [Video Tutorials]
Apache Kafka [Video Tutorials]

Below, you will see a series of video tutorials about Apache Kafka. After watching these videos, you will have more knowledge about Apache Kafka’s features, workflow, architecture, fundamentals, and so much more! So, let’s get started. Video Tutorials Apache Kafka Introduction The following three introduction videos about Apache Kafka will go into detail about Apache Kafka’s: Background Messaging systems Advantages Uses Streaming processes Core API’s Needs Apache Kafka Terminologies and Concepts The video below will go into detail about the key terminologies associated with Apache Kafka. I will list a few of the terms I talk about below: Broker Producers, consumers, and consumer group Partitions Offset Message Ordering Connector API Log Anatomy Apache Kafka Fundamentals The following video will go into detail about Apache Kafka’s: Architecture Producers and consumers Leaders and followers Brokers Servers Partitions Clusters Apache Kafka Components and Architecture The following video goes deeper into the architecture of Apache Kafka and explains some of Apache Kafka’s: Components Clusters Replication factors Apache Kafka Cluster and Components The video below goes further into detail about Apache Kafka’s: Clusters Ecosystem Components Apache Kafka Workflow The videos below explain Apache Kafka’s workflow, including: Pub-sub messaging Consumers and producers ZooKeeper Brokers Queue messaging Top Ten Apache Kafka Features In this last video, we take a look at the top ten features of Apache Kafka, which includes: Scalability High-volume Data transformations Fault tolerance Reliability Durability Performance Zero downtime Extensibility Replication Conclusion Thank you for taking a look at this video tutorial series about Apache Kafka. Hopefully, you took a lot of good information away from these videos and will apply them into your software developing career or hobby.

By Ram N
Data Ingestion vs. ETL: Definition, Benefits, and Key Differences
Data Ingestion vs. ETL: Definition, Benefits, and Key Differences

Data fuels insight-driven business decisions for enterprises today; be it planning, forecasting market trends, predictive analytics, data science, machine learning, or business intelligence (BI). But to be fully useful, data must be: Extracted from disparate sources in different formats; Available in a unified environment for access to the enterprise; Abundantly and readily available, and; Clean. If executed poorly, incomplete or incorrect data extraction can lead to misleading reports, bogus analytics conclusions, and inhibited decision-making. It is where data ingestion comes in handy as a process that helps enterprises make sense of the ever-increasing volumes and complexity of data. What Is Data Ingestion? Data Ingestion is the process of acquiring raw data from various sources and transferring it to a centralized repository. The data is moved from its original location to a system where it can be further processed or analyzed. These sources of data may include third-party systems such as CRMs, in-house applications, databases, spreadsheets, and information obtained from the internet. The destination for this data is typically a database, data warehouse, data lake, data mart, or a third-party application. As the initial step in data integration, Data Ingestion enables the incorporation of raw data from various sources and formats; whether structured, unstructured, or semi-structured. Data Ingestion can be achieved by scheduling batch jobs to transfer data to a central location at regular intervals or by performing it in real-time to continuously monitor changes in data. There are two main types of data ingestion: batch ingestion and real-time ingestion. Batch Ingestion: Batching is when data is ingested in discrete chunks at periodic intervals rather than collected immediately as it is generated. The ingestion process waits until the assigned amount of time has elapsed before transmitting the data from the original source to storage. The data can be batched or grouped based on any logical ordering, simple schedules, or criteria (such as triggering certain conditions). Real-Time Ingestion: Here, ingestion occurs in real time, where each data point is imported immediately as the source creates it. The data is made available for processing as soon as it is needed to facilitate real-time analytics and decision-making. Real-time ingestion is also called streaming or stream processing. Advantages of Data Ingestion Speed: Data ingestion allows data to be imported quickly and efficiently, making it available for further processing and analysis. Scalability: Data ingestion is highly scalable, allowing for the import of large volumes of data without significant performance degradation. Flexibility: Data ingestion is highly flexible, allowing for data to be imported from a wide variety of sources, including databases, files, and streams. What Is ETL? ETL is the process that extracts, transforms, and then loads data to create a uniform format. It is a more specific process with the goal of delivering data in a format that matches the requirements of the target destination. ETL is not just about changing data for storage. It also includes making sure the process runs smoothly and is managed well. Businesses should put in strong ETL practices to be able to handle changes that teams may need. Just like how we bring data in, ETL can be done in two ways; batch ETL and real-time ETL. Batch ETL In this method, information is extracted from a Data Lake and altered to meet business needs, resulting in a collection of structured or semi-structured data. This process is performed on a large volume of data at a specified time. Real-time ETL It is utilized to facilitate prompt decision-making by providing faster insights delivery, reducing storage costs, and more. This method allows for the tracking of trends in real-time. Advantages of ETL Data Validation: ETL validates data before loading it onto the system, ensuring accuracy and relevance. Data Quality Improvement: ETL improves data quality by cleansing, standardizing, and enriching it. Data Integration: ETL merges data from multiple sources for easy analysis. Differences: Data Ingestion vs. ETL Quality of Data While ETL is for optimizing data for analytics, Ingestion is carried out to collect raw data. In other words, when performing ETL, you have to consider how you enhance the data quality for further processing. But, with Ingestion, your target is to collect data even if it is untidy. Data Ingestion does not involve complex practices of organizing information — you only require to add some metadata tags and unique identifiers to locate the data when needed. ETL, in contrast, is used to structure the information for ease of use with data analytics tools. Coding Needs Collecting data from various sources for a Data Lake requires minimal custom coding as it focuses on bringing in the data rather than ensuring its quality. In contrast, ETL necessitates significant custom coding to extract relevant data, transform it, and store it in a warehouse. This can be a time-consuming task for companies with multiple data pipelines and may require updating the code if the workflow changes. On the other hand, Ingestion is less affected by internal team changes. Domain Knowledge Data Ingestion requires less expertise compared to ETL as it mainly involves pulling data from various sources using APIs or web scraping. However, ETL involves not only extracting data but also transforming it for further analytics. This requires knowledge of the specific domain and can greatly impact the quality of insights generated from the data. Real-Time Data Ingestion can involve real-time data storage, but real-time ETL provides additional value through the ability to perform streaming analytics. To achieve this, ETL processes must be optimized for speed and resiliency and able to recover quickly from any interruptions. However, this level of robustness is not as critical in the data ingestion process. Challenges in the Data Source While data ingestion practices may not change quickly, finding reliable sources is important, particularly when working with public data. Using unreliable sources can lead to inaccurate insights and negatively impact business decisions. ETL, on the other hand, presents a different set of challenges, with a greater emphasis on pre-processing the data rather than the source of the data. Conclusion Data ingestion and ETL play distinct roles in the data pipeline. Data ingestion is the act of bringing data into a system, while ETL transforms and loads it into a target location, such as a data warehouse. Both are crucial for guaranteeing the accuracy and completeness of data before analysis. Understanding the distinction and how they complement each other optimizes the data pipeline. With knowledge of data ingestion and ETL, one can make educated choices on handling and processing the data.

By Hiren Dhaduk
Fraud Detection With Apache Kafka, KSQL, and Apache Flink
Fraud Detection With Apache Kafka, KSQL, and Apache Flink

Fraud detection becomes increasingly challenging in a digital world across all industries. Real-time data processing with Apache Kafka became the de facto standard to correlate and prevent fraud continuously before it happens. This article explores case studies for fraud prevention from companies such as Paypal, Capital One, ING Bank, Grab, and Kakao Games that leverage stream processing technologies like Kafka Streams, KSQL, and Apache Flink. Fraud Detection and the Need for Real-Time Data Fraud detection and prevention is the adequate response to fraudulent activities in companies (like fraud, embezzlement, and loss of assets because of employee actions). An anti-fraud management system (AFMS) comprises fraud auditing, prevention, and detection tasks. Larger companies use it as a company-wide system to prevent, detect, and adequately respond to fraudulent activities. These distinct elements are interconnected or exist independently. An integrated solution is usually more effective if the architecture considers the interdependencies during planning. Real-time data beats slow data across business domains and industries in almost all use cases. But there are few better examples than fraud prevention and fraud detection. It is not helpful to detect fraud in your data warehouse or data lake after hours or even minutes, as the money is already lost. This “too late architecture” increases risk, revenue loss, and lousy customer experience. It is no surprise that most modern payment platforms and anti-fraud management systems implement real-time capabilities with streaming analytics technologies for these transactional and analytical workloads. The Kappa architecture powered by Apache Kafka became the de facto standard replacing the Lambda architecture. A Stream Processing Example in Payments Stream processing is the foundation for implementing fraud detection and prevention while the data is in motion (and relevant) instead of just storing data at rest for analytics (too late). No matter what modern stream processing technology you choose (e.g., Kafka Streams, KSQL, Apache Flink), it enables continuous real-time processing and correlation of different data sets. Often, the combination of real-time and historical data helps find the right insights and correlations to detect fraud with a high probability. Let's look at a few examples of stateless and stateful stream processing for real-time data correlation with the Kafka-native tools Kafka Streams and ksqlDB. Similarly, Apache Flink or other stream processing engines can be combined with the Kafka data stream. It always has pros and cons. While Flink might be the better fit for some projects, it is another engine and infrastructure you need to combine with Kafka. Ensure you understand your end-to-end SLAs and requirements regarding latency, exactly-once semantics, potential data loss, etc. Then use the right combination of tools for the job. Stateless Transaction Monitoring With Kafka Streams A Kafka Streams application, written in Java, processes each payment event in a stateless fashion one by one: Stateful Anomaly Detection With Kafka and KSQL A ksqlDB application, written with SQL code, continuously analyzes the transactions of the last hour per customer ID to identify malicious behavior: Kafka and Machine Learning With TensorFlow for Real-Time Scoring for Fraud Detection A KSQL UDF (user-defined function) embeds an analytic model trained with TensorFlow for real-time fraud prevention: Case Studies Across Industries Several case studies exist for fraud detection with Kafka. It is usually combined with stream processing technologies, such as Kafka Streams, KSQL, and Apache Flink. Here are a few real-world deployments across industries, including financial services, gaming, and mobility services: Paypal: processes billions of messages with Kafka for fraud detection. Capital One: looks at events as running its entire business (powered by Confluent), where stream processing prevents $150 of fraud per customer on average per year by preventing personally identifiable information (PII) violations of in-flight transactions. ING Bank: started many years ago by implementing real-time fraud detection with Kafka, Flink, and embedded analytic models. Grab: is a mobility service in Asia that leverages fully managed Confluent Cloud, Kafka Streams, and ML for stateful stream processing in its internal GrabDefence SaaS service. Kakao Games: a South-Korean gaming company uses data streaming to detect and operate anomalies with 300+ patterns through KSQL. Let’s explore the latter case study in more detail. Deep Dive Into Fraud Prevention With Kafka and KSQL in Mobile Gaming Kakao Games is a South Korea-based global video game publisher specializing in games across various genres for PC, mobile, and VR platforms. The company presented at Current 2022—The Next Generation of Kafka Summit in Austin, Texas. Here is a detailed summary of their compelling use case and architecture for fraud detection with Kafka and KSQL. Use Case: Detect Malicious Behavior by Gamers in Real-Time The challenge is evident when you understand the company’s history: Kakao Games has many outsourced games purchased via third-party game studios. Each game has its unique log with its standard structure and message format. Reliable real-time data integration at scale is required as a foundation for analytical business processes like fraud detection. The goal is to analyze game logs and telemetry data in real-time. This capability is critical for preventing and remediating threats or suspicious actions from users. Architecture: Change Data Capture and Streaming Analytics for Fraud Prevention The confluent-powered event streaming platform supports game log standardization. ksqlDB analyzes the incoming telemetry data for in-game abuse and anomaly detection. Source: Kakao Games (Current 2022 in Austin, Texas) Implementation: SQL Recipes for Data Streaming With KSQL Kakao Games detects anomalies and prevents fraud with 300+ patterns through KSQL. Use cases include bonus abuse, multiple account usage, account takeover, chargeback fraud, and affiliate fraud. Here are a few code examples written with SQL code using KSQL: Source: Kakao Games (Current 2022 in Austin, Texas) Results: Reduced Risk and Improved Customer Experience Kakao Games can do real-time data tracking and analysis at scale. Business benefits are faster time to market, increased active users, and more revenue thanks to a better gaming experience. Conclusion Ingesting data with Kafka into a data warehouse or a data lake is only part of a good enterprise architecture. Tools like Apache Spark, Databricks, Snowflake, or Google BigQuery enable finding insights within historical data. But real-time fraud prevention is only possible if you act while the data is in motion. Otherwise, the fraud already happened before you detected it. Stream processing provides a scalable and reliable infrastructure for real-time fraud prevention. The choice of the right technology is essential. However, all major frameworks, like Kafka Streams, KSQL, or Apache Flink, are very good. Hence, the case studies of Paypal, Capital One, ING Bank, Grab, and Kakao Games look different. Still, they have the same foundation with data streaming powered by the de facto standard Apache Kafka to reduce risk, increase revenue, and improve customer experience. If you want to learn more about streaming analytics with the Kafka ecosystem, check out how Apache Kafka helps in cybersecurity to create situational awareness and threat intelligence and how to learn from a concrete fraud detection example with Apache Kafka in the crypto and NFT space. How do you leverage data streaming for fraud prevention and detection? What does your architecture look like? What technologies do you combine? Let’s connect on LinkedIn and discuss it!

By Kai Wähner CORE
Apache Kafka Introduction, Installation, and Implementation Using .NET Core 6
Apache Kafka Introduction, Installation, and Implementation Using .NET Core 6

We will go over Apache Kafka basics, installation, and operation, as well as a step-by-step implementation using a .NET Core 6 web application. Prerequisites Visual Studio 2022 .NET Core 6 SDK SQL Server Java JDK 11 Apache Kafka Agenda Overview of Event Streaming Introduction to Apache Kafka. Main concepts and foundation of Kafka. Different Kafka APIs. Use cases of Apache Kafka. Installation of Kafka on Windows 10. Step-by-step implementation Overview of Event Streaming Events are the things that happen within our application when we navigate something. For example, we sign up on any website and order something, so, these are the events. The event streaming platform records different types of data like transaction, historical, and real-time data. This platform is also used to process events and allow different consumers to process results immediately and in a timely manner. An event-driven platform allows us to monitor our business and real-time data from different types of devices like IoT and many more. After analyzing, it provides a good customer experience based on different types of events and needs. Introduction to Apache Kafka Below, are a few bullet points that describe Apache Kafka: Kafka is a distributed event store and stream-processing platform. Kafka is open source and is written in Java and Scala. The primary purpose to designed Kafka by Apache foundation is to handle real-time data feeds and provide high throughput and low latency platforms. Kafka is an event streaming platform that has many capabilities to publish (write) and subscribe to (read) streams of events from a different system. Also, to store and process events durably as long as we want, by default, Kafka stores events from seven days of the time period, but we can increase that as per need and requirement. Kafka has distributed system, which has servers and clients that can communicate via TCP protocol. It can be deployed on different virtual machines and containers in on-premise and cloud environments as per requirements. In the Kafka world, a producer sends messages to the Kafka broker. The messages will get stored inside the topics and the consumer subscribes to that topic to consume messages sent by the producer. ZooKeeper is used to manage the metadata of Kafka-related things, it tracks which brokers are part of the Kafka cluster and partitions of different topics. Lastly, it manages the status of Kafka nodes and maintains a list of Kafka topics and messages. Main Concepts and Foundation of Kafka 1. Event An event or record is the message that we read and write to the Kafka server; we do this in the form of events in our business world, and it contains a key, a value, a timestamp, and other metadata headers. The key, value, and time stamp, in this case, are as follows: Key: “Jaydeep” Value: “Booked BMW” Event Timestamp: “Dec. 11, 2022, at 12:00 p.m.” 2. Producer The producer is a client application that sends messages to the Kafka node or broker. 3. Consumer The consumer is an application that receives data from Kafka. 4. Kafka Cluster The Kafka cluster is the set of computers that share the workload with each other with varying purposes. 5. Broker The broker is a Kafka server that acts as an agent between the producer and consumer, who communicate via the broker. 6. Topic The events are stored inside the “topic,” it’s similar to our folder in which we store multiple files. Each topic has one or more producers and consumers, which write and reads data from the topic. Events in “topic” can be read as often as needed because it persists events and it’s not like another messaging system that removes messages after consuming. 7. Partitions Topics are partitions, meaning the topic is spread over multiple partitions that we created inside the topic. When the producer sends some event to the topic, it will store it inside the particular partitions, and then, the consumer can read the event from the corresponding topic partition in sequence. 8. Offset Kafka assigns one unique ID to the message stored inside the topic partition when the message arrives from the producer. 9. Consumer Groups In the Kafka world, the consumer group acts as a single logical unit. 10. Replica In Kafka, to make data fault-tolerant and highly available, we can replicate topics in different regions and brokers. So, in case something wrong happens with data in one topic, we can easily get that from another to replicate the same. Different Kafka APIs Kafka has five core APIs that serve different purposes: Admin API: This API manages different topics, brokers, and Kafka objects. Producer API: This API is used to write/publish events to different Kafka topics. Consumer API: This API is used to receive the different messages corresponding to the topics that are subscribed by the consumer. Kafka Stream API: This API is used to perform different types of operations like windowing, joins, aggregation, and many others. Basically, its use is to transform objects. Kafka Connect API: This API works as a connector to Kafka, which helps different systems connect with Kafka easily. It has different types of ready-to-use connectors related to Kafka. Use Cases of Apache Kafka Messaging User activity tracking Log aggregation Stream processing Realtime data analytics Installation of Kafka on Windows 10 Step 1 Download and install the Java SDK of version 8 or more. Note: I have Java 11, that’s why I put the same path in all commands that I used here. Step 2 Open and install EXE. Step 3 Set the environment variable for Java using the command prompt as admin. Command: setx -m JAVA_HOME “C:\Program Files\Java\jdk-11.0.16.1” setx -m PATH “%JAVA_HOME%\bin;%PATH%” Step 4 After that, download and install Apache Kafka. Step 5 Extract the downloaded Kafka file and rename it “Kafka.” Step 6 Open D:\Kafka\config\ and create a “zookeeper-data” and “kafka-logs” folder inside that. Step 7 Next, open D:\Kafka\config\zookeeper.properties file and add the folder path inside that: D:\Kafka\config\zookeeper.properties dataDir=D:/Kafka/zookeeper-data Step 8 After that, open D:\Kafka\config\server.properties file and change the log path over there: D:\Kafka\config\server.properties log.dirs=D:/Kafka/kafka-logs Step 9 Saves and close both files. Step 10 Run ZooKeeper: D:\Kafka> .\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties Step 11 Start Kafka: D:\Kafka> .\bin\windows\kafka-server-start.bat .\config\server.properties Step 12 Create Kafka topic: D:\Kafka\bin\windows>kafka-topics.bat — create — bootstrap-server localhost:9092 — replication-factor 1 — partitions 1 — topic testdata Step 13 Create a producer and send some messages after you’ve started a producer and consumer: D:\Kafka\bin\windows>kafka-console-producer.bat — broker-list localhost:9092 — topic testdata Step 14 Next, create a consumer. After, you will see the message the producer sent: D:\Kafka\bin\windows>kafka-console-consumer.bat — bootstrap-server localhost:9092 — topic testdata Step-by-Step Implementation Let’s start with practical implementation. Step 1 Create a new .NET Core Producer Web API: Step 2 Configure your application: Step 3 Provide additional details: Step 4 Install the following two NuGet packages: Step 5 Add configuration details inside the appsettings.json file: JSON { "Logging": { "LogLevel": { "Default": "Information", "Microsoft.AspNetCore": "Warning" } }, "AllowedHosts": "*", "producerconfiguration": { "bootstrapservers": "localhost:9092" }, "TopicName": "testdata" } Step 6 Register a few services inside the “Program” class: C# using Confluent.Kafka; var builder = WebApplication.CreateBuilder(args); // Add services to the container. var producerConfiguration = new ProducerConfig(); builder.Configuration.Bind("producerconfiguration", producerConfiguration); builder.Services.AddSingleton<ProducerConfig>(producerConfiguration); builder.Services.AddControllers(); // Learn more about configuring Swagger/OpenAPI at https://aka.ms/aspnetcore/swashbuckle builder.Services.AddEndpointsApiExplorer(); builder.Services.AddSwaggerGen(); var app = builder.Build(); // Configure the HTTP request pipeline. if (app.Environment.IsDevelopment()) { app.UseSwagger(); app.UseSwaggerUI(); } app.UseHttpsRedirection(); app.UseAuthorization(); app.MapControllers(); app.Run(); Step 7 Next, create the CarDetails model class: C# using Microsoft.AspNetCore.Authentication; namespace ProducerApplication.Models { public class CarDetails { public int CarId { get; set; } public string CarName { get; set; } public string BookingStatus { get; set; } } } Step 8 Now, create the CarsController class: C# using Confluent.Kafka; using Microsoft.AspNetCore.Mvc; using Microsoft.Extensions.Configuration; using Newtonsoft.Json; using ProducerApplication.Models; namespace ProducerApplication.Controllers { [Route("api/[controller]")] [ApiController] public class CarsController : ControllerBase { private ProducerConfig _configuration; private readonly IConfiguration _config; public CarsController(ProducerConfig configuration, IConfiguration config) { _configuration = configuration; _config = config; } [HttpPost("sendBookingDetails")] public async Task<ActionResult> Get([FromBody] CarDetails employee) { string serializedData = JsonConvert.SerializeObject(employee); var topic = _config.GetSection("TopicName").Value; using (var producer = new ProducerBuilder<Null, string>(_configuration).Build()) { await producer.ProduceAsync(topic, new Message<Null, string> { Value = serializedData }); producer.Flush(TimeSpan.FromSeconds(10)); return Ok(true); } } } } Step 9 Finally, run the application and send a message: Step 10 Now, create a “consumer” application: For that, create a new .NET Core console application: Step 11 Configure your application: Step 12 Provide additional information: Step 13 Install the NuGet below: Step 14 Add the following code, which consumes messages sent by the consumer: C# using Confluent.Kafka; var config = new ConsumerConfig { GroupId = "gid-consumers", BootstrapServers = "localhost:9092" }; using (var consumer = new ConsumerBuilder<Null, string>(config).Build()) { consumer.Subscribe("testdata"); while (true) { var bookingDetails = consumer.Consume(); Console.WriteLine(bookingDetails.Message.Value); } } Step 15 Finally, run the producer and consumer, send a message using the producer app, and you will see the message immediately inside the consumer console sent by the producer: Here is the GitHub URL I used in this article. Conclusion Here, we discussed Apache Kafka introduction, working, benefits, and step-by-step implementation using .NET Core 6. Happy Coding!

By Jaydeep Patil
The 31 Flavors of Data Lineage and Why Vanilla Doesn’t Cut It
The 31 Flavors of Data Lineage and Why Vanilla Doesn’t Cut It

Data lineage, an automated visualization of the relationships for how data flows across tables and other data assets, is a must-have in the data engineering toolbox. Not only is it helpful for data governance and compliance use cases, but it also plays a starring role as one of the five pillars of data observability. Data lineage accelerates a data engineer’s ability to understand the root cause of a data anomaly and the potential impact it may have on the business. As a result, data lineage’s popularity as a must-have component of modern data tooling has skyrocketed faster than a high schooler with parents traveling out of town for the weekend. As a result, almost all data catalogs have introduced data lineage in the last few years. More recently, some big data cloud providers, such as Databricks and Google (as part of Dataplex), have announced data lineage capabilities. It’s great to see that so many leaders in the space, like Databricks and Google, realize the value of lineage for use cases across the data stack, from data governance to discovery. But now that there are multiple solutions offering some flavor of data lineage, the question arises, “does it still need to be a required feature within a data quality solution?” The answer is an unequivocal “yes.” When it comes to tackling data reliability, vanilla lineage just doesn’t cut it. Here’s why… 1. Data Lineage Informs Incident Detection and Alerting Data lineage powers better data quality incident detection and alerting when it’s natively integrated within a data observability platform. For example, imagine you have an issue with a table upstream that cascades into multiple other tables across several downstream layers. Do you want your team to get one alert, or do you want to get 15 – all for the same incident? The first option accurately depicts the full context along with a natural point to start your root cause analysis. The second option is akin to receiving 15 pages of a book out of order and hoping your on-call data engineer is able to piece together they are all part of a single story. As a function of data observability, data lineage pieces together this story automatically, identifying which one is the climax and which ones are just falling into action. Not to mention, too many superfluous alerts are the quickest route to alert fatigue–scientifically defined as the point where the data engineer rolls their eyes, shakes their head, and moves on to another task. So when your incident management channel in Slack has more than 25 unread messages, all corresponding to the same incident, are you really getting value from your data observability platform? One way to help combat alert fatigue and improve incident detection is to set alert parameters to only notify you about anomalies with your most important tables. However, without native data lineage, it’s difficult and time-consuming to understand what assets truly are important. One of the keys to operationalizing data observability is to ensure alerts are routed to the right responders–those who best understand the domain and particular systems in question. Data lineage can help surface and route alerts to the appropriate owners on both the data team and business stakeholder sides of the house. 2. Data Lineage Accelerates Incident Resolution Data engineers are able to fix broken pipelines and anomalous data faster when data lineage is natively incorporated within the data observability platform. Without it, you just have a list of incidents and a map of table/field dependencies, neither of which is particularly useful without the other. Without incidents embedded in lineage, those dots aren’t connected–and they certainly aren’t connected to how data is consumed within your organization. For example, data lineage is essential to the incident triage process. To butcher a proverb, “If a table experiences an anomaly, but no one consumes data from it, do you care?” Tracing incidents upstream across two different tools is a disjointed process. You don’t just want to swim to the rawest upstream table; you want to swim up to the most upstream table where the issue is still present. Of course, once we arrive at our most upstream table with an anomaly, our root cause analysis process has just begun. Data lineage gives you the where but not always the why. Data teams must now determine if it is: A systems issue: Did an Airflow job not run? Were there issues with permissions in Snowflake? A code issue: Did someone modify a SQL query or dbt model that mucked everything up? A data issue: Did a third party send us garbage data filled with NULLs and other nonsense? Data lineage is valuable, but it is not a silver bullet for incident resolution. It is at its best when it works within a larger ecosystem of incident resolution tools such as query change detection, high correlation insights, and anomalous row detection. 3. A Single Pane of Glass Sometimes vendors say their solution provides “a single pane of glass” with a bit too much robotic reverence and without enough critical thought toward the value provided. Nice to look at but not very useful. How I imagine some vendors say, “a single pane of glass.” In the case of data observability, however, a single pane of glass is integral to how efficient and effective your data team can be in its data reliability workflows. I previously mentioned the disjointed nature of cross-referencing your list of incidents to your map of data incidents. Still, it’s important to remember data pipelines extend beyond a single environment or solution. It’s great to know data moved from point A to point B, but your integration points will paint the full story of what happened to it along the way. Not all data lineage is created equal; the integration points and how those are surfaced are among the biggest differentiators. For example, are you curious how changes in dbt models may have impacted your data quality? If a failed Airflow job created a freshness issue? If a table feeds a particular dashboard? Well, if you are leveraging lineage from Dataplex or Databricks to resolve incidents across your environment, you’ll likely need to spend precious time piecing together information. Does your team use both Databricks and Snowflake and need to understand how data flows across both platforms? Let’s just say I wouldn’t hold our breath for that integration anytime soon. 4. The Right Tool for the Right Job Ultimately, this decision comes down to the advantages of using the right tool for the right job. Sure, your car has a CD player, but it would be pretty inconvenient to sit in your garage every time you’d like to hear some music. Not to mention the sound quality wouldn’t be as high, and the integration with Amazon Music account wouldn’t work. The parallel here is the overlap between data observability and data catalog solutions. Yes, both have data lineage features, but they are designed within many different contexts. For instance, Google developed its lineage features with compliance and governance use cases in mind, and Databricks has a lineage for cataloging and quality across native Databricks environments. So while data lineage may appear similar at first glance–spoiler alert: every platform’s graph will have boxes connected by lines–the real magic happens with the double click. For example, with Databricks, you can start with a high-level overview of the lineage and drill into a workflow. (Note: this would be only internal Databricks workflows, not external orchestrators.) You could then see a failed run time, and another click would take you to the code (not shown). Dataplex data lineage is similar with a depiction showing the relationships between datasets: The subsequent drill down allowing you to run an impact analysis is helpful, but for a “reporting and governance” use case. A data observability solution should take these high-level lineage diagrams a step further, down to the BI level, which, as previously mentioned, is critical for incident impact analysis. On the drill down, a data observability solution should provide all of the information shown across both tools plus a full history of queries run on the table, their runtimes, and associated jobs from dbt. Key data insights such as reads/write, schema changes, users, and the latest row count should be surfaced as well. Additionally, tables can be tagged (perhaps to denote their reliability level) with descriptions perhaps (to include information on SLAs and other relevant information). Taking a step beyond comparing lineage UIs for a moment, it’s important to also realize that you need a high-level overview of your data health. A data reliability dashboard–fueled by lineage metadata–can help you optimize your data quality investments by revealing your hot spots, uptime/SLA adherence, total incidents, time-to-fixed by domain, and more. Conclusion: Get the Sundae As data has become more crucial to business operations, the data space has exploded with many awesome and diverse tools. There are now 31 flavors instead of your typical spread of vanilla, chocolate, and strawberry. This can be as challenging for data engineers as it is exciting. Our best advice is to not get overwhelmed and let the use case drive the technology rather than vice versa. Ultimately, you will end up with an amazing, if sometimes messy, ice cream sundae with all of your favorite flavors perfectly balanced.

By Lior Gavish
Apache Kafka vs. Memphis.dev
Apache Kafka vs. Memphis.dev

What Is Apache Kafka? Apache Kafka is an open-source distributed event streaming platform. Based on the abstraction of a distributed commit log, Kafka can handle a great number of events with functionality comprising pub/sub. What Is Memphis.dev? Memphis is a next-generation message broker. A simple, robust, and durable cloud-native message broker wrapped with an entire ecosystem that enables fast and reliable development of next-generation event-driven use cases. Memphis.dev enables building next-generation applications that require large volumes of streamed and enriched data, modern protocols, zero ops, rapid development, extreme cost reduction, and a significantly lower amount of dev time for data-oriented developers and data engineers. General Parameter Memphis.dev Apache Kafka License Apache 2.0 Apache 2.0 Components Memphis + MongoDB (MDB is being removed) Kafka + Zookeeper(ZK is being removed) Message consumption model Pull Pull Storage architecture Log Log License Both technologies are available under fully open-source licenses. Memphis also has a commercial distribution with added security, tiered storage, and more. Components Kafka uses Apache Zookeeper™ for consensus and message storage. Memphis uses MongoDB for GUI state management only and will be removed soon, making Memphis without any external dependency. Memphis achieve consensus by using RAFT. Message Consumption Model Both Kafka and Memphis use a pull-based architecture where consumers pull messages from the server, and long-polling is used to ensure new messages are made available instantaneously. Pull-based architectures are often preferable for high throughput workloads as they allow consumers to manage their flow control, fetching only what they need. Storage Architecture Kafka uses a distributed commit log as its storage layer. Writes are appended to the end of the log. Reads are sequential, starting from an offset, and data is zero-copied from the disk buffer to the network buffer. This works well for event streaming use cases. Memphis also uses a distributed commit log called streams (made by NATS Jetstream) as its storage layer, which can be written entirely on the broker's (server) memory or disk. Memphis also uses offsets but abstracts them completely, so the heavy lifting of saving a record of the used offsets resides on Memphis and not on the client. Memphis also offers storage tiering for offloading messages to S3-compatible storage for an infinite storage time and more cost-effective storage. Reads are sequential. Ecosystem and User Experience Parameter Memphis.dev Apache Kafka Deployment Straight forward Requires deep understanding and design Enterprise support Yes 3rd parties like Confluent, AWS MSK Managed cloud offerings Yes 3rd parties like Confluent, AWS MSK Self-Healing Yes No Notifications Yes No Message tracing (aka Stream lineage) Yes No Storage balancing Automatic based on policy Manual Deployment Kafka is a cluster-based technology with a medium-weight architecture requiring two distributed components: Kafka's own servers (brokers) plus ZooKeeper™ servers. Zookeeper adds an additional level of complexity, but the community is in the process of removing the ZooKeeper component from Kafka. Kafka "Vanilla" deployment requires a manual binary installation and text-based configuration, as well as config OS daemons and internal parameters. Memphis has a lightweight yet robust cloud-native architecture and packed as a container from day one. It can be deployed using any docker engine, docker swarm, and for production environment using helm for Kubernetes (soon with operator). Memphis' initial config is already sufficient for production, and optimizations can take place on-the-fly without downtime. That approach enables Memphis to be completely isolated and apart from the infrastructure it deployed upon. Enterprise Support and Managed Cloud Offering Enterprise-grade support and managed cloud offerings for Kafka are available from several prominent vendors, including Confluent, AWS (MSK), Cloudera, and more. Memphis provided enterprise support and managed cloud offering that includes features like enhanced security, stream research abilities, an ML-based resource scheduler for better cost savings, and more. Self-Healing Kafka is a robust distributed system and requires constant tune-ups, client-made wrappers, management, and tight monitoring. The user or operator is responsible for ensuring it's alive and works as required. This approach has pros and cons, as the user can tune almost every parameter, which is often revealed as a significant burden. One of Memphis' core features is to remove frictions of management and autonomously make sure it's alive and performing well using periodic self-checks and proactive rebalancing tasks, as well as fencing the users from misusing the system. In parallel, every aspect of the system can be configured on-the-fly without downtime. Notifications Memphis has a built-in notification center that can push real-time alerts based on defined triggers like client disconnections, resource depletion, schema violation, etc. Apache Kafka does not offer an embedded solution for notifications. Can be achieved via commercial offerings. Message Tracing (aka Stream lineage) Tracking stream lineage is the ability to understand the full path of a message from the very first producer through the final consumer, including the trail and evolvement of a message between topics. This ability is extremely handy in a troubleshooting process. Apache Kafka does not provide a native ability for stream lineage. Still, it can be achieved using OpenTelemetry or OpenLineage frameworks, as well as integrating third-party applications such as Datadog, Epsagon or using Confluent's cloud offering. Memphis provides stream lineage per message with out-of-the-box visualization for each stamped message using a generated header by the Memphis SDK. Availability and Messaging Parameter Memphis.dev Apache Kafka Mirroring (Replication) Yes Yes Multi-tenancy Yes No Ordering guarantees Consumer group level Partition level Storage tiering Yes No. In progress (KIP-405) Permanent storage Yes Yes Delivery guarantees At least once, Exactly once At least once, Exactly once Idempotency Yes Yes Geo-Replication (Multi-region) Yes Yes Mirroring (Replication) Kafka Replication means having multiple copies of the data spread across multiple servers/brokers. This helps maintain high availability if one of the brokers goes down and is unavailable to serve the requests. Memphis station replication works similarly. During station (=topic) creation, the user can choose the number of replicas derived from the number of available brokers. Messages will be replicated in a raid-1 manner across the chosen number of brokers. Multi-Tenancy Multi-tenancy refers to the mode of operation of software where multiple independent instances of one or multiple applications operate in a shared environment. The instances (tenants) are logically isolated and often physically integrated. The most famous users are SaaS-type applications. Apache Kafka does not natively support multi-tenancy. However, it can be achieved via complex client logic, different topics, and ACL. As Memphis pushes to enable the next generation of applications and especially SaaS-type architectures, Memphis supports Multi-tenancy across all the layers from stations (=topics) to security, consumers, and producers, all the way to node selection for complete hardware isolation in case of need. It is enabled using namespaces and can be managed in a unified console. Storage Tiering Memphis offers a multi-tier storage strategy in its open-source version. Memphis will write messages that reached their end of 1st retention policy to a 2nd retention policy on object storage like S3 for longer retention time, potentially infinite, and post-streaming analysis. This feature can significantly help with cost reduction and stream auditing. Permanent Storage Both Kafka and Memphis store data durably and reliably, much like a normal database. Data retention is user configurable per Memphis station or Kafka topic. Idempotency Both Kafka and Memphis provide default support in idempotent producers. On the consumer side, in Kafka, it׳s the client's responsibility to build a retry mechanism that will retransmit a batch of messages exactly once, while in Memphis, it is provided natively within the SDK with a parameter called maxMsgDeliveries. Geo-Replication (Multi-Region) Common scenarios for a geo-replication include: Geo-replication Disaster recovery Feeding edge clusters into a central, aggregate cluster Physical isolation of clusters (such as production vs. testing) Cloud migration or hybrid cloud deployments Legal and compliance requirements Kafka users can set up such inter-cluster data flows with Kafka's MirrorMaker (version 2), a tool to replicate data between different Kafka environments in a streaming manner. Memphis cloud users can create more Memphis clusters and form a supercluster that replicates data in an async manner between the clusters of streamed data, security, consumer groups, unified management, and more. Features Parameter Memphis.dev Apache Kafka GUI Native 3rd Party Dead-letter Queue Yes No Schema Management Yes No Message routing Yes Yes. Using Kafka connect and KStreams Log compaction Not yet Yes Message replay, time travel Yes Yes Stream Enrichment SQL and Serverless functions SQL-based using KStreams Pull retry mechanism Yes Client Responsibility GUI Multiple open-source GUIs have been developed for Kafka over the years, for example, Kafka-UI. Usually, it cannot sustain heavy traffic and visualization and requires separate computing and maintenance. There are different commercial versions of Kafka that, among the rest, provide robust GUI, like Confluent, Conduktor, and more. Memphis provides a native state-of-the-art GUI hosted inside the broker, built to act as a management layer of all Memphis aspects, including cluster config, resources, data observability, notifications, processing, and more. Dead-Letter Queue Dead-letter queue is both a concept and a solution that is useful for debugging clients because it lets you isolate and "recycle" instead of dropping unconsumed messages to determine why their processing doesn't succeed. The Kafka architecture does not support DLQ within the broker; it is the client or consumer's responsibility to implement such behavior for good and bad. One of Memphis' core building blocks is avoiding unexpected data loss, enabling rapid development, and shortening troubleshooting cycles. Therefore, Memphis provides a native solution for a dead letter that acts as the station recycle bin for various failures, such as unacknowledged messages, schema violations, and custom exceptions. Schema Management The very basic building block to control and ensure the quality of data that flows through your organization between the different owners is by defining well-written schemas and data models. Confluent offers a "Schema Registry," which is a standalone component and provides a centralized repository for schemas and metadata, allowing services to flexibly interact and exchange data with each other without the challenge of managing and sharing schemas between them. However, it requires dedicated management, maintenance, scale, and monitoring. As part of its open-source version, Memphis presents Schemaverse, which is also embedded within the broker. Schemaverse provides a robust schema store and schema management layer on top of Memphis broker without a standalone compute or dedicated resources. With a unique and modern UI and programmatic approach, technical and non-technical users can create and define different schemas, attach the schema to multiple stations and choose if the schema should be enforced or not. Furthermore, in counter to Schema Registry, the client does not need to implement serialization functions, and every schema update takes place during the producers' runtime. Message Routing Kafka provides routing capabilities through Kafka Connect and Kafka Streams, including content-based routing, message transformation, and message enrichment. Memphis message routing is similar to the implementation of RabbitMQ using routing keys, wildcards, content-based routing, and more. Similar to RabbitMQ, it is also embedded within the broker and does not require external libraries or tools. Log Compaction Compaction has been created to support a long-term, potentially infinite record store based on specific keys. Kafka supports native topic compaction, which runs on all brokers. This runs automatically for compacted topics, condensing the log down to the latest version of messages sharing the same key. At the moment, Memphis does not support compaction, but it will in the future. Message Replay The ability to re-consume committed messages. Kafka does support replay by seeking specific offsets as the consumers have control over resetting the offset. Memphis does not support replay yet but will in the near future (2023). Stream Enrichment Kafka, with its Kafka Streams library, allows developers to implement elastic and scalable client applications that can leverage essential stream processing features such as tables, joins, and aggregations of several topics and export to multiple sources via Kafka connect. Memphis provides a similar behavior and more. Embedded inside the broker, Memphis users can create serverless-type functions or complete containerized applications that aggregate several stations and streams, decorate and enrich messages from different sources, write complex functions that cannot be achieved via SQL, and manipulate the schema. In addition, Memphis embedded connectors frameworks will help to push the results directly to a defined sink. Pull Retry Mechanism In case of a failure or lack of ability to acknowledge consumed messages, there should be a retry mechanism that will retry to pull the same offset or batch of messages. In Kafka, it is the client's responsibility to implement one. Some key factors must be considered to implement such a mechanism, like blocking vs. non-blocking, offset tracking, idempotency, and more. In Memphis, the retry mechanism is built-in and turned on by default within the SDK and broker. During consumer creation, the parameter maxMsgDeliveries will determine the number of retries the station will deliver a message if an acknowledgment does not arrive till maxAckTimeMs . The broker itself records the offsets given and will expose only the unacknowledged ones to the retry request. Conclusion Apache Kafka is a robust and mature system with extensive community support and a proven record for high-throughput use cases and users. Still, it also requires micro-management, troubleshooting can take precious time, complex client implementations, wrappers, and often tunings that take the user's focus and resources from the "main event," and the most important thing, it does not scale well as the organization grows, and more use cases join in. Wix Greyhound library is an excellent proof for the needed work on top of Kafka. We call Memphis "A next-generation message broker" because it leans towards the user and adapts to its scale and requirements, not the opposite. Most of the wrappers, tunings, management-overhead, and implementations needed from the client in Kafka, are abstract to the users in Memphis, which provides an excellent solution for both the smaller workload use cases and the more robust ones under the same system and with the full ecosystem to support it. It has a milage to pass, but the immediate benefits already exist and will continue to evolve.

By Yaniv Ben Hemo

Top Big Data Experts

expert thumbnail

Miguel Garcia

VP of Engineering,
Nextail Labs

Miguel has a great background in leading teams and building high-performance solutions for the retail sector. An advocate of platform design as a service and data as a product.
expert thumbnail

Gautam Goswami

Founder,
DataView

Enthusiastic about learning and sharing knowledge on Big Data, Data Science & related headways including data streaming platforms through knowledge sharing platform Dataview.in. Presently serving as CTO at Irisidea Technologies, Bangalore, India. https://www.irisidea.com/gautam-goswami/
expert thumbnail

Alexander Eleseev

Full Stack Developer,
First Line Software

‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎ ‎
expert thumbnail

Ben Herzberg

Chief Scientist,
Satori

Ben is an experienced hacker, developer, and author. Ben is experienced in endpoint security, behavioral analytics, application security, and data security. Ben filled roles such as the CTO of Cynet, and Director of threat research at Imperva. Ben is the Chief Scientist for Satori, streamlining data access and security with DataSecOps.

The Latest Big Data Topics

article thumbnail
Noteworthy Storage Options for React Native Apps!
Peek through the offerings of the key React Native Storage Options and understand which one is most suitable for your specific use case.
March 23, 2023
by Parija Rangnekar
· 573 Views · 1 Like
article thumbnail
TopicRecordNameStrategy in Kafka
Here, learn about TopicRecordNameStrategy and its use in Kafka. Create one topic and publish different event types same topic belongs to the same entity.
March 23, 2023
by Mahesh Chaluvadi
· 796 Views · 2 Likes
article thumbnail
Asynchronous Messaging Service
Asynchronous Messaging Service with Anypoint MQ, a cloud-based messaging queueing service that enables the smooth and secure movement of data.
March 23, 2023
by Sitanshu Gupta
· 1,700 Views · 1 Like
article thumbnail
Ten Easy Steps Toward a Certified Data Scientist Career
This data science certification can give you a leg-up against your competitors and elevate your career trajectory.
March 22, 2023
by Pradip Mohapatra
· 1,214 Views · 2 Likes
article thumbnail
How Elasticsearch Works
Discover what Elasticsearch is, how Elasticsearch works, and how you can configure, install, and run Elasticsearch. Also, understand its benefits and major use cases.
March 21, 2023
by Ruchita Varma
· 2,288 Views · 1 Like
article thumbnail
Create a CLI Chatbot With the ChatGPT API and Node.js
ChatGPT has taken the world by storm and is now available by API. In this article, we build a simple CLI to access ChatGPT with Node.js.
March 21, 2023
by Phil Nash
· 1,907 Views · 2 Likes
article thumbnail
Data Science vs. Software Engineering: Understanding the Fundamental Differences
Discover the key differences between Data Science and Software Engineering. Learn about the unique skills and expertise required for each field. Read now!
March 21, 2023
by Muhammad Rizwan
· 1,367 Views · 2 Likes
article thumbnail
Old School or Still Cool? Top Reasons To Choose ETL Over ELT
In this article, readers will learn about use cases where ETL (extract, transform, load) is a better choice in comparison to ELT (extract, load, transform).
March 20, 2023
by Hiren Dhaduk
· 1,754 Views · 1 Like
article thumbnail
Intro to Graph and Native Graph Databases
The graph database model is more flexible, scalable, and agile than RDBMS, and it’s the optimal data model for applications that harness artificial intelligence and machine learning.
March 20, 2023
by Julia Astashkina
· 1,754 Views · 1 Like
article thumbnail
7 ChatGPT Alternatives for the Best AI Chat Experience
Here are the seven best ChatGPT Alternatives for the Best AI Chat Experience. Check out our latest blog to learn more about it.
March 20, 2023
by Chetan Suthar
· 745 Views · 1 Like
article thumbnail
How Data Scientists Can Follow Quality Assurance Best Practices
Data scientists must follow quality assurance best practices in order to determine accurate findings and influence informed decisions.
March 19, 2023
by Devin Partida
· 2,627 Views · 1 Like
article thumbnail
What To Know Before Implementing IIoT
Industrial Internet of Things (IIoT) technology offers many benefits for manufacturers to improve operations. Here’s what to consider before implementation.
March 17, 2023
by Zac Amos
· 3,255 Views · 1 Like
article thumbnail
Apache Kafka Is NOT Real Real-Time Data Streaming!
Learn how Apache Kafka enables low latency real-time use cases in milliseconds, but not in microseconds; learn from stock exchange use cases at NASDAQ.
March 17, 2023
by Kai Wähner CORE
· 3,130 Views · 1 Like
article thumbnail
Use After Free: An IoT Security Issue Modern Workplaces Encounter Unwittingly
Use After Free is one of the two major memory allocation-related threats affecting C code. It is preventable with the right solutions and security strategies.
March 16, 2023
by Joydeep Bhattacharya CORE
· 2,198 Views · 1 Like
article thumbnail
Real-Time Analytics for IoT
If you're searching for solutions for IoT data in the age of customer-centric data, real-time analytics using the Apache Pinot™ database is a great solution.
March 16, 2023
by David G. Simmons CORE
· 2,890 Views · 4 Likes
article thumbnail
Transforming the Future With Deep Tech and Big Data
Unlock the power of deep tech and big data to transform your business. Gain a competitive edge and drive innovation across all sectors. Discover more now.
March 16, 2023
by Muhammad Rizwan
· 1,715 Views · 2 Likes
article thumbnail
High-Performance Analytics for the Data Lakehouse
New platform enables lakehouse analytics and reduces the cost of infrastructure by conducting analytics without ingesting data into the central warehouse.
March 15, 2023
by Tom Smith CORE
· 1,901 Views · 1 Like
article thumbnail
What Is the Difference Between VOD and OTT Streaming?
With the growing popularity of streaming services, this article discusses the distinction between OTT and VOD and their differences.
March 15, 2023
by Anna Smith
· 1,881 Views · 1 Like
article thumbnail
Using AI To Optimize IoT at the Edge
Artificial intelligence has the potential to revolutionize the combined application of IoT and edge computing. Here are some thought-provoking possibilities.
March 15, 2023
by Devin Partida
· 2,371 Views · 1 Like
article thumbnail
How HPC and AI in Sports Is Transforming the Industry
AI in sports is revolutionizing the industry! Learn how AI-powered systems are analyzing data to help teams improve performance and so much more.
March 15, 2023
by Kevin Vu
· 3,090 Views · 1 Like
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: