Data Resources

DZone's Featured Data Resources

Outlier Identification in Continuous Data Streams With Z-Score and Modified Z-Score in a Moving Window

By Maksim Kupriianov

My name is Maksim Kupriianov, and for the past few years, I have been actively involved in network monitoring. This means sending network probes from different locations within an organization’s network, analyzing the responses with regard to packet loss percentage and response times, and identifying places in the network where something has gone wrong. The probes are sent and the results are analyzed continuously, with new data coming in every second. The analytics are conducted on a sliding window of several tens of seconds. During this process, local anomalies often arise. For instance, a server in one of the racks might start showing increased packet loss due to the server's high CPU utilization. On the one hand, this is already a problem that should be highlighted, but on the other hand, it is not caused by any network equipment failure and doesn’t affect network performance in general. A similar situation occurs with the response times: for various reasons, responses could be delayed even by seconds, whereas typically, the round-trip time of the requests does not exceed a few milliseconds. How can we automatically detect such isolated anomalies and exclude them from the overall analysis? Many methods exist for this purpose, including Kalman filters, clustering, and machine learning algorithms. However, in this article, I want to focus on one of the methods of standardized evaluation: z-score filtering. The reason is that it is very simple to understand and implement, cheap in terms of additional CPU and memory load, and quite effective. Z-Score A z-score, also known as a standard score, is a metric that quantifies how far an observation deviates from the mean of the population. Essentially, it indicates the number of standard deviations below or above the mean that an observation falls. It is calculated for each observation using the formula: z =xi-μσ Where: xi is the individual data pointμ is the mean (average) of the datasetσ is the standard deviation of the dataset Let's look at an example. Consider a set of measured network latencies between two groups of servers located in the same building: Latency, µs 25 26 26 26 26 26 27 27 27 28 28 28 29 30 30 32 35 40 52 97 If we calculate the standard scores for these values, we get the following results: Latency, µs 25 26 26 26 26 26 27 27 27 28 28 28 29 30 30 32 35 40 52 97 Z-Score -0,520 -0,457 -0,457 -0,457 -0,457 -0,457 -0,394 -0,394 -0,394 -0,331 -0,331 -0,331 -0,268 -0,205 -0,205 -0,079 0,110 0,425 1,182 4,018 Outlier? no no no no no no no no no no no no no no no no no no no YES In this example, a typical threshold value of 3 was selected to identify anomalies. It is clear that the method only partially detected the anomaly. This occurred because the distribution of latencies is heavily skewed to the right: achieving a latency significantly lower than the median is physically impossible but experiencing delays due to network queues or similar issues resulting in higher values is common. The advantage of the z-score technique lies in its effectiveness and exceptional convenience when used in sliding time windows, requiring only three parameters for calculation: SumX=i=1Nxi, the sum of all values in the windowSsqX=i=1Nxi2, the sum of the squares of all values in the windowN, the number of values Thus, as new data arrives, we simply increment these counters, and as old data becomes obsolete, we decrement them accordingly. The standard score for each element Xi can be calculated using the following formula: z = xi-SumXNSsqX-SumX2NN-1 Modified Z-Score But can we do better? In our case, yes, and the modified z-score method will help with this. It performs better, especially on skewed datasets. The modified score is calculated as follows: zmod=0.6745xi-MMAD Where xi - Original sample elementM - Sample medianMAD=median(xi-M) - Median absolute deviation of the dataset In cases where the median absolute deviation equals zero, another formula is used. zmod=xi-M1.253314μAD Where μAD=i=1Nxi-MN, mean (or average) absolute deviation Latency, µs 25 26 26 26 26 26 27 27 27 28 28 28 29 30 30 32 35 40 52 97 Modified Z-Score -1,009 -0,673 -0,673 -0,673 -0,673 -0,673 -0,336 -0,336 -0,336 0,000 0,000 0,000 0,336 0,673 0,673 1,346 2,355 4,038 8,075 23,217 Outlier? no no no no no no no no no no no no no no no no no YES YES YES A threshold value of 3.5 is recommended for use with the modified z-score. In our example, employing this threshold successfully identified all three anomalous results. However, in practice, it is advisable to initially use higher threshold values, such as 5. One significant drawback of the modified z-score algorithm is its requirement to recalculate the median (and median absolute deviation) for each incoming value. This necessitates maintaining all values in the moving window in sorted order and performing calculations anew each time, which can be challenging and resource-intensive for large datasets. This challenge can be mitigated by employing a combination of both moving and sliding windows simultaneously. To distinguish between the two: Moving Time Window: This spans a fixed duration, such as ten seconds.Sliding Time Window: This moves continuously over the data, for example, every five seconds. Moving time window Sliding time window For instance, using a moving time window of ten seconds alongside a sliding window of five seconds allows for recalculating the median and its absolute deviation for the previous interval each time the sliding window advances. This approach enables analysis over the moving window while maintaining stability in the dataset over shorter periods of time. While this method may occasionally lead to false negatives and the exclusion of some trend-setting values, these effects are typically short-lived as subsequent sliding windows adjust accordingly. Conclusion The z-score method of standardized evaluation demonstrates strong performance in filtering out occasional anomalous outliers from a continuous stream of measurements. It remains cost-effective even when applied within a sliding window framework. However, its effectiveness may be compromised when dealing with datasets that exhibit significant skewness or distributions departing from normal. In such instances, the modified z-score method proves more suitable due to its capability to handle skewed data distributions effectively. Nevertheless, applying this method to continuous data streams presents challenges, as recalculating the median absolute deviation with each new sample necessitates accessing and potentially sorting all values. Despite these challenges, recalculating at smaller intervals may be feasible in scenarios where the sample median remains stable. More

Overview of Classical Time Series Analysis: Techniques, Applications, and Models

By Salman Khan

CORE

Time series data represents a sequence of data points collected over time. Unlike other data types, time series data has a temporal aspect, where the order and timing of the data points matter. This makes time series analysis unique and requires specialized techniques and models to understand and predict future patterns or trends. Applications of Time Series Modeling Time series modeling has a wide range of applications across various fields including: Economic and financial forecasting: Predicting future stock prices, volatility, and market trends; Forecasting GDP, inflation, and unemployment ratesRisk management: Assessing and managing financial risk through Value at Risk (VaR) modelsWeather forecasting: Predicting short-term weather conditions such as temperature and precipitationClimate modeling: Analyzing long-term climate patterns and predicting climate change impactsEpidemiology: Tracking and predicting the spread of diseasesPatient monitoring: Analyzing vital signs and predicting health events such as heart attacksDemand forecasting: Predicting electricity and gas consumption to optimize production and distributionCustomer behavior analysis: Understanding and predicting customer purchasing patternsPredictive maintenance: Forecasting equipment failures to perform maintenance before breakdowns occur Time Series Characteristics Time series data are characterized by: Trend: A long-term increase or decrease in the dataSeasonality: Influences from seasonal factors, such as the time of year or day of the week, occurring at fixed and known periodsCyclic patterns: Rises and falls that do not occur at a fixed frequency, usually driven by economic conditions and often linked to the "business cycle," typically lasting at least two years [1]. AirPassengers Time Series (1949-1960): This plot illustrates the monthly number of passengers on a US airline from 1949 to 1960. The blue line represents the original data, showing an increasing trend in air travel over the period. The green dashed line indicates the trend component, while the red dashed line depicts the seasonal component, highlighting the recurring patterns in passenger numbers across different months. In addition to standard descriptive statistical measures of central tendency (mean, median, mode) and variance, time series is defined by its temporal dependence. Temporal dependence is measured through auto-correlation and partial auto-correlation, which help identify the relationships between data points over time and are essential for understanding patterns and making accurate forecasts. Auto-Correlation and Partial Auto-Correlation Auto-correlation and partial auto-correlation are statistical measures used in time series analysis to understand the relationship between data points in a sequence. Auto-correlation measures the similarity between a data point and its lagged versions. It quantifies the correlation between a data point and previous data points in the sequence. Auto-correlation helps identify patterns and dependencies in the data over time and is often visualized using a correlogram, a plot of the correlation coefficients against the lag. Partial auto-correlation measures the correlation between a data point and its lagged versions while controlling for the influence of intermediate data points. It identifies the direct relationship between a data point and its lagged versions, excluding the indirect relationships mediated by other data points. Partial auto-correlation is also visualized using a correlogram. Both auto-correlation and partial auto-correlation are useful in time series analysis for several reasons: Identifying seasonality: Auto-correlation can help detect repeating patterns or seasonality in the data. Significant correlation at a specific lag suggests the data exhibits a repeating pattern at that interval.Model selection: Auto-correlation and partial auto-correlation guide the selection of appropriate models for time series forecasting. By analyzing the patterns in the correlogram, you can determine the order of autoregressive (AR) and moving average (MA) components in models like ARIMA (AutoRegressive Integrated Moving Average). White Noise Time series that show no autocorrelation are called white noise [1]. In other words, the values in a white noise series are independent and identically distributed (i.i.d.), with no predictable pattern or structure. A white noise series has the following properties: Zero mean: The average of the series is zero.Constant variance: The variance of the series remains the same over time.No auto-correlation: The auto-correlation at any lag is zero, indicating no predictable relationship between the data points. White noise is crucial in validating the effectiveness of time series models. If the residuals from a model are not white noise, it suggests that there are patterns left in the data that the model has not captured, indicating the need for a more complex or different model. Seasonality and Cycles Seasonality refers to the regular patterns or fluctuations in time series data that occur at fixed intervals within a year, such as daily, weekly, monthly, or quarterly. Seasonality is often caused by external factors like weather, holidays, or economic cycles. Seasonal patterns tend to repeat consistently over time. How To Identify Seasonality in Time-Series Models Seasonality in time series can be identified by analyzing ACF plots: Periodic peaks: Observing peaks in the ACF plot at regular intervals indicates a seasonal lag. For instance, when analyzing monthly data for yearly seasonality, peaks typically appear at lags 12, 24, 36, and so on. Similarly, quarterly data would show peaks at lags 4, 8, 12, etc.Significant peaks: Assessing the magnitude of auto-correlation coefficients at seasonal lags helps identify strong seasonal patterns. Higher peaks at seasonal lags compared to others suggest significant seasonality in the data.Repetitive patterns: Checking for repetitive patterns in the ACF plot aligned with the seasonal frequency reveals periodicity. Seasonal trends often exhibit repeated patterns of auto-correlation coefficients at seasonal lags.Alternating positive and negative correlations: Occasionally, observing alternating positive and negative auto-correlation coefficients at seasonal lags indicates a seasonal pattern.Partial Auto-correlation Function (PACF): Complementing the analysis with PACF helps pinpoint the direct influence of a lag on the current observation, excluding indirect effects through shorter lags. Significant spikes in PACF at seasonal lags further confirm seasonality in the data. By carefully examining the ACF/PACF plot for these indicators, one can infer the presence of seasonal trends in time series data. Additionally, spectral analysis and decomposition methods (e.g., STL decomposition) can also be used to identify and separate seasonal components from the data. This understanding is crucial for selecting appropriate forecasting models and devising strategies to manage seasonality effectively. ACF and PACF for AirPassenger Time Series: The plots above show the ACF and PACF correlograms for the airline passenger data. The ACF displays high values for the first few lags, which gradually decrease while remaining significant for many lags. This indicates a strong autocorrelation in the data, suggesting that past values have a significant influence on future values. In the PACF plot, significant peaks occur at lags 12, 24, etc., indicating a yearly seasonality effect in the data. Cycles, on the other hand, refer to fluctuations in a time series that are not of fixed frequency or period. They are typically longer-term patterns, often spanning several years, and are not as precisely defined as seasonal patterns. Cycles can be influenced by economic factors, business cycles, or other structural changes in the data. In summary, while both seasonality and cycles involve patterns of variation in time series data, seasonality repeats at fixed intervals within a year. In contrast, cycles represent longer-term fluctuations that may not have fixed periodicity. Stationarity Stationarity in time series data implies that statistical characteristics, such as mean, variance, and covariance, remain consistent over time. This stability is crucial for various time-series modeling techniques as it simplifies the underlying dynamics, facilitating accurate analysis, modeling, and forecasting. There are two primary types of stationarity: Stationarity is a key concept in time series analysis, as many statistical models assume the data's properties do not change over time. Non-stationary data can lead to unreliable forecasts and spurious relationships, making it crucial to achieve stationarity before modelling. Why Is Stationarity Important? Non-stationary time series can be problematic for several reasons: Difficulty in modeling: Non-stationary time series violates the assumptions of many statistical models, making it challenging to model and forecast future values accurately. Models like ARIMA (AutoRegressive Integrated Moving Average) assume stationarity, so non-stationary data can lead to unreliable predictions.Spurious regression: Non-stationary time series can result in spurious regression, where two unrelated variables appear to be strongly correlated. This can lead to misleading conclusions and inaccurate interpretations of the relationship between variables.Inefficient parameter estimation: Non-stationary time series can lead to inefficient parameter estimation. The estimates of model parameters may have large standard errors, reducing the precision and reliability of the estimated coefficients. Dickey-Fuller Test and Augmented Dickey-Fuller Test The Dickey-Fuller Test and the Augmented Dickey-Fuller Test are statistical tests used to determine if a time series dataset is stationary or not. They test for the presence of a unit root, which indicates non-stationarity. A unit root suggests that shocks to the time series have a permanent effect, meaning the series does not revert to a long-term mean. Limitations: These tests can be sensitive to the choice of lag length and may have low power in small samples. It's essential to interpret the results alongside other diagnostic checks carefully. ADF Test on AirPassengers Time seriesTest Result Python from statsmodels.tsa.stattools import adfuller output = adfuller(data['Passengers']) result = { 'ADF Statistic': round(output[0],3), 'p-value': round(output[1],3), 'Critical Values (1%)': round(output[4]['1%'],3), 'Critical Values (5%)': round(output[4]['5%'],3), 'Critical Values (10%)': round(output[4]['10%'],3),} print(pd.Series(result)) ADF Statistic 0.815 p-value 0.992 Critical Values (1%) -3.482 Critical Values (5%) -2.884 Critical Values (10%) -2.579 Given the high p-value (0.992) and the fact that ADF statistic (0.815) is greater than the critical values, we fail to reject the null hypothesis. Therefore, there is strong evidence to suggest that the time series is non-stationary and possesses a unit root. How To Make a Time Series Stationary if It Is Not Stationary Differencing: For example, First-Order Differencing involves subtracting the previous observation from the current observation. If the time series has seasonality, seasonal differencing can be applied.Transformations: Techniques like logarithm, square root, or Box-Cox can stabilize the variance.Decomposition: Decomposing the time series into trend, seasonal, and residual components.Detrending: For instance, subtracting the Rolling Mean or fitting and removing a Linear Trend. It is important to identify and address non-stationarity in time series analysis to ensure reliable and accurate modeling and forecasting. Modeling Univariate Time Series Wold Representation Theorem The Wold decomposition theorem states that any covariance stationary process can be decomposed into two mutually uncorrelated components. The first component is a linear combination of past values of a white noise process, while the second component consists of a process whose future values can be precisely predicted by a linear function of past observations. The Wold theorem is fundamental in time series analysis, providing a framework for understanding and modelling stationary time series. Lag Operator The lag operator (L) helps to succinctly represent the differencing operations. It shifts a time series back by a one-time increment. Exponential Smoothing Exponential smoothing is a time series forecasting technique that applies weighted averages to past observations, giving more weight to recent observations while exponentially decreasing the weight for older observations. This method is useful for making short-term forecasts and smoothing out irregularities in the data. Simple Exponential Smoothing Simple exponential smoothing is a technique where the forecast for the next period is calculated as a weighted average of the current period's observation and the previous prediction. This technique is suitable for time series data without trend or seasonality. ARMA (AutoRegressive Moving Average) Model The ARMA model is a popular time series model that combines both autoregressive (AR) and moving average (MA) components. It is used to forecast future values of a time series based on its past values. The autoregressive (AR) component of the ARMA model represents the linear relationship between the current observation and a certain number of lagged observations. It assumes that the current value of the time series is a linear combination of its past values. The order of the autoregressive component, denoted by p, determines the number of lagged observations included in the model. The moving average (MA) component of the ARMA model represents the linear relationship between the current observation and a certain number of lagged forecast errors. It assumes that the current value of the time series is a linear combination of the forecast errors from previous observations. The order of the moving average component, denoted by q, determines the number of lagged forecast errors included in the model. The ARMA model can be represented by the following equation: The ARMA model can be estimated using various methods, such as maximum likelihood estimation or least squares estimation. ARIMA Model ARIMA includes an integration term, denoted as the "I" in ARIMA, which accounts for non-stationarity in the data. ARIMA models handle non-stationary data by differencing the series to achieve stationarity. In ARIMA models, the integration order (denoted as "d") specifies how many times differencing is required to achieve stationarity. This is a parameter that needs to be determined or estimated from the data. ARMA models do not involve this integration order parameter since they assume stationary data. SARIMA Model SARIMA stands for Seasonal AutoRegressive Integrated Moving Average. It is an extension of the ARIMA model that incorporates seasonality into the modelling process. SARIMA models are particularly useful when dealing with time series data that exhibit seasonal patterns. SARMIAX The SARIMAX model is defined by the parameters (p, d, q) and (P, D, Q, s): (p, d, q): These are the non-seasonal parameters. p: The order of the non-seasonal AutoRegressive (AR) partd: The number of non-seasonal differences needed to make the series stationaryq: The order of the non-seasonal Moving Average (MA) part(P, D, Q, s): These are the seasonal parameters. P: The order of the seasonal AutoRegressive (AR) part.D: The number of seasonal differences needed to make the series stationary.Q: The order of the seasonal Moving Average (MA) part.s: The length of the seasonal cycle (e.g., s=12 for monthly data with yearly seasonality).Exogenous Variables (X): These are external variables that can influence the time series but are not part of the series itself. For example, economic indicators or weather data might be included as exogenous variables. SARIMAX Model For Air Passenger Time Series To identify the optimal order of a Seasonal AutoRegressive Integrated Moving Average (SARIMA) model, the auto_arima function from the pmdarima [4] library was utilized. It automates the identification of optimal parameters for the SARIMAX model. Python Sarimax_model = auto_arima(train_data, start_p=0, start_q=0, max_p=6, max_q=6, max_d=12, seasonal=True, m=12, # Seasonal period (e.g., 12 for monthly data with yearly seasonality) start_P=0, start_Q=0, max_P=25, max_Q=25, d=None, D=None, max_D=25, trace=True, error_action='ignore', suppress_warnings=True, stepwise=True, random = True, n_fits = 10, information_criterion = 'aic') Sarimax_model.summary() As per the auto_arima grid search, the optimal order for model is: SARIMAX(1, 1, 0)x(0, 1, 0, 12). Prediction and Forecasting using SARIMAX(1, 1, 0)x(0, 1, 0, 12) Modeling Volatility The volatility of a time series refers to the degree of variation or dispersion in the series over time. It is a measure of how much the series deviates from its average or expected value. Volatility is particularly relevant in financial markets but can also apply to other types of time series data where variability is important to understand or predict. It is often measured as the annualized standard deviation change in price or value of financial security; e.g., for asset price volatility, which is computed as follows [2]: Simple Methods to Model Volatility ARCH (Autoregressive Conditional Heteroskedasticity) Model ARCH models are a class of models used in econometrics and financial econometrics to analyze time series data, particularly in the context of volatility clustering. These models are designed to capture the time-varying volatility or heteroskedasticity in financial time series data, where the volatility of the series may change over time. In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance [3]. The basic idea behind ARCH models is that the variance of a time series can be modeled as a function of its past values, along with possibly some exogenous variables. In other words, the variance at any given time is conditional on the past observations of the series. ARCH (1) model derivation: ARCH (p) model: GARCH (Generalized Autoregressive Conditional Heteroskedasticity) Model The GARCH model is an extension of the ARCH Model. It models time series as a function of previous values as well as volatility. GARCH for S&P 500 volatility: Daily data and forecasts for the S&P 500 ETF using a GARCH(2,2) model. The top plot illustrates the actual closing prices of SPY from December 1992 to August 2024. The bottom plot compares daily returns with those forecasted by the GARCH(2,2) model (trained on last 10 years data). The primary y-axis displays actual returns and the model’s predicted mean returns, while the secondary y-axis shows the predicted standard deviation, reflecting forecasted volatility. Review: ARMA and GARCH AR/ARMA models: Best suited for stationary time series data, where statistical properties like mean and variance are constant over time; Useful for short-term forecasting, ARMA models combine both autoregressive (AR) and moving average (MA) components to capture the dynamics influenced by past values and past forecast errors.AR models: Used when the primary relationship in the data is between the current value and its past values; Suitable for time series where residuals show no significant autocorrelation pattern, indicating that past values alone sufficiently explain the current observations.ARMA models: Employed when both past values and past forecast errors significantly influence the current value; This combination provides a more comprehensive model for capturing complex dynamics in time series data.ARCH models: Best suited for time series data with volatility clustering but lacking long-term persistence; ARCH models capture bursts of high and low volatility effectively by modelling changing variance over time based on past errors.GARCH models: Extend ARCH models by incorporating past variances, allowing them to handle more persistent volatility; GARCH models are better at capturing long-term dependencies in financial time series data, making them suitable for series with sustained periods of high or low volatility. A comparison of time series models: AR(1), ARMA(1,1), ARCH(1), and GARCH(1,1). Model Selection When analyzing time series data, selecting the appropriate model (e.g., AR vs ARMA) and determining the model's order is crucial for making accurate predictions. Several methods can be used for model selection: Conclusion Time series analysis is a critical tool for understanding and predicting temporal data patterns across various fields, from finance and economics to healthcare and climate science. A solid grasp of classical time series models, such as ARMA, ARIMA, SARIMA, ARCH, and GARCH, alongside fundamental concepts like stationarity, auto-correlation, and seasonality, is crucial for developing and fine-tuning more advanced methodologies. Classical models provide foundational insights into time series behavior, guiding the application of more sophisticated techniques. Mastery of these basics not only enhances understanding of complex models but also ensures that forecasting methods are robust and reliable. By leveraging the principles and performance benchmarks established through classical models, practitioners can optimize advanced approaches, such as machine learning algorithms, deep learning, and hybrid models, leading to more accurate predictions and better-informed decision-making. References [1] Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. Accessed on June 15, 2024. [2] MIT, Topics in mathematics with application in finance. [3] Homoscedasticity and heteroscedasticity. Accessed on June 29, 2024. [4] Project description: pmdarima More

Cache Wisely: How You Can Prevent Distributed System Failures

By Tejas Ghadge

Effortless Concurrency: Leveraging the Actor Model in Financial Transaction Systems

By Nikita Melnikov

High Fidelity Data: Balancing Privacy and Usage

By Zilong Tang

Accelerate Your Journey to a Modern Data Platform Using Coalesce

Most organizations face challenges while adapting to data platform modernization. The critical challenge that data platforms have faced is improving the scalability and performance of data processing due to the increased volume, variety, and velocity of data used for analytics. This article aims to summarize answers to the challenging questions of data platform modernization, and here are a few questions: How can we onboard new data sources with no code or less code?What steps are required to improve data integrity among various data source systems?How can continuous integration/continuous development workflows across environments be simplified? How can we improve the testing process?How do we identify data quality issues early in the pipeline? Evolution of Data Platforms The evolution of data platforms and corresponding tools achieved considerable advancements driven by data's vast volume and complexity. Various data platforms have been used for a long time to consolidate data by extracting it from a wide array of heterogeneous source systems and integrating them by cleaning, enriching, and nurturing the data to make it easily accessible to different business users and cross-teams in an organization. The on-premises Extract, Transform, Load (ETL) tools are designed to process data for large-scale data analysis and integration into a central repository optimized for read-heavy operations. These tools manage structured data. All the organizations started dealing with vast amounts of data as Big Data rose. It is a distributed computing framework for processing large data sets. Tools like HDFS (Hadoop) and MapReduce enabled the cost-effective handling of vast data. These ETL tools encountered data complexity, scalability, and cost challenges, leading to No-SQL Databases such as MongoDB, Cassandra, and Redis, and these platforms excelled at handling unstructured or semi-structured data and provided scalability for high-velocity applications.The need for faster insights led to the evolution of data integration tools to support real-time and near-real-time ingestion and processing capabilities, such as Apache Kafka for real-time data streaming, Apache Storm for real-time data analytics, real-time machine learning, and Apache Pulsar for distributed messaging and streaming. Many more data stream applications are available.Cloud-based solutions like cloud computing and data warehouses like Amazon RDS, Google Big Query, and Snowflake offer scalable and flexible database services with on-demand resources. Data lake and lake warehouse formation on cloud platforms such as AWS S3 and Azure Data Lake allowed for storing raw, unstructured data in its native format. This approach provided a more flexible and scalable alternative to traditional data warehouses, enabling more advanced analytics and data processing. They provide a clear separation between computing and storage with managed services for transforming data within the database. With the integration of AI/ML into data platforms through tools such as Azure Machine Learning and AWS Machine Learning, Google AI data analysis is astonishing. Automated insights, predictive analytics, and natural language querying are becoming more prevalent, enhancing the value extracted from data. Challenges While Adapting a Data Platform Modernization Data platform modernization is essential for staying competitive and controlling the full potential of data. The critical challenge data platforms have faced is improving the scalability and performance of data processing due to the increased volume, variety, and velocity of data used for analytics. Most of the organizations are facing challenges while adapting to data platform modernization. The key challenges are: Legacy systems integration: Matching Apple to Apple is complex because outdated legacy source systems are challenging to integrate with modern data platforms. Data migration and quality: Data cleansing and quality issues are challenging to fix during data migration. Cost management: Due to the expensive nature of data modernization, budgeting and managing the cost of a project are significant challenges. Skills shortage: Retaining and finding highly niche skilled resources takes much work. Data security and privacy: Implementing robust security and privacy policies can be complex, as new technologies come with new risks on new platforms. Scalability and flexibility: The data platforms should be scalable and adapt to changing business needs as the organization grows.Performance optimization: It is essential to ensure that new platforms will perform efficiently under various data loads and scales, and increasing data volumes and queries is challenging. Data governance and compliance: It is challenging to implement data governance policies and comply with regulatory requirements in a new environment if there is no existing data strategy defined for strategic solutions across the organization. Vendor lock-in: Organizations should look for interoperability and portability while modernizing instead of having a single vendor locked in. User adoption: To get end users' buy-in, we must provide practical training and communication strategies. ETL Framework and Performance The ETL Framework impacts performance in several aspects within any data integration. The framework's performance is evaluated against the following metrics. Process utilizationMemory usageTimeNetwork bandwidth utilization Let us review how cloud-based ETL tools, as a framework, support fundamental data operations principles. This article covers how to simplify Data Operations with advanced ETL tools. For example, we will cover the Coalesce cloud-based ETL tool. Collaboration: The advanced cloud-based ETL tools allow data transformations written using platform native code and provide documentation within the models to generate clear documentation, making it easier for the data teams to understand and collaborate on data transformations. Automation: These tools allow data transformations and test cases to be written as code with explicit dependencies, automatically enabling the correct order of running scheduled data pipelines and CI/CD jobs.Version control: These tools seamlessly integrate with GitHub, Bitbucket, Azure DevOps, and GitLab, enabling the tracking of model changes and allowing teams to work on different versions of models, facilitating parallel development and testing. Continuous Integration and Continuous Delivery (CI/CD): ETL frameworks allow businesses to automate deployment processes by identifying changes and running impacted models and their dependencies along with the test cases, ensuring the quality and integrity of data transformations.Monitoring and observability: The modern data integration tools allow to run data freshness and quality checks to identify potential issues and trigger alerts,Modularity and reusability: It also encourages breaking down transformations into smaller, reusable models and allows sharing models as packages, facilitating code reuse across projects. Coalesce Is One of the Choices Coalesce is a cloud-based ELT (Extract Load and Transform) and ETL (Extract Transform and Load) tool that adopts data operation principles and uses tools that natively support them. It is one tool backed by the Snowflake framework for modern data platforms. Figure 1 shows an automated process for data transformation on the Snowflake platform. Coalesce generates the Snowflake native SQL code. Coalesce is a no/low-code data transformation platform. Figure 1: Automating the data transformation process using Coalesce The Coalesce application comprises a GUI front end and a backend cloud data warehouse. Coalesce has both GUI and Codebase environments. Figure 2 shows a high-level Coalesce application architecture diagram. Figure 2: Coalesce Application Architecture (Image Credit: Coalesce) Coalesce is a data transformation tool that uses graph-like data pipelines to develop and define transformation rules for various data models on modern platforms while generating Structured Query Language (SQL) statements. Figure 3 shows the combination of templates and nodes, like data lineage graphs with SQL, which makes it more potent for defining the transformation rules. Coalesce code-first GUI-driven approach has made building, testing, and deploying data pipelines easier. This coalesce framework improves the data pipeline development workflow compared to creating directed acyclic graphs (or DAGs) purely with code. Coalesce has column-aware inbuild column integrated functionality in the repository, which allows you to see data lineage for any column in the graphs.) Figure 3: Directed Acyclic Graph with various types of nodes (Image Credit: Coalesce) Set up projects and repositories. The Continuous Integration (CI)/Continuous Development (CD) workflow without the need to define the execution order of the objects. Coalesce tool supports various DevOps providers such as GitHub, Bitbucket, GitLab, and Azure DevOps. Each Coalesce project should be tied to a single git repository, allowing easy version control and collaboration. Figure 4: Browser Git Integration Data Flow (Image Credit: Coalesce) Figure 4 demonstrates the steps for browser Git Integration with Coalesce. This article will detail the steps to configure Git with Coalesce. The reference link guide will provide detailed steps on this configuration. When a user submits a Git request from the browser, an API call sends an authenticated request to the Coalesce backend (1). Upon successful authentication (2), the backend retrieves the Git personal access token (PAT) for the user from the industry standard credential manager (3) in preparation for the Git provider request. The backend then communicates directly over HTTPS/TLS with the Git provider (4) (GitHub, Bitbucket, Azure DevOps, GitLab), proxying requests (for CORS purposes) over HTTPS/TLS back to the browser (5). The communication in part 5 uses the native git HTTP protocol over HTTPS/TLS (this is the same protocol used when performing a git clone with an HTTP git repository URL). Set up the workspace. Within a project, we can create one or multiple Development Workspaces, each with its own set of code and configurations. Each project has its own set of deployable Environments, which can used to test and deploy code changes to production. In the tool itself, we configure Storage Locations and Mappings. A good rule is to create target schemas in Snowflake for DEV, QA, and Production. Then, map them in Coalesce. The build interface is where we will spend most of our time creating nodes, building graphs, and transforming data. Coalesce comes with default node types that are not editable. However, they can be duplicated and edited, or new ones can made from scratch. The standard nodes are the source node, stage node, persistent stage node, fact node, dimension node with SCD Type 1 and Type 2 support, and view node. With very ease of use, we can create various nodes and configure properties in a few clicks. A graph represents an SQL pipeline. Each node is a logical representation and can materialize as a table or a view in the database. User-defined nodes: Coalesce has User-Defined Nodes (UDN) for any particular object types or standards an organization may want to enforce. Coalesce packages have built-in nodes and templates for building Data Vault objects like Hubs, Links, PIT, Bridge, and Satellites. For example, package id for Data Vault 2.0 can be installed in the project's workspace. Investigate the data issues without inspecting the entire pipeline by narrowing the analysis using a lineage graph and sub-graphs.Adding new data objects without worrying about the orchestration and defining the execution order is easy.Execute tests through dependent objects and catch errors early in the pipeline. Node tests can run before or after the node's transformations, and this is user-configurable. Deployment interface: Deploy data pipelines to the data warehouse using Deployment Wizard. We can select the branch to deploy, override default parameters if required, and review the plan and deployment status. This GUI interface can deploy the code across all environments.Data refresh: We can only refresh it if we have successfully deployed the pipeline. Refresh runs the data transformations defined in data warehouse metadata. Use refresh to update the pipeline with any new changes from the data warehouse. To only refresh a subset of data, use Jobs. Jobs are a subset of nodes created by the selector query run during a refresh. In coalescing in the build interface, create a job, commit it to git, and deploy it to an environment before it can used. Orchestration: Coalesce orchestrates the execution of a transformation pipeline and allows users the freedom and flexibility to choose a scheduling mechanism for deployments and job refreshes that fit their organization's current workflows. Many tools, such as Azure Data Factory, Apache Airflow, GitLab, Azure DevOps, and others, can automate execution according to time or via specific triggers (e.g., upon code deployment). Snowflake also comes in handy by creating tasks and scheduling on Snowflake. Apache Airflow is a standard orchestrator used with Coalesce. Rollback: To roll back a deployment in Coalesce and restore the environment to its prior state regarding data structures, redeploy the commit deployed just before the deployment to roll back. Documentation: Coalesce automatically produces and updates documentation as developers work, freeing them to work on higher-value deliverables. Security: Coalesce never stores data at rest and data in motion is always encrypted, data is secured in the Snowflake account. Upsides of Coalesce FeatureBenefits Template-driven development Speed development; Change once, update all Auto generates code Enforce standards w/o reviews Scheduled execution Automates pipelines with 3rd party orchestration tools such as Airflow, Git, or Snowflake tasks to schedule the jobs Flexible coding Facilitates self-service and easy to code Data lineage Perform impact analysis Auto generates documentation Quick to onboard new staff Downsides of Coalesce Being Coalesce is a comprehensive data transformation platform with robust data integration capabilities it has some potential cons of using it as an ELT/ETL tool: Coalesce is built exclusively to support Snowflake.Reverse engineering schema from Snowflake into coalesce is not straightforward. Certain YAML files and configuration specification updates are required to get into graphs. The YAML file should be built with specifications to meet reverse engineering into graphs.The lack of logs after deployment and lack of logs during the data refresh phase can result in vague errors that are difficult to resolve issues.Infrastructure changes can be difficult to test and maintain, leading to frequent job failures. The CI/CD should be performed in a strictly controlled form.No built-in scheduler is available in the Coalesce application to orchestrate jobs like other ETL tools such as DataStage, Talend, Fivetran, Airbyte, and Informatica. Conclusions Here are the key take away from this article: As data platforms become more complex, managing them becomes difficult, and embracing the Data Operations principle is the way to address data operation challenges.We looked at the capabilities of ETL Frameworks and their performance.We examined Coalesce as a solution that supports data operation principles and allows us to build automated, scalable, agile, well-documented data transformation pipelines on a cloud-based data platform. We discussed the ups and downsides of Coalesce.

By Asia Banu Shaik

Preventing and Fixing Bad Data in Event Streams: Part 2

Alright, I’m back — time for part 2. In the first part, I covered how we handle bad data in batch processing; in particular, cutting out the bad data, replacing it, and running it again. But this strategy doesn’t work for immutable event streams as they are, well, immutable. You can’t cut out and replace bad data like you would in batch-processed data sets. Thus, instead of repairing after the fact, the first technique we looked at is preventing bad data from getting into your system in the first place. Use schemas, tests, and data quality constraints to ensure your systems produce well-defined data. To be fair, this strategy would also save you a lot of headaches and problems in batch processing. Prevention solves a lot of problems. But there’s still a possibility that you’ll end up creating some bad data, such as a typo in a text string or an incorrect sum in an integer. This is where our next layer of defense in the form of event design comes in. Event design plays a big role in your ability to fix bad data in your event streams. And much like using schemas and proper testing, this is something you’ll need to think about and plan for during the design of your application. Well-designed events significantly ease not only bad data remediation issues but also related concerns like compliance with GDPR and CCPA. And finally, we’ll look at what happens when all other lights go out — you’ve wrecked your stream with bad data and it’s unavoidably contaminated. Then what? Rewind, Rebuild, and Retry. But to start we’ll look at event design, as it will give you a much better idea of how to avoid shooting yourself in the foot from the get-go. Fixing Bad Data Through Event Design Event design heavily influences the impact of bad data and your options for repairing it. First, let’s look at State (or Fact) events, in contrast to Delta (or Action) events. State events contain the entire statement of fact for a given entity (e.g., Order, Product, Customer, Shipment). Think of state events exactly like you would think about rows of a table in a relational database — each presents an entire accounting of information, along with a schema, well-defined types, and defaults (not shown in the picture for brevity’s sake). State shows the entire state. Delta shows the change. State events enable event-carried state transfer (ECST), which lets you easily build and share state across services. Consumers can materialize the state into their own services, databases, and data sets, depending on their own needs and use cases. Materializing an event stream made of State events into a table. Materializing is pretty straightforward. The consumer service reads an event (1) and then upserts it into its own database (2), and you repeat the process (3 and 4) for each new event. Every time you read an event, you have the option to apply business logic, react to the contents, and otherwise drive business logic. Updating the data associated with a Key “A” (5) results in a new event. That event is then consumed and upserted (6) into the downstream consumer data set, allowing the consumer to react accordingly. Note that your consumer is not obligated to store any data that it doesn’t require — it can simply discard unused fields and values. Deltas, on the other hand, describe a change or an action. In the case of the Order, they describe item_added, and order_checkout, though reasonably you should expect many more deltas, particularly as there are many different ways to create, modify, add, remove, and change an entity. Though I can (and do) go on and on about the tradeoffs of these two event design patterns, the important thing for this post is that you understand the difference between Delta and State events. Why? Because only State events benefit from topic compaction, which is critical for deleting bad, old, private, and/or sensitive data. Compaction is a process in Apache Kafka that retains the latest value for each record key (e.g., Key = “A”, as above) and deletes older versions of that data with the same record key. Compaction enables the complete deletion of records via tombstones from the topic itself — all records of the same key that come before the tombstone will be deleted during compaction. Compacting an Event Stream (an asynchronous process): Source Aside from enabling deletion via compaction, tombstones also indicate to registered consumers that the data for that key has been deleted and they should act accordingly. For example, they should delete the associated data from their own internal state store, update any business operations affected by the deletion, and emit any associated events to other services. Compaction contributes to the eventual correctness of your data, though your consumers will still need to deal with any incorrect side-effects from earlier incorrect data. However, this remains identical as if you were writing and reading to a shared database — any decisions made off the incorrect data, either through a stream or by querying a table, must still be accounted for (and reversed if necessary). The eventual correction only prevents future mistakes. State Events: Fix It Once, Fix It Right It’s really easy to fix bad state data. Just correct it at the source (e.g., the application that created the data), and the state event will propagate to all registered downstream consumers. Compaction will eventually clean up the bad data, though you can force compaction, too, if you cannot wait (perhaps due to security reasons). You can fiddle around with compaction settings to better suit your needs, such as compacting ASAP or only compacting data older than 30 days (min.compaction.lag.ms= 2592000000). Note that active Kafka segments can’t be compacted immediately, the segment must first be closed. I like state events. They’re easy to use and map to database concepts that the vast majority of developers are already familiar with. Consumers can also infer the deltas of what has changed from the last event (n-1) by comparing it to their current state (n). And even more, they can compare it to the state before that (n-2), before that (n-3), and so forth (n-x), so long as you’re willing to keep and store that data in your microservice’s state store. “But wait, Adam!” I have heard (many) times before. “Shouldn’t we store as little data as possible so that we don’t waste space?” Eh, Kinda. Yes, you should be careful with how much data you move around and store, but only after a certain point. But this isn’t the 1980s, and you’re not paying $339.8 per MB for disk. You’re far more likely to be paying $0.08/GB-month for AWS EBS gp3, or you’re paying $0.023/GB-month for AWS S3. An example of a faulty mental model of storage pricing held by some developers State is cheap. Network is cheap. Be careful about cross-AZ costs, which some writers have identified as anti-competitive, but by and large, you don’t have to worry excessively about replicating data via State events. Maintaining a per-microservice state is very cheap these days, thanks to cloud storage services. And since you only need to keep the state your microservices or jobs care about, you can trim the per-consumer replication to a smaller subset in most cases. I’ll probably write another blog about the expenses of premature optimization, but just keep in mind that state events provide you with a ton of flexibility and let you keep complexity to a minimum. Embrace today’s cheap compute primitives, and focus on building useful applications and data products instead of trying to slash 10% of an event’s size (heck — just use compression if you haven’t already). But now that I’ve ranted about state events, how do they help us fix the bad data? Let’s take a look at a few simple examples, one using a database source, one with a topic source, and one with an FTP source. State Events and Fixing at the Source Kafka Connect is the most common way to bootstrap events from a database. Updates made to a registered database’s table rows (Create, Update, Delete) are emitted to a Kafka topic as discrete state events. You can, for example, connect to a MySQL, PostgreSQL, MongoDB, or Oracle database using Debezium (a change-data capture connector). Change-data events are state-type events and feature both before and after fields indicating the before state and after state due to the modification. You can find out more in the official documentation, and there are plenty of other articles written on CDC usage on the web. Fix the bad data at the database source and propagate it to the compacted state topic To fix the bad data in your Kafka Connect-powered topic, simply fix the data in your source database (1). The change-data connector (CDC, 2a) takes the data from the database log, packages it into events, and publishes it to the compacted output topic. By default, the schema of your state type maps directly to your table source — so be careful if you’re going to go about migrating your tables. Note that this process is exactly the same as what you would do for batch-based ETL. Fix the bad data at source, rerun the batch import job, then upsert/merge the fixes into the landing table data set. This is simply the stream-based equivalent. Fix the bad data in the compacted source topic and propagate it to the downstream compacted state topic Similarly, for example, a Kafka Streams application (2) can rely on compacted state topics (1) as its input, knowing that it’ll always get the eventual correct state event for a given record. Any events that it might publish (3) will also be corrected for its own downstream consumers. If the service itself receives bad data (say a bad schema evolution or even corrupted data), it can log the event as an error, divert it to a dead-letter queue (DLQ), and continue processing the other data (Note that we talked about dead-letter queues and validation back in part 1). Lastly, consider an FTP directory where business partners (AKA the ones who give us money to advertise/do work for them) drop documents containing information about their business. Let’s say they’re dropping in information about their total product inventory so that we can display the current stock to the customer. (Yes, sometimes this is as close to event streaming as a partner is willing or able to get). Bad data dropped into an FTP directory by a careless business partner We’re not going to run a full-time streaming job just idling away, waiting for updates to this directory. Instead, when we detect a file landing in the bucket, we can kick off a batch-based job (AWS Lambda?), parse the data out of the .xml file, and convert it into events keyed on the productId representing the current inventory state. If our partner passes us bad data, we’re not going to be able to parse it correctly with our current logic. We can, of course, ask them nicely to resend the correct data (1), but we might also take the opportunity to investigate what the error is, to see if it’s a problem with our parser (2), and not their formatting. Some cases, such as if the partner sends a completely corrupted file, require it to be resent. In other cases, they may simply leave it to us data engineers to fix it up on our own. So we identify the errors, add code updates, and new test cases, and reprocess the data to ensure that the compacted output (3) is eventually accurate. It doesn’t matter if we publish duplicate events since they’re effectively benign (idempotent), and won’t cause any changes to the consumer’s state. That’s enough for state events. By now you should have a good idea of how they work. I like state events. They’re powerful. They’re easy to fix. You can compact them. They map nicely to database tables. You can store only what you need. You can infer the deltas from any point in time so long as you’ve stored them. But what about deltas, where the event doesn’t contain state but rather describes some sort of action or transition? Buckle up. Can I Fix Bad Data for Delta-Style Events? “Now,” you might ask, “What about if I write some bad data into a delta-style event? Am I just straight out of luck?” Not quite. But the reality is that it’s a lot harder (like, a lot a lot) to clean up delta-style events than it is state-style events. Why? The major obstacle to fixing deltas (and any other non-state event, like commands) is that you can’t compact them — no updates, no deletions. Every single delta is essential for ensuring correctness, as each new delta is in relation to the previous delta. A bad delta represents a change into a bad state. So what do you do when you get yourself into a bad state? You really have two strategies left: Undo the bad deltas with new deltas. This is a build-forward technique, where we simply add new data to undo the old data. (WARNING: This is very hard to accomplish in practice).Rewind, rebuild, and retry the topic by filtering out the bad data. Then, restore consumers from a snapshot (or from the beginning of the topic) and reprocess. This is the final technique for repairing bad data, and it’s also the most labor-intensive and expensive. We’ll cover this more in the final section as it technically also applies to state events. Both options require you to identify every single offset for each bad delta event, a task that varies in difficulty depending on the quantity and scope of bad events. The larger the data set and the more delta events you have, the more costly it becomes — especially if you have bad data across a large keyspace. These strategies are really about making the best out of a bad situation. I won’t mince words: Bad delta events are very difficult to fix without intensive intervention! But let’s look at each of these strategies in turn. First up, build-forward, and then to cap off this blog, rewind, rebuild, and retry. Build-Forward: Undo Bad Deltas With New Deltas Deltas, by definition, create a tight coupling between the delta event models and the business logic of consumer(s). There is only one way to compute the correct state, and an infinite amount of ways to compute the incorrect state. And some incorrect states are terminal — a package, once sent, can’t be unsent, nor can a car crushed into a cube be un-cubed. Any new delta events, published to reverse previous bad deltas, must put our consumers back to the correct good state without overshooting into another bad state. But it’s very challenging to guarantee that the published corrections will fix your consumer’s derived state. You would need to audit each consumer’s code and investigate the current state of their deployed systems to ensure that your corrections would indeed correct their derived state. It’s honestly just really quite messy and labor-intensive and will cost a lot in both developer hours and opportunity costs. However… you may find success in using a delta strategy if the producer and consumer are tightly coupled and under the control of the same team. Why? Because you control entirely the production, transmission, and consumption of the events, and it’s up to you to not shoot yourself in the foot. Fixing Delta-Style Events Sounds Painful Yeah, it is. It’s one of the reasons why I advocate so strongly for state-style events. It’s so much easier to recover from bad data, to delete records (hello GDPR), to reduce complexity, and to ensure loose coupling between domains and services. Deltas are popularly used as the basis of event sourcing, where the deltas form a narrative of all changes that have happened in the system. Delta-like events have also played a role in informing other systems of changes but may require the interested parties to query an API to obtain more information. Deltas have historically been popular as a means of reducing disk and network usage, but as we observed when discussing state events, these resources are pretty cheap nowadays and we can be a bit more verbose in what we put in our events. Overall, I recommend avoiding deltas unless you absolutely need them (e.g., event sourcing). Event-carried state transfer and state-type events work extremely well and simplify so much about dealing with bad data, business logic changes, and schema changes. I caution you to think very carefully about introducing deltas into your inter-service communication patterns and encourage you to only do so if you own both the producer and the consumer. For Your Information: “Can I Just Include the State in the Delta?” I’ve also been asked if we can use events like the following, where there is a delta AND some state. I call these hybrid events, but the reality is that they provide guarantees that are effectively identical to state events. Hybrid events give your consumers some options as to how they store state and how they react. Let’s look at a simple money-based example. Key: {accountId: 6232729} Value: {debitAmount: 100, newTotal: 300} In this example, the event contains both the debitAmount ($100) and the newTotal of funds ($300). But note that by providing the computed state (newTotal=$300), it frees the consumers from computing it themselves, just like plain old state events. There’s still a chance the consumer will build a bad aggregate using debitAmount, but that’s on them — you already provided them with the correct computed state. There’s not much point in only sometimes including the current state. Either your consumers are going to depend on it all the time (state event) or not at all (delta event). You may say you want to reduce the data transfer over the wire — fine. But the vast majority of the time, we’re only talking about a handful of bytes, and I encourage you not to worry too much about event size until it costs you enough money to bother addressing. If you’re REALLY concerned, you can always invest in a claim-check pattern. But let’s move on now to our last bad-data-fixing strategy. The Last Resort: Rewind, Rebuild, and Retry Our last strategy is one that you can apply to any topic with bad data, be it delta, state, or hybrid. It’s expensive and risky. It’s a labor-intensive operation that costs a lot of people hours. It’s easy to screw up, and doing it once will make you never want to do it again. If you’re at this point you’ve already had to rule out our previous strategies. Let’s just look at two example scenarios and how we would go about fixing the bad data. Rewind, Rebuild, and Retry from an External Source In this scenario, there’s an external source from which you can rebuild your data. For example, consider an nginx or gateway server, where we parse each row of the log into its own well-defined event. What caused the bad data? We deployed a new logging configuration that changed the format of the logs, but we failed to update the parser in lockstep (tests, anyone?). The server log file remains the replayable source of truth, but all of our derived events from a given point in time onwards are malformed and must be repaired. Solution If your parser/producer is using schemas and data quality checks, then you could have shunted the bad data to a DLQ. You would have protected your consumers from the bad data but delayed their progress. Repairing the data in this case is simply a matter of updating your parser to accommodate the new log format and reprocessing the log files. The parser produces correct events, sufficient schema, and data quality, and your consumers can pick up where they left off (though they still need to contend with the fact that the data is late). But what happens if you didn’t protect the consumers from bad data, and they’ve gone and ingested it? You can’t feed them hydrogen peroxide to make them vomit it back up, can you? Let’s check how we’ve gotten here before going further: No schemas (otherwise would have failed to produce the events)No data quality checks (ditto)Data is not compactable and the events have no keysConsumers have gotten into a bad state because of the bad data At this point, your stream is so contaminated that there’s nothing left to do but purge the whole thing and rebuild it from the original log files. Your consumers are also in a bad state, so they’re going to need to reset either to the beginning of time or to a snapshot of internal state and input offset positions. Restoring your consumers from a snapshot or savepoint requires planning ahead (prevention, anyone?). Examples include Flink savepoints, MySQL snapshots, and PostgreSQL snapshots, to name just a few. In either case, you’ll need to ensure that your Kafka consumer offsets are synced up with the snapshot’s state. For Flink, the offsets are stored along with the internal state. With MySQL or PostgreSQL, you’ll need to commit and restore the offsets into the database, to align with the internal state. If you have a different data store, you’ll have to figure out the snapshotting and restores on your own. As mentioned earlier, this is a very expensive and time-consuming resolution to your scenario, but there’s not much else to expect if you use no preventative measures and no state-based compaction. You’re just going to have to pay the price. Rewind, Rebuild, and Retry With the Topic as the Source If your topic is your one and only source, then any bad data is your fault and your fault alone. If your events have keys and are compactable, then just publish the good data over top of the bad. Done. But let’s say we can’t compact the data because it doesn’t represent state? Instead, let’s say it represents measurements. Consider this scenario. You have a customer-facing application that emits measurements of user behavior to the event stream (think clickstream analytics). The data is written directly to an event stream through a gateway, making the event stream the single source of truth. But because you didn’t write tests nor use a schema, the data has accidentally been malformed directly in the topic. So now what? Solution The only thing you can do here is reprocess the “bad data” topic into a new “good data” topic. Just as when using an external source, you’re going to have to identify all of the bad data, such as by a unique characteristic in the malformed data. You’ll need to create a new topic and a stream processor to convert the bad data into good data. This solution assumes that all of the necessary data is available in the event. If that is not the case, then there’s little you can do about it. The data is gone. This is not CSI:Miami where you can yell, “Enhance!” to magically pull the data out of nowhere. So let’s assume you’ve fixed the data and pushed it to a new topic. Now all you need to do is port the producer over, then migrate all of the existing consumers. But don’t delete your old stream yet. You may have made a mistake migrating it to the new stream and may need to fix it again. Migrating consumers isn’t easy. A polytechnical company will have many different languages, frameworks, and databases in use by their consumers. To migrate consumers, for example, we typically must: Stop each consumer, and reload their internal state from a snapshot made prior to the timestamps of the first bad data.That snapshot must align with the offsets of the input topics, such that the consumer will process each event exactly once. Not all stream processors can guarantee this (but it is something that Flink is good at, for example).But wait! You created a new topic that filtered out bad data (or added missing data). Thus, you’ll need to map the offsets from the original source topic to the new offsets in the new topic.Resume processing from the new offset mappings, for each consumer. If your application doesn’t have a database snapshot, then we must delete the entire state of the consumer and rebuild it from the start of time. This is only possible if every input topic contains a full history of all deltas. Introduce even just one non-replayable source and this is no longer possible. Summary In Part 1, I covered how we do things in the batch world, and why that doesn’t transfer well to the streaming world. While event stream processing is similar to batch-based processing, there is significant divergence in strategies for handling bad data. In batch processing, a bad dataset (or partition of it) can be edited, corrected, and reprocessed after the fact. For example, if my bad data only affected computations pertaining to 2024–04–22, then I can simply delete that day’s worth of data and rebuild it. In batch, no data is immutable, and everything can be blown away and rebuilt as needed. Schemas tend to be optional, imposed only after the raw data lands in the data lake/warehouse. Testing is sparse, and reprocessing is common. In streaming, data is immutable once written to the stream. The techniques that we can use to deal with bad data in streaming differ from those in the batch world. The first one is to prevent bad data from entering the stream. Robust unit, integration, and contract testing, explicit schemas, schema validation, and data quality checks each play important roles. Prevention remains one of the most cost-effective, efficient, and important strategies for dealing with bad data —to just stop it before it even starts.The second is event design. Choosing a state-type event design allows you to rely on republishing records of the same key with the updated data. You can set up your Kafka broker to compact away old data, eliminating incorrect, redacted, and deleted data (such as for GDPR and CCPA compliance). State events allow you to fix the data once, at the source, and propagate it out to every subscribed consumer with little-to-no extra effort on your part.Third and finally is Rewind, Rebuild, and Retry. A labor-intensive intervention, this strategy requires you to manually intervene to mitigate the problems of bad data. You must pause consumers and producers, fix and rewrite the data to a new stream, and then migrate all parties over to the new stream. It’s expensive and complex and is best avoided if possible. Prevention and good event design will provide the bulk of the value for helping you overcome bad data in your event streams. The most successful streaming organizations I’ve worked with embrace these principles and have integrated them into their normal event-driven application development cycle. The least successful ones have no standards, no schemas, no testing, and no validation — it’s a wild west, and many a foot is shot. Anyways, if you have any scenarios or questions about bad data and streaming, please reach out to me on LinkedIn. My goal is to help tear down misconceptions and address concerns around bad data and streaming, and to help you build up confidence and good practices for your own use cases. Also, feel free to let me know what other topics you’d like me to write about, I’m always open to suggestions.

By Adam Bellemare

The Role of Data Governance in Data Strategy: Part 3

Data Subject Access Rights (DSAR) In the previous articles (Part 1 and Part 2), we have seen the concept of BigID and how it enhances the data in an organization. In this article, let's see what is Data Subject Access Rights (DSAR) and how they correlate to individual rights in real-time. Data Rights Fulfillment (DRF) is a process of steps/actions taken by an organization with data protection rules and ensuring that individual rights and personal data are respected. What are the most commonly used rights that one individual has the right to ask or know? What level of information does the organization have with regard to data? What are the rights of individuals under the GDPR? Right to data access (Article 15)The right to be informed (Articles 12, 13, and 14)Rights refer to automated individual decision-making, including profiling (Article 22).The right to object (Article 21)The right to data portability (Article 20)The right to restrict processing (Articles 18 and 19)The right to erasure ("right to be forgotten") (Article 17)The right to rectification (Article 16) 1. Right to Data Access This right allows individuals to ask an organization if they have personal data that is concerning them. Individuals are entitled to obtain additional information from the organization regarding the following: For what purposes is the personal data being used or processed?Recipients or services of recipients who have or will receive the dataThe source of the data, if it was not directly collected from the individualThe duration for which the data will be stored or the benchmark used to determine that tenure In summary, the right to access is an important component of data protection regulations, intended to grant individuals greater jurisdiction over their personal data and ensure transparency in how their data is used. 2. The Right to Be Informed This law plays a vital role in an organization and they are responsible for keeping the individuals informed about their data if there are any changes/edits to the data. Transparency is the core principle here for data protection and is key for building trust between organizations and individuals. This is mostly done through a "Privacy Note" or "Non-Disclosure Agreement (NDA)" between both parties. The organization is responsible for making sure these details are written/printed in a very detailed note that individuals can understand easily. Key points that must be included in the Privacy Notice: Identity and contact details of the data ConsultantPurpose of data processingLegal basis for processingRecipients or categories of recipientsInternational data transfersData retention periodIndividual rightsAutomated decision-makingSource of data (if not collected directly from the individual) 3. Rights Refer to Automated Individual Decision-Making, Including Profiling The individual has specific rights with regard to automated decision-making, including profiling, if the individual feels/suspects the processed data/results were not accurate. These rights are designed to protect individuals from actions that could impact them without any manual intervention. For instance, if a company uses an algorithm to reject job applications based on certain criteria automatically, an individual has the right to: Be informed that their application was rejected through automated decision-makingRequest human intervention to review the decisionProvide additional information that may not have been considered by the automated processThey may appeal or raise a flag if they feel the decision was unfair 4. The Right to Object The "right to object" enables individuals to request an organization to stop processing their personal data in some scenarios, like below: Right to object to processing for direct marketing purposesRight to object to processing based on legitimate interests or public taskRight to object to processing for research or statistical purposes For instance, if an organization uses personal data to send marketing campaign emails, the individual has the right to object to this kind of processing. Once the individual objects, the company must stop sending these emails to that person immediately. 5. The Right to Data Portability The right to data portability enables individuals to gather and reuse their personal information across multiple services. Organizations need to be able to provide their personal data upon request and in this way it allows them to carry their data in a safe and secure way without compromising their rights. Some of the general examples of how an individual can use these rights are: Switching financial services: An individual might use the right to data portability to transfer their transaction history from one bank to another.Number portability: An individual can use the right to data portability to “port” a mobile number to another mobile network provider.Health services: A patient might transfer their health records from one healthcare provider to another. 6. The Right to Restrict Processing This right provides individuals with the capability to stop processing of their personal data under certain circumstances without necessarily requiring the data to be deleted. An individual has the right to restrict what an organization does with their information, so they can process it as part of an agreement but not send marketing emails. While the processing is restricted by an individual, the organizations can still store their data and the data can be processed with the individual’s consent. An organization must keep track of who has prohibited specific sorts of processing and check that record before processing data. In most circumstances, the best way to address this will be within the software tools that are being used to manage those operations. 7. The Right to Erasure (“Right to Be Forgotten”) This right allows individuals to request that their personal data be deleted when an individual does not want to process their data in an organization. It is a key right of data protection in this digital era, ensuring that individuals have the ability to manage their digital footprint while still protecting their privacy. However, this right is balanced by some exclusions to guarantee that some essential data processing activities can continue where and when needed. For example, a person might request the deletion of their personal data from a company’s database in situations such as when they have withdrawn their consent to receive marketing emails or when they no longer wish to have an account with that company and want all associated data to be erased. By making such a request, the individual ensures that the company stops using their personal data for any purpose, including marketing, account management, or any other processing activity that might have been ongoing. 8. The Right to Rectification It allows individuals to request corrections to their personal data if an individual feels it is inaccurate or incomplete. Organizations need to know everywhere in their organization where data about individuals is stored so they can update those systems if an individual informs them that the data they have is incorrect. If an individual requests an organization to update or edit any of their personal information, they typically need to submit a request to the data controller of the organization that is handling the data. The request should specify what data is incorrect and what the correct information should be. In the future post, we will look at how BigID addresses DSAR and DRF requests and the impact it has on data and individuals. This framework is essential for maintaining justice and accountability in the age of AI. Reference Articles BigID-Automate-Data-Access-Rights-Fulfillment-White-Paper

By Satish Gaddipati

Content Detection Technologies in Data Loss Prevention (DLP) Products

Having worked with enterprise customers for a decade, I still see potential gaps in data protection. This article addresses the key content detection technologies needed in a Data Loss Prevention (DLP) product that developers need to focus on while developing a first-class solution. First, let’s look at a brief overview of the functionalities of a DLP product before diving into detection. Functionalities of a Data Loss Prevention Product The primary functionalities of a DLP product are policy enforcement, data monitoring, sensitive data loss prevention, and incident remediation. Policy enforcement allows security administrators to create policies and apply them to specific channels or enforcement points. These enforcement points include email, network traffic interceptors, endpoints (including BYOD), cloud applications, and data storage repositories. Sensitive data monitoring focuses on protecting critical data from leaking out of the organization's control, ensuring business continuity. Incident remediation may involve restoring data with proper access permissions, data encryption, blocking suspicious transfers, and more. Secondary functionalities of a DLP product include threat prevention, data classification, compliance and posture management, data forensics, and user behavior analytics, among others. A DLP product ensures data security within any enterprise by enforcing data protection across all access points. The primary differentiator between a superior data loss prevention product and a mediocre one is the breadth and depth of coverage. Breadth refers to the variety of enforcement points covered, while depth pertains to the quality of the content detection technologies. Detection Technologies Detection technologies can be broadly divided into three categories. The first category includes simple matchers that directly match individual data, known as direct content matchers. The second category consists of more complex matchers that can handle both structured content, such as data found in databases, and unstructured content, like text documents and images/video data. The third category consists of AI-based matchers that can be configured by using both supervised and unsupervised training methods. Direct Content Matchers There are three types of direct content matches, namely matches based on keywords, regular expression patterns, and popular identifier matchers. Keyword Matching Policies that require keyword matchers should include rules with specific keywords or phrases. The keyword matcher can directly inspect the content and match it based on these rules. The keyword input can be a list of keywords separated by appropriate delimiters or phrases. Effective keyword-matching algorithms include the Knuth-Morris-Pratt (KMP) algorithm and the Boyer-Moore algorithm. The KMP algorithm is suitable for documents of any size as it preprocesses the input keywords before starting the matching. The Boyer-Moore Algorithm is particularly effective for larger texts because of its heuristic-based approach. Modern keyword matching also involves techniques, such as keyword pair matching based on word distances and contextual keyword matching. Regular Expression Pattern Matching Regular expressions defined in security policies need to be pre-compiled, and pattern matching can then be performed on the content that needs to be monitored. The Google RE2 algorithm is one of the fastest pattern-matching algorithms in the industry, alongside others such as Hyper Scan by Intel and the Tried Regular Expression Matcher based on Deterministic Finite Automaton (DFA). Regular expression pattern policies can also include multiple patterns in a single rule and patterns based on word distances. Popular Identifier Matching Popular identifier matching is similar to a regex pattern matcher but specializes in detecting common identifiers used in everyday life, such as Social Security Numbers, tax identifiers, and driving license numbers. Each country may have unique identifiers that they use. Many of these popular identifiers are part of Personally Identifiable Information (PII), making it crucial to protect data that contains them. This type of matcher can be implemented using regular expression pattern matching. All these direct content matchers are known for generating a large number of false negatives. To address this issue, policies associated with these matcher rules should include data checkers to reduce the number of false positives. For example, not all 9-digit numbers can be US Social Security Numbers (SSNs). SSNs cannot start with 000 or 666, and the reserved range includes numbers from 900 to 999. Structured and Unstructured Content Matchers Both structured and unstructured content matchers require security administrators to pre-index the data, which is then fed into the content matchers for this type of matching to work. Developers can construct pre-filters to eliminate content from an inspection before it is passed on to this category of matchers. Structured Matcher Structured Data Matching, also known as Exact Data Matching (EDM), matches structured content found in spreadsheets, structured data repositories, databases, and similar sources. Any data that conforms to a specific structure can be matched using this type of matcher. The data to be matched must be pre-indexed so that the structured matchers can perform efficiently. Security policies, for instance, should specify the number of columns and the names of columns that need to match to qualify for a data breach incident when inspecting a spreadsheet. Typically, the pre-indexed content is large, in the order of gigabytes, and the detection matchers must have sufficient resources to load these files for matching. As the name suggests, this method exactly matches the pre-indexed data with the content being inspected. Unstructured Matcher Unstructured data matching, similar to EDM, involves pre-compiling and indexing the files provided by the security administrator when creating the policy. Unstructured content matching indexes include generating a rolling window of hashes for the documents and storing them in a format that allows for efficient content inspection. A video file might also be included under this type of matcher; however, once the transcript is extracted from the video, there is nothing preventing developers from using direct content matchers in addition to unstructured matchers for content monitoring. AI-Based Matchers AI matchers involve a trained model for matching. The model can be trained via a rigorous set of training data and supervision, or we can let the system train through unsupervised learning. Supervised Learning Training data should include both a positive set and a negative set with appropriate labels. The training data can also be based on a specific set of labels to classify the content within an organization. Most importantly, during training, critical features such as patterns and metadata should be extracted. Data Loss Prevention products generally use decision trees and support vector machine (SVM) algorithms for this type of matching. The model can be retrained or updated based on new training data or feedback from the security administrator. The key is to keep the model updated to ensure that this type of matcher performs effectively. Unsupervised Learning Unsupervised learning has been becoming increasingly popular in this AI era with the inception of large language models (LLMs). LLMs usually go through an initial phase of unsupervised learning followed by a supervised learning phase where fine-tuning takes place. Unsupervised learning algorithms popularly used by security vendors while creating DLP products are K-means, hierarchical clustering algorithms that can identify structural patterns and anomalies while performing data inspection. Methodologies — namely, Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) — can aid specifically in identifying sensitive patterns in documents that are sent for content inspection. Conclusion For superior data loss prevention products, developers and architects should consider including all the mentioned content-matching technologies. A comprehensive list of matchers allows security administrators to create policies with a wide variety of rules to protect sensitive content. It should be noted that a single security policy can include a combination of all the matchers, expressed as an expression joined using boolean operators such as OR, AND, and NOT. Protecting data will always be important, and it is becoming even more crucial in the AI era, where we must advocate for the ethical use of AI.

By Priyanka Neelakrishnan

Batch vs. Real-Time Processing: Understanding the Differences

The decision between batch and real-time processing is a critical one, shaping the design, architecture, and success of our data pipelines. While both methods aim to extract valuable insights from data, they differ significantly in their execution, capabilities, and use cases. Understanding the key distinctions between these two processing paradigms is crucial for organizations to make informed decisions and harness the full potential of their data. Key definitions can be summarized as follows: Batch processing is a process that accumulates data in discrete chunks at scheduled intervals or based on data volume. It is often ideal for non-time-sensitive tasks. Real-time processing continuously processes data as it arrives, with minimal latency, enabling immediate insights and actions. Build Your Own Batch Data Pipeline. Take the Free Course Today!* *Affiliate link. See Terms of Use. Batch vs. Real-Time Processing: Key Characteristics A comprehensive table outlining the key characteristics between batch and real-time processing can be found below — we will explore these differences in greater detail in the following sections. Batch vs. Real-Time Processing: Key Characteristics Characteristic Batch Processing Real-Time Processing Advantage Processing speed Processes data in large chunks over time Processes data almost instantly as it arrives Real-time Processing Data latency High latency; data is processed at scheduled times Low latency; data is processed immediately Real-time processing Processing volume Suitable for large volumes of data Suitable for smaller, continuous data streams Depends on use case Data freshness Data is stale between processing intervals Data is fresh and up to date Real-time processing Resource utilization High utilization during processing windows Continuous, steady resource usage Depends on use case Complexity Simpler to implement and manage More complex due to the need for low latency Batch processing Use cases Suitable for reporting, backups, and large-scale computations Ideal for monitoring, fraud detection, and instant analytics Depends on use case Scalability Scales well with large data sets Scales with the number of events yet may require a more sophisticated infrastructure Depends on use case Fault tolerance Can be more tolerant to failures; retries are easier Requires robust systems for fault tolerance Batch processing What Is Batch Processing? Batch processing is a traditional approach to data processing where data is collected over a period and processed in discrete groups or batches. This may occur at scheduled intervals or when a specified volume is reached. It's a sequential process that involves reading, processing, and writing data in chunks. Core Features Important batch processing features are: Scheduled processing: Batch jobs are executed at specific times or when certain conditions are met.High throughput: Capable of handling large volumes of data in a single batch.Resource iIntensive: Uses significant resources during processing windows. Benefits There are several key benefits to batch processing. Batch processing is: Efficient for processing large data sets.Easier to implement and manage.Able to be scheduled during off-peak hours to optimize resource use.Less demanding in terms of infrastructure and continuous monitoring compared to real-time processing. Challenges Key challenges to adopting batch processing at organizations are: High latency between data collection and processing. Not suitable for applications requiring immediate insights or actions.Possible stale data between processing intervals. What Is Real-Time Processing? Real-time processing involves the continuous input, processing, and output of data, almost instantaneously. A data processing methodology that involves analyzing and acting upon data as it arrives, often within milliseconds or seconds. Core Features Event-driven: Reacts to events as they occur and not schedules, enabling real-time actions and decisions.Continuous processing: When the appropriate event is triggered, data is processed continuously.Windowing: Supports windowing and aggregations over specific time periods.Time savings: Minimizes the time delay between data input and processing. Benefits Efficiency: Efficient for processing continuous streams of dataAccessibility: Provides up-to-date information and enables immediate decision-making and actionsHigh availability: Supports high availability for time-sensitive applications, analytics and insights Challenges Costs: Higher costs due to continuous resource utilizationComplexity: Complex to implement and maintainLoad balancing: Handling varying loads of data streams and maintaining performance can be difficultMitigating failures: Handling failures and data consistency can be difficult Key Differences Between Batch vs. Real-Time Processing Batch processing handles large volumes of data at scheduled intervals. It is suitable for applications where immediate data processing is not critical, emphasizing throughput and capacity. Real-time processing, in contrast, continuously processes data with minimal latency. It is ideal for scenarios requiring instant insights, focusing on low-latency and high-speed data handling. Batch vs. Real-Time Data Processing: Key Differences Differences Batch Processing Real-Time Processing Processing speed Batch processing is slower, as data is collected and processed in large chunks at scheduled intervals. Real-time processing prioritizes speed and processes data continuously. Data latency Batch processing introduces higher data latency, as there is a time delay between data arrival and processing. Real-time processing minimizes data latency, providing near-instantaneous access to data. Processing volume Batch processing excels at handling large volumes of data in a single batch. Real-time processing is better suited for handling high-velocity data streams. However, real-time systems may face challenges with large data volumes that require complex scaling strategies. Data freshness Batch processing results in lower data freshness, as the processed data reflects a past state. Real-time processing provides the most up-to-date information. Resource utilization Batch processing utilizes resources only during scheduled processing periods. Real-time processing requires continuous resources. Complexity Batch processing is simpler to implement and maintain due to its sequential nature and defined boundaries. Real-time processing introduces higher complexity due to the need for continuous monitoring, specialized tools, and handling potential errors in the data stream. Scalability Batch processing can leverage both vertical (adding more resources like CPU, memory, etc.) and horizontal scaling effectively. The primary scaling approach for real-time processing is horizontal scaling, adding more nodes or clusters to a distributed architecture. Fault tolerance Batch processing is generally more tolerant of failures due to easier retries and to its well-defined boundaries and checkpoints. Real-time processing can be more susceptible to errors and data loss due to its continuous nature. Additional Challenges and Considerations In batch processing, there are delays between data collection and data processing. There is high resource usage during processing in which data may be outdated between processing intervals. Ensuring data consistency across multiple batches and handling updates or changes to that data during batch processing is another challenge. Therefore, accurately estimating and provisioning the required resources for batch jobs, especially for large or variable data volumes, may also be challenging. Monitoring and debugging batch processes can be challenging, as issues may not become apparent until the batch completes. Handling late-arriving data that should have been included in a previous batch can be complex and may require reprocessing. In real-time processing, events may arrive out of order, leading to inconsistencies and errors. Ensuring system recovery from failures without losing information requires testing various failure scenarios. Consequently, implementing fault-tolerant mechanisms and ensuring reliable recovery from failures without data loss or duplication may not be easy, and generating real-time test data streams that accurately reflect real-world scenarios can be complex. Dynamically scaling resources and optimizing performance to handle varying data volumes and velocities while maintaining low latency can also be challenging. Technologies and Frameworks Batch and real-time data processing software and frameworks could include but are not limited to: Batch vs. Real-Time Processing: Technologies and Frameworks Technology Batch Processing Real-Time Processing Apache Hadoop Hadoop is a framework that facilitates the distributed processing of large data sets across clusters using simple programming models. Initially centered around two main components — HDFS, for storage, and MapReduce, for processing — Hadoop excels at batch processing tasks due to its high scalability, fault tolerance, and ability to parallelize workloads. With its ecosystem of complementary tools and frameworks, Hadoop also supports real-time data processing. By integrating with stream processing frameworks like Apache Storm, Apache Flink, and Apache Spark Streaming, Hadoop extends beyond batch processing to handle real-time data streams, enabling organizations to gain immediate insights and react to data as it arrives. Apache Kafka Apache Kafka, though often associated with real-time processing, is equally adept at handling batch workloads. Its distributed architecture and inherent durability make it a reliable platform for storing and processing large volumes of data in batches. Kafka's seamless integration with batch processing frameworks like Apache Spark and Apache Hadoop enables efficient processing of batch data, leveraging the scalability and fault tolerance of these frameworks. By combining Kafka's storage capabilities with powerful batch processing engines, organizations can build robust data pipelines that cater to both real-time and batch processing requirements. Apache Kafka is a distributed streaming platform designed for high-throughput, low-latency data streaming. It functions as a message broker, allowing data to be published and consumed in real time through its publish-subscribe model. Kafka ensures durability and reliability with its distributed architecture and replication mechanism, making it ideal for real-time applications. Kafka Streams, a library for building real-time applications, supports complex data transformations directly within Kafka. Kafka integrates seamlessly with real-time processing frameworks like Apache Flink and Apache Spark, enabling sophisticated processing pipelines. Apache Spark Spark is an open-source unified analytics engine designed for large-scale data processing. It operates as a robust and efficient framework for batch processing via in-memory computation, rich APIs, and seamless integration with various data sources. Its scalability, fault tolerance, and deployment flexibility make it a good choice for processing large-scale batch data efficiently. While Spark generally offers faster performance than Hadoop MapReduce due to its in-memory processing, the actual performance benefits can vary based on the specific workload and configuration. Spark provides robust and efficient capabilities for real-time processing through its Spark Streaming and Structured Streaming modules. By leveraging micro-batching and continuous processing, Spark enables real-time data ingestion, processing, and analysis with low latency. Complex event processing and windowed computations are also available, while seamless integration with real-time data sources like Apache Kafka, Flume, and Kinesis ensures smooth data flow. Additionally, the flexibility to deploy Spark on various cluster managers and cloud environments enhances its adaptability for diverse real-time applications. Apache Flink Flink is a robust stream processing framework that can also efficiently manage batch processing as a special case. Flink's DataSet API is specifically designed for batch processing. Example Flink operations include map, reduce, join, and filter, which are common in batch processing workflows. Reading and writing to batch sources and sinks like HDFS, local file systems and relational databases, Flink allows for fault tolerance through checkpoints and savepoints. Flink can also be used as a stream processing framework that utilizes the DataStream API for handling unbounded data streams. It supports event time and processing time semantics. Additionally, it can ingest data from streaming sources like Apache Kafka, Kinesis, and message queues. Flink can write to real-time sinks like Kafka topics, databases, or real-time dashboards. It also provides robust state management and ensures low-latency processing. Looking Forward There are several key considerations as we look towards the future of batch and real-time data processing: emerging trends and technologies such as hybrid architectures, AI/ML, and edge computing; specific advancements for batch processing; and finally, unique advancements for real-time data processing. In a lot of ways, our observations here are only scratching the surface of where these data processing techniques will continue to expand. Emerging Trends and Technologies The advent of cloud infrastructure, AI/ML, and edge computing has certainly paved the way for advancements and nuances across batch and real-time data processing. Let’s take a closer look as we explore these trends and technologies in greater detail. Hybrid Architectures Over the coming months and years, we will continue to witness the growing adoption of hybrid architectures — architectures that seamlessly blend batch and real-time data processing. Organizations are increasingly realizing that a one-size-fits-all approach is no longer sufficient. Hybrid architectures will leverage the strengths of both paradigms: batch processing for efficient handling of large historical datasets and real-time processing for immediate insights and actions on streaming data. This will enable organizations to address diverse use cases and meet the varying latency requirements of different applications. Serverless Serverless computing and storage are set to become increasingly popular. By abstracting away infrastructure management and scaling resources on demand, serverless technologies offer greater scalability, flexibility, and cost-efficiency. This will allow organizations to focus on developing data applications and pipelines without worrying about the underlying infrastructure. Edge Computing Processing data closer to where it is generated — via IoT devices or local servers — rather than relying on a centralized data center is often referred to as edge computing. This approach significantly reduces latency, as the data does not need to travel long distances to be processed. It also decreases bandwidth usage since only the necessary data is transmitted to the cloud. Edge computing is particularly beneficial for applications requiring real-time decision making, such as autonomous vehicles, smart cities, and industrial automation. By processing data at the edge, organizations can achieve faster response times and more efficient data handling. AI/ML Integrations The integration of artificial intelligence (AI) and machine learning (ML) into data processing frameworks is transforming how real-time analytics is performed. Enhanced processing frameworks now come with built-in AI and ML capabilities, enabling them to analyze data in real time, make predictions, and automate decision-making processes. These integrations allow for more sophisticated and accurate data analysis, supporting applications like fraud detection, predictive maintenance, and personalized recommendations. As AI and ML technologies continue to advance, their incorporation into data processing frameworks will further enhance the ability to derive actionable insights from real-time data streams. Advancements in Batch Processing Specific advancements in batch processing will include, but are not limited to: Continued prevalence of cloud-centric data systems: Cloud data platforms like Snowflake, Databricks, and AWS Redshift are already gaining popularity due to their ability to scale resources on demand and offer a pay-as-you-go model. In the coming years, we can expect even greater adoption of these platforms due to their enhanced scalability and cost-effectiveness when compared to traditional on-premises solutions. These platforms will continue to evolve, offering advanced features like auto-scaling, intelligent query optimization, and seamless integration with other cloud services, making batch processing more accessible and efficient for organizations of all sizes.Evolving AI/ML capabilities: Artificial intelligence and machine learning are set to revolutionize batch processing by automating repetitive tasks, optimizing resource allocation, and predicting potential bottlenecks. Intelligent algorithms can analyze historical data and system metrics to identify patterns and make recommendations for optimizing batch job scheduling, resource provisioning, and data partitioning. This will lead to significant improvements in efficiency, performance, and cost savings.Hybrid integrations: The future of batch processing will not be isolated but rather integrated with real-time systems to create hybrid processing capabilities. This means that batch processing will be able to leverage real-time data streams for more timely insights and actions. For example, a batch job processing historical data can incorporate real-time updates from streaming sources to provide a more comprehensive and up-to-date view of the data. This integration will enable organizations to bridge the gap between batch and real-time processing, unlocking new possibilities for data-driven decision-making. Advancements in Real-Time Processing The complexity of real-time processing has been a barrier to adoption for many organizations. In the future, we can expect: Accessible, user-friendly platforms: The development of more user-friendly and accessible streaming platforms and tools, abstracting away the underlying complexities and enabling a wider range of users to leverage real-time data. This will include simplified APIs, low-code or no-code interfaces, and enhanced visualizations that make it easier to design, deploy, and monitor real-time data pipelines.Increased streaming data for operational analytics and decision making: As the technology matures and becomes more accessible, streaming data will be increasingly adopted for operational analytics and decision making. Real-time insights into customer behavior, operational metrics, and market trends will empower businesses to make faster, data-driven decisions and react to changes as they occur. This will be particularly important in industries like finance, healthcare, and e-commerce, where timely information is critical for gaining a competitive advantage.Advancements in real-time AI/ML applications: The integration of AI and ML into real-time data processing will enable organizations to unlock deeper insights and automate decision-making processes. Real-time ML models can analyze streaming data, make predictions, and trigger actions based on the results. This will open up new possibilities for applications like fraud detection, anomaly detection, personalized recommendations, and predictive maintenance. Bottom Line: Batch vs. Real-Time Processing The choice between batch and real-time processing is not a matter of one being superior to the other. Instead, it's about understanding the unique requirements of our use cases and aligning them with the strengths and weaknesses of each approach. Batch processing excels at handling large volumes of historical data for periodic analysis and reporting. Real-time processing empowers organizations to make immediate decisions and take actions based on the most up-to-date information. As the data engineering landscape evolves rapidly, we can expect a greater convergence of batch and real-time processing techniques, enabling more flexible and powerful data pipelines. By understanding the nuances of both batch and real-time processing, we can design and implement data architectures that cater to the diverse needs of modern data-driven businesses. This way, we can unlock the full potential of data for innovation and growth.

By Stelios Manioudakis, PhD

CORE

Datafaker Gen: Leveraging BigQuery Sink on Google Cloud Platform

This is a continuation of the article Flexible Data Generation With Datafaker Gen about DataFaker Gen. In this section, we will explore the new BigQuery Sink feature for Google Cloud Platform, demonstrating how to utilize different field types based on the DataFaker schema. BigQuery is a fully managed and AI-ready data analytics platform available on Google Cloud Platform that gives anyone the capability to analyze terabytes of data. Let's consider a scenario where we aim to create a dummy dataset, aligned with our actual schema to facilitate executing and testing queries in BigQuery. By using Datafaker Gen, this data can become meaningful and predictable, based on predefined providers, thus allowing for more realistic and reliable testing environments. This solution leverages the BigQuery API Client libraries provided by Google. For more details, refer to the official documentation here: BigQuery API Client Libraries. Quick Start With BigQuery Sink This is a simple example of BigQuery Sink just to show that it requires two simple actions to see the result. This provides clarity on the approach. The other part of this article will cover detailed configuration and the flexibility of this feature. And so, three simple steps need to be done: 1. Download the project here, build it, and navigate to the folder with the BigQuery example: Shell ./mvnw clean verify && cd ./datafaker-gen-examples/datafaker-gen-bigquery 2. Configure schema in config.yaml : YAML default_locale: en-US fields: - name: id generators: [ Number#randomNumber ] - name: lastname generators: [ Name#lastName ] nullRate: 0.1 - name: firstname locale: ja-JP generators: [ Name#firstName ] Configure BigQuery Sink in output.yamlwith the path to the Service Account JSON (which should be obtained from GCP): YAML sinks: bigquery: project_id: [gcp project name] dataset: datafaker table: users service_account: [path to service accout json] Run it: Shell # Format json, number of lines 10000 and new BigQuery Sink bin/datafaker_gen -f json -n 10000 -sink bigquery In-Depth Guide To Using BigQuery Sink To prepare a generator for BigQuery, follow these two steps: Define the DataFaker Schema: The schema defined in config.yaml will be reused for the BigQuery Sink.Configure the BigQuery Sink: In output.yaml, specify the connection credentials, connection properties, and generation parameters. Note: Currently, BigQuery Sink only supports the JSON format. If another format is used, the BigQuery Sink will throw an exception. At the same time, it might be a good opportunity to introduce other formats, such as protobuf. 1. Define the DataFaker Schema One of the most important preparation tasks is defining the schema in the config.yaml file. The schema specifies the field definitions of the record based on the Datafaker provider. It also allows for the definition of embedded fields like array and struct. Consider this example of a schema definition in the config.yaml file. The first step is to define the base locale that should be used for all fields. This should be done at the top of the file in the property default_locale . The locale for a specific field can be customized directly. YAML default_locale: en-US This schema defines the default locale as 'en-EN' and lists the fields. Then all required fields should be defined in fields section. Let’s fill in the details of the field definitions. Datafaker Gen supports three main field types: default, array, and struct. Default Type This is a simple type that allows you to define the field name and how to generate its value using generator property. Additionally, there are some optional parameters that allow for customization of locale and rate nullability. YAML default_locale: en-US fields: - name: id generators: [ Number#randomNumber ] - name: lastname generators: [ Name#lastName ] nullRate: 0.1 - name: firstname locale: ja-JP generators: [ Name#firstName ] name: Defines the field name.generators: Defines the Faker provider methods that generate value. For BigQuery, based on the format provided by the Faker provider generators, it will generate JSON, which will be reused for BigQuery field types. In our example, Number#randomNumber returns a long value from the DataFaker provider, which is then converted to an integer for the BigQuery schema. Similarly, the fields Name#lastName and Name#firstName which are String and convert to STRING in BigQuery. nullRate: Determine how often this field is missing or has a null value.locale: Defines a specific locale for the current field. Array Type This type allows the generation of a collection of values. It reuses the fields from the default type and extends them with two additional properties: minLength and maxLength. In BigQuery, this type corresponds to a field with the REPEATED mode. The following fields need to be configured in order to enable the array type: type: Specify array type for this field.minLenght: Specify min length of array.maxLenght: Specify max length of array. All these properties are mandatory for the array type. YAML default_locale: en-US fields: - name: id generators: [ Number#randomNumber ] - name: lastname generators: [ Name#lastName ] nullRate: 0.1 - name: firstname generators: [ Name#firstName ] locale: ja-JP - name: phone numbers type: array minLength: 2 maxLength: 5 generators: [ PhoneNumber#phoneNumber, PhoneNumber#cellPhone ] It is also worth noting that, generator property can contain multiple sources of value, such as for phone numbers. Struct Type This type allows you to create a substructure that can contain many nested levels based on all existing types. In BigQuery, this type corresponds to RECORD type. struct type doesn’t have a generator property but has a new property called fields, where a substructure based on the default, array or struct type can be defined. There are two main fields that need to be added for the struct type: type: Specify struct type for this field.fields: Defines a list of fields in a sub-structure. YAML default_locale: en-US fields: - name: id generators: [ Number#randomNumber ] - name: lastname generators: [ Name#lastName ] nullRate: 0.1 - name: firstname generators: [ Name#firstName ] locale: ja-JP - name: phone numbers type: array minLength: 2 maxLength: 5 generators: [ PhoneNumber#phoneNumber, PhoneNumber#cellPhone ] - name: address type: struct fields: - name: country generators: [ Address#country ] - name: city generators: [ Address#city ] - name: street address generators: [ Address#streetAddress ] 2. Configure BigQuery Sink As previously mentioned, the configuration for sinks can be added in the output.yaml file. The BigQuery Sink configuration allows you to set up credentials, connection properties, and sink properties. Below is an example configuration for a BigQuery Sink: YAML sinks: bigquery: batchsize: 100 project_id: [gcp project name] dataset: datafaker table: users service_account: [path to service accout json] create_table_if_not_exists: true max_outstanding_elements_count: 100 max_outstanding_request_bytes: 10000 keep_alive_time_in_seconds: 60 keep_alive_timeout_in_seconds: 60 Let's review the entire list of leverages you can take advantage of: batchsize: Specifies the number of records to process in each batch. A smaller batch size can reduce memory usage but may increase the number of API calls.project_id: The Google Cloud Platform project ID where your BigQuery dataset resides.dataset: The name of the BigQuery dataset where the table is located.table: The name of the BigQuery table where the data will be inserted.Google Credentials should be configured with sufficient permissions to access and modify BigQuery datasets and tables. There are several ways to pass service account content: service_account: The path to the JSON file containing the service account credentials. This configuration should be defined in the output.yaml file.SERVICE_ACCOUNT_SECRETThis environment variable should contain the JSON content of the service account.The final option involves using the gcloud configuration from your environment (more details can be found here). This option is implicit and could potentially lead to unpredictable behavior.create_table_if_not_exists: If set to true, the table will be created if it does not already exist. A BigQuery Schema will be created based on the DataFaker Schema.max_outstanding_elements_count: The maximum number of elements (records) allowed in the buffer before they are sent to BigQuery.max_outstanding_request_bytes: The maximum size of the request in bytes allowed in the buffer before they are sent to BigQuery.keep_alive_time_in_seconds: The amount of time(in seconds) to keep the connection alive for additional requests.keep_alive_timeout_in_seconds: The amount of time(in seconds) to wait for additional requests before closing the connection due to inactivity. How to Run BigQuery Sink example has been merged into the main upstream Datafaker Gen project, where it can be adapted for your use. Running this generator is easy and lightweight. However, it requires several preparation steps: 1. Download the GitHub repository. The datafaker-gen-examples folder includes the example with BigQuery Sink, that we will use. 2. Build the entire project with all modules. The current solution uses 2.2.3-SNAPSHOT version of DataFaker library. Shell ./mvnw clean verify 3. Navigate to the 'datafaker-gen-bigquery' folder. This should serve as the working directory for your run. Shell cd ./datafaker-gen-examples/datafaker-gen-bigquery 4. Define the schema for records in the config.yaml file and place this file in the appropriate location where the generator should be run. Additionally, define the sinks configuration in the output.yaml file, as demonstrated previously. Datafake Gen can be executed through two options: 1. Use bash script from the bin folder in the parent project: Shell # Format json, number of lines 100 and new BigQuery Sink bin/datafaker_gen -f json -n 10000 -sink bigquery 2. Execute the JAR directly, like this: Shell java -cp [path_to_jar] net.datafaker.datafaker_gen.DatafakerGen -f json -n 10000 -sink bigquery Query Result and Outcome After applying all the necessary configurations and running in my test environment, it would be nice to check the outcome. This is the SQL query to retrieve the generated result: SQL SELECT id, lastname, firstname, `phone numbers`, address FROM `datafaker.users`; Here is the result of all our work (the result of the query): Only the first four records are shown here with all the fields defined above. It also makes sense to note that the phone numbers array field contains two or more values depending on the entries. The address structure field has three nested fields. Conclusion This newly added BigQuery Sink feature enables you to publish records to Google Cloud Platform efficiently. With the ability to generate and publish large volumes of realistic data, developers and data analysts can more effectively simulate the behavior of their applications and immediately start testing in real-world conditions. Your feedback allows us to evolve this project. Please feel free to leave a comment. The full source code is available here.I would like to thank Sergey Nuyanzin for reviewing this article. Thank you for reading! Glad to be of help.

By Roman Rybak

Tour of Snowflake Ingestion Using CockroachDB and Redpanda Connect

Previous Articles on Snowflake Integrating Snowflake with Trino Previous Articles on CockroachDB CDC Using CockroachDB CDC with Apache PulsarUsing CockroachDB CDC with Azure Event HubsSaaS Galore: Integrating CockroachDB with Confluent Kafka, FiveTran, and SnowflakeUsing CockroachDB CDC with Confluent Cloud Kafka and Schema RegistryCockroachDB CDC using Minio as cloud storage sinkCockroachDB CDC using Hadoop Ozone S3 Gateway as cloud storage sink Motivation I work with financial services clients, and it's common to encounter a need for streaming changes in the operational data store into a data warehouse or a data lake. A former colleague recently reached out for advice on the fastest and most efficient way to load trade data into Snowflake. I've come up with at least three methods, which I will explore in a follow-up series of articles. However, I've decided to first explore Redpanda Connect, a solution that has recently caught my attention. This is by no means a conclusive guide on how changefeed data must be loaded into Snowflake; we're merely exploring the possibilities and discussing the pros and cons in later articles. CockroachDB changefeeds are an enterprise feature and require a license. In this tutorial, I'm using a free-to-start version of CockroachDB Serverless, which has enterprise changefeeds enabled. High-Level Steps Deploy a CockroachDB cluster with enterprise changefeedsDeploy Redpanda ConnectDeploy SnowflakeVerifyConclusion Step-By-Step Instructions Deploy a CockroachDB Cluster With Enterprise Changefeeds Start an instance of CockroachDB or use the managed service. To enable CDC we need to execute the following commands: SET CLUSTER SETTING cluster.organization = '<organization name>'; SET CLUSTER SETTING enterprise.license = '<secret>'; SET CLUSTER SETTING kv.rangefeed.enabled = true; I am using CockroachDB Serverless and the above steps are not necessary. You may confirm whether the changefeeds are indeed enabled using the following command: SHOW CLUSTER SETTING kv.rangefeed.enabled; If the value is false, change it to true. Generate sample data: CREATE TABLE office_dogs ( id INT PRIMARY KEY, name STRING); INSERT INTO office_dogs VALUES (1, 'Petee'), (2, 'Carl'); UPDATE office_dogs SET name = 'Petee H' WHERE id = 1; We've populated the table and then updated a record. Let's add more data to make it interesting: INSERT INTO office_dogs SELECT generate_series(3, 10000), md5(random()::string); SELECT * FROM office_dogs LIMIT 5; id,name 1,Petee H 2,Carl 3,6e19280ae649efffa7a58584c7f46032 4,5e4e897f008bb752c8edfa64a3aed356 5,abc0d898318d27f23a43060f89d62e34 SELECT COUNT(*) FROM office_dogs; Deploy Redpanda Connect I'm running Redpanda Connect in a Docker Compose file. docker compose -f compose-redpanda.yaml up -d The contents of the file are: services: redpanda: container_name: redpanda-connect hostname: redpanda-connect image: docker.redpanda.com/redpandadata/connect volumes: - ./redpanda/connect.yaml:/connect.yaml - /Users/aervits/.ssh/rsa_key.pem:/rsa_key.pem I will be using the connect.yaml file as the foundation to connect all the components in this article. For more detailed information, you can refer to the documentation provided by Redpanda. The most basic configuration looks like so: input: stdin: {} pipeline: processors: [] output: stdout: {} Since I'm using CockroachDB input, mine looks like this: input: # CockroachDB Input label: "" cockroachdb_changefeed: dsn: postgresql://<user>:<password>@<cockroachdb-cluster>:<port>/<database>?sslmode=verify-full tls: skip_cert_verify: true #enable_renegotiation: false #root_cas: "" #root_cas_file: "" client_certs: [] tables: [table_for_cdc] # No default (required) cursor_cache: "" # No default (optional) auto_replay_nacks: true pipeline: processors: [] output: stdout: {} Leave the pipeline and output as default. For reference, I'm including the repo with my source code where you can reference the values. If you have been following along, you may have noticed that I haven't started a changefeed job in CockroachDB. The cockroachdb_changefeed input directly subscribes to the table, which can be observed by examining the logs using the command docker logs redpanda-connect --follow. If you look at the connect.yaml file, the output is sent to stdout: {"primary_key":"[9998]","row":"{\"after\": {\"id\": 9998, \"name\": \"0794a9d1c99e8e47ee4515be6e0d736f\"}","table":"office_dogs"} {"primary_key":"[9999]","row":"{\"after\": {\"id\": 9999, \"name\": \"c85a6b38154f7e3085d467d567141d45\"}","table":"office_dogs"} {"primary_key":"[10000]","row":"{\"after\": {\"id\": 10000, \"name\": \"aae9e0849fff8f47e0371a4c06fb255b\"}","table":"office_dogs"} The next step is to configure Snowflake. We are not going to look at the available processors today. Deploy Snowflake I'm using a Snowflake trial account. You get a generous credit which should be sufficient to complete this tutorial. We need to create a database and a table where we will output the changefeed data. CREATE OR REPLACE DATABASE FROM_COCKROACH; CREATE OR REPLACE TABLE OFFICE_DOGS (RECORD variant); We also need to create a user with key-pair authentication as we're going to be using the Snowpipe service. openssl genrsa 2048 | openssl pkcs8 -topk8 -v2 des3 -inform PEM -out rsa_key.p8 We must use an encrypted key as Redpanda doesn't support unencrypted versions. Generate a public key: openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub Lastly, generate a pem file from the private key: openssl pkcs8 -in rsa_key.p8 -out rsa_key.pem In Snowflake, alter the user to use the key pair generated in the previous step. ALTER USER username SET rsa_public_key='MIIB...'; We can now populate the connect.yaml file with the required information for the snowflake_put output. This output type is for commercial use and requires a license, but since we're using it for demo purposes, we are able to proceed. output: # Snowflake Output label: "" snowflake_put: account: <snowflake-account> user: <user> private_key_file: rsa_key.pem role: ACCOUNTADMIN database: <database> warehouse: <warehouse> schema: <schema> stage: "@%implicit_table_stage_name" path: "path" upload_parallel_threads: 4 compression: NONE batching: count: 10 period: 3s processors: - archive: format: json_array max_in_flight: 1 If we restart the compose environment and tail the logs, we can see the following: level=info msg="Running main config from specified file" @service=benthos benthos_version=v4.32.1 path=/connect.yaml level=info msg="Listening for HTTP requests at: http://0.0.0.0:4195" @service=benthos level=info msg="Launching a Redpanda Connect instance, use CTRL+C to close" @service=benthos level=info msg="Output type snowflake_put is now active" @service=benthos label="" path=root.output level=info msg="Input type cockroachdb_changefeed is now active" @service=benthos label="" path=root.input Let's look at the implicit table stage and observe if anything has changed. list @%office_dogs | dogs/f2f3cf47-d6bc-46f4-88f2-c82519b67481.json | 1312 | 30f709e4962bae9d10b48565d22e9f32 | Wed, 14 Aug 2024 18:58:43 GMT | | dogs/f6adbf39-3955-4848-93c3-06f873a88078.json | 1312 | 28be7a619ef1e139599077e977ea130b | Wed, 14 Aug 2024 18:58:13 GMT | | dogs/f8705606-eb07-400a-9ffe-da6834fa1a30.json | 1296 | 5afbdce0e8929fc38a2eb5e0f12b96d6 | Wed, 14 Aug 2024 18:57:29 GMT | | dogs/f9e5c01a-7dda-4e76-840d-13b8a1e4946a.json | 1296 | 5480c01f1578f67afe2761c7619e9123 | Wed, 14 Aug 2024 18:57:32 GMT | | dogs/fad4efe7-3f3f-48bc-bdb4-9f0310abcf4d.json | 1312 | 5942c6e2dbaef5ee257d4a9b8e68827d | Wed, 14 Aug 2024 18:58:04 GMT | The files are ready to be copied into a table. Let's create a pipe: CREATE OR REPLACE PIPE FROM_COCKROACH.PUBLIC.cockroach_pipe AUTO_INGEST = FALSE AS COPY INTO FROM_COCKROACH.PUBLIC.OFFICE_DOGS FROM (SELECT * FROM @%office_dogs) FILE_FORMAT = (TYPE = JSON COMPRESSION = AUTO STRIP_OUTER_ARRAY = TRUE); The last remaining step is to refresh the pipe. ALTER PIPE cockroach_pipe REFRESH; | dogs/ff0871b1-6f49-43a4-a929-958d07f74046.json | SENT | | dogs/ff131d8d-3781-4cf6-8700-edd50dbb87de.json | SENT | | dogs/ff216da1-4f9d-4b37-9776-bcd559dd4a6f.json | SENT | | dogs/ff221430-4c3a-46be-bbc2-d335cc6cc9e3.json | SENT | | dogs/ffbd7d45-5084-4e36-8907-61874ac652b4.json | SENT | | dogs/fffb5fa6-23cc-4450-934a-29ccf01c67b9.json | SENT | Let's query the table in Snowflake: SELECT * FROM OFFICE_DOGS LIMIT 5; | { | | "primary_key": "[5241]", | | "row": "{\"after\": {\"id\": 5241, \"name\": \"5e0360a0d10d849afbbfa319a50bccf2\"}", | | "table": "office_dogs" | | } | | { | | "primary_key": "[5242]", | | "row": "{\"after\": {\"id\": 5242, \"name\": \"62be250249afe74bfbc5dd356e7b0ad9\"}", | | "table": "office_dogs" | | } | | { | | "primary_key": "[5243]", | | "row": "{\"after\": {\"id\": 5243, \"name\": \"7f286800a8a03e74938d09fdba52f869\"}", | | "table": "office_dogs" | | } | | { | | "primary_key": "[5244]", | | "row": "{\"after\": {\"id\": 5244, \"name\": \"16a330b8f09bcd314f9760ffe26d0ae2\"}", | | "table": "office_dogs" | | } We expect 10000 rows: SELECT COUNT(*) FROM OFFICE_DOGS; +----------+ | COUNT(*) | |----------| | 10000 | +----------+ The data is in JSON format. Let's create a view and flatten the data out. CREATE VIEW v_office_dogs AS SELECT PARSE_JSON(record:row):after:id::INTEGER AS id, PARSE_JSON(record:row):after:name::STRING AS name FROM OFFICE_DOGS; Query the view: SELECT * FROM v_office_dogs WHERE id < 6; +----+----------------------------------+ | ID | NAME | |----+----------------------------------| | 1 | Petee H | | 2 | Carl | | 3 | 6e19280ae649efffa7a58584c7f46032 | | 4 | 5e4e897f008bb752c8edfa64a3aed356 | | 5 | abc0d898318d27f23a43060f89d62e34 | +----+----------------------------------+ Verify Let's make things a bit more interesting and delete data in CockroachDB. DELETE FROM office_dogs WHERE name = 'Carl'; DELETE FROM office_dogs WHERE id = 1; In Snowflake, let's refresh the pipe as of a few minutes ago: ALTER PIPE cockroach_pipe REFRESH MODIFIED_AFTER='2024-08-14T12:10:00-07:00'; Notice there are a couple of files. +------------------------------------------------+--------+ | File | Status | |------------------------------------------------+--------| | dogs/2a4ee400-6b37-4513-97cb-097764a340bc.json | SENT | | dogs/8f5b5b69-8a00-4dbf-979a-60c3814d96b4.json | SENT | +------------------------------------------------+--------+ I must caution that if you run the REFRESH manually, you may cause duplicates in your Snowflake data. We will look at better approaches in a future article. Let's look at the row count: +----------+ | COUNT(*) | |----------| | 10002 | +----------+ The removal process didn't properly update in Snowflake as anticipated; it recognized the deleted records but failed to mirror the state in CockroachDB. We need to incorporate additional logic to achieve this. This will be a task for another time. Lastly, I would like to note that using Redpanda Connect as a compose file is optional. You have the option to run the Docker container by executing the following command: docker run --rm -it -v ./redpanda/connect.yaml:/connect.yaml -v ./snowflake/rsa_key.pem:/rsa_key.pem docker.redpanda.com/redpandadata/connect run Conclusion Today, we explored Redpanda Connect as a means to deliver streaming changefeeds into Snowflake. We've only just begun to delve into this topic, and future articles will build upon the foundations laid today.

By Artem Ervits

CORE

Using a Body With an HTTP Get Method Is Still a Bad Idea

The HTTP GET Method and Using a Body With the Request The Hypertext Transfer Protocol (HTTP) has several methods, or "verbs," to allow clients and servers to communicate effectively. One of the most commonly used methods is the GET method, which retrieves data from a server. While HTTP specifications do not forbid sending a body with a GET request, doing so is non-standard and can lead to various problems. Background: What Is the HTTP GET Method? In HTTP, the GET method is designed to retrieve data from a server without causing any side effects. Typically, this data is fetched based on parameters sent as part of the URL's query string. For instance, in a URL like http://example.com/?key=value represents a parameter passed to the server. Can a Body Be Sent With an HTTP GET Request? Technically, yes. The HTTP/1.1 specification (RFC 7231) does not explicitly forbid including a body in a GET request. However, it states that a GET request body has no defined semantics, meaning that the server is under no obligation to understand or use it. In practice, the inclusion of a request body with GET has been a contentious issue. Reasons Why Including a Body With GET Is Not a Good Idea: Semantics Misalignment: HTTP methods have semantic meanings. A GET request signifies a read operation, with no side-effects, while request bodies typically convey data to be processed by the server. Sending a body with a GET muddies this clear distinction.Server Incompatibility: Many servers and intermediaries might ignore the body of a GET request or even reject the request altogether.Caching Issues: HTTP caching mechanisms rely on the predictability of request methods. A GET request with a body could disrupt these mechanisms because caches might not consider the body when determining a cache hit or miss.Potential Security Concerns: As it's uncommon, systems might not anticipate or correctly handle a body in a GET request. This oversight could expose vulnerabilities. Potential Reasons to Use a Body With GET: Complex Querying: Some applications, especially those that require complex querying (like certain database searches), might find it more straightforward to convey this information in a body rather than a URL.Uniformity in Design: If an application design uses bodies to send data in other methods (POST, PUT), one might consider using a body with GET for the sake of consistency.Avoiding Long URLs: URLs can have length restrictions. For instance, Internet Explorer has a maximum URL length of 2048 characters. Using a body can help sidestep this limitation. Alternatives to Sending a Body With GET: Use the POST Method: If there's a need to send a significant amount of data to the server for processing and retrieval, the POST method might be more appropriate.URL Encoding: For less complex data requirements, parameters can be URL-encoded and appended to the request URL.Custom Headers: Some information can be passed using custom HTTP headers, avoiding the need for a body or long URLs. Conclusion While it's technically possible to send a body with an HTTP GET request, it's generally not recommended due to the potential for semantic confusion, technical incompatibilities, and other challenges. It's essential to weigh the pros and cons in the context of specific application needs and, when in doubt, adhere to standard practices to ensure the broadest compatibility and best user experience.

By Danny Logsdon

Leveraging Microsoft Graph API for Unified Data Access and Insights

In today's world driven by data, it is essential for businesses and developers to efficiently access and manage data. The Microsoft Graph API serves as a gateway to connect with Microsoft services, like Office 365 Azure AD, OneDrive, Teams, and more. By utilizing the Microsoft Graph API companies can simplify data access, improve collaboration, and extract insights from their data. This article delves into the functionalities of the Microsoft Graph API. How it can be used to centralize data access for insights. Understanding the Microsoft Graph API The Microsoft Graph API acts as a web service that empowers developers to interact with Microsoft cloud services and information. It offers an endpoint (https;//graph.microsoft.com) for engaging with services making data integration and management across various platforms more straightforward. Main Key Features Centralized endpoint: Connect with Microsoft services using one API endpoint.Diverse data access: Engage with an array of information like user profiles, emails, calendars, files, and more.Security measures: Incorporates security elements such as OAuth 2.0 to ensure data access is in line with industry regulations.Insightful data analysis: Make use of analytics tools and insight capabilities to extract information from your datasets. Starting With Microsoft Graph API: A Beginner's Guide Requirements Before diving, into the world of Microsoft Graph API make sure you have the essentials; An Azure Active Directory (Azure AD) tenant.An application registered within Azure AD.Necessary permissions to access the data you require. Step-By-Step Instructions Step 1: Application Registration Log in to Azure Portal: Navigate to the Azure Portal (https://portal.azure.com) by signing in using your Microsoft account.Register your app: Head to the "Azure Active Directory" section, App registrations ". Then click on "New registration."Enter app details: Provide a name for your application ("TestingMicrosotGraph"). Choose the supported account type as single tenant or multitenant based on your scenario. Define redirect URI which is optional (e.g., http://localhost for local development). Click on "Register" to finalize app creation.Application ID: Once your app is registered remember to note down the "Application (client) ID" for authentication purposes.Create a Client secret: Go to Manage --> Certificates & Secrets section to create a client secret. Note: Going forward, due to the increasing number of security issues, avoid using client secrets and use Managed identities. Step 2: Setting up API Permissions When setting up your app go to the registration section. Click on "API permissions." Next, choose "Microsoft Graph". Specify the type of permissions needed for your app (either Delegated permissions or Application permissions). If required give admin consent for the selected permissions to allow the app access, to the data. Step 3: Authentication Next, proceed with authentication by obtaining an access token using OAuth 2.0. Here is a sample scenario using the Microsoft Authentication Library (MSAL) within a sample .NET console application: C# var clientId = "your-application-client-id"; var tenantId = "your-tenant-id"; var clientSecret = "your-client-secret"; var authority = $"https://login.microsoftonline.com/{tenantId}"; var clientApp = ConfidentialClientApplicationBuilder.Create(clientId) .WithClientSecret(clientSecret) .WithAuthority(new Uri(authority)) .Build(); var scopes = new[] { "https://graph.microsoft.com/.default" }; var authResult = await clientApp.AcquireTokenForClient(scopes).ExecuteAsync(); var accessToken = authResult.AccessToken; You can use an inbuilt text visualizer inside Visual Studio which allows you to decode a JWT token to see the app details as shown below. Step 4: Making API Calls Now that you have the access token you can send requests, to the Microsoft Graph API. Let me show you how to get the profile details of the user who is currently signed in using HTTP Client inside the console application. C# var httpClient = new HttpClient(); httpClient.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", accessToken); var response = await httpClient.GetAsync("https://graph.microsoft.com/v1.0/users"); var content = await response.Content.ReadAsStringAsync(); Console.WriteLine(content); Exploring the Applications of Microsoft Graph API 1. Improving Teamwork Through Microsoft Teams The Microsoft Graph API offers a way to boost teamwork within a company by linking up with Microsoft Teams. For instance, you can automate forming teams, and channels and handling memberships. This helps simplify tasks, like welcoming staff members or establishing communication channels tailored to projects. 2. Automating Workflows With Outlook Enhance your workflow efficiency by connecting with Outlook to send emails organize schedules and oversee assignments. This can boost productivity by minimizing the need, for involvement, in tasks. 3. Insights and Analytics Utilize the analysis features of the Microsoft Graph API to gain information from your data. For example, you can employ the API to examine email behaviors, calendar utilization, and task engagements in order to grasp employee efficiency and teamwork patterns. 4. Managing User and Organizational Data Utilize the Microsoft Graph API for handling user profiles, groups, and organizational data within Azure AD. This can aid in ensuring that your organization's applications and services have uniform information. Best Practices 1. Ensuring Security in Your Application It is important to prioritize authentication and authorization in your application. Utilize OAuth 2.0 and the Microsoft Authentication Library (MSAL) to securely obtain tokens. Regularly. Update permissions to adhere to the principle of privilege. 2. Effective Error Management and Handling Rate Limits Make sure to implement error-handling practices and adhere to API rate limits. Microsoft Graph API imposes rate limits to ensure usage. Handle HTTP status codes appropriately. Incorporate retry mechanisms with backoff, for temporary errors. 3. Efficient Use of APIs Maximize the efficiency of API consumption by fetching data using query parameters and $select statements to restrict the properties retrieved. For instance; C# var response = await httpClient.GetAsync("https://graph.microsoft.com/v1.0/me?$select=displayName,mail"); 4. Keep Informed The Microsoft Graph API is always changing. Make sure to stay up-to-date with the updates and new features by regularly checking out the official documentation. Conclusion The Microsoft Graph API is a tool that offers access to a variety of Microsoft services and data. By utilizing this API, companies can simplify data access improve collaboration and obtain insights. Whether you're automating processes, managing user information, or analyzing trends in productivity the Microsoft Graph API can greatly enhance your application capabilities. By adhering to recommended practices and keeping abreast of the advancements you can fully maximize the potential of the Microsoft Graph API for your business requirements.

By Naga Santhosh Reddy Vootukuri

CORE

Efficient Data Management With Offset and Cursor-Based Pagination in Modern Applications

Managing large datasets efficiently is essential in software development. Retrieval strategies play a crucial role in improving performance and scalability, especially when response times are critical. Pagination is a core technique used to manage data effectively. It is essential for optimizing performance and resource management. In this article, we will explore two pagination strategies, offset and cursor-based pagination, that are suited to different scenarios and requirements. These strategies will help you understand the importance of pagination and how they can benefit your system. Leveraging Jakarta Data, this exploration integrates these pagination techniques into a REST API developed with Quarkus and MongoDB. This combination demonstrates practical implementation and highlights the synergy between modern technologies and advanced data handling methods. This discussion aims to comprehensively understand each pagination method’s mechanics, benefits, and trade-offs, empowering developers to make informed decisions suited to complex and high-demand applications. Pagination: From Ancient Scrolls to Modern Databases Pagination, a critical concept in data organization, transcends its modern digital application, tracing its roots back to the earliest forms of written records. In contemporary usage, pagination divides content into discrete pages, whether in print or digital. This technique is pivotal for improving the user experience by making information access manageable and intuitive and enhancing the performance of data retrieval systems by limiting the volume of data loaded or rendered at any one time. The necessity for effective data organization is a complex dilemma. Ancient civilizations like Rome, developed early methods to manage extensive written information. Although the Romans did not use pagination in the way we understand it today — dividing texts into pages — they implemented organizational methods that foreshadowed modern pagination systems. Long texts in Rome were typically inscribed on scrolls made of papyrus or vellum. These scrolls, cumbersome in length, were navigated with indices and markers. Such markers functioned similarly to a modern table of contents, guiding readers to different text sections. While rudimentary by today's standards, this method represented an early form of pagination, as it organized information into segments that could be accessed independently. Additionally, the Romans used wax tablets for shorter documents. These tablets could be bound together, forming a structure akin to today's books — a codex. The advent of the codex was a significant evolution in text organization, allowing for faster and more efficient information access. Users could flip through pages, a clear predecessor to our current pagination systems, which significantly enhanced the speed and ease of reviewing information. In the digital age, pagination is essential for handling large datasets effectively. Digital pagination helps manage server loads and improve response times by delivering content in segments rather than requiring the entire dataset to be loaded simultaneously. It conserves resources and improves user interactions with applications by providing a seamless navigational experience. The parallels between ancient Roman text organization methods and modern digital pagination highlight a continuous need throughout history: efficiently managing large quantities of information. Whether through physical markers in scrolls, the development of the codex, or sophisticated digital pagination algorithms, the core challenge remains the same—making information accessible and manageable. Pagination in Modern Applications: Necessity and Strategies Pagination is an essential feature in modern software applications that helps to structure data into manageable pieces. This approach enhances user experience by preventing information overload and optimizes application performance by reducing the load on backend systems. When data is paginated, systems can query and render only the necessary subset of data at a time, thereby reducing memory usage and improving response times. It is especially critical in large datasets or high-user concurrency applications, where efficient data handling can significantly improve scalability and user satisfaction. Pagination can be very useful, but it also poses some challenges. Developers need to balance user experience and server performance carefully. Implementing pagination requires additional logic on both the client and server sides, which can complicate development. While pagination can reduce initial load time by fetching only a part of the data, it may increase the total waiting time for users as they navigate through multiple pages. Maintaining context between pages, such as sorting and filters, also requires careful state management and adds to the complexity. In modern web development, two popular pagination strategies are popular: offset pagination and cursor-based pagination. Each strategy has advantages and drawbacks, making it more suitable for different scenarios. Offset Pagination Offset pagination is a traditional method of dividing data into manageable chunks. Data is accessed by skipping a specified number of records before returning a set amount. This technique is often used in web and database applications to facilitate direct navigation to particular pages using a simple numeric offset. Figure 1: An offset pagination illustration The implementation of offset pagination generally involves two critical parameters in database queries: LIMIT: This parameter specifies the maximum number of records to return in a single page. It defines the size of each chunk of data, aligning with the concept of a “page” in pagination.OFFSET: This parameter indicates the number of records to skip from the beginning of the dataset. The value of the OFFSET is typically calculated as (page number - 1) * page size, allowing users to jump directly to the start of any page. Offset pagination is highly favored for its simplicity and straightforward implementation. It is particularly effective in applications where users benefit from being able to jump directly to a specific page and where the total number of records is known and relatively stable. It makes it ideal for situations where user-friendly navigation and simplicity are paramount. The main limitation of offset pagination is its scalability with large datasets. As the dataset grows and users request pages with higher numbers, the cost of skipping many records increases. This results in slower query performance because each subsequent page requires counting and skipping more records to reach the starting point of the desired page. If the underlying data is subject to insertion, deletion, or modification, users may experience “phantom reads” or skipped records as they navigate between pages. It happens because the offset does not account for changes in the dataset’s size or order after the initial page load. Offset pagination remains a popular choice for many applications due to its user-friendly approach and ease of implementation. However, understanding its limitations and properly planning its use is crucial to ensure that the system remains responsive and provides a good user experience as data scales. Cursor-Based Pagination Cursor-based pagination is an efficient method for managing data retrieval in large or dynamically updated datasets. It employs a cursor, which is a reference to a specific point in the dataset, to fetch data sequentially starting from the cursor’s position. Figure 2: A cursor pagination illustration Cursor-based pagination relies on a cursor to guide data fetching. The cursor can comprise multiple fields to ensure precise data retrieval and maintain sort order. Here’s how it can be structured: Cursor Fields One or more fields uniquely identify each record’s position within the dataset. These fields should be stable (i.e., not change once set) and unique to prevent duplicates and ensure data integrity. Commonly used fields include timestamps, unique IDs, or combinations of multiple fields to support complex sorting requirements. Query Direction This specifies whether the data retrieval should move forward or backward relative to the cursor’s position. It is beneficial in applications like social media feeds or log monitoring systems where newer or older entries might be of interest. Usage of Multiple Fields When sorting by multiple criteria (e.g., sorting blog posts by both creation_date and title), a cursor can include these fields to ensure that pagination maintains the specified sort order across queries. It is essential for consistency, especially in cases where the dataset is large or frequently updated. Using a cursor in pagination is particularly advantageous for large or frequently updated datasets, as it avoids the performance overhead of skipping over records and ensures consistent access to data. While cursor-based pagination provides significant performance benefits and enhances data consistency, its implementation can be complex. It requires the setup of a stable and unique cursor, which can be challenging, especially in datasets without obvious unique identifiers. Additionally, it restricts users to sequential navigation, which can be a limitation in use cases requiring random access to data. Adjusting the user interface to work smoothly with cursor-based pagination, especially when using multiple fields in a cursor, can also add to the development complexity. Developers often have to choose between offset and cursor-based pagination when implementing pagination in their applications. Each method presents different advantages and challenges. To make an informed decision, it is vital to understand how these methods compare across various dimensions, such as ease of implementation, performance, data consistency, and user navigation. To help identify the most suitable pagination strategy for different scenarios in software development, the following table provides a comprehensive comparison of offset and cursor-based pagination, highlighting key features and typical use cases. Additionally, the table also considers scalability. FeatureOffset PaginationCursor-based Pagination Description Paginates data using a numeric offset to skip several records before returning the next set Uses a cursor, often a unique identifier, to fetch data sequentially from the specified position Implementation Ease Implementing basic SQL or NoSQL queries with LIMIT and OFFSET parameters is simple. More complex to implement, requiring a stable and unique field as the cursor Best Use Cases Well-suited for small to medium datasets and applications where total data count and direct access to any page are beneficial Ideal for large or dynamically changing datasets where performance and data consistency are critical Performance Performance degrades as the dataset size increases, especially when accessing higher page numbers due to increased load in skipping records. Consistently high performance as it avoids the overhead of skipping records and directly accessing data starting from the cursor’s position Data Consistency Susceptible to issues like phantom reads or data duplication during pagination if underlying data changes Offers better consistency as each page load depends on the cursor’s position, which can adapt to changes in the data User Navigation Allows users to jump to any specific page directly, facilitating random access Generally restricts users to sequential navigation, which might not be suitable for all applications Complexity of Queries Simple queries, straightforward pagination logic Queries can be complex, especially when multiple fields are used as cursors to maintain order and uniqueness Scalability Less scalable with larger datasets due to increased query load for higher offsets Highly scalable, particularly effective in handling huge datasets efficiently When dealing with large datasets, it is essential to understand the efficiency and limitations of pagination strategies. One major challenge with offset-based pagination is that it becomes more difficult to access the data as the offset increases, particularly in large datasets. For example, if a dataset has 1 million records and is divided into pages of 100, accessing the final page (page 10,000) would require the database to process and discard the initial 999,900 records before delivering the last 100. It can result in longer load times as the dataset grows, making offset pagination less practical for handling large amounts of data. Cursor-based pagination is a more efficient solution for managing extensive datasets than offset pagination. With offset pagination, high offsets can cause performance issues, but cursor-based pagination avoids these pitfalls by using a pointer to track the last record fetched. This method enables subsequent queries to start from where the last one ended, resulting in faster data retrieval. To illustrate this point, the graph accompanying this text compares the performance of offset pagination versus cursor-based pagination in handling a dataset of 7.3 million records, showing the significant speed advantage of using cursors. This visual representation underscores the strategic importance of choosing the appropriate pagination method, considering factors such as the dataset size and access patterns. This ensures optimal performance and user experience, a crucial consideration in large-scale data handling. Figure 3: Offset pagination vs Cursor pagination for 7.3 million records in MySQL (Source) Choosing between offset and cursor-based pagination depends on an application's specific needs. Offset works well for smaller datasets or when direct page access is necessary, while cursor-based pagination is better for large or dynamic datasets. Next, we'll demonstrate both methods in a sample app to show the real-world implications of each. Introduction To Practical Pagination In this section, we transition from theoretical discussions about pagination to a practical demonstration, focusing on the distinct methods of implementing offset and cursor-based pagination. We’ll explore these concepts hands-on, utilizing Jakarta Data within a Quarkus application paired with MongoDB. This setup will enable us to directly compare the two pagination techniques by manipulating a manageable dataset of fruits. Our goal is to provide a clear illustration of how both pagination strategies can be seamlessly integrated and managed using Jakarta Data, a powerful toolset for data handling in Java applications. While this demonstration focuses on a simple scenario involving just ten elements, it’s important to note that the principles and methods discussed are not limited to small datasets. They are scalable and applicable to much larger datasets, giving you the confidence to apply these strategies in real-world scenarios. Moreover, the broader context of developing a comprehensive REST API, including using query parameters and implementing HATEOAS (Hypermedia as the Engine of Application State) for pagination, merits a detailed discussion that could quickly fill its dedicated article. The complexities involved in designing such APIs and the strategies for incorporating pagination efficiently are substantial topics that we will not delve into deeply here. Instead, this demonstration aims to introduce the core concepts of pagination with Jakarta Data, focusing on the technical implementation of the pagination mechanisms rather than the intricacies of the REST API design. We will provide references at the conclusion for those interested in exploring the broader context and details of REST API construction in further depth. This article specifically discusses the pagination features available in Jakarta Data. However, it’s important to note that Jakarta Data provides a wide range of functionalities that aim to make persistence integration for Jakarta EE applications simpler. Jakarta Data facilitates pagination through its API, enabling efficient data handling and retrieval in applications that manage large datasets or require sophisticated query capabilities. Two primary components underpin the pagination functionality: 1. PageRequest Creation Jakarta Data provides the PageRequest class to encapsulate pagination requests. Here’s how you can specify different types of pagination: Offset-Based Pagination This is used when you want to specify a particular page and size for the data retrieval. It’s straightforward and suitable for many standard use cases where the total number of items is known. Java PageRequest offSet = PageRequest.ofPage(1).size(10); Cursor-Based Pagination This method is used when dealing with continuous data streams or when the dataset is large and frequently updated. It allows fetching data continuously from a certain point without re-querying the previously fetched records. Java PageRequest cursor = PageRequest.ofSize(10).afterCursor(PageRequest.Cursor.forKey("key")); Both methods are designed to optimize data fetching processes by limiting the number of records retrieved per query, thus enhancing performance and resource utilization. 2. Special Parameters Jakarta Data also allows the use of special parameters that enhance the capabilities of repository interfaces. These parameters can be used to further refine the pagination strategy, including limits, sorting, and more intricate pagination mechanisms. The standard return structure for pagination queries is the Page interface, which provides a simple way to handle paginated data. Jakarta Data offers a specialized version called CursoredPage for cursor-based pagination. This structure is beneficial for scenarios where traditional page-based navigation is insufficient or impractical. Practical Example Based on our previous conversation about Jakarta Data’s pagination features, we would like to showcase how these capabilities can be implemented in a real-world application through a practical example. The example we are presenting utilizes Jakarta Data with Eclipse JNoSQL, Quarkus, and MongoDB to demonstrate the flexibility and power of Jakarta Data, especially in terms of how it interfaces with both NoSQL and relational databases through Jakarta Persistence. For those interested in exploring the complete code and diving deeper into its functionalities, you can find the sample project here: Quarkus Pagination with JNoSQL and MongoDB. The FruitRepository in our example extends BasicRepository, leveraging Jakarta Data’s capabilities to interact with the database in a streamlined manner. This repository illustrates three primary methods by which Jakarta Data can fetch and manage data: Using the @Find annotation: Simplifies the query process by allowing direct annotation-based queryingUsing Jakarta Query Language: Enables more complex queries similar to SQL, suitable for advanced data manipulationUsing method by query convention: Facilitates queries based on method naming conventions, making code easier to read and maintain Within the FruitRepository, we’ve implemented two specific methods to handle pagination: Java @Repository public interface FruitRepository extends BasicRepository<Fruit, String> { @Find CursoredPage<Fruit> cursor(PageRequest pageRequest, Sort<Fruit> order); @Find @OrderBy("name") Page<Fruit> offSet(PageRequest pageRequest); long countBy(); } Cursor pagination: This utilizes a CursoredPage<Fruit> to manage large datasets efficiently. This method is particularly useful in applications where data is continuously updated, as it provides a stable and performant way to handle sequential data retrieval.Offset pagination: This employs a simple Page<Fruit> to access data in a more traditional page-by-page manner. This approach is straightforward and familiar to many developers, making it ideal for applications with stable and predictable datasets. These examples illustrate the versatility of Jakarta Data in handling different pagination strategies, offering developers robust options based on their specific application needs. This approach not only highlights the practical application of Jakarta Data but also emphasizes its adaptability across different types of databases and data management strategies. Expanding upon our practical implementation with Jakarta Data, the FruitResource class in our Quarkus application provides REST endpoints for offset and cursor-based pagination methods. This setup effectively demonstrates the nuances between the two strategies and how they can be applied to serve data RESTful. In the FruitResource class, we define two distinct REST endpoints tailored to the different pagination strategies. This endpoint demonstrates offset pagination where clients can specify the page and size as query parameters. It is straightforward, allowing users to jump to a specific page directly. This method is particularly effective for datasets where the total size is known and predictable navigation between pages is required. Java @Path("/offset") @GET @Produces(MediaType.APPLICATION_JSON) public Iterable<Fruit> hello(@QueryParam("page") @DefaultValue("1") long page, @QueryParam("size") @DefaultValue("2") int size) { var pageRequest = PageRequest.ofPage(page).size(size); return fruitRepository.offSet(pageRequest).content(); } This endpoint caters to cursor-based pagination, which is essential for handling large or frequently updated datasets. The cursor acts as a pointer that facilitates fetching records continuously without skipping over previous data. This method ensures efficiency and consistency, particularly when dealing with real-time data streams. Clients can provide either an after or before cursor, depending on the direction of navigation they require. Both endpoints utilize Sort<Fruit> defined as ASC or DESC to determine the order in which records are fetched. This sort order enhances the usability of the pagination by ensuring that data is presented in a logical sequence. Java @Path("/cursor") @GET @Produces(MediaType.APPLICATION_JSON) public Iterable<Fruit> cursor(@QueryParam("after") @DefaultValue("") String after, @QueryParam("before") @DefaultValue("") String before, @QueryParam("size") @DefaultValue("2") int size) { if (!after.isBlank()) { var pageRequest = PageRequest.ofSize(size).afterCursor(PageRequest.Cursor.forKey(after)); return fruitRepository.cursor(pageRequest, ASC).content(); } else if (!before.isBlank()) { var pageRequest = PageRequest.ofSize(size).beforeCursor(PageRequest.Cursor.forKey(before)); return fruitRepository.cursor(pageRequest, DESC).stream().toList(); } var pageRequest = PageRequest.ofSize(size).size(size); return fruitRepository.cursor(pageRequest, ASC).content(); } The FruitResource class design is an excellent example of how different pagination approaches can be customized to fit specific application requirements. By comparing these two methods in a single application, developers can gain practical insights into selecting and implementing the most suitable pagination strategy based on their data characteristics and user needs. This approach not only showcases Jakarta Data’s capabilities in a microservices architecture using Quarkus and MongoDB but also enhances understanding of RESTful service design and data management. Conclusion As we have worked through the complexities of applying offset and cursor-based pagination using Jakarta Data in a Quarkus and MongoDB environment, we have realized Jakarta Data's adaptability and effectiveness in managing data retrieval processes. This exploration has provided practical use cases and emphasized the strategic benefits of each pagination method, enabling developers to make informed decisions according to their application requirements. This conversation provides a foundation for further exploration into Jakarta Data's capabilities and its integration with modern application frameworks like Quarkus. By understanding these pagination techniques, developers will be better equipped to construct scalable and efficient applications that can easily handle large datasets. In the future, choosing and implementing the most appropriate pagination strategy will be critical in optimizing application performance and enhancing user experience. References Cursor Pagination ProfileREST Pagination in SpringRichardson Maturity ModelJakarta Data Spec

By Otavio Santana

CORE

Data

DZone's Featured Data Resources

Top Data Experts

The Latest Data Topics