DZone Spotlight

Tuesday, May 14 View All Articles »

Understanding Bayesian Modeling and Probabilistic Programming for Machine Learning

By Salman Khan

Traditional machine learning (ML) models and AI techniques often suffer from a critical flaw: they lack uncertainty quantification. These models typically provide point estimates without accounting for the uncertainty surrounding their predictions. This limitation undermines the ability to assess the reliability of the model's output. Moreover, traditional ML models are data-hungry and often require correctly labeled data, and as a result, tend to struggle with problems where data is limited. Furthermore, these models lack a systematic framework for incorporating expert domain knowledge or prior beliefs into the model. Without the ability to leverage domain-specific insights, the model might overlook crucial nuances in data and tend not to perform up to its potential. ML models are becoming more complex and opaque, while there is a growing demand for more transparency and accountability in decisions derived from data and AI. Probabilistic Programming: A Solution To Addressing These Challenges Probabilistic programming provides a modeling framework that addresses these challenges. At its core lies Bayesian statistics, a departure from the frequentist interpretation of statistics. Bayesian Statistics In frequentist statistics, probability is interpreted as the long-run relative frequency of an event. Data is considered random and a result of sampling from a fixed-defined distribution. Hence, noise in measurement is associated with the sampling variations. Frequentists believe that probability exists and is fixed, and infinite experiments converge to that fixed value. Frequentist methods do not assign probability distributions to parameters, and their interpretation of uncertainty is rooted in the long-run frequency properties of estimators rather than explicit probabilistic statements about parameter values. In Bayesian statistics, probability is interpreted as a measure of uncertainty in a particular belief. Data is considered fixed, while the unknown parameters of the system are regarded as random variables and are modeled using probability distributions. Bayesian methods capture uncertainty within the parameters themselves and hence offer a more intuitive and flexible approach to uncertainty quantification. Frequentist vs. Bayesian Statistics [1] Probabilistic Machine Learning In frequentist ML, model parameters are treated as fixed and estimated through Maximum Likelihood Estimation (MLE), where the likelihood function quantifies the probability of observing the data given the statistical model. MLE seeks point estimates of parameters maximizing this probability. To implement MLE: Assume a model and the underlying model parameters. Derive the likelihood function based on the assumed model. Optimize the likelihood function to obtain point estimates of parameters. Hence, frequentist models which include Deep Learning rely on optimization, usually gradient-based, as its fundamental tool. To the contrary, Bayesian methods model the unknown parameters and their relationships with probability distributions and use Bayes' theorem to compute and update these probabilities as we obtain new data. Bayes Theorem: "Bayes’ rule tells us how to derive a conditional probability from a joint, conditioning tells us how to rationally update our beliefs, and updating beliefs is what learning and inference are all about" [2]. This is a simple but powerful equation. Prior represents the initial belief about the unknown parameters Likelihood represents the probability of the data based on the assumed model Marginal Likelihood is the model evidence, which is a normalizing coefficient. The Posterior distribution represents our updated beliefs about the parameters, incorporating both prior knowledge and observed evidence. In Bayesian machine learning inference is the fundamental tool. The distribution of parameters represented by the posterior distribution is utilized for inference, offering a more comprehensive understanding of uncertainty. Bayesian update in action: The plot below illustrates the posterior distribution for a simple coin toss experiment across various sample sizes and with two distinct prior distributions. This visualization provides insights into how the combination of different sample sizes and prior beliefs influences the resulting posterior distributions. Impact of Sample Size and Prior on Posterior Distribution How to Model the Posterior Distribution The seemingly simple posterior distribution in most cases is hard to compute. In particular, the denominator i.e. the marginal likelihood integral tends to be interactable, especially when working with a higher dimension parameter space. And in most cases there's no closed-form solution and numerical integration methods are also computationally intensive. To address this challenge we rely on a special class of algorithms called Markov Chain Monte Carlo simulations to model the posterior distribution. The idea here is to sample from the posterior distribution rather than explicitly modeling it and using those samples to represent the distribution of the model parameters Markov Chain Monte Carlo (MCMC) "MCMC methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain" [3]. A few of the commonly used MCMC samplers are: Metropolis-Hastings Gibbs Sampler Hamiltonian Monte Carlo (HMC) No-U-Turn Sampler (NUTS) Sequential Monte Carlo (SMC) Probabilistic Programming Probabilistic Programming is a programming framework for Bayesian statistics i.e. it concerns the development of syntax and semantics for languages that denote conditional inference problems and develop "solvers” for those inference problems. In essence, Probabilistic Programming is to Bayesian Modeling what automated differentiation tools are to classical Machine Learning and Deep Learning models [2]. There exists a diverse ecosystem of Probabilistic Programming languages, each with its own syntax, semantics, and capabilities. Some of the most popular languages include: BUGS (Bayesian inference Using Gibbs Sampling) [4]: BUGS is one of the earliest probabilistic programming languages, known for its user-friendly interface and support for a wide range of probabilistic models. It implements Gibbs sampling and other Markov Chain Monte Carlo (MCMC) methods for inference. JAGS (Just Another Gibbs Sampler) [5]: JAGS is a specialized language for Bayesian hierarchical modeling, particularly suited for complex models with nested structures. It utilizes the Gibbs sampling algorithm for posterior inference. STAN: A probabilistic programming language renowned for its expressive modeling syntax and efficient sampling algorithms. STAN is widely used in academia and industry for a variety of Bayesian modeling tasks. "Stan differs from BUGS and JAGS in two primary ways. First, Stan is based on a new imperative probabilistic programming language that is more flexible and expressive than the declarative graphical modeling languages underlying BUGS or JAGS, in ways such as declaring variables with types and supporting local variables and conditional statements. Second, Stan’s Markov chain Monte Carlo (MCMC) techniques are based on Hamiltonian Monte Carlo (HMC), a more efficient and robust sampler than Gibbs sampling or Metropolis-Hastings for models with complex posteriors" [6]. BayesDB: BayesDB is a probabilistic programming platform designed for large-scale data analysis and probabilistic database querying. It enables users to perform probabilistic inference on relational databases using SQL-like queries [7] PyMC3: PyMC3 is a Python library for Probabilistic Programming that offers an intuitive and flexible interface for building and analyzing probabilistic models. It leverages advanced sampling algorithms such as Hamiltonian Monte Carlo (HMC) and Automatic Differentiation Variational Inference (ADVI) for inference [8]. TensorFlow Probability: "TensorFlow Probability (TFP) is a Python library built on TensorFlow that makes it easy to combine probabilistic models and deep learning on modern hardware (TPU, GPU)" [9]. Pyro: "Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend. Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling" [10]. These languages share a common workflow, outlined below: Model definition: The model defines the processes governing data generation, latent parameters, and their interrelationships. This step requires careful consideration of the underlying system and the assumptions made about its behavior. Prior distribution specification: Define the prior distributions for the unknown parameters within the model. These priors encode the practitioner's beliefs, domain, or prior knowledge about the parameters before observing any data. Likelihood specification: Describe the likelihood function, representing the probability distribution of observed data conditioned on the unknown parameters. The likelihood function quantifies the agreement between the model predictions and the observed data. Posterior distribution inference: Use a sampling algorithm to approximate the posterior distribution of the model parameters given the observed data. This typically involves running Markov Chain Monte Carlo (MCMC) or Variational Inference (VI) algorithms to generate samples from the posterior distribution. Case Study: Forecasting Stock Index Volatility In this case study, we will employ Bayesian modeling techniques to forecast the volatility of a stock index. Volatility here measures the degree of variation in a stock's price over time and is a crucial metric for assessing the risk associated with a particular stock. Data: For this analysis, we will utilize historical data from the S&P 500 stock index. The S&P 500 is a widely used benchmark index that tracks the performance of 500 large-cap stocks in the United States. By examining the percentage change in the index's price over time, we can gain insights into its volatility. S&P 500 — Share Price and Percentage Change From the plot above, we can see that the time series — price change between consecutive days has: Constant Mean Changing variance over time, i.e., the time series exhibits heteroscedasticity Modeling Heteroscedasticity: "In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance" [11]. Auto-regressive Conditional Heteroskedasticity (ARCH) models are specifically designed to address heteroscedasticity in time series data. Bayesian vs. Frequentist Implementation of ARCH Model The key benefits of Bayesian modeling include the ability to incorporate prior information and quantify uncertainty in model parameters and predictions. These are particularly useful in settings with limited data and when prior knowledge is available. In conclusion, Bayesian modeling and probabilistic programming offer powerful tools for addressing the limitations of traditional machine-learning approaches. By embracing uncertainty quantification, incorporating prior knowledge, and providing transparent inference mechanisms, these techniques empower data scientists to make more informed decisions in complex real-world scenarios. References Fornacon-Wood, I., Mistry, H., Johnson-Hart, C., Faivre-Finn, C., O'Connor, J.P. and Price, G.J., 2022. Understanding the differences between Bayesian and frequentist statistics. International journal of radiation oncology, biology, physics, 112(5), pp.1076-1082. Van de Meent, J.W., Paige, B., Yang, H. and Wood, F., 2018. An Introduction to Probabilistic Programming. arXiv preprint arXiv:1809.10756. Markov chain Monte Carlo Spiegelhalter, D., Thomas, A., Best, N. and Gilks, W., 1996. BUGS 0.5: Bayesian inference using Gibbs sampling manual (version ii). MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK, pp.1-59. Hornik, K., Leisch, F., Zeileis, A. and Plummer, M., 2003. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of DSC (Vol. 2, No. 1). Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.A., Guo, J., Li, P. and Riddell, A., 2017. Stan: A probabilistic programming language. Journal of statistical software, 76. BayesDB PyMC TensorFlow Probability Pyro AI Homoscedasticity and heteroscedasticity Introduction to ARCH Models pymc.GARCH11 More

Participate in DZone Research Surveys: You Can Shape Trend Reports! (+ Enter the Raffles)

By Caitlin Candelmo

Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. you can find details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Low and No Code Research DZone's research will be shaping our June Low-Code Development Trend Report. Our 2024 low code survey explores: Low code's impact on software quality, performance, maintainability, and scalability Opinions on low code use cases and experience with implementation How teams are integrating AI into their low-code practices Concerns about security and governance over low code Join the Low Code Research You'll also have the chance to enter the $300 raffle at the end of the survey — four random people will be drawn and receive $75 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 12-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 cloud native survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management and monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team More

Trend Report

Modern API Management

When assessing prominent topics across DZone — and the software engineering space more broadly — it simply felt incomplete to conduct research on the larger impacts of data and the cloud without talking about such a crucial component of modern software architectures: APIs. Communication is key in an era when applications and data capabilities are growing increasingly complex. Therefore, we set our sights on investigating the emerging ways in which data that would otherwise be isolated can better integrate with and work alongside other app components and across systems.For DZone's 2024 Modern API Management Trend Report, we focused our research specifically on APIs' growing influence across domains, prevalent paradigms and implementation techniques, security strategies, AI, and automation. Alongside observations from our original research, practicing tech professionals from the DZone Community contributed articles addressing key topics in the API space, including automated API generation via no and low code; communication architecture design among systems, APIs, and microservices; GraphQL vs. REST; and the role of APIs in the modern cloud-native landscape.

Refcard #395

Open Source Migration Practices and Patterns

By Nuwan Dias

CORE

Open Source Migration Practices and Patterns

Refcard #171

MongoDB Essentials

By Abhishek Gupta

CORE

How To Optimize the Salesforce CRM Analytics Dashboards Using SAQL

Salesforce Analytics Query Language (SAQL) is a Salesforce proprietary query language designed for analyzing Salesforce native objects and CRM Analytics datasets. SAQL enables developers to query, transform, and project data to facilitate business insights by customizing the CRM dashboards. SAQL is very similar to SQL (Structured Query Language); however, it is designed to explore data within Salesforce and has its own unique syntax which is somewhat like Pig Latin (pig-ql). You can also use SAQL to implement complex logic while preparing datasets using dataflows and recipes. Key Features Key features of SAQL include the following: It enables users to specify filter conditions, and group and summarize input data streams to create aggregated values to derive actionable insights and analyze trends. SAQL supports conditional statements such as IF-THEN-ELSE and CASE. This feature can be used to execute complex conditions for data filtering and transformation. SAQL DATE and TIME-related functions make it much easier to work with date and time attributes, allowing users to execute time-based analysis, like comparing the data over various time intervals. Supports a variety of data transformation functions to cleanse, format, and typecast data to alter the structure of data to suit the requirements SAQL enables you to create complex calculated fields using existing data fields by applying mathematical, logical, or string functions. SAQL provides seamless integration with the Salesforce objects and CRM Analytics datasets. SAQL queries can be used to design visuals like charts, graphs, and dashboards within the Salesforce CRM Analytics platform. The rest of this article will focus on explaining the fundamentals of writing the SAQL queries, and delve into a few use cases where you can use SAQL to analyze the Salesforce data. Basics of SAQL Typical SAQL queries work like any other ETL tool: queries load the datasets, perform operations/transformations, and create an output data stream to be used in visualization. SAQL statements can run into multiple lines and are concluded with a semicolon. Every line of the query works on a named stream, which can serve as input for any subsequent statements in the same query. The following SAQL query can be used to create a data stream to analyze the opportunities booked in the previous year by month. SQL 1. q = load "OpportunityLineItems"; 2. q = filter q by 'StageName' == "6 - Closed Won" and date('CloseDate_Year', 'CloseDate_Month', 'CloseDate_Day') in ["1 year ago".."1 year ago"]; 3. q = group q by ('CloseDate_Year', 'CloseDate_Month'); 4. q = foreach q generate q.'CloseDate_Year' as 'CloseDate_Year', q.'CloseDate_Month' as 'CloseDate_Month', sum(q.'ExpectedTotal__c') as 'Bookings'; 5. q = order q by ('CloseDate_Year' asc, 'CloseDate_Month' asc); 6. q = limit q 2000; Line Number Description 1 This statement loads the CRM analytics dataset named “OpportunityLineItems” into an input stream q. 2 The input stream q is filtered to look for the opportunities closed won in the previous year. This is similar to the WHERE clause in SQL. 3 The statement focuses on grouping the records by the close date year and month so that we can visualize this data by the months. This is similar to the GROUP BY clause in SQL. 4 Statement 4 is selecting the attributes we want to project from the input stream. Here the expected total is being summed up for each group. 5 Statement 5 is ordering the records by the close of the year and month so that we can create a line chart to visualize this by month. 6 The last statement in the code above focuses on restricting the stream to a limited number of rows. This is mainly used for debugging purposes. Joining Multiple Data Streams The SAQL cogroup function joins input data streams like Salesforce objects or CRM analytics datasets. The data sources being joined should have a related column to facilitate the join. cogroup also supports the execution of both INNER and OUTER joins. For example, if you had two datasets, with one containing sales data and another containing customer data, you could use cogroup to join them based on a common field like customer ID. The resultant data stream contains both fields from both tables. Use Case The following code block can be used for a data stream for NewPipeline and Bookings for the customers. The pipeline built and bookings are coming from two different streams. We can join these two streams by Account Name. SQL q = load "Pipeline_Metric"; q = filter q by 'Source' in ["NewPipeline"]; q = group q by 'AccountName'; q = foreach q generate q.'AccountName' as 'AccountName', sum(ExpectedTotal__c) as 'NewPipeline'; q1 = load "Bookings_Metric"; q1 = filter q1 by 'Source' in ["Bookings"]; q1 = group q1 by 'AccountName'; q1 = foreach q1 generate q1.'AccountName' as 'AccountName', sum(q1.ExpectedTotal__c) as 'Bookings'; q2 = cogroup q by 'AccountName', q1 by 'AccountName'; result = foreach q2 generate q.'AccountName' as 'AccountName', sum(q.'NewPipeline') as 'NewPipeline',sum(q1.'Bookings') as 'Bookings'; You can also use a left outer cogroup to join the right data table with the left. This will result in all the records from the left data stream and all the matching records from the right stream. Use the coalesce function to replace all the null values from the right stream with another value. In the example above, if you want to report all the accounts with or without bookings, you can use the query below. SQL q = load "Pipeline_Metric"; q = filter q by 'Source' in ["NewPipeline"]; q = group q by 'AccountName'; q = foreach q generate q.'AccountName' as 'AccountName', sum(ExpectedTotal__c) as 'NewPipeline'; q1 = load "Bookings_Metric"; q1 = filter q1 by 'Source' in ["Bookings"]; q1 = group q1 by 'AccountName'; q1 = foreach q1 generate q1.'AccountName' as 'AccountName', sum(q1.ExpectedTotal__c) as 'Bookings'; q2 = cogroup q by 'AccountName' left, q1 by 'AccountName'; result = foreach q2 generate q.'AccountName' as 'AccountName', sum(q.'NewPipeline') as 'NewPipeline', coalesce(sum(q1.'Bookings'), 0) as 'Bookings'; Top N Analysis Using Windowing SAQL enables Top N analysis across value groups using the windowing functions within the input data stream. These functionalities are utilized for deriving the moving averages, cumulative totals, and rankings within the groups. You can specify the set of records where you want to execute these calculations using the “over” keyword. SAQL allows you to specify an offset to identify the number of records before and after the selected row. Optionally you can choose to work on all the records within a partition. These records are called windows. Once the set of records is identified for a window, you can apply an aggregation function to all the records within the defined window. Optionally you can create partitions to group the records based on a set of fields and perform aggregate calculations for each partition independently. Use Case The following SAQL code can be used to prepare data for the percentage contribution of new pipelines for each customer to the total pipeline by the region and the ranking of these customers by the region. SQL q = load "Pipeline_Metric"; q = filter q by 'Source' in ["NewPipeline"]; q = group q by ('Region','AccountName'); q = foreach q generate q.'Region' as 'Region',q.'AccountName' as 'AccountName', ((sum('ExpectedTotal__c')/sum(sum('ExpectedTotal__c')) over ([..] partition by 'Region')) * 100) as 'PCT_PipelineContribution', rank() over ([..] partition by ('Region') order by sum('ExpectedTotal__c') desc ) as 'Rank'; q = filter q by 'Rank' <=5; Data Aggregation: Grand Totals and Subtotals With SAQL SAQL offers rollup and grouping functions to aggregate the data streams based on pre-defined groups. While the rollup construct is used with the group by statement, grouping is used as part of foreach statements while projecting the input data stream. The rollup function aggregates the input data stream at various levels of hierarchy allowing you to create calculated fields on summarized datasets at higher levels of granularity. For example, in case you have datasets by the day, rollup can be used to aggregate the results by week, month, or year. The grouping function is used to group data based on specific dimensions or fields in order to segment the data into meaningful subsets for analysis. For example, you might group sales data by product category or region to analyze performance within each group. Use Case Use the code below to prepare data for the total number of accounts and accounts engaged by the region and theater. Also, add the grand total to look at the global numbers and subtotals for both regions and theaters. SQL q = load "ABXLeadandOpportunities_Metric"; q = filter q by 'Source' == "ABX Opportunities" and 'CampaignType' == "Growth Sprints" and 'Territory_Level_01__c' is not null; q = foreach q generate 'Territory_Level_01__c' as 'Territory_Level_01__c','Territory_Level_02__c' as 'Territory_Level_02__c','Territory_Level_03__c' as 'Territory_Level_03__c', q.'AccountName' as 'AccountName',q.'OId' as 'OId','MarketingActionedOppty' as 'MarketingActionedOppty','AccountActionedAcct' as 'AccountActionedAcct','ADRActionedOppty' as 'ADRActionedOppty','AccountActionedADRAcct' as 'AccountActionedADRAcct'; q = group q by rollup ('Territory_Level_01__c', 'Territory_Level_02__c'); q = foreach q generate case when grouping('Territory_Level_01__c') == 1 then "TOTAL" else 'Territory_Level_01__c' end as 'Level1', case when grouping('Territory_Level_02__c') == 1 then "LEVEL1 TOTAL" else 'Territory_Level_02__c' end as 'Level2', unique('AccountName') as 'Total Accounts',unique('AccountActionedAcct') as 'Engaged',((unique('AccountActionedAcct') / unique('AccountName'))) as '% of Engaged'; q = limit q 2000; Filling the Missing Date Fields You can use the fill() function to create a record for missing date, week, month, quarter, and year records in your dataset. This comes very handy when you want to show the result as 0 for these missing days/weeks/months instead of not showing them at all. Use Case The following SAQL code allows you to track the number of tasks for the sales agents by the days of the week. In case the agents are on PTO you want to show 0 tasks. SQL q = load "Tasks_Metric"; q = filter q by 'Source' == "Tasks"; q = filter q by date('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day') in [dateRange([2024,4,23], [2024,4,30])]; q = group q by ('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day'); q = foreach q generate q.'MetricDate_Year' as 'MetricDate_Year', q.'MetricDate_Month' as 'MetricDate_Month', q.'MetricDate_Day' as 'MetricDate_Day', unique(q.'Id') as 'Tasks'; q = order q by ('MetricDate_Year' asc, 'MetricDate_Month' asc, 'MetricDate_Day' asc); q = limit q 2000; The code above will be missing two days where there were no tasks created. You can use the code below to fill in the missing days. SQL q = load "Tasks_Metric"; q = filter q by 'Source' == "Tasks"; q = filter q by date('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day') in [dateRange([2024,4,23], [2024,4,30])]; q = group q by ('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day'); q = foreach q generate q.'MetricDate_Year' as 'MetricDate_Year', q.'MetricDate_Month' as 'MetricDate_Month', q.'MetricDate_Day' as 'MetricDate_Day', unique(q.'Id') as 'Tasks'; q = fill q by (dateCols=(MetricDate_Year, MetricDate_Month, MetricDate_Day, "Y-M-D")); q = order q by ('MetricDate_Year' asc, 'MetricDate_Month' asc, 'MetricDate_Day' asc); q = limit q 2000; You can also specify the start date and end date to populate the missing records between these dates. Conclusion In the end, SAQL has proven itself as a powerful tool for the Salesforce developer community, empowering them to extract actionable business insights from the CRM datasets using capabilities like filtering, aggregation, windowing, time-analysis, blending, custom calculation, Salesforce integration, and performance optimization. In this article, we have explored various capabilities of this technology and focused on targeted use cases. As a next step, I would recommend continuing your learnings by exploring Salesforce documentation, building your data models using dataflow, and using SAQL capabilities to harness the true potential of Salesforce as a CRM.

By Kapil Kumar Sharma

Deployment Strategies

Deployment strategies provide a systematic approach to releasing software changes, minimizing risks, and maintaining consistency across projects and teams. Without a well-defined strategy and systematic approach, deployments can lead to downtime, data loss, or system failures, resulting in frustrated users and revenue loss. Before we start exploring different deployment strategies in more detail, let’s take a look at the short overview of each deployment strategy mentioned in this article: All-at-once deployment: This strategy involves updating all the target environments at once, making it the fastest but riskiest approach. In-place deployment: This involves stopping the current application and replacing it with a new version, directly affecting availability. Blue/Green deployment: A zero-downtime approach that involves running two identical environments and switching from old to new Canary deployment: Introduces new changes incrementally to a small subset of users before a full rollout Shadow deployment: Mirrors real traffic to a shadow environment where the new deployment is tested without affecting the live environment All-At-Once Deployment All-at-once deployment strategy, also known as the "Big Bang" deployment strategy, involves simultaneously releasing your application's new version to all servers or environments. This method is straightforward and can be implemented quickly, as it does not require complex orchestration or additional infrastructure. The primary benefit of this approach is its simplicity and the ability to immediately transition all users to the new version of the application. However, the all-at-once method carries significant risks. Since all instances are updated together, any issues with the new release immediately impact all users. There is no opportunity to mitigate risks by gradually rolling out the change or testing it with a subset of the user base first. Additionally, if something goes wrong, the rollback process can be just as disruptive as the initial deployment. Despite these risks, all-at-once deployment can be suitable for small applications or environments where downtime is more acceptable and the impact of potential issues is minimal and is used pretty often. It is also useful in scenarios where applications are inherently simple or have been thoroughly tested to ensure compatibility and stability before release. In-Place (Recreate) Deployment In-place or recreate deployment strategy is another strategy that is used pretty often when developing projects. It is the simplest and does not require additional infrastructure. Its essence lies in the fact that when we deploy a new version, we stop the application and start it with new changes. The disadvantage of this approach is that the service we are updating will experience downtime that will affect its users. Also, in case of problems with new software changes, we might need to roll back the latest changes, which will lead to service downtime. To avoid downtime and be able to roll back changes without it during the deployment process, there are deployment strategies that are created for this purpose and used in the industry. Blue/Green Deployment The first zero downtime deployment strategy we’re going to talk about is the Blue/Green deployment strategy. Its main goal is to minimize downtime and risks while deploying new software versions. This is done by having 2 identical environments of our service. One environment contains the original application (Blue environment) that serves users' requests and the other environment (Green environment) is where new software changes are deployed. This allows us to verify and test new changes with near zero downtime for users and the service, with the ability to safely roll back in case of any problems, except for some cases that we will discuss a bit later. Typically, the process is the following: after verifying and testing the new changes in the Green environment, we reroute traffic from the Blue environment to our identical Green environment with the new changes. Sounds easy, isn’t it? ... it depends. The problem is that we can easily reroute traffic between environments only when our services are stateless. If they interact with any data sources, things get more complicated, and here's why: Our identical Green and Blue environments share common data source(s). While sharing data sources such as NoSQL databases or object stores (AWS S3, for example) between our identical environment is easier to accomplish, this is completely not true for relational databases because it requires additional efforts (NoSQL also might require some effort) to support Blue/Green deployments. Since approaches to handle schema updates without downtime are out of the scope of this article, you can check out the article, "Upgrading database schema without downtime," to learn more about updating schemas without downtime (if you have any interesting resources on updating schemas without downtime - please, share with us in the comments). A general recommendation is that If your services are not stateless and use data sources with schemas, implementing a Blue/Green deployment strategy is not always recommended because of the additional risk and failure points this can introduce minimizing the benefits of the Blue/Green deployment strategy. But if you’ve decided that you need to integrate a Blue/Green deployment strategy and your infrastructure is running on Amazon Web Services, you might find this document by AWS on how to implement Blue/Green deployments and its infrastructure useful. Canary Deployment The idea of the Canary deployment strategy is to reduce the risks of deploying new software versions in production by rolling out new changes to users slowly. In the same manner, as we do in the Blue/Green deployment strategy, we roll out new software versions to our identical environment; but instead of completely rerouting traffic from one environment to another, we, for example, route a portion of users to our environment with new software version using a load balancer. The size of the portion of users getting new software versions and/or the criteria used to determine them - may be specific for every company/project. Some roll out new changes only to their internal stuff first, some determine users randomly and some may use algorithms to match users based on some criteria. Pick anything that best suits your needs. Shadow Deployment Shadow deployment strategy is the next strategy I find interesting, personally. This strategy also uses the concept of identical environments, just as the Blue/Green and Canary deployment strategies do. The main difference is that instead of completely rerouting or rerouting only a portion of real users we duplicate entire traffic to our second environment where we deployed our new changes. This way, we can test and verify our changes without negatively affecting our users, thus mitigating risks of broken software updates or performance bottlenecks. Conclusion In this article, we walked through five different deployment strategies, each with its own set of advantages and challenges. The all-at-once and in-place deployment strategies stand out for their speed and minimal effort required to deploy new versions of software. While these two strategies will be your go-to deployment strategies in most cases, it’s still useful to understand and know about more complex and resource-intensive strategies. Ultimately, implementing any deployment strategy requires careful consideration of the potential impact on both the system and its users. The choice of deployment strategy should align with your project’s needs, risk tolerance, and operational capabilities.

By Illia Pantsyr

Introducing Stalactite ORM

Java ORM world is very steady and few libraries exist, but none of them brought any breaking change over the last decade. Meanwhile, application architecture evolved with some trends such as Hexagonal Architecture, CQRS, Domain Driven Design, or Domain Purity. Stalactite tries to be more suitable to these new paradigms by allowing to persist any kind of Class without the need to annotate them or use external XML files: its mapping is made of method reference. As a benefit, you get a better view of the entity graph since the mapping is made through a fluent API that chains your entity relations, instead of spreading annotations all over entities. This is very helpful to see the complexity of your entity graph, which would impact its load as well as the memory. Moreover, since Stalactite only fetches data eagerly, we can say that what you see is what you get. Here is a very small example: Java MappingEase.entityBuilder(Country.class, Long.class) .mapKey(Country::getId, IdentifierPolicy.afterInsert()) .mapOneToOne(Country::getCapital, MappingEase.entityBuilder(City.class, Long.class) .mapKey(City::getId, IdentifierPolicy.afterInsert()) .map(City::getName)) First Steps The release 2.0.0 is out for some weeks and is available as a Maven dependency, hereafter is an example with HSQLDB. For now, Stalactite is compatible with the following databases (mainly in their latest version): HSQLDB, H2, PostgreSQL, MySQL, and MariaDB. XML <dependency> <groupId>org.codefilarete.stalactite</groupId> <artifactId>orm-hsqldb-adapter</artifactId> <version>2.0.0</version> </dependency> If you're interested in a less database-vendor-dedicated module, you can use the orm-all-adapter module. Just be aware that it will bring you extra modules and extra JDBC drivers, heaving your artifact. After getting Statactite as a dependency, the next step is to have a JDBC DataSource and pass it to a org.codefilarete.stalactite.engine.PersistenceContext: Java org.hsqldb.jdbc.JDBCDataSource dataSource= new org.hsqldb.jdbc.JDBCDataSource(); dataSource.setUrl("jdbc:hsqldb:mem:test"); dataSource.setUser("sa"); dataSource.setPassword(""); PersistenceContext persistenceContext = new PersistenceContext(dataSource, new HSQLDBDialect()); Then comes the interesting part: the mapping. Supposing you get a Country, you can quickly set up its mapping through the Fluent API, starting with the org.codefilarete.stalactite.mapping.MappingEase class as such: Java EntityPersister<Country, Long> countryPersister = MappingEase.entityBuilder(Country.class, Long.class) .mapKey(Country::getId, IdentifierPolicy.afterInsert()) .map(Country::getName) .build(persistenceContext); the afterInsert() identifier policy means that the country.id column is an auto-increment one. Two other policies exist: the beforeInsert() for identifier given by a database Sequence (for example), and the alreadyAssigned() for entities that have a natural identifier given by business rules, any non-declared property is considered transient and not managed by Stalactite. The schema can be generated with the org.codefilarete.stalactite.sql.ddl.DDLDeployer class as such (it will generate it into the PersistenceContext dataSource): Java DDLDeployer ddlDeployer = new DDLDeployer(persistenceContext); ddlDeployer.deployDDL(); Finally, you can persist your entities thanks to the EntityPersister obtained previously, please find the example below. You might notice that you won't find JPA methods in Stalactite persister. The reason is that Stalactite is far different from JPA and doesn't aim at being compatible with it: no annotation, no attach/detach mechanism, no first-level cache, no lazy loading, and many more. Hence, the methods are quite straight to their goal: Java Country myCountry = new Country(); myCountry.setName("myCountry"); countryPersister.insert(myCountry); myCountry.setName("myCountry with a different name"); countryPersister.update(myCountry); Country loadedCountry = countryPersister.select(myCountry.getId()); countryPersister.delete(loadedCountry); Spring Integration There was a raw usage of Stalactite, meanwhile, you may be interested in its integration with Spring to benefit from the magic of its @Repository. Stalactite provides it, just be aware that it's still a work-in-progress feature. The approach to activate it is the same as for JPA: enable Stalactite repositories thanks to the @EnableStalactiteRepositories annotation on your Spring application. Then you'll declare the PersistenceContext and EntityPersister as @Bean : Java @Bean public PersistenceContext persistenceContext(DataSource dataSource) { return new PersistenceContext(dataSource); } @Bean public EntityPersister<Country, Long> countryPersister(PersistenceContext persistenceContext) { return MappingEase.entityBuilder(Country.class, long.class) .mapKey(Country::getId, IdentifierPolicy.afterInsert()) .map(Country::getName) .build(persistenceContext); } Then you can declare your repository as such, to be injected into your services : Java @Repository public interface CountryStalactiteRepository extends StalactiteRepository<Country, Long> { } As mentioned earlier, since the paradigm of Stalactite is not the same as JPA (no annotation, no attach/detach mechanism, etc), you won't find the same methods of JPA repository in Stalactite ones : save : Saves the given entity, either inserting it or updating it according to its persistence states saveAll : Same as the previous one, with a massive API findById : Try to find an entity by its id in the database findAllById : Same as the previous one, with a massive API delete : Delete the given entity from the database deleteAll : Same as the previous one, with a massive API Conclusion In these chapters we introduced the Stalactite ORM, more information about the configuration, the mapping, and all the documentation are available on the website. The project is open-source with the MIT license and shared through Github. Thanks for reading, any feedback is appreciated!

By Guillaume Mary

When It’s Time to Give REST a Rest

Through my years of building services, the RESTful API has been my primary go-to. However, even though REST has its merits, that doesn’t mean it’s the best approach for every use case. Over the years, I’ve learned that, occasionally, there might be better alternatives for certain scenarios. Sticking with REST just because I’m passionate about it — when it’s not the right fit — only results in tech debt and a strained relationship with the product owner. One of the biggest pain points with the RESTful approach is the need to make multiple requests to retrieve all the necessary information for a business decision. As an example, let’s assume I want a 360-view of a customer. I would need to make the following requests: GET /customers/{some_token} provides the base customer information GET /addresses/{some_token} supplies a required address GET /contacts/{some_token} returns the contact information GET /credit/{some_token} returns key financial information While I understand the underlying goal of REST is to keep responses laser-focused for each resource, this scenario makes for more work on the consumer side. Just to populate a user interface that helps an organization make decisions related to future business with the customer, the consumer must make multiple calls In this article, I’ll show why GraphQL is the preferred approach over a RESTful API here, demonstrating how to deploy Apollo Server (and Apollo Explorer) to get up and running quickly with GraphQL. I plan to build my solution with Node.js and deploy my solution to Heroku. When To Use GraphQL Over REST? There are several common use cases when GraphQL is a better approach than REST: When you need flexibility in how you retrieve data: You can fetch complex data from various resources but all in a single request. (I will dive down this path in this article.) When the frontend team needs to evolve the UI frequently: Rapidly changing data requirements won’t require the backend to adjust endpoints and cause blockers. When you want to minimize over-fetching and under-fetching: Sometimes REST requires you to hit multiple endpoints to gather all the data you need (under-fetching), or hitting a single endpoint returns way more data than you actually need (over-fetching). When you’re working with complex systems and microservices: Sometimes multiple sources just need to hit a single API layer for their data. GraphQL can provide that flexibility through a single API call. When you need real-time data pushed to you: GraphQL features subscriptions, which provide real-time updates. This is useful in the case of chat apps or live data feeds. (I will cover this benefit in more detail in a follow-up article.) What Is Apollo Server? Since my skills with GraphQL aren’t polished, I decided to go with Apollo Server for this article. Apollo Server is a GraphQL server that works with any GraphQL schema. The goal is to simplify the process of building a GraphQL API. The underlying design integrates well with frameworks such as Express or Koa. I will explore the ability to leverage subscriptions (via the graphql-ws library) for real-time data in my next article. Where Apollo Server really shines is the Apollo Explorer, a built-in web interface that developers can use to explore and test their GraphQL APIs. The studio will be a perfect fit for me, as it allows for the easy construction of queries and the ability to view the API schema in a graphical format. My Customer 360 Use Case For this example, let’s assume we need the following schema to provide a 360-view of the customer: TypeScript type Customer { token: String name: String sic_code: String } type Address { token: String customer_token: String address_line1: String address_line2: String city: String state: String postal_code: String } type Contact { token: String customer_token: String first_name: String last_name: String email: String phone: String } type Credit { token: String customer_token: String credit_limit: Float balance: Float credit_score: Int } I plan to focus on the following GraphQL queries: TypeScript type Query { addresses: [Address] address(customer_token: String): Address contacts: [Contact] contact(customer_token: String): Contact customers: [Customer] customer(token: String): Customer credits: [Credit] credit(customer_token: String): Credit } Consumers will provide the token for the Customer they wish to view. We expect to also retrieve the appropriate Address, Contact, and Credit objects. The goal is to avoid making four different API calls for all this information rather than with a single API call. Getting Started With Apollo Server I started by creating a new folder called graphql-server-customer on my local workstation. Then, using the Get Started section of the Apollo Server documentation, I followed steps one and two using a Typescript approach. Next, I defined my schema and also included some static data for testing. Ordinarily, we would connect to a database, but static data will work fine for this demo. Below is my updated index.ts file: TypeScript import { ApolloServer } from '@apollo/server'; import { startStandaloneServer } from '@apollo/server/standalone'; const typeDefs = `#graphql type Customer { token: String name: String sic_code: String } type Address { token: String customer_token: String address_line1: String address_line2: String city: String state: String postal_code: String } type Contact { token: String customer_token: String first_name: String last_name: String email: String phone: String } type Credit { token: String customer_token: String credit_limit: Float balance: Float credit_score: Int } type Query { addresses: [Address] address(customer_token: String): Address contacts: [Contact] contact(customer_token: String): Contact customers: [Customer] customer(token: String): Customer credits: [Credit] credit(customer_token: String): Credit } `; const resolvers = { Query: { addresses: () => addresses, address: (parent, args, context) => { const customer_token = args.customer_token; return addresses.find(address => address.customer_token === customer_token); }, contacts: () => contacts, contact: (parent, args, context) => { const customer_token = args.customer_token; return contacts.find(contact => contact.customer_token === customer_token); }, customers: () => customers, customer: (parent, args, context) => { const token = args.token; return customers.find(customer => customer.token === token); }, credits: () => credits, credit: (parent, args, context) => { const customer_token = args.customer_token; return credits.find(credit => credit.customer_token === customer_token); } }, }; const server = new ApolloServer({ typeDefs, resolvers, }); const { url } = await startStandaloneServer(server, { listen: { port: 4000 }, }); console.log(`Apollo Server ready at: ${url}`); const customers = [ { token: 'customer-token-1', name: 'Acme Inc.', sic_code: '1234' }, { token: 'customer-token-2', name: 'Widget Co.', sic_code: '5678' } ]; const addresses = [ { token: 'address-token-1', customer_token: 'customer-token-1', address_line1: '123 Main St.', address_line2: '', city: 'Anytown', state: 'CA', postal_code: '12345' }, { token: 'address-token-22', customer_token: 'customer-token-2', address_line1: '456 Elm St.', address_line2: '', city: 'Othertown', state: 'NY', postal_code: '67890' } ]; const contacts = [ { token: 'contact-token-1', customer_token: 'customer-token-1', first_name: 'John', last_name: 'Doe', email: 'jdoe@example.com', phone: '123-456-7890' } ]; const credits = [ { token: 'credit-token-1', customer_token: 'customer-token-1', credit_limit: 10000.00, balance: 2500.00, credit_score: 750 } ]; With everything configured as expected, we run the following command to start the server: Shell $ npm start With the Apollo server running on port 4000, I used the http://localhost:4000/ URL to access Apollo Explorer. Then I set up the following example query: TypeScript query ExampleQuery { addresses { token } contacts { token } customers { token } } This is how it looks in Apollo Explorer: Pushing the Example Query button, I validated that the response payload aligned with the static data I provided in the index.ts: JSON { "data": { "addresses": [ { "token": "address-token-1" }, { "token": "address-token-22" } ], "contacts": [ { "token": "contact-token-1" } ], "customers": [ { "token": "customer-token-1" }, { "token": "customer-token-2" } ] } } Before going any further in addressing my Customer 360 use case, I wanted to run this service in the cloud. Deploying Apollo Server to Heroku Since this article is all about doing something new, I wanted to see how hard it would be to deploy my Apollo server to Heroku. I knew I had to address the port number differences between running locally and running somewhere in the cloud. I updated my code for starting the server as shown below: TypeScript const { url } = await startStandaloneServer(server, { listen: { port: Number.parseInt(process.env.PORT) || 4000 }, }); With this update, we’ll use port 4000 unless there is a PORT value specified in an environment variable. Using Gitlab, I created a new project for these files and logged into my Heroku account using the Heroku command-line interface (CLI): Shell $ heroku login You can create a new app in Heroku with either their CLI or the Heroku dashboard web UI. For this article, we’ll use the CLI: Shell $ heroku create jvc-graphql-server-customer The CLI command returned the following response: Shell Creating jvc-graphql-server-customer... done https://jvc-graphql-server-customer-b62b17a2c949.herokuapp.com/ | https://git.heroku.com/jvc-graphql-server-customer.git The command also added the repository used by Heroku as a remote automatically: Shell $ git remote heroku origin By default, Apollo Server disables Apollo Explorer in production environments. For my demo, I want to leave it running on Heroku. To do this, I need to set the NODE_ENV environment variable to development. I can set that with the following CLI command: Shell $ heroku config:set NODE_ENV=development The CLI command returned the following response: Shell Setting NODE_ENV and restarting jvc-graphql-server-customer... done, v3 NODE_ENV: development Now we’re in a position to deploy our code to Heroku: Shell $ git commit --allow-empty -m 'Deploy to Heroku' $ git push heroku A quick view of the Heroku Dashboard shows my Apollo Server running without any issues: If you’re new to Heroku, this guide will show you how to create a new account and install the Heroku CLI. Acceptance Criteria Met: My Customer 360 Example With GraphQL, I can meet the acceptance criteria for my Customer 360 use case with the following query: TypeScript query CustomerData($token: String) { customer(token: $token) { name sic_code token }, address(customer_token: $token) { token customer_token address_line1 address_line2 city state postal_code }, contact(customer_token: $token) { token, customer_token, first_name, last_name, email, phone }, credit(customer_token: $token) { token, customer_token, credit_limit, balance, credit_score } } All I need to do is pass in a single Customer token variable with a value of customer-token-1: JSON { "token": "customer-token-1" } We can retrieve all of the data using a single GraphQL API call: JSON { "data": { "customer": { "name": "Acme Inc.", "sic_code": "1234", "token": "customer-token-1" }, "address": { "token": "address-token-1", "customer_token": "customer-token-1", "address_line1": "123 Main St.", "address_line2": "", "city": "Anytown", "state": "CA", "postal_code": "12345" }, "contact": { "token": "contact-token-1", "customer_token": "customer-token-1", "first_name": "John", "last_name": "Doe", "email": "jdoe@example.com", "phone": "123-456-7890" }, "credit": { "token": "credit-token-1", "customer_token": "customer-token-1", "credit_limit": 10000, "balance": 2500, "credit_score": 750 } } } Below is a screenshot from Apollo Explorer running from my Heroku app: Conclusion I recall earlier in my career when Java and C# were competing against each other for developer adoption. Advocates on each side of the debate were ready to prove that their chosen tech was the best choice … even when it wasn’t. In this example, we could have met my Customer 360 use case in multiple ways. Using a proven RESTful API would have worked, but it would have required multiple API calls to retrieve all of the necessary data. Using Apollo Server and GraphQL allowed me to meet my goals with a single API call. I also love how easy it is to deploy my GraphQL server to Heroku with just a few commands in my terminal. This allows me to focus on implementation—offloading the burdens of infrastructure and running my code to a trusted third-party provider. Most importantly, this falls right in line with my personal mission statement: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” – J. Vester If you are interested in the source code for this article, it is available on GitLab. But wait… there’s more! In my follow-up post, we will build out our GraphQL server further, to implement authentication and real-time data retrieval with subscriptions. Have a really great day!

By John Vester

CORE

Snowflake Data Time Travel

Snowflake is a leading cloud-based data storage and analytics service that provides various solutions for data warehouses, data engineering, AI/ML modeling, and other related services. It has multiple features and functionalities; one powerful data recovery feature is Time Travel. It allows users to access historical data from the past. It is beneficial when a user comes across any of the below scenarios: Retrieving the previous row or column value before the current DML operation Recovering the last state of data for backup or redundancy Updating or deleting records from the table by mistake Restoring the previous state of the table, schema, or database Snowflake's Continuous Data Protection Life Cycle allows time travel within a window of 1 to 90 days. For the Enterprise edition, up to 90 days of retention is allowed. Time Travel SQL Extensions Time Travel can be achieved using Offsets, Timestamps, and Statements keywords in addition to the AT or BEFORE clause. Offset If a user wants to retrieve past data or recover a table from the older state data using time parameters, then the user can use the query below, where offset is defined in seconds. SQL SELECT * FROM any_table AT(OFFSET => -60*5); -- For 5 Minutes CREATE TABLE recoverd_table CLONE any_table AT(OFFSET => -3600); -- For 1 Hour Timestamp Suppose a user wants to query data from the past or recover a schema for a specific timestamp. Then, the user can utilize the below query. SQL SELECT * FROM any_table AT(TIMESTAMP => 'Sun, 05 May 2024 16:20:00 -0700'::timestamp_tz); CREATE SCHEMA recovered_schema CLONE any_schema AT(TIMESTAMP => 'Wed, 01 May 2024 01:01:00 +0300'::timestamp_tz); Statement Users can also use any unique query ID to get the latest data until the statement. SQL SELECT * FROM any_table BEFORE(STATEMENT => '9f6e1bq8-006f-55d3-a757-beg5a45c1234'); CREATE DATABASE recovered_db CLONE any_db BEFORE(STATEMENT => '9f6e1bq8-006f-55d3-a757-beg5a45c1234'); The command below sets the data retention time and increases or decreases. SQL CREATE TABLE any_table(id NUMERIC, name VARCHAR, created_date DATE) DATA_RETENTION_TIME_IN_DAYS=90; ALTER TABLE any_table SET DATA_RETENTION_TIME_IN_DAYS=30; If data retention is not required, then we can also use SET DATA_RETENTION_TIME_IN_DAYS=0;. Objects that do not have an explicitly defined retention period can inherit the retention from the upper object level. For instance, tables that do not have a specified retention period will inherit the retention period from schema, and schema that does not have the retention period defined will inherit from the database level. The account level is the highest level of the hierarchy and should be set up with 0 days for data retention. Now consider a case where a table, schema, or database accidentally drops, causing all the data to be lost. During such cases, when any data object gets dropped, it's kept in Snowflake's back-end until the data retention period. For such cases, Snowflake has a similar great feature that will bring those objects back with below SQL. SQL UNDROP TABLE any_table; UNDROP SCHEMA any_schema; UNDROP DATABASE any_database; If a user creates a table with the same name as the dropped table, then Snowflake creates a new table, not restore the old one. When the user uses the above UNDROP command, Snowflake restores the old object. Also, the user needs permission or ownership to restore the object. After the Time Travel period, if the object isn't retrieved within the data retention period, it is transferred to Snowflake Fail-Safe, where users can't query. The only way to recover that is by using Snowflake's help, and it stores the data for a maximum of 7 days. Challenges Time travel, though useful, has a few challenges, as shown below. The Time Travel has a default one-day setup for transient and temporary tables in Snowflake. Any objects except tables, such as views, UDFs, and stored procedures, are not supported. If a table is recreated with the same name, referring to the older version of the same name requires renaming the current table as, by default, Time Travel will refer to the latest version. Conclusion The Time Travel feature is quick, easy, and powerful. It's always handy and gives users more comfort while operating production-sensitive data. The great thing is that users can run these queries themselves without having to involve admins. With a maximum retention of 90 days, users have more than enough time to query back in time or fix any incorrectly updated data. In my opinion, it is Snowflake's strongest feature. Reference Understanding & Using Time Travel

By BHUSHAN FADNIS

Getting Started With OCR docTR on Ubuntu

In this tutorial, I'll explore how to set up and utilize docTR, the open-source OCR (Optical Character Recognition) solution of the document parsing API startup Mindee. I’ll go through what you need to install docTR on Ubuntu. It accepts PDFs, images, and even a website URL as an input. In this example, I will parse a grocery store receipt. Let’s get started. Setting Up docTR on Ubuntu docTR is compatible with any Linux distribution, macOS, and Windows. It is also available as a Docker image. I will use Ubuntu 22.04 LTS (Jammy Jellyfish) for this tutorial. Hardware-wise, you don’t need anything specific, but if you want to do extensive testing, I recommend using a GPU instance; OVHcloud offers affordable options, with servers starting at less than a dollar per hour. Let’s start by installing Python. At the time of writing, docTR requires Python 3.8 (or higher). Shell sudo apt install -y python3 To avoid messing with system libraries, let’s use a virtual environment. Shell sudo apt install -y python3.10-venv python3 -m venv testing-Mindee-docTR Then we install the OpenGL Mesa 3D Graphics Library, used for the computer vision part of docTR. Shell sudo apt install -y libgl1-mesa-glx We install pango, which is a text layout engine library. Shell sudo apt-get install -y libpango-1.0-0 libpangoft2-1.0-0 Then, we install pip so that we can install docTR. Shell sudo apt install -y python3-pip Finally, we install docTR within our virtual environment. This version is specifically for PyTorch. If you choose to use TensorFlow, change the command accordingly. Shell testing-Mindee-docTR/bin/pip3 install "python-doctr[torch]" Using docTR Now that docTR is installed, let’s start playing with it. In this example, I will test it with a grocery store receipt. You can download the receipt using the command below. Shell wget "https://media.istockphoto.com/id/889405434/vector/realistic-paper-shop-receipt-vector-cashier-bill-on-white-background.jpg?s=612x612&w=0&k=20&c=M2GxEKh9YJX2W3q76ugKW23JRVrm0aZ5ZwCZwUMBgAg=" -O receipt.jpeg Create a testing-docTR.py file and insert the following code into it. Python from doctr.io import DocumentFile from doctr.models import ocr_predictor # Load the grocery receipt doc = DocumentFile.from_images("receipt.jpeg") # Load the OCR model model = ocr_predictor(pretrained=True) # Perform OCR result = model(doc) # Display the OCR result print(result.export()) Note that docTR uses a two-stage approach: First, it performs text detection to localize words. Then, it conducts text recognition to identify all characters in the word. The ocr_predictor function accepts additional parameters to select the text detection and recognition architecture. For simplicity, I used the default ones in this example. You can find information about other models on the docTR documentation. Reading a Receipt Using docTR Now you just need to run your Python script: Shell testing-Mindee-docTR/bin/python3 testing-docTR.py You will get an output such as the one below: JSON {"pages": [{"page_idx": 0, "dimensions": [612, 612], "orientation": {"value": null, "confidence": null}, "language": {"value": null, "confidence": null}, "blocks": [{"geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]], "lines": [{"geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]], "words": [{"value": "RECEIPT", "confidence": 0.9695481061935425, "geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]]}]}], "artefacts": []}]}]} Note that I drastically shortened the JSON output for readability and only kept the part showing the “RECEIPT” word. Here is the JSON structure you’d be looking at without truncating the result. I have expanded the part of the tree that I kept in the JSON output. docTR will provide a bunch of information about the document but the important part is about how it breaks down the document into lines, and for each line, provides an array containing the words it detected along with the degree of confidence. Here, we can see it spotted the word RECEIPT with a confidence of 96%. docTR offers an efficient OCR solution that simplifies text recognition processes. Depending on the document type, you may need to change the text detection and text recognition architectures to improve accuracy. Comprehensive docTR documentation is available here. Considerations When Using docTR Deploying docTR entails certain complexities. First, you must create a dataset and train docTR to achieve satisfactory accuracy. This means dealing with data annotation on many images. Since OCR systems typically serve as backend services for other apps, it may be necessary to integrate docTR via an API and scale it according to the app’s needs. docTR does not provide this out of the box, but there are many open-source technologies that can help facilitate this step. Conclusion Document processing technologies have come a long way since the advent of OCR tools, which are limited to character recognition. Intelligent Document Processing (IDP) platforms represent the next step; they utilize OCR (such as docTR) along with additional layers of intelligence like table reconstruction, document classification, and natural language understanding, to achieve better accuracy and precision. Additionally, for those seeking a scalable IDP solution without the complexities of data collection and model training, I recommend trying out Mindee’s latest solution, docTI. This training-free IDP solution leverages Large Language Models (LLMs) to eliminate the need for data collection, annotation, and the model training process. You can use the free-tier plan, configure an instance, and start querying the API in minutes.

By Sylvain Kalache

Mastering Unit Testing and Test-Driven Development in Java

Unit testing is a software testing methodology where individual units or components of software are tested in isolation to check whether it is functioning up to the expectation or not. In Java, it is an essential practice with the help of which an attempt to verify code correctness is made, and an attempt to improve code quality is made. It will basically ensure that the code works fine and the changes are not the point of breakage of existing functionality. Test-Driven Development (TDD) is a test-first approach to software development in short iterations. It is a kind of practice where a test is written before the real source code is written. It pursues the aim of writing code that passes predefined tests and, hence, well-designed, clean, and free from bugs. Key Concepts of Unit Testing Test automation: Use tools for automatic test running, such as JUnit. Asserts: Statements that confirm an expected result within a test. Test coverage: It is the code execution percentage defined by the tests. Test suites: Collection of test cases. Mocks and stubs: Dummy objects that simulate real dependencies. Unit Testing Frameworks in Java: JUnit JUnit is an open-source, simple, and widely used unit testing. JUnit is one of the most popular Java frameworks for unit testing. In other words, it comes with annotations, assertions, and tools required to write and run tests. Core Components of JUnit 1. Annotations JUnit uses annotations to define tests and lifecycle methods. These are some of the key annotations: @Test: Marks a method as a test method. @BeforeEach: Denotes that the annotated method should be executed before each @Test method in the current class. @AfterEach: Denotes that the annotated method should be executed after each @Test method in the current class. @BeforeAll: Denotes that the annotated method should be executed once before any of the @Test methods in the current class. @AfterAll: Denotes that the annotated method should be executed once after all of the @Test methods in the current class. @Disabled: Used to disable a test method or class temporarily. 2. Assertions Assertions are used to test the expected outcomes: assertEquals(expected, actual): Asserts that two values are equal. If they are not, an AssertionError is thrown. assertTrue(boolean condition): Asserts that a condition is true. assertFalse(boolean condition): Asserts that a condition is false. assertNotNull(Object obj): Asserts that an object is not null. assertThrows(Class<T> expectedType, Executable executable): Asserts that the execution of the executable throws an exception of the specified type. 3. Assumptions Assumptions are similar to assertions but used in a different context: assumeTrue(boolean condition): If the condition is false, the test is terminated and considered successful. assumeFalse(boolean condition): The inverse of assumeTrue. 4. Test Lifecycle The lifecycle of a JUnit test runs from initialization to cleanup: @BeforeAll → @BeforeEach → @Test → @AfterEach → @AfterAll This allows for proper setup and teardown operations, ensuring that tests run in a clean state. Example of a Basic JUnit Test Here’s a simple example of a JUnit test class testing a basic calculator: Java import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.Test; import org.junit.jupiter.api.AfterEach; import static org.junit.jupiter.api.Assertions.*; class CalculatorTest { private Calculator calculator; @BeforeEach void setUp() { calculator = new Calculator(); } @Test void testAddition() { assertEquals(5, calculator.add(2, 3), "2 + 3 should equal 5"); } @Test void testMultiplication() { assertAll( () -> assertEquals(6, calculator.multiply(2, 3), "2 * 3 should equal 6"), () -> assertEquals(0, calculator.multiply(0, 5), "0 * 5 should equal 0") ); } @AfterEach void tearDown() { // Clean up resources, if necessary calculator = null; } } Dynamic Tests in JUnit 5 JUnit 5 introduced a powerful feature called dynamic tests. Unlike static tests, which are defined at compile-time using the @Test annotation, dynamic tests are created at runtime. This allows for more flexibility and dynamism in test creation. Why Use Dynamic Tests? Parameterized testing: This allows you to create a set of tests that execute the same code but with different parameters. Dynamic data sources: Create tests based on data that may not be available at compile-time (e.g., data from external sources). Adaptive testing: Tests can be generated based on the environment or system conditions. Creating Dynamic Tests JUnit provides the DynamicTest class for creating dynamic tests. You also need to use the @TestFactory annotation to mark the method that returns the dynamic tests. Example of Dynamic Tests Java import org.junit.jupiter.api.DynamicTest; import org.junit.jupiter.api.TestFactory; import java.util.Arrays; import java.util.Collection; import java.util.stream.Stream; import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.DynamicTest.dynamicTest; class DynamicTestsExample { @TestFactory Stream<DynamicTest> dynamicTestsFromStream() { return Stream.of("apple", "banana", "lemon") .map(fruit -> dynamicTest("Test for " + fruit, () -> { assertEquals(5, fruit.length()); })); } @TestFactory Collection<DynamicTest> dynamicTestsFromCollection() { return Arrays.asList( dynamicTest("Positive Test", () -> assertEquals(2, 1 + 1)), dynamicTest("Negative Test", () -> assertEquals(-2, -1 + -1)) ); } } Creating Parameterized Tests In JUnit 5, you can create parameterized tests using the @ParameterizedTest annotation. You'll need to use a specific source annotation to supply the parameters. Here's an overview of the commonly used sources: @ValueSource: Supplies a single array of literal values. @CsvSource: Supplies data in CSV format. @MethodSource: Supplies data from a factory method. @EnumSource: Supplies data from an Enum. Example of Parameterized Tests Using @ValueSource Java import org.junit.jupiter.params.ParameterizedTest; import org.junit.jupiter.params.provider.ValueSource; import static org.junit.jupiter.api.Assertions.assertTrue; class ValueSourceTest { @ParameterizedTest @ValueSource(strings = {"apple", "banana", "orange"}) void testWithValueSource(String fruit) { assertTrue(fruit.length() > 4); } } Using @CsvSource Java import org.junit.jupiter.params.ParameterizedTest; import org.junit.jupiter.params.provider.CsvSource; import static org.junit.jupiter.api.Assertions.assertEquals; class CsvSourceTest { @ParameterizedTest @CsvSource({ "test,4", "hello,5", "JUnit,5" }) void testWithCsvSource(String word, int expectedLength) { assertEquals(expectedLength, word.length()); } } Using @MethodSource Java import org.junit.jupiter.params.ParameterizedTest; import org.junit.jupiter.params.provider.MethodSource; import java.util.stream.Stream; import static org.junit.jupiter.api.Assertions.assertTrue; class MethodSourceTest { @ParameterizedTest @MethodSource("stringProvider") void testWithMethodSource(String word) { assertTrue(word.length() > 4); } static Stream<String> stringProvider() { return Stream.of("apple", "banana", "orange"); } } Best Practices for Parameterized Tests Use descriptive test names: Leverage @DisplayName for clarity. Limit parameter count: Keep the number of parameters manageable to ensure readability. Reuse methods for data providers: For @MethodSource, use static methods that provide the data sets. Combine data sources: Use multiple source annotations for comprehensive test coverage. Tagging in JUnit 5 The other salient feature in JUnit 5 is tagging: it allows for assigning their own custom tags to tests. Tags allow, therefore, a way to group tests and later execute groups selectively by their tag. This would be very useful for managing large test suites. Key Features of Tagging Flexible grouping: Multiple tags can be applied to a single test method or class, so flexible grouping strategies can be defined. Selective execution: Sometimes it may be required to execute only the desired group of tests by adding tags. Improved organization: Provides an organized way to set up tests for improved clarity and maintainability. Using Tags in JUnit 5 To use tags, you annotate your test methods or test classes with the @Tag annotation, followed by a string representing the tag name. Example Usage of @Tag Java import org.junit.jupiter.api.Tag; import org.junit.jupiter.api.Test; @Tag("fast") class FastTests { @Test @Tag("unit") void fastUnitTest() { // Test logic for a fast unit test } @Test void fastIntegrationTest() { // Test logic for a fast integration test } } @Tag("slow") class SlowTests { @Test @Tag("integration") void slowIntegrationTest() { // Test logic for a slow integration test } } Running Tagged Tests You can run tests with specific tags using: Command line: Run the tests by passing the -t (or --tags) argument to specify which tags to include or exclude.mvn test -Dgroups="fast" IDE: Most modern IDEs like IntelliJ IDEA and Eclipse allow selecting specific tags through their graphical user interfaces. Build tools: Maven and Gradle support specifying tags to include or exclude during the build and test phases. Best Practices for Tagging Consistent tag names: Use a consistent naming convention across your test suite for tags, such as "unit", "integration", or "slow". Layered tagging: Apply broader tags at the class level (e.g., "integration") and more specific tags at the method level (e.g., "slow"). Avoid over-tagging: Do not add too many tags to a single test, which can reduce clarity and effectiveness. JUnit 5 Extensions The JUnit 5 extension model allows developers to extend and otherwise customize test behavior. They provide a mechanism for extending tests with additional functionality, modifying the test execution lifecycle, and adding new features to your tests. Key Features of JUnit 5 Extensions Customization: Modify the behavior of test execution or lifecycle methods. Reusability: Create reusable components that can be applied to different tests or projects. Integration: Integrate with other frameworks or external systems to add functionality like logging, database initialization, etc. Types of Extensions Test Lifecycle Callbacks BeforeAllCallback, BeforeEachCallback, AfterAllCallback, AfterEachCallback. Allow custom actions before and after test methods or test classes. Parameter Resolvers ParameterResolver. Inject custom parameters into test methods, such as mock objects, database connections, etc. Test Execution Condition ExecutionCondition. Enable or disable tests based on custom conditions (e.g., environment variables, OS type). Exception Handlers TestExecutionExceptionHandler. Handle exceptions thrown during test execution. Others TestInstancePostProcessor, TestTemplateInvocationContextProvider, etc. Customize test instance creation, template invocation, etc. Implementing Custom Extensions To create a custom extension, you need to implement one or more of the above interfaces and annotate the class with @ExtendWith. Example: Custom Parameter Resolver A simple parameter resolver that injects a string into the test method: Java import org.junit.jupiter.api.extension.*; public class CustomParameterResolver implements ParameterResolver { @Override public boolean supportsParameter(ParameterContext parameterContext, ExtensionContext extensionContext) { return parameterContext.getParameter().getType().equals(String.class); } @Override public Object resolveParameter(ParameterContext parameterContext, ExtensionContext extensionContext) { return "Injected String"; } } Using the Custom Extension in Tests Java import org.junit.jupiter.api.Test; import org.junit.jupiter.api.extension.ExtendWith; @ExtendWith(CustomParameterResolver.class) class CustomParameterTest { @Test void testWithCustomParameter(String injectedString) { System.out.println(injectedString); // Output: Injected String } } Best Practices for Extensions Separation of concerns: Extensions should have a single, well-defined responsibility. Reusability: Design extensions to be reusable across different projects. Documentation: Document how the extension works and its intended use cases. Unit testing and Test-Driven Development (TDD) offer significant benefits that positively impact software development processes and outcomes. Benefits of Unit Testing Improved Code Quality Detection of bugs: Unit tests detect bugs early in the development cycle, making them easier and cheaper to fix. Code integrity: Tests verify that code changes don't break existing functionality, ensuring continuous code integrity. Simplifies Refactoring Tests serve as a safety net during code refactoring. If all tests pass after refactoring, developers can be confident that the refactoring did not break existing functionality. Documentation Tests serve as live documentation that illustrates how the code is supposed to be used. They provide examples of the intended behavior of methods, which can be especially useful for new team members. Modularity and Reusability Writing testable code encourages modular design. Code that is easily testable is generally also more reusable and easier to understand. Reduces Fear of Changes A comprehensive test suite helps developers make changes confidently, knowing they will be notified if anything breaks. Regression Testing Unit tests can catch regressions, where previously working code stops functioning correctly due to new changes. Encourages Best Practices Developers tend to write cleaner, well-structured, and decoupled code when unit tests are a priority. Benefits of Test-Driven Development (TDD) Ensures test coverage: TDD ensures that every line of production code is covered by at least one test. This provides comprehensive coverage and verification. Focus on requirements: Writing tests before writing code forces developers to think critically about requirements and expected behavior before implementation. Improved design: The incremental approach of TDD often leads to better system design. Code is written with testing in mind, resulting in loosely coupled and modular systems. Reduces debugging time: Since tests are written before the code, bugs are caught early in the development cycle, reducing the amount of time spent debugging. Simplifies maintenance: Well-tested code is easier to maintain because the tests provide instant feedback when changes are introduced. Boosts developer confidence: Developers are more confident in their changes knowing that tests have already validated the behavior of their code. Facilitates collaboration: A comprehensive test suite enables multiple developers to work on the same codebase, reducing integration issues and conflicts. Helps identify edge cases: Thinking through edge cases while writing tests helps to identify unusual conditions that could be overlooked otherwise. Reduces overall development time: Although TDD may initially seem to slow development due to the time spent writing tests, it often reduces the total development time by preventing bugs and reducing the time spent on debugging and refactoring. Conclusion By leveraging unit testing and TDD in Java with JUnit, developers can produce high-quality software that's easier to maintain and extend over time. These practices are essential for any professional software development workflow, fostering confidence and stability in your application's codebase.

By Maic Moerser

Best Practices for Migration of COTS Applications to Cloud

The first step in a Cloud Adoption journey for any enterprise is Application Portfolio Analysis. During this assessment, we see custom in-house (Bespoke) applications, Commercial-Off-The-Shelf (COTS) applications, Software-as-a-Service (SaaS) applications, etc. The constitution of these applications in the portfolio varies between enterprises and industries. As an outcome of the assessment, the applications are dispositioned into one of the seven common migration strategies (7-R’s of Migration: Retire, Retain, Refactor, Replatform, Repurchase, Rehost, and Relocate) and arrive at a roadmap for cloud migration. While COTS applications are generally perceived as low-hanging fruits during cloud migrations, they come with their own challenges. For example, the currency of technical stack (Operating System, Product Versions, Frameworks, Databases, etc.), managing licenses and adherence to security requirements in the cloud, etc. Understanding these challenges is critical to arriving at migration strategies. Let us deep dive a bit and uncover them. Challenges and Best Practices From our experiences in cloud migrations, we have observed some common challenges and best practices to mitigate them. These have helped us in successful migrations for many clients. Disposition COTS products are used by enterprises as a ready-made solution for their business needs. However, in many cases, these applications trail other evolving applications and become outdated or difficult to integrate with other modern applications. Some applications do not undergo any upgrades/enhancements and outlive the usual application lifecycle. It becomes challenging to migrate such applications and requires due diligence, stakeholder concurrence, etc. Best Practices Perform a comprehensive assessment, considering the business needs, performance, compatibility, risks, and cost-benefit study. Involving all stakeholders like Business, IT, Independent Software Vendor (ISV), etc. is important for the best outcome of this assessment. Based on the analysis, choose one of the following strategies. Rehost the application and database (lift-and-shift the virtual machine (as-is)) to the Infrastructure-as-a-Service (IaaS) compute instance on the cloud. Rehost the application to IaaS and re-platform the database to a Platform-as-a-Service (PaaS) instance on the cloud. Replatform to ISV Managed SaaS solution. This solution can be from the same ISV or from another ISV that provides an easy migration path to their solution. The following challenges will also contribute to the dispositioning of the COTS applications. ISV Support Pure COTS applications are usually straightforward to migrate provided the vendor supports and certifies the application to run on the target cloud platform. Some ISVs provide customized versions of their product to suit the business needs of an enterprise. For such applications, the Vendor has their share of ownership and responsibility to maintain them on the client's premises. While these applications are managed by an application team, the knowledge of the application, the intrinsic details, and the application roadmap is with the vendor. The ISVs have their own timelines and schedules for their releases, and they would require sufficient notice to have their personnel engaged to support the migration. This impacts the migration plan of the COTS application and its dependencies. Best Practices Start engaging with the ISV early in the migration journey preferably in the planning stage. Some ISVs would require a new professional services contract to be engaged for providing support. Understand the rules of engagement and ensure the contract clearly details the roles and responsibilities of the ISV. Some COTS products require certification by the ISV to run on the cloud. Understand the requirements and clearly document them as this would have cost implications. There are also scenarios when enterprises decide to migrate without vendor support due to cost and they have in-house expertise on the product. Target Cloud Support Most enterprises incorrectly assume the COTS product is compatible with any target platform. However, there are instances where they realize that the products do not work as intended after the migration journey and the vendor refuses to support the chosen platform/technology stack. Some examples are: The ISV might not have an immediate plan for supporting the target operating system or the target Database in the cloud. One of the objectives of moving to the cloud is to provide for High Availability and Disaster Recovery and some of the COTS products may not support front-ending by a Load Balancer or provide support for Clustering. Best Practices During assessment, determine the ISV support for the target technology stack. Some applications would require modifications/upgrades to support. It’s good to ask specific questions regarding the support like Does the COTS product support Windows 2022 in AWS Cloud? If yes, does it require modifications to support it? Does the COTS product which currently uses SQL Server on-premises support migration to RDS SQL Server in AWS Cloud? It’s also good to have the ISV involved in the design of the target architecture. Getting the target architecture approved by the ISV is a good practice. Security Requirements One of the common issues encountered during migrations is that COTS products do not adhere to the Security Policies defined by the Information Security(InfoSec) Team. For Example, privileged credentials used by COTS applications are often embedded in cleartext within application configuration files, database tables, scripts, etc. The credentials would not be changed frequently, and the same would also be used across multiple environments or in other applications as well. This is perceived as a security vulnerability by the InfoSec team. Best Practices Understand from the ISV if there is a direct integration available to a secure vault like Azure Key Vault, AWS Secrets Manager, HashiCorp Vault, etc. Alternatively, the ISV must provide a patch to encrypt the password stored locally on the server, or database. Provide for rotation of credentials based on policies like the criticality of the application or based on data sensitivity requirements. Some COTS products do not provide support for integration with vaults due to the legacy software stack and the application cannot be modified. In such cases, an exception from the InfoSec team is sought with a remediation plan. For example, a product upgrade, or enhancement, say within 6 months after the migrations, as it involves cost, potential change in integrations with other applications, etc. Licenses ISVs use different types of licensing models for their products. Some offer one-time, perpetual licenses, while others require enterprises to renew the licenses (subscription-based). Similarly, some licenses are tied to the server's metadata (IP address, MAC address, or hostname) while others are portable. During migrations, another common issue observed is licensing conflict. Inadequate licenses restrict the application from functioning on the cloud while the license is already tied to the running instance at on-premises or vice-versa. There could also be a change in the licensing model when the COTS product is moved to the cloud. For example, moving from an on-premises deployment to a SaaS model would require a move from a perpetual licensing model to a subscription-based one. Best Practices Understand the current licensing model, the number of licenses available, and request for additional licenses. Some ISVs provide temporary licenses that will allow the application to run simultaneously on-premises and on the target cloud platform. Understand if there are license checks done by the installed software. Some COTS applications send internet egress traffic for license validation. This would help in planning for firewall rules during migration. Team Organization and Co-ordination In the case of business-owned applications, the IT presence will be limited to providing platform support. Also, for customized COTS products, the ISV is a key participant and contributor in the migration. Involving them late in the play is a common mistake and it causes delay, the need for expedited engagement, and in turn proves to be an expensive affair. Along with identifying the contributors from IT (infra and database support), business (testers), etc. to support the migrations, placing the right ownership of actions on the ISV is also important. Best Practices ISV team should have tasks assigned in the project plan and it is necessary to communicate the tasks and the timeline to them as early as possible. ISV should have clearly defined responsibilities. For example, during the installation of the COTS software in the cloud, the application team would perform the installation with support from the ISV or the ISV themselves might perform the installation. These activities should be listed in the runbook against the ISV listed as task owner. ISVs might also require access to the cloud environment during migration or for later support. Access requirements for the ISV can be evaluated again during migration and provided for. Integrations While most modern COTS products support enhanced security controls, you will come across a few products that use non-secure ports or integration mechanisms for communication. For example, http ports (80, 8080) or FTP (21). In the cloud, one of the security controls enforced is the encryption of data in transit. Additionally, the other applications having an affinity with the COTS product may take a modernization path, involving a change in the framework(Struts to Spring Boot), data models(XML to JSON), etc. This may require some changes to the COTS product. Application Remediation for such enhancements would require a considerable amount of time and testing. Best Practices Start the identification of these integrations much earlier and work with the ISV for the changes to the COTS application. Factor efforts for comprehensive testing. Despite the identification of such requirements early in the migration, it’s a possibility that changes couldn’t be made to the COTS product due to timelines and various other factors. In such cases, it's normal to get an exception approval from InfoSec for allowing these ports. We can also understand from the ISV if there are automation possibilities like enabling CI/CD pipelines, configuration management, etc. that will reduce the manual effort and errors in deployment. Automation can also assist in enabling faster recovery in case of an outage. Data Migration Ensuring data remains secure during and after the migration is a significant challenge. This is required as enterprises must consider data encryption, access controls, and compliance requirements such as GDPR, HIPAA, or PCI-DSS. COTS applications often have large volumes of data stored in various formats and structures. Migrating this data to the cloud while maintaining its integrity and consistency can be complex, especially if the data is spread across multiple sources or databases. Best Practices Evaluate the options to encrypt the data in transit and at rest with the ISV and other stakeholders, as it may require changes to the product. Understand the complexity of the data and its structure by doing a thorough analysis. Work with the ISV if there are proprietary tools for migration of the data and obtain clearance for using the tool from the InfoSec organization. This is a longer process hence it is essential to be addressed very early in the migration life cycle. Plan for incremental data migration to ensure data integrity during cutover to the target. Other Challenges Containerization Containerizing a COTS product is a popular solution as the application can benefit from isolation, portability, scalability, and efficient utilization of resources. While the benefits are huge, this migration path is usually tricky because the ISVs may not have container images and even if they agree to build a container image, they may not have the resources to maintain images on a continuous basis. So, it’s necessary to understand these intricacies before proceeding with containerization. Refactor Refactor the COTS application to a custom application. This is usually perceived as a project by itself, driven by a strong business case and it involves considerable cost, time, and manpower. This could lead to build-vs-buy decisions or even buying and building (customizations). It's advised to take this route only when you have in-house knowledge about the application. Conclusion Every migration provides a lot of learnings and insights to carry forward into successive migrations. Based on our migration experiences for various enterprises across industries, we have shared the challenges and the mitigations that have helped us overcome them. As mentioned earlier, the number of COTS applications would vary across enterprises and industries and the challenges are similar. Addressing these challenges early in the migration cycle would help in the cloud journey while ensuring maximum benefits from the cloud for your application portfolio.

By Rakesh Rao

Why and How To Integrate Elastic APM in Apache JMeter

The Advantages of Elastic APM for Observing the Tested Environment My first use of the Elastic Application Performance Monitoring (Elastic APM) solution coincides with projects that were developed based on microservices in 2019 for the projects on which I was responsible for performance testing. At that time (2019) the first versions of Elastic APM were released. I was attracted by the easy installation of agents, the numerous protocols supported by the Java agent (see Elastic supported technologies) including the Apache HttpClient used in JMeter and other languages (Go, .NET, Node.js, PHP, Python, Ruby), and the quality of the dashboard in Kibana for the APM. I found the information displayed in the Kibana APM dashboards to be relevant and not too verbose. The Java agent monitoring is simple but displays essential information on the machine's OS and JVM. The open-source aspect and the free solution for the main functions of the tool were also decisive. I generalize the use of the Elastic APM solution in performance environments for all projects. With Elastic APM, I have the timelines of the different calls and exchanges between web services, the SQL queries executed, the exchange of messages by JMS file, and monitoring. I also have quick access to errors or exceptions thrown in Java applications. Why Integrate Elastic APM in Apache JMeter By adding Java APM Agents to web applications, we find the services called timelines in the Kibana dashboards. However, we remain at a REST API call level mainly, because we do not have the notion of a page. For example, page PAGE01 will make the following API calls: /rest/service1 /rest/service2 /rest/service3 On another page, PAGE02 will make the following calls: /rest/service2 /rest/service4 /rest/service5 /rest/service6 The third page, PAGE03, will make the following calls: /rest/service1 /rest/service2 /rest/service4 In this example, service2 is called on 3 different pages and service4 in 2 pages. If we look in the Kibana dashboard for service2, we will find the union of the calls of the 3 calls corresponding to the 3 pages, but we don't have the notion of a page. We cannot answer "In this page, what is the breakdown of time in the different REST calls," because for a user of the application, the notion of page response time is important. The goal of the jmeter-elastic-apm tool is to add the notion of an existing page in JMeter in the Transaction Controller. This starts in JMeter by creating an APM transaction, and then propagating this transaction identifier (traceparent) with the Elastic agent to an HTTP REST request to web services because the APM Agent recognizes the Apache HttpClient library and can instrument it. In the HTTP request, the APM Agent will add the identifier of the APM transaction to the header of the HTTP request. The headers added are traceparent and elastic-apm-traceparent. We start from the notion of the page in JMeter (Transaction Controller) to go to the HTTP calls of the web application (gestdoc) hosted in Tomcat. In the case of an application composed of multi-web services, we will see in the timeline the different web services called in HTTP(s) or JMS and the time spent in each web service. This is an example of technical architecture for a performance test with Apache JMeter and Elastic APM Agent to test a web application hosted in Apache Tomcat. How the jmeter-elastic-apm Tool Works jmeter-elastic-apm adds Groovy code before a JMeter Transaction Controller to create an APM transaction before a page. In the JMeter Transaction Controller, we find HTTP samplers that make REST HTTP(s) calls to the services. The Elastic APM Agent automatically adds a new traceparent header containing the identifier of the APM transaction because it recognizes the Apache HttpClient of the HTTP sampler. The Groovy code terminates the APM transaction to indicate the end of the page. The jmeter-elastic-apm tool automates the addition of Groovy code before and after the JMeter Transaction Controller. The jmeter-elastic-apm tool is open source on GitHub (see link in the Conclusion section of this article). This JMeter script is simple with 3 pages in 3 JMeter Transaction Controllers. After launching the jmeter-elastic-apm action ADD tool, the JMeter Transaction Controllers are surrounded by Groovy code to create an APM transaction before the JMeter Transaction Controller and close the APM transaction after the JMeter Transaction Controller. In the “groovy begin transaction apm” sampler, the Groovy code calls the Elastic APM API (simplified version): Groovy Transaction transaction = ElasticApm.startTransaction(); Scope scope = transaction.activate(); transaction.setName(transactionName); // contains JMeter Transaction Controller Name In the “groovy end transaction apm” sampler, the groovy code calls the ElasticApm API (simplified version): Groovy transaction.end(); Configuring Apache JMeter With the Elastic APM Agent and the APM Library Start Apache JMeter With Elastic APM Agent and Elastic APM API Library Declare the Elastic APM Agent URLto find the APM Agent: Add the ELASTIC APM Agent somewhere in the filesystem (could be in the <JMETER_HOME>\lib but not mandatory). In <JMETER_HOME>\bin, modify the jmeter.bat or setenv.bat. Add Elastic APM configuration like so: Shell set APM_SERVICE_NAME=yourServiceName set APM_ENVIRONMENT=yourEnvironment set APM_SERVER_URL=http://apm_host:8200 set JVM_ARGS=-javaagent:<PATH_TO_AGENT_APM_JAR>\elastic-apm-agent-<version>.jar -Delastic.apm.service_name=%APM_SERVICE_NAME% -Delastic.apm.environment=%APM_ENVIRONMENT% -Delastic.apm.server_urls=%APM_SERVER_URL% 2. Add the Elastic APM library: Add the Elastic APM API library to the <JMETER_HOME>\lib\apm-agent-api-<version>.jar. This library is used by JSR223 Groovy code. Use this URL to find the APM library. Recommendations on the Impact of Adding Elastic APM in JMeter The APM Agent will intercept and modify all HTTP sampler calls, and this information will be stored in Elasticsearch. It is preferable to voluntarily disable the HTTP request of static elements (images, CSS, JavaScript, fonts, etc.) which can generate a large number of requests but are not very useful in analyzing the timeline. In the case of heavy load testing, it's recommended to change the elastic.apm.transaction_sample_rate parameter to only take part of the calls so as not to saturate the APM Server and Elasticsearch. This elastic.apm.transaction_sample_rate parameter can be declared in <JMETER_HOME>\jmeter.bat or setenv.bat but also in a JSR223 sampler with a short Groovy code in a setUp thread group. Groovy code records only 50% samples: Groovy import co.elastic.apm.api.ElasticApm; // update elastic.apm.transaction_sample_rate ElasticApm.setConfig("transaction_sample_rate","0.5"); Conclusion The jmeter-elastic-apm tool allows you to easily integrate the Elastic APM solution into JMeter and add the notion of a page in the timelines of Kibana APM dashboards. Elastic APM + Apache JMeter is an excellent solution for understanding how the environment works during a performance test with simple monitoring, quality dashboards, time breakdown timelines in the different distributed application layers, and the display of exceptions in web services. Over time, the Elastic APM solution only gets better. I strongly recommend it, of course, in a performance testing context, but it also has many advantages in the context of a development environment used for developers or integration used by functional or technical testers. Links Command Line Tool jmeter-elastic-apm JMeter plugin elastic-apm-jmeter-plugin Elastic APM Guides: APM Guide or Application performance monitoring (APM)

By Vincent DABURON

Explainable AI: Interpreting BERT Model

Motivation and Background Why is it important to build interpretable AI models? The future of AI is in enabling humans and machines to work together to solve complex problems. Organizations are attempting to improve process efficiency and transparency by combining AI/ML technology with human review. In recent years with the advancement of AI, AI-specific regulations have emerged, for example, Good Machine Learning Practices (GMLP) in pharma and Model Risk Management (MRM) in finance industries, other broad-spectrum regulations addressing data privacy, EU’s GDPR and California’s CCPA. Similarly, internal compliance teams may also want to interpret a model’s behavior when validating decisions based on model predictions. For instance, underwriters want to learn why a specific loan application was tagged suspicious by an ML model. Overview What is interpretability? In the ML context, interpretability refers to trying to backtrack what factors have contributed to an ML model for making a certain prediction. As shown in the graph below, simpler models are easier to interpret but may often produce lower accuracy compared to complex models like Deep Learning and transformer-based models that can understand non-linear relations in the data and often have high accuracy. Loosely defined, there are two types of explanations: Global explanation: is explaining on an overall model level to understand what features have contributed the most to the output? For example, in a finance setting where the use case is to build an ML model to identify customers who are most likely to default, some of the most influential features for making that decision are the customer’s credit score, total no. of credit cards, revolving balance, etc. Local explanation: This can enable you to zoom in on a particular data point and observe the behavior of the model in that neighborhood. For example, for sentiment classification of a movie review use case, certain words in the review may have a higher impact on the outcomes vs the other. “I have never watched something as bad.” What is a transformer model? A transformer model is a neural network that tracks relationships in sequential input, such as the words in a sentence, to learn context and subsequent meaning. Transformer models use an evolving set of mathematical approaches, called attention or self-attention, to find minute relationships between even distance data elements in a series. Refer to Google’s publication for more information. Integrated Gradients Integrated Gradients (IG), is an Explainable AI technique introduced in the paper Axiomatic Attribution for Deep Networks. In this paper, an attempt is made to assign an attribution value to each input feature. This tells how much the input contributed to the final prediction. IG is a local method that is a popular interpretability technique due to its broad applicability to any differentiable model (e.g., text, image, structured data), ease of implementation, computational efficiency relative to alternative approaches, and theoretical justifications. Integrated gradients represent the integral of gradients with respect to inputs along the path from a given baseline to input. The integral can be approximated using a Riemann Sum or Gauss Legendre quadrature rule. Formally, it can be described as follows: Integrated Gradients along the i — th dimension of input X. Alpha is the scaling coefficient. The equations are copied from the original paper. The cornerstones of this approach are two fundamental axioms, namely sensitivity and implementation invariance. More information can be found in the original paper. Use Case Now let’s see in action how the Integrated Gradients method can be applied using the Captum package. We will be fine-tuning a question-answering BERT (Bidirectional Encoder Representations from Transformers) model, on the SQUAD dataset using the transformers library from HuggingFace, review notebook for a detailed walkthrough. Steps Load the tokenizer and pre-trained BERT model, in this case, bert-base-uncased Next is computing attributions w.r.t BertEmbeddings layer. To do so, define baseline/references and numericalize both the baselines and inputs. Python def construct_whole_bert_embeddings(input_ids, ref_input_ids, \ token_type_ids=None, ref_token_type_ids=None, \ position_ids=None, ref_position_ids=None): Python input_embeddings = model.bert.embeddings(input_ids, token_type_ids=token_type_ids, position_ids=position_ids) Python ref_input_embeddings = model.bert.embeddings(ref_input_ids, token_type_ids=ref_token_type_ids, position_ids=ref_position_ids) Python return input_embeddings, ref_input_embeddings Now, let's define the question-answer pair as an input to our BERT model Question = “What is important to us?” text = “It is important to us to include, empower and support humans of all kinds.” Generate corresponding baselines/references for question-answer pair The next step is to make predictions, one option is to use LayerIntegratedGradients and compute the attributions with respect to BertEmbedding. LayerIntegratedGradients represents the integral of gradients with respect to the layer inputs/outputs along the straight-line path from the layer activations at the given baseline to the layer activation at the input. Python start_scores, end_scores = predict(input_ids, \ token_type_ids=token_type_ids, \ position_ids=position_ids, \ attention_mask=attention_mask) Python print(‘Question: ‘, question) print(‘Predicted Answer: ‘, ‘ ‘.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])) Python lig = LayerIntegratedGradients(squad_pos_forward_func, model.bert.embeddings) Output: Question: What is important to us? Predicted Answer: to include , em ##power and support humans of all kinds Visualize attributes for each word token in the input sequence using a helper function Python # storing couple samples in an array for visualization purposes Python start_position_vis = viz.VisualizationDataRecord( attributions_start_sum, torch.max(torch.softmax(start_scores[0], dim=0)), torch.argmax(start_scores), torch.argmax(start_scores), str(ground_truth_start_ind), attributions_start_sum.sum(), all_tokens, delta_start) Python print(‘\033[1m’, ‘Visualizations For Start Position’, ‘\033[0m’) viz.visualize_text([start_position_vis]) Python print(‘\033[1m’, ‘Visualizations For End Position’, ‘\033[0m’) viz.visualize_text([end_position_vis]) From the results above we can tell that for predicting the start position, our model is focusing more on the question side. More specifically on the tokens ‘what’ and ‘important’. It has also a slight focus on the token sequence ‘to us’ on the text side. In contrast to that, for predicting end position, our model focuses more on the text side and has relatively high attribution on the last end position token ‘kinds’. Conclusion This blog describes how explainable AI techniques like Integrated Gradients can be used to make a deep learning NLP model interpretable by highlighting positive and negative word influences on the outcome of the model. References Axiomatic Attribution for Deep Networks Model Interpretability for PyTorch Towards Better Understanding of Gradient-Based Attribution Methods for Deep Neural Networks

By Sai Sharanya Nalla