DZone Spotlight

Tuesday, October 22 View All Articles »

How to Convert HTML to DOCX in Java

By Brian O'Neill

CORE

There's a far smaller audience of folks who understand the intricacies of HTML document structure than those who understand the user-friendly Microsoft (MS) Word application. Automating HTML-to-DOCX conversions makes a lot of sense if we frequently need to generate well-formatted documents from dynamic web content, streamline reporting workflows, or convert any other web-based information into editable Word documents for a non-technical business audience. Automating HTML-to-DOCX conversions with APIs reduces the time and effort it takes to generate MS Word content for non-technical users. In this article, we'll review open-source and proprietary API solutions for streamlining HTML-to-DOCX conversions in Java, and we'll explore the relationship between HTML and DOCX file structures that makes this conversion relatively straightforward. How Similar are HTML and DOCX Structures? HTML and DOCX documents serve very different purposes, but they have more in common than we might initially think. They're both XML-based formats with similar approaches to structuring text on a page: HTML documents use an XML-based structure to organize how content appears in a web browser. DOCX documents use a series of zipped XML files to collectively define how content appears in the proprietary MS Word application. Content elements in an HTML document like paragraphs (<p>), headings (<h1>, <h2>, etc.), and tables (<table>) all roughly translate into DOCX iterations of the same concept. For example, DOCX files map HTML <p> tags to <w:p> elements, and they map <h1> tags to <w:pStyle> elements. Further, in a similar way to how HTML documents often reference CSS stylesheets (e.g., styles.css) for element styling, DOCX documents use an independent document.xml file to store content display elements and map them with Word styles and settings, stored in style.xml and settings.xml files respectively within the DOCX archive. Differences Between HTML and DOCX to Consider It's worth noting that HTML and DOCX files do handle certain types of content quite differently, despite sharing a similar derivative structure. Much of this can be attributed to differences between how web browser applications and the MS Word application interpret information. The challenges we encounter with HTML-to-DOCX conversions are largely driven by inconsistencies in the way custom styling, media content, and dynamic elements are interpreted. The styling used in native HTML and native DOCX documents is often custom/proprietary, and custom/proprietary HTML styles (e.g., custom fonts) won't necessarily translate into identical DOCX styles when we convert content between those formats. Further, in HTML files, multimedia (e.g., images, videos) are included on any given page as links, whereas DOCX files embed media objects directly. Finally, the dynamic code elements we find on some HTML pages — usually written in JavaScript — won't translate to DOCX whatsoever given that DOCX is a static format. Converting HTML to DOCX When we convert HTML to DOCX, we effectively parse content from HTML elements and subsequently map that content to appropriate DOCX elements. The same occurs in reverse when we make the opposite conversion (a process I've written about in the past). How that parsing and mapping take place depends entirely on how we structure our code — or which APIs we elect to use in our programming project. Open-Source Libraries for HTML-to-DOCX Conversions If we're looking for open-source libraries to make HTML-to-DOCX conversions, we'll go a long way with libraries like jsoup and docx4j. The jsoup library is designed to parse and clean HTML programmatically into a structure that we can easily work with, and the docx4j library offers features capable of mapping HTML tags to their corresponding DOCX elements. We can also finalize the creation of our DOCX documents with docx4j, literally organizing our mapped HTML elements into a series of XML files and zipping those with a .docx extension. The docx4j library is very similar to Microsoft's OpenXML SDK, only for Java developers instead of C#. HTML-to-DOCX Conversion Demonstration If we're looking to simplify HTML-to-DOCX conversions, we can turn our attention to a web API solution that gets in the weeds on our behalf, parsing and mapping HTML into a consistent DOCX result without requiring us to download multiple libraries or write a lot of extra code. JitPack a free solution to use, requiring only a free API key. We'll now walk through example code that we can use to structure our API call. To begin, we'll install the client using Maven. We'll first add the repository to our pom.xml: XML <repositories> <repository> <id>jitpack.io</id> <url>https://jitpack.io</url> </repository> </repositories> And after that, we'll add the dependency to our pom.xml: XML <dependencies> <dependency> <groupId>com.github.Cloudmersive</groupId> <artifactId>Cloudmersive.APIClient.Java</artifactId> <version>v4.25</version> </dependency> </dependencies> Next, we'll import the necessary classes to configure the API client, handle exceptions, etc.: Java // Import classes: //import com.cloudmersive.client.invoker.ApiClient; //import com.cloudmersive.client.invoker.ApiException; //import com.cloudmersive.client.invoker.Configuration; //import com.cloudmersive.client.invoker.auth.*; //import com.cloudmersive.client.ConvertWebApi; Now we'll configure our API client with an API key for authentication: Java ApiClient defaultClient = Configuration.getDefaultApiClient(); // Configure API key authorization: Apikey ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey"); Apikey.setApiKey("YOUR API KEY"); // Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null) //Apikey.setApiKeyPrefix("Token"); Finally, we’ll create the API instance, prepare our input request, and handle our conversion (while catching any exceptions, of course): Java ConvertWebApi apiInstance = new ConvertWebApi(); HtmlToOfficeRequest inputRequest = new HtmlToOfficeRequest(); // HtmlToOfficeRequest | HTML input to convert to DOCX try { byte[] result = apiInstance.convertWebHtmlToDocx(inputRequest); System.out.println(result); } catch (ApiException e) { System.err.println("Exception when calling ConvertWebApi#convertWebHtmlToDocx"); e.printStackTrace(); } Once our conversion is complete, we can write the resulting byte[] array to a DOCX file, and we're all finished. We can perform subsequent operations with our new DOCX document, or we can store it for business users to access directly and call it a day. Conclusion In this article, we reviewed some of the similarities between HTML and DOCX file structures that make converting between both formats relatively simple and easy to accomplish with code. We then discussed two open-source libraries we could use in conjunction to handle HTML-to-DOCX conversions, and we learned how to call a free proprietary API to handle all our steps in one go. More

*You* Can Shape Trend Reports: Join DZone's Observability Research + Enter the Prize Drawing!

By Caitlin Candelmo

Hey, DZone Community! We have a survey in progress as part of our original research for the upcoming Trend Report. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. check out the details for our research survey below Observability and Performance Research DZone's annual research on application performance dives deeper into the emerging trends and techniques around monitoring and observability, both of which are must-haves to support the performance, reliability, and scalability of today's complex applications and system architectures. Our 10-minute research survey that will help guide the narrative of our November Observability and Performance Trend Report explores: Observability models, techniques, and tools OpenTelemetry use, benefits, and drawbacks Performance metrics and degradation root causes AI analytics capabilities for observability and monitoring Join the Observability Research Over the coming months, we will compile, observe, and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team More

Trend Report

Kubernetes in the Enterprise

In 2014, Kubernetes' first commit was pushed to production. And 10 years later, it is now one of the most prolific open-source systems in the software development space. So what made Kubernetes so deeply entrenched within organizations' systems architectures? Its promise of scale, speed, and delivery, that is — and Kubernetes isn't going anywhere any time soon.DZone's fifth annual Kubernetes in the Enterprise Trend Report dives further into the nuances and evolving requirements for the now 10-year-old platform. Our original research explored topics like architectural evolutions in Kubernetes, emerging cloud security threats, advancements in Kubernetes monitoring and observability, the impact and influence of AI, and more, results from which are featured in the research findings.As we celebrate a decade of Kubernetes, we also look toward ushering in its future, discovering how developers and other Kubernetes practitioners are guiding the industry toward a new era. In the report, you'll find insights like these from several of our community experts; these practitioners guide essential discussions around mitigating the Kubernetes threat landscape, observability lessons learned from running Kubernetes, considerations for effective AI/ML Kubernetes deployments, and much more.

Refcard #303

API Integration Patterns

By Thomas Jardinet

CORE

Refcard #389

Threat Detection

By Sudip Sengupta

CORE

Building an Interactive Chatbot With Streamlit, LangChain, and Bedrock

In the ever-evolving landscape of AI, chatbots have become indispensable tools for enhancing user engagement and streamlining information delivery. This article will walk you through the process of building an interactive chatbot using Streamlit for the front end, LangChain for orchestrating interactions, and Anthropic’s Claude Model powered by Amazon Bedrock as the Large Language Model (LLM) backend. We'll dive into the code snippets for both the backend and front end and explain the key components that make this chatbot work. Core Components Streamlit frontend: Streamlit's intuitive interface allows us to create a low-code user-friendly chat interface with minimal effort. We'll explore how the code sets up the chat window, handles user input, and displays the chatbot's responses. LangChain orchestration: LangChain empowers us to manage the conversation flow and memory, ensuring the chatbot maintains context and provides relevant responses. We'll discuss how LangChain's ConversationSummaryBufferMemory and ConversationChain are integrated. Bedrock/Claude LLM backend: The true magic lies in the LLM backend. We'll look at how to leverage Amazon Bedrock’s claude foundation model to generate intelligent and contextually aware responses. Chatbot Architecture Conceptual Walkthrough of the Architecture User interaction: The user initiates the conversation by typing a message into the chat interface created by Streamlit. This message can be a question, a request, or any other form of input the user wishes to provide. Input capture and processing: Streamlit's chat input component captures the user's message and passes it on to the LangChain framework for further processing. Contextualization with LangChain memory: LangChain plays a crucial role in maintaining the context of the conversation. It combines the user's latest input with the relevant conversation history stored in its memory. This ensures that the chatbot has the necessary information to generate a meaningful and contextually appropriate response. Leveraging the LLM: The combined context is then sent to the Bedrock/Claude LLM. This powerful language model uses its vast knowledge and understanding of language to analyze the context and generate a response that addresses the user's input in an informative way. Response retrieval: LangChain receives the generated response from the LLM and prepares it for presentation to the user. Response display: Finally, Streamlit takes the chatbot's response and displays it in the chat window, making it appear as if the chatbot is engaging in a natural conversation with the user. This creates an intuitive and user-friendly experience, encouraging further interaction. Code Snippets Frontend (Streamlit) Python import streamlit import chatbot_backend from langchain.chains import ConversationChain from langchain.memory import ConversationSummaryBufferMemory import boto3 from langchain_aws import ChatBedrock import pandas as pd # 2 Set Title for Chatbot - streamlit.title("Hi, This is your Chatbott") # 3 LangChain memory to the session cache - Session State - if 'memory' not in streamlit.session_state: streamlit.session_state.memory = demo.demo_memory() # 4 Add the UI chat history to the session cache - Session State if 'chat_history' not in streamlit.session_state: streamlit.session_state.chat_history = [] # 5 Re-render the chat history for message in streamlit.session_state.chat_history: with streamlit.chat_message(message["role"]): streamlit.markdown(message["text"]) # 6 Enter the details for chatbot input box input_text = streamlit.chat_input("Powered by Bedrock") if input_text: with streamlit.chat_message("user"): streamlit.markdown(input_text) streamlit.session_state.chat_history.append({"role": "user", "text": input_text}) chat_response = demo.demo_conversation(input_text=input_text, memory=streamlit.session_state.memory) with streamlit.chat_message("assistant"): streamlit.markdown(chat_response) streamlit.session_state.chat_history.append({"role": "assistant", "text": chat_response}) Backend (LangChain and LLM) Python from langchain.chains import ConversationChain from langchain.memory import ConversationSummaryBufferMemory import boto3 from langchain_aws import ChatBedrock # 2a Write a function for invoking model- client connection with Bedrock with profile, model_id def demo_chatbot(): boto3_session = boto3.Session( # Your aws_access_key_id, # Your aws_secret_access_key, region_name='us-east-1' ) llm = ChatBedrock( model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=boto3_session.client('bedrock-runtime'), model_kwargs={ "anthropic_version": "bedrock-2023-05-31", "max_tokens": 20000, "temperature": .3, "top_p": 0.3, "stop_sequences": ["\n\nHuman:"] } ) return llm # 3 Create a Function for ConversationSummaryBufferMemory (llm and max token limit) def demo_memory(): llm_data = demo_chatbot() memory = ConversationSummaryBufferMemory(llm=llm_data, max_token_limit=20000) return memory # 4 Create a Function for Conversation Chain - Input text + Memory def demo_conversation(input_text, memory): llm_chain_data = demo_chatbot() # Initialize ConversationChain with proper llm and memory llm_conversation = ConversationChain(llm=llm_chain_data, memory=memory, verbose=True) # Call the invoke method full_input = f" \nHuman: {input_text}" llm_start_time = time.time() chat_reply = llm_conversation.invoke({"input": full_input}) llm_end_time = time.time() llm_elapsed_time = llm_end_time - llm_start_time memory.save_context({"input": input_text}, {"output": chat_reply.get('response', 'No Response')}) return chat_reply.get('response', 'No Response') Conclusion We've explored the fundamental building blocks of an interactive chatbot powered by Streamlit, LangChain, and a powerful LLM backend. This foundation opens doors to endless possibilities, from customer support automation to personalized learning experiences. Feel free to experiment, enhance, and deploy this chatbot for your specific needs and use cases.

By Karan Bansal

An Interview About Navigating the Cloud-Native Ecosystem

In this interview with Julian Fischer, CEO of the cloud computing and automation company anynines GmbH, we explore the evolving landscape of cloud-native technologies with a strong focus on the roles of Kubernetes and Cloud Foundry in modern enterprise environments. About the Interviewee The interviewee, Julian Fischer, has extensive experience in Cloud Foundry and Kubernetes ops. Julian leads anynines in helping organizations operate applications at scale. Under his guidance, they're also pioneering advancements in managing data services across many Kubernetes clusters via the open-source Klutch project. The Dominance of Kubernetes Question: Kubernetes has dominated the container orchestration space in recent years. What key factors have contributed to its success? Answer: "Kubernetes has indeed taken the lead in container orchestration. It's flexible, and this flexibility allows companies to customize their container deployment and management to fit their unique needs. But it's not just about flexibility. The ecosystem around Kubernetes is robust and ever-growing. Think tools, services, integrations – you name it. This expansive ecosystem is a major draw. Community support is another big factor. The Kubernetes community is large, active, and innovative. And let's not forget about multi-cloud capabilities. Kubernetes shines here. It enables consistent deployments across various cloud providers and on-premises environments. That's huge for companies with diverse infrastructure needs. Lastly, it's efficient. Kubernetes has some pretty advanced scheduling capabilities. This means optimal use of cluster resources." Question: Despite Kubernetes' popularity, what challenges do organizations face when managing large-scale Kubernetes environments? Answer: "Well, Kubernetes isn't without its challenges, especially at scale. Complexity is a big one. Ensuring consistent configs across multiple clusters? It's not for the faint of heart. Resource management becomes a real juggling act as you scale up. You're dealing with compute, storage, network – it all gets more complex. Monitoring is another headache. As your microservices and containers multiply, maintaining visibility becomes tougher. It's like trying to keep track of a thousand moving parts. Security is a constant concern too. Implementing and maintaining policies across a large Kubernetes ecosystem is a full-time job. And then there are all the updates and patches. Keeping a large Kubernetes environment up-to-date is like painting the Golden Gate Bridge. By the time you finish, it's time to start over. It's a never-ending process." Question: Given Kubernetes' dominance, is there still a place for Cloud Foundry in the cloud-native ecosystem? Answer: "Absolutely. Cloud Foundry still brings a lot to the table. It's got a different focus. While Kubernetes is all about flexibility, Cloud Foundry is about simplicity and operational efficiency for developers. It streamlines the whole process of deploying and scaling apps. That's valuable. Think about it this way. Cloud Foundry abstracts away a lot of the infrastructure complexity. Developers can focus on code, not on managing the underlying systems. That's powerful. Robust security features, proven track record in large enterprises – these things matter. And here's something interesting—in some large-scale scenarios, Cloud Foundry can actually be more economical. Especially when you're running lots of cloud-native apps. It's all about the right tool for the job." The Relationship Between Cloud Foundry and Kubernetes Question: How are the Cloud Foundry and Kubernetes communities working together to bridge these technologies? Answer: "It's not a competition anymore. The communities are collaborating, and it's exciting to see. There are some really interesting projects in the works. Take Klutch, for example. It's an open-source tool that's bridging the gap between Cloud Foundry and Kubernetes for data services. Pretty cool stuff." Figure 1. The open-source Klutch project enables centralized resource management for multi-cluster Kubernetes environments. "Then there's Korifi. This project is ambitious. It's bringing the Cloud Foundry developer experience to Kubernetes. Imagine getting Cloud Foundry's simplicity with Kubernetes' power. That's the goal. These projects show a shift in thinking. It's not about choosing one or the other anymore. It's about leveraging the strengths of both platforms. That's the future of cloud-native tech." Question: What factors should organizations consider when choosing between Kubernetes and Cloud Foundry? Answer: "Great question. There's no one-size-fits-all answer here. First, look at your team. What are they comfortable with? What's their expertise? That matters a lot. Then, think about your applications. What do they need? Some apps are better suited for one platform over the other. Scalability is crucial too. How much do you need to grow? And how fast? Consider your control needs as well. Do you need the fine-grained control of Kubernetes? Or would you benefit more from Cloud Foundry's abstraction? Don't forget about your existing tools and workflows. Integration is key. You want a solution that plays nice with what you already have. It's about finding the right fit for your specific situation." Question: Can you elaborate on the operational efficiency advantages that Cloud Foundry might offer in certain scenarios? Answer: "Sure thing. Cloud Foundry can be a real efficiency booster in the right context. It's all about its opinionated approach. This might sound limiting, but in large-scale environments, it can be a blessing. Here's why – Cloud Foundry streamlines a lot of operational aspects. Deployment, scaling, management - it's all simplified. This means less operational overhead. In some cases, it can lead to significant cost savings. Especially when you're dealing with a large number of applications that fit well with Cloud Foundry's model. But here's the catch. This advantage is context-dependent. It's not a universal truth. You need to evaluate your specific use case. For some, the efficiency gains are substantial. For others, not so much. It's all about understanding your needs and environment." Looking to the Future of Cloud-Native Technologies Question: How do you see the future of cloud-native technologies evolving, particularly concerning Kubernetes and Cloud Foundry? Answer: "The future is exciting. And diverse. We're moving away from the idea that there's one perfect solution for everything. Kubernetes will continue to dominate, no doubt. But Cloud Foundry isn't going anywhere. In fact, I see increased integration between the two. We're likely to see more hybrid approaches. Organizations leveraging the strengths of both platforms. Why choose when you can have both, right? The focus will be on creating seamless experiences. Imagine combining Kubernetes' flexibility with Cloud Foundry's developer-friendly abstractions. That's incredibly powerful, and what we’re working towards. Innovation will continue at a rapid pace. We'll see new tools, new integrations. The line between these technologies might even start to blur. It's an exciting time to be in this space." Question: What advice would you give to organizations trying to navigate this complex cloud-native ecosystem? Answer: "My advice? Stay flexible. And curious. This field is evolving rapidly. What works today might not be the best solution tomorrow. Start by really understanding your needs. Not just your current needs, but where you're headed. Don't view it as a binary choice. Kubernetes or Cloud Foundry – it doesn't have to be either/or. Consider how they can work together in your stack. Experiment. Start small. See what works for your specific use cases. Invest in your team. Train them on both technologies. The more versatile your team, the better positioned you'll be. And remember, it's okay to change course. Be prepared to evolve your strategy as the technologies and your needs change. The goal isn't to use the trendiest tech. It's to choose the right tools that solve your problems efficiently. Sometimes that's Kubernetes. Sometimes it's Cloud Foundry. Often, it's a combination of both. Stay focused on your business needs, and let that guide your technology choices." This article was shared as part of DZone's media partnership with KubeCon + CloudNativeCon.View the Event

By McKinzie Brocail

How to Use Self Join and WITH Clause in Oracle

The Oracle WITH clause is one of the most commonly used techniques to simplify the SQL source code and improve performance. In Oracle SQL, the 'WITH' clause also known as a Common Table Expression (CTE) is a powerful tool which is also used to enhance the code readability. WITH is commonly used to define temporary named result sets, also referred to as subqueries or CTEs as defined earlier. These temporary named sets can be referenced multiple times within the main SELECT SQL query. The CTEs are like virtual tables and are very helpful in organizing and modularizing the SQL code. Understanding the WITH Clause Syntax The usage of the WITH clause is very simple. Create a namespace with the AS operator followed by the SELECT query and you can add as many SELECT queries as you want followed by a comma (,). It's a good practice to use meaningful terms for namespaces in order to distinguish in the main SELECT. In terms of internal execution of the WITH clause, Oracle will internally execute the namespaces individually and cache the results in the memory which will then be utilized by the main SELECT SQL. It mimics a materialized view with intermediate results and reduces redundant calculations. This suggests that Oracle optimizes SQL queries with CTEs by storing the results of the subqueries temporarily, allowing for faster retrieval and processing in subsequent parts of the query. SQL WITH cte_name1 as (SELECT * FROM Table1), cte_name2 as (SELECT * FROM Table2), ... SELECT ... FROM cte_name1, cte_name2 WHERE ...; Use Case In this use case, I am going to talk specifically about how you can effectively utilize inner joins alongside, using a WITH clause, which can tremendously help in performance tuning the process. Let's take a look at the dataset first and the problem statement before we delve deep into the solution. The scenario is of an e-commerce retail chain for whom the bulk product sales price data needs to be loaded for a particular e-store location. Imagine that a product can have several price lines meant for regular prices, promotional and BOGO offer prices. In this case, the user is trying to create multiple promotional price lines and is unaware of the possible mistakes he/she could commit. Through this process, we will detect duplicate data that is functionally redundant and prevent the creation of poor data quality in the pricing system. By doing so, we will avoid the interface program failures in the Pricing repository staging layer, which acts as a bridge between the pricing computation engine and the pricing repository accessed by the e-commerce platform. TABLE: e_promotions Price_LINE UPC_code Description Price Start_DT End_dt Row_num flag 10001 049000093322 Coca-Cola 12 OZ $6.86 01/01/2024 09/30/2024 1 0 10001 049000093322 Coca-Cola 12 OZ $5.86 01/31/2024 03/30/2024 2 0 10001 049000028201 Fanta Pineapple Soda, 20 OZ $2.89 01/01/2024 09/30/2024 3 0 10001 054000150296 Scott 1000 $1.19 01/01/2024 09/30/2024 4 0 PS: This a sample data, but in the real world, there could be thousands and millions of price lines being updated to mark down or mark up the prices on a weekly basis. The table above captures the UPC codes and the respective items within the price line 10001. The issue with this data set is that the back office user is trying to create a duplicate line as part of the same price line through an upload process and the user does not know the duplicate data he/she may be creating. The intent here is to catch the duplicate record and reject both entries 1 and 2 so that the user can decide which one among the two needs to go in the pricing system to be reflected on the website. Using the code below would simplify error detection and also optimize the store proc solution for better performance. PLSQL WITH price_lines as (SELECT rowid, price_line, UPC, start_dt, end_dt FROM e_promotions WHERE price_line = 10001 AND flag = 0) SELECT MIN(a.rowid) as row_id, a.price_line, a.UPC, a.start_dt, a.end_dt FROM price_lines a, price_lines b WHERE a.price_line = b.price_line AND a.flag = b.flag AND a.UPC = b.UPC AND a.rowid <> b.rowid AND (a.start_dt BETWEEN b.start_dt AND b.end_dt OR a.end_dt BETWEEN b.start_dt AND b.end_dt OR b.start_dt BETWEEN a.start_dt AND a.end_dt OR b.end_dt BETWEEN a.start_dt AND a.end_dt) GROUP BY a.price_line, a.UPC, a.start_dt, a.end_dt; With the code above we did two things in parallel: Queried the table once for the dataset we need to process using the WITH clause Added the inner join to detect duplicates without having to query the table for the 2nd time, hence optimizing the performance of the store proc This is one of the many use cases I have used in the past that gave me significant performance gain in my PLSQL and SQL coding. Have fun and post your comments if you have any questions!

By Sachin More

Mastering Date/Time APIs: Challenges With Java's Calendar and JDK Date/Time APIs

In this article, we'll address four problems covering different date-time topics. These problems are mainly focused on the Calendar API and on the JDK Date/Time API. Disclaimer: This article is an abstract from my recent book Java Coding Problems, Second Edition. Use the following problems to test your programming prowess on date and time. Remember that there usually isn’t a single correct way to solve a particular problem. Also, remember that the explanations shown here include only the most interesting and important details needed to solve the problems. Download the example solutions to see additional details and to experiment with the programs. 1. Defining a Day Period Problem: Write an application that goes beyond AM/PM flags and split the day into four periods: night, morning, afternoon, and evening. Depending on the given date-time and time zone generate one of these periods. Let’s imagine that we want to say hello to a friend from another country (in a different time zone) via a message such as Good morning, Good afternoon, and so on based on their local time. So, having access to AM/PM flags is not enough, because we consider that a day (24 hours) can be represented by the following periods: • 9:00 PM (or 21:00) – 5:59 AM = night • 6:00 AM – 11:59 AM = morning • 12:00 PM – 5:59 PM (or 17:59) = afternoon • 6:00 PM (or 18:00) – 8:59 PM (or 20:59) = evening Before JDK 16 First, we have to obtain the time corresponding to our friend’s time zone. For this, we can start from our local time given as a java.util.Date, java.time.LocalTime, and so on. If we start from a java. util.Date, then we can obtain the time in our friend’s time zone as follows: Java LocalTime lt = date.toInstant().atZone(zoneId).toLocalTime(); Here, the date is a new Date() and zoneId is java.time.ZoneId. Of course, we can pass the zone ID as a String and use the ZoneId.of(String zoneId) method to get the ZoneId instance. If we prefer to start from LocalTime.now(), then we can obtain the time in our friend’s time zone as follows: Java LocalTime lt = LocalTime.now(zoneId); Next, we can define the day periods as a bunch of LocalTime instances and add some conditions to determine the current period. The following code exemplifies this statement: Java public static String toDayPeriod(Date date, ZoneId zoneId) { LocalTime lt = date.toInstant().atZone(zoneId).toLocalTime(); LocalTime night = LocalTime.of(21, 0, 0); LocalTime morning = LocalTime.of(6, 0, 0); LocalTime afternoon = LocalTime.of(12, 0, 0); LocalTime evening = LocalTime.of(18, 0, 0); LocalTime almostMidnight = LocalTime.of(23, 59, 59); LocalTime midnight = LocalTime.of(0, 0, 0); if((lt.isAfter(night) && lt.isBefore(almostMidnight)) || lt.isAfter(midnight) && (lt.isBefore(morning))) { return "night"; } else if(lt.isAfter(morning) && lt.isBefore(afternoon)) { return "morning"; } else if(lt.isAfter(afternoon) && lt.isBefore(evening)) { return "afternoon"; } else if(lt.isAfter(evening) && lt.isBefore(night)) { return "evening"; } return "day"; } Now, let’s see how we can do this in JDK 16+. JDK 16+ Starting with JDK 16+, we can go beyond AM/PM flags via the following strings: in the morning, in the afternoon, in the evening, and at night. These friendly outputs are available via the new pattern, B. This pattern is available starting with JDK 16+ via DateTimeFormatter and DateTimeFormatterBuilder (you can find these APIs in Chapter 1, Problem 18, in my book; see the figure below for reference). So, the following code uses the DateTimeFormatter to exemplify the usage of pattern B, representing a period of the day: Java public static String toDayPeriod(Date date, ZoneId zoneId) { ZonedDateTime zdt = date.toInstant().atZone(zoneId); DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MMM-dd [B]"); return zdt.withZoneSameInstant(zoneId).format(formatter); } Here is an output for Australia/Melbourne: Java 2023-Feb-04 at night You can see more examples in the bundled code. Feel free to challenge yourself to adjust this code to reproduce the result from the first example. 2. Converting Between Date and YearMonth Write an application that converts between java.util.Date and java.time.YearMonth and vice versa. Converting a java.util.Date to JDK 8 java.time.YearMonth can be done based on YearMonth. from(TemporalAccessor temporal). A TemporalAccessor is an interface (more precisely, a framework-level interface) that exposes read-only access to any temporal object including date, time, and offset (a combination of these is also allowed). So, if we convert the given java.util.Date to java. time.LocalDate, then the result of the conversion can be passed to YearMonth.from() as follows: Java public static YearMonth toYearMonth(Date date) { return YearMonth.from(date.toInstant() .atZone(ZoneId.systemDefault()) .toLocalDate()); } Vice versa can be obtained via Date.from(Instant instant) as follows: Java public static Date toDate(YearMonth ym) { return Date.from(ym.atDay(1).atStartOfDay( ZoneId.systemDefault()).toInstant()); } Well, that was easy, wasn’t it? 3. Converting Between Int and YearMonth Let’s consider that a YearMonth is given (for instance, 2023-02). Convert it to an integer representation (for instance, 24277) that can be converted back to YearMonth. Consider that we have YearMonth.now() and we want to convert it to an integer (for example, this can be useful for storing a year/month date in a database using a numeric field). Check out the solution: Java public static int to(YearMonth u) { return (int) u.getLong(ChronoField.PROLEPTIC_MONTH); } The proleptic-month is a java.time.temporal.TemporalField, which basically represents a date-time field such as month-of-year (our case) or minute-of-hour. The proleptic-month starts from 0 and counts the months sequentially from year 0. So, getLong() returns the value of the specified field (here, the proleptic-month) from this year-month as a long. We can cast this long to int since the proleptic-month shouldn’t go beyond the int domain (for instance, for 2023/2 the returned int is 24277). Vice versa can be accomplished as follows: Java public static YearMonth from(int t) { return YearMonth.of(1970, 1) .with(ChronoField.PROLEPTIC_MONTH, t); } You can start from any year/month. The 1970/1 (known as the epoch and the starting point for java.time.Instant) choice was just an arbitrary choice. 4. Converting Week/Year to Date Problem Statement: Consider that two integers are given representing a week and a year (for instance, week 10, year 2023). Write a program that converts 10-2023 to a java.util.Date via Calendar and to a LocalDate via the WeekFields API. Also, do vice versa: from a given Date/LocalDate extract the year and the week as integers. Solutions: Let’s consider the year 2023, week 10. The corresponding date is Sun Mar 05 15:15:08 EET 2023 (of course, the time component is relative). Converting the year/week to java.util.Date can be done via the Calendar API as in the following self-explanatory snippet of code: Java public static Date from(int year, int week) { Calendar calendar = Calendar.getInstance(); calendar.set(Calendar.YEAR, year); calendar.set(Calendar.WEEK_OF_YEAR, week); calendar.set(Calendar.DAY_OF_WEEK, 1); return calendar.getTime(); } If you prefer to obtain a LocalDate instead of a Date then you can easily perform the corresponding conversion or you can rely on java.time.temporal.WeekFields. This API exposes several fields for working with week-of-year, week-of-month, and day-of-week. This being said, here is the previous solution written via WeekFields to return a LocalDate: Java public static LocalDate from(int year, int week) { WeekFields weekFields = WeekFields.of(Locale.getDefault()); return LocalDate.now() .withYear(year) .with(weekFields.weekOfYear(), week) .with(weekFields.dayOfWeek(), 1); } On the other hand, if we have a java.util.Date and we want to extract the year and the week from it, then we can use the Calendar API. Here, we extract the year: Java public static int getYear(Date date) { Calendar calendar = Calendar.getInstance(); calendar.setTime(date); return calendar.get(Calendar.YEAR); } And here, we extract the week: Java public static int getWeek(Date date) { Calendar calendar = Calendar.getInstance(); calendar.setTime(date); return calendar.get(Calendar.WEEK_OF_YEAR); } Getting the year and the week from a LocalDate is easy thanks to ChronoField.YEAR and ChronoField. ALIGNED_WEEK_OF_YEAR: Java public static int getYear(LocalDate date) { return date.get(ChronoField.YEAR); } public static int getWeek(LocalDate date) { return date.get(ChronoField.ALIGNED_WEEK_OF_YEAR); } Of course, getting the week can be accomplished via WeekFields as well: Java return date.get(WeekFields.of( Locale.getDefault()).weekOfYear()); Challenge yourself to obtain week/month and day/week from a Date/LocalDate.

By Anghel Leonard

CORE

Event-Driven vs Event-Sourced: A Common Misunderstanding

In today’s world of software development, systems containing some sort of event constructs are increasing in popularity. While this is primarily driven by message-based communication mechanisms, events are also used in different scopes and contexts. The frequent use of the term “event” leads to confusion, which is often observed in discussions about various software architectures among people who are new to these concepts. The terms “event-driven” and “event-sourced” are often used interchangeably, while in reality, the two are very different concepts. In this article, we are going to explore the key characteristics of each, explain how they differ, and how they complement each other. We will focus on clarifying the key differences, not a deep dive into each concept. Before we dive in, let’s clarify the definition of an “event” in both event-driven and event-sourced systems. An event is an immutable record describing something that has happened in the past. Therefore, the data that an event contains cannot be changed. Immutability and description of the past are fundamental characteristics of events. Event-Driven Event-driven architecture (EDA) is an architectural style that primarily uses events as a communication mechanism between various components of a larger system, or between systems. Events are handled asynchronously in a fire-and-forget manner, i.e. the publisher does not wait for a response from consumers. Other parts of the system that subscribe to the event handle it and can trigger larger workflows composed of further events. For example, a microservice-based system may use events to communicate the occurrence of a fact in one microservice. Those events can be handled by consumers in other microservices, which in turn can emit their events to communicate their changes. Events are published in a publish-subscribe manner, allowing multiple consumers to handle the same event simultaneously. Each consumer receives its own copy of the event. The asynchronous nature of EDA can often lead to eventual consistency. This means that the state of the system might not be instantaneously reflected across components. EDA systems typically, but not always, leverage a message broker to transmit events from the publisher to consumers. Components often use two types of events: domain and integration. Domain events are internal to the components that emit them, while integration events are used for communication between components. Typically used by most if not all components in a particular system, with each publishing and subscribing to a subset of events. Event-Sourced Unlike EDA, which is an architectural style leveraging events for communication, Event Sourcing (ES) is a persistence mechanism. Events are used to reflect changes to the state of an entity (or other constructs such as an aggregate in Domain-Driven Design) and to reconstitute that state. Upon completing a domain operation, a sequence of events is stored as a stream linked to a particular entity. For example, a set of actions performed on a shopping basket may result in the following events: Shopping started Item added Item added Item removed Item added Upon reconstituting the basket’s state, these events are loaded in chronological order and replayed to update the state one by one. The resulting basket will contain two shopping items. Event sourcing can be described by the below key characteristics: After completing an operation, the entity contains new event(s) reflecting the nature of the change(s) which are then saved to an event store (database). Events are strictly linked to a specific entity. Typically, events participate in a limited-in-scope, strongly-consistent manner. A sequence of new events will not persist if another process appends events to the stream simultaneously. New events are persisted transactionally- either all or none get saved. Events are loaded from the stream to reconstitute the entity’s state. Limited in scope to a single component. A larger system can contain heterogeneous persistence mechanisms, where only a subset uses ES. Different Yet Complementary When comparing the characteristics of EDA and ES, it becomes clear that the two are very different. Indeed, they address fundamentally different aspects of software design. While EDA is an architectural style for communication, ES is a persistence mechanism for entities’ state. However, these differences do not mean that the two approaches are mutually exclusive. On the contrary, EDA and ES often go hand in hand complementing each other. In such a setup, a component using ES processes business logic and stores its entity’s state in an event-sourced manner. At some point, an event will be published indicating the completion of an action that other components participating in EDA might be interested in. It may only be that one particular event or an aggregation of data from multiple events which needs to be handled elsewhere. In either case, a publisher will map this event(s) to an integration event and publish it, thus allowing its consumption in other components. The distinct nature of EDA and ES is also their strength — each approach focuses on a different area, but together they contribute to a broader system, enabling an audit trail and fine detail of changes in one component, and leveraging EDA to communicate selected changes to other components.

By Ernest Sak

VACUUM In Postgres Demystified

Let’s see what is VACUUM in PostgreSQL, how it’s useful, and how to improve your database performance. Storage Basics Before diving into vacuuming, it's important to first understand the fundamentals of data storage in PostgreSQL. While we’ll explain how data is represented internally, we won’t cover every aspect of storage, such as shared memory or buffer pools. Let's start by examining a table. At its core, data is stored in data files. Each data file is divided into pages, typically 8 kilobytes in size. This structure allows the database to manage growing datasets efficiently without performance issues. PostgreSQL handles these pages individually, so it doesn’t need to load an entire table into memory during a read operation. Internally, these pages are often referred to as blocks, and we’ll use the terms interchangeably throughout this discussion. Each data page contains the following components: Header: 24 bytes long, containing maintenance information for transaction management and pointers to other sections for easy access. Line pointers: References to the actual tuples holding the data. Free space Tuples Now, let’s look at a diagram: As shown, tuples are stored starting from the end of the page and grow backward. Thanks to the line pointers, each tuple can be easily located, and new tuples can be added seamlessly to the file. Every tuple is identified by a Tuple Identifier (TID), which is a pair of numbers: one representing the block (or page) number and the other representing the line pointer number within that page. Here’s what a typical tuple looks like: As shown, each tuple consists of a header and the actual data. The header contains metadata and several important fields related to transaction management: t_xmin: Stores the transaction identifier of the transaction that inserted the tuple t_xmax: Stores the transaction identifier of the transaction that deleted or updated the tuple. If the tuple hasn’t been deleted or updated, this field will hold a value of 0 The transaction identifier is a number that increments with every new transaction. The fields t_xmin and t_xmax are essential for determining the visibility of a tuple during a transaction. Essentially, each transaction receives a snapshot that defines which tuples are visible and which should be ignored. This is crucial for Multi-Version Concurrency Control. Multi-Version Concurrency Control Multi-Version Concurrency Control (MVCC) enables the database to handle multiple transactions simultaneously, enhancing performance. Instead of blocking transactions when they attempt to access the same data, MVCC allows multiple versions of each entity to exist concurrently, only rolling back transactions that introduce conflicting changes. This method is especially advantageous when transactions involve modifying unrelated entities or primarily reading data. In such cases, there is seldom a need to roll back a transaction due to invalid changes. Additionally, transactions can proceed independently without waiting for others, which is a key principle in how PostgreSQL manages its data. However, this approach makes modifications slightly more difficult. Since we don’t want to stop the transactions, we somehow need to control which changes are visible to which transactions. This is determined based on t_xmin and t_xmax fields and visibility maps. This may lead to a situation in which many versions of the same row exist in the database. Let’s see how that’s possible. Let’s create a table and insert one row into it: SQL CREATE TABLE test (id INT) INSERT INTO test VALUES (1) Let’s now start the first transaction and read the table, but don’t commit the transaction yet. SQL BEGIN TRANSACTION SELECT * FROM test This returns value 1. Let’s now start another transaction that does the following: SQL BEGIN TRANSACTION UPDATE test SET id = 2 SELECT * FROM test This returns value 2. However, if we now return to the first transaction and rerun the select statement, then we will still get value 1. This indicates that there are two versions of the same row with different visibility to support different transactions. This is achieved by introducing duplicates and using t_xmin and t_xmax fields. When we first insert the value, the t_xmax of the entity is set to 0 which indicates there is no newer version yet. Later, when we run the UPDATE statement, the t_xmax is changed to indicate the last transaction that should see this row, and a new row is inserted into the same table with t_xmin set to the latest transaction number. This poses two interesting questions: If the second transaction is rolled back, what happens with the newer row? On the other hand, what happens with the older row if the second transaction is committed? Both these situations lead to so-called dead tuples. A dead tuple is a tuple that is not visible to any transaction. It was either removed or updated. Dead tuples degrade performance by wasting disk space and making the database engine read more data when running transactions. Dead tuples are not removed immediately. Nothing happens when we commit a transaction that creates dead tuples (by deleting rows or updating them) and the modified data is ready to be used by other transactions. Old tuples are only marked for deletion and we need to remove them periodically. This leads to the process called vacuuming. Vacuuming As we saw above, we need to periodically curate our tables to remove dead tuples. This is done by vacuuming which takes care of removing dead tuples to recover the wasted disk space. Vacuuming became a periodic maintenance task over the years that does much more than just removing the dead tuples. It does the following: Updates statistics used by the PostgreSQL query planner Updates the visibility map used by the index-only scans Protects the server from the loss of ancient data due to transaction wraparound Recovers the disk space for later reuse Returns the space to the operating system if possible Compacts tables by rewriting them Vacuuming runs in two modes: regular and full. Regular vacuuming doesn’t try to minimize disk space usage but wants to achieve stability and steady usage of disk space. This should be sufficient for most of our day-to-day operations and we should rarely need a full vacuuming that shrinks the tables. Vacuuming runs automatically and can be scheduled or run on demand with VACUUM or VACUUM FULL. A typical approach is to let it run automatically or schedule a database-wide VACUUM once a day during an off-peak period (typically at night). When it comes to updating the statistics, the vacuuming daemon automatically runs ANALYZE command whenever it decides that the content of a table has changed sufficiently. This may not be enough if we modify fewer rows or if we modify important columns (as the daemon does not know the distribution of the values). Therefore, we may need to manually analyze the tables to keep the statistics up to date. This may be especially useful after batch-loading many rows. We can analyze some tables only or even some columns. However, the vacuuming daemon does not analyze foreign tables. If we need statistics for them, then we need to analyze them manually. Similarly, the daemon does not analyze the parent partition when children change. We need to analyze them manually. Yet another thing that vacuuming does is update the visibility map. This is needed to speed up the index-only scans. The auto-vacuuming daemon takes care of that automatically. Transaction Wraparound The last activity that vacuuming takes care of is preventing transaction ID wraparound issues. MVCC depends on the transaction ID (XID) numbers that determine the visibility of rows. Each XID is simply a 32-bit number. In a database with plenty of transactions, we may cause the XID to overflow or wrap around. When this happens, we may run into issues. Without going much into detail, the row is visible if it had been created before the current transaction started. To determine that, we compare if the t_xmin is less than the current XID. This works as long as XID numbers increase. However, once the transaction number overflows, it starts counting from zero again. Therefore, we may incorrectly conclude that some rows shouldn’t be visible as their t_xmin is greater than the current XID. This is incorrect. To fix that, vacuuming marks rows as frozen which means that they were created sufficiently far in the past to be handled differently. When determining if a row should be visible, we don’t compare t_xmin for frozen rows, but we just assume they are always created in the past. The remaining question is which rows should we consider frozen. This is determined by the vacuum_freeze_min_age parameter which defaults to 50 million. If a row was created 50 million transactions ago, then the vacuuming daemon turns the row into a frozen one. We may want to increase this parameter to make the freezing less often. Transaction wraparound may cause the database to shut down and data loss. We should prevent this from happening by curating our databases often. Seeing It In Action Different vacuuming activities kick in based on different thresholds. The regular one kicks in when 20% of the table has been changed (by default). The analyzing task starts when 10% of the table is modified. Let’s see that in action. Let’s start by creating a test table: SQL CREATE TABLE test(id int) We can now check when it’s been vacuumed: SQL SELECT last_vacuum, last_autovacuum FROM pg_stat_all_tables WHERE schemaname = 'public' AND relname = 'test'; It should return the following: SQL last_vacuum last_autovacuum (null) (null) We can now insert many rows and see that the table hasn’t been vacuumed either: SQL INSERT INTO test SELECT * FROM generate_series(1,1000) SQL SELECT last_vacuum, last_autovacuum FROM pg_stat_all_tables WHERE schemaname = 'public' AND relname = 'test'; SQL last_vacuum last_autovacuum (null) (null) We can now update the table: SQL UPDATE test SET id = -id WHERE id < 500 And this time, the autovacuum should kick in as we modified nearly 50% of the table: SQL SELECT last_vacuum, last_autovacuum FROM pg_stat_all_tables WHERE schemaname = 'public' AND relname = 'test'; SQL last_vacuum last_autovacuum (null) "2024-09-11 15:22:38" We can also vacuum the table manually: SQL VACUUM public.test We should now see the following: SQL SELECT last_vacuum, last_autovacuum FROM pg_stat_all_tables WHERE schemaname = 'public' AND relname = 'test'; SQL last_vacuum last_autovacuum "2024-09-11 15:24:09" "2024-09-11 15:22:38" Fine-Tuning We can adjust many parameters to configure how often the vacuuming runs. autovacuum: Controls whether the autovacuum background process is enabled or disabled. By default, this feature is turned on autovacuum_vacuum_threshold: Specifies the minimum number of dead rows that must accumulate in a table before the vacuum process is triggered; the default setting is 50. Autovacuum_analyze_threshold: Defines the minimum number of live rows that need to exist in a table before the analyze process is initiated. The default value is 50. autovacuum_vacuum_scale_factor: A multiplier that adjusts the number of dead rows required to trigger a vacuum, based on the table's size. The default multiplier is 0.2. autovacuum_analyze_scale_factor: A multiplier that determines how many live rows are necessary to trigger an analyze, relative to the table's size. The default value is 0.1. autovacuum_vacuum_cost_delay: Specifies the delay (in milliseconds) before the autovacuum process begins a vacuum operation. The default delay is 20 milliseconds. autovacuum_vacuum_cost_limit: Sets the maximum number of rows that can be processed during a single vacuum operation. The default limit is 200. We can also follow some best practices: Don’t run manual VACUUM without a reason: manual vacuum may collide with other activities and cause I/O and CPU spikes. If running a VACUUM manually, run it on a table-by-table basis. Do it outside of peak hours. Run VACUUM ANALYZE when the distribution of the values changes significantly. Run VACUUM ANALYZE after bulk-loading many rows. Avoid VACUUM FULL. Run it only when performance degrades significantly. Adjust vacuuming thresholds based on your specific needs. Postgres should do quite well without any intervention, so you can rely on your defaults. Summary Vacuuming in PostgreSQL is important because it helps manage dead rows left behind after updates or deletes, preventing table bloat and reclaiming storage space. It also maintains the efficiency of query performance by updating table statistics for the query planner and preventing transaction ID wraparound, which can lead to database corruption if not addressed. Regular vacuuming ensures the database remains optimized and functions smoothly.

By Adam Furmanek

The Battle of Data: Statistics vs Machine Learning

The goal of this article is to investigate the fields of statistics and machine learning and look at the differences, similarities, usage, and ways of analyzing data in these two branches. Both branches of science allow interpreting data, however, they are based on different pillars: statistics on mathematics and the other on computer science — the focus of machine learning. Introduction Artificial intelligence together with machine learning is presently the technologically advanced means of extracting useful information from the raw data that is changing every day around us. On the contrary, statistics — a very old field of research of over 3 centuries — has always been regarded as a core discipline for the interpretation of the collected data and decision-making. Even though both of them share one goal of studying data, how the goal is achieved and where the focus is varies in statistics and machine learning. This article, however, seeks to relate the two fields and how they address the needs of contemporary society as the field of data science expands. 1. Foundations and Definitions Cohen's Measurement This is a subsection of mathematics that revolves around the organization, evaluation, analysis, and representation of numerical figures. It has grown through a timeline of three hundred years and finds application in such fields as economics, health sciences, and social studies Machine Learning (ML) This is the area of computer science that involves extracting intelligence from data in order to help the systems make decisions in the future. This includes those algorithms that are capable of identifying very sophisticated patterns and extending them to novel, unreleased data. However, the concept of machine learning is not so old, it has developed for about 30+ years. 2. Key Differences Between Statistics and Machine Learning Aspect Statistics Machine Learning Assumptions Assumes relationships between variables (e.g., alpha, beta) before building models Makes fewer assumptions, and can model complex relationships without prior knowledge Interpretability Focuses on interpretation: parameters like coefficients provide insight into how variables influence outcomes. Focuses on predictive accuracy: often works with complex algorithms (e.g., neural networks) that act as “black boxes.” Data Size Traditionally works with smaller, structured datasets Designed to handle large, complex datasets, including unstructured data (e.g., text, images) Applications Used in areas like social sciences, economics, and medicine for making inferences about populations Applied in AI, computer vision, NLP, and recommender systems, focusing on predictive modeling 3. Learning Approaches Statistics The methods have a static nature in that they adopt an existing proposition. That is proposing a hypothesis and including a sample to the hypothesis to either nullify or substantiate it. Often the being is to scope the bias within the sample when an inference from sample to population is made. Machine Learning The methods have an active rather than static outlook. The algorithm is able to recognize available patterns in the data without any predefined pattern. Machine learning models are all about hunting for the elephants in the room rather than just testing hypotheses. 4. Example: Linear Regression in Both Fields The same linear regression formula, y = mx + b (or y = ax + b), is adjacent to both statistics and machine learning; however, the methodologies are different: As part of the analysis and description, the model is constructed in such a way that the target variable value is represented as a function of other input variables by making a guess about the model parameters. They claim to accept the same model in order to reduce the error between the predicted output and the actual output, which in the case of the former is principally directed towards fitting and understanding the parameters. 5. Applications of Statistics vs. Machine Learning Applications Statistics Machine Learning Social Sciences Used for sampling to make inferences about large populations Predictive models for identifying patterns in survey data Economics and Medicine Statistical models (e.g., ANOVA, t-tests) to identify significant trends AI models to predict patient outcomes or stock market trends Quality Control Applies hypothesis testing for quality assurance AI-driven automation in manufacturing for predictive maintenance Artificial Intelligence (AI) Less common in AI due to its focus on smaller datasets Central to AI, including in computer vision and NLP 6. Example Algorithms in Each Field Statistics Algorithms Machine Learning Algorithms Linear Regression Decision Trees Logistic Regression Neural Networks ANOVA (Analysis of Variance) Support Vector Machines (SVM) t-tests, Chi-square tests k-Nearest Neighbors (KNN) Hypothesis Testing Random Forests 7. Handling Data Statistics A branch that is most effective when tasked with well-defined and clean datasets, where the dependence amongst the variables can either be linear or otherwise known. Machine Learning This type of data analysis does well with big, dirty, and unstructured data (such as pictures and videos) that has no recommended formats or applies in this case. It can also deal with nonlinear relationships that are often difficult to implement with statistical techniques. Conclusion: Choosing the Right Tool It is clear that both statistics and machine learning are useful in the analysis of data. However, a decision has to be arrived at concerning which one to use in which scenario. Statistics are appropriate when there is a need to analyze data and establish how independent and dependent variables are related especially when working with lower dimensional structured data. Machine Learning is appropriate when the objective is predictive modeling, with vast or non-structural data, and where computation takes precedence over explanatory power. In modern times, these two approaches are usually used together. For example, a data analyst may perform data exploration first using statistical approaches, then turn on predictive models to refine the prediction. Summary Table: Statistics vs. Machine Learning Factor Statistics Machine Learning Approach Deductive, starts with hypothesis Inductive, learns patterns from data Data Type Structured, smaller datasets Large, complex, and unstructured datasets Interpretability High: focuses on insights from models Low: models often function as "black boxes" Application Areas Economics, social sciences, medicine AI, computer vision, natural language processing By understanding both fields, data scientists can choose the right method based on their goals whether it's interpreting data or making predictions. Ultimately, the integration of statistics and machine learning is the key to unlocking powerful insights from today’s vast and complex datasets.

By Vasanthi Govindaraj

Mutable vs. Immutable: Infrastructure Models in the Cloud Era

In the world of infrastructure management, two fundamental approaches govern how resources are deployed and maintained: mutable and immutable infrastructure. These approaches influence how updates are made, how infrastructure evolves, and how consistency is ensured across different environments. Mutable infrastructure refers to systems that can be changed or updated after they’ve been initially deployed. This means that configuration changes, software updates, or patches can be applied directly to existing infrastructure resources without replacing them entirely. For instance, a server can be updated by installing new software, tweaking its settings, or increasing its resources. While the server itself stays the same, its configuration evolves over time. Immutable infrastructure works differently. Once it’s deployed, it can’t be changed or updated. Instead of modifying what’s already there, any updates or changes require replacing the existing infrastructure with a new version that includes the updates. For example, if a new version of an application needs to be deployed, new servers are created with the updated settings, while the old servers are shut down or removed. This approach ensures consistency with each deployment and avoids any unexpected issues from lingering changes. Key Differences Between Mutable and Immutable Infrastructure When comparing mutable and immutable infrastructure, several key differences highlight the strengths and trade-offs of each approach. These differences revolve around how changes are handled, how infrastructure consistency is maintained, and the overall impact on operations. Use Case Mutable Infrastructure Immutable Infrastructure Change Management Allows in-place updates, where changes, updates, or patches can be applied directly to running infrastructure without redeploying. This can be faster and more convenient for making incremental adjustments. Does not allow in-place changes. Instead, any update requires the creation of a new instance or infrastructure. The old instance is terminated after the new one is successfully deployed. Changes happen by replacing the infrastructure entirely. Configuration Drift Prone to configuration drift. Over time, as small changes are applied to systems manually or through different tools, the configuration of the system can deviate from its original state. This makes it harder to maintain consistency. Eliminates configuration drift. Every deployment starts fresh with a new environment, ensuring that the infrastructure always matches the desired state and behaves consistently. Consistency May lead to inconsistencies if manual changes are made, or if different versions of updates are applied across environments. This can result in unexpected behavior, particularly in environments that are long-lived or frequently updated. Highly consistent because each deployment uses a clean state. This ensures that each instance of the infrastructure are same, avoiding unexpected issues from differing configurations. Downtime Can be updated in-place, which often leads to shorter or no downtime. This is crucial for systems that require high availability and cannot afford to be redeployed. May involve temporary downtime during the replacement process, depending on the strategy used (e.g., blue-green or rolling deployments). However, modern techniques like blue-green or canary deployments can mitigate this issue by seamlessly transitioning between old and new infrastructure. Rollback Complexity Rolling back changes can be complex because reverting to a previous configuration may not fully restore the infrastructure to its original state. In most cases, manual intervention may be required. Rollback is simpler. Since new infrastructure is created for every change, rolling back is as easy as redeploying the previous, working version. This reduces the risk of failure when reverting to an earlier state. Security Security patches are applied directly to running systems, which can leave them vulnerable to attack if patches are delayed or improperly applied. Manual patching introduces risks. More secure because each update replaces the entire system with a fresh instance. This ensures that any vulnerabilities from previous configurations are completely removed, and patched versions are consistently applied. Operational Overhead Requires more maintenance and monitoring to ensure that updates are properly applied and systems remain secure over time. This can increase operational overhead, especially when managing large, complex environments. Reduces operational overhead by standardizing updates and deployments. Since there’s no need to manage in-place updates, less effort is required to maintain consistency across environments. Resource Efficiency More resource-efficient in some cases since updates are made to the existing resources without creating new ones. This can save costs and time, especially for updates that don’t require full system redeployment. Requires the creation of new resources for every update, which can lead to increased costs and resource usage, particularly for environments that need frequent updates or scaling. However, automation tools can help manage this process more efficiently. Use Cases Suited for environments where rapid, in-place updates are needed without full redeployment (e.g., databases, development and testing environments, or legacy systems). It’s also practical for environments where costs need to be minimized by reusing existing infrastructure. Best for environments where consistency, security, and reliability are critical, such as production environments, containerized applications, and microservices. Immutable infrastructure is ideal for organizations that prioritize automation, scaling, and continuous delivery. Infrastructure Lifespan Often associated with long-lived infrastructure where the same resources are maintained and updated over time. These systems may evolve as they stay active, leading to drift over time. Typically associated with short-lived infrastructure. Resources are frequently replaced with newer versions, reducing the need for ongoing maintenance of the same system. When to Choose Mutable Infrastructure While immutable infrastructure is often preferred in modern environments due to its consistency and reliability, there are still several scenarios where choosing mutable infrastructure is more practical and beneficial. Mutable infrastructure offers flexibility for certain use cases where in-place changes, cost-effectiveness, or maintaining long-lived systems are essential. Here are some key situations where you should consider using mutable infrastructure: Dynamic or evolving environments - In scenarios where the infrastructure needs to frequently adapt to changes, such as in development or testing environments, mutable infrastructure is advantageous. Developers may need to quickly update configurations, modify code, or patch resources without spinning up new instances. Cost-sensitive environments - Immutable infrastructure can increase operational costs, particularly in cloud environments where spinning up new instances for every update incurs additional expenses. For cost-sensitive environments, such as startups or organizations with tight budgets, mutable infrastructure can be more economical. Stateful applications - Applications or services that maintain state (such as databases, file systems, or session-oriented services) benefit from mutable infrastructure. In these cases, tearing down and replacing infrastructure can lead to data loss or significant complexity in preserving state. Legacy systems - Legacy systems often rely on mutable infrastructure due to their architecture and the fact that they weren’t designed with modern immutable practices in mind. Rewriting or migrating legacy applications to immutable infrastructure may be impractical, expensive, or risky, making mutable infrastructure the better choice. Applications with infrequent updates - For environments or applications where updates are infrequent and the risk of configuration drift is low, mutable infrastructure can be a simple and effective solution. If the system doesn’t require constant adjustments or scaling, maintaining long-lived infrastructure may be sufficient. Systems that require minimal downtime - Some critical systems cannot afford the downtime that might occur during the replacement of infrastructure, especially in high-availability environments. Mutable infrastructure allows for in-place updates, which can minimize or eliminate downtime altogether. Systems with complex interdependencies - In environments where services or applications have complex interdependencies, managing immutable infrastructure might become difficult. When numerous components rely on each other, applying changes in place ensures those connections remain intact without needing to redeploy the entire infrastructure. When to Choose Immutable Infrastructure Immutable infrastructure has become a popular approach in today's IT world, especially in cloud-native and DevOps environments. It helps prevent configuration drift and guarantees consistent, reliable deployments. However, it's not always the best fit for every situation. There are cases where choosing immutable infrastructure is highly advantageous, particularly when consistency, security, and scalability are crucial priorities. Here are key situations where choosing immutable infrastructure is the better choice: Production environments - Immutable infrastructure works exceptionally well in production environments where stability and reliability are key. By avoiding in-place changes, it keeps the production environment consistent and minimizes the risk of unexpected errors from manual updates or configuration drift. This makes it a solid choice when maintaining a reliable system is essential. Security-conscious environments - In environments where security is a top priority, immutable infrastructure is a more secure option. Since no manual changes are applied after the infrastructure is deployed, there’s less risk of introducing vulnerabilities through untracked or insecure changes. Microservices architectures - Microservices architectures are built to be modular, making it easy to replace individual components. In these setups, immutable infrastructure plays a key role by ensuring each service is deployed consistently and independently, helping to avoid the risk of misconfigurations. This approach enhances reliability across the system. CI/CD pipelines and automation - Continuous Integration and Continuous Delivery (CI/CD) pipelines thrive on consistency and automation. Immutable infrastructure complements CI/CD pipelines by ensuring that every deployment is identical and repeatable, reducing the chances of failed builds or broken environments. Disaster recovery and rollbacks - Immutable infrastructure simplifies disaster recovery and rollbacks. Since infrastructure is not modified after deployment, rolling back to a previous version is as simple as redeploying the last known working configuration. This reduces downtime and makes recovery faster and more reliable. Scalability and auto-scaling - In environments where scalability is a key requirement, immutable infrastructure supports automatic scaling by creating new instances as needed, rather than modifying existing ones. This is particularly useful for cloud-native applications or containerized environments where dynamic scaling is a common requirement. Blue-green or canary deployments - Blue-green and canary deployment strategies are ideal candidates for immutable infrastructure. These deployment methods rely on running two environments simultaneously (blue and green), where the new version is tested before it fully replaces the old one. Cloud-native and containerized applications - Cloud-native and containerized applications naturally align with immutable infrastructure because they are designed to be stateless, scalable, and disposable. Infrastructure as Code (IaC) tools like Terraform, combined with container orchestration platforms like Kubernetes, benefit greatly from immutable practices. High-availability systems - High-availability systems that cannot tolerate downtime can benefit from immutable infrastructure. With rolling or blue-green deployments, updates are applied seamlessly, ensuring that there is no disruption to the end user. Hybrid Approaches: Combining Mutable and Immutable Infrastructure While immutable infrastructure offers reliability and consistency, and mutable infrastructure provides flexibility and state retention, many organizations can benefit from a hybrid approach. By combining the best of both models, you can build a more flexible infrastructure that adapts to the specific needs of different components in your environment. This balanced approach allows you to address varying requirements more effectively. A hybrid approach typically involves using immutable infrastructure for stateless services, where consistency and repeatability are essential, and mutable infrastructure for stateful services or legacy systems, where preserving data and flexibility is more important. Terraform, as a powerful Infrastructure as Code (IaC) tool, can manage both models simultaneously, giving you the flexibility to implement a hybrid approach effectively. Use Immutable Infrastructure for Stateless Components Stateless services and applications that don’t rely on maintaining internal state across sessions are ideal for immutable infrastructure. These services can be replaced or scaled without the need for in-place updates. Example: With Terraform, you can manage web server deployments behind a load balancer seamlessly. Each time the application is updated, new instances are deployed while the old ones are removed. This ensures that the web servers are always running the latest version of the application, without any risk of configuration drift. Use Mutable Infrastructure for Stateful Components Stateful components, such as databases, file systems, or applications that retain session data, require mutable infrastructure to preserve the data across updates. Replacing these components in an immutable model would involve complex data migration and could risk data loss. Example: Manage a relational database like PostgreSQL using Terraform, where you need to update storage capacity or apply security patches without replacing the database instance. This ensures that the data remains intact while the infrastructure is modified. Automate Infrastructure Management With Terraform Terraform’s ability to define infrastructure as code allows you to automate both mutable and immutable deployments. With a hybrid approach, you can use Terraform to manage both types of infrastructure side by side, ensuring consistency where needed and flexibility where required. Example: Use Terraform to define both immutable and mutable resources in a single deployment plan. For example, use immutable infrastructure for auto-scaling application servers, while managing stateful databases with mutable infrastructure for in-place updates. Implement Hybrid Scaling Strategies In environments with both stateless and stateful services, scaling strategies can benefit from a hybrid approach. Stateless services can be scaled horizontally using immutable infrastructure, while stateful services may require vertical scaling or more complex approaches. Example: Use Terraform to manage auto-scaling groups for stateless web servers while adjusting database resources (such as memory or CPU) for stateful services through mutable infrastructure. This ensures that both types of services can scale based on their unique needs. Legacy System Modernization Many organizations have legacy systems that are critical to their operations but are not easily moved to an immutable infrastructure model. In these cases, a hybrid approach allows organizations to maintain these legacy systems with mutable infrastructure while using immutable infrastructure for newer, cloud-native components. Example: Use Terraform to manage infrastructure for both legacy and modern applications. The legacy system (e.g., an on-premises ERP) can be maintained with mutable infrastructure, while cloud-native microservices are deployed with immutable infrastructure in the cloud. Simplifying Disaster Recovery With Hybrid Approaches A hybrid approach can simplify disaster recovery by leveraging the benefits of both models. Immutable infrastructure can be used for services that need quick rollbacks, while mutable infrastructure can handle systems that need to retain their state after recovery. Example: In a hybrid cloud environment managed by Terraform, stateless services (like a front-end application) can be redeployed quickly with immutable infrastructure, while stateful services (like a database) can be restored from backups using mutable infrastructure. Security and Compliance Considerations Security-sensitive applications benefit from immutable infrastructure due to the reduction in configuration drift and manual changes. However, some services, especially those involving sensitive data (like customer databases), may require mutable infrastructure for security patching and retention of critical state information. Also it is important to note that the evidence is updated when immutable infrastructure are used. Example: Use immutable infrastructure for API gateways and front-end services to ensure they are always deployed with the latest security patches. Simultaneously, manage the back-end databases using mutable infrastructure, allowing for in-place security patches without affecting stored data. Example: Terraform can be used to manage testing environments with immutable infrastructure, ensuring that each test is performed on a fresh, consistent environment. Meanwhile, long-lived production databases are managed with mutable infrastructure, allowing for updates and scaling without disruption. Conclusion A hybrid approach to infrastructure combines the best of both mutable and immutable models, offering flexibility for stateful and legacy systems while ensuring consistency and scalability for stateless services. By using Terraform to manage this blend, organizations can optimize their infrastructure for dynamic needs, balancing reliability, cost-efficiency, and operational flexibility. This approach allows for tailored strategies that meet the unique demands of various applications and services.

By Josephine Eskaline Joyce

CORE

Writing Great Code: The Five Principles of Clean Code

One of the finest ways to make the code easy to read, maintain, and improve can be done by writing clean code. Clean code helps to reduce errors, improves the code quality of the project, and makes other developers and future teams understand and work with the code. The well-organized code reduces mistakes and maintenance a lot easier. 1. Use Meaningful Names Clean code should have meaningful names that describe the purpose of variables, functions, and classes. The name itself should convey what the code does. So, anyone who is reading it can understand its purpose without referring to any additional documentation. Example Bad naming: Java int d; // What does 'd' represent? Good naming: Java int daysSinceLastUpdate; // Clear, self-explanatory name Bad method name: Java public void process() { // What exactly is this method processing? } Good method name: Java public void processPayment() { // Now it's clear that this method processes a payment. } 2. Write Small Functions Functions should be small and focused on doing just one thing. A good rule of thumb is the Single Responsibility Principle (SRP) which is each function should have only one reason to change. Example Bad example (a large function doing multiple things): Java public void processOrder(Order order) { validateOrder(order); calculateShipping(order); processPayment(order); sendConfirmationEmail(order); } Good example (functions doing only one thing): Java public void processOrder(Order order) { validateOrder(order); processPayment(order); confirmOrder(order); } private void validateOrder(Order order) { // Validation logic } private void processPayment(Order order) { // Payment processing logic } private void confirmOrder(Order order) { // Send confirmation email } Each function now handles only one responsibility, making it easier to read, maintain, and test. 3. Avoid Duplication Duplication is one of the biggest problems in messy code. Avoid copying and pasting code. Instead, look for common patterns and extract them into reusable methods or classes. Example Bad example (duplicated code): Java double getAreaOfRectangle(double width, double height) { return width * height; } double getAreaOfTriangle(double base, double height) { return (base * height) / 2; } Good example (eliminated duplication): Java double getArea(Shape shape) { return shape.area(); } abstract class Shape { abstract double area(); } class Rectangle extends Shape { double width, height; @Override double area() { return width * height; } } class Triangle extends Shape { double base, height; @Override double area() { return (base * height) / 2; } } Now, the logic for calculating the area is encapsulated in each shape class, eliminating the need for duplicating similar logic. Java public void updateOrderStatus(Order order) { if (order.isPaid()) { order.setStatus("shipped"); sendEmailToCustomer(order); // Side effect: sends an email when the status is changed } } 4. Eliminate Side Effects Functions should avoid change of state outside their scope which causes unintended side effects. When functions have side effects, they become harder to understand and can act in unpredictable behavior. Example Bad example (function with side effects): Java public void updateOrderStatus(Order order) { if (order.isPaid()) { order.setStatus("shipped"); sendEmailToCustomer(order); // Side effect: sends an email when the status is changed } } Good example (no side effects): Java public void updateOrderStatus(Order order) { if (order.isPaid()) { order.setStatus("shipped"); } } public void sendOrderShippedEmail(Order order) { if (order.getStatus().equals("shipped")) { sendEmailToCustomer(order); } } In the above case, the function's main job is to update the status. Sending an email is another task that is handled in a separate method. 5. Keep Code Expressive Keeping the code expressive includes structure, method names, and overall design which should be easy to understand to the reader what the code is doing. Comments are rarely used if the code is clear. Example Bad example (unclear code): Java if (employee.type == 1) { // Manager pay = employee.salary * 1.2; } else if (employee.type == 2) { // Developer pay = employee.salary * 1.1; } else if (employee.type == 3) { // Intern pay = employee.salary * 0.8; } Good example (expressive code): Java if (employee.isManager()) { pay = employee.calculateManagerPay(); } else if (employee.isDeveloper()) { pay = employee.calculateDeveloperPay(); } else if (employee.isIntern()) { pay = employee.calculateInternPay(); } In the above good example, method names are made clear to express the intent. It makes user to read easier and understand the code without any comments. Conclusion Here are some of the key principles for writing clean code: use of meaningful names, use small functions, avoid repeating code, remove side effects, and write clear and expressive code. Following these practices makes the code easier to understand and fix.

By Sasidhar Sunkara

Operationalize a Scalable AI With LLMOps Principles and Best Practices

Organizations are fully adopting Artificial Intelligence (AI) and proving that AI is valuable. Enterprises are looking for valuable AI use cases that abound in their industry and functional areas to reap more benefits. Organizations are responding to opportunities and threats, gain improvements in sales, and lower costs. Organizations are recognizing the special requirements of AI workloads and enabling them with purpose-built infrastructure that supports the consolidated demands of multiple teams across the organization. Organizations adopting a shift-left paradigm by planning for good governance early in the AI process will minimize AI efforts for data movement to accelerate model development. In an era of rapidly evolving AI, data scientists should be flexible in choosing platforms that provide flexibility, collaboration, and governance to maximize adoption and productivity. Let's dive into the workflow automation and pipeline orchestration world. Recently, two prominent terms have appeared in the artificial intelligence and machine learning world: MLOps and LLMOps. What Is MLOps? MLOps (Machine Learning Operations) is a set of practices and technology to standardize and streamline the process of construction and deployment of machine learning systems. It covers the entire lifecycle of a machine learning application from data collection to model management. MLOps provides a provision for huge workloads to accelerate time-to-value. MLOps principles are architected based on the DevOps principles to manage applications built-in ML (Machine Learning). The ML model is created by applying an algorithm to a mass of training data, which will affect the behavior of the model in different environments. Machine learning is not just code, its workflows include the three key assets Code, Model, and Data. Figure 1: ML solution is comprised of Data, Code, and Model These assets in the development environment will have the least restrictive access controls and less quality guarantee, while those in production will be the highest quality and tightly controlled. The data is coming from the real world in production where you cannot control its change, and this raises several challenges that need to be resolved. For example: Slow, shattered, and inconsistent deployment Lack of reproducibility Performance reduction (training-serving skew) To resolve these types of issues, there are combined practices from DevOps, data engineering, and practices unique to machine learning. Figure 2: MLOps is the intersection of Machine Learning, DevOps, and Data Engineering - LLMOps rooted in MLOps Hence, MLOps is a set of practices that combines machine learning, DevOps, and data engineering, which aims to deploy and maintain ML systems in production reliably and efficiently. What Is LLMOps? The recent rise of Generative AI with its most common form of large language models (LLMs) prompted us to consider how MLOps processes should be adapted to this new class of AI-powered applications. LLMOps (Large Language Models Operations) is a specialized subset of MLOps (Machine Learning Operations) tailored for the efficient development and deployment of large language models. LLMOps ensures that model quality remains high and that data quality is maintained throughout data science projects by providing infrastructure and tools. Use a consolidated MLOps and LLMOps platform to enable close interaction between data science and IT DevOps to increase productivity and deploy a greater number of models into production faster. MLOps and LLMOps will both bring Agility to AI Innovation to the project. LLMOps tools include MLOps tools and platforms, LLMs that offer LLMOps capabilities, and other tools that can help with fine-tuning, testing, and monitoring. Explore more on LLMOps tools. Differentiate Tasks Between MLOps and LLMOps MLOps and LLMOps have two different processes and techniques in their primary tasks. Table 1 shows a few key tasks and a comparison between the two methodologies: Task MLOps LLMOps Primary focus Developing and deploying machine-learning models Specifically focused on LLMs Model adaptation If employed, it typically focuses on transfer learning and retraining. Centers on fine-tuning pre-trained models like GPT with efficient methods and enhancing model performance through prompt engineering and retrieval augmented generation (RAG) Model evaluation Evaluation relies on well-defined performance metrics. Evaluating text quality and response accuracy often requires human feedback due to the complexity of language understanding (e.g., using techniques like RLHF) Model management Teams typically manage their models, including versioning and metadata. Models are often externally hosted and accessed via APIs. Deployment Deploy models through pipelines, typically involving feature stores and containerization. Models are part of chains and agents, supported by specialized tools like vector databases. Monitoring Monitor model performance for data drift and model degradation, often using automated monitoring tools. Expands traditional monitoring to include prompt-response efficacy, context relevance, hallucination detection, and security against prompt injection threats Table 1: Key tasks of MLOPs and LLMOps methodologies Adapting any implications into MLOps required minimal changes to existing tools and processes. Moreover, many aspects do not change: The separation of development, staging, and production remains the same. The version control tool and the model registry in the catalog remain the primary channels for promoting pipelines and models toward production. The data architecture for managing data remains valid and essential for efficiency. Existing CI/CD infrastructure should not require changes. The modular structure of MLOps remains the same, with pipelines for model training, model inference, etc., A summary of key properties of LLMs and the implications for MLOps are listed in Table 2. KEY PROPERTIES OF LLMS IMPLICATIONS FOR MLOPS LLMs are available in many forms: Proprietary models behind paid APIs Pre-training models fine-tuned models Projects often develop incrementally, starting from existing, third-party, or open-source models and ending with custom fine-tuned models. This has an impact on the development process. Prompt Engineering: Many LLMs take queries and instructions as input in the form of natural language. Those queries can contain carefully engineered “prompts” to elicit the desired responses. Designing text templates for querying LLMs is often an important part of developing new LLM pipelines. Many LLM pipelines will use existing LLMs or LLM serving endpoints; the ML logic developed for those pipelines may focus on prompt templates, agents, or “chains” instead of the model itself. The ML artifacts packaged and promoted to production may frequently be these pipelines, rather than models. Context-based prompt engineering: Many LLMs can be given prompts with examples and context, or additional information to help answer the query. When augmenting LLM queries with context, it is valuable to use previously uncommon tooling such as vector databases to search for relevant context. Model Size: LLMs are very large deep-learning models, often ranging from gigabytes to hundreds of gigabytes. Many LLMs may require GPUs for real-time model serving. Since larger models require more computation and are thus more expensive to serve, techniques for reducing model size and computation may be required. Model evaluation: LLMs are hard to evaluate via traditional ML metrics since there is often no single “right” answer. Since human feedback is essential for evaluating and testing LLMs, it must be incorporated more directly into the MLOps process, both for testing and monitoring and for future fine-tuning. Table 2: Key properties of LLMs and implications for MLOps Semantics of Development, Staging, and Production An ML solution comprises data, code, and models. These assets are developed, tested, and moved to production through deployments. For each of these stages, we also need to operate within an execution environment. Each of the data, code, models, and execution environments is ideally divided into development, staging, and production. Data: Some organizations label data as either development, staging, or production, depending on which environment it originated in. Code: Machine learning project code is often stored in a version control repository, with most organizations using branches corresponding to the lifecycle phases of development, staging, or production. Model: The model and code lifecycle phases often operate asynchronously and model lifecycles do not correspond one-to-one with code lifecycles. Hence it makes sense for model management to have its model registry to manage model artifacts directly. The loose coupling of model artifacts and code provides flexibility to update production models without code changes, streamlining the deployment process in many cases. Semantics: Semantics indicates that when it comes to MLOps, there should always be an operational separation between development, staging, and production environments. More importantly, observe that data, code, and model, which we call Assets, in development will have the least restrictive access controls and quality guarantee, while those in production will be the highest quality and tightly controlled. Deployment Patterns Two major patterns can be used to manage model deployment. The training code (Figure 3, deploy pattern code) which can produce the model is promoted toward the production environment after the code is developed in the dev and tested in staging environments using a subset of data. Figure 3: Deploy pattern code The packaged model (Figure 4, deploy pattern model) is promoted through different environments, and finally to production. Model training is executed in the dev environment. The produced model artifact is then moved to the staging environment for model validation checks, before deployment of the model to the production environment. This approach requires two separate paths, one for deploying ancillary code such as inference and monitoring code and the other “deploy code” path where the code for these components is tested in staging and then deployed to production. This pattern is typically used when deploying a one-off model, or when model training is expensive and read-access to production data from the development environment is possible. Figure 4: Deploy pattern model The choice of process will also depend on the business use case, maturity of the machine learning infrastructure, compliance and security guidelines, resources available, and what is most likely to succeed for that particular use case. Therefore, it is a good idea to use standardized project templates and strict workflows. Your decisions around packaging ML logic as version-controlled code vs. registered models will help inform your decision about choosing between the deploy models, deploy code, and hybrid architectures. With LLMs, it is common to package machine-learning logic in new forms. These may include: MLflow can be used to package LLMs and LLM pipelines for deployment. Built-in model flavors include: PyTorch and TensorFlow Hugging Face Transformers (relatedly, see Hugging Face Transformers’ MLflowCallback) LangChain OpenAI API MLflow can package the LLM pipelines via the MLflow Pyfunc capability, which can store arbitrary Python code. Figure 5 is a machine learning operations architecture and process that uses Azure Databricks. Figure 5: MLOps Architecture (Image source, Azure Databricks) Key Components of LLM-Powered Applications The field of LLMOps is quickly evolving. Here are key components and considerations to bear in mind. Some, but not necessarily all of the following approaches make up a single LLM-based application. Any of these approaches can be taken to leverage your data with LLMs. Prompt engineering is the practice of adjusting the text prompts given to an LLM to extract more accurate or relevant responses from the model. It is very important to craft effective and specialized prompt templates to guide LLM behavior and mitigate risks such as model hallucination and data leakage. This approach is fast, cost-effective, with no training required, and less control than fine-tuning. Retrieval Augmented Generation (RAG), combining an LLM with external knowledge retrieval, requires an external knowledge base or database (e.g., vector database) with moderate training time (e.g., computing embeddings). The primary use case of this approach is dynamically updated context and enhanced accuracy but it significantly increases prompt length and inference computation. RAG LLMs use two systems to obtain external data: Vector databases: Vector databases help find relevant documents using similarity searches. They can either work independently or be part of the LLM application. Feature stores: These are systems or platforms to manage and store structured data features used in machine learning and AI applications. They provide organized and accessible data for training and inference processes in machine learning models like LLMs. Fine-tuning LLMs: Fine-tuning is the process of adapting a pre-trained LLM on a comparatively smaller dataset that is specific to an individual domain or task. During the fine-tuning process, only a small number of weights are updated, allowing it to learn new behaviors and specialize in certain tasks. The advantage of this approach is granular control, and high specialization but it requires labeled data and comes with a computational cost. The term “fine-tuning” can refer to several concepts, with the two most common forms being: Supervised instruction fine-tuning: This approach involves continuing training of a pre-trained LLM on a dataset of input-output training examples - typically conducted with thousands of training examples. Instruction fine-tuning is effective for question-answering applications, enabling the model to learn new specialized tasks such as information retrieval or text generation. The same approach is often used to tune a model for a single specific task (e.g. summarizing medical research articles), where the desired task is represented as an instruction in the training examples. Continued pre-training: This fine-tuning method does not rely on input and output examples but instead uses domain-specific unstructured text to continue the same pre-training process (e.g. next token prediction, masked language modeling). This approach is effective when the model needs to learn new vocabulary or a language it has not encountered before. Pre-training a model from scratch refers to the process of training a language model on a large corpus of data (e.g. text, code) without using any prior knowledge or weights from an existing model. This is in contrast to fine-tuning, where an already pre-trained model is further adapted to a specific task or dataset. The output of full pre-training is a base model that can be directly used or further fine-tuned for downstream tasks. The advantage of this approach is maximum control, tailored for specific needs, but it is extremely resource-intensive, and it requires longer training from days to weeks. A good rule of thumb is to start with the simplest approach possible, such as prompt engineering with a third-party LLM API, to establish a baseline. Once this baseline is in place, you can incrementally integrate more sophisticated strategies like RAG or fine-tuning to refine and optimize performance. The use of standard MLOps tools such as MLflow is equally crucial in LLM applications to track performance over different approach iterations. Quick, on-the-fly model guidance. Model Evaluation Challenges Evaluating LLMs is a challenging and evolving domain, primarily because LLMs often demonstrate uneven capabilities across different tasks. LLMs can be sensitive to prompt variations, demonstrating high proficiency in one task but faltering with slight deviations in prompts. Since most LLMs output natural language, it is very difficult to evaluate the outputs via traditional Natural Language Processing metrics. For domain-specific fine-tuned LLMs, popular generic benchmarks may not capture their nuanced capabilities. Such models are tailored for specialized tasks, making traditional metrics less relevant. It is often the case that LLM performance is being evaluated in domains where text is scarce or there is a reliance on subject matter expert knowledge. In such scenarios, evaluating LLM output can be costly and time-consuming. Some prominent benchmarks used to evaluate LLM performance include: BIG-bench (Beyond the Imitation Game Benchmark): A dynamic benchmarking framework, currently hosting over 200 tasks, with a focus on adapting to future LLM capabilities Elluether AI LM Evaluation Harness: A holistic framework that assesses models on over 200 tasks, merging evaluations like BIG-bench and MMLU, promoting reproducibility and comparability Mosaic Model Gauntlet: An aggregated evaluation approach, categorizing model competency into six broad domains (shown below) rather than distilling it into a single monolithic metric LLMOps Reference Architecture A well-defined LLMOps architecture is essential for managing machine learning workflows and operationalizing models in production environments. Here is an illustration of the production architecture with key adjustments to the reference architecture from traditional MLOps, and below is the reference production architecture for LLM-based applications: RAG workflow using a third-party API: Figure 6: RAG workflow using a third-party API (Image Source: Databricks) RAG workflow using a self-hosted fine-tuned model and an existing base model from the model hub that is then fine-tuned in production: Figure 7: RAG workflow using a self-hosted fine-tuned model (Image Source: Databricks) LLMOps: Pros and Cons Pros Minimal changes to base model: Most of the LLM applications often make use of existing, pre-trained models, and an internal or external model hub becomes a valuable part of the infrastructure. It is easy and requires simple changes to adopt it. Easy to model and deploy: The complexities of model construction, testing, and fine-tuning are overcome in LLMOps, enabling quicker development cycles. Also, deploying, monitoring, and enhancing models is made hassle-free. You can leverage expansive language models directly as the engine for your AI applications. Advanced language models: By utilizing advanced models like the pre-trained Hugging Face model (e.g., meta-llama/Llama-2-7b, google/gemma-7b) or one from OpenAI (e.g., GPT-3.5-turbo or GPT-4). LLMOps enables you to harness the power of billions or trillions of parameters, delivering natural and coherent text generation across various language tasks. Cons Human feedback: Human feedback in monitoring and evaluation loops may be used in traditional ML but becomes essential in most LLM applications. Human feedback should be managed like other data, ideally incorporated into monitoring based on near real-time streaming. Limitations and quotas: LLMOps comes with constraints such as token limits, request quotas, response times, and output length, affecting its operational scope. Risky and complex integration: The LLM pipeline will make external API calls, from the model serving endpoint to internal or third-party LLM APIs. This adds complexity, potential latency, and another layer of credential management. Also, integrating large language models as APIs requires technical skills and understanding. Scripting and tool utilization have become integral components, adding to the complexity. Conclusion Automation of workload is variable and intensive and will help in filling the gap between the data science team and the IT operations team. Planning for good governance early in the AI process will minimize AI efforts for data movement to accelerate model development. The emergence of LLMOps highlights the rapid advancement and specialized needs of the field of Generative AI and LLMOps is still rooted in the foundational principles of MLOps. In this article, we have looked at key components, practices, tools, and reference architecture with examples such as: Major similarities and differences between MLOPs and LLOPs Major deployment patterns to migrate data, code, and model Schematics of Ops such as development, staging, and production environments Major approaches to building LLM applications such as prompt engineering, RAGs, fine-tuned, and pre-trained models, and key comparisons LLM serving and observability, including tools and practices for monitoring LLM performance The end-to-end architecture integrates all components across dev, staging, and production environments. CI/CD pipelines automate deployment upon branch merges.

By Asia Banu Shaik