A Framework for Developing Service-Level Objectives: Essential Guidelines and Best Practices for Building Effective Reliability Targets
Beyond ChatGPT: How Generative AI Is Transforming Software Development
Observability and Performance
The dawn of observability across the software ecosystem has fully disrupted standard performance monitoring and management. Enhancing these approaches with sophisticated, data-driven, and automated insights allows your organization to better identify anomalies and incidents across applications and wider systems. While monitoring and standard performance practices are still necessary, they now serve to complement organizations' comprehensive observability strategies. This year's Observability and Performance Trend Report moves beyond metrics, logs, and traces — we dive into essential topics around full-stack observability, like security considerations, AIOps, the future of hybrid and cloud-native observability, and much more.
Platform Engineering Essentials
Apache Kafka Essentials
NodeJS is a very popular platform for building backend services and creating API endpoints. Several large companies use NodeJS in their microservices tech stack, which makes it a very useful platform to learn and know, similar to other popular languages like Java and Python. ExpressJS is a leading framework used for building APIs, and TypeScript provides necessary strict typing support, which is very valuable in large enterprise application development. TypeScript and ExpressJS combined together allow the development of robust distributed backend systems. Securing access to such a system is very critical. The NodeJS platform offers several options for securing APIs, such as JWT (JSON Web Token), OAuth2, Session-based authentication, and more. JWT has seen a rise in adoption due to several key characteristics when it comes to securing the APIs. Some of the noteworthy benefits of using JWT to secure APIs are noted below: Stateless authentication: JWT tokens carry all necessary information for authentication with them and don't need any server-side storage.Compact and efficient: JWT tokens are small, allowing for easy transmission over the network. They can be easily sent in HTTP headers.CORS-support: As JWT tokens are stateless, they make it super easy to implement cross-browser support. This makes them ideal for Single Page Applications (SPAs) and microservices architectures.Standardized format: JWT tokens follow RFC 7519 - JWT specification, which makes them ideal for cross-platform interoperability. In this tutorial, you will be building NodeJS-based microservices from scratch using Express and TypeScript in the beginning. The tutorial implements a library book management system where a user can view as well as edit a catalog of books. Later, you will be securing the endpoints of this microservice using JWT. The full code for the tutorial is available on this GitHub link. However, I encourage you to follow along for deeper insights and understanding. Prerequisites To follow along in the tutorial, ensure the below prerequisites are met. Understanding of JavaScript. TypeScript familiarity is a great bonus to have. Understanding of REST API operations, such as GET, POST, PUT, and DELETE.NodeJS and NPM installed on the machine. This can be verified using node -v and npm -v.An editor of choice. Visual Studio Code was used in the development of this tutorial and is a good choice. Initiating the New NodeJS App Create a new folder on your local machine and initiate a new NodeJS application using the commands below: GitHub Flavored Markdown mkdir ts-node-jwt cd ts-node-jwt npm init -y The NodeJS you will build uses TypeScript and ExpressJS. Install necessary dependencies using the npm commands below: GitHub Flavored Markdown npm install typescript ts-node-dev @types/node --save-dev npm install express dotenv npm install @types/express --save-dev The next step is to initiate and define TypeScript configuration. Use the command below to create a new TypeScript configuration: GitHub Flavored Markdown npx tsc --init At this point, open the project folder in your editor and locate the freshly created tsconfig.json file and update its content as per below: JSON { "compilerOptions": { "target": "ES6", "module": "commonjs", "strict": true, "esModuleInterop": true, "forceConsistentCasingInFileNames": true, "skipLibCheck": true, "outDir": "./dist" }, "include": ["src"], "exclude": ["node_modules"] } Creating the Basic App With No Authentication Create a folder named src inside the project root, and inside this src directory, create a file name server.ts. This file will contain basic server boot-up code. TypeScript /*Path of the file: project-root/src/server.ts*/ import express from 'express'; import dotenv from 'dotenv'; import router from './routes'; dotenv.config(); const app = express(); const PORT = process.env.PORT || 3000; app.use(express.json()); app.use('/api', router); app.listen(PORT, () => { console.log(`Server is running on http://localhost:${PORT}`); }); Create a directory named routes under src and add file index.ts to it. This file will hold routing details and handling of routes for APIs needed to implement the catalog system that you are building. TypeScript /*Path of the file: project-root/src/routes/index.ts*/ import { Router } from 'express'; import { getAllBooks, getBookById, addNewBook, removeBook } from '../controllers/bookController'; const router = Router(); router.get('/books', getAllBooks); router.get('/books/:id', getBookById); router.post('/books', addNewBook); router.delete('/books/:id', removeBook); export default router; Next, you should create a book controller. This controller will hold code for handling, receiving, and responding to actual API calls. Create controllers directory under src. Add a file named bookController.ts under src/controllers directory. Add the code below to this file. This controller code receives each individual API call, parses its request when needed, then interacts with the service layer (which you will build in the next steps), and responds to the user. TypeScript /*Path of the file: project-root/src/controllers/userController.ts*/ import { Request, Response } from 'express'; import { getBooks, findBookById, addBook, deleteBook } from '../services/bookService'; export const getAllBooks = (req: Request, res: Response): void => { const books = getBooks(); res.json(books); }; export const getBookById = (req: Request, res: Response): void => { const bookId = parseInt(req.params.id); if (isNaN(bookId)) { res.status(400).json({ message: 'Invalid book ID' }); return; } const book = findBookById(bookId); if (!book) { res.status(404).json({ message: 'Book not found' }); return; } res.json(book); }; export const addNewBook = (req: Request, res: Response): void => { const { title, author, publishedYear } = req.body; if (!title || !author || !publishedYear) { res.status(400).json({ message: 'Missing required fields' }); return; } const newBook = { id: Date.now(), title, author, publishedYear }; addBook(newBook); res.status(201).json(newBook); }; export const removeBook = (req: Request, res: Response): void => { const bookId = parseInt(req.params.id); if (isNaN(bookId)) { res.status(400).json({ message: 'Invalid book ID' }); return; } const book = findBookById(bookId); if (!book) { res.status(404).json({ message: 'Book not found' }); return; } deleteBook(bookId); res.status(200).json({ message: 'Book deleted successfully' }); }; The controller interacts with the book service to perform reads and writes on the book database. Create a JSON file as per below with dummy books, which will act as the database. JSON /*Path of the file: project-root/src/data/books.json*/ [ { "id": 1, "title": "To Kill a Mockingbird", "author": "Harper Lee", "publishedYear": 1960 }, { "id": 2, "title": "1984", "author": "George Orwell", "publishedYear": 1949 }, { "id": 3, "title": "Pride and Prejudice", "author": "Jane Austen", "publishedYear": 1813 } ] Read this book's details in the service file and provide methods for updating the books as well. This code implements an in-memory book database. Add a directory services under src and add file bookService.ts with the code below. TypeScript /*Path of the file: project-root/src/services/bookService.ts*/ import fs from 'fs'; import path from 'path'; interface Book { id: number; title: string; author: string; publishedYear: number; } let books: Book[] = []; export const initializeBooks = (): void => { const filePath = path.join(__dirname, '../data/books.json'); const data = fs.readFileSync(filePath, 'utf-8'); books = JSON.parse(data); }; export const getBooks = (): Book[] => { return books; }; export const findBookById = (id: number): Book | undefined => { return books.find((b) => b.id === id); }; export const addBook = (newBook: Book): void => { books.push(newBook); }; export const deleteBook = (id: number): void => { books = books.filter((b) => b.id !== id); }; export const saveBooks = (): void => { const filePath = path.join(__dirname, '../data/books.json'); fs.writeFileSync(filePath, JSON.stringify(books, null, 2)); }; The initial version of the application is almost ready. Update server.ts code to initiate the database and then add a server startup script in the package.json file. TypeScript /*Path of the file: project-root/src/server.ts*/ import express from 'express'; import dotenv from 'dotenv'; import router from './routes'; import { initializeBooks } from './services/bookService'; dotenv.config(); initializeBooks(); const app = express(); const PORT = process.env.PORT || 3000; app.use(express.json()); app.use('/api', router); app.listen(PORT, () => { console.log(`Server is running on http://localhost:${PORT}`); }); JSON /*Path of the file: project-root/package.json*/ ...rest of file "scripts": { "test": "echo \"Error: no test specified\" && exit 1", "start": "ts-node-dev src/server.ts" }, ...rest of file Finally, start the application by using command npm start. You should see output like below on the screen, and the server should start. App Running Testing the APIs Without Authentication Now that the server is up, you should be able to test the API. Use a tool such as Postman and access the URL http://localhost:3000/api/books to get responses from APIs. You should see a response like the one below: API Call Similarly, you can use API endpoints to update or delete books as well. I have created a postman collection which you should be able to import use inside the postman. You can get at this link. The API for creating new books is http://localhost:3000/api/books, and the API to delete the books is http://localhost:3000/api/books/:id. Implementing JWT Authentication At this point, you are ready to secure the APIs. You will need a list of users who can access the book management APIs. Create a dummy users.json file under the data directory to hold our in-memory users. JSON /*Path of the file: project-root/src/data/users.json*/ [ { "id": 1, "username": "john_doe", "email": "john@example.com", "password": "password1" }, { "id": 2, "username": "jane_doe", "email": "jane@example.com", "password": "password2" } ] Now it is time to create two file userService.ts and userController.ts which will hold login to provide a route to authenticate a user based on username and password. TypeScript /*Path of the file: project-root/src/services/userService.ts*/ import fs from 'fs'; import path from 'path'; interface User { id: number; username: string; email: string; password: string; } let users: User[] = []; export const initializeUsers = (): void => { const filePath = path.join(__dirname, '../data/users.json'); const data = fs.readFileSync(filePath, 'utf-8'); users = JSON.parse(data); }; export const findUserByUsername = (username: string): User | undefined => { return users.find((user) => user.username === username); }; export const generateToken = (user: User): string => { const payload = { id: user.id, username: user.username }; return jwt.sign(payload, process.env.JWT_SECRET || 'secret', { expiresIn: '1h' }); }; TypeScript /*Path of the file: project-root/src/controllers/userController.ts*/ import { Request, Response } from 'express'; import { findUserByUsername, generateToken } from '../services/userService'; export const loginUser = (req: Request, res: Response): void => { const { username, password } = req.body; if (!username || !password) { res.status(400).json({ message: 'Username and password are required' }); return; } const user = findUserByUsername(username); if (!user) { res.status(401).json({ message: 'Invalid username or password' }); return; } if (user.password !== password) { res.status(401).json({ message: 'Invalid username or password' }); return; } const token = generateToken(user); res.json({ token }); }; In the next step, you need to create an authentication middleware function. This function intercepts all the API calls made and validates whether they come from authenticated users or not. Create a directory middleware under src and add file authMiddleware.ts with the code below. TypeScript /*Path of the file: project-root/src/middleware/authMiddleware.ts*/ import { Request, Response, NextFunction } from 'express'; import jwt from 'jsonwebtoken'; export const authMiddleware = (req: Request, res: Response, next: NextFunction): void => { const token = req.header('Authorization')?.split(' ')[1]; if (!token) { res.status(401).json({ message: 'Access Denied. No token provided.' }); return; } try { jwt.verify(token, process.env.JWT_SECRET || 'secret'); next(); } catch (error) { res.status(400).json({ message: 'Invalid Token' }); } }; Now, it's time to incorporate the authentication logic in each API call. Update the routes file to include the authMiddlware in each API call related to book management, as well as add a route related to login. TypeScript /*Path of the file: project-root/src/routes/index.ts*/ import { Router } from 'express'; import { getAllBooks, getBookById, addNewBook, removeBook } from '../controllers/bookController'; import { loginUser } from '../controllers/userController'; import { authMiddleware } from '../middleware/authMiddleware'; const router = Router(); router.post('/login', loginUser); router.get('/books', authMiddleware, getAllBooks); router.get('/books/:id', authMiddleware, getBookById); router.post('/books', authMiddleware, addNewBook); router.delete('/books/:id', authMiddleware, removeBook); export default router; In the final step, initialize the memory user database. Update server.ts file to make them look like the one below. TypeScript /*Path of the file: project-root/src/server.ts*/ import express from 'express'; import dotenv from 'dotenv'; import router from './routes'; import { initializeBooks } from './services/bookService'; import { initializeUsers } from './services/userService'; dotenv.config(); const app = express(); const PORT = process.env.PORT || 3000; initializeBooks(); initializeUsers(); app.use(express.json()); // Middleware to parse JSON app.use('/api', router); app.listen(PORT, () => { console.log(`Server is running on http://localhost:${PORT}`); }); Testing the APIs With Authentication Calling APIs without providing the correct JWT token will now result in the below error from the server. JSON { "message": "Access Denied. No token provided." } Before calling the APIs, you need to authenticate using the URL http://localhost:3000/api/login. Use this URL and provide your username and password. This will give you a valid JWT token, as illustrated below. JWT AUTH You should pass the received JWT to each API and preappend with the word bearer, as highlighted below. This will give you the correct response. Response with JWT Token Conclusion Securing your APIs is the most critical step in modern backend system design. So, congratulations on securing APIs with JWT. JWT makes authenticating APIs stateless and scalable. By leveraging JWT, your Node.js and Express APIs are now better equipped to handle real-world security challenges.
Imagine you're working on a complex puzzle. There are two ways to solve it: The first way: You keep rearranging all the pieces directly on the table, moving them around, and sometimes the pieces you've already arranged get disturbed. This is like traditional imperative programming, where we directly modify data and state as we go. The second way: For each step, you take a picture of your progress, and when you want to try something new, you start with a fresh copy of the last successful attempt. No previous work gets disturbed, and you can always go back to any of the prior states. This is functional programming — where we transform data by creating new copies instead of modifying existing data. Functional programming isn't just another programming style — it's a way of thinking that makes your code more predictable, testable, and often, more readable. In this article, we'll break down functional programming concepts in a way that will make you say, "Ah, now I get it!" What Makes Code "Functional"? Let's break down the core concepts that separate functional programming from traditional imperative (or "primitive") programming. 1. Pure Functions: The Heart of FP In functional programming, pure functions are like vending machines. Given the same input (money and selection), they always return the same output (specific snack). They don't: Keep track of previous purchasesModify anything outside themselvesDepend on external factors Code examples: Plain Text // Impure function - Traditional approach class Calculator { // This variable can be changed by any method, making it unpredictable private int runningTotal = 0; // Impure method - it changes the state of runningTotal public int addToTotal(int number) { runningTotal += number; // Modifying external state return runningTotal; } } // Pure function - Functional approach class BetterCalculator { // Pure method - only works with input parameters // Same inputs will ALWAYS give same outputs public int add(int first, int second) { return first + second; } } // Usage example: Calculator calc = new Calculator(); System.out.println(calc.addToTotal(5)); // Output: 5 System.out.println(calc.addToTotal(5)); // Output: 10 (state changed!) BetterCalculator betterCalc = new BetterCalculator(); System.out.println(betterCalc.add(5, 5)); // Always outputs: 10 System.out.println(betterCalc.add(5, 5)); // Always outputs: 10 2. Immutability: Treat Data Like a Contract In traditional programming, we often modify data directly. In functional programming, we treat data as immutable - once created, it cannot be changed. Instead of modifying existing data, we create new data with the desired changes. Plain Text // Traditional approach - Mutable List public class MutableExample { public static void main(String[] args) { // Creating a mutable list List<String> fruits = new ArrayList<>(); fruits.add("Apple"); fruits.add("Banana"); // Modifying the original list - This can lead to unexpected behaviors fruits.add("Orange"); System.out.println(fruits); // [Apple, Banana, Orange] } } // Functional approach - Immutable List public class ImmutableExample { public static void main(String[] args) { // Creating an immutable list List<String> fruits = List.of("Apple", "Banana"); // Instead of modifying, we create a new list List<String> newFruits = new ArrayList<>(fruits); newFruits.add("Orange"); // Original list remains unchanged System.out.println("Original: " + fruits); // [Apple, Banana] System.out.println("New List: " + newFruits); // [Apple, Banana, Orange] } } 3. Declarative vs. Imperative: The "What" vs. the "How" Traditional programming often focuses on how to do something (step-by-step instructions). Functional programming focuses on what we want to achieve. Plain Text public class NumberProcessing { public static void main(String[] args) { List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5, 6); // Traditional approach (imperative) - Focusing on HOW List<Integer> evenNumbersImperative = new ArrayList<>(); // Step by step instructions for (Integer number : numbers) { if (number % 2 == 0) { evenNumbersImperative.add(number); } } // Functional approach (declarative) - Focusing on WHAT List<Integer> evenNumbersFunctional = numbers.stream() // Just specify what we want: numbers that are even .filter(number -> number % 2 == 0) .collect(Collectors.toList()); System.out.println("Imperative Result: " + evenNumbersImperative); System.out.println("Functional Result: " + evenNumbersFunctional); } } Why Choose Functional Programming? Predictability: Pure functions always produce the same output for the same input, making code behavior more predictable.Testability: Pure functions are easier to test because they don't depend on external state.Debugging: When functions don't modify the external state, bugs are easier to track down.Concurrency: Immutable data and pure functions make concurrent programming safer and more manageable. Common Functional Programming Patterns Here's a quick look at some common patterns you'll see in functional programming: Plain Text public class FunctionalPatterns { public static void main(String[] args) { List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5); // 1. Map: Transform each number to its doubled value List<Integer> doubled = numbers.stream() .map(number -> number * 2) // transforms each number .collect(Collectors.toList()); System.out.println("Doubled: " + doubled); // [2, 4, 6, 8, 10] // 2. Filter: Keep only even numbers List<Integer> evens = numbers.stream() .filter(number -> number % 2 == 0) // keeps only even numbers .collect(Collectors.toList()); System.out.println("Evens: " + evens); // [2, 4] // 3. Reduce: Sum all numbers int sum = numbers.stream() .reduce(0, (a, b) -> a + b); // combines all numbers into one System.out.println("Sum: " + sum); // 15 } } Conclusion Remember: Just like taking pictures of your puzzle progress, functional programming is about creating clear, traceable transformations of your data. Each step is predictable, reversible, and clean. Start small — try using map, filter, and reduce instead of for loops. Experiment with keeping your data immutable. Soon, you'll find yourself naturally thinking in terms of data transformations rather than step-by-step instructions.
Kubernetes is not new and has been a de-facto standard of deployments and CI/CD at most companies for a while. The goal of this article is to make you familiar with all the terms and jargon that Kubernetes experts use, in approximately 5 minutes! Introduction to Kubernetes Kubernetes provides a scalable framework to manage containers, offering features that span basic cluster architecture to advanced workload orchestration. This piece goes over both the basic and advanced features of Kubernetes. It talks about architecture, resource management, layered security, and networking solutions. It ends with looking at service meshes and persistent storage. Building Blocks Kubernetes operates on a cluster architecture comprising control plane nodes and worker nodes. I sometimes like to refer to "worker nodes" as data planes. The control plane coordinates the cluster, with core components such as: API Server: Manages all cluster operations via RESTful APIsScheduler: Assigns pods to nodes based on resource availability and policiesControllers: Align cluster state with desired configurations (e.g., ReplicaSets, Deployments)etcd: Provides a robust key-value store for all cluster dataThe Data Plane hosts application workloads and includes: Kubelet: Manages pod execution and node operationsKube-proxy: Configures networking rules to connect podsContainer Runtime: Runs containers using tools like containerd, which is open source Refer to the Kubernetes API reference guide for detailed component insights. Resource Management Kubernetes organizes workloads into logical resources such as: Pods: The smallest deployable unit, often hosting one or more containersDeployments: Manage stateless workloads with rolling updates and scalingStatefulSets: Provide persistent storage and ordered scheduling for stateful applicationsDaemonSets: Ensures that system-level pods are set up on all nodes Security Measures Security is a top priority in cloud-native environments, and Kubernetes provides a comprehensive suite of features to address this need. Kubernetes offers several security features, including: RBAC: Role-Based Access Control (RBAC) lets you be very specific about who in the cluster can see what; it is often also referred to as ACLs (Access Control Lists).Network Policies: Network Policies enable you to regulate communication between Pods, adding an extra layer of protection by isolating sensitive workloads.Secrets: Secrets are a safe way to store sensitive credentials, keeping important information private and avoiding accidental exposure. Networking Kubernetes adopts a flat networking model where all pods communicate seamlessly across nodes. Key networking features include: Services: Expose pods to network traffic, enabling internal and external access.Ingress: Manages external HTTP traffic with routing rules.Network Policies: Control ingress and egress traffic for pods, enhancing security. Service Mesh for Microservices Complex microservices often require advanced communication capabilities beyond Kubernetes services. Service meshes like Istio, Linkerd, and Consul provide: Automated mTLS encryption for secure service-to-service commsTraffic routing, observability, and even load balancingSupport for A/B testing, circuit breaking, and traffic splitting These tools eliminate the need for custom-coded communication logic, streamlining development. Persistent Storage for Stateful Workloads Kubernetes supports persistent storage via the Container Storage Interface (CSI), enabling integration with diverse storage backends. This only makes sense though if your application is stateful (or, using StatefulSets). Key resources include: PersistentVolumes (PV): Represent physical or cloud-based storagePersistentVolumeClaims (PVC): Allow workloads to request storage dynamicallyStorageClasses: Simplify storage configuration for diverse workload needs StatefulSets combined with PVCs ensure data durability even during pod rescheduling. Performance: Azure Kubernetes Service As Example Performance really depends on how large the container image that you are running is. Managed solutions like Azure Kubernetes Service provide all of these mentioned above, plus, offer the reliability of Azure behind it. Azure enhanced its infrastructure to reduce the container startup time by pre-caching common base images, so customers can see 15-20x performance gain on cold container startups [1]. Conclusion As I said earlier, unless you are living under a rock, Kubernetes has become an essential tool for organizations embracing cloud-native practices. Its robust, decoupled architecture, combined with strong security features, allows for the efficient deployment and management of containerized applications. For further learning, consult the official Kubernetes documentation. References US11966769B2 - Container instantiation with union file system layer mounts- Google Patents. Hotinger, E. R., Du, B., Antony, S., Lasker, S. M., Garudayagari, S., You, D., Wang, Y., Shah, S., Goff, B. T., Zhang, S., & Llc, M. T. L. (2019, May 23).
TL; DR: The Lean Tech Manifesto With Fabrice Bernhard: Hands-on Agile #65 Join Fabrice Bernhard on how the “Lean Tech Manifesto” solves the challenge of scaling Agile for large organizations and enhances innovation and team autonomy (note: the recording is in English). Lean Tech Manifesto Abstract The release of the Agile Manifesto on February 13th, 2001, marked a revolutionary shift in how tech organizations think about work. By empowering development teams, Agile cut through the red tape in software development and quickly improved innovation speed and software quality. Agile's new and refreshing approach led to its adoption beyond just the scope of a development team, spreading across entire companies far beyond the initial context the manifesto’s original thinkers designed for it. And here lies the problem: the Agile Manifesto was intended for development teams, not for organizations with hundreds or thousands of people. As enthusiasts of Agile, Fabrice and his partner went through phases of excitement and then frustration as they experienced these limitations firsthand while their company grew and our clients became larger. What gave them hope was seeing organizations on both sides of the Pacific, in Japan and California, achieve growth and success almost unmatched while retaining the principles that made the Agile movement so compelling. The “Lean Tech Manifesto” resulted from spending the past 15 years studying these giants and experimenting as they scaled their business. It tries to build on the genius of the original 2001 document but adapt it to a much larger scale. Fabrice shares the connection we identified between Agile and Lean principles and the tech innovations we found the best tech organizations adopt to distribute work and maintain team autonomy. Meet Fabrice Bernhard Fabrice Bernhard is the co-author of The Lean Tech Manifesto and the Group CTO of Theodo, a leading technology consultancy he cofounded with Benoît Charles-Lavauzelle and scaled from 10 people in 2012 to 700 people in 2022. Based in Paris, London and Casablanca, Theodo uses Agile, DevOps, and Lean to build transformational tech products for clients all over the world, including global companies — such as VF Corporation, Raytheon Technologies, SMBC, Biogen, Colas, Tarkett, Dior, Safran, BNP Paribas, Allianz, and SG — and leading tech scale-ups — such as ContentSquare, ManoMano, and Qonto. Fabrice is an expert in technology and large-scale transformations and has contributed to multiple startups scaling more sustainably with Lean thinking. He has been invited to share his experience at international conferences, including the Lean Summit, DevopsDays, and CraftConf. The Theodo story has been featured in multiple articles and in the book Learning to Scale at Theodo Group. Fabrice is also the co-founder of the Paris DevOps meetup and an active YPO member. He studied at École Polytechnique and ETH Zürich and lives with his two sons in London. Connect with Fabrice Bernhard on LinkedIn. Video Watch the recording of Fabrice Bernhard’s The Lean Tech Manifesto session now:
The rapid advancement of software systems, fuelled by the adoption of microservices and cloud architectures, has significantly increased complexity and unpredictability. As modern enterprises become more reliant on these distributed systems, the risk of unexpected failures and service disruptions has grown. In response to these challenges, a transformative approach has emerged called Chaos Engineering. Chaos Engineering has gained momentum in software development, with its origins rooted in experiments by tech leaders like Netflix and Amazon. This practice involves deliberately introducing controlled disruptions into production systems to evaluate their resilience and uncover vulnerabilities. However, as software systems continue to evolve, the practice of Chaos Engineering is being reconsidered and refined. The Emergence of Chaos Engineering The origins of Chaos Engineering date back to the 1980s when Apple used a "monkey" program to simulate random events, helping identify memory issues in Macintosh systems. This concept later evolved as companies like Amazon and Google adopted similar techniques to enhance system reliability by simulating and addressing potential failures. The breakthrough came in 2011 when Netflix formalized Chaos Engineering. By creating a tool called Chaos Monkey, Netflix randomly disabled servers during normal operations to ensure engineers-built redundancy and automation into the system. The success of Chaos Monkey led to the development of Netflix's Simian Army, a suite of tools designed to test various failure scenarios, such as network disruptions and database outages, thereby laying the foundation for modern Chaos Engineering. Core Principles of Chaos Engineering Chaos Engineering is grounded in a set of key principles that guide the planning and execution of experiments to evaluate software system resilience: Define the Steady State: The process begins by establishing a clear understanding of the system's normal behavior. This involves identifying measurable key performance indicators (KPIs) that act as a baseline for assessing deviations during experiments.Formulate Hypotheses: Once the steady state is defined, hypotheses are created to predict how the system should behave under specific failure scenarios. These predictions are then tested through controlled disruptions.Simulate Real-World Scenarios: Chaos experiments mimic real-world events such as hardware failures, network latency, or resource exhaustion. These simulations allow engineers to observe system responses and uncover weaknesses in a controlled environment.Conduct Experiments in Production: To ensure realistic results, experiments are carried out in the live production environment rather than in isolated testing setups, providing a true representation of system behavior under actual conditions.Limit the Blast Radius: Despite running tests in production, minimizing user impact is critical. Careful planning, executing experiments during off-peak times, and using backup systems ensure disruptions are contained and quickly recoverable. Benefits of Chaos Engineering Adopting Chaos Engineering provides organizations with several key advantages: Enhanced System Resilience and Reliability: By proactively uncovering and addressing vulnerabilities, Chaos Engineering strengthens critical software systems, ensuring they remain operational and reliable under stress.Mitigation of Financial Risks: Unanticipated outages can result in significant revenue losses and increased operational costs. Chaos Engineering minimizes these risks by helping organizations prevent disruptions.In-Depth System Insights: Chaos experiments offer a clearer understanding of complex system interactions and dependencies, enabling engineers to make more strategic decisions in system architecture and design.Accelerated Recovery from Failures: Organizations can refine incident response strategies by analyzing system behavior during various failure scenarios, reducing recovery times and improving operational efficiency.Enhanced Customer Experience: Robust, failure-resistant systems foster customer trust and satisfaction, as they deliver consistent and dependable services even in the face of unexpected challenges. Challenges and Limitations While chaos testing offers significant advantages, it also comes with notable limitations and considerations that organizations should carefully address: Impact on Production Systems: Introducing disruptions in a live production environment entails risks, including potential service outages or downtime. Thoughtful planning and risk mitigation are crucial to minimize the effects on users and operations.Complexity and Cost: Conducting chaos testing demands substantial resources, such as specialized tools, robust infrastructure, and skilled professionals. For smaller organizations with limited budgets or expertise, these requirements may pose significant barriers.Potential for Misinterpretation: Without proper analysis, test results can be misleading. A system’s response during a test may not always mirror its behavior in more complex or varied real-world scenarios. Results should be contextualized and validated through repeated testing.Limited Scope and Coverage: Chaos testing may not account for all possible failure scenarios, leaving certain vulnerabilities undetected. Supplementing it with other testing methods, like penetration and security testing, ensures broader coverage.Unpredictability of Outcomes: The inherent randomness of chaos testing makes it challenging to predict specific outcomes, which can hinder efforts to replicate scenarios or consistently validate fixes.Ethical and Regulatory Considerations: Deliberately inducing failures in systems handling sensitive data, critical infrastructure, or financial transactions raises ethical concerns. Compliance with regulations and safeguarding sensitive information is essential.Specialized Skill Requirements: Effective chaos testing requires expertise in distributed systems, fault injection, and system monitoring. Organizations may need to invest in training or hiring skilled professionals to maximize the benefits.Integration with CI/CD Pipelines: Incorporating chaos tests into continuous integration and deployment workflows is complex and demands careful orchestration to avoid disrupting pipelines or delaying releases.Metrics and Measurement Challenges: Defining and accurately measuring metrics that reflect system resilience under stress can be difficult. Metrics must be carefully chosen to yield actionable insights.Organizational Readiness: Successful adoption of chaos testing depends on a culture that views failure as a learning opportunity. This requires collaboration across teams (developers, operations, security) and a commitment to continuous improvement in system reliability. Beyond Chaos: Designing for Resilience A cornerstone of the modern approach to software development is embedding resilience directly into system design rather than relying solely on Chaos Engineering to identify and address vulnerabilities. This involves incorporating features like redundancy, fault tolerance, and self-healing mechanisms from the outset, ensuring systems can endure and recover from failures with minimal disruption to users. By adopting this proactive strategy, organizations can minimize the reliance on disruptive Chaos Engineering experiments and prioritize building systems that are inherently robust and scalable. This approach aligns with the growing focus on non-functional requirements such as security, maintainability, and performance, which are often overshadowed by the drive for rapid feature delivery. Observability and Continuous Improvement Modern software development also emphasizes observability with real-time monitoring tools that provide insights into system performance. By integrating observability into development workflows, organizations can identify potential issues early and drive continuous improvement, refining system resilience over time. Empowering Developers: Shifting Left The evolution of Chaos Engineering includes a "shift left" approach, empowering developers to build resilient systems from the start. By equipping developers with tools like Steadybit, organizations enable them to incorporate resilience practices such as retries, circuit breakers, and rolling updates directly into their code. This democratization of Chaos Engineering distributes responsibility for system reliability across the development team, reducing dependency on dedicated Site Reliability Engineering (SRE) teams. The Future of Chaos Engineering As the software industry evolves, Chaos Engineering is poised to become an integral part of a broader strategy for building resilient systems. Rather than existing as an isolated practice, it will increasingly be recognized as one of many techniques within a comprehensive Resilience Engineering framework. This approach will seamlessly combine proactive system design, enhanced observability, continuous improvement, and empowering developers to build reliability into their work. By embracing this holistic perspective, organizations can create software that is not only highly available and scalable but also adaptive, self-healing, and capable of withstanding the unpredictable challenges of today’s digital landscape. Conclusion In conclusion, the role of Chaos Engineering in today’s software landscape is evolving significantly. While it has been instrumental in enhancing system resilience, the industry is shifting towards a more proactive, design-focused approach to building robust systems. By embedding resilience into software design, harnessing observability, promoting continuous improvement, and empowering developers to prioritize reliability, organizations can create systems that are not only scalable and highly available but also adaptive, self-healing, and equipped to handle the uncertainties of the digital world. As software development continues to advance, Chaos Engineering will increasingly integrate with Resilience Engineering to form a unified strategy aimed at delivering reliable, high-performance, and user-focused digital solutions. Adopting this comprehensive approach enables organizations to future-proof their software and sustain a competitive advantage in the rapidly changing technological landscape.
The first time I had to work on a high-performance ETL pipeline for processing terabytes of smart city sensor data, traditional stack recommendations overwhelmed me. Hadoop, Spark, and other heavyweight solutions seemed like bringing a tank to a street race. That's when I discovered Golang, and it fundamentally changed how I approach ETL architecture. Understanding Modern ETL Requirements ETL has undergone a sea of change in the last decade. Gone are the days when batch processing would run fine at night. The kind of applications that are being written now require real-time processing, streaming, and support of all sorts of data formats while maintaining performance and reliability. Having led data engineering teams for years, I have seen firsthand how traditional ETL solutions struggle to keep pace with today's requirements. Data streams flowing from IoT devices, social media feeds, and real-time transactions result in volumes of data requiring immediate processing. Today, the challenge is not just one of volume but of minimum latency processing for quality data with system resilience. Hence, performance considerations have become particularly crucial. In one recent project, for example, we had to process over 80,000 messages per second from IoT sensors across smart city infrastructure. There, traditional batch processing wouldn't cut it, and near real-time insights were required to make meaningful decisions on traffic flow and energy consumption. Advantages of Golang for ETL This is where Golang really shines brightly. When we moved from our initial Python-based implementation to Go, the transformation was nothing short of magical. The concurrent processing in Go, particularly goroutines and channels, proved to be an elegance for solving our performance challenges. The thing that I think is impressively great about Go is: it is really lightweight threads, which it calls goroutines. Unlike most threading models, they are extremely resource-efficient. You can create thousands with very little overhead. In our smart city project, each sensor stream had its own goroutine to handle it, and so you had true parallel processing without the heaviness of managing thread pools or other process overhead. Data flow based on channels provides a clean and efficient way to handle data pipelines in Go. We replaced complex queue management systems with channels, setting up very simple flows of data between the different stages of processing. This made our code simpler and easier to maintain and debug. One of the most underestimated benefits of using Go for ETL is memory management. Go's garbage collector is one of the most tuned in the industry, with predictable latency — a critical component of any ETL workload. We wouldn't need to worry anymore about memory leaks, and sudden garbage collection pauses that disrupt our data processing pipeline. Key Features for ETL Operations The standard library does contain some real gems, not least for an ETL developer. Encoding/JSON and encoding/CSV cover a great deal of the bases when it comes to data formats; database/SQL allows you to deal with other database systems. Context is a beautiful way of dealing with timeouts and cancellations, common requirements when keeping pipelines reliable. Although error handling in Go was very controversial due to its explicit syntax when we started using it, it proved to be a blessing for ETL operations. Explicit and immediate error handling helped us create more reliable pipelines. We found problems immediately and quickly fixed them, not allowing bad data to propagate further in the system. Here is one of the patterns we use commonly in order to handle errors robustly in our pipelines: Go type Result struct { Data interface{} Error error } func processRecord(record Data) Result { if err := validate(record); err != nil { return Result{Error: fmt.Errorf("validation failed: %w", err)} } transformed, err := transform(record) if err != nil { return Result{Error: fmt.Errorf("transformation failed: %w", err)} } return Result{Data: transformed} } Common ETL Patterns in Golang Over the course of our projects, we identified some useful patterns for ETL. One of those patterns is the pipeline pattern that takes full advantage of Go's concurrency features: Go func Pipeline(input <-chan Data) <-chan Result { output := make(chan Result) go func() { defer close(output) for data := range input { result := processRecord(data) output <- result } }() return output } This allows us to easily chain multiple transformation stages, maintaining high throughput with clean error handling. At each stage in this pipeline, we can also add monitoring, logging, and error recovery. Integration Capabilities This is fairly painless to do in Go, due to the really rich ecosystem of libraries that exist, making it extremely easy to integrate with a wide variety of data sources and destinations. Whether we're pulling data from REST APIs, reading from Kafka streams, or writing to cloud storage, there's usually a well-maintained Go library available to do so. In our smart city project, we utilize the AWS SDK in Go to stream the processed data directly into S3 while maintaining a real-time view in Redis. The ability to handle multiple outputs with negligible performance impact was impressive. Real-World Implementation Let me give a concrete example from our smart city project. We had to process sensor data coming in through Kafka, transform it, and store it in both S3 for long-term storage and Redis for real-time querying. Here's a simplified version of what our architecture looked like: Data ingestion using Sarama (Kafka client for Go)Parallel processing using goroutines poolData transformation using protocol buffersConcurrent writing to S3 and Redis These were stunning results — a single instance of our Go-based pipeline was processing 80,000 messages a second with sub-second latency. When we needed to scale up to 10Gbps throughput, we merely deployed multiple instances behind a load balancer. Case Studies and Benchmarks In comparing our Go implementation against the previous Python-based solution, the numbers tell it all: 90% reduction in processing latency70% lower CPU utilization40% lower memory footprint60% reduction in cloud infrastructure costs But probably most importantly, our solution was easy to work with. The entire pipeline including error handling and monitoring was implemented in less than 2,000 lines of code. This allowed us to onboard new people in the project very efficiently. Conclusion Go has proven to be an excellent choice for modern ETL pipelines. The combination of performance, simplicity, and a strong standard library provides the opportunity to create very efficient data processing solutions without the complexity of traditional big data frameworks. To teams considering Go for their ETL needs, I can only advise to start small. Build a simple pipeline handling one data source and one destination. Get the concurrent processing patterns right, then incrementally build more features and complexity as needed. That is just the beauty with Go: with it, your solution naturally grows with your requirements while keeping performance and code clarity intact. ETL is all about getting data from point A to point B in a reliable, maintainable way. From what I've found, Go strikes a perfect balance among these qualities, making it an excellent match for ETL challenges facing our world today.
Retrieval-augmented generation (RAG) applications integrate private data with public data and improve large language models' (LLMs) output, but building one is challenging as private data can be unstructured and siloed. You'll also need a reliable and efficient way to retrieve relevant information from the knowledge base. This might seem like an uphill battle, but it's doable with tools like Milvus and LlamaIndex, which can quickly handle big data and retrieve relevant information, especially when adopted together. What Are Milvus and LlamaIndex? To build an RAG application that optimizes query efficiency, you need a scalable, flexible vector database and an indexing algorithm. Before showing you how to build one, we'll quickly discuss Milvus and LlamaIndex. What Is Milvus? Milvus is an open-source vector database for storing, processing, running, indexing, and retrieving vector embedding across various environments. This platform is popular among generative AI developers because of its similarity search in massive datasets of high-dimensional vectors and high scalability. Besides its scalability and high performance, developers can use machine learning (ML), build recommendation systems, and mitigate hallucinations in LLMs. Milvus offers three deployment options: Milvus Lite is a Python library and ultra-lightweight version of Milvus that works great for small-scale local experiments.Milvus Standalone is a single-node deployment that uses a client-server model, the MySQL equivalent of Milvus. Milvus Distributed is Milvus's distributed mode, which adopts a cloud-native architecture and is great for building large-scale vector database systems. What Is LlamaIndex? LlamaIndex is an orchestration framework that simplifies building LLM applications by integrating private, domain-specific, and public data. It achieves this by augmenting external data and storing it as vectors in a vector database to be used for knowledge generation, complex operation search, and reasoning. Besides storing and data ingestion, LlamaIndex comes in handy when indexing and querying data. The enterprise version comprises LlamaCloud and LlamaParse. There's also an open-source package with LlamaHub (their data connectors), Python, and TypeScript packages. What Is RAG? Retrieval-augmented generation (RAG) is an AI technique that combines the strength of generative LLMs with traditional information retrieval systems to enhance accuracy and reliability. This is important because it exposes your LLMs to external, real-time vector-based information outside their knowledge bases, addressing non-contextual, inaccuracy, and hallucination issues. Building a RAG System Using LlamaIndex and Milvus We'll show you how to build a retrieval-augmented generation system using LlamaIndex and Milvus. First, you'll make use of data from the Litbank repository. Then, we'll index the data using the llama_index library and Milvus Lite. Next, we'll process the documents into vector representation using the OpenAI API and finally, query data and filter it through the metadata. Prerequisites and Dependencies To follow along with this tutorial, you will need the following: Python 3.9 or higher.Any IDE or code editor. We recommend Google Collab, but you can also use Jupyter Notebook. An OpenAI developer account so you can access your OpenAI API key. Setup and Installation Before building the RAGs, you'll need to install all your dependencies. PowerShell %pip install pymilvus>=2.4.2 %pip install llama-index-vector-stores-milvus %pip install llama-index These code snippets will install and upgrade the following: pymilvus — is the Milvus Python SDK.llama-index-vector-stores-milvus — provides integration between the LlamaIndex and Milvus vector storellama-index — the data framework for indexing and querying LLMs Next, you need to set up your OpenAI API to access their multiple advanced language models that have been trained for various natural language processing (NLP) and image-generative AI tasks. However, before you can use the OpenAI API, you must create an OpenAI developer account. Visit the API keys section of your OpenAI developer dashboard.Click on “Create a new secret key” to generate an API key.Copy the key. Then, head over to your Google Collab notebook. Python import openai openai.api_key = "OpenAI-API-Key" Generating Data For your dataset, you can use LitBank, a repository of annotated datasets of a hundred works of English-language fiction. For this project, we'll use "The Fall of the House of Usher" by Edgar Allan Poe and "Oliver Twist" by Charles Dickens. To achieve this, create a directory to retrieve and save your data. Python ! mkdir -p 'data/' ! wget 'https://raw.githubusercontent.com/dbamman/litbank/refs/heads/master/original/730_oliver_twist.txt' -O 'data/730_oliver_twist.txt' ! wget 'https://raw.githubusercontent.com/dbamman/litbank/refs/heads/master/original/932_the_fall_of_the_house_of_usher.txt' -O 'data/932_the_fall_of_the_house_of_usher.txt' Then, generate a document from the novel using a SimpleDirectoryReader class from llama_index library library. Python from llama_index.core import SimpleDirectoryReader # load documents documents = SimpleDirectoryReader( input_files=["data/730_oliver_twist.txt"] ).load_data() print("Document ID:", documents[0].doc_id) Indexing Data Next, index over your document to reduce search latency and enable semantic similarity search for quick retrieval of relevant documents based on meaning and context. You can do this using the llama_index library. All you need to do is specify the file path and storage configuration and set your vector embedding dimensionality. You'll also set the URI of Milvus Lite as your local file. Alternatively, you can run Milvus via Docker, Kubernetes, or Zilliz Cloud, Milvus’s fully managed cloud solution. These alternatives are best for large projects. Python # Create an index over the documents from llama_index.core import VectorStoreIndex, StorageContext from llama_index.vector_stores.milvus import MilvusVectorStore vector_store = MilvusVectorStore(uri="./milvus_demo.db", dim=1536, overwrite=True) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents(documents, storage_context=storage_context) Querying Data You'll need to leverage the indexed documents as a knowledge base for asking questions. This will allow your RAG to have conversational AI capabilities, quickly retrieve relevant answers, and have a contextual understanding of conversations. Python query_engine = index.as_query_engine() res = query_engine.query("how did Oliver twist grow up?") print(res) Try asking more questions about the novel. Python res = query_engine.query("What motivates Oliver to ask for more food in the workhouse?") print(res) You can try more tests like overwriting any previously asked information. Python from llama_index.core import Document vector_store = MilvusVectorStore(uri="./milvus_demo.db", dim=1536, overwrite=True) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents( [Document(text="The number that is being searched for is ten.")], storage_context, ) query_engine = index.as_query_engine() res = query_engine.query("how did Oliver twist grow up?") print(res) Let’s try one more test to add additional data to an already existing index. Python del index, vector_store, storage_context, query_engine vector_store = MilvusVectorStore(uri="./milvus_demo.db", overwrite=False) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents(documents, storage_context=storage_context) query_engine = index.as_query_engine() res = query_engine.query("What is the number?") print(res) res = query_engine.query("how did Oliver twist grow up?") print(res) Filtering Metadata Metadata filtering allows you to narrow search results that match specific criteria based on metadata. This way, you can search for documents based on various metadata fields such as author, date, and tag. This is particularly useful when you have a large dataset and need to find documents that meet certain attributes. You can load both documents using the code snippet below. Python from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters # Load all the two documents loaded before documents_all = SimpleDirectoryReader("./data/").load_data() vector_store = MilvusVectorStore(uri="./milvus_demo.db", dim=1536, overwrite=True) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents(documents_all, storage_context) If you only want to retrieve documents from The Fall of the House of Usher, use the following script: Python filters = MetadataFilters( filters=[ExactMatchFilter(key="file_name", value="932_the_fall_of_the_house_of_usher.txt")] ) query_engine = index.as_query_engine(filters=filters) res = query_engine.query("What distinctive physical feature does Roderick Usher exhibit in The Fall of the House of Usher?") print(res) If you only want to use Oliver Twist, you can use this script: Python filters = MetadataFilters( filters=[ExactMatchFilter(key="file_name", value="730_oliver_twist.txt")] ) query_engine = index.as_query_engine(filters=filters) res = query_engine.query("What challenges did Oliver face?") print(res) You can explore the full project code on GitHub along with the interactive Google Collab notebook. Conclusion In this post, you learned how to build a RAG application with LlamaIndex and Milvus. Milvus offers capabilities such as image search, and since Milvus Lite is an open-source project, you can make your own contributions as well.
When we think about software engineers, the focus often lands squarely on technical skills — writing efficient code, solving complex problems, and understanding algorithms. However, this narrow view overlooks a critical element that can make or break a career: personal branding. This oversight is a mistake I made early in my career. I believed my technical abilities alone would lead to success, promotions, and recognition. But over time, I realized that while being skilled at software design and architecture is essential, it is only part of the equation. Why Personal Branding Matters Have you ever been passed over for a promotion or a dream project, only to see someone less skilled take the opportunity? If so, you may have wondered what set that person apart. The difference often lies not in technical expertise but in visibility, reliability, and connections. Even foundational technical books hint at this truth. In The Philosophy of Software Design, John Ousterhout emphasizes the importance of strategic thinking — skills beyond coding. Similarly, Eric Evans' classic Domain-Driven Design delves into strategic DDD, where concepts like ubiquitous language and bounded contexts highlight the importance of human interaction and communication. This reality becomes even more apparent when considering the Agile Manifesto, which prioritizes individuals and interactions over processes and tools. Software is created with, for, and through people. Connections and perceptions matter just as much as code quality because, ultimately, the people you work with shape your career trajectory. What Is Personal Branding? Your brand is the perception others have of you. It's how colleagues, managers, and the broader industry view your reliability, expertise, and approachability. Personal branding involves defining and promoting what you stand for as an individual. It reflects your experiences, skills, values, and unique traits. For software engineers, personal branding is crucial because it influences career growth. Your reputation impacts hiring decisions, promotions, and project allocations. As the ancient saying goes, "Caesar's wife must be above suspicion." Perceptions have shaped decisions for centuries, and this principle holds even in the tech-driven modern world. Benefits of Personal Branding Investing in your brand offers numerous benefits: Improved Credibility: Demonstrating your expertise and knowledge establishes trust among peers and stakeholders.Differentiation: Highlighting what makes you unique sets you apart from others in your field.Lasting Impressions: A well-defined brand ensures you are remembered for the right reasons.Better Connections: Strong branding attracts like-minded professionals and opens doors to valuable opportunities. How to Start Your Branding Today Building a personal brand doesn't require reinventing yourself or creating a façade. Instead, it's about curating and amplifying your authentic self to showcase your strengths. Here's how to begin: Understand Yourself: Reflect on your skills, values, and goals. What do you want to be known for? What are your long-term career aspirations? Defining your brand starts with self-awareness.Audit Your Existing Brand: Search for your online presence. What appears when you do? Are there any inconsistencies in how you are represented across different platforms? This audit can help you identify areas for improvement.Define Your Audience: Personal branding isn't about appealing to everyone. Identify the people who matter most in your career — managers, colleagues, domain experts, or potential employers — and tailor your brand to resonate with them.Leverage Written Content: Write articles, blogs, or social media posts to showcase your knowledge. Sharing your expertise not only boosts your reliability but also expands your reach.Be Active in Your Network: Build relationships with colleagues and industry peers. Being approachable and helpful creates a positive impression, often leading to new opportunities.Maintain Consistency: Whether through LinkedIn, GitHub, or personal blogs, ensure your online presence aligns with your professional goals. Use the same tone, messaging, and visuals across platforms.Remember Non-Technical Audiences: Not everyone influencing your career will understand the technical aspects of your work. Build trust with non-technical stakeholders — managers, HR, and directors — by communicating effectively and showcasing your value.Enhance Word of mouth: Be generous with your knowledge and support. A helpful reputation spreads, creating opportunities through recommendations and referrals. Real-Life Scenarios for Developers Building a Personal Brand As developers, our work often goes unnoticed unless we actively showcase it. Personal branding helps highlight your skills, connect with others, and create career opportunities. Here’s how to start: Open Source Contributions: Imagine you're contributing to a popular open-source project. Your commits and pull requests improve the codebase and showcase your expertise to a global audience. This visibility can lead to job offers, speaking invitations, or collaborations.Tech Blogging: A developer who writes about solving a tricky bug or implementing a complex feature can establish themselves as an expert in that area. For example, a blog post detailing how you optimized database queries in a high-traffic application could resonate with others facing similar challenges.Conference Speaking: Sharing a personal project or a unique approach at a developer conference boosts your confidence and positions you as a thought leader in your field. A talk on how you integrated cutting-edge tools like Kubernetes with CI/CD can inspire others.GitHub Portfolio: Developers often share side projects, frameworks, or libraries on GitHub. Imagine creating a tool that simplifies a common development pain point — like automating documentation generation for APIs. Your GitHub stars and forks become a testament to your innovation.Social Media Engagement: A thread on Twitter about debugging a complex issue in JavaScript or a LinkedIn post about lessons learned while scaling a microservices architecture can attract attention from peers and recruiters.Code Reviews and Mentorship: Providing thoughtful, constructive feedback during code reviews or mentoring junior developers showcases your leadership skills. It builds your internal brand within a team or organization.Live Coding or Tutorials: Hosting live coding sessions on platforms like YouTube or Twitch to solve problems, build apps, or explore new technologies demonstrates your technical skills and communication ability. By embracing scenarios like these, developers can naturally weave personal branding into their daily lives, allowing their expertise and passion to shine authentically and impactfully. Final Thoughts In software engineering, your technical skills are just the foundation. To truly stand out, you must cultivate a personal brand that amplifies your strengths, builds trust, and opens doors. By understanding yourself, creating valuable content, and fostering connections, you can shape how others perceive you and take control of your career narrative. Start today — your future self will thank you.
Prometheus is a tool that helps you track how your systems are working. Think of it as a tool that collects numbers about your applications and servers. This guide will help you understand the different types of metrics and how to use them. The Four Basic Types of Prometheus Metrics 1. Counters - Numbers That Only Go Up A counter is a number that only goes up or resets to zero on restart, just like a car's odometer that keeps adding miles. It's perfect for tracking things that only increase, like total API requests, error counts, or tasks completed. When a counter resets to zero (like during a system restart), Prometheus can detect this reset and handle calculations correctly. Counters are the simplest metric type and should be used whenever you're counting the total occurrences of something. Plain Text # Example of a counter http_requests_total{method="POST", endpoint="/api/users"} 2387 What to Know Only increases or resets to zeroUsed for counting total eventsCommon uses: counting requests, errors, completed tasks Basic Rules Always add _total to counter namesUse only for numbers that increaseNever use for numbers that need to go down Real Examples Plain Text # Wrong way: Using a counter for current users active_users 23 # Why it's wrong: Current users can go up OR down, but counters can only go up # Right way: Using a counter for total logins user_logins_total{status="success"} 10483 # Why it's right: Total logins only increase, perfect for a counter 2. Gauges - Numbers That Go Up and Down A gauge is a number that can go both up and down, like a thermometer or fuel gauge in your car. It represents a current value at any point in time, such as memory usage, active requests, or CPU temperature. You can think of a gauge as taking a snapshot of something that changes frequently. Unlike counters, gauges are perfect for metrics that can increase or decrease based on system behavior. Plain Text # Example of a gauge node_memory_usage_bytes{instance="server-01"} 1234456789 What to Know Can increase or decreaseShows current value at any timeGood for measuring current state Basic Rules Use for values that change up and downGood for usage and saturation metricsDon't use for counting total events Real Examples Plain Text # Right way: Tracking CPU temperature cpu_temperature_celsius{core="0"} 54.5 # Why it's right: Temperature naturally goes up and down # Right way: Current database connections db_connections_current{database="users"} 47 # Why it's right: Active connections change both up and down 3. Histograms - Tracking Value Ranges A histogram groups measurements into ranges (called buckets), like sorting test scores into A, B, C, D, and F grades. It automatically tracks how many values fall into each range, plus keeps a count of all values and their sum. Histograms are especially useful for measuring things like request duration or response size, where you want to understand the distribution of values. The key feature of histograms is that they let you calculate percentiles later using the histogram_quantile function. Plain Text # Example of a histogram http_request_duration_seconds_bucket{le="0.1"} 24054 # Requests faster than 0.1s http_request_duration_seconds_bucket{le="0.5"} 33444 # Requests faster than 0.5s http_request_duration_seconds_bucket{le="1.0"} 34001 # Requests faster than 1.0s What to Know Groups values into ranges (buckets)Creates count and sum automaticallyHelps calculate percentiles Basic Rules Pick ranges that make sense for your dataGood for response times and sizesDon't create too many ranges (it uses more memory) Real Examples Plain Text # Wrong way: Too many buckets api_response_time_bucket{le="0.1"} 100 api_response_time_bucket{le="0.2"} 150 api_response_time_bucket{le="0.3"} 180 # Why it's wrong: Too many small buckets use extra memory and don't add value # Right way: Meaningful bucket sizes api_response_time_bucket{le="0.5"} 1000 # Half second api_response_time_bucket{le="1.0"} 1500 # One second api_response_time_bucket{le="2.0"} 1700 # Two seconds # Why it's right: Buckets match meaningful response time targets 4. Summaries - Calculating Percentiles A summary is similar to a histogram but calculates percentiles directly when collecting the data, like having a calculator that immediately tells you your test score's ranking in the class. It tracks the total count and sum like a histogram, but instead of buckets, it stores exact percentile values (like 50th, 90th, 99th percentile). Summaries are more resource-intensive than histograms because they calculate percentiles on the fly, but they provide more accurate percentile calculations. Use summaries when you need exact percentiles and can't calculate them later. Plain Text # Example of a summary http_request_duration_seconds{quantile="0.5"} 0.05 # 50% of requests http_request_duration_seconds{quantile="0.9"} 0.1 # 90% of requests http_request_duration_seconds_count 34010 # Total count What to Know Calculates exact percentilesIncludes total count and sumUses more computer resources than histograms Basic Rules Use when you need exact percentilesConsider histograms for most casesBe careful with labels (they use memory) Common Mistakes to Avoid 1. Counter vs. Gauge Confusion Plain Text # Wrong way: Using counter for temperature temperature_total{location="room"} 25 # Why wrong: Temperature goes up and down, counters can't go down # Right way: Using gauge for temperature temperature{location="room"} 25 # Why right: Gauges can show current temperature properly 2. Too Many Labels Plain Text # Wrong way: Too much detail http_requests_total{user_id="12345", path="/api/users", method="GET", status="200", browser="chrome"} # Why wrong: Creates too many combinations, uses lots of memory # Right way: Important details only http_requests_total{path="/api/users", method="GET", status="200"} # Why right: Keeps useful information without too many combinations Simple Recipes for Common Tasks Calculating Rates Plain Text # Request rate per second over 5 minutes rate(http_requests_total[5m]) # Error rate percentage (rate(http_errors_total[5m]) / rate(http_requests_total[5m])) * 100 Tracking Resource Usage Plain Text # Average memory usage by application avg(process_memory_bytes) by (app_name) # Maximum CPU usage in the last hour max(cpu_usage_percent[1h]) Key Points to Remember 1. Use Counter when: Counting total eventsTracking errorsMeasuring completed tasks 2. Use Gauges when: Measuring current valuesTracking things that go up and downShowing resource usage 3. Use Histograms when: Measuring response timesLooking at value rangesNeed approximate percentiles 4. Use Summaries when: Need exact percentilesWilling to use more computer resourcesCan't calculate percentiles later Start with counters and gauges for basic monitoring. Add histograms and summaries when you need to track response times or understand how your values are spread out. Remember, good monitoring starts with choosing the right type of metric for what you want to measure.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems. AIOps play a crucial role in streamlining the operational load, improving overall performance, and enhancing the security of highly distributed and complex applications. By introducing the AI and ML capabilities of AIOps into observability workflows, manual effort is saved via automation of incident detection, root cause analysis, and self-healing capabilities. As the complexity of a system grows and the volume of data increases, the efficacy of an AIOps integration improves. In this article, we will analyze AIOps capabilities to modernize and optimize observability workflows. A Brief Review of Observability Workflows and AIOps This section will discuss the key components of observability workflows and AIOps with examples. AIOps can be used in existing observability workflows to make them smarter. Figure 1. AIOps in observability workflows Key Components of Observability Workflows Observability workflows provide a deep understanding of and visibility into a complex distributed system. This enables software teams to proactively detect issues, enhance security, optimize application performance, and scale the system. Table 1. Observability workflow components ComponentDetailsData collectionCollect data from various sources (e.g., logs, metrics, service traces)Data processingUnify and standardize collected dataData ingestionIngest collected and processed data into the platform for further analysisData storageStore ingested data in high-volume storageData visualizationVisualize stored data using commonly available toolsReportingUse various methods to trigger tickets and notify corresponding stakeholdersIncident managementAchieve automated or manual incident analysis and resolution based on tickets; various rules are configured to take required actionsBehavior monitoringUse the data to analyze actor behavior and identify any malicious activityContinuous improvementsUse the past data for root cause analysis to improve observability workflows Observability workflows enable a scattered team working on complex and highly distributed systems to take necessary action by employing the above methodologies to ensure the high availability of these systems. A few advantages include: Real-time communication monitoring. Observability workflows enable software teams to collect data from various distributed systems in real time to gain insight into the application.Microservices monitoring. Observability workflows enable the monitoring of complex distributed systems by ingesting data from these systems and creating a report on the collected data.Automated incident management. Observability workflows enable teams to offload the manual workload of identifying and resolving any software incident that could impact customers. Key Components of AIOps The key components of AIOps overlap with observability workflows. Apart from the key components of observability workflows listed in Table 1, below are the key components of an AIOps system: AIOps can use natural language processing to gain a better insight into the system by using ML techniques for anomaly detection, predictive analysis, and behavior learning.Using AI and ML techniques, AIOps has a better healing capability than traditional observability workflows.AI and ML provide a better self-learning ability for ensuring security and compliance, and it also helps in preemptive detection and incident response.With self-learning capabilities analyzing a vast amount of data, historical patterns, and predictive capabilities, AIOps work with higher accuracy. There are various tools (e.g., ElasticSearch, LogStash, Kibana, Kedro) and MLOps practices that can work together to create an AIOps observability workflow. These tools play a crucial role in various segments like collection, processing, storage, AI/ML, incident management, and reporting. The AIOps framework can be created using these various tools together, and once a continuous pipeline is set up, it can be used in observability workflows. These pipelines will expose interaction points of the workflows. AIOps for Observability Workflows The core of both AIOps and observability workflows is data from various sources, ingestion, storage, monitoring, mitigation, and continuous learning. AIOps will consume the huge data produced by various microservices, process it, and use it to improve the AI/ML system continuously. This, when combined with observability workflows, enhances the overall performance of the system. There are various overlapping components between these two systems like data collection, data processing, data ingestion, data storage, data visualization, and continuous learning. It is easier to extend observability workflows to utilize AIOps for improved functionality. Key Components of AIOps for Observability Workflows The AIOps components described in Figure 1 provide AI/ML intelligence capabilities to existing observability workflows by using ML models. These models are continuously trained on the data and actions using a feedback loop that enhances their capabilities with time. AIOps components leverage ML models to provide intelligent, automated remediation actions and evolving recommendations (Figure 1). Apart from the components mentioned in Table 1, the AIOps components for observability workflows are: Table 2. AIOps components for observability workflows ComponentDetailsDetectionThe component to detect anomalies in dataRecognitionRecognize common patterns in dataML modelsCore of the AIOps to perform ML-related activitiesAnalysisAnalyze data using ML modelsRecommendationUse data to generate recommendationsRemediationUse ML model analysis to automatically remediate detected issuesFeedbackUse ML model output and actions to retrain the modelBehavior monitoringUse data to analyze actor behavior and identify any malicious activityContinuous improvementsUse past data for root cause analysis to improve observability workflows How the Key Components Interact A data ingestion tool, when applied with an AI/ML solution, provides an intelligent platform for observability. The components of AIOps for observability workflows work together to provide a unified experience of a continuously evolving AI/ML-based system. At the core of these interactions are data, an AI model, continuous learning, and prediction. Input data is sanitized, aggregated, and then fed to an AI/ML model, which will then be analyzed for possible risks, mitigation, and reporting. There are various open-source platforms (e.g., ELK, Prometheus, OpenTelemetry) that can be combined with AIOps platforms to provide a unified experience. Benefits and Challenges of AIOps for Observability Workflows AIOps introduces various benefits and increases the efficiency of a software development team. The benefits of utilizing observability workflows with AIOps are: AIOps in an observability workflow will detect issues faster, resulting in quicker resolution.With an ever-evolving system in place, less manual intervention is required.AIOps' full automation supported by AI/ML will lead to improved productivity.The security landscape is an ever-growing area, and using AIOps will strengthen security of the system.AIOps will result in enhanced compliance in a distributed system.AIOps' self-healing ability will improve the system's scalability and uptime. However, these systems have various challenges associated with them, such as: The cost of implementation in terms of AI/ML inference, continuous learning pipelines, infrastructure, etc., is higher compared to observability workflows.AIOps will require niche knowledge of AI/ML models, training, retraining, etc., leading to a higher learning curve.AIOps introduces an extra component of AI/ML, which will lead to higher complexity of the overall system.The AIOps system will have a risk of bias associated with the data that is fed into the ML model.Once implemented, there could be a high chance of false positives due to the configuration of AI/ML models.AIOps still has skills gap issues; it requires an understanding of AI/ML, ops, and observability.It takes time to realize the return on investment for an AIOps system. Enhance Existing Observability Workflows With AIOps AIOps can be implemented in a new observability workflow as well as enable an existing workflow due to the amount of overlap between these two systems and the components that can be reused while upgrading to use AIOps. To implement an AIOps solution to an existing workflow, a systematic and modular approach is the most effective way to avoid any issues and to ensure a smooth rollout. At a high level, the implementation process can be: Study and identify the existing components of the workflowOutline the reusable components once the systems are identifiedDesign the AIOps system with these componentsIntroduce the new components (e.g., AI/ML, continuous learning, monitoring, mitigation)Implement a proof of concept (POC) on non-production environments using existing componentsPlan a phased rollout for user acceptance testing once the POC is implemented successfullyPlan a production rollout With a phased approach, any existing/legacy observability workflows can be made more intelligent using AIOps. Conclusion When combined with observability workflows, AIOps improves the capabilities of any complex software system. For instance, AIOps in observability workflows strengthen the system's security by proactively learning and mitigating security threats using AI and ML capabilities. This approach enhances the flexibility of complex distributed systems by implementing a more versatile, self-learning, evolving solution rather than a hardcoded, rule-based, traditional observability workflow. By utilizing the vast amount of generated data from various distributed systems and a feedback loop, AIOps can predict issues proactively and more accurately, thus lowering the service downtime, time to detect issues, and time to implement fixes, and providing better scalability. AIOps does not stop there, and with time, it raises the operational excellence of the distributed systems and minimizes any manual dependence on scaling, detecting, and fixing issues. This will prove useful when looking forward; as cyber threats evolve and data volumes grow larger, AIOps will continue to be pivotal for enhancing such complex systems. This is an excerpt from DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems.Read the Free Report
Personal Branding for Software Engineers: Why It Matters and How to Start Today
December 9, 2024 by CORE
Guide to LangChain Runnable Architecture
December 4, 2024 by
Management Capabilities 101: Ensuring On-Time Delivery in Agile-Driven Projects
December 3, 2024 by
Protecting Your API Ecosystem: The Role of Rate Limiting in Service Stability
December 10, 2024 by
Advancing Explainable Natural Language Generation (NLG): Techniques, Challenges, and Applications
December 10, 2024 by
Understanding Functional Programming: A Quick Guide for Beginners
December 10, 2024 by
Protecting Your API Ecosystem: The Role of Rate Limiting in Service Stability
December 10, 2024 by
Serverless Computing: The Future of Programming and Application Deployment Innovations
December 10, 2024 by
A Practical Guide to Securing NodeJS APIs With JWT
December 10, 2024 by
Angular Input/Output Signals: The New Component Communication
December 10, 2024 by
Understanding Functional Programming: A Quick Guide for Beginners
December 10, 2024 by
A Practical Guide to Securing NodeJS APIs With JWT
December 10, 2024 by
Serverless Computing: The Future of Programming and Application Deployment Innovations
December 10, 2024 by
How to Automate Blob Deletion in Azure Storage Using PowerShell
December 10, 2024 by
Demystifying Kubernetes in 5 Minutes
December 10, 2024 by
Advancing Explainable Natural Language Generation (NLG): Techniques, Challenges, and Applications
December 10, 2024 by
Angular Input/Output Signals: The New Component Communication
December 10, 2024 by
Understanding Functional Programming: A Quick Guide for Beginners
December 10, 2024 by