The final step in the SDLC, and arguably the most crucial, is the testing, deployment, and maintenance of development environments and applications. DZone's category for these SDLC stages serves as the pinnacle of application planning, design, and coding. The Zones in this category offer invaluable insights to help developers test, observe, deliver, deploy, and maintain their development and production environments.
In the SDLC, deployment is the final lever that must be pulled to make an application or system ready for use. Whether it's a bug fix or new release, the deployment phase is the culminating event to see how something works in production. This Zone covers resources on all developers’ deployment necessities, including configuration management, pull requests, version control, package managers, and more.
The cultural movement that is DevOps — which, in short, encourages close collaboration among developers, IT operations, and system admins — also encompasses a set of tools, techniques, and practices. As part of DevOps, the CI/CD process incorporates automation into the SDLC, allowing teams to integrate and deliver incremental changes iteratively and at a quicker pace. Together, these human- and technology-oriented elements enable smooth, fast, and quality software releases. This Zone is your go-to source on all things DevOps and CI/CD (end to end!).
A developer's work is never truly finished once a feature or change is deployed. There is always a need for constant maintenance to ensure that a product or application continues to run as it should and is configured to scale. This Zone focuses on all your maintenance must-haves — from ensuring that your infrastructure is set up to manage various loads and improving software and data quality to tackling incident management, quality assurance, and more.
Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
The Testing, Tools, and Frameworks Zone encapsulates one of the final stages of the SDLC as it ensures that your application and/or environment is ready for deployment. From walking you through the tools and frameworks tailored to your specific development needs to leveraging testing practices to evaluate and verify that your product or application does what it is required to do, this Zone covers everything you need to set yourself up for success.
Automated Testing
The broader rise in automation has paved the way for advanced capabilities and time savings for developers and tech professionals, especially when it comes to testing. There are increasingly more conversations around how to transition tests to an automated cadence as well as a deeper push toward better automated testing integration throughout the SDLC. Solutions such as artificial intelligence (AI) and low code play an important role in implementing tests for development and testing teams, expanding test coverage and eliminating time spent on redundant tasks. It's a win-win-win.In DZone's 2023 Automated Testing Trend Report, we further assess current trends related to automated testing, covering everything from architecture and test-driven development to observed benefits of AI and low-code tools. The question is no longer should we automate tests; it's how do we better automate tests and integrate them throughout CI/CD pipelines to ensure high degrees of test coverage? This question will be examined through our original research, expert articles from DZone Community members, and other insightful resources.As part of our December 2023 re-launch, we've added updates to the Solutions Directory and more.
How to Choose a Server Stack at Product Launch
Getting Started With OpenTelemetry
In Kubernetes, a Secret is an object that stores sensitive information like a password, token, key, etc. One of the several good practices for Kubernetes secret management is making use of a third-party secrets store provider solution to manage secrets outside of the clusters and configuring pods to access those secrets. There are plenty of such third-party solutions available in the market, such as: HashiCorp VaultGoogle Cloud Secret ManagerAWS Secrets ManagerAzure Key VaultAkeyless These third-party solutions, a.k.a External Secrets Managers (ESM), implement secure storage, secret versioning, fine-grain access control, audit and logging. The External Secrets Operator (ESO) is an open-source solution used for secure retrieval and synchronization of secrets from the ESM. The secrets retrieved from the ESM are injected into the Kubernetes environment as native secret objects. Thus, ESO enables application developers to use Kubernetes Secret Object with Enterprise grade External Secret Managers. ESO implementation in a Kubernetes cluster primarily requires two resources: ClusterSecretStore that specifies how to access the External Secrets ManagerExternalSecret that specifies what data is to be fetched and stored as a Kubernetes secret object Secret retrieval is a one-time activity, but synchronization of secrets generates traffic at regular intervals. So it's important to follow best practices (listed below) that can optimize ESO traffic to the external secrets management systems. Defining Refresh Interval for the ExternalSecret Object Long-lived static secrets pose a security risk that can be addressed by adopting a secret rotation policy. Each time a secret gets rotated in the ESM, it should be reflected in the corresponding Kubernetes Secret object. ESO supports automatic secret synchronization for such situations. Secrets get synchronized after a specified time frame, called "refresh interval," which is a part of the ExternalSecret resource definition. It is advisable to opt for an optimum refresh interval value; e.g., a secret that's not likely to get modified often can have a refresh interval of one day instead of one hour or a few minutes. Remember, the more aggressive the refresh interval, the more traffic it will generate. Defining Refresh Interval for the ClusterSecretStore Object The refresh interval defined in the ClusterSecretStore (CSS) is the frequency with which the CSS validates itself with the ESM. If the refresh interval is not specified while defining a CSS object, the default refresh interval (which is specific to the ESM API implementation) is considered. The default CSS refresh interval has been found to be a very aggressive value; i.e., the interaction with the ESM happens very frequently in this case. For example, the picture below is an excerpt of the description of a sample CSS (HashiCorp Vault is the ESM in this case) that has no refresh interval value in its definition. The refresh interval seen in the CSS description below is five minutes, implying the resource is approaching the ESM every five minutes, generating avoidable traffic. The refresh interval attribute gets missed in most CSS definitions because: There is a discrepancy between the default value of the refresh interval for an ExternalSecret object and that for a ClusterSecretStore object. This can inadvertently lead to an un-optimized implementation for ClusterSecretStore. The default value of the refresh interval for the ExternalSecret object is ZERO. It signifies that refresh is disabled; i.e., the secret never gets synchronized automatically. The default value of the refresh interval for the ClusterSecretStore object is ESM-specific; e.g., it is five minutes in the HashiCorp Vault scenario cited above.The refresh interval attribute is not present in the prominent samples/examples on the internet (e.g., check ClusterSecretStore documentation). One can gain insight into this attribute via the command kubectl explain clustersecretstore.spec. The significance of defining a refresh interval for CSS can be realized by monitoring the traffic generated via a CSS object without a refresh interval in a test cluster that does not have any ESO object. Using Cluster-Scoped External Secrets Over Namespace-Scoped External Secrets The first ESO release was done in May 2021. Back then, the only option was to use the namespace-scoped ExternalSecret resource. So, even if the secret stored was global, an ExternalSecret object had to be defined for each namespace. ExternalSecret objects across all namespaces would get synchronized at the defined refresh interval, thereby generating traffic. The larger the number of namespaces, the more traffic they would generate. There was a dire need for a global ExternalSecret object accessible across different namespaces. To fill this gap, the cluster-level external secret resource, ClusterExternalSecret (CES) was introduced in April 2022 (v0.5.0). Opting for ClusterExternalSecret over ExternalSecret (where applicable) can avoid redundant traffic generation. A sample YAML specific to HashiCorp Vault and Kubernetes image pull secret can be referred to below: CES example: YAML apiVersion: external-secrets.io/v1beta1 kind: ClusterExternalSecret metadata: name: "sre-cluster-ext-secret" spec: # The name to be used on the ExternalSecrets externalSecretName: sre-cluster-es # This is a basic label selector to select the namespaces to deploy ExternalSecrets to. # you can read more about them here https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#resources-that-support-set-based-requirements namespaceSelector: #mandatory -- not adding this will expose the external secret matchLabels: label: try_ces # How often the ClusterExternalSecret should reconcile itself # This will decide how often to check and make sure that the ExternalSecrets exist in the matching namespaces refreshTime: "10h" # This is the spec of the ExternalSecrets to be created externalSecretSpec: secretStoreRef: name: vault-backend kind: ClusterSecretStore target: name: sre-k8secret-cluster-es template: type: kubernetes.io/dockerconfigjson data: .dockerconfigjson: "{{.dockersecret | toString}" refreshInterval: "24h" data: - secretKey: dockersecret remoteRef: key: imagesecret property: dockersecret Conclusion By following the best practices listed above, the External Secrets Operator traffic to the External Secrets Manager can be reduced significantly.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC. In the past, before CI/CD and Kubernetes came along, deploying software to Kubernetes was a real headache. Developers would build stuff on their own machines, then package it and pass it to the operations team to deploy it on production. This approach would frequently lead to delays, miscommunications, and inconsistencies between environments. Operations teams had to set up the deployments themselves, which increased the risk of human errors and configuration issues. When things went wrong, rollbacks were time consuming and disruptive. Also, without automated feedback and central monitoring, it was tough to keep an eye on how builds and deployments were progressing or to identify production issues. With the advent of CI/CD pipelines combined with Kubernetes, deploying software is smoother. Developers can simply push their code, which triggers builds, tests, and deployments. This enables organizations to ship new features and updates more frequently and reduce the risk of errors in production. This article explains the CI/CD transformation with Kubernetes and provides a step-by-step guide to building a pipeline. Why CI/CD Should Be Combined With Kubernetes CI/CD paired with Kubernetes is a powerful combination that makes the whole software development process smoother. Kubernetes, also known as K8s, is an open-source system for automating the deployment, scaling, and management of containerized applications. CI/CD pipelines, on the other hand, automate how we build, test, and roll out software. When you put them together, you can deploy more often and faster, boost software quality with automatic tests and checks, cut down on the chance of pushing out buggy code, and get more done by automating tasks that used to be done by hand. CI/CD with Kubernetes helps developers and operations teams work better together by giving them a shared space to do their jobs. This teamwork lets companies deliver high-quality applications rapidly and reliably, gaining an edge in today's fast-paced world. Figure 1 lays out the various steps: Figure 1. Push-based CI/CD pipeline with Kubernetes and monitoring tools There are several benefits in using CI/CD with Kubernetes, including: Faster and more frequent application deployments, which help in rolling out new features or critical bug fixes to the usersImproved quality by automating testing and incorporating quality checks, which helps in reducing the number of bugs in your applicationsReduced risk of deploying broken code to production since CI/CD pipelines can conduct automated tests and roll-back deployments if any problems existIncreased productivity by automating manual tasks, which can free developers' time to focus on important projectsImproved collaboration between development and operations teams since CI/CD pipelines provide a shared platform for both teams to work Tech Stack Options There are different options available if you are considering building a CI/CD pipeline with Kubernetes. Some of the popular ones include: Open-source tools such as Jenkins, Argo CD, Tekton, Spinnaker, or GitHub ActionsEnterprise tools, including but not limited to, Azure DevOps, GitLab CI/CD, or AWS CodePipeline Deciding whether to choose an open-source or enterprise platform to build efficient and reliable CI/CD pipelines with Kubernetes will depend on your project requirements, team capabilities, and budget. Impact of Platform Engineering on CI/CD With Kubernetes Platform engineering builds and maintains the underlying infrastructure and tools (the "platform") that development teams use to create and deploy applications. When it comes to CI/CD with Kubernetes, platform engineering has a big impact on making the development process better. It does so by hiding the complex parts of the underlying infrastructure and giving developers self-service options. Platform engineers manage and curate tools and technologies that work well with Kubernetes to create a smooth development workflow. They create and maintain CI/CD templates that developers can reuse, allowing them to set up pipelines without thinking about the details of the infrastructure. They also set up rules and best practices for containerization, deployment strategies, and security measures, which help maintain consistency and reliability across different applications. What's more, platform engineers provide ways to observe and monitor applications running in Kubernetes, which let developers find and fix problems and make improvements based on data. By building a strong platform, platform engineering helps dev teams zero in on creating and rolling out features more without getting bogged down by the complexities of the underlying tech. It brings together developers, operations, and security teams, which leads to better teamwork and faster progress in how things are built. How to Build a CI/CD Pipeline With Kubernetes Regardless of the tech stack you select, you will often find similar workflow patterns and steps. In this section, I will focus on building a CI/CD pipeline with Kubernetes using GitHub Actions. Step 1: Setup and prerequisites GitHub account – needed to host your code and manage the CI/CD pipeline using GitHub ActionsKubernetes cluster – create one locally (e.g., MiniKube) or use a managed service from Amazon or Azurekubectl – Kubernetes command line tool to connect to your clusterContainer registry – needed for storing Docker images; you can either use a cloud provider's registry (e.g., Amazon ECR, Azure Container Registry, Google Artifact Registry) or set up your own private registryNode.js and npm – install Node.js and npm to run the sample Node.js web applicationVisual Studio/Visual Studio Code – IDE platform for making code changes and submitting them to a GitHub repository Step 2: Create a Node.js web application Using Visual Studio, create a simple Node.js application with a default template. If you look inside, the server.js in-built generated file will look like this: Shell // server.js 'use strict'; var http = require('http'); var port = process.env.PORT || 1337; http.createServer(function (req, res) { res.writeHead(200, { 'Content-Type': 'text/plain' }); res.end('Hello from kubernetes\n'); }).listen(port); Step 3: Create a package.json file to manage dependencies Inside the project, add a new file Package.json to manage dependencies: Shell // Package.Json { "name": "nodejs-web-app1", "version": "0.0.0", "description": "NodejsWebApp", "main": "server.js", "author": { "name": "Sunny" }, "scripts": { "start": "node server.js", "test": "echo \"Running tests...\" && exit 0" }, "devDependencies": { "eslint": "^8.21.0" }, "eslintConfig": { } } Step 4: Build a container image Create a Dockerfile to define how to build your application's Docker image: Shell // Dockerfile # Use the official Node.js image from the Docker Hub FROM node:14 # Create and change to the app directory WORKDIR /usr/src/app # Copy package.json and package-lock.json COPY package*.json ./ # Install dependencies RUN npm install # Copy the rest of the application code COPY . . # Expose the port the app runs on EXPOSE 3000 # Command to run the application CMD ["node", "app.js"] Step 5: Create a Kubernetes Deployment manifest Create a deployment.yaml file to define how your application will be deployed in Kubernetes: Shell // deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: nodejs-deployment spec: replicas: 3 selector: matchLabels: app: nodejs-app template: metadata: labels: app: nodejs-app spec: containers: - name: nodejs-container image: nodejswebapp ports: - containerPort: 3000 env: - name: NODE_ENV value: "production" --- apiVersion: v1 kind: Service metadata: name: nodejs-service spec: selector: app: nodejs-app ports: - protocol: TCP port: 80 targetPort: 3000 type: LoadBalancer Step 6: Push code to GitHub Create a new code repository on GitHub, initialize the repository, commit your code changes, and push it to your GitHub repository: Shell git init git add . git commit -m "Initial commit" git remote add origin "<remote git repo url>" git push -u origin main Step 7: Create a GitHub Actions workflow Inside your GitHub repository, go to the Actions tab. Create a new workflow (e.g., main.yml) in the .github/workflows directory. Inside the GitHub repository settings, create Secrets under actions related to Docker and Kubernetes cluster — these are used in your workflow to authenticate: Shell //main.yml name: CI/CD Pipeline on: push: branches: - main jobs: build: runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v2 - name: Set up Node.js uses: actions/setup-node@v2 with: node-version: '14' - name: Install dependencies run: npm install - name: Run tests run: npm test - name: Build Docker image run: docker build -t <your-docker-image> . - name: Log in to Docker Hub uses: docker/login-action@v1 with: username: ${{ secrets.DOCKER_USERNAME } password: ${{ secrets.DOCKER_PASSWORD } - name: Build and push Docker image uses: docker/build-push-action@v2 with: context: . push: true tags: <your-docker-image-tag> deploy: needs: build runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v2 - name: Set up kubectl uses: azure/setup-kubectl@v1 with: version: 'latest' - name: Set up Kubeconfig run: echo "${{ secrets.KUBECONFIG }" > $HOME/.kube/config - name: Deploy to Kubernetes run: kubectl apply -f deployment.yaml Step 8: Trigger the pipeline and monitor Modify server.js and push it to the main branch; this triggers the GitHub Actions workflow. Monitor the workflow progress. It installs the dependencies, sets up npm, builds the Docker image and pushes it to the container registry, and deploys the application to Kubernetes. Once the workflow is completed successfully, you can access your application that is running inside the Kubernetes cluster. You can leverage open-source monitoring tools like Prometheus and Grafana for metrics. Deployment Considerations There are a few deployment considerations to keep in mind when developing CI/CD pipelines with Kubernetes to maintain security and make the best use of resources: Scalability Use horizontal pod autoscaling to scale your application's Pods based on how much CPU, memory, or custom metrics are needed. This helps your application work well under varying loads.When using a cloud-based Kubernetes cluster, use the cluster autoscaler to change the number of worker nodes as needed to ensure enough resources are available and no money is wasted on idle resources.Ensure your CI/CD pipeline incorporates pipeline scalability, allowing it to handle varying workloads as per your project needs.Security Scan container images regularly to find security issues. Add tools for image scanning into your CI/CD pipeline to stop deploying insecure code.Implement network policies to limit how Pods and services talk to each other inside a Kubernetes cluster. This cuts down on ways attackers could get in.Set up secrets management using Kubernetes Secrets or external key vaults to secure and manage sensitive info such as API keys and passwords.Use role-based access control to control access to Kubernetes resources and CI/CD pipelines.High availability Through multi-AZ or multi-region deployments, you can set up your Kubernetes cluster in different availability zones or regions to keep it running during outages.Pod disruption budgets help you control how many Pods can be down during planned disruptions (like fixing nodes) or unplanned ones (like when nodes fail).Implement health checks to monitor the health of your pods and automatically restart if any fail to maintain availability.Secrets management Store API keys, certificates, and passwords as Kubernetes Secrets, which are encrypted and added to Pods.You can also consider external secrets management tools like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault if you need dynamic secret generation and auditing. Conclusion Leveraging CI/CD pipelines with Kubernetes has become a must-have approach in today's software development. It revolutionizes the way teams build, test, and deploy apps, leading to more efficiency and reliability. By using automation, teamwork, and the strength of container management, CI/CD with Kubernetes empowers organizations to deliver high-quality software at speed. The growing role of AI and ML will likely have an impact on CI/CD pipelines — such as smarter testing, automated code reviews, and predictive analysis to further enhance the development process. When teams adopt best practices, keep improving their pipelines, and are attentive to new trends, they can get the most out of CI/CD with Kubernetes, thus driving innovation and success. This is an excerpt from DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC.Read the Free Report
Full stack development is often likened to an intricate balancing act, where developers are expected to juggle multiple responsibilities across the front end, back end, database, and beyond. As the definition of full-stack development continues to evolve, so too does the approach to debugging. Full stack debugging is an essential skill for developers, as it involves tracking issues through multiple layers of an application, often navigating domains where one’s knowledge may only be cursory. In this blog post, I aim to explore the nuances of full-stack debugging, offering practical tips and insights for developers navigating the complex web of modern software development. Notice that this is an introductory post focusing mostly on the front-end debugging aspects. In the following posts, I will dig deeper into the less familiar capabilities of front-end debugging. Full Stack Development: A Shifting Definition The definition of full stack development is as fluid as the technology stacks themselves. Traditionally, full-stack developers were defined as those who could work on both the front end and back end of an application. However, as the industry evolves, this definition has expanded to include aspects of operations (OPS) and configuration. The modern full-stack developer is expected to submit pull requests that cover all parts required to implement a feature — backend, database, frontend, and configuration. While this does not make them an expert in all these areas, it does require them to navigate across domains, often relying on domain experts for guidance. I've heard it said that full-stack developers are: Jack of all trades, master of none. However, the full quote probably better represents the reality: Jack of all trades, master of none, but better than a master of one. The Full Stack Debugging Approach Just as full-stack development involves working across various domains, full-stack debugging requires a similar approach. A symptom of a bug may manifest in the front end, but its root cause could lie deep within the backend or database layers. Full stack debugging is about tracing these issues through the layers and isolating them as quickly as possible. This is no easy task, especially when dealing with complex systems where multiple layers interact in unexpected ways. The key to successful full-stack debugging lies in understanding how to track an issue through each layer of the stack and identifying common pitfalls that developers may encounter. Frontend Debugging: Tools and Techniques It Isn't "Just Console.log" Frontend developers are often stereotyped as relying solely on Console.log for debugging. While this method is simple and effective for basic debugging tasks, it falls short when dealing with the complex challenges of modern web development. The complexity of frontend code has increased significantly, making advanced debugging tools not just useful, but necessary. Yet, despite the availability of powerful debugging tools, many developers continue to shy away from them, clinging to old habits. The Power of Developer Tools Modern web browsers come equipped with robust developer tools that offer a wide range of capabilities for debugging front-end issues. These tools, available in browsers like Chrome and Firefox, allow developers to inspect elements, view and edit HTML and CSS, monitor network activity, and much more. One of the most powerful, yet underutilized, features of these tools is the JavaScript debugger. The debugger allows developers to set breakpoints, step through code, and inspect the state of variables at different points in the execution. However, the complexity of frontend code, particularly when it has been obfuscated for performance reasons, can make debugging a challenging task. We can launch the browser tools on Firefox using this menu: On Chrome we can use this option: I prefer working with Firefox, I find their developer tools more convenient but both browsers have similar capabilities. Both have fantastic debuggers (as you can see with the Firefox debugger below); unfortunately, many developers limit themselves to console printing instead of exploring this powerful tool. Tackling Code Obfuscation Code obfuscation is a common practice in frontend development, employed to protect proprietary code and reduce file sizes for better performance. However, obfuscation also makes the code difficult to read and debug. Fortunately, both Chrome and Firefox developer tools provide a feature to de-obfuscate code, making it more readable and easier to debug. By clicking the curly brackets button in the toolbar, developers can transform a single line of obfuscated code into a well-formed, debuggable file. Another important tool in the fight against obfuscation is the source map. Source maps are files that map obfuscated code back to its original source code, including comments. When generated and properly configured, source maps allow developers to debug the original code instead of the obfuscated version. In Chrome, this feature can be enabled by ensuring that “Enable JavaScript source maps” is checked in the developer tools settings. You can use code like this in the JavaScript file to point at the sourcemap file: //@sourceMappingURL=myfile.js.map For this to work in Chrome we need to ensure that “Enable JavaScript source maps” is checked in the settings. Last I checked it was on by default, but it doesn’t hurt to verify: Debugging Across Layers Isolating Issues Across the Stack In full-stack development, issues often manifest in one layer but originate in another. For example, a frontend error might be caused by a misconfigured backend service or a database query that returns unexpected results. Isolating the root cause of these issues requires a methodical approach, starting from the symptom and working backward through the layers. A common strategy is to reproduce the issue in a controlled environment, such as a local development setup, where each layer of the stack can be tested individually. This helps to narrow down the potential sources of the problem. Once the issue has been isolated to a specific layer, developers can use the appropriate tools and techniques to diagnose and resolve it. The Importance of System-Level Debugging Full stack debugging is not limited to the application code. Often, issues arise from the surrounding environment, such as network configurations, third-party services, or hardware limitations. A classic example of this that we ran into a couple of years ago was a production problem where a WebSocket connection would frequently disconnect. After extensive debugging, Steve discovered that the issue was caused by the CDN provider (CloudFlare) timing out the WebSocket after two minutes — something that could only be identified by debugging the entire system, not just the application code. System-level debugging requires a broad understanding of how different components of the infrastructure interact with each other. It also involves using tools that can monitor and analyze the behavior of the system as a whole, such as network analyzers, logging frameworks, and performance monitoring tools. Embracing Complexity Full stack debugging is inherently complex, as it requires developers to navigate multiple layers of an application, often dealing with unfamiliar technologies and tools. However, this complexity also presents an opportunity for growth. By embracing the challenges of full-stack debugging, developers can expand their knowledge and become more versatile in their roles. One of the key strengths of full-stack development is the ability to collaborate with domain experts. When debugging an issue that spans multiple layers, it is important to leverage the expertise of colleagues who specialize in specific areas. This collaborative approach not only helps to resolve issues more efficiently but also fosters a culture of knowledge sharing and continuous learning within the team. As tools continue to evolve, so too do the tools and techniques available for debugging. Developers should strive to stay up-to-date with the latest advancements in debugging tools and best practices. Whether it’s learning to use new features in browser developer tools or mastering system-level debugging techniques, continuous learning is essential for success in full-stack development. Video Conclusion Full stack debugging is a critical skill for modern developers, we mistakenly think it requires a deep understanding of both the application and its surrounding environment. I disagree: By mastering the tools and techniques discussed in this post/upcoming posts, developers can more effectively diagnose and resolve issues that span multiple layers of the stack. Whether you’re dealing with obfuscated frontend code, misconfigured backend services, or system-level issues, the key to successful debugging lies in a methodical, collaborative approach. You don't need to understand every part of the system, just the ability to eliminate the impossible.
With the rapid development of Internet technology, server-side architectures have become increasingly complex. It is now difficult to rely solely on the personal experience of developers or testers to cover all possible business scenarios. Therefore, real online traffic is crucial for server-side testing. TCPCopy [1] is an open-source traffic replay tool that has been widely adopted by large enterprises. While many use TCPCopy for testing in their projects, they may not fully understand its underlying principles. This article provides a brief introduction to how TCPCopy works, with the hope of assisting readers. Architecture The architecture of TCPCopy has undergone several upgrades, and this article introduces the latest 1.0 version. As shown in the diagram below, TCPCopy consists of two components: tcpcopy and intercept. tcpcopy runs on the online server, capturing live TCP request packets, modifying the TCP/IP header information, and sending them to the test server, effectively "tricking" the test server. intercept runs on an auxiliary server, handling tasks such as relaying response information back to tcpcopy. Figure 1: Overview of the TCPCopy Architecture The simplified interaction process is as follows: tcpcopy captures packets on the online server.tcpcopy modifies the IP and TCP headers, spoofing the source IP and port, and sends the packet to the test server. The spoofed IP address is determined by the -x and -c parameters set at startup.The test server receives the request and returns a response packet with the destination IP and port set to the spoofed IP and port from tcpcopy.The response packet is routed to the intercept server, where intercept captures and parses the IP and TCP headers, typically returning only empty response data to tcpcopy.tcpcopy receives and processes the returned data. Technical Principles TCPCopy operates in two modes: online and offline. The online mode is primarily used for real-time capturing of live request packets, while the offline mode reads request packets from pcap-format files. Despite the difference in working modes, the core principles remain the same. This section provides a detailed explanation of TCPCopy's core principles from several perspectives. 1. Packet Capturing and Sending The core functions of tcpcopy can be summarized as "capturing" and "sending" packets. Let's begin with packet capturing. How do you capture real traffic from the server? Many people may feel confused when first encountering this question. In fact, Linux operating systems already provide the necessary functionality, and a solid understanding of advanced Linux network programming is all that's needed. The initialization of packet capturing and sending in tcpcopy is handled in the tcpcopy/src/communication/tc_socket.c file. Next, we will introduce the two methods tcpcopy uses for packet capturing and packet sending. Raw Socket A raw socket can receive packets from the network interface card on the local machine. This is particularly useful for monitoring and analyzing network traffic. The code for initializing raw socket packet capturing in tcpcopy is shown below, and this method supports capturing packets at both the data link layer and the IP layer. int tc_raw_socket_in_init(int type) { int fd, recv_buf_opt, ret; socklen_t opt_len; if (type == COPY_FROM_LINK_LAYER) { /* Copy ip datagram from Link layer */ fd = socket(AF_PACKET, SOCK_DGRAM, htons(ETH_P_IP)); } else { /* Copy ip datagram from IP layer */ #if (TC_UDP) fd = socket(AF_INET, SOCK_RAW, IPPROTO_UDP); #else fd = socket(AF_INET, SOCK_RAW, IPPROTO_TCP); #endif } if (fd == -1) { tc_log_info(LOG_ERR, errno, "Create raw socket to input failed"); fprintf(stderr, "Create raw socket to input failed:%s\n", strerror(errno)); return TC_INVALID_SOCK; } recv_buf_opt = 67108864; opt_len = sizeof(int); ret = setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &recv_buf_opt, opt_len); if (ret == -1) { tc_log_info(LOG_ERR, errno, "Set raw socket(%d)'s recv buffer failed"); tc_socket_close(fd); return TC_INVALID_SOCK; } return fd; } The code for initializing the raw socket for sending packets is shown below. First, it creates a raw socket at the IP layer and informs the protocol stack not to append an IP header to the IP layer. int tc_raw_socket_out_init(void) { int fd, n; n = 1; /* * On Linux when setting the protocol as IPPROTO_RAW, * then by default the kernel sets the IP_HDRINCL option and * thus does not prepend its own IP header. */ fd = socket(AF_INET, SOCK_RAW, IPPROTO_RAW); if (fd == -1) { tc_log_info(LOG_ERR, errno, "Create raw socket to output failed"); fprintf(stderr, "Create raw socket to output failed: %s\n", strerror(errno)); return TC_INVALID_SOCK; } /* * Tell the IP layer not to prepend its own header. * It does not need setting for linux, but *BSD needs */ if (setsockopt(fd, IPPROTO_IP, IP_HDRINCL, &n, sizeof(n)) < 0) { tc_socket_close(fd); tc_log_info(LOG_ERR, errno, "Set raw socket(%d) option \"IP_HDRINCL\" failed", fd); return TC_INVALID_SOCK; } return fd; } Construct the complete packet and send it to the target server. dst_addr is filled with the target IP address.The IP header is populated with the source and destination IP addresses.The TCP header is filled with the source port, destination port, and other relevant information. Pcap Pcap is an application programming interface (API) provided by the operating system for capturing network traffic, with its name derived from "packet capture." On Linux systems, pcap is implemented via libpcap, and most packet capture tools, such as tcpdump, use libpcap for capturing traffic. Below is the code for initializing packet capture with pcap. int tc_pcap_socket_in_init(pcap_t **pd, char *device, int snap_len, int buf_size, char *pcap_filter) { int fd; char ebuf[PCAP_ERRBUF_SIZE]; struct bpf_program fp; bpf_u_int32 net, netmask; if (device == NULL) { return TC_INVALID_SOCK; } tc_log_info(LOG_NOTICE, 0, "pcap open,device:%s", device); *ebuf = '\0'; if (tc_pcap_open(pd, device, snap_len, buf_size) == TC_ERR) { return TC_INVALID_SOCK; } if (pcap_lookupnet(device, &net, &netmask, ebuf) < 0) { tc_log_info(LOG_WARN, 0, "lookupnet:%s", ebuf); return TC_INVALID_SOCK; } if (pcap_compile(*pd, &fp, pcap_filter, 0, netmask) == -1) { tc_log_info(LOG_ERR, 0, "couldn't parse filter %s: %s", pcap_filter, pcap_geterr(*pd)); return TC_INVALID_SOCK; } if (pcap_setfilter(*pd, &fp) == -1) { tc_log_info(LOG_ERR, 0, "couldn't install filter %s: %s", pcap_filter, pcap_geterr(*pd)); pcap_freecode(&fp); return TC_INVALID_SOCK; } pcap_freecode(&fp); if (pcap_get_selectable_fd(*pd) == -1) { tc_log_info(LOG_ERR, 0, "pcap_get_selectable_fd fails"); return TC_INVALID_SOCK; } if (pcap_setnonblock(*pd, 1, ebuf) == -1) { tc_log_info(LOG_ERR, 0, "pcap_setnonblock failed: %s", ebuf); return TC_INVALID_SOCK; } fd = pcap_get_selectable_fd(*pd); return fd; } The code for initializing packet sending with pcap is as follows: int tc_pcap_snd_init(char *if_name, int mtu) { char pcap_errbuf[PCAP_ERRBUF_SIZE]; pcap_errbuf[0] = '\0'; pcap = pcap_open_live(if_name, mtu + sizeof(struct ethernet_hdr), 0, 0, pcap_errbuf); if (pcap_errbuf[0] != '\0') { tc_log_info(LOG_ERR, errno, "pcap open %s, failed:%s", if_name, pcap_errbuf); fprintf(stderr, "pcap open %s, failed: %s, err:%s\n", if_name, pcap_errbuf, strerror(errno)); return TC_ERR; } return TC_OK; } Raw Socket vs. Pcap Since tcpcopy offers two methods, which one is better? When capturing packets, we are primarily concerned with the specific packets we need. If the capture configuration is not set correctly, the system kernel might capture too many irrelevant packets, leading to packet loss, especially under high traffic pressure. After extensive testing, it has been found that when using the pcap interface to capture request packets, the packet loss rate in live environments is generally higher than when using raw sockets. Therefore, tcpcopy defaults to using raw sockets for packet capture, although the pcap interface can also be used (with the --enable-pcap option), which is mainly suited for high-end pfring captures and captures after switch mirroring. For packet sending, tcpcopy uses the raw socket output interface by default, but it can also send packets via pcap_inject (using the --enable-dlinject option). The choice of which method to use can be determined based on performance testing in your actual environment. 2. TCP Protocol Stack We know that the TCP protocol is stateful. Although the packet-sending mechanism was explained earlier, without establishing an actual TCP connection, the sent packets cannot be truly received by the testing service. In everyday network programming, we typically use the TCP socket interfaces provided by the operating system, which abstract away much of the complexity of TCP states. However, in tcpcopy, since we need to modify the source IP and destination IP of the packets to deceive the testing service, the APIs provided by the operating system are no longer sufficient. As a result, tcpcopy implements a simulated TCP state machine, representing the most complex and challenging aspect of its codebase. The relevant code, located in tcpcopy/src/tcpcopy/tc_session.c, handles crucial tasks such as simulating TCP interactions, managing network latency, and emulating upper-layer interactions. Figure 2: Classic TCP state machine overview In tcpcopy, a session is defined to maintain information for different connections. Different captured packets are processed accordingly: SYN packet: Represents a new connection request. tcpcopy assigns a source IP, modifies the destination IP and port, then sends the packet to the test server. At the same time, it creates a new session to store all states of this connection.ACK packet: Pure ACK packet: To reduce the number of sent packets, tcpcopy generally doesn't send pure ACKs.ACK packet with payload (indicating a specific request): It finds the corresponding session and sends the packet to the test server. If the session is still waiting for the response to the previous request, it delays sending.RST packet: If the current session is waiting for the test server's response, the RST packet is not sent. Otherwise, it's sent.FIN packet: If the current session is waiting for the test server's response, it waits; otherwise, the FIN packet is sent. 3. Routing After tcpcopy sends the request packets, their journey may not be entirely smooth: The IP of the request packet is forged and not the actual IP of the machine running tcpcopy. If some machines have rpfilter (reverse path filtering) enabled, it will check whether the source IP address is trustworthy. If the source IP is untrustworthy, the packet will be discarded at the IP layer.If the test server receives the request packet, the response packet will be sent to the forged IP address. To ensure these response packets don't mistakenly go back to the client with the forged IP, proper routing configuration is necessary. If the routing isn't set up correctly, the response packet won't be captured by intercept, leading to incomplete data exchange.After intercept captures the response packet, it extracts the response packet, and discards the actual data, returning only the response headers and other necessary information to tcpcopy. When necessary, it also merges the return information to reduce the impact on the network of the machine running tcpcopy. 4. Intercept For those new to tcpcopy, it might be puzzling — why is intercept necessary if we already have tcpcopy? While intercept may seem redundant, it actually plays a crucial role. You can think of intercept as the server-side counterpart of tcpcopy, with its name itself explaining its function: an "interceptor." But what exactly does intercept need to intercept? The answer is the response packet from the test service. If intercept were not used, the response packets from the test server would be sent directly to tcpcopy. Since tcpcopy is deployed in a live environment, this means the response packets would be sent directly to the production server, significantly increasing its network load and potentially affecting the normal operation of the live service. With intercept, by spoofing the source IP, the test service is led to "believe" that these spoofed IP clients are accessing it. Intercept also performs aggregation and optimization of the response packet information, further ensuring that the live environment at the network level is not impacted by the test environment. intercept is an independent process that, by default, captures packets using the pcap method. During startup, the -F parameter needs to be passed, for example, "tcp and src port 8080," following libpcap's filter syntax. This means that intercept does not connect directly to the test service but listens on the specified port, capturing the return data packets from the test service and interacting with tcpcopy. 5. Performance tcpcopy uses a single-process, single-thread architecture based on an epoll/select event-driven model, with related code located in the tcpcopy/src/event directory. By default, epoll is used during compilation, though you can switch to select with the --select option. The choice of method can depend on the performance differences observed during testing. Theoretically, epoll performs better when handling a large number of connections. In practical use, tcpcopy's performance is directly tied to the amount of traffic and the number of connections established by intercept. The single-threaded architecture itself is usually not a performance bottleneck (for instance, Nginx and Redis both use single-threaded + epoll models and can handle large amounts of concurrency). Since tcpcopy only establishes connections directly with intercept and does not need to connect to the test machines or occupy port numbers, tcpcopy consumes fewer resources, with the main impact being on network bandwidth consumption. static tc_event_actions_t tc_event_actions = { #ifdef TC_HAVE_EPOLL tc_epoll_create, tc_epoll_destroy, tc_epoll_add_event, tc_epoll_del_event, tc_epoll_polling #else tc_select_create, tc_select_destroy, tc_select_add_event, tc_select_del_event, tc_select_polling #endif }; Conclusion TCPCopy is an excellent open-source project. However, due to the author's limitations, this article only covers the core technical principles of TCPCopy, leaving many details untouched [2]. Nevertheless, I hope this introduction provides some inspiration to those interested in TCPCopy and traffic replay technologies! References [1] GitHub: session-replay-tools/tcpcopy [2] Mobile test development A brief analysis of the principle of TCPCopy, a real-time traffic playback tool
As organizations adopt microservices and containerized architectures, they often realize that they need to rethink their approach to basic operational tasks like security or observability. It makes sense: in a world where developers – rather than operations teams – are keeping applications up and running, and where systems are highly distributed, ephemeral, and interconnected, how can you take the same approach you have in the past? From a technology perspective, there has been a clear shift to open source standards, especially in the realm of observability. Protocols like OpenTelemetry and Prometheus, and agents like Fluent Bit, are now the norm – according to the 2023 CNCF survey, Prometheus usage increased to 57% adoption in production workloads, with OpenTelemetry and Fluent both at 32% adoption in production. But open source tools alone can’t help organizations transform their observability practices. As I’ve had the opportunity to work with organizations who have solved the challenge of observability at scale, I’ve seen a few common trends in how these companies operate their observability practices. Let's dig in. Measure Thyself — Set Smart Goals With Service Level Objectives Service Level Objectives were first introduced by the Google SRE book in 2016 with great fanfare. But I’ve found that many organizations don’t truly understand them, and even fewer have implemented them. This is unfortunate because they are secretly one of the best ways to predict failures. SLOs (Service Level Objectives) are specific goals that show how well a service should perform, like aiming for 99.9% uptime. SLIs (Service Level Indicators) are the actual measurements used to see if the SLOs are met — think about tracking the percentage of successful requests. Error budgeting is the process of allowing a certain amount of errors or downtime within the SLOs, which helps teams balance reliability and new features — this ensures they don’t push too hard at the risk of making things unstable. Having SLOs on your key services and using error budgeting allows you to identify impending problems and act on them. One of the most mature organizations that I’ve seen practicing SLOs is Doordash. For them, the steaks are high (pun intended). If they have high SLO burn for a service, that could lead to a merchant not getting a food order on time, right, or at all. Or it could lead to a consumer not getting their meal on time or experiencing errors in the app. Getting started with SLOs doesn’t need to be daunting. My colleague recently wrote up her tips on getting started with SLOs. She advises to keep SLOs practical and achievable, starting with the goals that truly delight customers. Start small by setting an SLO for a key user journey. Collaborate with SREs and business users to define realistic targets. Be flexible and adjust SLOs as your system evolves. Embrace Events — The Only Constant in your Cloud-Native Environment is Change In DevOps, things are always changing. We're constantly shipping new code, turning features on and off, updating our infrastructure, and more. This is great for innovation and agility, but it also introduces change, which opens the door for errors. Plus, the world outside our systems is always shifting too, from what time of day it is to what's happening in the news. All of this can make it hard to keep everything running smoothly. These everyday events that result in changes are the most common causes of issues in production systems. And the challenge is that these changes are initiated by many different types of systems, from feature flag management to CI/CD, cloud infrastructure, security, and more. Interestingly, 67% of organizations don’t have the ability to identify change(s) in their environments that caused performance issues according to the Digital Enterprise Journal. The only way to stay on top of all of these changes is to connect them into a central hub to track them. When people talk about “events” as a fourth type of telemetry, outside of metrics, logs, and traces, this is typically what they mean. One organization I’ve seen do this really well is Dandy Dental. They’ve found that the ability to understand change in their system, and quickly correlate it to the changes in behavior, has made debugging a lot faster for developers. Making a habit of understanding what changed has allowed Dandy to improve their observability effectiveness. Adopt Hypothesis-Driven Troubleshooting — Enable Any Developer to Fix Issues Faster When a developer begins troubleshooting an issue, they start with a hypothesis. Their goal is to quickly prove or disprove that hypothesis. The more context they have about the issue, the faster they can form a good hypothesis to test. If they have multiple hypotheses, they will need to test each one in order of likelihood to determine which one is the culprit. The faster a developer can prove or disprove a hypothesis, the faster they can solve the problem. Developers use observability tools to both form their initial hypotheses and to prove/disprove them. A good observability tool will give the developer the context they need to form a likely hypothesis. A great observability tool will make it as easy as possible for a developer with any level of expertise or familiarity with the service to quickly form a likely hypothesis and test it. Organizations that want to improve their MTTR can start by shrinking the time to create a hypothesis. Tooling that provides the on-call developer with highly contextual alerts that immediately focus them on the relevant information can help shrink this time. The other advantage of explicitly taking a hypothesis-driven troubleshooting approach is concurrency. If the issue is high severity, or has significant complexity, they may need to call in more developers to help them concurrently prove or disprove each hypothesis to speed up troubleshooting time. An AI software company we work with uses hypothesis-driven troubleshooting. I recently heard a story about how they were investigating a high error rate on a service, and used their observability tool to narrow it down to two hypotheses. Within 10 minutes they had proven their first hypothesis to be correct – that the errors were all occurring in a single region that had missed the most recent software deploy. Taking the Next Step If you're committed to taking your observability practice to the next level, these tried-and-true habits can help you take the initial steps forward. All three of these practices are areas that we’re passionate about. If you’ll be at KubeCon and want to discuss this more, please come say hello! This article was shared as part of DZone's media partnership with KubeCon + CloudNativeCon.View the Event
The global developer population is expected to reach 28.7 million people by 2024, surpassing the entire population of Australia. Among such a large group, achieving unanimous agreement on anything is remarkable. Yet, there's widespread consensus on one point: good technical documentation is crucial and saves considerable time. Some even consider it a cornerstone of engineering success, acting as a vital link between ideas, people, and visions. Despite this, many developers battle daily with poor, incomplete, or inaccurate documentation. It’s a common grievance in the tech community, where developers lament the hours spent on documentation, searching scattered sources for information, or enduring unnecessary meetings to piece together disjointed details. Vadim Kravcenko in his essay on Healthy Documentation highlights a pervasive issue: “The constant need to have meetings is a symptom of a deeper problem — a lack of clear, accessible, and reliable documentation. A well-documented workflow doesn't need an hour-long session for clarification. A well-documented decision doesn't need a room full of people to understand its rationale. A well-documented knowledge base doesn't need a group huddle whenever a new member joins the team.” Documentation, especially that of system architecture, is often seen as a burdensome afterthought, fraught with tedious manual diagramming and verbose records spread across various platforms. It’s important to highlight that bad documentation is not just a source of frustration for developers, but it also has a very tangible business impact. After all, time is money. When developers waste time manually recording information or looking for something in vain, they are being diverted from building new features, optimizing performance, and, in general, producing value for end users. This article examines the evolving requirements of modern system architecture documentation and how system architecture observability might be a way to reduce overhead for teams and provide them with the information they need when they need it. Why System Architecture Documentation Is Important System documentation is crucial as it captures all aspects of a software system’s development life cycle, from initial requirements and design to implementation and deployment. There are two primary benefits of comprehensive system architecture documentation: 1. Empowers All Stakeholders While Saving Time System design is inherently collaborative, requiring inputs from various stakeholders to ensure the software system meets all business and technical requirements while remaining feasible and maintainable. Documentation serves different needs for different stakeholders: New Team Additions: Comprehensive documentation helps new members quickly understand the system's architecture, technical decisions, and operational logic, facilitating smoother and faster onboarding.Existing Engineering Team: Serves as a consistent reference, guiding the team's implementation efforts and reducing the frequency of disruptive clarification meetings.Cross-Functional Teams: Enables teams from different functional areas to understand the system’s behavior and integration points, which is crucial for coordinated development efforts.Security Teams and External Auditors: Documentation provides the necessary details for compliance checks, security audits, and certifications, detailing the system’s structure and security measures. Effective documentation ensures that all team members, regardless of their role, can access and utilize crucial project information, enhancing overall collaboration and efficiency. 2. Persisted, Single Source of Company Knowledge A dynamic, comprehensive repository of system knowledge helps mitigate risks associated with personnel changes, code redundancy, and security vulnerabilities. It preserves critical knowledge, preventing the 'single point of failure' scenario where departing team members leave a knowledge vacuum. This central source of truth also streamlines problem-solving and minimizes time spent on context-switching, duplicated efforts, and unnecessary meetings. By centralizing system information across various platforms — like Jira, GitHub, Confluence, and Slack — teams can avoid the pitfalls of fragmented knowledge and ensure that everyone has access to the latest, most accurate system information. Modern Systems Have Outgrown Traditional Documentation The requirements for system architecture documentation have evolved dramatically from 20 or even 10 years ago. The scale, complexity, and distribution of modern systems render traditional documentation methods inadequate. Previously, a team might grasp a system's architecture, dependencies, and integrations by reviewing a static diagram, skimming the codebase, and browsing through some decision records. Today, such an approach is insufficient due to the complexity and dynamic nature of contemporary systems. Increased Technological Complexity Modern technologies have revolutionized system architecture. The rise of distributed architectures, cloud-native applications, SaaS, APIs, and composable platforms has added layers of complexity. Additionally, the aging of software and the proliferation of legacy systems necessitate continual evolution and integration. This technological diversity and modularity increase interdependencies and complicate the system's communication structure, making traditional diagramming tools inadequate for capturing and understanding the full scope of system behaviors. Accelerated System Evolution The adoption of Agile methodologies and modern design practices like Continuous and Evolutionary Architecture has significantly increased the rate of change within software systems. Teams have to update their systems to reflect changes in external infrastructure, new technologies, evolving business requirements, or a plethora of other aspects that might change during the lifetime of any software system. That’s why a dynamic documentation approach that can keep pace with rapid developments is necessary. Changing Engineering Team Dynamics The globalization of the workforce and the demand from users for global, scalable, and performant applications have led to more distributed engineering teams. Coordinating across different cross-functional teams, offices, and time zones, introduces numerous communication challenges. The opportunity for misunderstandings and failures becomes an order N squared problem: adding a 10th person to a team adds 9 new lines of communication to worry about. That’s also reflected in the famous Fred Brooks quote from the The Mythical Man-Month book: “Adding [human] power to a late software project makes it later.” This complexity is compounded by the industry's high turnover rate, with developers often changing roles every 1 to 2 years, underscoring the necessity for robust, accessible documentation. New Requirements of System Architecture Documentation System architecture documentation should be accurate, current, understandable, maintainable, easy to access, and relevant. Despite these goals, traditional documentation methods have often fallen short due to several inherent challenges: Human Error and Inconsistencies: Relying on individuals, whether software architects, technical writers, or developers, to document system architecture introduces human error, inconsistencies, and quickly outdated information. These issues are compounded by barriers such as interpersonal communication, lack of motivation, insufficient technical writing skills, or time constraints.Documentation as Code: While self-documenting code is a step forward, using comments to clarify code logic can only provide so much clarity. It lacks critical contextual information like decision rationales or system-wide implications.Fragmented Tooling: Documentation generators that scan source code and other artifacts can create documentation based on predefined templates and rules. However, these tools often provide fragmented views of the system, requiring manual efforts to integrate and update disparate pieces of information. The complexity and dynamism of modern software systems intensify these documentation challenges. In response, new requirements have emerged: Automation: Documentation processes need to minimize manual efforts, allowing for the automatic creation and maintenance of diagrams, component details, and decision records. Tools should enable the production of interactive, comprehensive visuals quickly and efficiently.Reliability and Real-Time Updates: Documentation must not only be reliable but also reflect real-time system states. This is essential to empowering engineers to make accurate, informed decisions based on the current state of the system. This immediacy helps troubleshoot issues efficiently and prevents wasted effort on tasks based on outdated information.Collaborative Features: Modern tooling must support both synchronous and asynchronous collaboration across distributed teams, incorporating features like version control and advanced search capabilities to manage and navigate documentation easily. In today's fast-paced software development environment, documentation should evolve alongside the systems it describes, facilitating seamless updates without imposing additional overhead on engineering teams. Observability Could Solve the Biggest Pain Points Leveraging observability could be the key to keeping system architecture documentation current while significantly reducing the manual overhead for engineering teams. The growing adoption of open standards, such as OpenTelemetry (OTel), is crucial here. These standards enhance interoperability among various tools and platforms, simplifying the integration and functionality of observability infrastructures. Imagine a scenario where adding just a few lines of code to your system allows a tool to automatically discover, track, and detect drift in your architecture, dependencies, and APIs. Such technology not only exists but is becoming increasingly accessible. Building software at scale remains a formidable challenge. It's clear that merely increasing the number of engineers, or pursing traditional approaches to technical documentation doesn’t equate to better software — what's needed are more effective tools. Developers deserve advanced tools that enable them to visualize, document, and explore their systems’ architecture effortlessly. Just as modern technology has exponentially increased the productivity of end-users, innovative tools for system design and documentation are poised to do the same for developers, transforming their capacity to manage and evolve complex systems.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC. In recent years, observability has re-emerged as a critical aspect of DevOps and software engineering in general, driven by the growing complexity and scale of modern, cloud-native applications. The transition toward microservices architecture as well as complex cloud deployments — ranging from multi-region to multi-cloud, or even hybrid-cloud, environments — has highlighted the shortcomings of traditional methods of monitoring. In response, the industry has standardized utilizing logs, metrics, and traces as the three pillars of observability to provide a more comprehensive sense of how the application and the entire stack is performing. We now have a plethora of tools to collect, store, and analyze various signals to diagnose issues, optimize performance, and respond to issues. Yet anyone working with Kubernetes will still say that observability in Kubernetes remains challenging. Part of it comes from the inherent complexity of working with Kubernetes, but the fact of the matter is that logs, metrics, and traces alone don't make up observability. Also, the vast ecosystem of observability tooling does not necessarily equate to ease of use or high ROI, especially given today's renewed focus on cost. In this article, we'll dive into some considerations for Kubernetes observability, challenges of and some potential solutions for implementing it, and the oft forgotten aspect of developer experience in observability. Considerations for Kubernetes Observability When considering observability for Kubernetes, most have a tendency to dive straight into tool choices, but it's advisable to take a hard look at what falls under the scope of things to "observe" for your use case. Within Kubernetes alone, we already need to consider: Cluster components – API server, etcd, controller manager, schedulerNode components – kublect, kube-proxy, container runtimeOther resources – CoreDNS, storage plugins, Ingress controllersNetwork – CNI, service meshSecurity and access – audit logs, security policiesApplication – both internal and third-party applications And most often, we inevitably have components that run outside of Kubernetes but interface with many applications running inside. Most notably, we have databases ranging from managed cloud offerings to external data lakes. We also have things like serverless functions, queues, or other Kubernetes clusters that we need to think about. Next, we need to identify the users of Kubernetes as well as the consumers of these observability tools. It's important to consider these personas as building for an internal-only cluster vs. a multi-tenant SaaS cluster may have different requirements (e.g., privacy, compliance). Also, depending on the team composition, the primary consumers of these tools may be developers or dedicated DevOps/SRE teams who will have different levels of expertise with not only these tools but with Kubernetes itself. Only after considering the above factors can we start to talk about what tools to use. For example, if most applications are already on Kubernetes, using a Kubernetes-focused tool may suffice, whereas organizations with lots of legacy components may elect to reuse an existing observability stack. Also, a large organization with various teams mostly operating as independent verticals may opt to use their own tooling stack, whereas a smaller startup may opt to pay for an enterprise offering to simplify the setup across teams. Challenges and Recommendations for Observability Implementation After considering the scope and the intended audience of our observability stack, we're ready to narrow down the tool choices. Largely speaking, there are two options for implementing an observability stack: open source and commercial/SaaS. Open-Source Observability Stack The primary challenge with implementing a fully open-source observability solution is that there is no single tool that covers all aspects. Instead, what we have are ecosystems or stacks of tools that cover different aspects of observability. One of the more popular tech stacks from Prometheus and Grafana Lab's suite of products include: Prometheus for scraping metrics and alerting Loki for collecting logsTempo for distributed tracingGrafana for visualization While the above setup does cover a vast majority of observability requirements, they still operate as individual microservices and do not provide the same level of uniformity as a commercial or SaaS product. But in recent years, there has been a strong push to at least standardize on OpenTelemetry conventions to unify how to collect metrics, logs, and traces. Since OpenTelemetry is a framework that is tool agnostic, it can be used with many popular open-source tools like Prometheus and Jaeger. Ideally, architecting with OpenTelemetry in mind will make standardization of how to generate, collect, and manage telemetry data easier with the growing list of compliant open-source tools. However, in practice, most organizations will already have established tools or in-house versions of them — whether that is the EFK (Elasticsearch, Fluentd, Kibana) stack or Prometheus/Grafana. Instead of forcing a new framework or tool, apply the ethos of standardization and improve what and how telemetry data is collected and stored. Finally, one of the common challenges with open-source tooling is dealing with storage. Some tools like Prometheus cannot scale without offloading storage with another solution like Thanos or Mimir. But in general, it's easy to forget to monitor the observability tooling health itself and scale the back end accordingly. More telemetry data does not necessarily equal more signals, so keep a close eye on the volume and optimize as needed. Commercial Observability Stack On the commercial offering side, we usually have agent-based solutions where telemetry data is collected from agents running as DaemonSets on Kubernetes. Nowadays, almost all commercial offerings have a comprehensive suite of tools that combine into a seamless experience to connect logs to metrics to traces in a single user interface. The primary challenge with commercial tools is controlling cost. This usually comes in the form of exposing cardinality from tags and metadata. In the context of Kubernetes, every Pod has tons of metadata related to not only Kubernetes state but the state of the associated tooling as well (e.g., annotations used by Helm or ArgoCD). These metadata then get ingested as additional tags and date fields by the agents. Since commercial tools have to index all the data to make telemetry queryable and sortable, increased cardinality from additional dimensions (usually in the form of tags) causes issues with performance and storage. This directly results in higher cost to the end user. Fortunately, most tools now allow the user to control which tags to index and even downsample data to avoid getting charged for repetitive data points that are not useful. Be aggressive with filters and pipeline logic to only index what is needed; otherwise, don't be surprised by the ballooning bill. Remembering the Developer Experience Regardless of the tool choice, one common pitfall that many teams face is over-optimizing for ops usage and neglecting the developer experience when it comes to observability. Despite the promise of DevOps, observability often falls under the realm of ops teams, whether that be platform, SRE, or DevOps engineering. This makes it easy for teams to build for what they know and what they need, over-indexing on infrastructure and not investing as much on application-level telemetry. This ends up alienating developers to invest less time or become too reliant on their ops counterparts for setup or debugging. To make observability truly useful for everyone involved, don't forget about these points: Access. It's usually more of a problem with open-source tools, but make sure access to logs, dashboards, and alerts are not gated by unnecessary approvals. Ideally, having quick links from existing mediums like IDEs or Slack can make tooling more accessible.Onboarding. It's rare for developers to go through the same level of onboarding in learning how to use any of these tools. Invest some time to get them up to speed.Standardization vs. flexibility. While a standard format like JSON is great for indexing, it may not be as human readable and is filled with extra information. Think of ways to present information in a usable format. At the end of the day, the goals of developers and ops teams should be aligned. We want tools that are easy to integrate, with minimal overhead, that produce intuitive dashboards and actionable, contextual information without too much noise. Even with the best tools, you still need to work with developers who are responsible for generating telemetry and also acting on it, so don't neglect the developer experience entirely. Final Thoughts Observability has been a hot topic in recent years due to several key factors, including the rise of complex, modern software coupled with DevOps and SRE practices to deal with that complexity. The community has moved past the simple notion of monitoring to defining the three pillars of observability as well as creating new frameworks to help with generation, collection, and management of these telemetry data. Observability in a Kubernetes context has remained challenging so far given the large scope of things to "observe" as well as the complexity of each component. With the open source ecosystem, we have seen a large fragmentation of specialized tools that is just now integrating into a standard framework. On the commercial side, we have great support for Kubernetes, but cost control has been a huge issue. And to top it off, lost in all of this complexity is the developer experience in helping feed data into and using the insights from the observability stack. But as the community has done before, tools and experience will continue to improve. We already see significant research and advances in how AI technology can improve observability tooling and experience. Not only do we see better data-driven decision making, but generative AI technology can also help surface information better in context to make tools more useful without too much overhead. This is an excerpt from DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC.Read the Free Report
What Is Jenkins, and Why Does It Matter? In the world of software development, speed and efficiency are everything. That's where Jenkins, a popular open-source automation server, steps in. Jenkins plays a key role in streamlining workflows by automating the building, testing, and deployment of code — tasks that would otherwise take up countless developer hours. But why does Jenkins matter in the larger context of DevOps and CI/CD (Continuous Integration/Continuous Deployment)? Well, if you're part of a development team, you're likely familiar with these terms. DevOps aims to break down barriers between development and operations teams, enabling faster, more reliable software releases. CI/CD pipelines, in turn, automate the process of integrating new code and delivering updates to users, minimizing downtime and reducing errors. Jenkins, being one of the oldest and most widely adopted CI/CD tools, has been a cornerstone of this shift. It enables teams to automate everything from building the code to testing and deploying it, helping companies deliver updates more efficiently. However, as newer tools like GitHub Actions and CircleCI enter the scene, you might be wondering: is Jenkins still relevant in 2024? In this article, you’ll learn why Jenkins remains a critical tool in many enterprise environments and how it stacks up against newer alternatives. The Role of Jenkins in DevOps, Build, and Release Engineering Jenkins has a long and influential history in the world of software development. Originally developed as Hudson in 2004, Jenkins emerged as a leading open-source tool for automating parts of the software development lifecycle (SDLC), particularly within the DevOps ecosystem. DevOps practices focus on reducing the time between writing code and delivering it to production while ensuring high quality. Jenkins fits into this philosophy by enabling teams to automate repetitive tasks like code integration, testing, and deployment. Figure 1: Jenkins and its ecosystem across different stages of the SDLC One of Jenkins’ key roles is in the Continuous Integration (CI) process. CI is a development practice where developers frequently merge their code changes into a shared repository, often multiple times a day. Jenkins automates this process by fetching the latest code, compiling it, and running tests to ensure everything works before deploying the changes. This level of automation allows teams to catch issues early, avoiding painful, last-minute fixes. Jenkins’ importance extends to Continuous Deployment (CD) as well. Once a build has passed the necessary tests, Jenkins can automate the deployment of that code to various environments — whether it’s staging, production, or anywhere in between. This makes it a central tool for DevOps and build engineering, helping teams maintain a steady, efficient pipeline from development to production. By automating these crucial stages, Jenkins eliminates manual steps, increases efficiency, and ensures that code gets shipped faster and more reliably. Even as newer tools emerge, Jenkins’ ability to streamline workflows and its flexibility in handling large-scale projects has made it a staple in enterprise environments. Figure 2: Different stages of SDLC Jenkins Strengths: Enterprise Adoption and Plugin Ecosystem One of Jenkins' biggest strengths lies in its extensive plugin ecosystem. Jenkins offers over 1,800 plugins, allowing teams to customize and extend the functionality of the tool to suit their specific needs. This plugin architecture makes Jenkins incredibly flexible, particularly for large enterprises that require bespoke workflows and integrations across a variety of development environments, testing frameworks, and deployment pipelines. This flexibility is why Jenkins is widely adopted by enterprises. Its plugins enable teams to integrate with virtually any tool or service in the software development lifecycle, from source control systems like Git to cloud providers such as AWS and Google Cloud to notification services like Slack. Jenkins is designed to be adaptable, which is particularly valuable in complex projects where multiple tools need to work together seamlessly. Another key strength is Jenkins' scalability. Jenkins can handle thousands of jobs across distributed environments, making it a popular choice for large organizations with massive, concurrent build pipelines. Whether it’s managing a simple application or a sprawling microservices architecture, Jenkins' ability to scale ensures it can meet the demands of the most complex development operations. Jenkins' open-source nature also plays a major role in its popularity. It has a strong and active community that continuously contributes to the project, maintaining its relevance and expanding its capabilities over time. This community-driven approach means that when enterprises run into roadblocks, there’s usually already a plugin, guide, or support solution available. In short, Jenkins’ rich plugin ecosystem, scalability, and open-source backing make it a powerhouse for enterprises looking to automate their CI/CD processes in a highly customizable way. Jenkins Weaknesses: Stateful Architecture and Challenges with GitOps One of Jenkins’ most significant weaknesses is its reliance on a stateful architecture. Unlike modern CI/CD tools that are designed to be stateless, Jenkins stores its build information and job configurations on the file system, without a dedicated database. This lack of a centralized state management system can lead to issues, especially when scaling Jenkins across multiple environments or instances. The result is a fragile system that requires careful handling to avoid inconsistencies and failures in large-scale, distributed setups. Jenkins’ incompatibility with GitOps principles also limits its appeal in cloud-native and Kubernetes-focused environments. GitOps revolves around the idea of using Git as the single source of truth for infrastructure and application deployment. Modern CI/CD tools, such as Argo Workflows and Argo CD, are designed with GitOps in mind, offering seamless, declarative workflows that allow teams to manage infrastructure and applications using Git repositories. Jenkins, on the other hand, struggles to adapt to this approach due to its stateful nature and the complexity of configuring pipelines that align with GitOps principles. As the industry moves towards containerization and Kubernetes-native CI/CD pipelines, Jenkins’ architecture often proves to be a hurdle. While it can be made to work in Kubernetes environments, it’s far from ideal. Jenkins requires a complex web of plugins and manual configurations to support Kubernetes workflows, whereas tools like Argo and Tekton are built specifically for these environments, providing native support and a more intuitive user experience. Ultimately, Jenkins’ reliance on stateful architecture, difficulty scaling, and lack of GitOps-friendly workflows are key reasons why many teams have opted for more modern, Kubernetes-native alternatives like Argo Workflows and Argo CD. Comparison: Jenkins vs GitHub Actions vs CircleCI vs Argo CD As the landscape of CI/CD tools evolves, teams have more options than ever to build, test, and deploy their applications. Tools like GitHub Actions, CircleCI, and Argo CD have emerged as strong contenders in the modern, cloud-native development world. Let’s compare these tools to Jenkins to understand their strengths and weaknesses. Jenkins: Flexibility and Customisation, But High Complexity Jenkins has long been a go-to tool for enterprise-grade customization. Its extensive plugin ecosystem gives teams unparalleled flexibility to build highly tailored CI/CD pipelines. Jenkins excels in environments where deep integration with multiple systems and complex, distributed builds are required. However, Jenkins’ plugin complexity and maintenance burden often outweigh its benefits, especially in Kubernetes-native workflows. Each plugin adds layers of configuration and dependency management, making it hard to maintain over time. Additionally, Jenkins' stateful architecture makes it a less natural fit for cloud-native environments, where stateless and GitOps-based approaches are becoming the norm. GitHub Actions: Seamless GitHub Integration, Built for Simplicity GitHub Actions is a relatively new CI/CD tool designed with simplicity in mind, making it especially attractive to developers who are already using GitHub for version control. Its tight integration with GitHub makes setting up CI/CD pipelines straightforward, with workflows defined through YAML files stored in the same repositories as your code. This makes GitHub Actions easy to use for small-to-medium projects or teams that prefer a lightweight solution. GitHub Actions also natively supports containerized and Kubernetes workflows, making it a viable option for cloud-native teams. However, it lacks the deep customization and scalability that Jenkins offers, which can be a limitation for more complex, enterprise-grade projects. CircleCI: Simplicity With Strong Kubernetes Support CircleCI offers a cloud-native, container-centric approach to CI/CD that aligns well with modern development practices. Its interface is intuitive, and it supports parallel testing, automatic scaling, and strong Kubernetes integration out of the box. Teams using CircleCI benefit from faster setup times and a cleaner experience than Jenkins, especially for cloud-native or microservices-based architectures. CircleCI also offers built-in support for Docker and Kubernetes, which makes it easier to configure and deploy pipelines in cloud environments. However, CircleCI can become expensive as teams scale, and while it’s simpler to manage than Jenkins, it doesn’t offer the same degree of customization for large, highly complex workflows. Argo CD: GitOps-Native and Kubernetes-Centric Argo CD is a Kubernetes-native CI/CD tool built with GitOps principles in mind. It operates by using Git repositories as the source of truth for both infrastructure and application deployment. Argo CD makes managing deployments in Kubernetes clusters highly efficient, as the entire state of the application is version-controlled and automated using Git commits. For teams adopting Kubernetes and containerization as core elements of their infrastructure, Argo CD is one of the best tools available. It offers declarative, Git-driven workflows that simplify the process of deploying and scaling applications across cloud environments. Unlike Jenkins, which struggles with GitOps and Kubernetes integration, Argo CD is purpose-built for these use cases. However, Argo CD is more specialized — it focuses solely on deployment and doesn’t cover the entire CI/CD process, such as continuous integration (CI). Teams often pair Argo CD with other tools like Argo Workflows or CircleCI to handle CI tasks. While it excels in the Kubernetes space, it may not be the right choice for organizations with less emphasis on containerization. Key Takeaways Jenkins is most suitable for large enterprises that require deep customization and integration with legacy systems. However, its complexity and lack of native Kubernetes support are significant drawbacks.GitHub Actions is ideal for teams already embedded in GitHub, offering a simple, integrated solution for small-to-medium projects, with native Kubernetes support but limited scalability for complex workflows.CircleCI offers a cloud-native CI/CD solution focusing on containerization and Kubernetes scalability and ease of use, although with potentially higher costs as projects grow.Argo CD is the most Kubernetes-centric option, thriving in environments that follow GitOps principles. While it excels in Kubernetes-native deployments, it requires additional tools for a complete CI/CD pipeline. Why Jenkins Still Has a Place in 2024 Despite the rise of modern, cloud-native CI/CD tools like GitHub Actions and CircleCI, Jenkins remains a heavyweight in the continuous integration and delivery space. Holding an estimated 44%-46% of the global CI/CD market in 2023, Jenkins continues to be widely adopted, with more than 11 million developers and over 200,000 active installations across various industries (CD Foundation)(CloudBees). This widespread usage reflects Jenkins' strong position in enterprise environments, where its robust plugin ecosystem and extensive customization options continue to deliver value. One of Jenkins' major strengths is its extensibility. With over 1,800 plugins, Jenkins can integrate deeply with legacy systems, internal workflows, and various third-party tools, making it an essential part of many large-scale and complex projects (CloudBees). In industries where infrastructure and application delivery rely on specialized or customized workflows — such as finance, healthcare, and manufacturing — Jenkins' ability to adapt to unique requirements remains unmatched. This flexibility is a key reason why Jenkins is still preferred in enterprises that have heavily invested in their CI/CD pipelines. Moreover, Jenkins continues to see substantial growth in its usage. Between 2021 and 2023, Jenkins Pipeline usage increased by 79%, while overall job workloads grew by 45% (CD Foundation)(CloudBees). These numbers indicate that, even in the face of newer competition, Jenkins is being used more frequently to automate complex software delivery processes. Another factor contributing to Jenkins' staying power is its open-source nature and community support. With thousands of active contributors and corporate backing from major players like AWS, IBM, and CloudBees, Jenkins benefits from a large knowledge base and ongoing development (CD Foundation)(CloudBees). This ensures that Jenkins remains relevant and adaptable to emerging trends, even if its architecture is not as cloud-native as some of its newer competitors. While Jenkins may not be the go-to for modern Kubernetes or GitOps-focused workflows, it continues to play a critical role in on-premise and hybrid environments where companies require greater control, customization, and integration flexibility. Its deep entrenchment in enterprise systems and ongoing improvements ensure that Jenkins still has a crucial place in the CI/CD ecosystem in 2024 and beyond.
My name is Siarhei Misko, an iOS Team Lead with over eight years of experience. Today, I want to walk you through the essential topic of observability in mobile applications, focusing on iOS. We are going to take a closer look at how we can implement observability more easily and why it is so important to improve app performance and provide a better user experience. Why Observability Matters The very fundamental question is: how do we ensure our application is stable and working fine? Conventionally, the answer would point to testing. We fire up QA tests by using various test cases, test environments, and on specific devices. The thing is, these tests are synthetic and mostly remain incapable of reflecting real-world scenarios. In the real world, applications go through much more complex environments, and here is where observability bridges the gap between test conditions and real-world functionality. Observability does not originate from another technical buzzword, but rather from engineering and systems analysis. Thus, in the parlance of software and particularly mobile app development, observability describes a system's ability to report on its internal state and operations. This is like designing a system to be intrinsically stable and secure in that the system will be one that actually informs you about what could go wrong. Observability isn't just about collecting traces, metrics, and logs. It's the tools that grant observability, but aggregating all this stuff into one focal point is what true observability is. Teams could monitor trends through this aggregation, visualize them in dashboards, and even trigger alerts when thresholds of incidents were reached. This sort of observability will enable people not only to respond quickly when there is an emerging issue but also to make predictions before things start to go wrong. Imagine a system that tells you exactly what went wrong and why. That's the power of observability. With great observability in place, teams can instantly identify the root causes of issues, minimize downtime, and by doing so, improve overall user experience. Observability in Practice: Learning from Failures Let's take a very real example: consider some iOS project, having absolutely no observability features such as analytics, logging, or crashlytics. Within this imaginary case, the team launches a new release, and voil̀a: users start reporting crashes — sometimes when opening particular screens, or sometimes totally at random. The team tries to reproduce this issue on their own, but it's all working fine in their environment. No clear data to work from means they're ultimately just guessing, hoping their fixes fix the problem. If this app had been using tools such as crashlytics, it would have instantly caught crash data, like stack traces and device information. This will give them the information they need to find out how to solve the issue. Apple has some basic functionality by default for tracking crashes, but it is pretty basic. That's why many developers use third-party solutions, such as Firebase Crashlytics, which not only captures crashes but even helps identify the potential causes. Expanding Beyond Crashes: Performance and Feature Flags Crashes in and of themselves are not the only problem that could lead to dissatisfied users. In fact, many non-crash issues could be far worse. The more complex a system becomes, the more variables will involve things like different devices, OS versions, and feature flags. Think feature flags: a product manager mistypes or misconfigures something within their feature flags, and all of a sudden parts of that functionality aren't available. Maybe everything seems to be fine on the server side, and maybe crashlytics shows nothing, yet it still affects users. Observability enables engineering teams to keep track of how features will actually run under real-world conditions like flag configuration, device compatibility, and settings within an app. Another challenge is testing: you can never simulate every possible usage scenario. That's where observability steps in: it helps you build metrics on how your app is doing, even under very specific conditions, such as on specific phone models, OS versions, or for certain app settings. Key Metrics for iOS Observability Where would you start when it comes to implementing observability in iOS apps? Following are three key areas to focus on: Important user stories: Focus on critical flows that include registration, login, processing of payments, and sending of messages. It helps you observe not only whether these flows work but also how well they perform.Dynamic parameters: Most features operate based on dynamically changing parameters related to country codes or feature flags. Observability in this respect helps you in the capture of errors caused because of these misconfigured parameters.External dependencies: An application may depend on services or third-party APIs. Ensuring that in case of their failure, observability will make sure that you will be able to detect the issue and take action before it affects users. Implementing Observability in iOS: Architectural Solutions Speaking more about iOS observability, architectural considerations become important. A well-structured app makes a lot of things easier to implement when it comes to observability tools. You also need to explore best practices from fields such as DevOps and backend development — quite often, these share a lot of valuable insights for mobile apps. Almost all observability is based on logging. While metrics indicate the overall health of a system, logs capture an in-depth view of what happened inside the app. Logging has a number of problems: there might be too much overhead. Here are a few ways to manage logs on iOS: Live or online logging: In this mode, logs are sent right after recording. Though the latter is common in most of the backend systems, it tends to result in data overload for mobile apps.On-demand logging: Logs can be sent either manually by the user or automatically due to certain triggers. Examples include a flag created for any login or a push notification. This is useful for troubleshooting issues in a focused manner.Triggered logging: Logs will be sent only after an event occurs, such as a crash or an error. Whichever you decide there are ways to handle your logs properly. For instance, it is not necessary to record every log at any time. This will cause your disk to become overloaded, which you wouldn't want. Work with log file rotations to limit the file size and handle log levels accurately. Make sure to include only the most critical logs in your release builds. Logging in iOS: Tools and Techniques Log collection in iOS can be done in a number of ways. Apple's OSLog is quite strong, but it also includes a set of limitations. For example, retrieval of logs from extensions or past sessions of the app is not possible. While OSLog is well-integrated with iOS, the process of accessing the logs themselves is somewhat cumbersome and therefore not the best for real-time analysis. Instead, consider logging to a text file, or third-party solutions like Firebase Crashlytics. These give you more flexibility, such as logging more data at the time of a crash, like user roles or subscription statuses. Firebase itself is a great starting point, but beware of some of its limitations, which include data sampling and being unable to execute user-created queries. Optimizing Logs and Metrics for Performance Logs and metrics can be quite verbose, so some optimization is in order. One of the ways this would be achieved would be to log only important events — not to log each event, but only key system states. Instead of trying to log all user interactions, like mouse movements and keystrokes, track high-level events in an app, such as screen transitions or API requests. This will reduce the volume of data down to a more digestible volume and will simplify any analysis that needs to be conducted on it. In more complex cases, such as when processing Notification Service Extensions, observability may provide step-by-step process tracking. Notification extensions have very tight memory limits of 24 MB, where exceeding this limit may kill a process. Observability enables you to trace every step of the process with confidence that no steps were skipped and no unnecessary overhead happened. Leveraging Architectural Patterns for Observability Finally, let me address how architectural patterns can contribute to observability. The coordinator pattern turns out to be really good at knowing exactly how the user navigates through the app —it provides all the transitions between screens and thus a clear view on where the user has been and where the app currently is. You can monitor something like app start time — for instance, how long it takes for an app to transition from launch to some state. Or you can observe user flow completion times, such as how long it takes for users to make a payment. If users suddenly take longer in completing one, observability allows you to find problems much faster. Also, you can apply user behavior tracking metrics, such as Annoyed Tap — a metric that tracks the amount of taps on a single UI element in a short period of time, which may indicate frustration or a malfunctioning feature. Conclusion By implementing observability in your iOS applications, you ensure stability, performance, and a much better user experience for your app. With well-thought-out implementation of logs and metrics, besides architectural patterns, you get real-time insight into the behavior of your app and are able to detect and fix problems much faster. Keep in mind that observability is not a replacement for classic testing, but a strong boost to make your app resilient under real-world conditions. Thanks for reading, and good luck implementing observability in your iOS projects!
Are you ready to start your journey on the road to collecting telemetry data from your applications? Great observability begins with great instrumentation! In this series, you'll explore how to adopt OpenTelemetry (OTel) and how to instrument an application to collect tracing telemetry. You'll learn how to leverage out-of-the-box automatic instrumentation tools and understand when it's necessary to explore more advanced manual instrumentation for your applications. By the end of this series, you'll have an understanding of how telemetry travels from your applications to the OpenTelemetry Collector, and be ready to bring OpenTelemetry to your future projects. Everything discussed here is supported by a hands-on, self-paced workshop authored by Paige Cruz. The previous article explored how developers are able to manually instrument their applications to generate specific metadata for our business to derive better insights faster. In this article, we'll look at the first part of how to link metrics to trace data using exemplars. It is assumed that you followed the previous articles in setting up both OpenTelemetry and the example Python application project, but if not, go back and see the previous articles as it's not covered here. Before we dive into linking our metrics to traces with exemplars, let's look at what exactly an exemplar is. Exemplar Basics As defined by Google Cloud Observability: "Exemplars are a way to associate arbitrary data with metric data. You can use them to attach non-metric data to measurements. One use of exemplars is to associate trace data with metric data." Remember, we use trace data to give a detailed view of a single request through our systems and metrics are used to provide an aggregated systems view. Exemplars are a way to combine the two, such that once you narrow your troubleshooting efforts to a single trace, you can then explore the associated metrics that the exemplar provides. This also works from metrics to traces, for example, allowing you to jump from, "Hey, what's that weird spike?" on a metric chart directly to a trace associated with that context in a single click. Our goal is tying metrics to traces via exemplars, so using the CNCF open-source projects, we'll instrument metrics with Prometheus, which has stable APIs and an extensive ecosystem of instrumented libraries. While OpenTelemetry also provides a metrics SDK, it's marked as mixed and under active development. The Python SDK does not yet support exemplars, so let's keep our traces instrumented with OpenTelemetry. The plan we will follow in Part One is to first configure a Prometheus instance to gather metrics from our example application, second to instrument our example application to collect metrics, and verifying that this is collecting metrics. In Part Two, we'll then implement an exemplar connecting these metrics to existing trace data we've been collecting from our example application, and finally, verify all this work in Jaeger tooling. Prometheus Configuration Within the example Python application we've previously installed, we find a configuration file for Prometheus in metrics/prometheus/prometheus.yml that looks like this: global: scrape_interval: 5s scrape_configs: - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] - job_name: "hello-otel" static_configs: - targets: ["localhost:5000"] The default use of Prometheus is to monitor targets over HTTP and scrape their endpoints. Our initial configuration is scraping Prometheus itself as the target with the endpoint localhost:9090 and looking for hello-intel as a target on the endpoint localhost:5000. Using this configuration we can now build and start our Prometheus instance with: $ podman build -t workshop-prometheus:v2.54.1 -f ./metrics/prometheus/Buildfile-prom $ podman run -p 9090:9090 workshop-prometheus:v2.54.1 --config.file=/etc/prometheus/prometheus.yml We can verify our targets are configured correctly by opening the Prometheus console Targets page at http://localhost:9090/targets, noting that Prometheus is scraping itself and waiting for the hello-otel application to come online: For now, stop the Prometheus instance by using CTRL-C as we'll later be running the application and Prometheus instance together in a single pod. Collecting Metrics The next step is to start collecting metrics from our Python example application. To do this we can explore the prometheus-flask-exporter that provides an easy getting started experience. We start by opening the build file metrics/Buildfile-metrics and adding the command to install the prometheus_flask_exporter as shown here in bold: FROM python:3.12-bullseye WORKDIR /app COPY requirements.txt requirements.txt RUN pip install -r requirements.txt RUN pip install opentelemetry-api \ opentelemetry-sdk \ opentelemetry-exporter-otlp \ opentelemetry-instrumentation-flask \ opentelemetry-instrumentation-jinja2 \ opentelemetry-instrumentation-requests \ prometheus-flask-exporter COPY . . CMD [ "flask", "run", "--host=0.0.0.0"] Now to start using it in our example application we open the file metrics/app.py and add the prometheus-flask-exporter and specify import specifically PrometheusMetrics as shown in bold below: ... from opentelemetry.trace import set_tracer_provider from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import SimpleSpanProcessor from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter from opentelemetry.instrumentation.flask import FlaskInstrumentor from opentelemetry.instrumentation.jinja2 import Jinja2Instrumentor from opentelemetry.instrumentation.requests import RequestsInstrumentor from prometheus_flask_exporter import PrometheusMetrics ... Search further down this same file, metrics/app.py, and verify that the programmatic Flask metric instrumentation has been added as shown in bold here: ... provider = TracerProvider() processor = SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")) provider.add_span_processor(processor) set_tracer_provider(provider) app = Flask("hello-otel") FlaskInstrumentor().instrument_app(app) Jinja2Instrumentor().instrument() RequestsInstrumentor().instrument() metrics = PrometheusMetrics(app) ... Now that we have it all configured, let's try it out. Verifying Metrics to Traces We need to start verification by (re)building the application image as follows: $ podman build -t hello-otel:metrics -f metrics/Buildfile-metrics ... Successfully tagged localhost/hello-otel:metrics 81039de9e73baf0c2ee04d75f7c4ed0361cd97cf927f46020e295a30ec34af8f Now to run this, we are using a pod configuration with our example application, a Prometheus instance, and the Jaeger tooling to visualize our telemetry data as follows: $ podman play kube metrics/app_pod.yaml --down Once this has started, we can open a browser and make several requests to each of the pages listed here below to generate metrics and tracing data: http://localhost:8001http://localhost:8001/rolldicehttp://localhost:8001/doggo I mentioned previously that Prometheus uses a pull-model where a Prometheus server scrapes a /metric endpoint. Let's check our example application's metric endpoint to verify that the prometheus-flask-exporter is working by opening http://localhost:8001/metrics in a browser and confirming we're seeing metrics prefixed with flask_* and python_* as follows: Now when we check if the hello-otel application is available to be scrapped by Prometheus by opening the targets page in Prometheus at http://localhost:9090/targets, we see that both Prometheus and the example application are in the state of UP (it was DOWN, previously): This concludes Part One. These examples use code from a Python application that you can explore in the provided hands-on workshop. What's Next? This article, part one of two, started the journey to linking metrics to our trace data with exemplars using Prometheus and OpenTelemetry with our example application. In our next article, part two, we'll finish our journey by implementing an exemplar to link these collected metrics to their relevant traces.