Migrate, Modernize and Build Java Web Apps on Azure: This live workshop will cover methods to enhance Java application development workflow.
Modern Digital Website Security: Prepare to face any form of malicious web activity and enable your sites to optimally serve your customers.
The cultural movement that is DevOps — which, in short, encourages close collaboration among developers, IT operations, and system admins — also encompasses a set of tools, techniques, and practices. As part of DevOps, the CI/CD process incorporates automation into the SDLC, allowing teams to integrate and deliver incremental changes iteratively and at a quicker pace. Together, these human- and technology-oriented elements enable smooth, fast, and quality software releases. This Zone is your go-to source on all things DevOps and CI/CD (end to end!).
What Is CI/CD? Beginner’s Guide To Continuous Integration and Deployments
Feature Flags for CI/CD
Part 1: The Challenge and the Workload FinOps is an evolving practice to deliver maximum value from investments in Cloud. Most organizations in their journey of adopting FinOps focus on highly tactical and visible activities. They perform activities post-deployment of the applications to help them understand and then optimize their cloud usage and cost. This approach, while being able to clearly demonstrate benefits, falls short of the potential of FinOps as it requires workloads to be effectively built multiple times. Shifting left with FinOps — you build once, this not only saves money on your cloud bill — but increases innovation within your business by freeing up your resources. In this series, we will walk you through an example solution and how to effectively implement a shift-left approach to FinOps to demonstrate the techniques to discover and validate cost optimizations throughout a typical cloud software development lifecycle. Part1: The challenge and the workload Part2: Creating and implementing the cost model Part3: Cost Optimization Techniques for Infrastructure Part4: Cost Optimization Techniques for Applications Part5: Cost Optimization Techniques for Data Part6: Implementation / Case Study Results The Challenge In the current format of this evolving discipline, there are three iterative phases: Inform, Optimize, and Operate. The Inform phase gives the visibility for creating shared accountability. The Optimize phase is intended to identify efficiency opportunities and determine their value. The operating phase defines and implements processes that achieve the goals of technology, finance, and business. FinOps Phases However, with modern cloud pricing calculators and workload planning tools, it is possible to get visibility to your complete cloud cost without having to go through the development process well before anything is built. The cost of development, deployment, and operations can be determined based on the architecture, services, and technical components. The current architecture method involves understanding the scope and requirements. The personas involved and the functional requirements are captured as use cases. The non-functional requirements are captured based on the qualities (security, performance, scalability, and availability) and constraints. Based on the functional and non-functional requirements, a candidate architecture is proposed. Existing Architecture method and FinOps activities As soon as a proposed candidate architecture is developed, we include a phase to do a FinOps model for the candidate. In this step, we shift left some of the FinOps activities at the architecture phase itself. The candidate architecture is reviewed through the frame of FinOps for optimizations. This will go through iterations and refinement of the architecture to arrive at an optimal cost of the solution without compromising on any of the functional and non-functional aspects. ShiftLeft FinOps Model for Creating Working Architecture Building a FinOps Cost Model is very similar to how you can shift left security in a DevOps pipeline by creating a threat model upfront. Creating a FinOps model for the solution is an iterative process. This starts with establishing an initial baseline cost for the candidate architecture. The solution components are then reviewed for cost optimization. In certain cases, it might require the teams to perform a proof of engineering to get the cost estimates or projections. The cost optimization techniques need to be applied at various levels or layers to arrive at a working architecture. They can be divided as follows: Cost Optimization Techniques for Infrastructure Cost Optimization Techniques for Application Cost Optimization Techniques for Data The Workload Workloads can be very different; however, when viewed as functional components — you can utilize similar optimization approaches and techniques to maximize efficiency and value. Most workloads will have some form of data input, processing, and output, so the example we will use is a cloud-native application that performs data ingestion, processing to enrich and analyze the data, and then outputs the data along with reports and insights for a user. We will utilize a cloud-agnostic approach and break the workload and optimization techniques into the following components: Infrastructure: This is the computing, storage, and networking. This will include the resources, services, and associated attributes of them. Application: This is the application design and architecture and covers how the application will behave and function on the infrastructure. Data: This is the data itself and also the formatting and handling of the data throughout the workload. These methods and techniques for each component and layer are discussed in detail with the help of an example. The workload for this example is a cloud-native application that involves some domain-specific data ingestion, processing, and analysis. The structured/semi-structured data is enriched and analyzed further to create reports and insights for the end user. This application is architected to be a deployment that can leverage services in multiple clouds — for instance, AWS and GCP. Candidate Architecture for the representative Cloud Native Application Conclusion FinOps is a practice that gives the enterprise a better way to manage their cloud spending. Shifting left FinOps gives more opportunities to save costs earlier in the software development lifecycle. This involves the introduction of a few simple steps before the solution architecture is pushed to the detailed design and implementation phase. The steps involve creating a FinOps Cost Model and iterating through it to ensure that you have applied all the cost optimization techniques at the infrastructure, application, and data layer and components. You can optimize your overall cloud expenses by shifting left FinOps. In part 2 of the blog series, we will be creating and implementing the cost model.
In today's fast-evolving technology landscape, the integration of Artificial Intelligence (AI) into Internet of Things (IoT) systems has become increasingly prevalent. AI-enhanced IoT systems have the potential to revolutionize industries such as healthcare, manufacturing, and smart cities. However, deploying and maintaining these systems can be challenging due to the complexity of the AI models and the need for seamless updates and deployments. This article is tailored for software engineers and explores best practices for implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines for AI-enabled IoT systems, ensuring smooth and efficient operations. Introduction To CI/CD in IoT Systems CI/CD is a software development practice that emphasizes the automated building, testing, and deployment of code changes. While CI/CD has traditionally been associated with web and mobile applications, its principles can be effectively applied to AI-enabled IoT systems. These systems often consist of multiple components, including edge devices, cloud services, and AI models, making CI/CD essential for maintaining reliability and agility. Challenges in AI-Enabled IoT Deployments AI-enabled IoT systems face several unique challenges: Resource Constraints: IoT edge devices often have limited computational resources, making it challenging to deploy resource-intensive AI models. Data Management: IoT systems generate massive amounts of data, and managing this data efficiently is crucial for AI model training and deployment. Model Updates: AI models require periodic updates to improve accuracy or adapt to changing conditions. Deploying these updates seamlessly to edge devices is challenging. Latency Requirements: Some IoT applications demand low-latency processing, necessitating efficient model inference at the edge. Best Practices for CI/CD in AI-Enabled IoT Systems Version Control: Implement version control for all components of your IoT system, including AI models, firmware, and cloud services. Use tools like Git to track changes and collaborate effectively. Create separate repositories for each component, allowing for independent development and testing. Automated Testing: Implement a comprehensive automated testing strategy that covers all aspects of your IoT system. This includes unit tests for firmware, integration tests for AI models, and end-to-end tests for the entire system. Automation ensures that regressions are caught early in the development process. Containerization: Use containerization technologies like Docker to package AI models and application code. Containers provide a consistent environment for deployment across various edge devices and cloud services, simplifying the deployment process. Orchestration: Leverage container orchestration tools like Kubernetes to manage the deployment and scaling of containers across edge devices and cloud infrastructure. Kubernetes ensures high availability and efficient resource utilization. Continuous Integration for AI Models: Set up CI pipelines specifically for AI models. Automate model training, evaluation, and validation. This ensures that updated models are thoroughly tested before deployment, reducing the risk of model-related issues. Edge Device Simulation: Simulate edge devices in your CI/CD environment to validate deployments at scale. This allows you to identify potential issues related to device heterogeneity and resource constraints early in the development cycle. Edge Device Management: Implement device management solutions that facilitate over-the-air (OTA) updates. These solutions should enable remote deployment of firmware updates and AI model updates to edge devices securely and efficiently. Monitoring and Telemetry: Incorporate comprehensive monitoring and telemetry into your IoT system. Use tools like Prometheus and Grafana to collect and visualize performance metrics from edge devices, AI models, and cloud services. This helps detect issues and optimize system performance. Rollback Strategies: Prepare rollback strategies in case a deployment introduces critical issues. Automate the rollback process to quickly revert to a stable version in case of failures, minimizing downtime. Security: Security is paramount in IoT systems. Implement security best practices, including encryption, authentication, and access control, at both the device and cloud levels. Regularly update and patch security vulnerabilities. CI/CD Workflow for AI-Enabled IoT Systems Let's illustrate a CI/CD workflow for AI-enabled IoT systems: Version Control: Developers commit changes to their respective repositories for firmware, AI models, and cloud services. Automated Testing: Automated tests are triggered upon code commits. Unit tests, integration tests, and end-to-end tests are executed to ensure code quality. Containerization: AI models and firmware are containerized using Docker, ensuring consistency across edge devices. Continuous Integration for AI Models: AI models undergo automated training and evaluation. Models that pass predefined criteria are considered for deployment. Device Simulation: Simulated edge devices are used to validate the deployment of containerized applications and AI models. Orchestration: Kubernetes orchestrates the deployment of containers to edge devices and cloud infrastructure based on predefined scaling rules. Monitoring and Telemetry: Performance metrics, logs, and telemetry data are continuously collected and analyzed to identify issues and optimize system performance. Rollback: In case of deployment failures or issues, an automated rollback process is triggered to revert to the previous stable version. Security: Security measures, such as encryption, authentication, and access control, are enforced throughout the system. Case Study: Smart Surveillance System Consider a smart surveillance system that uses AI-enabled cameras for real-time object detection in a smart city. Here's how CI/CD principles can be applied: Version Control: Separate repositories for camera firmware, AI models, and cloud services enable independent development and versioning. Automated Testing: Automated tests ensure that camera firmware, AI models, and cloud services are thoroughly tested before deployment. Containerization: Docker containers package the camera firmware and AI models, allowing for consistent deployment across various camera models. Continuous Integration for AI Models: CI pipelines automate AI model training and evaluation. Models meeting accuracy thresholds are considered for deployment. Device Simulation: Simulated camera devices validate the deployment of containers and models at scale. Orchestration: Kubernetes manages container deployment on cameras and cloud servers, ensuring high availability and efficient resource utilization. Monitoring and Telemetry: Metrics on camera performance, model accuracy, and system health are continuously collected and analyzed. Rollback: Automated rollback mechanisms quickly revert to the previous firmware and model versions in case of deployment issues. Security: Strong encryption and authentication mechanisms protect camera data and communication with the cloud. Conclusion Implementing CI/CD pipelines for AI-enabled IoT systems is essential for ensuring the reliability, scalability, and agility of these complex systems. Software engineers must embrace version control, automated testing, containerization, and orchestration to streamline development and deployment processes. Continuous monitoring, rollback strategies, and robust security measures are critical for maintaining the integrity and security of AI-enabled IoT systems. By adopting these best practices, software engineers can confidently deliver AI-powered IoT solutions that drive innovation across various industries.
IT teams have been observing applications for their health and performance since the beginning. They observe the telemetry data (logs, metrics, traces) emitted from the application/microservice using various observability tools and make informed decisions regarding scaling, maintaining, or troubleshooting applications in the production environment. If observability is not something new and there are a plethora of monitoring and observability tools available in the market, why bother about OpenTelemetry? What makes it special such that it is getting widely adopted? And most importantly, what is in it for developers, DevOps, and SRE folks? Well, let us find out. What Is OpenTelemetry? OpenTelemetry (OpenTelemetry) provides open-source standards and formats for collecting and exporting telemetry data from microservices for observability purposes. The standardized way of collecting data helps DevOps and SRE engineers use any compatible observability backend of their choice to observe services and infrastructure, without being vendor locked-in. OpenTelemetry diagram for microservices deployed in a Kubernetes cluster OpenTelemetry is both a set of standards and an open-source project that provides components, such as collectors and agents, for its implementation. Besides, OpenTelemetry offers APIs, SDKs, and data specifications for application developers to standardize instrumenting their application code. (Instrumentation is the process of adding observability libraries/dependencies to the application code so that it emits logs, traces, and metrics.) Why Is OpenTelemetry Good News for DevOps and SREs? The whole observability process starts with application developers. Typically, they instrument application code with the proprietary library/agent provided by the observability backend tool that IT teams plan to go with. For example, let us say IT teams want to use Dynatrace as the observability tool. Then, application developers use code/SDKs from Dynatrace to instrument (i.e., to generate and export telemetry data) all the applications in the system. It helps to fetch and feed data in the format Dynatrace is compatible with. But this is where the problem lies. The observability requirements of DevOps and SREs seldom stay the same. They will have to switch between vendors providing observability tools or may want to use more than one tool, as their needs evolve. But, since all the applications are instrumented with the proprietary code from the current vendor, switching becomes a nightmare: The new vendor may prefer collecting telemetry data in a format (tracing format, for example) not compatible with the existing vendor. It means developers will have to rewrite the instrumentation code for all applications. This will have severe overhead in terms of cost, developer effort, and potential service disruptions, depending on the deployments and infrastructure. Non-compatible formats also cause problems with historical data while switching vendors. That is, it becomes hard for DevOps and SREs to analyze the performance before and after the migration. This is where OpenTelemetry proves helpful, and this the reason it is being widely adopted. OpenTelemetry prevents such vendor lock-in by standardizing telemetry data collection and exportation. With OpenTelemetry, developers can send the data to one or more observability backends, be it open-source or proprietary, as it supports most of the leading observability tools. OpenTelemetry Components and Workflow OpenTelemetry provides certain vendor-agnostic components that work together to fetch, process, and export telemetry data to various backends. There are three major components: Instrumentation library, OpenTelemetry Collector, and Exporters. Instrumentation Library OpenTelemetry provides SDKs and libraries for application developers to instrument their code manually or automatically. They support many popular programming languages, such as Java, Python, Ruby, Rust, JavaScript, and more.The instrumentation library is evolving, and developers should check the status of the telemetry data component in the instrumentation library, specific to the programming language they use. OpenTelemetry docs update them frequently. The status at the time of writing this piece is given below: Status of programming language-specific telemetry data support in OpenTelemetry For Kubernetes workloads, OpenTelemetry Operator for Kubernetes can be used to inject auto-instrumentation libraries. OpenTelemetry Collector (OTC) The collector has receiver, processor, and exporter components, which gather, process, and export telemetry data from instrumented applications or infrastructure to observability backends for visualization (refer to the image below). It can receive and export data in various formats, such as its native format (OpenTelemetry Protocol or OTLP), Prometheus, Jaeger, and more. OpenTelemetry Collector components and workflow OTC can be deployed as an agent — either as a sidecar container that runs alongside the application container or as a DaemonSet that runs on each node. And it can be scaled in or out depending on the data throughput. OpenTelemetry Collector is not mandatory since OpenTelemetry is designed to be modular and flexible. IT teams can pick components of their choice as receivers, processors, and exporters or even add custom ones. Exporters They allow developers to configure any compatible backend they want to send the processed telemetry data to. There are open-source and vendor-specific exporters available. Some of them are Apache Skywalking, Prometheus, Datadog, and Dynatrace, which are part of the contrib projects. You can see the complete list of vendors who provide exporters here. The difference Between Trace Data Collected by OpenTelemetry and Istio In a distributed system, tracing is the process of monitoring and recording the lifecycle of a request as it goes through different services in the system. It helps DevOps and SREs visualize the interaction between services and troubleshoot issues, like latency. Istio is one of the most popular service mesh software that provides distributed tracing for observability purposes. In Istio, application containers accompany sidecar containers, i.e., Envoy proxies. The proxy intercepts traffic between services and provides telemetry data for observability (refer to the image below). Istio sidecar architecture and observability Although both OpenTelemetry and Istio provide tracing data, there is a slight difference between them. Istio focuses on the lifecycle of a request as it traverses through multiple services in the system (networking layer) while OpenTelemetry — given that the application is instrumented with the OpenTelemetry library — focuses on the lifecycle of a request as it flows through an application (application layer), interacting with various functions and modules. For example, let us say service A is talking to service B, and the communication has latency issues. Istio can show you which service causes latency and by how much. While this information is enough for DevOps and SREs, it will not help developers debug the part of the application that is causing the problem. This is where OpenTelemetry tracing helps. Since the application is instrumented with the OpenTelemetry library, OpenTelemetry tracing can provide details regarding the specific function of the application that causes latency here. To put it another way, Istio gives traces from outside the application, while OpenTelemetry tracing provides traces from within the application. Istio tracing is good for troubleshooting problems at the networking layer, while OpenTelemetry tracing helps to troubleshoot problems at the application level. OpenTelemetry for Microservices Observability and Vendor Neutrality Enterprises adopting microservices architecture have applications distributed across the cloud, with respective IT teams maintaining them. By instrumenting applications with OpenTelemetry libraries and SDKs, the IT teams are free to choose any compatible observability backend of their choice. The choice will not affect the Ops/SRE teams’ ability to have central visibility into the entire services in the system. OpenTelemetry supports a variety of data formats and seamlessly integrates with most of the open-source and vendor-specific monitoring and observability tools. This also makes switching between vendors painless. Get Started With OpenTelemetry for Istio Service Mesh Watch the following video to learn how to get started with OpenTelemetry for Istio service mesh to achieve observability-in-depth: Additionally, you can go through the blog post, "Integrate Istio and Apache Skywalking for Kubernetes Observability," where the OpenTelemetry collector is used to scrape Prometheus endpoints.
Tools and platforms form the backbone of seamless software delivery in the ever-evolving world of Continuous Integration and Continuous Deployment (CI/CD). For years, Jenkins has been the stalwart, powering countless deployment pipelines and standing as the go-to solution for many DevOps professionals. But as the tech landscape shifts towards cloud-native solutions, AWS CodePipeline emerges as a formidable contender. Offering deep integration with the expansive AWS ecosystem and the agility of a cloud-based platform, CodePipeline is redefining the standards of modern deployment processes. This article dives into the transformative power of AWS CodePipeline, exploring its advantages over Jenkins and showing why many are switching to this cloud-native tool. Brief Background About CodePipeline and Jenkins At its core, AWS CodePipeline is Amazon Web Services' cloud-native continuous integration and continuous delivery service, allowing users to automate the build, test, and deployment phases of their release process. Tailored to the vast AWS ecosystem, CodePipeline leverages other AWS services, making it a seamless choice for teams already integrated with AWS cloud infrastructure. It promises scalability, maintenance ease, and enhanced security, characteristics inherent to many managed AWS services. On the other side of the spectrum is Jenkins – an open-source automation server with a storied history. Known for its flexibility, Jenkins has garnered immense popularity thanks to its extensive plugin system. It's a tool that has grown with the CI/CD movement, evolving from a humble continuous integration tool to a comprehensive automation platform that can handle everything from build to deployment and more. Together, these two tools represent two distinct eras and philosophies in the CI/CD domain. Advantages of AWS CodePipeline Over Jenkins 1. Integration with AWS Services AWS CodePipeline: Offers a native, out-of-the-box integration with a plethora of AWS services, such as Lambda, EC2, S3, and CloudFormation. This facilitates smooth, cohesive workflows, especially for organizations already using AWS infrastructure. Jenkins: While integration with cloud services is possible, it usually requires third-party plugins and additional setup, potentially introducing more points of failure or compatibility issues. 2. Scalability AWS CodePipeline: Being a part of the AWS suite, it natively scales according to the demands of the deployment pipeline. There's no need for manual intervention, ensuring consistent performance even during peak loads. Jenkins: Scaling requires manual adjustments, such as adding agent nodes or reallocating resources, which can be both time-consuming and resource-intensive. 3. Maintenance AWS CodePipeline: As a managed service, AWS handles all updates, patches, and backups. This ensures that the latest features and security patches are always in place without user intervention. Jenkins: Requires periodic manual updates, backups, and patching. Additionally, plugins can introduce compatibility issues or security vulnerabilities, demanding regular monitoring and adjustments. 4. Security AWS CodePipeline: One of the key benefits of AWS's comprehensive security model. Features like IAM roles, secret management with AWS Secrets Manager, and fine-grained access controls ensure robust security standards. Jenkins: Achieving a similar security level necessitates additional configurations, plugins, and tools, which can sometimes introduce more vulnerabilities or complexities. 5. Pricing and Long-Term Value AWS CodePipeline: Operates on a pay-as-you-go model, ensuring you only pay for what you use. This can be cost-effective, especially for variable workloads. Jenkins: While the software itself is open-source, maintaining a Jenkins infrastructure (servers, electricity, backups, etc.) incurs steady costs, which can add up in the long run, especially for larger setups. When Might Jenkins Be a Better Choice? Extensive Customization Needs With its rich plugin ecosystem, Jenkins provides a wide variety of customization options. For unique CI/CD workflows or specialized integration needs, Jenkins' vast array of plugins can be invaluable, including integration with non-AWS services. On-Premise Solutions Organizations with stringent data residency or regulatory requirements might prefer on-premise solutions. Jenkins offers the flexibility to be hosted on local servers, providing complete control over data and processes. Existing Infrastructure and Expertise Organizations with an established Jenkins infrastructure and a team well-versed in its intricacies might find transitioning to another tool costly and time-consuming. The learning curve associated with a new platform and migration efforts can be daunting. The team needs to weigh in on the transition along with other items in their roadmap. Final Takeaways In the ever-evolving world of CI/CD, selecting the right tool can be the difference between seamless deployments and daunting processes. Both AWS CodePipeline and Jenkins have carved out their specific roles in this space, yet as the industry shifts more towards cloud-native solutions, AWS CodePipeline indeed emerges at the forefront. With its seamless integration within the AWS ecosystem, innate scalability, and reduced maintenance overhead, it represents the future-facing approach to CI/CD. While Jenkins has served many organizations admirably and offers vast customization, the modern tech landscape is ushering in a preference for streamlined, cloud-centric solutions like AWS CodePipeline. The path from development to production is critical, and while the choice of tools will vary based on organizational needs, AWS CodePipeline's advantages are undeniably compelling for those looking toward a cloud-first future. As we navigate the challenges and opportunities of modern software delivery, AWS CodePipeline offers a promising solution that is more efficient, scalable, secure, and worth considering.
Delivering new features and updates to users without causing disruptions or downtime is a crucial challenge in the quick-paced world of software development. This is where the blue-green deployment strategy is useful. Organizations can roll out new versions of their software in a secure and effective way by using the release management strategy known as “blue-green deployment.” Organizations strive for quick and dependable deployment of new features and updates in the fast-paced world of software development. Rolling out changes, however, can be a difficult task because there is a chance that it will introduce bugs or result in downtime. An answer to this problem can be found in the DevOps movement’s popular blue-green deployment strategy. Blue-green deployment enables uninterrupted software delivery with little interruption by utilizing parallel environments and careful traffic routing. In this article, we will explore the principles, benefits, and best practices of blue-green deployment, shedding light on how it can empower organizations to release software with confidence. In this article, we will explore the concept of blue-green deployment, its benefits, and how it can revolutionize the software development process. Understanding Blue-Green Deployment In order to reduce risks and downtime when releasing new versions or updates of an application, blue-green deployment is a software deployment strategy. It entails running two parallel instances of the same production environment, with the “blue” environment serving as a representation of the current stable version and the “green” environment. With this configuration, switching between the two environments can be done without upsetting end users. without disrupting end-users. The fundamental idea behind blue-green deployment is to automatically route user traffic to the blue environment to protect the production system's stability and dependability. Developers and QA teams can validate the new version while the green environment is being set up and thoroughly tested before it is made available to end users. The deployment process typically involves the following steps: Initial Deployment: The blue environment is the initial production environment running the stable version of the application. Users access the application through this environment, and it serves as the baseline for comparison with the updated version. Update Deployment: The updated version of the application is deployed to the green environment, which mirrors the blue environment in terms of infrastructure, configuration, and data. The green environment remains isolated from user traffic initially. Testing and Validation: The green environment is thoroughly tested to ensure that the updated version functions correctly and meets the desired quality standards. This includes running automated tests, performing integration tests, and potentially conducting user acceptance testing or canary releases. Traffic Switching: Once the green environment passes all the necessary tests and validations, the traffic routing mechanism is adjusted to start directing user traffic from the blue environment to the green environment. This switch can be accomplished using various techniques such as DNS changes, load balancer configuration updates, or reverse proxy settings. Monitoring and Verification: Throughout the deployment process, both the blue and green environments are monitored to detect any issues or anomalies. Monitoring tools and observability practices help identify performance problems, errors, or inconsistencies in real-time. This ensures the health and stability of the application in a green environment. Rollback and Cleanup: In the event of unexpected issues or unsatisfactory results, a rollback strategy can be employed to switch the traffic back to the blue environment, reverting to the stable version. Additionally, any resources or changes made in the green environment during the deployment process may need to be cleaned up or reverted. The advantages of blue-green deployment are numerous. By maintaining parallel environments, organizations can significantly reduce downtime during deployments. They can also mitigate risks by thoroughly testing the updated version before exposing it to users, allowing for quick rollbacks if issues arise. Blue-green deployment also supports scalability testing, continuous delivery practices, and experimentation with new features. Overall, blue-green deployment is a valuable approach for organizations seeking to achieve seamless software updates, minimize user disruption, and ensure a reliable and efficient deployment process. Benefits of Blue-Green Deployment Blue-green deployment offers several significant benefits for organizations looking to deploy software updates with confidence and minimize the impact on users. Here are the key benefits of implementing blue-green deployment: Minimized Downtime: Blue-green deployment significantly reduces downtime during the deployment process. By maintaining parallel environments, organizations can prepare and test the updated version (green environment) alongside the existing stable version (blue environment). Once the green environment is deemed stable and ready, the switch from blue to green can be accomplished seamlessly, resulting in minimal or no downtime for end-users. Rollback Capability: Blue-green deployment provides the ability to roll back quickly to the previous version (blue environment) if issues arise after the deployment. In the event of unforeseen problems or performance degradation in the green environment, organizations can redirect traffic back to the blue environment, ensuring a swift return to a stable state without impacting users. Risk Mitigation: With blue-green deployment, organizations can mitigate the risk of introducing bugs, errors, or performance issues to end-users. By maintaining two identical environments, the green environment can undergo thorough testing, validation, and user acceptance testing before directing live traffic to it. This mitigates the risk of impacting users with faulty or unstable software and increases overall confidence in the deployment process. Scalability and Load Testing: Blue-green deployment facilitates load testing and scalability validation in the green environment without affecting production users. Organizations can simulate real-world traffic and user loads in the green environment to evaluate the performance, scalability, and capacity of the updated version. This helps identify potential bottlenecks or scalability issues before exposing them to the entire user base, ensuring a smoother user experience. Continuous Delivery and Continuous Integration: Blue-green deployment aligns well with continuous delivery and continuous integration (CI/CD) practices. By automating the deployment pipeline and integrating it with version control and automated testing, organizations can achieve a seamless and streamlined delivery process. CI/CD practices enable faster and more frequent releases, reducing time-to-market for new features and updates. Flexibility for Testing and Experimentation: Blue-green deployment provides a controlled environment for testing and experimentation. Organizations can use the green environment to test new features, conduct A/B testing, or gather user feedback before fully rolling out changes. This allows for data-driven decision-making and the ability to iterate and improve software based on user input. Improved Reliability and Fault Tolerance: By maintaining two separate environments, blue-green deployment enhances reliability and fault tolerance. In the event of infrastructure or environment failures in one of the environments, the other environment can continue to handle user traffic seamlessly. This redundancy ensures that the overall system remains available and minimizes the impact of failures on users. Implementing Blue-Green Deployment To successfully implement blue-green deployment, organizations need to follow a series of steps and considerations. The process involves setting up parallel environments, managing infrastructure, automating deployment pipelines, and establishing efficient traffic routing mechanisms. Here is a step-by-step guide on how to implement blue-green deployment effectively: Duplicate Infrastructure: Duplicate the infrastructure required to support the application in both the blue and green environments. This includes servers, databases, storage, and any other components necessary for the application’s functionality. Ensure that the environments are identical to minimize compatibility issues. Automate Deployment: Implement automated deployment pipelines to ensure consistent and repeatable deployments. Automation tools such as Jenkins, Travis CI, or GitLab CI/CD can help automate the deployment process. Create a pipeline that includes steps for building, testing, and deploying the application to both the blue and green environments. Version Control and Tagging: Adopt proper version control practices to manage different releases effectively. Use a version control system like Git to track changes and create clear tags or branches for each environment. This helps in identifying and managing the blue and green versions of the software. Automated Testing: Implement comprehensive automated testing to validate the functionality and stability of the green environment before routing traffic to it. Include unit tests, integration tests, and end-to-end tests in your testing suite. Automated tests help catch issues early in the deployment process and ensure a higher level of confidence in the green environment. Traffic Routing Mechanisms: Choose appropriate traffic routing mechanisms to direct user traffic between the blue and green environments. Popular options include DNS switching, reverse proxies, or load balancers. Configure the routing mechanism to gradually shift traffic from the blue environment to the green environment, allowing for a controlled transition. Monitoring and Observability: Implement robust monitoring and observability practices to gain visibility into the performance and health of both environments. Monitor key metrics, logs, and user feedback to detect any anomalies or issues. Utilize monitoring tools like Prometheus, Grafana, or ELK Stack to ensure real-time visibility into the system. Incremental Rollout: Adopt an incremental rollout approach to minimize risks and ensure a smoother transition. Gradually increase the percentage of traffic routed to the green environment while monitoring the impact and collecting feedback. This allows for early detection of issues and quick response before affecting the entire user base. Rollback Strategy: Have a well-defined rollback strategy in place to revert back to the stable blue environment if issues arise in the green environment. This includes updating the traffic routing mechanism to redirect traffic back to the blue environment. Ensure that the rollback process is well-documented and can be executed quickly to minimize downtime. Continuous Improvement: Regularly review and improve your blue-green deployment process. Collect feedback from the deployment team, users, and stakeholders to identify areas for enhancement. Analyze metrics and data to optimize the deployment pipeline, automate more processes, and enhance the overall efficiency and reliability of the blue-green deployment strategy. By following these implementation steps and considering key aspects such as infrastructure duplication, automation, version control, testing, traffic routing, monitoring, and continuous improvement, organizations can successfully implement blue-green deployment. This approach allows for seamless software updates, minimized downtime, and the ability to roll back if necessary, providing a robust and efficient deployment strategy. Best Practices for Blue-Green Deployment Blue-green deployment is a powerful strategy for seamless software delivery and minimizing risks during the deployment process. To make the most of this approach, consider the following best practices: Version Control and Tagging: Implement proper version control practices to manage different releases effectively. Clearly label and tag the blue and green environments to ensure easy identification and tracking of each version. This helps in maintaining a clear distinction between the stable and updated versions of the software. Automated Deployment and Testing: Leverage automation for deployment pipelines to ensure consistent and repeatable deployments. Automation helps streamline the process and reduces the chances of human error. Implement automated testing at different levels, including unit tests, integration tests, and end-to-end tests. Automated testing helps verify the functionality and stability of the green environment before routing traffic to it. Infrastructure Duplication: Duplicate the infrastructure and set up identical environments for blue and green. This includes replicating servers, databases, and any other dependencies required for the application. Keeping the environments as similar as possible ensures a smooth transition without compatibility issues. Traffic Routing Mechanisms: Choose appropriate traffic routing mechanisms to direct user traffic from the blue environment to the green environment seamlessly. Popular techniques include DNS switching, reverse proxies, or load balancers. Carefully configure and test these mechanisms to ensure they handle traffic routing accurately and efficiently. Incremental Rollout: Consider adopting an incremental rollout approach rather than switching all traffic from blue to green at once. Gradually increase the percentage of traffic routed to the green environment while closely monitoring the impact. This allows for real-time feedback and rapid response to any issues that may arise, minimizing the impact on users. Canary Releases: Implement canary releases by deploying the new version to a subset of users or a specific geographic region before rolling it out to the entire user base. Canary releases allow you to collect valuable feedback and perform additional validation in a controlled environment. This approach helps mitigate risks and ensures a smoother transition to the updated version. Rollback Strategy: Always have a well-defined rollback strategy in place. Despite thorough testing and validation, issues may still occur after the deployment. Having a rollback plan ready allows you to quickly revert to the stable blue environment if necessary. This ensures minimal disruption to users and maintains the continuity of service. Monitoring and Observability: Implement comprehensive monitoring and observability practices to gain visibility into the performance and health of both the blue and green environments. Monitor key metrics, logs, and user feedback to identify any anomalies or issues. This allows for proactive detection and resolution of problems, enhancing the overall reliability of the deployment process. By following these best practices, organizations can effectively leverage blue-green deployment to achieve rapid and reliable software delivery. The careful implementation of version control, automation, traffic routing, and monitoring ensures a seamless transition between different versions while minimizing the impact on users and mitigating risks. Conclusion Deploying software in a blue-green fashion is a potent method for ensuring smooth and dependable releases. Organizations can minimize risks, cut down on downtime, and boost confidence in their new releases by maintaining two parallel environments and converting user traffic gradually. This method enables thorough testing, validation, and scalability evaluation and perfectly complies with the continuous delivery principles. Adopting blue-green deployment as the software development landscape changes can be a game-changer for businesses looking to offer their users top-notch experiences while maintaining a high level of reliability. Organizations can use the effective blue-green deployment strategy to deliver software updates with confidence. This method allows teams to seamlessly release new features and updates by reducing downtime, providing rollback capabilities, and reducing risks. Organizations can use blue-green deployment to achieve quicker and more reliable software delivery if the appropriate infrastructure is set up, deployment pipelines are automated, and traffic routing mechanisms are effective. Organizations can fully utilize blue-green deployment by implementing the recommended best practices discussed in this article. This will guarantee a positive user experience while lowering the risk of deployment-related disruptions. In conclusion, blue-green deployment has a lot of advantages, such as decreased downtime, rollback capability, risk reduction, scalability testing, alignment with CI/CD practices, flexibility for testing and experimentation, and increased reliability. Organizations can accomplish seamless software delivery, boost deployment confidence, and improve user experience throughout the deployment process by utilizing parallel environments and careful traffic routing.
Have you ever wondered which cloud service provider can elevate your software product engineering to new heights with their powerful DevOps offerings? If you haven’t, get ready! Because we’re about to explore every nook and cranny of the two leading cloud service providers — Azure and AWS. But hold on — it’s not just another tech comparison. We’ll dive deep into how each DevOps service aligns with your team’s skills, how it complements your existing infrastructure, and, most importantly, how it enhances your business strategies! So, if you’re eager to discover how Azure DevOps and AWS DevOps can elevate your software product development, sit back, grab your favorite cup of coffee, and let’s set out on this captivating comparison. Also, let us share the best video on DevOps we have recently come across. Azure DevOps Services: An Overview Azure DevOps is a comprehensive suite of cloud-based services provided by Microsoft. It provides a wide range of tools, systems, and services that enable individuals and companies with everything, from planning and coding to building, testing, and deploying software products. The core components of Azure DevOps are: 1. Azure Boards Azure Boards enables you to track and manage work items, backlogs, and project progress. This gives the team visibility and lets everyone know how the project is moving along. 2. Azure Repos It is a Git repository hosting service for version control of your software projects. It also offers a convenient way to collaborate with everyone working on the project. 3. Azure Pipelines The pipeline is where all the automation processes take place and from where the software product is deployed. Azure Pipeline is a continuous integration and continuous deployment (CI/CD) service that automates build and release pipelines. 4. Azure Test Plans Azure Test Plans is a testing service that enables Azure DevOps engineers to test software products either through automation or a manual approach. It also helps Azure DevOps engineers manage and track tests in order to have a clear record of errors and successful tests. 5. Azure Artifacts Azure Artifacts is a package management solution that enables developers to share their code efficiently and manage all their packages from one place. In fact, it also empowers developers to publish packages to their feeds and share them within the team, across organizations, and even publicly. AWS DevOps Services: An Overview AWS DevOps is an ecosystem of cloud-based services provided by Amazon Web Services (AWS). It offers various services that are flexible and designed to let organizations streamline and automate software product development and delivery processes with agility and efficiency. AWS DevOps has similar features to Azure DevOps, but it distinguishes itself with its own set of unique features. For instance, AWS DevOps seamlessly integrates with various AWS-based services. This enables DevOps engineers to set up infrastructure, monitor, and efficiently manage features and services on the AWS Cloud platform. The core components of AWS DevOps are: 1. AWS CodeCommit Similar to Azure Repos, AWS CodeCommit is a managed source control service based on Git to securely store and version code. 2. AWS CodeBuild AWS CodeBuild is a fully managed build service that allows you to compile source code, run various tests, and produce software packages. 3. AWS CodePipeline It is an automated CI/CD service that allows applications to be updated rapidly. AWS CodePipeline integrates with AWS and gives organizations the full flexibility to deliver software products end-to-end. 4. AWS CodeDeploy It is a deployment service that automates application deployments to various environments. 5. AWS CodeStar AWS CodeStar is a unified development service to quickly develop, build, and deploy applications on AWS. Azure DevOps vs. AWS DevOps – The Battle for DevOps Dominance! Both Azure and AWS DevOps are powerful and widely used platforms for managing the software product engineering lifecycle and DevOps practices. Here’s a detailed comparison table highlighting some of the key differences between the two: Azure DevOps AWS DevOps Popularity and Adoption Azure DevOps is widely used by organizations of all sizes and is especially popular among enterprises that are invested in the Microsoft ecosystem. AWS DevOps is also widely adopted, especially by companies utilizing Amazon Web Services as their cloud platform. Ease of Use Azure DevOps offers a user-friendly web interface with well-integrated tools, making it easy for teams familiar with Microsoft products to get started. AWS DevOps has a learning curve, especially for teams new to AWS. However, it offers comprehensive documentation and resources to simplify adoption. Scalability and flexibility Azure DevOps can scale up to 10,000 concurrent users and 2 GB of storage per user. AWS DevOps can scale up to millions of users and offers unlimited storage. Infrastructure Azure, initially started as a PaaS, streamlines the process for developers to build and scale applications effortlessly, eliminating concerns about the underlying infrastructure. AWS DevOps offers services that IT operations can easily comprehend and use to support their on-demand computing and storage requirements. Service Integration Azure DevOps enables users to integrate Azure services like Azure app services, SQL databases, and Azure VM. In fact, it also streamlines the SDLC process by utilizing Jenkins and other third-party tools. AWS DevOps allows users to enforce and integrate various services like viz., S3, EC2, and Beanstalk. Managing Packages in Software Azure has a package manager tool called Azure Artifacts to manage packages like Maven, Nuget, etc. In AWS DevOps, you need to integrate third-party tools like Artifactory to manage packages. Version Control System Git and Team Foundation Version Control (TFVC) Git only CI/CD Pipeline Capabilities Azure DevOps supports both Build and Release Pipelines and also offers YAML-based pipeline configuration. AWS DevOps provides CI/CD pipelines with customizable configurations using YAML or Visual Designer. Build Agents For Azure DevOps, hosted and self-hosted agents are available for build and deployment tasks. AWS CodeBuild uses managed build servers and you can also bring your custom build environment. Deployment Capabilities Deployment to Azure, on-premises, and third-party cloud providers. Deployment to AWS services and on-premises infrastructure. Compliance Azure DevOps complies with various industry standards, including SOC 1, SOC 2, ISO 27001, and HIPAA. AWS DevOps also adheres to numerous security and compliance standards. Security It has role-based access control, granular permissions, and secure pipelines. It has AWS Identity and Access Management (IAM) for access control and secure resource management. Best Features Azure DevOps offers features like Kanban workflows, boards, and a massive extension ecosystem. By utilizing AWS services, AWS DevOps can easily automate the deployment of all codes. Pricing It offers various plans based on the number of users and build minutes. AWS DevOps has a pay-as-you-go pricing based on the usage of AWS Developer Tools.
Hey everyone, this is by far my most read article, and given the changing landscape of DevOps tools, I thought nearly three years later and heading into 2023 it was worth a refresh! For those of you who read the original article, I have updated it with a few small changes to my categorization of tools list since last time and added new key players to most of those categories (along with a few I missed the first time). There are so many test tools across all the various tech stacks and DevOps stacks, that I can't possibly put them all in here, but this time I tried to add some tools used in different worlds (JavaScript, Kubernetes, Java, front end, etc). Before getting to the updated article, here is a summary of some key trends we have seen across our 30+ projects and from our friends at Rhythmic Technologies in the industry. We have seen an explosion of tools around microservices, particularly in the Kubernetes space. Tools like Envoy provide layer 7 proxy, and Istio and Linkerd provide layer 7 service mesh. At this point I consider these tools more purely operational tools, so I have excluded them from my list, but this may change on the next update! For well-run shops we have seen a lot of “less is more” lately. A lot of simpler pipelines, smaller test suites, less stages, less environments. A lot more Docker. Not necessarily a unified approach to Docker, but more of it! For mismanaged shops we have seen a significant growth of complexity, mostly around microservices; instead of DevOps being used to reduce and manage cloud spend, it seems to have the opposite effect in these shops. Simplified CI/CD approaches are on the rise, more GitHub actions and Bitbucket pipelines. Moving DevOps/IAC code into the repositories they support (yay!) Code coverage for infrastructure as code has gone up dramatically in the last 2 years. Low-level infrastructure (VPCs, related networking, security config, IAM) coverage remains low but it has improved a bunch. Last week a few of my very senior colleagues and myself were remarking about how many new DevOps tools are emerging and how it’s getting harder and harder every day to keep track of them and where they fit into the world. I asked several of them where these tools, Ansible, Terraform, Salt, Chef, Bamboo, CloudFormation, fit in. Why would I use one vs. the other? Are they even the same thing? Am I missing a major player? I got back the same blank stares/questions that I had. So, I thought I would do some research, read, and try to make sense of it for all of us so we could classify products into categories or uses to which we are all familiar. Before we start to talk about DevOps tools and categories, let’s take a step back and discuss a few basic (but often overloaded) terms and what they mean. Computer/Server: A physical device featuring a Central Processing Unit (CPU) and has memory (RAM), local storage (disk) and runs an operating system. Virtual Machine: An emulation of a computer system running on a host computer; typically can be isolated from other operating systems in terms of CPU, memory and disk usage. Containers: A packaging of software and all its dependencies so that it can run uniformly and consistently on any infrastructure. Docker containers are the most popular. They allow you to package up a bunch of stuff (your software, configurations and other software) for easy deployment and shipping. You can think of containers as the next evolution of virtualization (after Virtual Machines). Network Device: A piece of hardware that routes network traffic between devices. Examples include routers, load balancers, and firewalls. Software: Code that is written and runs on an operating system. DevOps: Traditionally there was “development” (you build it), and there was “operations” (we will run it), and everything in between the two was subject to how a shop worked. Starting around 2010 and achieving near ubiquity around 2018, the DevOps idea is “a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality.” When you think about what goes into building and running a non-trivial system, there is actually a lot that goes into it. Here is a list of the traditional items to consider: Obtaining the computer/server hardware Configuring the computer/server hardware (operating systems, network wiring, etc.) Monitoring the computer/server hardware Obtaining the network devices (load balancers, firewalls, routers, etc.) Configuring the network devices Monitoring the network devices Constructing the software Building the software Testing the software Packaging the software Deploying/releasing the software Monitoring the software Before DevOps, we used to have four distinct teams doing this work: Developer: they would perform #7, #8 and sometimes #10 QA: they would perform #9 and sometimes #11 System Administrator: they would do #1, #2, #3, #12 Network Administrator: they would do #4, #5, #6 For the configuring of the hardware, network devices and software, each team likely would use their own set of scripts and tools and, in many cases, would do things manually to make a “software release” happen. With the advent of DevOps, for me the key idea was breaking down these walls and making everyone part of “one” team, bringing consistency to the way all things are configured, deployed and managed. Cloud: Defining the most overloaded term in the history of information technology is tough, but I like the t-shirt that says, “There is no cloud, it's just someone else's computer”. Initially, when the cloud services started they really were just someone else’s computer (or a VM running on their computer), or storage. Over time, they have evolved into this and many, many value added services. The hardware has been mostly abstracted away; you can’t buy a hardware device in most cloud services these days, but you can buy a service provided by the hardware devices. Infrastructure as Code (IAC): A new ability or concept that allows us to define a complete setup of all the items in our data center including VMs, containers, and network devices through definition or configuration files. The idea is I can create some configurations and some scripts and run them using one of the tools we are about to discuss and they will automatically provision all of our services in the data center. CI/CD was a precursor to IAC, for years we have been working on automating our build/test/integration/deploy cycle, doing this with our cloud infrastructure was a natural extension to this. This brought about cost reduction, faster time to market, and less risk (of human error). With the advent of IAC many of the traditional development tools could now be used for managing infrastructure. Categories of tools (listed below) like software repositories, build tools, CI/CD, code analyzer, and testing tools that were traditionally used by software developers could now be used by DevOps engineers to build and maintain infrastructure. So now that we have some basic vocabulary defined, let me get back to the task of attempting to categorize DevOps tools in a way that makes it easier for us to determine what can be used for what. Monitoring tools: these tools allow the monitoring of hardware and software. Typically they include watchers that watch processes and log files to ensure the health of systems. Lots of activity around Grafana/Prometheus as a replacement for Nagios. Also, SaaS monitoring platforms have proliferated, even though many are based on Prometheus/ELK under the hood. Containerization and orchestration tools: these tools configure, coordinate and manage computer systems and software. They frequently include “automation” and “workflow” as part of their services. Kubernetes is a very popular orchestration tool that is focused on containers. Terraform is a very popular orchestration tool that is more broadly focused including cloud orchestration. Also, each cloud provider offers tools which include CloudFormation, GCP Deployment Manager, and ARM. Configuration management: configuration management tools and databases typically store all the information about your hardware and software items as well as provide a scripting and/or templating system for automation of common tasks. There seems to be many players in this space. Traditional players were Chef, Puppet and Salt. Deployment tools: these tools help with the deployment of software. Many CI tools are also CD (continuous deployment) tools which assist the deployment of software. Traditionally, in Ruby, the Capistrano tool has been used widely; in Java, Maven is used by many. All of the orchestration tools also support some sort of deployment. Continuous integration tools: configured so each time you check code into a repository, it builds, deploys and tests the software. This usually improves quality and time to market. The most popular tools in this market are GitHub Actions, CircleCI, or Bitbucket Pipelines, Jenkins, Travis and TeamCity. Testing tools: testing tools are used to manage tests, as well as test automation, including things like performance and load testing. Also, test tools specifically around JavaScript and Kubernetes have exploded into the market. Code analyzer/review tools: these tools look for errors in code, code formatting and quality, and test coverage. These vary from language to language. SonarQube is a popular tool in this space, as well as other “linting” tools. Build tools: some software requires compiling before it can be packaged or used, traditional build tools include Make, Ant, Maven, and MSBuild. Version control systems: tools to manage versions of software - Git is the most widely used tool today. There are many cloud hosted options for Git: GitHub, GitLab and Bitbucket and Azure DevOps dominate the market today. Of course, like any other set of products, the categories are not necessarily clean. Many tools cross categories and provide features from two or more categories. Below is my attempt to show most of the very popular tools and visualize where they fit in terms of these categories. As you can see, there are several players like Ansible, Terraform and the cloud tools (AWS, GCP and Azure) that are trying to span the deployment, configuration management and orchestration categories with their offerings. The older toolset, Puppet, Chef and SaltStack, are focused on configuration management and automation but have expanded into orchestration and deployment. There are also tools like GitLab, Github, Bitbucket and AzureDevOps that are trying to span nearly every category of DevOps. I hope this overview helps you understand the basics of DevOps, the categories of tools available, and how the various products on the market today help in one or more of these categories. At Solution Street we have used many of these tools over the years, for us there is no single “go to” tool we use in all cases. What is used is based on the technologies being used, where it is being hosted (and where in the future it may) as well as the talents and makeup of the team. Further Reading Why we use Terraform Best cloud infrastructure automation tools What is DevOps
In my previous posting, I explained how to run Ansible scripts using a Linux virtual machine on Windows Hyper-V. This article aims to ease novices into Ansible IAC at the hand of an example. The example being booting one's own out-of-cloud Kubernetes cluster. As such, the intricacies of the steps required to boot a local k8s cluster are beyond the scope of this article. The steps can, however, be studied at the GitHub repo, where the Ansible scripts are checked in. The scripts were tested on Ubuntu20, running virtually on Windows Hyper-V. Network connectivity was established via an external virtual network switch on an ethernet adaptor shared between virtual machines but not with Windows. Dynamic memory was switched off from the Hyper-V UI. An SSH service daemon was pre-installed to allow Ansible a tty terminal to run commands from. Bootstrapping the Ansible User Repeatability through automation is a large part of DevOps. It cuts down on human error, after all. Ansible, therefore, requires a standard way to establish a terminal for the various machines under its control. This can be achieved using a public/private key pairing for SSH authentication. The keys can be generated for an Elliptic Curve Algorithm as follows: ssh-keygen -f ansible -t ecdsa -b 521 The Ansible script to create and match an account to the keys is: YAML --- - name: Bootstrap ansible hosts: all become: true tasks: - name: Add ansible user ansible.builtin.user: name: ansible shell: /bin/bash become: true - name: Add SSH key for ansible ansible.posix.authorized_key: user: ansible key: "{{ lookup('file', 'ansible.pub') }" state: present exclusive: true # to allow revocation # Join the key options with comma (no space) to lock down the account: key_options: "{{ ','.join([ 'no-agent-forwarding', 'no-port-forwarding', 'no-user-rc', 'no-x11-forwarding' ]) }" # noqa jinja[spacing] become: true - name: Configure sudoers community.general.sudoers: name: ansible user: ansible state: present commands: ALL nopassword: true runas: ALL # ansible user should be able to impersonate someone else become: true Ansible is declarative, and this snippet depicts a series of tasks that ensure that: The Ansible user exists; The keys are added for SSH authentication and The Ansible user can execute with elevated privilege using sudo Towards the top is something very important, and it might go unnoticed under a cursory gaze: hosts: all What does this mean? The answer to this puzzle can be easily explained at the hand of the Ansible inventory file: YAML masters: hosts: host1: ansible_host: "192.168.68.116" ansible_connection: ssh ansible_user: atmin ansible_ssh_common_args: "-o ControlMaster=no -o ControlPath=none" ansible_ssh_private_key_file: ./bootstrap/ansible comasters: hosts: co-master_vivobook: ansible_connection: ssh ansible_host: "192.168.68.109" ansible_user: atmin ansible_ssh_common_args: "-o ControlMaster=no -o ControlPath=none" ansible_ssh_private_key_file: ./bootstrap/ansible workers: hosts: client1: ansible_connection: ssh ansible_host: "192.168.68.115" ansible_user: atmin ansible_ssh_common_args: "-o ControlMaster=no -o ControlPath=none" ansible_ssh_private_key_file: ./bootstrap/ansible client2: ansible_connection: ssh ansible_host: "192.168.68.130" ansible_user: atmin ansible_ssh_common_args: "-o ControlMaster=no -o ControlPath=none" ansible_ssh_private_key_file: ./bootstrap/ansible It is the register of all machines the Ansible project is responsible for. Since our example project concerns a high availability K8s cluster, it consists of sections for the master, co-masters, and workers. Each section can contain more than one machine. The root-enabled account atmin on display here was created by Ubuntu during installation. The answer to the question should now be clear — the host key above specifies that every machine in the cluster will have an account called Ansible created according to the specification of the YAML. The command to run the script is: ansible-playbook --ask-pass bootstrap/bootstrap.yml -i atomika/atomika_inventory.yml -K The locations of the user bootstrapping YAML and the inventory files are specified. The command, furthermore, requests password authentication for the user from the inventory file. The -K switch, on its turn, asks that the superuser password be prompted. It is required by tasks that are specified to be run as root. It can be omitted should the script run from the root. Upon successful completion, one should be able to login to the machines using the private key of the ansible user: ssh ansible@172.28.110.233 -i ansible Note that since this account is not for human use, the bash shell is not enabled. Nevertheless, one can access the home of root (/root) using 'sudo ls /root' The user account can now be changed to ansible and the location of the private key added for each machine in the inventory file: YAML host1: ansible_host: "192.168.68.116" ansible_connection: ssh ansible_user: ansible ansible_ssh_common_args: "-o ControlMaster=no -o ControlPath=none" ansible_ssh_private_key_file: ./bootstrap/ansible One Master To Rule Them All We are now ready to boot the K8s master: ansible-playbook atomika/k8s_master_init.yml -i atomika/atomika_inventory.yml --extra-vars='kubectl_user=atmin' --extra-vars='control_plane_ep=192.168.68.119' The content of atomika/k8s_master_init.yml is: YAML # k8s_master_init.yml - hosts: masters become: yes become_method: sudo become_user: root gather_facts: yes connection: ssh roles: - atomika_base vars_prompt: - name: "control_plane_ep" prompt: "Enter the DNS name of the control plane load balancer?" private: no - name: "kubectl_user" prompt: "Enter the name of the existing user that will execute kubectl commands?" private: no tasks: - name: Initializing Kubernetes Cluster become: yes # command: kubeadm init --pod-network-cidr 10.244.0.0/16 --control-plane-endpoint "{{ ansible_eno1.ipv4.address }:6443" --upload-certs command: kubeadm init --pod-network-cidr 10.244.0.0/16 --control-plane-endpoint "{{ control_plane_ep }:6443" --upload-certs #command: kubeadm init --pod-network-cidr 10.244.0.0/16 --upload-certs run_once: true #delegate_to: "{{ k8s_master_ip }" - pause: seconds=30 - name: Create directory for kube config of {{ ansible_user }. become: yes file: path: /home/{{ ansible_user }/.kube state: directory owner: "{{ ansible_user }" group: "{{ ansible_user }" mode: 0755 - name: Copy /etc/kubernetes/admin.conf to user home directory /home/{{ ansible_user }/.kube/config. copy: src: /etc/kubernetes/admin.conf dest: /home/{{ ansible_user }/.kube/config remote_src: yes owner: "{{ ansible_user }" group: "{{ ansible_user }" mode: '0640' - pause: seconds=30 - name: Remove the cache directory. file: path: /home/{{ ansible_user }/.kube/cache state: absent - name: Create directory for kube config of {{ kubectl_user }. become: yes file: path: /home/{{ kubectl_user }/.kube state: directory owner: "{{ kubectl_user }" group: "{{ kubectl_user }" mode: 0755 - name: Copy /etc/kubernetes/admin.conf to user home directory /home/{{ kubectl_user }/.kube/config. copy: src: /etc/kubernetes/admin.conf dest: /home/{{ kubectl_user }/.kube/config remote_src: yes owner: "{{ kubectl_user }" group: "{{ kubectl_user }" mode: '0640' - pause: seconds=30 - name: Remove the cache directory. file: path: /home/{{ kubectl_user }/.kube/cache state: absent - name: Create Pod Network & RBAC. become_user: "{{ ansible_user }" become_method: sudo become: yes command: "{{ item }" with_items: kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml - pause: seconds=30 - name: Configure kubectl command auto-completion for {{ ansible_user }. lineinfile: dest: /home/{{ ansible_user }/.bashrc line: 'source <(kubectl completion bash)' insertafter: EOF - name: Configure kubectl command auto-completion for {{ kubectl_user }. lineinfile: dest: /home/{{ kubectl_user }/.bashrc line: 'source <(kubectl completion bash)' insertafter: EOF ... From the host keyword, one can see these tasks are only enforced on the master node. However, two things are worth explaining. The Way Ansible Roles The first is the inclusion of the atomika_role towards the top: YAML roles: - atomika_base The official Ansible documentation states that: "Roles let you automatically load related vars, files, tasks, handlers, and other Ansible artifacts based on a known file structure." The atomika_base role is included in all three of the Ansible YAML scripts that maintain the master, co-masters, and workers of the cluster. Its purpose is to lay the base by making sure that tasks common to all three member types have been executed. As stated above, an ansible role follows a specific directory structure that can contain file templates, tasks, and variable declaration, amongst other things. The Kubernetes and ContainerD versions are, for example, declared in the YAML of variables: YAML k8s_version: 1.28.2-00 containerd_version: 1.6.24-1 In short, therefore, development can be fast-tracked through the use of roles developed by the Ansible community that open-sourced it at Ansible Galaxy. Dealing the Difference The second thing of interest is that although variables can be passed in from the command line using the --extra-vars switch, as can be seen, higher up, Ansible can also be programmed to prompt when a value is not set: YAML vars_prompt: - name: "control_plane_ep" prompt: "Enter the DNS name of the control plane load balancer?" private: no - name: "kubectl_user" prompt: "Enter the name of the existing user that will execute kubectl commands?" private: no Here, prompts are specified to ask for the user that should have kubectl access and the IP address of the control plane. Should the script execute without error, the state of the cluster should be: atmin@kxsmaster2:~$ kubectl get pods -o wide -A NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-flannel kube-flannel-ds-mg8mr 1/1 Running 0 114s 192.168.68.111 kxsmaster2 <none> <none> kube-system coredns-5dd5756b68-bkzgd 1/1 Running 0 3m31s 10.244.0.6 kxsmaster2 <none> <none> kube-system coredns-5dd5756b68-vzkw2 1/1 Running 0 3m31s 10.244.0.7 kxsmaster2 <none> <none> kube-system etcd-kxsmaster2 1/1 Running 0 3m45s 192.168.68.111 kxsmaster2 <none> <none> kube-system kube-apiserver-kxsmaster2 1/1 Running 0 3m45s 192.168.68.111 kxsmaster2 <none> <none> kube-system kube-controller-manager-kxsmaster2 1/1 Running 7 3m45s 192.168.68.111 kxsmaster2 <none> <none> kube-system kube-proxy-69cqq 1/1 Running 0 3m32s 192.168.68.111 kxsmaster2 <none> <none> kube-system kube-scheduler-kxsmaster2 1/1 Running 7 3m45s 192.168.68.111 kxsmaster2 <none> <none> All the pods required to make up the control plane run on the one master node. Should you wish to run a single-node cluster for development purposes, do not forget to remove the taint that prevents scheduling on the master node(s). kubectl taint node --all node-role.kubernetes.io/control-plane:NoSchedule- However, a cluster consisting of one machine is not a true cluster. This will be addressed next. Kubelets of the Cluster, Unite! Kubernetes, as an orchestration automaton, needs to be resilient by definition. Consequently, developers and a buggy CI/CD pipeline should not touch the master nodes by scheduling load on it. Therefore, Kubernetes increases resilience by expecting multiple worker nodes to join the cluster and carry the load: ansible-playbook atomika/k8s_workers.yml -i atomika/atomika_inventory.yml The content of k8x_workers.yml is: YAML # k8s_workers.yml --- - hosts: workers, vmworkers remote_user: "{{ ansible_user }" become: yes become_method: sudo gather_facts: yes connection: ssh roles: - atomika_base - hosts: masters tasks: - name: Get the token for joining the nodes with Kuberenetes master. become_user: "{{ ansible_user }" shell: kubeadm token create --print-join-command register: kubernetes_join_command - name: Generate the secret for joining the nodes with Kuberenetes master. become: yes shell: kubeadm init phase upload-certs --upload-certs register: kubernetes_join_secret - name: Copy join command to local file. become: false local_action: copy content="{{ kubernetes_join_command.stdout_lines[0] } --certificate-key {{ kubernetes_join_secret.stdout_lines[2] }" dest="/tmp/kubernetes_join_command" mode=0700 - hosts: workers, vmworkers #remote_user: k8s5gc #become: yes #become_metihod: sudo become_user: root gather_facts: yes connection: ssh tasks: - name: Copy join command to worker nodes. become: yes become_method: sudo become_user: root copy: src: /tmp/kubernetes_join_command dest: /tmp/kubernetes_join_command mode: 0700 - name: Join the Worker nodes with the master. become: yes become_method: sudo become_user: root command: sh /tmp/kubernetes_join_command register: joined_or_not - debug: msg: "{{ joined_or_not.stdout }" ... There are two blocks of tasks — one with tasks to be executed on the master and one with tasks for the workers. This ability of Ansible to direct blocks of tasks to different member types is vital for cluster formation. The first block extracts and augments the join command from the master, while the second block executes it on the worker nodes. The top and bottom portions from the console output can be seen here: YAML janrb@dquick:~/atomika$ ansible-playbook atomika/k8s_workers.yml -i atomika/atomika_inventory.yml [WARNING]: Could not match supplied host pattern, ignoring: vmworkers PLAY [workers, vmworkers] ********************************************************************************************************************************************************************* TASK [Gathering Facts] ************************************************************************************************************************************************************************ok: [client1] ok: [client2] ........................................................................... TASK [debug] **********************************************************************************************************************************************************************************ok: [client1] => { "msg": "[preflight] Running pre-flight checks\n[preflight] Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Starting the kubelet\n[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...\n\nThis node has joined the cluster:\n* Certificate signing request was sent to apiserver and a response was received.\n* The Kubelet was informed of the new secure connection details.\n\nRun 'kubectl get nodes' on the control-plane to see this node join the cluster." } ok: [client2] => { "msg": "[preflight] Running pre-flight checks\n[preflight] Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Starting the kubelet\n[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...\n\nThis node has joined the cluster:\n* Certificate signing request was sent to apiserver and a response was received.\n* The Kubelet was informed of the new secure connection details.\n\nRun 'kubectl get nodes' on the control-plane to see this node join the cluster." } PLAY RECAP ************************************************************************************************************************************************************************************client1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 client1 : ok=23 changed=6 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0 client2 : ok=23 changed=6 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0 host1 : ok=4 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 Four tasks were executed on the master node to determine the join command, while 23 commands ran on each of the two clients to ensure they were joined to the cluster. The tasks from the atomika-base role accounts for most of the worker tasks. The cluster now consists of the following nodes, with the master hosting the pods making up the control plane: atmin@kxsmaster2:~$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k8xclient1 Ready <none> 23m v1.28.2 192.168.68.116 <none> Ubuntu 20.04.6 LTS 5.4.0-163-generic containerd://1.6.24 kxsclient2 Ready <none> 23m v1.28.2 192.168.68.113 <none> Ubuntu 20.04.6 LTS 5.4.0-163-generic containerd://1.6.24 kxsmaster2 Ready control-plane 34m v1.28.2 192.168.68.111 <none> Ubuntu 20.04.6 LTS 5.4.0-163-generic containerd://1.6.24 With Nginx deployed, the following pods will be running on the various members of the cluster: atmin@kxsmaster2:~$ kubectl get pods -A -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default nginx-7854ff8877-g8lvh 1/1 Running 0 20s 10.244.1.2 kxsclient2 <none> <none> kube-flannel kube-flannel-ds-4dgs5 1/1 Running 1 (8m58s ago) 26m 192.168.68.116 k8xclient1 <none> <none> kube-flannel kube-flannel-ds-c7vlb 1/1 Running 1 (8m59s ago) 26m 192.168.68.113 kxsclient2 <none> <none> kube-flannel kube-flannel-ds-qrwnk 1/1 Running 0 35m 192.168.68.111 kxsmaster2 <none> <none> kube-system coredns-5dd5756b68-pqp2s 1/1 Running 0 37m 10.244.0.9 kxsmaster2 <none> <none> kube-system coredns-5dd5756b68-rh577 1/1 Running 0 37m 10.244.0.8 kxsmaster2 <none> <none> kube-system etcd-kxsmaster2 1/1 Running 1 37m 192.168.68.111 kxsmaster2 <none> <none> kube-system kube-apiserver-kxsmaster2 1/1 Running 1 37m 192.168.68.111 kxsmaster2 <none> <none> kube-system kube-controller-manager-kxsmaster2 1/1 Running 8 37m 192.168.68.111 kxsmaster2 <none> <none> kube-system kube-proxy-bdzlv 1/1 Running 1 (8m58s ago) 26m 192.168.68.116 k8xclient1 <none> <none> kube-system kube-proxy-ln4fx 1/1 Running 1 (8m59s ago) 26m 192.168.68.113 kxsclient2 <none> <none> kube-system kube-proxy-ndj7w 1/1 Running 0 37m 192.168.68.111 kxsmaster2 <none> <none> kube-system kube-scheduler-kxsmaster2 1/1 Running 8 37m 192.168.68.111 kxsmaster2 <none> <none> All that remains is to expose the Nginx pod using an instance of NodePort, LoadBalancer, or Ingress to the outside world. Maybe more on that in another article... Conclusion This posting explained the basic concepts of Ansible at the hand of scripts booting up a K8s cluster. The reader should now grasp enough concepts to understand tutorials and search engine results and to make a start at using Ansible to set up infrastructure using code.
If you follow the platform engineering trend, you'll have heard people talking about paved paths and golden paths. They're sometimes used as synonyms but can also reflect different approaches. In this article, I discuss the critical difference between paved paths and golden paths in platform engineering. Paved Paths If you were a city planner designing a park, you'd need to provide areas for people to stop and routes to pass through. The perfect park is a public space you can see, stroll through, and use for recreational activities. One of the tricky parts of park design is where to place the paths. People who want to take a leisurely walk prefer a winding scenic stroll with pleasant views. But people passing through from the coffee shop to their office prefer more direct routes. Most parks offer winding trails through the park or a series of direct paths forming a giant X crisscrossing the park. As an alternative to planning the routes through the park, you can let people use it for a while. People crossing the park will wear tracks in the grass that indicate where paths may be most helpful. They literally vote with their feet. Building these paths after the demand for a route is visible means you're more likely to put them in the right place, though you can't please everyone. This approach is also dangerous, as Sam Walter Foss warned. His 1895 poem tells how a playful calf influences the design of a large city. The city's main road gets built around the trail the calf made through the woods some 300 years earlier. Paved Paths in Software You can use the paved path technique in software. You can observe how users currently achieve a goal and use what you find to generate a design for the software. Before people created software source control systems, the source wall was a common way to avoid change collisions. To create a source wall, you'd write each file name on a sticky note and add it to the wall. If you wanted to edit a file, you'd go to the source wall and find the file you wanted to change. If the sticky note was on the wall, you could take it back to your desk and make your edit. If you couldn't find the sticky note, you had to wait for its return before making your change. This manual process meant your changes would never clash, and you'd never overwrite another developer's changes. The first source control system paved this path. You'd check out a file, and the system would prevent another developer from changing it until you checked it back in. This pattern was the paved path equivalent of the source wall. If you use a modern source control system, you'll notice it doesn't work this way. That's because something better has replaced the paved path — a golden path. Golden Paths Going back to the city park example, if you had a design in mind for the use of different spaces, you might want to tempt people to take a slightly longer route that lets you make better use of the overall space. Instead of optimizing the park for commuters, you want to balance the many different uses. In this case, you'd need to find ways to attract people to your preferred route to avoid them damaging the grass and planting. In Brisbane, the park in the South Bank area features just such a path. Instead of offering an efficient straight line between common destinations, it has sweeping curves along its entire length. The path has a decorative arbor that provides shelter from the hot sun and light showers. Instead of attempting to block other routes with fences, people are attracted to the path because they can stay cool or dry. The Brisbane Grand Arbor walk is 150 meters longer than a straight-line route, but it creates spaces for restaurants, a pond, a rainforest walk, and a lagoon. Golden paths are a system-level design technique. They're informed by a deep understanding of the different purposes of the space. Golden Paths in Platform Engineering In Platform Engineering, Golden Paths are just like Brisbane's Grand Arbor. Instead of forcing developers to do things a certain way, you design the internal developer platform to attract developers by reducing their burden and removing pain points. It's the optimal space between anything goes and forced standardization. Golden paths provide a route toward alignment. Say you have 5 teams, all using different continuous integration tools. As a platform engineer, you'd work out the best way to build, test, and package all the software and provide this as a golden path. It needs to be better than what developers currently do and easy to adopt, as you can't force it on a team. The teams that adopt the golden path have an easy life as far as their continuous integration activities are concerned. Nothing makes a platform more attractive than seeing happy users. When done well, an internal development platform may feel like a paved path to the developers, but it should reduce the overall cognitive load. This often involves both consolidation and standardization. You won't solve all developer pain at once. Platform engineers will need to go and see what pain exists and think about how they might design a product that will remove it. When you start this journey, it's worth understanding the patterns and anti-patterns of platform engineering. Take the High Road World champion weightlifter Jerzy Gregorek once said: "Hard choices, easy life. Easy choices, hard life." You need to make many hard choices to create a great internal developer platform. You have to decide what problems the platform will solve and which it won't. You need to determine when a feature should flex to meet the needs of a development team and when you should let them strike out on their own path. These hard choices are the difference between a golden path and a paved path. With a paved path, you can reduce the burden on developers; the pain just moves into your platform team. A golden path will reduce the total cognitive load for everyone by dedicating the platform team to its elimination. Happy deployments!
The rise of Kubernetes, cloud native, and microservices spawned major changes in architectures and abstractions that developers use to create modern applications. In this multi-part series, I talk with some of the leading experts across various layers of the stack — from networking infrastructure to application infrastructure and middleware to telemetry data and modern observability concerns — to understand emergent platform engineering patterns that are affecting developer workflow around cloud native. The first participant in our series is Thomas Graf, CTO and co-founder of Isovalent, and the creator of Cilium — an open source, cloud-native solution for providing, securing, and observing network connectivity between workloads, fueled by the revolutionary Kernel technology eBPF. Q: We are nearly a decade into containers and Kubernetes (K8s was first released in September 2014). How would you characterize how things look different today than ten years ago, especially in terms of the old world of systems engineers and network administrators and a big dividing line between these operations concerns and the developers on the other side of the wall? What do you think are the big changes that DevOps and the evolution of platform engineering and site reliability engineering have ushered in, especially from the networking perspective? A: Platform engineering has brought traditional systems engineers and network administrators a lot closer to developers. The rise of containers has not only simplified the deployment for developers but also for platform engineering teams. Instead of serving machines, we are finally hosting applications. Unlike serverless, Kubernetes preserved some of the existing infrastructure abstractions, thus offering a more approachable evolutionary step. This allowed systems engineers and network administrators to step up and evolve into platform engineering, and with them, they brought decades of experience in how to operate enterprise infrastructure. At the same time, platforming engineering has brought a radical modernization of the networking layer. Application teams look at the network like they look at the internet. A giant, untrusted connectivity plane connecting everyone and everything. This requires platform engineering to rethink network security and bring in micro-segmentation, zero-trust security concepts, and mutual authentication. At the same time, this new, exciting world has to be connected to the world of existing infrastructure, which requires mapping the world of identity-based network security to the old world of virtual networks, MPLS, and ACLs. Q: The popularity of the major cloud service platforms and all of the thrust behind the last ten years of SaaS applications and cloud native created a ton of new abstractions for the level at which developers are able to interact with underlying cloud and network infrastructure. How has this trend of raising the abstraction for interacting with infrastructure affected developers, specifically? A: The cost of developing an initial MVP for a new application has decreased enormously. A small team of developers can develop an application to early product maturity within weeks or months. This is achieved with automation on the cloud infrastructure side, managed databases and cloud services, and the composability of microservices. The cost of this shortcut is typically paid later in the form of the cloud bill, challenges in portability across cloud providers, and the inevitable consequence of having to develop a proper multi-cloud strategy as different application teams will start developing on different cloud platforms. Developers are rightfully not concerned about infrastructure abstraction and infrastructure security early on. Time to market is everything. Platform engineering teams then typically come in and help port the application to a Kubernetes platform to start standardizing the applications to corporate needs by elevating security and monitoring standards, decoupling dependencies, and preparing the new application for scale. Q: What are the areas where it makes sense for developers to have to really think about underlying systems versus the ones where having a high degree of instrumentation or customization ("shift left") is going to be very important? A: Every couple of years, there is a new term for what is a pretty logical software development practice to test early. Iterative development, agile development, test-driven development, and now shift-left. The combination of system-aware development and abstraction-based development has always been crucial, but equally important is to shift concern for resilience and supportability to the left. Looking back at the Apollo mission, all these concepts played a significant role. A lot of the software was obviously written in a system-specific language to be as efficient as possible. The navigation business logic was using an abstracted language, however, in order to be able to compute complex vector computation. Last but not least, it was the concept of resilience that allowed the lander to overcome a faulty sensor that overloaded the computer, which would have prevented all required system components from grabbing enough CPU time. Q: Despite the obvious immense popularity of cloud native, much of the world's infrastructure (especially in highly regulated industries) is still running in on-prem datacenters. What does the future hold for all this legacy infrastructure and millions of servers humming along in data centers? What are the implications for managing this mixed cloud-native infrastructure together with legacy data centers over time? A: As with any other transformation, cloud native will take much longer than anticipated, but the benefits are so fundamental that anybody not undergoing the transformation is at risk of being disrupted from a technology perspective. The world of cloud native and data centers not only have to get to know each other but have to move in together for the foreseeable future. The typical enterprise-grade data center requirements have already come to cloud native. What we are now seeing is that some of the cloud-native concepts, such as further automation, better declarative approaches, and cleaner abstractions, are flowing from cloud native back into the world of data centers. Cloud-native solutions will have to learn to live in data centers, running as appliances, and typical data center requirements will have to be met in the world of the cloud in order for the two worlds to be able to talk to each other. Q: What do you think are some of the modern checklist items that developers care most about in terms of their workflow and how platform engineering makes their lives more productive? Broadly speaking, what are conditions that are most desirable versus least desirable in terms of the build environment and toolchains that modern developers care about? A: The checklist has changed quite a bit over the last few years. In the beginning, the checklist was all about getting one of each to try and form the best possible stack out of hundreds of possible options. As this usual phase of an early technology stack now matures, the checklist has changed to limit the number of moving pieces and instead focus on core values such as developer efficiency, operational complexity, security risks, and total cost. This has led to a shift to managed offerings, wider platforms covering more aspects, and a focus on day-2 operational aspects and long-term cost aspects over exclusively building the best possible platform. The least desirable outcome in building a platform for developers is to ramp it all up but fail to make it sustainable operationally and economically. Q: What is the significance of eBPF in this overall context of the evolution of platform engineering and SRE patterns? A: eBPF has become the amazing hidden helper below deck, making everything better and faster. Its magical value comes from its programmability. The operation system has become a little bit like hardware, which is really hard to change. Making the operation system agile and programmable again allows software infrastructure to keep up with the changing demands of platform engineering technology like Kubernetes. Fundamentally, eBPF is not only able to solve problems really well, but even more importantly, it is a tremendous time to market hack for infrastructure and security software, and platforming engineering is driven by continued innovation as platforms are still being built up and requirements keep piling up. Q: What does Cilium give platform teams beyond the built-in capabilities of eBPF? What is the relationship between the two technologies, and what should platform engineering teams be doing with Cilium today (if they are not already)? A: eBPF is fundamentally designed for kernel developers, and early adopters of eBPF were typically companies running their own kernel teams. Think of eBPF as you think of FPGAs or GPUs in the context of AI. As an enterprise, you can’t go out there and buy a bunch of FPGAs or GPUs, stuff them into your data center, and then simply benefit from it. You will need people to build something with it. Cilium takes eBPF and utilizes it to implement core networking, security, and observability needs of platform engineering teams. It does so without requiring platform engineering teams to learn how to build it themselves with eBPF. Kubernetes has created a whole new set of challenges on how to connect and secure workloads inside and outside of Kubernetes. eBPF is incredible at solving the problems of the new world of Kubernetes and equally well capable of translating that new world to the old legacy. Q: How would you describe the overall evolution of networking, from the old scale-up days to distributed computing and commodity hardware, to virtual machines, then SDNs, and now where we are today? What do you think are the coolest trends in network infrastructure to watch today? A: Networking has evolved over the years along with the needs of applications. What is interesting is that we are probably in the middle of one of the most significant shifts in networking but haven’t fully realized it yet. Looking back, networking was all about connecting machines physically. With Google and distributed computing, it became obvious that virtualization would play a massive role. As a consequence, software-defined networking was the networking shift that came along with it. But it still connected machines, and it inherited the vast majority of building blocks from the physical networking world. Cloud networking took that network virtualization technology and added APIs and automation in front of it. With the rise of containers and Kubernetes, networking is now changing fundamentally as we no longer connect machines. We connect applications. A modern cloud-native networking layer looks more like a messaging bus than a network to developers but without requiring your applications to change in any way and while continuing to the strong security and performance requirements of a typical enterprise network. The cloud-native shift in networking will not just impact the Kubernetes networking layer. It will touch all aspects of connectivity from L3/L4 north-south load balancers, network firewalls, and VPNs all the way up to L7 WAFs and L7 east-west load balancers.
Boris Zaikin
Lead Solution Architect,
CloudAstro GmBH
Pavan Belagatti
Developer Evangelist,
SingleStore
Nicolas Giron
Site Reliability Engineer (SRE),
KumoMind
Alireza Chegini
DevOps Architect / Azure Specialist,
Coding As Creating