Cloud architecture refers to how technologies and components are built in a cloud environment. A cloud environment comprises a network of servers that are located in various places globally, and each serves a specific purpose. With the growth of cloud computing and cloud-native development, modern development practices are constantly changing to adapt to this rapid evolution. This Zone offers the latest information on cloud architecture, covering topics such as builds and deployments to cloud-native environments, Kubernetes practices, cloud databases, hybrid and multi-cloud environments, cloud computing, and more!
A Look Into Netflix System Architecture
You Can Shape Trend Reports: Participate in DZone Research Surveys + Enter the Prize Drawings!
It's super important today to keep things secure and make sure everything is running as it should. AWS Identity and Access Management (IAM) helps with this by letting you manage who can get into what parts of your AWS account. One cool thing about IAM is that it lets you give permissions to different parts or people in your account without having to share sensitive info like passwords. Today, I'm going to talk about using Terraform, a tool that lets you set up infrastructure through code, to create and set up these IAM roles easily. Understanding AWS IAM Roles and Terraform Before we get into how to use Terraform for setting up IAM roles in AWS, it's key to grasp what AWS IAM roles and Terraform are all about. In your AWS account, you can create IAM roles, which are basically identities with certain permissions attached. These roles let you give specific rights to different parts of your AWS setup without any hassle. On the flip side, Terraform is a tool that lets you manage your infrastructure through code instead of doing everything manually. It’s made specifically for working smoothly with services such as those offered by AWS, thanks to the Terraform AWS provider. The Basics of AWS IAM Roles IAM roles in AWS are key to controlling who gets to do what with AWS resources. When you set up an IAM role, you decide on the permissions and rules that outline which actions can or cannot be taken. These guidelines might come from two places: AWS-managed policies, which are ready-made sets of rules provided by AWS, or customer-managed policies that you create yourself based on your needs. On top of these, there's also the option to add inline policies right onto IAM roles for more specific control. Mostly, IAM roles let certain services, apps, or even other AWS accounts borrow permissions temporarily by assuming a role. Introduction to Terraform for AWS Terraform is a tool that lets you set up your infrastructure through code, making it easier to manage and provision resources just by describing what you want. With Terraform, all you need to do is tell it how you'd like your setup to look and let it take care of setting everything up for you. When working with AWS services, the Terraform AWS provider comes into play. This part of Terraform is made just for dealing with AWS stuff, including IAM roles. It gives you various tools and information sources so that managing IAM roles becomes straightforward using Terraform's coding approach. By combining the powers of both Terraform and the AWS provider, handling IAM roles becomes not only simpler but also something that can be done consistently as your needs grow or change. Setting up Your Environment for Terraform Before diving into using Terraform to set up IAM roles for your AWS account, there's some groundwork you need to do. First off, getting the Terraform CLI on your computer is a must. It doesn't matter what kind of computer you're using because there's a version of the CLI that'll work for it. On top of that, make sure you've got the AWS Command Line Interface (CLI) ready and loaded with your AWS credentials — we're talking about your access key and secret access key here. With these steps out of the way, Terraform can smoothly talk to your AWS account and get everything set up just right. Installing Terraform To get Terraform set up, head over to the official Terraform website and grab the newest version they've got. They have versions ready for different types of computers like Windows, macOS, and Linux. After picking the right one for your computer, just follow what they say on how to put it in. When you're done installing it, you can make sure everything's working fine by typing terraform --version into where you type commands on your computer. This will show you which version of Terraform is now running on your machine. Also, if you want to check that your setup with Terraform is good to go or see what changes will happen before actually making them, use the terraform plan command. It helps validate your configuration and gives a sneak peek at what applying those settings will do. Configuring AWS CLI and Terraform To set up the AWS CLI and Terraform, you need to enter your AWS access key and secret access key. These keys let both tools connect with your AWS account so they can do tasks for you. By typing aws configure, you'll be asked to input your access and secret keys along with choosing a default region. The AWS provider in Terraform will then use these details automatically when working with AWS services. You have other choices too, like setting these credentials through environment variables or by using an AWS credentials file. Making sure everything is set up right means you'll be able to make IAM roles without any trouble, thanks to having the correct permissions from your AWS account and terraform AWS provider setup. Creating AWS IAM Roles With Terraform With your setup ready, you can begin to craft AWS IAM roles with Terraform. This step includes laying out your Terraform configuration and penning down the code that outlines the role name, policy attachments, and other important settings. Through a variety of resources and data sources provided by Terraform, it's possible to define IAM roles and handle their configurations efficiently. By using Terraform for this task, creating IAM roles becomes a process you can replicate easily and scale up as needed while keeping everything consistent across your infrastructure setup. Defining Your Terraform Configuration When setting up your Terraform configuration, you need to lay out what you want your IAM roles to look like. This means picking a role name, attaching policies, and adjusting any other important settings. With Terraform, there are a bunch of tools and info sources that help you work with AWS services and set up your infrastructure just how you need it. The AWS provider in Terraform has special tools for handling IAM roles like aws_iam_role and aws_iam_policy_attachment. Plus, if you're looking to grab details on existing IAM roles or other bits of AWS resources, data sources are there for the taking. Writing Terraform Code for IAM Role Creation In Terraform, you use a special kind of code that tells the computer exactly how you want your online setup to look. When setting up IAM roles, this means writing out what each role should be called and what rules it follows using Terraform's language. For instance, with the aws_iam_role bit in Terraform, you can spell out an IAM role and give it a name. After that, by using something called aws_iam_policy_attachment, you can stick certain policies onto that role. This way of doing things lets you keep track of your IAM roles easily since everything is written down clearly in code form. It also makes working together on projects smoother because everyone can see and understand the setups being used without confusion. Applying Terraform Configuration to Create IAM Roles After you've written your Terraform code, it's time to use the terraform apply command. This will look over what you wrote and set up the IAM roles in AWS just like you wanted. With Terraform, making sure these IAM roles have all the right settings and policy attachments is a breeze. It gives you full control over managing these roles, so if anything gets changed by someone else or something else, Terraform will notice this drift and help fix it back to how it should be based on your original setup. Best Practices for Managing IAM Roles With Terraform When you're handling IAM roles with Terraform, it's smart to stick to some key rules so everything stays safe and easy to manage. For starters, organize your Terraform projects in a way that makes them easy to reuse and update. This means putting your code into different folders, keeping track of changes with version control, and making good use of Terraform modules. On top of this, make sure your IAM roles are locked down tight by sticking with the usual IAM policies, being careful about who gets what permissions, and regularly checking and tweaking those policies as needed. Doing all this stuff right from the get-go ensures that everything related to IAM is well-managed using Terraform including its modules. Structuring Your Terraform Projects Keeping your Terraform code neat and tidy is super important. A good way to do this is by putting different bits of your code into folders based on what they're for. For instance, you might have one folder for IAM roles, another for EC2 instances, and yet another for S3 buckets. With everything in its own place, it's a breeze to find and tweak whatever resource you need. On top of that, using something like Git helps keep track of all the changes you make over time; think of it as a safety net letting you go back if anything goes sideways. Also, diving into Terraform modules can be a game-changer because they let you package up common setups so you can use them again without starting from scratch every single time—kinda like having building blocks ready to go whenever needed. Securing Your IAM Roles Making sure your IAM roles are safe is key to keeping your AWS setup secure. A good way to do this is by sticking with the standard IAM policies that AWS offers. These come ready-made to safely let you into the usual AWS services and stuff you might need. On top of that, it's important to keep a close eye on what permissions these IAM roles have. Make it a habit to check and tweak the policies tied to them now and then. This helps make sure they only can do what they really need to, cutting down on chances someone could get in who shouldn't be able to With careful attention paid towards securing your IAM roles, you're taking big steps toward avoiding security problems and making sure everything runs smoothly. Advanced Terraform Techniques for IAM Roles Terraform is really good at handling IAM roles in AWS, making things a lot easier. With Terraform, you can automate the whole process of setting up, changing, and removing IAM roles. You just tell Terraform how you want your IAM roles to look through code, and it does all the work to make sure everything matches what you asked for. This way, everything stays consistent and mistakes are less likely when setting up your IAM roles. On top of that, Terraform keeps track of all changes made to your IAM role configurations so if something goes wrong or doesn't work out as planned; rolling back those changes is pretty straightforward. Using Terraform Modules for IAM Roles Terraform really shines when you dive into its modules, especially for setting up IAM roles. Think of modules like a toolbox that lets you pack away bits of code you use often so you can easily grab them for different projects or settings. This way, instead of writing the same stuff over and over again, you just set it up once in a module and then reuse it wherever needed. It's all about making things more streamlined and keeping your setup consistent across various AWS accounts without having to redo work. Plus, with these modules, tweaking your infrastructure by adding or taking away IAM roles becomes a breeze. So basically, using Terraform's modules means less hassle managing configurations while ensuring everything stays neat and tidy. Automating Role Updates With Terraform With Terraform, setting up IAM roles becomes a breeze because it lets you automate updates. By defining how you want your IAM roles to look in the code, and then using the "Terraform plan" command, you get a sneak peek at what changes will happen. This step is great because it means you can double-check everything looks right before making any moves on your AWS account. Happy with what you see? Just hit "Terraform apply," and just like that, your IAM roles are updated automatically in your AWS account — no need for manual tweaks or worrying about mistakes slipping through. Plus, when there's a new version of Terraform out, upgrading is straightforward so that you won't miss out on cool new features or important bug fixes. Troubleshooting Common Issues Terraform makes setting up and handling IAM roles in AWS easier, but it's not always smooth sailing. Sometimes you might run into problems like getting the IAM policies wrong, bumping into issues with roles or policies that are already there, or hitting snags when you're trying to apply your Terraform setup. To get around these bumps, it helps a lot to know how Terraform does its thing step by step and to take a close look at the Terraform plan output for any heads-ups on errors or things to watch out for. On top of that, peeking into the debug logs can shed some light on what's going wrong. Sticking to best practices and asking for tips from other people using Terraform can also be a big help in fixing common troubles with IAM roles and Terraform. Debugging Terraform Execution When you're fixing problems with IAM roles and using Terraform, it's really important to get how Terraform does its thing. With the Terraform plan command, you can see ahead of time what changes will happen to your IAM roles. By looking at what the plan shows, you can spot any troubles or clashes that might pop up when you actually apply these changes. If something goes wrong or if there are warnings, checking out the debug logs is a smart move because they give more details on what's happening step by step in Terraform. These logs are super helpful for figuring out exactly where things went sideways. So, by digging into these debug logs and getting how everything works in Terraform from start to finish, sorting out issues with IAM roles becomes way easier. Resolving Common Errors With IAM Roles and Terraform When you're setting up IAM roles in AWS with Terraform, it's pretty common to run into some hiccups during the "Terraform apply" step. You might bump into issues like clashes with roles or policies that are already there, IAM policies not set up right, or even problems directly from the AWS IAM service itself. To get past these errors, it's key to take a close look at what the error message from Terraform is telling you and follow any steps it suggests for fixing things. This could mean tweaking your IAM role setup a bit, changing your IAM policies so everything plays nice together, or maybe reaching out to the folks who manage AWS IAM if there’s something bigger going on their end. By sticking to best practices and really understanding how to tackle these errors head-on, sorting out any troubles with your configuration should be totally doable. To wrap things up, using Terraform to set up IAM roles in AWS makes the whole process of handling permissions a lot smoother. When you mix what AWS IAM can do with how Terraform lets you manage infrastructure through code, setting up secure and effective roles becomes simpler. It's important to stick to good ways of organizing your Terraform work and keeping those IAM roles safe so that your security stays strong. Getting into deeper stuff like working with Terraform modules and making role updates automatic will make everything more scalable and flexible. If you run into problems or have questions, looking at common troubleshooting tips and FAQs is a big help in getting past any bumps in the road smoothly. Keeping on top of managing your IAM roles with Terraform is key for making sure your AWS setup runs as well as it can.
Navigating toward a cloud-native architecture can be both exciting and challenging. The expectation of learning valuable lessons should always be top of mind as design becomes a reality. In this article, I wanted to focus on an example where my project seemed like a perfect serverless use case, one where I’d leverage AWS Lambda. Spoiler alert: it was not. Rendering Fabric.js Data In a publishing project, we utilized Fabric.js — a JavaScript HTML5 canvas library — to manage complex metadata and content layers. These complexities included spreads, pages, and templates, each embedded with fonts, text attributes, shapes, and images. As the content evolved, teams were tasked with updates, necessitating the creation of a publisher-quality PDF after each update. We built a Node.js service to run Fabric.js, generating PDFs and storing resources in AWS S3 buckets with private cloud access. During a typical usage period, over 10,000 teams were using the service, with each individual contributor sending multiple requests to the service as a result of manual page saves or auto-saves driven by the Angular client. The service was set up to run as a Lambda in AWS. The idea of paying at the request level seemed ideal. Where Serverless Fell Short We quickly realized that our Lambda approach wasn’t going to cut it. The spin-up time turned out to be the first issue. Not only was there the time required to start the Node.js service but preloading nearly 100 different fonts that could be used by those 10,000 teams caused delays too. We were also concerned about Lambda’s processing limit of 250 MB of unzipped source code. The initial release of the code was already over 150 MB in size, and we still had a large backlog of feature requests that would only drive this number higher. Finally, the complexity of the pages — especially as more elements were added — demanded increased CPU and memory to ensure quick PDF generation. After observing the usage for first-generation page designs completed by the teams, we forecasted the need for nearly 12 GB of RAM. Currently, AWS Lambdas are limited to 10 GB of RAM. Ultimately, we opted for dedicated EC2 compute resources to handle the heavy lifting. Unfortunately, this decision significantly increased our DevOps management workload. Looking for a Better Solution Although I am no longer involved with that project, I’ve always wondered if there was a better solution for this use case. While I appreciate AWS, Google, and Microsoft providing enterprise-scale options for cloud-native adoption, what kills me is the associated learning curve for every service. The company behind the project was a smaller technology team. Oftentimes teams in that position struggle with adoption when it comes to using the big three cloud providers. The biggest challenges I continue to see in this regard are: A heavy investment in DevOps or CloudOps to become cloud-native. Gaining a full understanding of what appears to be endless options. Tech debt related to cost analysis and optimization. Since I have been working with the Heroku platform, I decided to see if they had an option for my use case. Turns out, they introduced large dynos earlier this year. For example, with their Performance-L RAM Dyno, my underlying service would get 50x the compute power of a standard Dyno and 30 GB of RAM. The capability to write to AWS S3 has been available from Heroku for a long time too. V2 Design in Action Using the Performance-L RAM dyno in Heroku would be no different (at least operationally) than using any other dyno in Heroku. To run my code, I just needed the following items: A Heroku account The Heroku command-line interface (CLI) installed locally After navigating to the source code folder, I would issue a series of commands to log in to Heroku, create my app, set up my AWS-related environment variables, and run up to five instances of the service using the Performance-L dyno with auto-scaling in place: Shell heroku login heroku apps:create example-service heroku config:set AWS_ACCESS_KEY_ID=MY-ACCESS-ID AWS_SECRET_ACCESS_KEY=MY-ACCESS-KEY heroku config:set S3_BUCKET_NAME=example-service-assets heroku ps:scale web=5:Performance-L-RAM git push heroku main Once deployed, my example-service application can be called via standard RESTful API calls. As needed, the auto-scaling technology in Heroku could launch up to five instances of the Performance-L Dyno to meet consumer demand. I would have gotten all of this without having to spend a lot of time understanding a complicated cloud infrastructure or worrying about cost analysis and optimization. Projected Gains As I thought more about the CPU and memory demands of our publishing project — during standard usage seasons and peak usage seasons — I saw how these performance dynos would have been exactly what we needed. Instead of crippling our CPU and memory when the requested payload included several Fabric.js layers, we would have had enough horsepower to generate the expected image, often before the user navigated to the page containing the preview images. We wouldn’t have had size constraints on our application source code, which we would inevitably have hit in AWS Lambda limitations within the next 3 to 4 sprints. The time required for our DevOps team to learn Lambdas first and then switch to EC2 hit our project’s budget pretty noticeably. And even then, those services weren't cheap, especially when spinning up several instances to keep up with demand. But with Heroku, the DevOps investment would be considerably reduced and placed into the hands of software engineers working on the use case. Just like any other dyno, it’s easy to use and scale up the performance dynos either with the CLI or the Heroku dashboard. Conclusion My readers may recall my personal mission statement, which I feel can apply to any IT professional: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” — J. Vester In this example, I had a use case that required a large amount of CPU and memory to process complicated requests made by over 10,000 consumer teams. I walked through what it would have looked like to fulfill this use case using Heroku's large dynos, and all I needed was a few CLI commands to get up and running. Burning out your engineering and DevOps teams is not your only option. There are alternatives available to relieve the strain. By taking the Heroku approach, you lose the steep learning curve that often comes with cloud adoption from the big three. Even better, the tech debt associated with cost analysis and optimization never sees the light of day. In this case, Heroku adheres to my personal mission statement, allowing teams to focus on what is likely a mountain of feature requests to help product owners meet their objectives. Have a really great day!
The explicit behavior of IAC version managers is quite crucial. It is especially critical in the realm of Terraform and OpenTofu because tool upgrades might destroy or corrupt all managed infrastructure. To protect users from unexpected updates, all version managers have to work clearly and without any internal wizardry that cannot be explained without a deep dive into the sources. Tenv is a versatile version manager for OpenTofu, Terraform, Terragrunt, and Atmos, written in Go and developed by tofuutils team. This tool simplifies the complexity of handling different versions of these powerful tools, ensuring developers and DevOps professionals can focus on what matters most — building and deploying efficiently. Tenv is a successor of tofuenv and tfenv. In the process of tenv development, our team discovered quite an unpleasant surprise with Terragrunt and tenv, which may have created serious issues. On a fresh install of the Linux system, when one of our users attempted to run Terragrunt, the execution ended up utilizing OpenTofu instead of Terraform, with no warnings in advance. In the production environment, it might cause serious Terraform state corruption, but luckily it was a testing environment. Before we look at the root cause of this issue, I need to explain how the tenv works. Tenv manages all tools by wrapping them in an additional binary that serves as a proxy for the original tool. It means you can't install Terraform or OpenTofu on an ordinary Linux machine alongside tenv (except NixOS case). At our tool, we supply a binary with the same name as the tool (Terraform / OpenTofu / Terragrunt / Atmos), within which we implement the proxy pattern. It was required since it simplifies version management and allows us to add new capabilities to automatic version discovery and installation handling. So, knowing that tenv is based on a downstream proxy architecture, we are ready to return to the problem. Why was our user's execution performed using OpenTofu rather than Terraform? The answer has two parts: Terragrunt started to use OpenTofu as the default IAC tool, however, this was not a major release; instead, it was provided as a patch and users didn't expect to have any differences in the behavior. The original problem may be found here. When Terragrunt called OpenTofu in the new default behavior, it used tenv's proxy to check the required version of OpenTofu and install it automatically. Although the TERRAGRUNT_TFPATH setting might control the behavior, users were unaware of the Terragrunt breaking change and were surprised to see OpenTotu at the end of execution. But why did OpenTofu execute if users did not have it in their system? Here we are dealing with the second issue that has arisen. At the start of tenv development, we replicated many features from the tfenv tool. One of these features was automatic tool installation, which is controlled by the TFENV_AUTO_INSTALL environment variable and is enabled by default. Tenv also has the TENV_AUTO_INSTALL variable, which is also was true by default unless the mentioned case hasn't been discovered. Users who used Terraform / OpenTofu without Terragrunt via tenv may have encountered the auto-install when, for example, switching the version of the tool with the following command: tenv tf use 1.5.3 tenv tofu use 1.6.1 The use command installed the required version even if it wasn’t present in the operation system locally. After a brief GitHub discussion, our team decided to disable auto-install by default and release this minor change as a new, major version of tenv. We made no major changes to the program, did not update the framework of the language version, and only updated the default variable, deciding that users should understand that one of the most often utilized and crucial behaviors had changed. It's interesting that during the discussion, we disagreed on whether users should read the README.md or documentation, but whether you like it or not, it's true that people don't read the docs unless they're in difficulty. As the tofuutils team, we cannot accept the possibility that a user will mistakenly utilize OpenTofu in a real-world production environment and break the state or the cloud environment. Finally, I'd like to highlight a few points once more: Implement intuitive behavior in your tool. Consider user experience and keep in mind that many people don't read manuals. Do not worry about releasing a major version if you made the breaking change. In programming, explicit is preferable to implicit, especially when dealing with state-sensitive tools.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. When it comes to software engineering and application development, cloud native has become commonplace in many teams' vernacular. When people survey the world of cloud native, they often come away with the perspective that the entire process of cloud native is for the large enterprise applications. A few years ago, that may have been the case, but with the advancement of tooling and services surrounding systems such as Kubernetes, the barrier to entry has been substantially lowered. Even so, does adopting cloud-native practices for applications consisting of a few microservices make a difference? Just as cloud native has become commonplace, the shift-left movement has made inroads into many organizations' processes. Shifting left is a focus on application delivery from the outset of a project, where software engineers are just as focused on the delivery process as they are on writing application code. Shifting left implies that software engineers understand deployment patterns and technologies as well as implement them earlier in the SDLC. Shifting left using cloud native with microservices development may sound like a definition containing a string of contemporary buzzwords, but there's real benefit to be gained in combining these closely related topics. Fostering a Deployment-First Culture Process is necessary within any organization. Processes are broken down into manageable tasks across multiple teams with the objective being an efficient path by which an organization sets out to reach a goal. Unfortunately, organizations can get lost in their processes. Teams and individuals focus on doing their tasks as best as possible, and at times, so much so that the goal for which the process is defined gets lost. Software development lifecycle (SDLC) processes are not immune to this problem. Teams and individuals focus on doing their tasks as best as possible. However, in any given organization, if individuals on application development teams are asked how they perceive their objectives, responses can include: "Completing stories" "Staying up to date on recent tech stack updates" "Ensuring their components meet security standards" "Writing thorough tests" Most of the answers provided would demonstrate a commitment to the process, which is good. However, what is the goal? The goal of the SDLC is to build software and deploy it. Whether it be an internal or SaaS application, deploying software helps an organization meet an objective. When presented with the statement that the goal of the SDLC is to deliver and deploy software, just about anyone who participates in the process would say, "Well, of course it is." Teams often lose sight of this "obvious" directive because they're far removed from the actual deployment process. A strategic investment in the process can close that gap. Cloud-native abstractions bring a common domain and dialogue across disciplines within the SDLC. Kubernetes is a good basis upon which cloud-native abstractions can be leveraged. Not only does Kubernetes' usefulness span applications of many shapes and sizes, but when it comes to the SDLC, Kubernetes can also be the environment used on systems ranging from local engineering workstations, though the entire delivery cycle, and on to production. Bringing the deployment platform all the way "left" to an engineer's workstation has everyone in the process speaking the same language, and deployment becomes a focus from the beginning of the process. Various teams in the SDLC may look at "Kubernetes Everywhere" with skepticism. Work done on Kubernetes in reducing its footprint for systems such as edge devices has made running Kubernetes on a workstation very manageable. Introducing teams to Kubernetes through automation allows them to iteratively absorb the platform. The most important thing is building a deployment-first culture. Plan for Your Deployment Artifacts With all teams and individuals focused on the goal of getting their applications to production as efficiently and effectively as possible, how does the evolution of application development shift? The shift is subtle. With a shift-left mindset, there aren't necessarily a lot of new tasks, so the shift is where the tasks take place within the overall process. When a detailed discussion of application deployment begins with the first line of code, existing processes may need to be updated. Build Process If software engineers are to deploy to their personal Kubernetes clusters, are they able to build and deploy enough of an application that they're not reliant on code running on a system beyond their workstation? And there is more to consider than just application code. Is a database required? Does the application use a caching system? It can be challenging to review an existing build process and refactor it for workstation use. The CI/CD build process may need to be re-examined to consider how it can be invoked on a workstation. For most applications, refactoring the build process can be accomplished in such a way that the goal of local build and deployment is met while also using the refactored process in the existing CI/CD pipeline. For new projects, begin by designing the build process for the workstation. The build process can then be added to a CI/CD pipeline. The local build and CI/CD build processes should strive to share as much code as possible. This will keep the entire team up to date on how the application is built and deployed. Build Artifacts The primary deliverables for a build process are the build artifacts. For cloud-native applications, this includes container images (e.g., Docker images) and deployment packages (e.g., Helm charts). When an engineer is executing the build process on their workstation, the artifacts will likely need to be published to a repository, such as a container registry or chart repository. The build process must be aware of context. Existing processes may already be aware of their context with various settings for environments ranging from test and staging to production. Workstation builds become an additional context. Given the awareness of context, build processes can publish artifacts to workstation-specific registries and repositories. For cloud-native development, and in keeping with the local workstation paradigm, container registries and chart repositories are deployed as part of the workstation Kubernetes cluster. As the process moves from build to deploy, maintaining build context includes accessing resources within the current context. Parameterization Central to this entire process is that key components of the build and deployment process definition cannot be duplicated based on a runtime environment. For example, if a container image is built and published one way on the local workstation and another way in the CI/CD pipeline. How long will it be before they diverge? Most likely, they diverge sooner than expected. Divergence in a build process will create a divergence across environments, which leads to divergence in teams and results in the eroding of the deployment-first culture. That may sound a bit dramatic, but as soon as any code forks — without a deliberate plan to merge the forks — the code eventually becomes, for all intents and purposes, unmergeable. Parameterizing the build and deployment process is required to maintain a single set of build and deployment components. Parameters define build context such as the registries and repositories to use. Parameters define deployment context as well, such as the number of pod replicas to deploy or resource constraints. As the process is created, lean toward over-parameterization. It's easier to maintain a parameter as a constant rather than extract a parameter from an existing process. Figure 1. Local development cluster Cloud-Native Microservices Development in Action In addition to the deployment-first culture, cloud-native microservices development requires tooling support that doesn't impede the day-to-day tasks performed by an engineer. If engineers can be shown a new pattern for development that allows them to be more productive with only a minimum-to-moderate level of understanding of new concepts, while still using their favorite tools, the engineers will embrace the paradigm. While engineers may push back or be skeptical about a new process, once the impact on their productivity is tangible, they will be energized to adopt the new pattern. Easing Development Teams Into the Process Changing culture is about getting teams on board with adopting a new way of doing something. The next step is execution. Shifting left requires that software engineers move from designing and writing application code to becoming an integral part of the design and implementation of the entire build and deployment process. This means learning new tools and exploring areas in which they may not have a great deal of experience. Human nature tends to resist change. Software engineers may look at this entire process and think, "How can I absorb this new process and these new tools while trying to maintain a schedule?" It's a valid question. However, software engineers are typically fine with incorporating a new development tool or process that helps them and the team without drastically disrupting their daily routine. Whether beginning a new project or refactoring an existing one, adoption of a shift-left engineering process requires introducing new tools in a way that allows software engineers to remain productive while iteratively learning the new tooling. This starts with automating and documenting the build out of their new development environment — their local Kubernetes cluster. It also requires listening to the team's concerns and suggestions as this will be their daily environment. Dev(elopment) Containers The Development Containers specification is a relatively new advancement based on an existing concept in supporting development environments. Many engineering teams have leveraged virtual desktop infrastructure (VDI) systems, where a developer's workstation is hosted on a virtualized infrastructure. Companies that implement VDI environments like the centralized control of environments, and software engineers like the idea of a pre-packaged environment that contains all the components required to develop, debug, and build an application. What software engineers do not like about VDI environments is network issues where their IDEs become sluggish and frustrating to use. Development containers leverage the same concept as VDI environments but bring it to a local workstation, allowing engineers to use their locally installed IDE while being remotely connected to a running container. This way, the engineer has the experience of local development while connected to a running container. Development containers do require an IDE that supports the pattern. What makes the use of development containers so attractive is that engineers can attach to a container running within a Kubernetes cluster and access services as configured for an actual deployment. In addition, development containers support a first-class development experience, including all the tools a developer would expect to be available in a development environment. From a broader perspective, development containers aren't limited to local deployments. When configured for access, cloud environments can provide the same first-class development experience. Here, the deployment abstraction provided by containerized orchestration layers really shines. Figure 2. Microservice development container configured with dev containers The Synergistic Evolution of Cloud-Native Development Continues There's a synergy across shift-left, cloud-native, and microservices development. They present a pattern for application development that can be adopted by teams of any size. Tooling continues to evolve, making practical use of the technologies involved in cloud-native environments accessible to all involved in the application delivery process. It is a culture change that entails a change in mindset while learning new processes and technologies. It's important that teams aren't burdened with a collection of manual processes where they feel their productivity is being lost. Automation helps ease teams into the adoption of the pattern and technologies. As with any other organizational change, upfront planning and preparation is important. Just as important is involving the teams in the plan. When individuals have a say in change, ownership and adoption become a natural outcome. This is an excerpt from DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. Simplicity is a key selling point of cloud technology. Rather than worrying about racking and stacking equipment, configuring networks, and installing operating systems, developers can just click through a friendly web interface and quickly deploy an application. Of course, that friendly web interface hides serious complexity, and deploying an application is just the first and easiest step toward a performant and reliable system. Once an application grows beyond a single deployment, issues begin to creep in. New versions require database schema changes or added components, and multiple team members can change configurations. The application must also be scaled to serve more users, provide redundancy to ensure reliability, and manage backups to protect data. While it might be possible to manage this complexity using that friendly web interface, we need automated cloud orchestration to deliver consistently at speed. There are many choices for cloud orchestration, so which one is best for a particular application? Let's use a case study to consider two key decisions in the trade space: The number of different technologies we must learn and manage Our ability to migrate to a different cloud environment with minimal changes to the automation However, before we look at the case study, let's start by understanding some must-have features of any cloud automation. Cloud Orchestration Must-Haves Our goal with cloud orchestration automation is to manage the complexity of deploying and operating a cloud-native application. We want to be confident that we understand how our application is configured, that we can quickly restore an application after outages, and that we can manage changes over time with confidence in bug fixes and new capabilities while avoiding unscheduled downtime. Repeatability and Idempotence Cloud-native applications use many cloud resources, each with different configuration options. Problems with infrastructure or applications can leave resources in an unknown state. Even worse, our automation might fail due to network or configuration issues. We need to run our automation confidently, even when cloud resources are in an unknown state. This key property is called idempotence, which simplifies our workflow as we can run the automation no matter the current system state and be confident that successful completion places the system in the desired state. Idempotence is typically accomplished by having the automation check the current state of each resource, including its configuration parameters, and applying only necessary changes. This kind of smart resource application demands dedicated orchestration technology rather than simple scripting. Change Tracking and Control Automation needs to change over time as we respond to changes in application design or scaling needs. As needs change, we must manage automation changes as dueling versions will defeat the purpose of idempotence. This means we need Infrastructure as Code (IaC), where cloud orchestration automation is managed identically to other developed software, including change tracking and version management, typically in a Git repository such as this example. Change tracking helps us identify the source of issues sooner by knowing what changes have been made. For this reason, we should modify our cloud environments only by automation, never manually, so we can know that the repository matches the system state — and so we can ensure changes are reviewed, understood, and tested prior to deployment. Multiple Environment Support To test automation prior to production deployment, we need our tooling to support multiple environments. Ideally, we can support rapid creation and destruction of dynamic test environments because this increases confidence that there are no lingering required manual configurations and enables us to test our automation by using it. Even better, dynamic environments allow us to easily test changes to the deployed application, creating unique environments for developers, complex changes, or staging purposes prior to production. Cloud automation accomplishes multi-environment support through variables or parameters passed from a configuration file, environment variables, or on the command line. Managed Rollout Together, idempotent orchestration, a Git repository, and rapid deployment of dynamic environments bring the concept of dynamic environments to production, enabling managed rollouts for new application versions. There are multiple managed rollout techniques, including blue-green deployments and canary deployments. What they have in common is that a rollout consists of separately deploying the new version, transitioning users over to the new version either at once or incrementally, then removing the old version. Managed rollouts can eliminate application downtime when moving to new versions, and they enable rapid detection of problems coupled with automated fallback to a known working version. However, a managed rollout is complicated to implement as not all cloud resources support it natively, and changes to application architecture and design are typically required. Case Study: Implementing Cloud Automation Let's explore the key features of cloud automation in the context of a simple application. We'll deploy the same application using both a cloud-agnostic approach and a single-cloud approach to illustrate how both solutions provide the necessary features of cloud automation, but with differences in implementation and various advantages and disadvantages. Our simple application is based on Node, backed by a PostgreSQL database, and provides an interface to create, retrieve, update, and delete a list of to-do items. The full deployment solutions can be seen in this repository. Before we look at differences between the two deployments, it's worth considering what they have in common: Use a Git repository for change control of the IaC configuration Are designed for idempotent execution, so both have a simple "run the automation" workflow Allow for configuration parameters (e.g., cloud region data, unique names) that can be used to adapt the same automation to multiple environments Cloud-Agnostic Solution Our first deployment, as illustrated in Figure 1, uses Terraform (or OpenTofu) to deploy a Kubernetes cluster into a cloud environment. Terraform then deploys a Helm chart, with both the application and PostgreSQL database. Figure 1. Cloud-agnostic deployment automation The primary advantage of this approach, as seen in the figure, is that the same deployment architecture is used to deploy to both Amazon Web Services (AWS) and Microsoft Azure. The container images and Helm chart are identical in both cases, and the Terraform workflow and syntax are also identical. Additionally, we can test container images, Kubernetes deployments, and Helm charts separately from the Terraform configuration that creates the Kubernetes environment, making it easy to reuse much of this automation to test changes to our application. Finally, with Terraform and Kubernetes, we're working at a high level of abstraction, so our automation code is short but can still take advantage of the reliability and scalability capabilities built into Kubernetes. For example, an entire Azure Kubernetes Service (AKS) cluster is created in about 50 lines of Terraform configuration via the azurerm_kubernetes_cluster resource: Shell resource "azurerm_kubernetes_cluster" "k8s" { location = azurerm_resource_group.rg.location name = random_pet.azurerm_kubernetes_cluster_name.id ... default_node_pool { name = "agentpool" vm_size = "Standard_D2_v2" node_count = var.node_count } ... network_profile { network_plugin = "kubenet" load_balancer_sku = "standard" } } Even better, the Helm chart deployment is just five lines and is identical for AWS and Azure: Shell resource "helm_release" "todo" { name = "todo" repository = "https://book-of-kubernetes.github.io/helm/" chart = "todo" } However, a cloud-agnostic approach brings additional complexity. First, we must create and maintain configuration using multiple tools, requiring us to understand Terraform syntax, Kubernetes manifest YAML files, and Helm templates. Also, while the overall Terraform workflow is the same, the cloud provider configuration is different due to differences in Kubernetes cluster configuration and authentication. This means that adding a third cloud provider would require significant effort. Finally, if we wanted to use additional features such as cloud-native databases, we'd first need to understand the key configuration details of that cloud provider's database, then understand how to apply that configuration using Terraform. This means that we pay an additional price in complexity for each native cloud capability we use. Single Cloud Solution Our second deployment, illustrated in Figure 2, uses AWS CloudFormation to deploy an Elastic Compute Cloud (EC2) virtual machine and a Relational Database Service (RDS) cluster: Figure 2. Single cloud deployment automation The biggest advantage of this approach is that we create a complete application deployment solution entirely in CloudFormation's YAML syntax. By using CloudFormation, we are working directly with AWS cloud resources, so there's a clear correspondence between resources in the AWS web console and our automation. As a result, we can take advantage of the specific cloud resources that are best suited for our application, such as RDS for our PostgreSQL database. This use of the best resources for our application can help us manage our application's scalability and reliability needs while also managing our cloud spend. The tradeoff in exchange for this simplicity and clarity is a more verbose configuration. We're working at the level of specific cloud resources, so we have to specify each resource, including items such as routing tables and subnets that Terraform configures automatically. The resulting CloudFormation YAML is 275 lines and includes low-level details such as egress routing from our VPC to the internet: Shell TodoInternetRoute: Type: AWS::EC2::Route Properties: DestinationCidrBlock: 0.0.0.0/0 GatewayId: !Ref TodoInternetGateway RouteTableId: !Ref TodoRouteTable Also, of course, the resources and configuration are AWS-specific, so if we wanted to adapt this automation to a different cloud environment, we would need to rewrite it from the ground up. Finally, while we can easily adapt this automation to create multiple deployments on AWS, it is not as flexible for testing changes to the application as we have to deploy a full RDS cluster for each new instance. Conclusion Our case study enabled us to exhibit key features and tradeoffs for cloud orchestration automation. There are many more than just these two options, but whatever solution is chosen should use an IaC repository for change control and a tool for idempotence and support for multiple environments. Within that cloud orchestration space, our deployment architecture and our tool selection will be driven by the importance of portability to new cloud environments compared to the cost in additional complexity. This is an excerpt from DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. Cloud native and observability are an integral part of developer lives. Understanding their responsibilities within observability at scale helps developers tackle the challenges they are facing on a daily basis. There is more to observability than just collecting and storing data, and developers are essential to surviving these challenges. Observability Foundations Gone are the days of monitoring a known application environment, debugging services within our development tooling, and waiting for new resources to deploy our code to. This has become dynamic, agile, and quickly available with auto-scaling infrastructure in the final production deployment environments. Developers are now striving to observe everything they are creating, from development to production, often owning their code for the entire lifecycle. The tooling from days of old, such as Nagios and HP OpenView, can't keep up with constantly changing cloud environments that contain thousands of microservices. The infrastructure for cloud-native deployments is designed to dynamically scale as needed, making it even more essential for observability platforms to help condense all that data noise to detect trends leading to downtime before they happen. Splintering of Responsibilities in Observability Cloud-native complexity not only changed the developer world but also impacted how organizations are structured. The responsibilities of creating, deploying, and managing cloud-native infrastructure have split into a series of new organizational teams. Developers are being tasked with more than just code creation and are expected to adopt more hybrid roles within some of these new teams. Observability teams have been created to focus on a specific aspect of the cloud-native ecosystem to provide their organization a service within the cloud infrastructure. In Table 1, we can see the splintering of traditional roles in organizations into these teams with specific focuses. Table 1. Who's who in the observability game Team Focus maturity goals DevOps Automation and optimization of the app development lifecycle, including post-launch fixes and updates Early stages: developer productivity Platform engineering Designing and building toolchains and workflows that enable self-service capabilities for developers Early stages: developer maturity and productivity boost CloudOps Provides organizations proper (cloud) resource management, using DevOps principles and IT operations applied to cloud-based architectures to speed up business processes Later stages: cloud resource management, costs, and business agility SRE All-purpose role aiming to manage reliability for any type of environment; a full-time job avoiding downtime and optimizing performance of all apps and supporting infrastructure, regardless of whether it's cloud native Early to late stages: on-call engineers trying to reduce downtime Central observability team Responsible for defining observability standards and practices, delivering key data to engineering teams, and managing tooling and observability data storage Later stages, owning: Define monitoring standards and practices Deliver monitoring data to engineering teams Measure reliability and stability of monitoring solutions Manage tooling and storage of metrics data To understand how these teams work together, imagine a large, mature, cloud native organization that has all the teams featured in Table 1: The DevOps team is the first line for standardizing how code is created, managed, tested, updated, and deployed. They work with toolchains and workflow provided by the platform engineering team. DevOps advises on new tooling and/or workflows, creating continuous improvements to both. A CloudOps team focuses on cloud resource management and getting the most out of the budgets spent on the cloud by the other teams. An SRE team is on call to manage reliability, avoiding downtime for all supporting infrastructure in the organization. They provide feedback for all the teams to improve tools, processes, and platforms. The overarching central observability team sets the observability standards for all teams to adhere to, delivering the right observability data to the right teams and managing tooling and data storage. Why Observability Is Important to Cloud Native Today, cloud native usage has seen such growth that developers are overwhelmed by their vast responsibilities that go beyond just coding. The complexity introduced by cloud-native environments means that observability is becoming essential to solving many of the challenges developers are facing. Challenges Increasing cloud native complexity means that developers are providing more code faster and passing more rigorous testing to ensure that their applications work at cloud native scale. These challenges expanded the need for observability within what was traditionally the developers' coding environment. Not only do they need to provide code and testing infrastructure for their applications, they are also required to instrument that code so that business metrics can be monitored. Over time, developers learned that fully automating metrics was overkill, with much of that data being unnecessary. This led developers to fine tune their instrumentation methods and turn to manual instrumentation, where only the metrics they needed were collected. Another challenge arises when decisions are made to integrate existing application landscapes with new observability practices in an organization. The time developers spend manually instrumenting existing applications so that they provide the needed data to an observability platform is an often overlooked burden. New observability tools designed to help with metrics, logs, and traces are introduced to the development teams — leading to more challenges for developers. Often, these tools are mastered by few, leading to siloed knowledge, which results in organizations paying premium prices for advanced observability tools only to have them used as if one is engaging in observability as a toy. Finally, when exploring the ingested data from our cloud infrastructure, the first thing that becomes obvious is that we don't need to keep everything that is being ingested. We need the ability to have control over our telemetry data and find out what is unused by our observability teams. There are some questions we need to answer about how we can: Identify ingested data not used in dashboards, alerting rules, nor touched in ad hoc queries by our observability teams Control telemetry data with aggregation and rules before we put it into expensive, longer-term storage Use only telemetry data needed to support the monitoring of our application landscape Tackling the flood of cloud data in such a way as to filter out the unused telemetry data, keeping only that which is applied for our observability needs, is crucial to making this data valuable to the organization. Cloud Native at Scale The use of cloud-native infrastructure brings with it a lot of flexibility, but when done at scale, the small complexities can become overwhelming. This is due to the premise of cloud native where we describe how our infrastructure should be set up, how our applications and microservices should be deployed, and finally, how it automatically scales when needed. This approach reduces our control over how our production infrastructure reacts to surges in customer usage of an organization's services. Empowering Developers Empowering developers starts with platform engineering teams that focus on developer experiences. We create developer experiences in our organization that treat observability as a priority, dedicating resources for creating a telemetry strategy from day one. In this culture, we're setting up development teams for success with cloud infrastructure, using observability alongside testing, continuous integration, and continuous deployment. Developers are not only owning the code they deliver but are now encouraged and empowered to create, test, and own the telemetry data from their applications and microservices. This is a brave new world where they are the owners of their work, providing agility and consensus within the various teams working on cloud solutions. Rising to the challenges of observability in a cloud native world is a success metric for any organization, and they can't afford to get it wrong. Observability needs to be front of mind with developers, considered a first-class citizen in their daily workflows, and consistently helping them with challenges they face. Artificial Intelligence and Observability Artificial intelligence (AI) has risen in popularity within not only developer tooling but also in the observability domain. The application of AI in observability falls within one of two use cases: Monitoring machine learning (ML) solutions or large language model (LLM) systems Embedding AI into observability tooling itself as an assistant The first case is when you want to monitor specific AI workloads, such as ML or LLMs. They can be further split into two situations that you might want to monitor, the training platform and the production platform. Training infrastructure and the process involved can be approached just like any other workload: easy-to-achieve monitoring using instrumentation and existing methods, such as observing specific traces through a solution. This is not the complete monitoring process that goes with these solutions, but out-of-the-box observability solutions are quite capable of supporting infrastructure and application monitoring of these workloads. The second case is when AI assistants, such as chatbots, are included in the observability tooling that developers are exposed to. This is often in the form of a code assistant, such as one that helps fine tune a dashboard or query our time series data ad hoc. While these are nice to have, organizations are very mindful of developer usage when inputting queries that include proprietary or sensitive data. It's important to understand that training these tools might include using proprietary data in their training sets, or even the data developers input, to further train the agents for future query assistance. Predicting the future of AI-assisted observability is not going to be easy as organizations consider their data one of their top valued assets and will continue to protect its usage outside of their control to help improve tooling. To that end, one direction that might help adoption is to have agents trained only on in-house data, but that means the training data is smaller than publicly available agents. Cloud-Native Observability: The Developer Survival Pattern While we spend a lot of time on tooling as developers, we all understand that tooling is not always the fix for the complex problems we face. Observability is no different, and while developers are often exposed to the mantra of metrics, logs, and traces for solving their observability challenges, this is not the path to follow without considering the big picture. The amount of data generated in cloud-native environments, especially at scale, makes it impossible to continue collecting all data. This flood of data, the challenges that arise, and the inability to sift through the information to find the root causes of issues becomes detrimental to the success of development teams. It would be more helpful if developers were supported with just the right amount of data, in just the right forms, and at the right time to solve issues. One does not mind observability if the solution to problems are found quickly, situations are remediated faster, and developers are satisfied with the results. If this is done with one log line, two spans from a trace, and three metric labels, then that's all we want to see. To do this, developers need to know when issues arise with their applications or services, preferably before it happens. They start troubleshooting with data that has been determined by their instrumented applications to succinctly point to areas within the offending application. Any tooling allows the developer who's investigating to see dashboards reporting visual information that directs them to the problem and potential moment it started. It is crucial for developers to be able to remediate the problem, maybe by rolling back a code change or deployment, so the application can continue to support customer interactions. Figure 1 illustrates the path taken by cloud native developers when solving observability problems. The last step for any developer is to determine how issues encountered can be prevented going forward. Figure 1. Observability pattern Conclusion Observability is essential for organizations to succeed in a cloud native world. The splintering of responsibilities in observability, along with the challenges that cloud-native environments bring at scale, cannot be ignored. Understanding the challenges that developers face in cloud native organizations is crucial to achieving observability happiness. Empowering developers, providing ways to tackle observability challenges, and understanding how the future of observability might look are the keys to handling observability in modern cloud environments. DZone Refcard resources: Full-Stack Observability Essentials by Joana Carvalho Getting Started With OpenTelemetry by Joana Carvalho Getting Started With Prometheus by Colin Domoney Getting Started With Log Management by John Vester Monitoring and the ELK Stack by John Vester This is an excerpt from DZone's 2024 Trend Report,Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. 2024 and the dawn of cloud-native AI technologies marked a significant jump in computational capabilities. We're experiencing a new era where artificial intelligence (AI) and platform engineering converge to transform cloud computing landscapes. AI is now merging with cloud computing, and we're experiencing an age where AI transcends traditional boundaries, offering scalable, efficient, and powerful solutions that learn and improve over time. Platform engineering is providing the backbone for these AI systems to operate within cloud environments seamlessly. This shift entails designing, implementing, and managing the software platforms that serve as the fertile ground for AI applications to flourish. Together, the integration of AI and platform engineering in cloud-native environments is not just an enhancement but a transformative force, redefining the very fabric of how services are now being delivered, consumed, and evolved in the digital cosmos. The Rise of AI in Cloud Computing Azure and Google Cloud are pivotal solutions in cloud computing technology, each offering a robust suite of AI capabilities that cater to a wide array of business needs. Azure brings to the table its AI Services and Azure Machine Learning, a collection of AI tools that enable developers to build, train, and deploy AI models rapidly, thus leveraging its vast cloud infrastructure. Google Cloud, on the other hand, shines with its AI Platform and AutoML, which simplify the creation and scaling of AI products, integrating seamlessly with Google's data analytics and storage services. These platforms empower organizations to integrate intelligent decision-making into their applications, optimize processes, and provide insights that were once beyond reach. A quintessential case study that illustrates the successful implementation of AI in the cloud is that of the Zoological Society of London (ZSL), which utilized Google Cloud's AI to tackle the biodiversity crisis. ZSL's "Instant Detect" system harnesses AI on Google Cloud to analyze vast amounts of images and sensor data from wildlife cameras across the globe in real time. This system enables rapid identification and categorization of species, transforming the way conservation efforts are conducted by providing precise, actionable data, leading to more effective protection of endangered species. Such implementations as ZSL's not only showcase the technical prowess of cloud AI capabilities but also underscore their potential to make a significant positive impact on critical global issues. Platform Engineering: The New Frontier in Cloud Development Platform engineering is a multifaceted discipline that refers to the strategic design, development, and maintenance of software platforms to support more efficient deployment and application operations. It involves creating a stable and scalable foundation that provides developers the tools and capabilities needed to develop, run, and manage applications without the complexity of maintaining the underlying infrastructure. The scope of platform engineering spans the creation of internal development platforms, automation of infrastructure provisioning, implementation of continuous integration and continuous deployment (CI/CD) pipelines, and the insurance of the platforms' reliability and security. In cloud-native ecosystems, platform engineers play a pivotal role. They are the architects of the digital landscape, responsible for constructing the robust frameworks upon which applications are built and delivered. Their work involves creating abstractions on top of cloud infrastructure to provide a seamless development experience and operational excellence. Figure 1. Platform engineering from the top down Platform engineers enable teams to focus on creating business value by abstracting away complexities related to environment configurations, along with resource scaling and service dependencies. They guarantee that the underlying systems are resilient, self-healing, and can be deployed consistently across various environments. The convergence of DevOps and platform engineering with AI tools is an evolution that is reshaping the future of cloud-native technologies. DevOps practices are enhanced by AI's ability to predict, automate, and optimize processes. AI tools can analyze data from development pipelines to predict potential issues, automate root cause analyses, and optimize resources, leading to improved efficiency and reduced downtime. Moreover, AI can drive intelligent automation in platform engineering, enabling proactive scaling and self-tuning of resources, and personalized developer experiences. This synergy creates a dynamic environment where the speed and quality of software delivery are continually advancing, setting the stage for more innovative and resilient cloud-native applications. Synergies Between AI and Platform Engineering AI-augmented platform engineering introduces a layer of intelligence to automate processes, streamline operations, and enhance decision-making. Machine learning (ML) models, for instance, can parse through massive datasets generated by cloud platforms to identify patterns and predict trends, allowing for real-time optimizations. AI can automate routine tasks such as network configurations, system updates, and security patches; these automations not only accelerate the workflow but also reduce human error, freeing up engineers to focus on more strategic initiatives. There are various examples of AI-driven automation in cloud environments, such as implementing intelligent systems to analyze application usage patterns and automatically adjust computing resources to meet demand without human intervention. The significant cost savings and performance improvements provide exceptional value to an organization. AI-operated security protocols can autonomously monitor and respond to threats more quickly than traditional methods, significantly enhancing the security posture of the cloud environment. Predictive analytics and ML are particularly transformative in platform optimization. They allow for anticipatory resource management, where systems can forecast loads and scale resources accordingly. ML algorithms can optimize data storage, intelligently archiving or retrieving data based on usage patterns and access frequencies. Figure 2. AI resource autoscaling Moreover, AI can oversee and adjust platform configurations, ensuring that the environment is continuously refined for optimal performance. These predictive capabilities are not limited to resource management; they also extend to predicting application failures, user behavior, and even market trends, providing insights that can inform strategic business decisions. The proactive nature of predictive analytics means that platform engineers can move from reactive maintenance to a more visionary approach, crafting platforms that are not just robust and efficient but also self-improving and adaptive to future needs. Changing Landscapes: The New Cloud Native The landscape of cloud native and platform engineering is rapidly evolving, particularly with leading cloud service providers like Azure and Google Cloud. This evolution is largely driven by the growing demand for more scalable, reliable, and efficient IT infrastructure, enabling businesses to innovate faster and respond to market changes more effectively. In the context of Azure, Microsoft has been heavily investing in Azure Kubernetes Service (AKS) and serverless offerings, aiming to provide more flexibility and ease of management for cloud-native applications. Azure's emphasis on DevOps, through tools like Azure DevOps and Azure Pipelines, reflects a strong commitment to streamlining the development lifecycle and enhancing collaboration between development and operations teams. Azure's focus on hybrid cloud environments, with Azure Arc, allows businesses to extend Azure services and management to any infrastructure, fostering greater agility and consistency across different environments. In the world of Google Cloud, they've been leveraging expertise in containerization and data analytics to enhance cloud-native offerings. Google Kubernetes Engine (GKE) stands out as a robust, managed environment for deploying, managing, and scaling containerized applications using Google's infrastructure. Google Cloud's approach to serverless computing, with products like Cloud Run and Cloud Functions, offers developers the ability to build and deploy applications without worrying about the underlying infrastructure. Google's commitment to open-source technologies and its leading-edge work in AI and ML integrate seamlessly into its cloud-native services, providing businesses with powerful tools to drive innovation. Both Azure and Google Cloud are shaping the future of cloud-native and platform engineering by continuously adapting to technological advancements and changing market needs. Their focus on Kubernetes, serverless computing, and seamless integration between development and operations underlines a broader industry trend toward more agile, efficient, and scalable cloud environments. Implications for the Future of Cloud Computing AI is set to revolutionize cloud computing, making cloud-native technologies more self-sufficient and efficient. Advanced AI will oversee cloud operations, enhancing performance and cost effectiveness while enabling services to self-correct. Yet integrating AI presents ethical challenges, especially concerning data privacy and decision-making bias, and poses risks requiring solid safeguards. As AI reshapes cloud services, sustainability will be key; future AI must be energy efficient and environmentally friendly to ensure responsible growth. Kickstarting Your Platform Engineering and AI Journey To effectively adopt AI, organizations must nurture a culture oriented toward learning and prepare by auditing their IT setup, pinpointing AI opportunities, and establishing data management policies. Further: Upskilling in areas such as machine learning, analytics, and cloud architecture is crucial. Launching AI integration through targeted pilot projects can showcase the potential and inform broader strategies. Collaborating with cross-functional teams and selecting cloud providers with compatible AI tools can streamline the process. Balancing innovation with consistent operations is essential for embedding AI into cloud infrastructures. Conclusion Platform engineering with AI integration is revolutionizing cloud-native environments, enhancing their scalability, reliability, and efficiency. By enabling predictive analytics and automated optimization, AI ensures cloud resources are effectively utilized and services remain resilient. Adopting AI is crucial for future-proofing cloud applications, and it necessitates foundational adjustments and a commitment to upskilling. The advantages include staying competitive and quickly adapting to market shifts. As AI evolves, it will further automate and refine cloud services, making a continued investment in AI a strategic choice for forward-looking organizations. This is an excerpt from DZone's 2024 Trend Report,Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. The cloud-native application protection platform (CNAPP) model is designed to secure applications that leverage cloud-native technologies. However, applications not in the scope are typically legacy systems that were not designed to operate within modern cloud infrastructures. Therefore, in practice, CNAPP covers the security of containerized applications, serverless functions, and microservices architectures, possibly running across different cloud environments. Figure 1. CNAPP capabilities across different application areas A good way to understand the goal of the security practices in CNAPPs is to look at the threat model, i.e., attack scenarios against which applications are protected. Understanding these scenarios helps practitioners grasp the aim of features in CNAPP suites. Note also that the threat model might vary according to the industry, the usage context of the application, etc. In general, the threat model is attached to the dynamic and distributed nature of cloud-native architectures. Such applications face an important attack surface and an intricate threat landscape mainly because of the complexity of their execution environment. In short, the model typically accounts for unauthorized access, data breaches due to misconfigurations, inadequate identity and access management policies, or simply vulnerabilities in container images or third-party libraries. Also, due to the ephemeral and scalable characteristics of cloud-native applications, CNAPPs require real-time mechanisms to ensure consistent policy enforcement and threat detection. This is to protect applications from automated attacks and advanced persistent threats. Some common threats and occurrences are shown in Figure 2: Figure 2. Typical threats against cloud-native applications Overall, the scope of the CNAPP model is quite broad, and vendors in this space must cover a significant amount of security domains to shield the needs of the entire model. Let’s review the specific challenges that CNAPP vendors face and the opportunities to improve the breadth of the model to address an extended set of threats. Challenges and Opportunities When Evolving the CNAPP Model To keep up with the evolving threat landscape and complexity of modern organizations, the evolution of the CNAPP model yields both significant challenges and opportunities. Both the challenges and opportunities discussed in the following sections are briefly summarized in Table 1: Table 1. Challenges and opportunities with evolving the CNAPP model Challenges Opportunities Integration complexity – connect tools, services, etc. Automation – AI and orchestration Technological changes – tools must continually evolve Proactive security – predictive and prescriptive measures Skill gaps – tools must be friendly and efficient DevSecOps – integration with DevOps security practices Performance – security has to scale with complexity Observability – extend visibility to the SDLC’s left and right Compliance – region-dependent, evolving landscape Edge security – control security beyond the cloud Challenges The integration challenges that vendors face due to the scope of the CNAPP model are compounded by quick technological changes: Cloud technologies are continuously evolving, and vendors need to design tools that are user friendly. Managing the complexity of cloud technology via simple, yet powerful, user interfaces allows organizations to cope with the notorious skill gaps in teams resulting from rapid technology evolution. An important aspect of the security measures delivered by CNAPPs is that they must be efficient enough to not impact the performance of the applications. In particular, when scaling applications, security measures should continue to perform gracefully. This is a general struggle with security — it should be as transparent as possible yet responsive and effective. An often industry-rooted challenge is regulatory compliance. The expansion of data protection regulations globally requires organizations to comply with evolving regulation frameworks. For vendors, this requires maintaining a wide perspective on compliance and incorporating these requirements into their tool capabilities. Opportunities In parallel, there are significant opportunities for CNAPPs to evolve to address the challenges. Taming complexity is an important factor to tackle head first to expand the scope of the CNAPP model. For that purpose, automation is a key enabler. For example, there is a significant opportunity to leverage artificial intelligence (AI) to accelerate routine tasks, such as policy enforcement and anomaly detection. The implementation of AI for operation automation is particularly important to address the previously mentioned scalability challenges. This capability enhances analytics and threat intelligence, particularly to offer predictive and prescriptive security capabilities (e.g., to advise users for the necessary settings in a given scenario). With such new AI-enabled capabilities, organizations can effectively address the skill gap by offering guided remediation, automated policy recommendations, and comprehensive visibility. An interesting opportunity closer to the code stage is integrating DevSecOps practices. While a CNAPP aims to protect cloud-native applications across their lifecycle, in contrast, DevSecOps embeds security practices that liaise between development, operations, and security teams. Enabling DevSecOps in the context of the CNAPP model covers areas such as providing integration with source code management tools and CI/CD pipelines. This integration helps detect vulnerabilities early and ensure that security is baked into the product from the start. Also, providing developers with real-time feedback on the security implications of their activities helps educate them on security best practices and thus reduce the organization’s exposure to threats. The main goal here is to "shift left" the approach to improve observability and to help reduce the cost and complexity of fixing security issues later in the development cycle. A last and rather forward-thinking opportunity is to evolve the model so that it extends to securing an application on “the edge,” i.e., where it is executed and accessed. A common use case is the access of a web application from a user device via a browser. The current CNAPP model does not explicitly address security here, and this opportunity should be seen as an extension of the operation stage to further “shield right” the security model. Technology Trends That Can Reshape CNAPP The shift left and shield right opportunities (and the related challenges) that I reviewed in the last section can be addressed by the technologies exemplified here. Firstly, the enablement of DevSecOps practices is an opportunity to further shift the security model to the left of the SDLC, moving security earlier in the development process. Current CNAPP practices already include looking at source code and container vulnerabilities. More often than not, visibility over these development artifacts starts once they have been pushed from the development laptop to a cloud-based repository. By using a secure implementation of cloud development environments (CDEs), from a CNAPP perspective, observability across performance and security can start from the development environment, as opposed to the online DevOps tool suites such as CI/CD and code repositories. Secondly, enforcing security for web applications at the edge is an innovative concept when looking at it from the perspective of the CNAPP model. This can be realized by integrating an enterprise browser into the model. For example: Security measures that aim to protect against insider threats can be implemented on the client side with mechanisms very similar to how mobile applications are protected against tampering. Measures to protect web apps against data exfiltration and prevent display of sensitive information can be activated based on injecting a security policy into the browser. Automation of security steps allows organizations to extend their control over web apps (e.g., using robotic process automation). Figure 3. A control component (left) fetches policies to secure app access and browsing (right) Figure 4 shows the impact of secure implementation of a CDE and enterprise browser on CNAPP security practices. The use of both technologies enables security to become a boon for productivity as automation plays the dual role of simplifying user-facing processes around security to the benefit of increased productivity. Figure 4. CNAPP model and DevOps SDLC augmented with secure cloud development and browsing Conclusion The CNAPP model and the tools that implement it should be evolving their coverage in order to add resilience to new threats. The technologies discussed in this article are examples of how coverage can be improved to the left and further to the right of the SDLC. The goal of increasing coverage is to provide organizations more control over how they implement and deliver security in cloud-native applications across business scenarios. This is an excerpt from DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC. In today's cloud computing landscape, businesses are embracing the dynamic world of hybrid and multi-cloud environments and seamlessly integrating infrastructure and services from multiple cloud vendors. This shift from a single provider is driven by the need for greater flexibility, redundancy, and the freedom to leverage the best features from each provider and create tailored solutions. Furthermore, the rise of cloud-native technologies is reshaping how we interact with the cloud. Containerization, serverless, artificial intelligence (AI), and edge computing are pushing the boundaries of what's possible, unlocking a new era of innovation and efficiency. But with these newfound solutions comes a new responsibility: cost optimization. The complexities of hybrid and multi-cloud environments, coupled with the dynamic nature of cloud-native deployments, require a strategic approach to managing cloud costs. This article dives into the intricacies of cloud cost management in this new era, exploring strategies, best practices, and frameworks to get the most out of your cloud investments. The Role of Containers in Vendor Lock-In Vendor lock-in occurs when a company becomes overly reliant on a specific cloud provider's infrastructure, services, and tools. This can have a great impact on both agility and cost. Switching to a different cloud provider can be a complex and expensive process, especially as apps become tightly coupled with the vendor's proprietary offerings. Additionally, vendor lock-in can limit you from negotiating better pricing options or accessing the latest features offered by other cloud providers. Containers are recognized for their portability and ability to package applications for seamless deployment across different cloud environments by encapsulating an application's dependencies within a standardized container image (as seen in Figure 1). This means that you can theoretically move your containerized application from one cloud provider to another without significant code modifications. This flexibility affects greater cost control as you're able to leverage the competitive nature of the cloud landscape to negotiate the best deals for your business. Figure 1. Containerization explained With all that being said, complete freedom from vendor lock-in remains a myth with containers. While application code may be portable, configuration management tools, logging services, and other aspects of your infrastructure might still be tied up with the specific vendor's offerings. An approach that leverages open-source solutions whenever possible can maximize the portability effects of containers and minimize the risk of vendor lock-in. The Importance of Cloud Cost Management With evolving digital technologies, where startups and enterprises alike depend on cloud services for their daily operations, efficient cloud cost management is essential. To maximize the value of your cloud investment, understanding and controlling cloud costs not only prevents budget overruns but also ensures that resources are used optimally. The first step in effective cloud cost management is understanding your cloud bill. Most cloud providers now offer detailed billing reports that break down your spending by service, resource type, and region. Familiarize yourself with these reports and identify the primary cost drivers for your environment. Common cost factors include: Transfer rates Storage needs Compute cycles consumed by your services Once you have an understanding of these drivers, the next step is to identify and eliminate any cloud waste. Wasteful cloud spending is often attributed to unused or underutilized resources, which can easily happen if you leave them running overnight or on weekends, and this can significantly inflate your cloud bill. You can eliminate this waste by leveraging tools like autoscaling to automatically adjust resources based on demand. Additionally, overprovisioning (allocating more resources than necessary) can be another really big cost driver. Practices such as rightsizing, where you adjust the scales of your cloud resources to match the demand, can lead to cost savings. Continuous monitoring and analysis of resource utilization is necessary to ensure that each service is perfectly fitted to its needs, neither over- nor under-provisioned. Finally, most cloud providers now offer cost-saving programs that can help optimize your spending. These may include reserved instances where you get discounts for committing to a specific resource for a fixed period, or Spot instances that allow you to use unused capacity at a significantly lower price. Taking advantage of such programs requires a deep understanding of your current and projected usage to select the most beneficial option. Effective cloud cost management is not just about cutting costs but also about optimizing cloud usage in a way that aligns with organizational goals and strategies. Selecting the Best Cloud Options for Your Organization As the one-size-fits-all approach doesn't really exist when working with the cloud, choosing the best options for your specific needs is paramount. Below are some strategies that can help. Assessing Organizational Needs A thorough assessment of your organizational needs involves analyzing your workload characteristics, scalability, and performance requirements. For example, mission-critical applications with high resource demands might need different cloud configurations than static web pages. You can evaluate your current usage patterns and future project needs using machine learning and AI. Security and compliance needs are equally important considerations. Certain industries face regulatory requirements that can dictate data-handling and processing protocols. Identifying a cloud provider that meets these security and compliance standards is non-negotiable for protecting sensitive information. This initial assessment will help you identify which cloud services are suitable for your business needs and implement a proactive approach to cloud cost optimization. Evaluating Cloud Providers Once you have a clear understanding, the next step is to compare the offerings of different cloud providers. Evaluate their services based on key metrics, such as performance, cost efficiency, and the quality of customer support. Take advantage of free trials and demos offered to test drive their services and better assess their suitability. The final decision often comes down to one question: adopt a single- or multi-cloud strategy? Each approach offers specific advantages and disadvantages, so the optimal choice depends on specific needs and priorities. The table below compares the key features of single-cloud and multi-cloud strategies to help you make an informed decision: Table 1. Single- vs. multi-cloud approaches Feature Single-Cloud Multi-Cloud Simplicity Easier to manage; single point of contact More complex to manage; requires expertise in multiple platforms Cost Potentially lower costs through volume discounts May offer lower costs overall by leveraging the best pricing models from different providers Vendor lock-in High; limited flexibility to switch providers Low; greater freedom to choose and switch providers Performance Consistent performance if the provider is chosen well May require optimization for performance across different cloud environments Security Easier to implement and maintain consistent security policies Requires stronger security governance to manage data across multiple environments Compliance Easier to comply with regulations if provider offerings align with needs May require additional effort to ensure compliance across different providers Scalability Scalable within the chosen provider's ecosystem Offers greater horizontal scaling potential by leveraging resources from multiple providers Innovation Limited to innovations offered by the chosen provider Access to a wider range of innovations and features from multiple providers Modernizing Cloud Tools and Architectures Having selected the right cloud options and established a solid foundation for cloud cost management, you need to ensure your cloud environment is optimized for efficiency and cost control. This requires a proactive approach that continuously evaluates and modernizes your cloud tools and architectures. Here, we introduce a practical framework for cloud modernization and continuous optimization: Assessment – Analyze your current cloud usage using cost management platforms and identify inefficiencies and opportunities for cost reduction. Pinpoint idle or underutilized resources that can be scaled down or eliminated. Planning – Armed with these insights, define clear goals and objectives for your efforts. These goals might include reducing overall cloud costs by a specific percentage, optimizing resource utilization, or improving scalability. Once you establish your goals, choose the right optimization strategies that will help you achieve them. Implementation – Now is time to put your plan into action. This can mean implementing cost-saving measures like autoscaling, which automatically adjusts your resources based on demand. Cloud cost management platforms can also play a crucial role in providing real-time visibility and automated optimization recommendations. Monitoring and optimization – Cloud modernization is an ongoing process that requires continuous monitoring and improvement. Regularly review your performance metrics, cloud costs, and resource utilization metrics to adapt your strategies as needed. Figure 2. A framework for modernizing cloud environments By following this framework, you can systematically improve your cloud environment and make sure it remains cost effective. Conclusion Cloud technologies offer a lot of benefits for businesses of all sizes. However, without a strategic approach to cost management, these benefits can be overshadowed by unexpected expenses. By following the best practices in this article, from understanding your cloud requirements and selecting the best cloud option to adopting continuous optimization for your tools and architectures, you can ensure your cloud journey is under financial control. Looking ahead, the future of cloud computing looks exciting as serverless, AI, and edge computing promise to unlock even greater agility, scalability, and efficiency. Staying informed about these advancements, new pricing models, and emerging tools will be really important to maximize the value of your cloud investment. Cost optimization is not a one-time endeavor but an ongoing process that requires continuous monitoring, adaptation, and a commitment to extract the most value out of your cloud resources. This is an excerpt from DZone's 2024 Trend Report, Cloud Native: Championing Cloud Development Across the SDLC.Read the Free Report
A typical machine learning (ML) workflow involves processes such as data extraction, data preprocessing, feature engineering, model training and evaluation, and model deployment. As data changes over time, when you deploy models to production, you want your model to learn continually from the stream of data. This means supporting the model’s ability to autonomously learn and adapt in production as new data is added. In practice, data scientists often work with Jupyter Notebooks for development work and find it hard to translate from notebooks to automated pipelines. To achieve the two main functions of an ML service in production, namely retraining (retrain the model on newer labeled data) and inference (use the trained model to get predictions), you might primarily use the following: Amazon SageMaker: A fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly AWS Glue: A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data In this post, we demonstrate how to orchestrate an ML training pipeline using AWS Glue workflows and train and deploy the models using Amazon SageMaker. For this use case, you use AWS Glue workflows to build an end-to-end ML training pipeline that covers data extraction, data processing, training, and deploying models to Amazon SageMaker endpoints. Use Case For this use case, we use the DBpedia Ontology classification dataset to build a model that performs multi-class classification. We trained the model using the BlazingText algorithm, which is a built-in Amazon SageMaker algorithm that can classify unstructured text data into multiple classes. This post doesn’t go into the details of the model but demonstrates a way to build an ML pipeline that builds and deploys any ML model. Solution Overview The following diagram summarizes the approach for the retraining pipeline. The workflow contains the following elements: AWS Glue crawler: You can use a crawler to populate the Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. AWS Glue triggers: Triggers are Data Catalog objects that you can use to either manually or automatically start one or more crawlers or ETL jobs. You can design a chain of dependent jobs and crawlers by using triggers. AWS Glue job: An AWS Glue job encapsulates a script that connects source data, processes it, and writes it to a target location. AWS Glue workflow: An AWS Glue workflow can chain together AWS Glue jobs, data crawlers, and triggers, and build dependencies between the components. When the workflow is triggered, it follows the chain of operations as described in the preceding image. The workflow begins by downloading the training data from Amazon Simple Storage Service (Amazon S3), followed by running data preprocessing steps and dividing the data into train, test, and validate sets in AWS Glue jobs. The training job runs on a Python shell running in AWS Glue jobs, which starts a training job in Amazon SageMaker based on a set of hyperparameters. When the training job is complete, an endpoint is created, which is hosted on Amazon SageMaker. This job in AWS Glue takes a few minutes to complete because it makes sure that the endpoint is in InService status. At the end of the workflow, a message is sent to an Amazon Simple Queue Service (Amazon SQS) queue, which you can use to integrate with the rest of the application. You can also use the queue to trigger an action to send emails to data scientists that signal the completion of training, add records to management or log tables, and more. Setting up the Environment To set up the environment, complete the following steps: Configure the AWS Command Line Interface (AWS CLI) and a profile to use to run the code. For instructions, see Configuring the AWS CLI. Make sure you have the Unix utility wget installed on your machine to download the DBpedia dataset from the internet. Download the following code into your local directory. Organization of Code The code to build the pipeline has the following directory structure: --Glue workflow orchestration --glue_scripts --DataExtractionJob.py --DataProcessingJob.py --MessagingQueueJob,py --TrainingJob.py --base_resources.template --deploy.sh --glue_resources.template The code directory is divided into three parts: AWS CloudFormation templates: The directory has two AWS CloudFormation templates: glue_resources.template and base_resources.template. The glue_resources.template template creates the AWS Glue workflow-related resources, and base_resources.template creates the Amazon S3, AWS Identity and Access Management (IAM), and SQS queue resources. The CloudFormation templates create the resources and write their names and ARNs to AWS Systems Manager Parameter Store, which allows easy and secure access to ARNs further in the workflow. AWS Glue scripts: The folder glue_scripts holds the scripts that correspond to each AWS Glue job. This includes the ETL as well as model training and deploying scripts. The scripts are copied to the correct S3 bucket when the bash script runs. Bash script: A wrapper script deploy.sh is the entry point to running the pipeline. It runs the CloudFormation templates and creates resources in the dev, test, and prod environments. You use the environment name, also referred to as stage in the script, as a prefix to the resource names. The bash script performs other tasks, such as downloading the training data and copying the scripts to their respective S3 buckets. However, in a real-world use case, you can extract the training data from databases as a part of the workflow using crawlers. Implementing the Solution Complete the following steps: Go to the deploy.sh file and replace algorithm_image name with <ecr_path> based on your Region. The following code example is a path for Region us-west-2: Shell algorithm_image="433757028032.dkr.ecr.us-west-2.amazonaws.com/blazingtext:latest" For more information about BlazingText parameters, see Common parameters for built-in algorithms. Enter the following code in your terminal: Shell sh deploy.sh -s dev AWS_PROFILE=your_profile_name This step sets up the infrastructure of the pipeline. On the AWS CloudFormation console, check that the templates have the status CREATE_COMPLETE. On the AWS Glue console, manually start the pipeline. In a production scenario, you can trigger this manually through a UI or automate it by scheduling the workflow to run at the prescribed time. The workflow provides a visual of the chain of operations and the dependencies between the jobs. To begin the workflow, in the Workflow section, select DevMLWorkflow. From the Actions drop-down menu, choose Run. View the progress of your workflow on the History tab and select the latest RUN ID. The workflow takes approximately 30 minutes to complete. The following screenshot shows the view of the workflow post-completion. After the workflow is successful, open the Amazon SageMaker console. Under Inference, choose Endpoint. The following screenshot shows that the endpoint of the workflow deployed is ready. Amazon SageMaker also provides details about the model metrics calculated on the validation set in the training job window. You can further enhance model evaluation by invoking the endpoint using a test set and calculating the metrics as necessary for the application. Cleaning Up Make sure to delete the Amazon SageMaker hosting services—endpoints, endpoint configurations, and model artifacts. Delete both CloudFormation stacks to roll back all other resources. See the following code: Python def delete_resources(self): endpoint_name = self.endpoint try: sagemaker.delete_endpoint(EndpointName=endpoint_name) print("Deleted Test Endpoint ", endpoint_name) except Exception as e: print('Model endpoint deletion failed') try: sagemaker.delete_endpoint_config(EndpointConfigName=endpoint_name) print("Deleted Test Endpoint Configuration ", endpoint_name) except Exception as e: print(' Endpoint config deletion failed') try: sagemaker.delete_model(ModelName=endpoint_name) print("Deleted Test Endpoint Model ", endpoint_name) except Exception as e: print('Model deletion failed') This post describes a way to build an automated ML pipeline that not only trains and deploys ML models using a managed service such as Amazon SageMaker, but also performs ETL within a managed service such as AWS Glue. A managed service unburdens you from allocating and managing resources, such as Spark clusters, and makes it easy to move from notebook setups to production pipelines.
Abhishek Gupta
Principal Developer Advocate,
AWS
Daniel Oh
Senior Principal Developer Advocate,
Red Hat
Pratik Prakash
Principal Solution Architect,
Capital One