Cloud architecture refers to how technologies and components are built in a cloud environment. A cloud environment comprises a network of servers that are located in various places globally, and each serves a specific purpose. With the growth of cloud computing and cloud-native development, modern development practices are constantly changing to adapt to this rapid evolution. This Zone offers the latest information on cloud architecture, covering topics such as builds and deployments to cloud-native environments, Kubernetes practices, cloud databases, hybrid and multi-cloud environments, cloud computing, and more!
When it comes to infrastructure provisioning, including the AWS EKS cluster, Terraform is the first tool that comes to mind. Learning Terraform is much easier than setting up the infrastructure manually. That said, would you rather use the traditional approach to set up the infrastructure, or would you prefer to use Terraform? More specifically, would you rather create an EKS cluster using Terraform and have Terraform Kubernetes deployment in place, or use the manual method, leaving room for human errors? As you may already know, Terraform is an open-source Infrastructure as Code (IaC) software platform that allows you to manage hundreds of cloud services using a uniform CLI approach and uses declarative configuration files to codify cloud APIs. In this article, we won’t go into all the details of Terraform. Instead, we will be focusing on Terraform Kubernetes deployment. In summary, we will be looking at the steps to provision an EKS Cluster using Terraform. Also, we will go through how Terraform Kubernetes deployment helps save time and reduce human errors, which can occur when using a traditional or manual approach for application deployment. Prerequisites for Terraform Kubernetes Deployment Before we proceed and provision EKS cluster using Terraform, there are a few commands (or tools) you need to have in mind and on hand. First off, you must have an AWS Account, and Terraform must be installed on your host machine, seeing as we are going to create an EKS cluster using Terraform CLI on the AWS cloud. Now, let’s take a look at the prerequisites for this setup and help you install them: AWS Account If you don’t have an AWS account, you can register for a “Free Tier Account” and use it for test purposes. IAM Admin User You must have an IAM user with AmazonEKSClusterPolicy and AdministratorAccess permissions as well as its secret and access keys. We will be using the IAM user credentials to provision an EKS cluster using Terraform. The keys you create for this user will be used to connect to the AWS account from the CLI (Command Line Interface). When working on production clusters, only provide the required access and avoid providing admin privileges. EC2 Instance We will be using the Ubuntu 18.04 EC2 Instance as a host machine to execute our Terraform code. You may use another machine, however, you will need to verify which commands are compatible with your host machine to install the required packages. The first step is installing the required packages on your machine. You can also use your personal computer to install the required tools. This step is optional. Access to the Host Machine Connect to the EC2 Instance and install the “unzip package:” ssh -i “<key-name.pem>” ubuntu@<public-ip-of-the–ec2-instance>. If you are using your personal computer, you may not need to connect to the EC2 instance. However, in this case, the installation command will differ. sudo apt-get update -y. sudo apt-get install unzip -y. Terraform To create an EKS cluster using Terraform, you need to have Terraform on your host machine. Use the following commands to install Terraform on an Ubuntu 18.04 EC2 machine. Visit Hashicorps official website to view the installation instructions for other platforms. sudo apt-get update and sudo apt-get install -y gnupg software-properties-common curl. cur—fsSL | sudo “apt-key” add (here). sudo apt-add-repository deb [arch=amd64] $(lsb_release -cs) main (here). sudo “apt-get” update and sudo “apt-get” install Terraform. Terraform—version: AWS CLI There is not much to do with aws-cli, however, we need to use it to check the details about the IAM user whose credentials will be used from the terminal. To install it, use the commands below: curl (here) -o “awscliv2.zip” unzip awscliv2.zip sudo ./aws/install Kubectl We will be using the kubectl command against the Kubernetes cluster to view the resources in the EKS cluster we want to create. Install kubectl on Ubuntu 18.04 EC2 machine using the commands below: curl -LO “curl -L -s (here) /bin/linux/amd64/kubectl” url -LO “curl -L -s (here) /bin/linux/amd64/kubectl.sha256” echo “$(cat kubectl.sha256) kubectl” | sha256sum –check sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl kubectl versio—client DOT Note: This is completely optional. We will be using this to convert the output of the Terraform graph command. The output of the Terraform graph command is in DOT format, which can easily be converted into an image by making use of the DOT provided by GraphViz. The Terraform graph command is used to generate a visual representation of a configuration or execution plan. To install the DOT command, execute this command: sudo apt install graphvi Export Your AWS Access and Secret Keys for the Current Session If the session expires, you will need to export the keys again on the terminal. There are other ways to use your keys that allow aws-cli to interact with AWS: Export AWS_ACCESS_KEY_ID=<YOUR_AWS_ACCESS_KEY_ID>. Export AWS_SECRET_ACCESS_KEY=<YOUR_AWS_SECRET_ACCESS_KEY>. Export AWS_DEFAULT_REGION=<YOUR_AWS_DEFAULT_REGION>. Here, replace <YOUR_AWS_ACCESS_KEY_ID> with your access key, <YOUR_AWS_SECRET_ACCESS_KEY> with your secret key, and <YOUR_AWS_DEFAULT_REGION> with the default region for your aws-cli. Check the Details of the IAM User Whose Credentials Are Being Used Basically, this will display the details of the user whose keys you used to configure the CLI in the above step: aws sts get-caller-identity: Architecture The architecture should appear as follows. A VPC will be created with three public subnets and three private subnets. Traffic from private subnets will route through the NAT gateway and traffic from public subnets will route through the internet gateway. Kubernetes cluster nodes will be created as part of auto-scaling groups and will reside in private subnets. Public subnets can be used to create bastion servers that can be used to connect to private nodes. Three public subnets and three private subnets will be created in three different availability zones. You can change the VPC CIDR in the Terraform configuration files if you wish. If you are just getting started, we recommend following the article without making any unfamiliar changes to the configuration in to avoid human errors. Highlights This article will help you provision an EKS cluster using Terraform and deploy a sample NodeJs application. When creating an EKS cluster, other AWS resources such as VPC, subnets, NAT gateway, internet gateway, and security groups will also be created on your AWS account. This article is divided into two parts: The creation of an EKS cluster using Terraform. Deployment of a sample Nodejs application in the EKS cluster using Terraform. First off, we will create an EKS cluster. After which, we will deploy a sample Nodejs application on it using Terraform. In this article, we refer to Terraform Modules for creating VPC, and its components, along with the EKS cluster. Here is a list of some of the Terraform elements we’ll use: Module: A module is made up of a group of of .tf and/or .tf.jsonfiles contained in a directory. Modules are containers for a variety of resources that are used in conjunction. Terraform’s main method of packaging and reusing resource configurations is through modules. EKS: We will be using a terraform-aws-modules/eks/aws module to create our EKS cluster and its components. VPC: We will be using a terraform-aws-modules/vpc/aws module to create our VPC and its components. Data: The Data source is accessed by a data resource, which is specified using a data block and allows Terraform to use the information defined outside of Terraform, defined within another Terraform configuration, or changed by functions. Provider: Provider is a Terraform plugin that allows users to communicate with cloud providers, SaaS providers, and other APIs. Terraform configurations must specify the required providers in order for Terraform to install and use them. Kubernetes: The Kubernetes provider is used to communicate with Kubernetes’ resources. Before the provider can be utilized, it must be configured with appropriate credentials. For this example, we will use host, token, and cluster_ca_certificate in the provider block. AWS: The AWS provider is used to communicate with AWS’ resources. Before the provider can be utilized, it must be configured with the appropriate credentials. In this case, we will export AWS access and secret keys to the terminal. Resource: The most crucial element of the Terraform language is resources. Each resource block describes one or more infrastructure items. A resource block specifies a particular kind of resource with a specific local name. The resource type and name serve to identify a resource and, therefore, must be distinct inside a module. kubernetes_namespace: We will be using a kubernetes_namespace to create namespaces in our EKS cluster. kubernetes_deployment: We will be using a kubernetes_deployment to create deployments in our EKS cluster. kubernetes_service: We will be using a kubernetes_service to create services in our EKS cluster. aws_security_group: We will be using aws_security_group to create multiple security groups for instances that will be created as part of our EKS cluster. random_string: Random string is a resource that generates a random permutation of alphanumeric and special characters. We will be creating a random string that will be used as a suffix in our EKS cluster name. Output: Output values provide information about your infrastructure to other Terraform setups and make said information available on the command line. In computer languages, output values are equivalent to return values. An output block must be used to declare each output value that a module exports. Required_version: The required_version parameter accepts a version constraint string that determines which Terraform versions are compatible with your setup. Only the Terraform CLI version is affected by the required version parameter. Variable: Terraform variables, including variables in any other programming language, allow you to tweak the characteristics of Terraform modules without modifying the module’s source code. This makes your modules composable and reusable by allowing you to exchange modules across multiple Terraform settings. Variables can include past values, which can be utilized when calling the module or running Terraform. Locals: A local value gives an expression a name, allowing you to use it numerous times within a module without having to repeat it. Before we go ahead and create EKS cluster using Terraform, let’s take a look at why Terraform is a good choice. Why Provision and Deploy With Terraform? It’s normal to wonder “why provision EKS cluster using Terraform?” or “why create EKS cluster using Terraform?” when we can simply achieve the same with the AWS Console, AWS CLI, or other tools. Here are a few of the reasons why: Unified workflow: If you’re already using Terraform to deploy infrastructure to AWS, your EKS cluster can be integrated into that process. Also, Terraform can be used to deploy applications into your EKS cluster. Full lifecycle management: Terraform not only creates resources, but also updates and deletes tracked resources without you having to inspect the API to find those resources. Graph of relationships: Terraform recognizes resource-dependent relationships via a relationship graph. If, for example, an AWS Kubernetes cluster requires specified VPC and subnet configurations, Terraform will not attempt to create the cluster if the VPC and subnets fail to create it with the required configuration. How To Create an EKS Cluster Using Terraform In this part of the article, we shall provision an EKS cluster using Terraform. While doing this, other dependent resources like VPC, Subnets, NAT Gateway, Internet Gateway, and Security Groups will also be created, and we will also deploy an Nginx application with Terraform. Note: You can find all of the relevant code on my Github repository. Before you create an EKS cluster with Terraform using the following steps, you need to set up and make note of a few things: Clone the Github repository in your home directory: cd ~/ pwd git clone After you clone the repo, change your “Present Working Directory” to ~/DevOps/aws/terraform/terraform-kubernetes-deployment/ cd ~/DevOps/aws/terraform/terraform-kubernetes-deployment/ The Terraform files for creating an EKS cluster must be in one folder and the Terraform files for deploying a sample NodeJs application must be in another. eks-cluster: This folder contains all of the .tf files required for creating the EKS Cluster. nodejs-application: This folder contains a .tf file required for deploying the sample NodeJs application. Now, let’s proceed with the creation of an EKS cluster using Terraform: Verify your “Present Working Directory,” it should be ~/DevOps/aws/terraform/terraform-kubernetes-deployment/as you have already cloned my Github repository in the previous step. cd ~/DevOps/aws/terraform/terraform-kubernetes-deployment/ Next, change your “Present Working Directory” to ~/DevOps/aws/terraform/terraform-kubernetes-deployment/eks-cluster cd ~/DevOps/aws/terraform/terraform-kubernetes-deployment/eks-cluster You should now have all the required files: If you haven’t cloned the repo then you are free to create the required .tf files in a new folder. Create your eks-cluster.tf file with the content below. In this case, we are using the module from terraform-aws-modules/eks/aws to create our EKS Cluster. Let’s take a look at the most important inputs that you may want to consider changing depending on your requirements. Source: This informs Terraform of the location of the source code for the module. Version: It is recommended to explicitly specify the acceptable version number to avoid unexpected or unwanted changes. cluster_name: The cluster name is passed from the local variable. cluster_version: This defines the EKS cluster version. Subnets: This specifies the list of Subnets in which nodes will be created. For this example, we will be creating nodes in the “Private Subnets.” Subnet IDs are passed here from the module in which Subnets are created. cpc_id: This specifies the VPI in which the EKS cluster will be created. The value is passed from the module in which the VPC is created. instance_type: You can change this value if you want to create worker nodes with any other instance type. asg_desired_capacity: Here, you can specify the maximum number of nodes that you want in your auto-scaling worker node groups. module “eks” { source = “terraform-aws-modules/eks/aws” version = “17.24.0” cluster_name = local.cluster_name cluster_version = “1.20” subnets = module.vpc.private_subnets vpc_id = module.vpc.vpc_id workers_group_defaults = { root_volume_type = “gp2” } worker_groups = [ { name = “worker-group-1” instance_type = “t2.small” additional_userdata = “echo nothing” additional_security_group_ids = [aws_security_group.worker_group_mgmt_one.id] asg_desired_capacity = 2 }, { name = “worker-group-2” instance_type = “t2.medium” additional_userdata = “echo nothing” additional_security_group_ids = [aws_security_group.worker_group_mgmt_two.id] asg_desired_capacity = 1 }, ] } data “aws_eks_cluster” “cluster” { name = module.eks.cluster_id } data “aws_eks_cluster_auth” “cluster” { name = module.eks.cluster_id } Create a kubernetes.tf file with the following content. Here, we are using Terraform Kubernetes Provider to create Kubernetes objects, such as a namespace, deployment, and service, using Terraform. We are creating these resources for testing purposes only. In the following steps, we will also be deploying a sample application using Terraform. Here are all the resources we will be creating in the EKS cluster. kubernetes_namespace: We will be creating a “test” namespace and using this namespace to create other Kubernetes objects in the EKS cluster. kubernetes_deployment: We will be creating a deployment with two replicas of the Nginx pod. kubernetes_service: A service of type LoadBalancer will be created and used to access our Nginx application. provider “kubernetes” { host = data.aws_eks_cluster.cluster.endpoint token = data.aws_eks_cluster_auth.cluster.token cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data) } resource “kubernetes_namespace” “test” { metadata { name = “nginx” } } resource “kubernetes_deployment” “test” { metadata { name = “nginx” namespace = kubernetes_namespace.test.metadata.0.name } spec { replicas = 2 selector { match_labels = { app = “MyTestApp” } } template { metadata { labels = { app = “MyTestApp” } } spec { container { image = “nginx” name = “nginx-container” port { container_port = 80 } } } } } } resource “kubernetes_service” “test” { metadata { name = “nginx” namespace = kubernetes_namespace.test.metadata.0.name } spec { selector = { app = kubernetes_deployment.test.spec.0.template.0.metadata.0.labels.app } type = “LoadBalancer” port { port = 80 target_port = 80 } } } Create an outputs.tf file with the following content. This is done to export the structured data related to our resources. It will provide information about our EKS infrastructure to other Terraform setups. Output values are equivalent to return values in other programming languages. output “cluster_id” { description = “EKS cluster ID.” value = module.eks.cluster_id } output “cluster_endpoint” { description = “Endpoint for EKS control plane.” value = module.eks.cluster_endpoint } output “cluster_security_group_id” { description = “Security group ids attached to the cluster control plane.” value = module.eks.cluster_security_group_id } output “kubectl_config” { description = “kubectl config as generated by the module.” value = module.eks.kubeconfig } output “config_map_aws_auth” { description = “A kubernetes configuration to authenticate to this EKS cluster.” value = module.eks.config_map_aws_auth } output “region” { description = “AWS region” value = var.region } output “cluster_name” { description = “Kubernetes Cluster Name” value = local.cluster_name } Create a security-groups.tf file with the following content. In this case, we are defining the security groups, which will be used by our worker nodes. resource “aws_security_group” “worker_group_mgmt_one” { name_prefix = “worker_group_mgmt_one” vpc_id = module.vpc.vpc_id ingress { from_port = 22 to_port = 22 protocol = “tcp” cidr_blocks = [ “10.0.0.0/8”, ] } } resource “aws_security_group” “worker_group_mgmt_two” { name_prefix = “worker_group_mgmt_two” vpc_id = module.vpc.vpc_id ingress { from_port = 22 to_port = 22 protocol = “tcp” cidr_blocks = [ “192.168.0.0/16”, ] } } resource “aws_security_group” “all_worker_mgmt” { name_prefix = “all_worker_management” vpc_id = module.vpc.vpc_id ingress { from_port = 22 to_port = 22 protocol = “tcp” cidr_blocks = [ “10.0.0.0/8”, “172.16.0.0/12”, “192.168.0.0/16”, ] } } Create a versions.tf file with the following content. Here, we are defining version constraints along with a provider as AWS. terraform { required_providers { aws = { source = “hashicorp/aws” version = “>= 3.20.0” } random = { source = “hashicorp/random” version = “3.1.0” } local = { source = “hashicorp/local” version = “2.1.0” } null = { source = “hashicorp/null” version = “3.1.0” } kubernetes = { source = “hashicorp/kubernetes” version = “>= 2.0.1” } } required_version = “>= 0.14” } Create a vpc.tf file with the following content. For the purposes of our example, we have defined a 10.0.0.0/16 CIDR for a new VPC that will be created and used by the EKS cluster. We will also be creating public and private Subnets. To create a VPC, we will be using the terraform-aws-modules/vpc/aws module. All of our resources will be created in the us-east-1 region. If you want to use any other region for creating the EKS cluster, all you need to do is change the value assigned to the “region” variable. variable “region” { default = “us-east-1” description = “AWS region” } provider “aws” { region = var.region } data “aws_availability_zones” “available” {} locals { cluster_name = “test-eks-cluster-${random_string.suffix.result}” } resource “random_string” “suffix” { length = 8 special = false } module “vpc” { source = “terraform-aws-modules/vpc/aws” version = “3.2.0” name = “test-vpc” cidr = “10.0.0.0/16” azs = data.aws_availability_zones.available.names private_subnets = [“10.0.1.0/24”, “10.0.2.0/24”, “10.0.3.0/24”] public_subnets = [“10.0.4.0/24”, “10.0.5.0/24”, “10.0.6.0/24”] enable_nat_gateway = true single_nat_gateway = true enable_dns_hostnames = true tags = { “kubernetes.io/cluster/${local.cluster_name}” = “shared” } public_subnet_tags = { “kubernetes.io/cluster/${local.cluster_name}” = “shared” “kubernetes.io/role/elb” = “1” } private_subnet_tags = { “kubernetes.io/cluster/${local.cluster_name}” = “shared” “kubernetes.io/role/internal-elb” = “1” } } At this point, you should have six files in your current directory (as can be seen below). All of these files should be in one directory. In the next step, you will learn to deploy a sample NodeJs application and, to do so, you will need to create a new folder. That said, if you have cloned the repo herein, you won’t need to worry: eks-cluster.tf kubernetes.tf outputs.tf security-groups.tf versions.tf vpc.tf To create an EKS cluster, execute the following command to initialize the current working directory containing the Terraform configuration files (.tffiles). terraform init Execute the following command to determine the desired state of all the resources defined in the above .tffiles. terraform plan Before we go ahead and create our EKS cluster, let’s take a look at the “terraform graph” command. This is an optional step that you can skip if you wish. Basically, we are simply generating a visual representation of our execution plan. Once you execute the following command, you will get the output in a graph.svgfile. You can try to open the file on your personal computer or online with SVG Viewer. terraform graph -type plan | dot -Tsvg > graph.svg We must now create the resources using the .tffiles in our current working directory. Execute the following command. This will take around ten minutes to complete. Feel free to go have a cup of coffee. Once the command is successfully completed, the EKS cluster will be ready. terraform apply 16. After the “terraform apply” command is successfully completed, you should see the output as depicted below. 17. You can now go to the AWS Console and verify the resources created as part of the EKS cluster. EKS cluster 2. VPC 3. EC2 Instance 4. EC2 AutoScaling Groups You can check for other resources in the same way.18. Now, if you try to use the kubectl command to connect to the EKS cluster and control it, you will get an error seeing as you have the kubeconfig file being used for authentication purposes. kubectl get nodes To resolve the error mentioned above, retrieve the access credentials, i.e. update the ~/.kube/config file for the cluster and automatically configure kubectl so that you can connect to the EKS cluster using the kubectlcommand. Execute the following command from the directory where all your .tffiles used to create the EKS cluster are located. aws eks —region $(terraform output -raw region) update-kubeconfi—name $(terraform output -raw cluster_name) Now, you are ready to connect to your EKS cluster and check the nodes in the Kubernetes cluster using the following command: kubectl get nodes Check the resources such as pods, deployments, and services that are available in the Kubernetes cluster across all namespaces. kubectl get pod—A kubectl get deployment—A kubectl get service—A kubectl get ns In the screenshot above, you can see the namespace, pods, deployment, and service we created with Terraform. We have now attempted to create an EKS cluster using Terraform and deploy Nginx with Terraform. Now, let’s see if we can deploy a sample NodeJs application using Terraform in the same EKS cluster. This time, we will keep the Kubernetes objects files in a separate folder so the NodeJs application can be managed independently, allowing us to deploy and/or destroy our NodeJs application without affecting the EKS cluster. Deploying a Sample Nodejs Application on the EKS Cluster Using Terraform? In this part of the article, we will deploy a sample NodeJs application and its dependent resources, including namespace, deployment, and service. We have used the publicly available Docker Images for the sample NodeJs application and MongoDB database. Now, let’s go ahead with the deployment. Change your “Present Working Directory” to ~/DevOps/aws/terraform/terraform-kubernetes-deployment/nodejs-application/if you have cloned my repository. cd ~/DevOps/aws/terraform/terraform-kubernetes-deployment/nodejs-application/ You should now have the required file. If you haven’t cloned the repo, feel free to create the required .tf file in a new folder. Please note that we are using a separate folder here. Create a sample-nodejs-application.tf file with the following content. In this case, we are using a Terraform Kubernetes Provider to deploy a sample NodeJs application. We will be creating a namespace, NodeJs deployment, and its service of type LoadBalancer, and MongoDB Deployment, as well as its service for type. provider “kubernetes” { config_path = “~/.kube/config” } resource “kubernetes_namespace” “sample-nodejs” { metadata { name = “sample-nodejs” } } resource “kubernetes_deployment” “sample-nodejs” { metadata { name = “sample-nodejs” namespace = kubernetes_namespace.sample-nodejs.metadata.0.name } spec { replicas = 1 selector { match_labels = { app = “sample-nodejs” } } template { metadata { labels = { app = “sample-nodejs” } } spec { container { image = “learnk8s/knote-js:1.0.0” name = “sample-nodejs-container” port { container_port = 80 } env { name = “MONGO_URL” value = “mongodb://mongo:27017/dev” } } } } } } resource “kubernetes_service” “sample-nodejs” { metadata { name = “sample-nodejs” namespace = kubernetes_namespace.sample-nodejs.metadata.0.name } spec { selector = { app = kubernetes_deployment.sample-nodejs.spec.0.template.0.metadata.0.labels.app } type = “LoadBalancer” port { port = 80 target_port = 3000 } } } resource “kubernetes_deployment” “mongo” { metadata { name = “mongo” namespace = kubernetes_namespace.sample-nodejs.metadata.0.name } spec { replicas = 1 selector { match_labels = { app = “mongo” } } template { metadata { labels = { app = “mongo” } } spec { container { image = “mongo:3.6.17-xenial” name = “mongo-container” port { container_port = 27017 } } } } } } resource “kubernetes_service” “mongo” { metadata { name = “mongo” namespace = kubernetes_namespace.sample-nodejs.metadata.0.name } spec { selector = { app = kubernetes_deployment.mongo.spec.0.template.0.metadata.0.labels.app } type = “ClusterIP” port { port = 27017 target_port = 27017 } } } At this point, you should have one file in your current directory, as depicted below: sample-nodejs-application.tf To initialize the current working directory containing our Terraform configuration file (.tffile) and deploy a sample NodeJs application, execute the following command: terraform init Next, execute the following command to determine the desired state of all the resources defined in the above .tffile: terraform plan Before we deploy a sample NodeJs application, let’s try generating a visual representation of our execution plan in the same way we did while creating the EKS cluster. In this case, this is an optional step that you can skip if you don’t wish to see the graph. Once you execute the following command, you will get the output in a graph.svgfile. You can try to open the file on your personal computer or online with SVG Viewer. terraform grap—type plan | dot -Tsvg > graph.svg The next step is to deploy the sample Nodejs application using the .tffiles in our current working directory. Execute the following command, and this time, don’t go and have a cup of tea seeing as it should not take more than one minute to complete. Once the command is successfully completed, the sample NodeJs application will be deployed in the EKS cluster using Terraform: terraform apply 10. Once the “terraform apply” command is successfully completed, you should see the following output: 11. You can now verify the objects that have been created using the commands below kubectl get pods -A kubectl get deployments -A kubectl get services -A In the above screenshot, you can see the namespace, pods, deployment, and service that were created for the sample NodeJs application. You can also access the sample NodeJs application using the DNS of the LoadBalancer. Upload an image and add a note to it. Once you’ve published the image, it will be saved in the database.It’s important to note that, in this case, we did not use any type of persistent storage for the application, therefore the data will not be retained after the pod is recreated or deleted. To retain data, try using PersistentVolume for the database. We just deployed a sample NodeJs application that was publicly accessible over the LoadBalancer DNS using Terraform. Next, we will complete the creation of the EKS Cluster and deployment of the sample NodeJs application using Terraform. Cleanup the Resources We Created It’s always better to delete the resource once you’re done with the tests, seeing as this saves costs. To clean up the resources and delete the sample NodeJs application and EKS cluster, follow the steps below: Destroy the sample NodeJs application using the following command: terraform initExecute this command if you get this error: “Error: Inconsistent dependency lock file.” terraform destroy You should see the following output once the above command is successful: Validate whether or not the NodeJs application has been destroyed: kubectl get pod—A kubectl get deployment—A kubectl get service—A In the above screenshot you can see that all of the sample Nodejs application resources have been deleted. Now, let’s destroy the EKS cluster using the following command: terraform init: Execute this command if you get the following error: “Error: Inconsistent dependency lock file.” terraform destroy 5. You will see the following output once the above command is successful. 6. You can now go to the AWS console to verify whether or not the resources have been deleted. There you have it! We have just successfully deleted the EKS cluster, as well as the sample NodeJs application. Conclusion of Terraform Kubernetes Deployment Elastic Kubernetes Service (EKS) is a managed Kubernetes service provided by AWS, which takes the complexity and overhead out of provisioning and optimizing a Kubernetes cluster for development teams. An EKS cluster can be created using a variety of methods; nevertheless, using the best possible way is critical in improving the infrastructure management lifecycle. Terraform is one of the Infrastructure as Code (IaC) tools that allows you to create, modify, and version control cloud and on-premise resources in a secure and efficient manner. You can use the Terraform Kubernetes deployment method to create an EKS cluster using Terraform while automating the creation process of the EKS Cluster and having additional control over the entire infrastructure management process through code. The creation of the EKS cluster and the deployment of Kubernetes objects can also be managed using the Terraform Kubernetes provider.
In this article, we will explore Azure Observability, the difference between monitoring and observability, its components, different patterns, and antipatterns. Azure Observability is a powerful set of services provided by Microsoft Azure that allows developers and operations teams to monitor, diagnose, and improve the performance and availability of their applications. With Azure Observability, you can gain deep insights into the performance and usage of your applications and quickly identify and resolve issues. Azure Monitoring and Azure Observability Azure Monitoring and Azure Observability are related but different concepts in the Azure ecosystem. Azure Monitor is a service that provides a centralized location for collecting and analyzing log data from Azure resources and other sources. It includes features for collecting data from Azure services such as Azure Virtual Machines, Azure App Services, and Azure Functions, as well as data from other sources such as Windows Event Logs and custom logs. The service also includes Azure Log Analytics, which is used to analyze the log data and create custom queries and alerts.Azure Observability, on the other hand, is a broader concept that encompasses a set of services provided by Azure for monitoring, diagnosing, and improving the performance and availability of your applications. It includes Azure Monitor but also encompasses other services such as Azure Application Insights, Azure Metrics, and Azure Diagnostics.Azure Monitor is a service that provides log data collection and analysis, while Azure Observability is a broader set of services that provides deep insights into the performance and availability of your application. Azure Observability is built on top of Azure Monitor, and it integrates with other services to provide a comprehensive view of your application's performance. Key Components of Azure Observability One of the key components of Azure Observability is Azure Monitor. This service provides a centralized location for collecting and analyzing log data from Azure resources and other sources. It includes features for collecting data from Azure services such as Azure Virtual Machines, Azure App Services, and Azure Functions, as well as data from other sources such as Windows Event Logs and custom logs. This allows you to have a comprehensive view of your environment and understand how your resources are performing.Another important component of Azure Observability is Azure Log Analytics. This service is used to analyze the log data collected by Azure Monitor and to create custom queries and alerts. It uses a query language called Kusto, which is optimized for large-scale data analysis. With Azure Log Analytics, you can easily search and filter through large amounts of log data and create custom queries and alerts to notify you of specific events or issues.Azure Application Insights is another service provided by Azure Observability. This service provides deep insights into the performance and usage of your applications. It can be used to track requests, exceptions, and performance metrics and to create custom alerts. With Azure Application Insights, you can gain a better understanding of how your users interact with your applications and identify and resolve issues quickly.Azure Metrics is another service provided by Azure observability. It allows you to collect and analyze performance data from your applications and services, including CPU usage, memory usage, and network traffic. This will give you a real-time view of your resource's performance and allow for proactive monitoring.Finally, Azure Diagnostics is a service that is used to diagnose and troubleshoot issues in your applications and services. It includes features for collecting diagnostic data, such as performance counters, traces, and logs, and for analyzing that data to identify the root cause of issues. With Azure Diagnostics, you can quickly identify and resolve issues in your applications and services and ensure that they are performing optimally. Example: Flow of Observability Data From an Azure Serverless Architecture An example of using Azure Observability to monitor and improve the performance of an application would involve the following steps:Enabling Azure Monitor for your application: This involves configuring Azure Monitor to collect log data from your application, such as requests, exceptions, and performance metrics. This data can be collected from Azure services such as Azure App Services, Azure Functions, and Azure Virtual Machines.Analyzing log data with Azure Log Analytics: Once data is collected, you can use Azure Log Analytics to analyze the log data and create custom queries and alerts. For example, you can create a query to identify all requests that returned a 500-error code and create an alert to notify you when this happens.Identifying and resolving performance issues: With the data collected and analyzed, you can use Azure Application Insights to identify and resolve performance issues. For example, you can use the performance metrics collected by Azure Monitor to identify slow requests and use Azure Diagnostics to collect additional data, such as traces and logs, to understand the root cause of the issue.Monitoring your resources: With Azure Metrics, you can monitor your resource's performance and understand the impact on the application. This will give you a real-time view of your resources and allow for proactive monitoring.Setting up alerts: Azure Monitor, Azure Log Analytics, and Azure Application Insights can set up alerts; this way, you can be notified of any issues or potential issues. This will allow you to act before it becomes a problem for your users.Continuously monitoring and improving: After resolving the initial issues, you should continue to monitor your application using Azure Observability to ensure that it is performing well and identify any new issues that may arise. This allows you to continuously improve the performance and availability of your application. Observability Patterns Azure Observability provides a variety of patterns that can be used to monitor and improve the performance of your application. Some of the key patterns and metrics include:Logging: Collecting log data such as requests, exceptions, and performance metrics and then analyzing this data using Azure Monitor and Azure Log Analytics. This can be used to identify and troubleshoot issues in your application and to create custom queries and alerts to notify you of specific events or issues.Tracing: Collecting trace data such as request and response headers and analyzing this data using Azure Diagnostics. This can be used to understand the flow of requests through your application and to identify and troubleshoot issues with specific requests.Performance monitoring: Collecting performance metrics such as CPU usage, memory usage, and network traffic and analyzing this data using Azure Metrics. This can be used to identify and troubleshoot issues with the performance of your application and resources.Error tracking: Collecting and tracking errors and exceptions and analyzing this data using Azure Application Insights. This can be used to identify and troubleshoot issues with specific requests and to understand how errors are impacting your users.Availability monitoring: Collecting and monitoring data related to the availability of your application and resources, such as uptime and response times, and analyzing this data using Azure Monitor. This can be used to identify and troubleshoot issues with the availability of your application.Custom metrics: Collecting custom metrics that are specific to your application and analyzing this data using Azure Monitor and Azure Log Analytics. This can be used to track key performance indicators (KPIs) for your application and to create custom alerts.All these patterns and metrics can be used together to gain a comprehensive understanding of the performance and availability of your application and to quickly identify and resolve issues. Additionally, Azure Observability services are integrated; this way, you can easily correlate different data sources and have a holistic view of your application's performance. While Azure Observability provides a powerful set of services for monitoring, diagnosing, and improving the performance of your applications, there are also some common mistakes/contrasts that should be avoided to get the most out of these services. Here are a few examples of Azure Observability contrasts:Not collecting enough data: Collecting insufficient data makes it difficult to diagnose and troubleshoot issues and can lead to incomplete or inaccurate analysis. Make sure to collect all the relevant data for your application, including logs, traces, and performance metrics, to ensure that you have a comprehensive view of your environment.Not analyzing the data: Collecting data is not enough; you need to analyze it and act. Not analyzing the data can lead to missed opportunities to improve the performance and availability of your applications. Make sure to use Azure Monitor and Azure Log Analytics to analyze the data, identify patterns and issues, and act. Conclusion In summary, Azure observability architecture is a set of services that allows for data collection, data analysis, and troubleshooting. It provides a comprehensive set of services that allows you to monitor, diagnose, and improve the performance and availability of your applications. With Azure Observability, you can gain deep insights into your environment and quickly identify and resolve issues, ensuring that your applications are always available and performing at their best.
A few weeks ago, Kelsey Hightower wrote a tweet and held a live discussion on Twitter about whether it's a good idea or not to run a database on Kubernetes. This happened to be incredibly timely for me since we at QuestDB are about to launch our own cloud database service (built on top of k8s)! "Rubbing Kubernetes on Postgres Won't Turn it Into Cloud SQL" You can run databases on Kubernetes because it's fundamentally the same as running a database on a VM. The biggest challenge is understanding that rubbing Kubernetes on Postgres won't turn it into Cloud SQL. https://t.co/zdFobm4ijy One of the biggest takeaways from this discussion is that there seems to be a misconception about the features that k8s actually provides. While newcomers to k8s may expect that it can handle complex application lifecycle features out-of-the-box, it, in fact, only provides a set of cloud-native primitives (or building blocks) for you to configure and use to deploy your workflows. Any functionality outside of these core building blocks needs to be implemented somehow in additional orchestration code (usually in the form of an operator) or config. K8s Primitives When working with databases, the obvious concern is data persistence. Earlier in its history, k8s really shined in the area of orchestrating stateless workloads, but support for stateful workflows was limited. Eventually, primitives like StatefulSets, PersistentVolumes (PVs), and PersistentVolumeClaims (PVCs) were developed to help orchestrate stateful workloads on top of the existing platform. PersistentVolumes are abstractions that allow for the management of raw storage, ranging from local disk to NFS, cloud-specific block storage, and more. These work in concert with PersistentVolumeClaims that represent requests for a pod to access the storage managed by a PV. A user can bind a PVC to a PV to make an ownership claim on a set of raw disk resources encompassed by the PV. Then, you can add that PVC to any pod spec as a volume, effectively allowing you to mount any kind of persistent storage medium to a particular workload. The separation of PV and PVC also allows you to fully control the lifecycle of your underlying block storage, including mounting it to different workloads or freeing it all together once the claim expires. StatefulSets manage the lifecycles of pods that require more stability than what exists in other primitives like Deployments and ReplicaSets. By creating a StatefulSet, you can guarantee that when you remove a pod, the storage managed by its mounted PVCs does not get deleted along with it. You can imagine how useful this property is if you're hosting a database! StatefulSets also allow for ordered deployment, scaling, and rolling updates, all of which create more predictability (and thus stability) in our workloads. This is also something that seems to go hand-in-hand with what you want out of your database's infrastructure. What Else? While StatefulSets, PVs, and PVCs do quite a bit of work for us, there are still many administration and configuration actions that you need to perform on a production-level database. For example, how do you orchestrate backups and restores? These can get quite complex when dealing with high-traffic databases that include functionality such as WALs. What about clustering and high availability? Or version upgrades? Are these operations zero-downtime? Every database deals with these features in different ways, many of which require precise coordination between components to succeed. Kubernetes alone can't handle this. For example, you can't have a StatefulSet automatically set up your average RDBMS in a read-replica mode very easily without some additional orchestration. Not only do you have to implement many of these features yourself, but you also need to deal with the ephemeral nature of Kubernetes workloads. To ensure peak performance, you have to guarantee that the k8s scheduler places your pods on nodes that are already pre-tuned to run your database, with enough free resources to run it properly. If you're dealing with clustering, how are you handling networking to ensure that database nodes are able to connect to each other (ideally in the same cloud region)? This brings me to my next point... Pets, Not Cattle Once you start accounting for things like node performance-tuning and networking, along with the requirement to store data persistently in-cluster, all of a sudden, your infrastructure starts to grow into a set of carefully groomed pet servers instead of nameless herds of cattle. But one of the main benefits of running your application in k8s is the exact ability to treat your infrastructure like cattle instead of pets! All of the most common abstractions like Deployments, Ingresses, and Services, along with features like vertical and horizontal autoscaling, are made possible because you can run your workloads on a high-level set of infrastructure components, so you don't have to worry about your physical infrastructure layer. These abstractions allow you to focus more on what you're trying to achieve with your infrastructure instead of how you're going to achieve it. Then Why Even Bother With K8s? Despite these rough edges, there are plenty of reasons to want to run your database on k8s. There's no denying that k8s' popularity has increased tremendously over the past few years across both startups and enterprises. The k8s ecosystem is under constant development so that its feature set continues to expand and improve regularly. And its operator model allows end users to programmatically manage their workloads by writing code against the core k8s APIs to automatically perform tasks that would previously have to be done manually. K8s allows for easy GitOps-style management so you can leverage battle-tested software development practices when managing infrastructure in a reproducible and safe way. While vendor lock-in still exists in the world of k8s, its effect can be minimized to make it easier for you to go multi-cloud (or even swap one for another). So what can we do if we want to take advantage of all the benefits that k8s has to offer while using it to host our database? What Do You Need to Build an RDS on K8s? Toward the end of the live chat, someone asked Kelsey, "what do you actually need to build an RDS on k8s?" He jokingly answered with expertise, funding, and customers. While we're certainly on the right track with these at QuestDB, I think that this can be better phrased in that you need to implement Day 2 Operations to get to what a typical managed database service would provide. Day 2 Operations Day 2 Operations encompass many of the items that I've been discussing; backups, restore, stop/start, replication, high availability, and clustering. These are the features that differentiate a managed database service from a simple database hosted on k8s primitives, which is what I would call a Day 1 Operation. While k8s and its ecosystem can make it very easy to install a database in your cluster, you're going to eventually need to start thinking about Day 2 Operations once you get past the prototype phase. Here, I'll jump into more detail about what makes these operations so difficult to implement and why special care must be taken when implementing them, either by a database admin or a managed database service provider. Stop/Start Stopping and starting databases is a common operation in today's DevOps practices and is a must-have for any fully-featured managed database service. It is pretty easy to find at least one reason for wanting to stop and start a database. For example, you may want to have a database used for running integration tests that run on a pre-defined schedule. Or you maybe have a shared instance that's used by a development team for live QA before merging a commit. You could always create and delete database instances on-demand, but it is sometimes easier to have a reference to a static database connection string and URL in your test harness or orchestration code. While stop/start can be automated in k8s (perhaps by simply setting a StatefulSet's replica count to 0), there are still other aspects that need to be considered. If you're shutting down a database to save some money, will you also be spinning down any infrastructure? If so, how can you ensure that this infrastructure will be available when you start the database backup? K8s provides primitives like node affinity and taints to help solve this problem, but everyone's infrastructure provisioning situation and budget are different, and there's no one-size-fits-all approach to this problem. Backup and Restore One interesting point that Kelsey made in his chat was that having the ability to start an instance from scratch (moving from a stopped -> running state), is not trivial. Many challenges need to be solved, including finding the appropriate infrastructure to run the database, setting up network connectivity, mounting the correct volume, and ensuring data integrity once the volume has been mounted. In fact, this is such an in-depth topic that Kelsey compares going from 0 -> 1 running instance to an actual backup-and-restore test. If you can indeed spin up an instance from scratch while loading up pre-existing data, you have successfully completed a live restore test! Even if you have restores figured out, backups have their own complexities. K8s provides some useful building blocks like Jobs and CronJobs, which you can use if you want to take a one-off backup or create a backup schedule, respectively. But you need to ensure that these jobs are configured correctly in order to access raw database storage. Or if your database allows you to perform a backup using a CLI, then these jobs also need secure access to credentials to even connect to the database in the first place. From an end-user standpoint, you need an easy way to manage existing backups, which includes creating an index, applying data retention policies, and RBAC policies. Again, while k8s can help us build out these backup-and-restore components, a lot of these features are built on top of the infrastructure primitives that k8s provides. Replication, HA, and Clustering These days, you can get very far by simply vertically scaling your database. The performance of modern databases can be sufficient for almost anyone's use case if you throw enough resources at the problem. But once you've reached a certain scale or require features like high availability, there is a reason to enable some of the more advanced database management features like clustering and replication. Once you start down this path, the amount of infrastructure orchestration complexity can increase exponentially. You need to start thinking more about networking and physical node placement to achieve your desired goal. If you don't have a centralized monitoring, logging, and telemetry solution, you're now going to need one if you want to easily diagnose issues and get the best performance out of your infrastructure. Based on its architecture and feature set, every database can have different options for enabling clustering, many of which require intimate knowledge of the inner workings of the database to choose the correct settings. Vanilla k8s know nothing of these complexities. Instead, these all need to be orchestrated by an administrator or operator (human or automated). If you're working with production data, changes may need to happen with close-to-zero downtime. This is where managed database services shine. They can make some of these features as easy to configure as a single web form with a checkbox or two and some input fields. Unless you're willing to invest the time into developing these solutions yourself, or leverage existing open-source solutions if they exist, sometimes it's worth giving up some level of control for automated expert assistance when configuring a database cluster. Orchestration For your Day 2 Operations to work as they would in a managed database service such as RDS, they need to not just work but also be automated. Luckily for us, there are several ways to build automation around your database on k8s. Helm and YAML Tools Won't Get Us There Since k8s configuration is declarative, it can be very easy to get from 0 -> 1 with traditional YAML-based tooling like Helm or cdk8s. Many industry-leading k8s tools install into a cluster with a simple helm install or kubectl apply command. These are sufficient for Day 1 Operations and non-scalable deployments. But as soon as you start to move into more vendor-specific Day 2 Operations that require more coordination across system components, the usefulness of traditional YAML-based tools starts to degrade quickly since some imperative programming logic is required. Provisioners One pattern that you can use to automate database management is a provisioner process. We've even used this approach to build v1 of our managed cloud solution. When a user wants to make a change to an existing database's state, our backend sends a message to a queue that is eventually picked up by a provisioner. The provisioner reads the message, uses its contents to determine which actions to perform on the cluster, and performs them sequentially. Where appropriate, each action contains a rollback step in case of a kubectl apply error to leave the infrastructure in a predictable state. Progress is reported back to the application on a separate gossip queue, providing almost-immediate feedback to the user on the progress of each state change. While this has grown to be a powerful tool for us, there is another way to interact with the k8s API that we are now starting to leverage... Operators K8s has an extensible Operator pattern that you can use to manage your own Custom Resources (CRs) by writing and deploying a controller that reconciles your current cluster state into its desired state, as specified by CR YAML spec files that are applied to the cluster. This is also how the functionality of the basic k8s building blocks are implemented, which just further emphasizes how powerful this model can be. Operators have the ability to hook into the k8s API server and listen for changes to resources inside a cluster. These changes get processed by a controller, which then kicks off a reconciliation loop where you can add your custom logic to perform any number of actions, ranging from simple resource existence to complex Day 2 Operations. This is an ideal solution to our management problem; we can offload much of our imperative code into a native k8s object, and database-specific operations appear to be as seamless as the standard set of k8s building blocks. Many existing database products use operators to accomplish this, and more are currently in development (see the Data on Kubernetes community for more information on these efforts). As you can imagine, coordinating activities like backups, restores, and clustering inside a mostly stateless and idempotent reconciliation loop isn't the easiest. Even if you follow best practices by writing a variety of simple controllers, with each managing its own clearly-defined CR, the reconciliation logic can still be very error-prone and time-consuming to write. While frameworks like Operator SDK exist to help you with scaffolding your operator and libraries like Kubebuilder provide a set of incredibly useful controller libraries, it's still a lot of work to undertake. K8s Is Just a Tool At the end of the day, k8s is a single tool in the DevOps engineer's toolkit. These days, it's possible to host workloads in a variety of ways, using managed services (PaaS), k8s, VMs, or even running on a bare metal server. The tool that you choose depends on a variety of factors, including time, experience, performance requirements, ease of use, and cost. While hosting a database on k8s might be a fit for your organization, it just as easily could create even more overhead and instability if not done carefully. Implementing the Day 2 features that I described above is time-consuming and costly to get right. Testing is incredibly important since you want to be absolutely sure that your (and your customers') precious data is kept safe and accessible when it's needed. If you just need a reliable database to run your application on top of, then maybe all of the work required to run a database on k8s might be too much for you to undertake. But if your database has strong k8s support (most likely via an operator), or you are doing something unique (and at scale) with your storage layer, it might be worth it to look more into managing your stateful databases on k8s. Just be prepared for a large time investment and ensure that you have the requisite in-house knowledge (or support) so that you can be confident that you're performing your database automation activities correctly and safely.
In this blog post, you will learn how to run a Go application to AWS App Runner using the Go platform runtime. You will start with an existing Go application on GitHub and deploy it to AWS App Runner. The application is based on the URL shortener application (with some changes) that persists data in DynamoDB. Introduction AWS App Runner is a robust and user-friendly service that simplifies the deployment process of web applications in the AWS Cloud. It offers developers an effortless and efficient way to deploy their source code or container image directly to a scalable and secure web application without requiring them to learn new technologies or choose the appropriate compute service. One of the significant benefits of using AWS App Runner is that it connects directly to the code or image repository, enabling an automatic integration and delivery pipeline. This eliminates the need for developers to go through the tedious process of manually integrating their code with AWS resources. For developers, AWS App Runner simplifies the process of deploying new versions of their code or image repository. They can easily push their code to the repository, and App Runner will automatically take care of the deployment process. On the other hand, for operations teams, App Runner allows for automatic deployments every time a new commit is pushed to the code repository or a new container image version is added to the image repository. App Runner: Service Sources With AWS App Runner, you can create and manage services based on two types of service sources: Source code (covered in this blog post) Source image Source code is nothing but your application code that App Runner will build and deploy. All you need to do is point App Runner to a source code repository and choose a suitable runtime that corresponds to a programming platform version. App Runner provides platform-specific managed runtimes (for Python, Node.js, Java, Go, etc.). The AWS App Runner Go platform runtime makes it easy to build and run containers with web applications based on a Go version. You don’t need to provide container configuration and build instructions such as a Dockerfile. When you use a Go runtime, App Runner starts with a managed Go runtime image which is based on the Amazon Linux Docker image and contains the runtime package for a version of Go and some tools. App Runner uses this managed runtime image as a base image and adds your application code to build a Docker image. It then deploys this image to run your web service in a container. Let’s Get Started Make sure you have an AWS account and install AWS CLI. 1. Create a GitHub Repo for the URL Shortener Application Clone this GitHub repo and then upload it to a GitHub repository in your account (keep the same repo name i.e. apprunner-go-runtime-app): git clone https://github.com/abhirockzz/apprunner-go-runtime-app 2. Create a DynamoDB Table To Store URL Information Create a table named urls. Choose the following: Partition key named shortcode (data type String) On-Demand capacity mode 3. Create an IAM Role With DynamoSB-Specific Permissions export IAM_ROLE_NAME=apprunner-dynamodb-role aws iam create-role --role-name $IAM_ROLE_NAME --assume-role-policy-document file://apprunner-trust-policy.json Before creating the policy, update the dynamodb-access-policy.json file to reflect the DynamoDB table ARN name. aws iam put-role-policy --role-name $IAM_ROLE_NAME --policy-name dynamodb-crud-policy --policy-document file://dynamodb-access-policy.json Deploy the Application to AWS App Runner If you have an existing AWS App Runner GitHub connection and want to use that, skip to the Repository selection step. 1. Create an AWS App Runner GitHub Connection Open the App Runner console and choose Create service. Create AWS App Runner Service On the Source and deployment page, in the Source section, for Repository type, choose Source code repository. Under Connect to GitHub, choose Add new, and then, if prompted, provide your GitHub credentials. Add GitHub connection In the Install AWS Connector for GitHub dialog box, if prompted, choose your GitHub account name. If prompted to authorize the AWS Connector for GitHub, choose Authorize AWS Connections. Choose Install. Your account name appears as the selected GitHub account/organization. You can now choose a repository in your account. 2. Repository Selection For Repository, choose the repository you created: apprunner-go-runtime-app. For Branch, choose the default branch name of your repository (for example, main). Configure your deployment: In the Deployment settings section, choose Automatic, and then choose Next. Choose GitHub repo 3. Configure Application Build On the Configure build page, for the Configuration file, choose Configure all settings here. Provide the following build settings: Runtime: Choose Go 1 Build command: Enter go build main.go Start command: Enter ./main Port: Enter 8080 Choose Next. Configure runtime info 4. Configure Your Service Under Environment variables, add an environment variable. For Key, enter TABLE_NAME, and for Value, enter the name of the DynamoDB table (urls) that you created before. Add environment variables Under Security > Permissions, choose the IAM role that you had created earlier (apprunner-dynamodb-role). Add IAM role for App Runner Choose Next. On the Review and create page, verify all the details you’ve entered, and then choose Create and deploy. If the service is successfully created, the console shows the service dashboard, with a Service overview of the application. Verify URL Shortener Functionality The application exposes two endpoints: To create a short link for a URL Access the original URL via the short link First, export the App Runner service endpoint as an environment variable: export APP_URL=<enter App Runner service URL> # example export APP_URL=https://jt6jjprtyi.us-east-1.awsapprunner.com 1. Invoke It With a URL That You Want to Access via a Short Link curl -i -X POST -d 'https://abhirockzz.github.io/' $APP_URL # output HTTP/1.1 200 OK Date: Thu, 21 Jul 2022 11:03:40 GMT Content-Length: 25 Content-Type: text/plain; charset=utf-8 {"ShortCode":"ae1e31a6"} You should get a JSON response with a short code and see an item in the DynamoDB table as well. You can continue to test the application with other URLs that you want to shorten! 2. Access the URL Associated With the Short Code Enter the following in your browser http://<enter APP_URL>/<shortcode>. For example, when you enter https://jt6jjprtyi.us-east-1.awsapprunner.com/ae1e31a6, you will be redirected to the original URL. You can also use curl. Here is an example: export APP_URL=https://jt6jjprtyi.us-east-1.awsapprunner.com curl -i $APP_URL/ae1e31a6 # output HTTP/1.1 302 Found Location: https://abhirockzz.github.io/ Date: Thu, 21 Jul 2022 11:07:58 GMT Content-Length: 0 Clean up Once you complete this tutorial, don’t forget to delete the following resources: DynamoDB table App Runner service Conclusion In this blog post, you learned how to go from a Go application in your GitHub repository to a complete URL shortener service deployed to AWS App Runner!
The Gartner hype cycle, illustrated below, can be applied to most aspects of technology: As new innovations enter their respective cycles, expectations are eventually realized—leading to some level of adoption. The goal for every innovation is to reach the plateau of productivity where consumers have determined that the reward of adopting the innovation far outweighs any known risks. At the same time, there is a point where the plateau of productivity begins to diminish, leading to an exodus away from that innovation. One simple example would be pagers (or beepers), which were common before mobile phones/devices reached the plateau of productivity. As technologists, we strive to deliver features, frameworks, products, or services that increase the plateau of productivity. The same holds true for the ones that we use. Recently, I felt like my current hosting platform began falling off the plateau of productivity. In fact, a recent announcement made me wonder if it was time to consider other options. Since I had a positive experience using the Render PaaS, I wanted to look at how easily I could convert one of my Heroku applications, adopt PostgreSQL, and migrate to Render. I’m describing that journey in this two-part series: Part 1: We’ll focus on migrating my backend services (Spring Boot and ClearDB MySQL). Part 2: We’ll focus on porting and migrating my frontend Angular client. Why I Chose Render If you have never heard of Render before, check out some of my previous publications: Using Render and Go for the First Time Under the Hood: Render Unified Cloud Purpose-Driven Microservice Design Launch Your Startup Idea in a Day How I Used Render To Scale My Microservices App With Ease What I find exciting about Render is that they continue to climb the slope of enlightenment while actively providing a solid solution for adopters recognizing the plateau of productivity. As I’ve noted in my articles, Render offers a “Zero DevOps” promise. This perfectly aligns with my needs since I don’t have the time to focus on DevOps tasks. The Heroku platform has several things that I am not too fond of: Daily restarts led to unexpected downtime for one of my services. Entry-level (all I really need) Postgres on Heroku allows for four hours of downtime per month. Pricing levels, from a consumer perspective, don’t scale well. From a pricing perspective, I am expecting to see significant cost savings after migrating all of my applications and services from Heroku to Render. What is more amazing is that I am getting better memory and CPU for that price, with linear scaling as my application footprint needs to grow. Converting a Single Service As I noted above, this is part one of a two-part series, and I’ll focus on the service tier in this article. The service I want to convert has the following attributes: Spring Boot RESTful API Service Heroku CloudAMQP (RabbitMQ) Message Broker Heroku ClearDB (MySQL) Database (single schema) Okta Integration On the Render PaaS side, the new service design will look like this: Render Web Service hosting Spring Boot RESTful API Service (via Docker) Render Private Service hosting RabbitMQ Message Broker (via Docker) Render Postgres with the ability for multiple schemas to exist Okta Integration Below is a side-by-side comparison of the two ecosystems: My high-level plan of attack for the conversion is as follows: Prepare Heroku for Conversion Before getting started, it is recommended to put all existing Heroku services into Maintenance Mode. This will prohibit any consumers from accessing the applications or services. While the source code should already be backed up and stored in a git-based repository, it is a good idea to make sure a database backup has been successfully created. Conversion of Heroku Services From a conversion perspective, I had two items to convert: the service itself and the ClearDB (MySQL) database. The conversion of my Spring Boot RESTful service did not involve much work. In fact, I was able to leverage the approach I used for a previous project of mine. For the database, I needed to convert from MySQL to PostgreSQL. My goal was to use Render’s Heroku Migrator to easily migrate Heroku Postgres to Render Postgres, but I needed to convert from MySQL to PostgreSQL first. Initially, I started down the pgloader path, which seemed to be a common approach for the database conversion. However, using my M1-chip MacBook Pro led to some unexpected issues. Instead, I opted to use NMIG to convert MySQL to PostgreSQL. For more information, please check out the “Highlights From the Database Conversion” section below. Create Services in Render After converting the database and the Spring Boot RESTful service running inside Docker, the next step was to create a Render Web Service for the Spring Boot RESTful API service. This was as easy as creating the service, giving it a name, and pointing to the appropriate repository for my code in GitLab. Since I also needed a RabbitMQ service, I followed these instructions to create a RabbitMQ Private Service running on Render. This included establishing a small amount of disk storage to persist messages that have not been processed. Finally, I created the necessary environment variables in the Render Dashboard for both the Spring Boot RESTful API service and the RabbitMQ message broker. Initialize and Validate the Services The next step was to start my services. Once they were running and the APIs were validated using my Postman collection, I updated my client application to point to the new Render service location. Once everything was up and running, my Render Dashboard appeared as shown below: Next Steps All that remained at this point was to delete the databases still running on Heroku and remove the migrated services from the Heroku ecosystem. When using Heroku, any time I merged code into the master branch of my service repository, the code was automatically deployed, provided I used GitLab CI/CD to deploy to Heroku in my source repository. However, there is no need to add code to the source file repository with Render. I simply needed to specify the Build & Deploy Branch in the Render Dashboard for the service: I love the Zero DevOps promise. Highlights From the Database Conversion By following the steps above, the conversion from Heroku to Render was smooth and successful. The biggest challenge for me was the conversion of data. At a high level, this mostly boiled down to a series of commands executed from the terminal of my MacBook Pro. First, I started a local Postgres instance via Docker: docker run --publish 127.0.0.1:5432:5432 --name postgres -e POSTGRES_PASSWORD=dbo -d postgres Next, I created a database called “example” using the following command (or pgAdmin): createdb example For converting my ClearDB (MYSQL) instance running on Heroku to my example Postgres database running locally, I used NMIG, which is a Node.js-based database conversion utility. After installing NMIG, I set up the config.json file with database endpoint information and credentials, and then I ran: /path/to/nmig$ npm start Next, I backed up the data to a file using the following command: pg_dump -Fc --no-acl --no-owner -h localhost -U postgres example > example.dump Rather than go through the hassle of creating a signed URL in AWS, I just used the pgAdmin client to import the backup into a newly created Postgres instance on Heroku. With the Postgres instance running and data validated, I created a new Postgres database on the Render PaaS. Then all I had to do was issue the following command: pg_restore --verbose --no-acl --no-owner -d postgres://username:password@hostname.zone-postgres.render.com/example example.dump Lessons Learned Along the Way Looking back on my conversion from Heroku to Render, here are some lessons I learned along the way: I had a minor issue with the Postgres database updating the date/time value to include the DST offset. This may have been an issue with my original database design, but I wanted to pass this along. In my case, the impacted column is only used for Date values, which did not change for me. I included a database column named END in one of my tables, which caused a problem when either Postgres or Hibernate attempted to return a native query. The service saw the END column name and injected it as a SQL keyword. I simply renamed the column to fix this issue, which I should have known not to do in the first place. With Render, I needed to make the RabbitMQ service a Private Service because the Web Service option does not expose the expected port. However, with this approach, I lost the ability to access the RabbitMQ admin interface, since Private Services are not exposed externally. It looks like Render plans to address this feature request. All in all, these minor hurdles weren’t significant enough to impact my decision to migrate to Render. Conclusion The most important aspect of Gartner’s plateau of productivity is providing products, frameworks, or services that allow consumers to thrive and meet their goals. The plateau of productivity is not intended to be flashy or fashionable—in a metaphorical sense. When I shared this conclusion with Ed, a Developer Advocate at Render, his response was something I wanted to share: “Render is pretty avowedly not trying to be ‘fashionable.’ We're trying to be unsurprising and reliable.” Ed’s response resonated deeply with me and reminded me of a time when my former colleague told me my code came across as “boring” to him. His comment turned out to be the greatest compliment I could have received. You can read more here. In any aspect of technology, the decision on which provider to select should always match your technology position. If you are unsure, the Gartner hype cycle is a great reference point, and you can get started with a subscription to their service here. I have been focused on the following mission statement, which I feel can apply to any IT professional: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” - J. Vester When I look at the Render PaaS ecosystem, I see a solution that adheres to my mission statement while residing within my hype cycle preference. What makes things better is that I fully expect to see a 44% savings in my personal out-of-pocket costs—even more as my services need to scale vertically. For those considering hosting solutions, I recommend adding Render to the list of providers for review and analysis. You can get started for free by following this link. The second part of this series will be exciting. I will demonstrate how to navigate away from paying for my static client written in Angular and take advantage of Render’s free Static Sites service using either Vue or Svelte. Which framework will I choose … and why? Have a really great day!
Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino was designed to handle data warehousing, ETL, and interactive analytics by large amounts of data and producing reports. Alluxio is an open-source data orchestration platform for large-scale analytics and AI. Alluxio sits between compute frameworks such as Trino and Apache Spark and various storage systems like Amazon S3, Google Cloud Storage, HDFS, and MinIO. This is a tutorial for deploying Alluxio as the caching layer for Trino using the Iceberg connector. Why Do We Need Caching for Trino? A small fraction of the petabytes of data you store is generating business value at any given time. Repeatedly scanning the same data and transferring it over the network consumes time, compute cycles, and resources. This issue is compounded when pulling data from disparate Trino clusters across regions or clouds. In these circumstances, caching solutions can significantly reduce the latency and cost of your queries. Trino has a built-in caching engine, Rubix, in its Hive connector. While this system is convenient as it comes with Trino, it is limited to the Hive connector and has not been maintained since 2020. It also lacks security features and support for additional compute engines. Trino on Alluxio Alluxio connects Trino to various storage systems, providing APIs and a unified namespace for data-driven applications. Alluxio allows Trino to access data regardless of the data source and transparently cache frequently accessed data (e.g., tables commonly used) into Alluxio distributed storage. Using Alluxio Caching via the Iceberg Connector Over MinIO File Storage We’ve created a demo that demonstrates how to configure Alluxio to use write-through caching with MinIO. This is achieved by using the Iceberg connector and making a single change to the location property on the table from the Trino perspective. In this demo, Alluxio is run on separate servers; however, it’s recommended to run it on the same nodes as Trino. This means that all the configurations for Alluxio will be located on the servers where Alluxio runs, while Trino’s configuration remains unaffected. The advantage of running Alluxio externally is that it won’t compete for resources with Trino, but the disadvantage is that data will need to be transferred over the network when reading from Alluxio. It is crucial for performance that Trino and Alluxio are on the same network. To follow this demo, copy the code located here. Trino Configuration Trino is configured identically to a standard Iceberg configuration. Since Alluxio is running external to Trino, the only configuration needed is at query time and not at startup. Alluxio Configuration The configuration for Alluxio can all be set using the alluxio-site.properties file. To keep all configurations colocated on the docker-compose.yml, we are setting them using Java properties via the ALLUXIO_JAVA_OPTS environment variable. This tutorial also refers to the master node as the leader and the workers as followers. Master Configurations alluxio.master.mount.table.root.ufs=s3://alluxio/ The leader exposes ports 19998 and 19999, the latter being the port for the web UI. Worker Configurations alluxio.worker.ramdisk.size=1G alluxio.worker.hostname=alluxio-follower The follower exposes ports 29999 and 30000, and sets up a shared memory used by Alluxio to store data. This is set to 1G via the shm_size property and is referenced from the alluxio.worker.ramdisk.size property. Shared Configurations Between Leader and Follower alluxio.master.hostname=alluxio-leader # Minio configs alluxio.underfs.s3.endpoint=http://minio:9000 alluxio.underfs.s3.disable.dns.buckets=true alluxio.underfs.s3.inherit.acl=false aws.accessKeyId=minio aws.secretKey=minio123 # Demo-only configs alluxio.security.authorization.permission.enabled=false The alluxio.master.hostname needs to be on all nodes, leaders and followers. The majority of shared configs points Alluxio to the underfs, which is MinIO in this case. alluxio.security.authorization.permission.enabled is set to “false” to keep the Docker setup simple. Note: This is not recommended to do in a production or CI/CD environment. Running Services First, you want to start the services. Make sure you are in the trino-getting-started/iceberg/trino-alluxio-iceberg-minio directory. Now, run the following command: docker-compose up -d You should expect to see the following output. Docker may also have to download the Docker images before you see the “Created/Started” messages, so there could be extra output: [+] Running 10/10 ⠿ Network trino-alluxio-iceberg-minio_trino-network Created 0.0s ⠿ Volume "trino-alluxio-iceberg-minio_minio-data" Created 0.0s ⠿ Container trino-alluxio-iceberg-minio-mariadb-1 Started 0.6s ⠿ Container trino-alluxio-iceberg-minio-trino-coordinator-1 Started 0.7s ⠿ Container trino-alluxio-iceberg-minio-alluxio-leader-1 Started 0.9s ⠿ Container minio Started 0.8s ⠿ Container trino-alluxio-iceberg-minio-alluxio-follower-1 Started 1.5s ⠿ Container mc Started 1.4s ⠿ Container trino-alluxio-iceberg-minio-hive-metastore-1 Started Open Trino CLI Once this is complete, you can log into the Trino coordinator node. We will do this by using the exec command and run the trino CLI executable as the command we run on that container. Notice the container id is trino-alluxio-iceberg-minio-trino-coordinator-1, so the command you will run is: <<<<<<< HEAD docker container exec -it trino-alluxio-iceberg-minio-trino-coordinator-1 trino ======= docker container exec -it trino-minio_trino-coordinator_1 trino >>>>>>> alluxio When you start this step, you should see the trino cursor once the startup is complete. It should look like this when it is done: trino> To best understand how this configuration works, let’s create an Iceberg table using a CTAS (CREATE TABLE AS) query that pushes data from one of the TPC connectors into Iceberg that points to MinIO. The TPC connectors generate data on the fly so we can run simple tests like this. First, run a command to show the catalogs to see the tpch and iceberg catalogs since these are what we will use in the CTAS query: SHOW CATALOGS; You should see that the Iceberg catalog is registered. MinIO Buckets and Trino Schemas Upon startup, the following command is executed on an intiailization container that includes the mc CLI for MinIO. This creates a bucket in MinIO called /alluxio, which gives us a location to write our data to and we can tell Trino where to find it: /bin/sh -c " until (/usr/bin/mc config host add minio http://minio:9000 minio minio123) do echo '...waiting...' && sleep 1; done; /usr/bin/mc rm -r --force minio/alluxio; /usr/bin/mc mb minio/alluxio; /usr/bin/mc policy set public minio/alluxio; exit 0; " Note: This bucket will act as the mount point for Alluxio, so the schema directory alluxio://lakehouse/ in Alluxio will map to s3://alluxio/lakehouse/. Querying Trino Let’s move to creating our SCHEMA that points us to the bucket in MinIO and then run our CTAS query. Back in the terminal, create the iceberg.lakehouse SCHEMA. This will be the first call to the metastore to save the location of the schema location in the Alluxio namespace. Notice, we will need to specify the hostname alluxio-leader and port 19998 since we did not set Alluxio as the default file system. Take this into consideration if you want Alluxio caching to be the default usage and transparent to users managing DDL statements: CREATE SCHEMA iceberg.lakehouse WITH (location = 'alluxio://alluxio-leader:19998/lakehouse/'); Now that we have a SCHEMA that references the bucket where we store our tables in Alluxio, which syncs to MinIO, we can create our first table. Optional: To view your queries run, log into the Trino UI and log in using any username (it doesn’t matter since no security is set up). Move the customer data from the tiny generated TPCH data into MinIO using a CTAS query. Run the following query, and if you like, watch it running on the Trino UI: CREATE TABLE iceberg.lakehouse.customer WITH ( format = 'ORC', location = 'alluxio://alluxio-leader:19998/lakehouse/customer/' ) AS SELECT * FROM tpch.tiny.customer; Go to the Alluxio UI and the MinIO UI, and browse the Alluxio and MinIO files. You will now see a lakehouse directory that contains a customer directory that contains the data written by Trino to Alluxio and Alluxio writing it to MinIO. Now, there is a table under Alluxio and MinIO, you can query this data by checking the following: SELECT * FROM iceberg.lakehouse.customer LIMIT 10; How are we sure that Trino is actually reading from Alluxio and not MinIO? Let’s delete the data in MinIO and run the query again just to be sure. Once you delete this data, you should still see data return. Stopping Services Once you complete this tutorial, the resources used for this excercise can be released by runnning the following command: docker-compose down Conclusion At this point, you should have a better understanding of Trino and Alluxio, how to get started with deploying Trino and Alluxio, and how to use Alluxio caching with an Iceberg connector and MinIO file storage. I hope you enjoyed this article. Be sure to like this article and comment if you have any questions!
Migrating to the cloud can be a daunting task, but with the right plan and execution, it can be a seamless process. AWS offers various services that can help you with your migration, but it's important to be aware of the best practices and pitfalls to avoid. This blog post will discuss the best practices and common pitfalls to avoid when migrating to the AWS cloud. Best Practices Plan Your Migration Before you begin your migration, it's important to plan your migration. This includes the following things. Identifying your current environment. Define your migration goals. The applications and data you want to migrate. Planning your migration will help you identify potential challenges and provide a roadmap for a successful migration. Assess Your Current Environment Before migrating to the cloud, it's important to assess your current environment. This includes identifying your current infrastructure, applications, and data. Assessing your current environment will help you identify what needs to be migrated and what can be left behind. For example, you can use AWS Application Discovery Service. It will automatically discover and collects information about your application's infrastructure, including servers, databases, and dependencies. Choose the Right Migration Strategy AWS offers seven migration strategies (initially, it was six), which include: Retire Retain Relocate Lift and shift (Rehost) Repurchase Re-platform Refactor Theoretically, there are seven strategies, but I will discuss the more common approaches to migrating applications to the cloud. Always choose the migration strategy that best fits your needs. For example, lift and shift is a good option for simple, infrequently used applications, while refactoring is a better option for more complex applications. 1. Lift and Shift: This is the most basic migration strategy and involves simply moving existing applications and workloads to the cloud without any significant changes. This approach is best for simple, stateless applications that do not require significant changes to operate in the cloud. This approach is also known as "lift and shift" or "rehosting" because the goal is to move the application as is, with minimal changes. This approach can be done using the AWS MGN service. It is the best way to migrate any on-prem physical or virtual servers to AWS. After migration, you can use AWS Elastic Beanstalk, AWS EC2, or AWS Auto Scaling. This approach is relatively quick and simple, but it may not provide optimal performance or cost savings in the long run, as the applications may not be fully optimized for the cloud. 2. Re-architecture: This approach involves making significant changes to the architecture of the application to take full advantage of the cloud. This may include breaking down monolithic applications into microservices, using containers and Kubernetes for orchestration, and using cloud-native services such as AWS Lambda and Amazon SNS. This approach is best for complex and large applications that require significant changes to operate efficiently in the cloud. This approach takes longer than lift and shift and requires a deep understanding of the application and the cloud. 3. Replatforming: This approach is to move an existing application to a new platform, such as moving a Java application to .NET. This approach is best for organizations that want to move to a new technology platform that is not supported on-premises and to take advantage of the benefits of the new platform. AWS services like AWS Elastic Beanstalk, AWS ECS, and AWS RDS can be used to deploy the new platform in the cloud. 4. Hybrid: This approach involves running some workloads on-premises and some in the cloud. This approach is best for organizations that have strict compliance or security requirements that prevent them from moving all their workloads to the cloud. This approach is also good for organizations that have complex interdependencies between on-premise and cloud-based workloads. It also allows organizations to take a more gradual approach to migration, moving workloads to the cloud as they become ready. AWS services like AWS Direct Connect and AWS VPN can be used to create a secure and reliable connection between on-premise and cloud-based resources. AWS EKS and Storage gateway, AWS Outposts are good examples to work in a hybrid cloud. 5. Cloud-native: This approach involves building new applications using cloud-native services and architectures from the ground up. This approach is best for organizations that are starting new projects and want to take full advantage of the scalability and elasticity of the cloud. This approach requires a deep understanding of cloud-native services and architectures and is generally more complex than lift and shift or re-platforming. AWS App Runner, AWS Fargate, and ECS can be used to implement cloud-native services. Test Your Migration Once your migration plan is in place, it's important to test your migration. This includes testing your applications, data, and infrastructure. Testing your migration will help you identify any issues and ensure that your applications and data are working as expected in the cloud. Monitor and Optimize Your Migration After your migration is complete, it's important to monitor and optimize your migration. This includes monitoring your applications, data, and infrastructure to ensure that they are working as expected in the cloud. It also includes optimizing your cloud resources to reduce costs and improve performance. Avoid Pitfalls Avoid vendor lock-in: Take advantage of open-source and cross-platform tools and technologies to avoid being locked into a single vendor's ecosystem. Avoid the pitfall of not testing: One of the common pitfalls of cloud migration is not testing the migration properly. It is important to test the migration thoroughly to ensure that all applications and data are working as expected in the cloud. Another pitfall is not considering security: Another common pitfall of cloud migration is not considering security. It's important to ensure that your applications and data are secure in the cloud. This includes securing your data in transit and at rest and ensuring that your applications are secure. Not considering scalability: Another pitfall of cloud migration is not considering scalability. It's important to ensure that your applications and data are scalable in the cloud. This includes ensuring that your applications and data can handle an increase in traffic and usage. Not considering cost: Another pitfall of cloud migration is not considering the cost. It's important to ensure that your migration is cost-effective and that you are not over-provisioning resources. Not considering compliance: Another pitfall of cloud migration is not considering compliance. It's important to ensure that your migration complies with any relevant laws and regulations. Finally, train your team on the new tools and technologies that they'll be using in the cloud. In conclusion, migrating to the AWS cloud requires planning, testing, monitoring, and optimization. Avoiding the pitfalls mentioned above, and following the best practices, will help ensure a successful migration. Additionally, it is important to keep security, scalability, cost, and compliance in mind throughout the migration process.
When we combine the cloud with IaC tools like Terraform and continuous deployment we get the almost magical ability to create resources on demand. For all its benefits, however, the cloud has also introduced a set of difficulties, one of which is estimating cloud costs accurately. Cloud providers have complex cost structures that are constantly changing. AWS, for example, offers 536 types of EC2 Linux machines. Many of them have similar names and features. Take for example "m6g.2xlarge" and "m6gd.2xlarge" — the only difference is that the second comes with an SSD drive, which will add $60 dollars to the bill. Often, making a mistake in defining your infrastructure can cause your bill to balloon at the end of the month. It’s so easy to go above budget. We can set up billing alerts, but there are no guarantees that they will work. Alerts can happen during the weekend or be delayed, making us shoot past our budget in a few hours. So, how can we avoid this problem and use the cloud with confidence? Enter Infracost Infracost is an open-source project that helps us understand how and where we’re spending our money. It gives a detailed breakdown of actual infrastructure costs and calculates how changes impact them. Basically, Infracost is a git diff for billing. Infracost has two versions: a VSCode addon and a command line program. Both do the same thing: parse Terraform code, pull the current cost price points from a cloud pricing API, and output an estimate. You can use Infracost pricing API for free or host your own. The paid tier includes a cloud dashboard to track changes over time. We can see the estimates right in the IDE: Real-time cost estimation on VSCode. Or as comments in pull requests or commits: Cost change information in the PR. Setting Up Infracost To try out Infracost, we’ll need the following: An Infracost API key: You can get one by signing up for free at Infracost.io. The Infracost CLI installed in your machine Some Terraform files Once the CLI tool is installed, run infracost auth login to retrieve the API key. Now we’re ready to go. The first command we’ll try is infracost breakdown. It analyzes Terraform plans and prints out a cost estimate. The --path variable must point to the folder containing your Terraform files. For example, imagine we want to provision an "a1.medium" EC2 instance with the following: provider "aws" { region = "us-east-1" skip_credentials_validation = true skip_requesting_account_id = true } resource "aws_instance" "myserver" { ami = "ami-674cbc1e" instance_type = "a1.medium" root_block_device { volume_size = 100 } } At current rates, this instance costs $28.62 per month to run: $ infracost breakdown --path . Name Monthly Qty Unit Monthly Cost aws_instance.myserver ├─ Instance usage (Linux/UNIX, on-demand, a1.medium) 730 hours $18.62 └─ root_block_device └─ Storage (general purpose SSD, gp2) 100 GB $10.00 OVERALL TOTAL $28.62 If we add some extra storage (600GB of EBS), the cost increases to $155.52, as shown below: $ infracost breakdown --path . Name Monthly Qty Unit Monthly Cost aws_instance.myserver ├─ Instance usage (Linux/UNIX, on-demand, a1.medium) 730 hours $18.62 ├─ root_block_device │ └─ Storage (general purpose SSD, gp2) 100 GB $10.00 └─ ebs_block_device[0] ├─ Storage (provisioned IOPS SSD, io1) 600 GB $75.00 └─ Provisioned IOPS 800 IOPS $52.00 OVERALL TOTAL $155.62 Infracost can also calculate usage-based resources like AWS Lambda. Let's see what happens when we swap the EC2 instance for serverless functions: provider "aws" { region = "us-east-1" skip_credentials_validation = true skip_requesting_account_id = true } resource "aws_lambda_function" "my_lambda" { function_name = "my_lambda" role = "arn:aws:lambda:us-east-1:account-id:resource-id" handler = "exports.test" runtime = "nodejs12.x" memory_size = 1024 } Running infracost breakdown yields a total cost of 0 dollars: $ infracost breakdown --path . Name Monthly Qty Unit Monthly Cost aws_lambda_function.my_lambda ├─ Requests Monthly cost depends on usage: $0.20 per 1M requests └─ Duration Monthly cost depends on usage: $0.0000166667 per GB-seconds OVERALL TOTAL $0.00 That can’t be right unless no one uses our Lambda function, which is precisely what the tool assumes by default. We can fix this by providing an estimate via a usage file. We can create a sample usage file with this command: $ infracost breakdown --sync-usage-file --usage-file usage.yml --path . We can now provide estimates by editing usage.yml. The following example consists of 5 million requests with an average runtime of 300 ms: resource_usage: aws_lambda_function.my_lambda: monthly_requests: 5000000 request_duration_ms: 300 We’ll tell Infracost to use the usage file with --usage-file to get a proper cost estimate: $ infracost breakdown --path . --usage-file usage.yml Name Monthly Qty Unit Monthly Cost aws_lambda_function.my_lambda ├─ Requests 5 1M requests $1.00 └─ Duration 1,500,000 GB-seconds $25.00 OVERALL TOTAL $26.00 That’s much better. Of course, this is accurate as long as our usage file is correct. If you’re unsure, you can integrate Infracost with the cloud provider and pull the utilization metrics from the source. Git Diff for Cost Changes Infracost can save results in JSON by providing the --format json and --out-file options. This gives us a file we can check in source control and use as a baseline. $ infracost breakdown --path . --format json --usage-file usage.yml --out-file baseline.json We can now compare changes by running infracost diff. Let’s see what happens if the Lambda execution time goes from 300 to 350 ms: $ infracost diff --path . --compare-to baseline.json --usage-file usage.yml ~ aws_lambda_function.my_lambda +$4.17 ($26.00 → $30.17) ~ Duration +$4.17 ($25.00 → $29.17) Monthly cost change for TomFern/infracost-demo/dev Amount: +$4.17 ($26.00 → $30.17) Percent: +16% As you can see, the impact is a 16% increase. Integrating Infracost With CI/CD We’ve seen how this tool can help us estimate cloud costs. That’s valuable information, but what role does Infracost take in continuous integration? To answer that, we must understand what infracost comment does. The comment command takes a JSON file generated by infracost diff and posts its contents directly into GitHub, Bitbucket, or GitLab. Thus, by running Infracost inside CI, we make relevant cost information available to everyone on the team. Infracost comments on the cost difference in a GitHub commit. If you want to learn how to configure CI/CD to run Infracost on every update, check out this tutorial: How to Run Infracost on Semaphore. Working With Monorepos You will likely have separate Terraform files for each subproject if you work with a monorepo. In this case, you should add an infracost config file at the project's root. This allows you to specify the project names and where Terraform and usage files are located. You can also set environment variables and other options. version: 0.1 projects: - path: dev usage_file: dev/infracost-usage.yml env: NODE_ENV: dev - path: prod usage_file: prod/infracost-usage.yml env: AWS_ACCESS_KEY_ID: ${PROD_AWS_ACCESS_KEY_ID} AWS_SECRET_ACCESS_KEY: ${PROD_AWS_SECRET_ACCESS_KEY} NODE_ENV: production When the config file is involved, you must replace the --path argument with --config-file in all your commands. Establishing Policies One more trick Infracost has up its sleeve is enforcing policies. Policies are rules that evaluate the output of infracost diff and stop the CI pipeline if a resource goes over budget. This feature allows managers and team leads to enforce limits. When the policy fails, the CI/CD pipeline stops with an error, preventing the infrastructure from being provisioned. When a policy is in place, Infracost warns us if any limits are exceeded. Infracost implements policies using Open Policy Agent (OPA), which uses the Rego language to encode policy rules. Rego has a ton of features, and it’s worth digging in to learn it thoroughly, but for our purposes, we only need to learn a few keywords: deny[out] defines a new policy rule that fails if the out object has failed: true msg: defines the error message shown when the policy fails. out: defines the logic that makes the policy pass or fails. input: references the contents of the JSON object generated with infracost diff The following example shows a policy that fails when the total budget exceeds $1,000: # policy.rego package infracost deny[out] { # define a variable maxMonthlyCost = 1000.0 msg := sprintf( "Total monthly cost must be less than $%.2f (actual diff is $%.2f)", [maxMonthlyCost, to_number(input.totalMonthlyCost)], ) out := { "msg": msg, "failed": to_number(input.totalMonthlyCost) >= maxMonthlyCost } } This is another example that fails if the cost difference is equal to or greater than $500. package infracost deny[out] { # maxDiff defines the threshold that you require the cost estimate to be below maxDiff = 500.0 msg := sprintf( "Total monthly cost diff must be less than $%.2f (actual diff is $%.2f)", [maxDiff, to_number(input.diffTotalMonthlyCost)], ) out := { "msg": msg, "failed": to_number(input.diffTotalMonthlyCost) >= maxDiff } } You can experiment and try several examples online on the OPA playground. To enforce a policy, you must add the --policy-path option in any of the infracost comment commands like this: curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh checkout infracost diff --path . --usage-file usage.yml --compare-to baseline.json --format json --out-file /tmp/infracost-diff-commit.json infracost comment github --path=/tmp/infracost-diff-commit.json --repo=$SEMAPHORE_GIT_REPO_SLUG --commit=$SEMAPHORE_GIT_SHA --github-token=$GITHUB_API_KEY --policy-path policy.rego --behavior=update Conclusion The power to spin up resources instantly is a double-edged knife: a typo in a Terraform file can be a costly mistake. Staying proactive when managing our cloud infrastructure is essential to sticking to the budget and avoiding nasty surprises at the end of the month. If you’re already automating deployment with continuous deployment and managing services with Terraform, you may as well add Infracost to the mix to make more informed decisions and impose spending limits. Setting this up takes only a few minutes and can save thousands of dollars down the road.
Authentication in the Age of SaaS and Cloud Let's start with the differences between authentication and authorization. People tend to lump these concepts together as auth, but they're two distinct processes. Authentication describes the process of finding out that you are who you say you are. In the past, we used user IDs and passwords. These days it's much more common to use magic links or multi-factor authentication, etc. but, it's the same process. Authentication used to be the responsibility of the operating system that logs you in once you provide a password. But over the past 15 years, as we moved into the age of SaaS and cloud, that changed. The first generation of SaaS and cloud apps had to reinvent this process because there were no longer any operating systems to ask to authenticate the user's identity. In the course of the last 15 years, we started to work together as an industry to develop standards around authentication, like OAuth2, OpenID connect, and SAML. We’ve started to use JWTs and so on. Today, no one has to build a log-in system if they don't want to. Numerous developer services can help you do this. Overall, you can say that we've successfully moved identity from on-premises to the realm of SaaS in the cloud. Authorization, on the other hand, has not transitioned to the cloud. Authorization, or access control, is the process of discerning what you can see and do once you're logged in. Unlike authentication, authorization is a problem that is far from being solved. The problem is that there aren’t any industry standards for authorization. You can apply some patterns like role-based access control (RBAC) and attribute-based access control (ABAC), but there are no standards because authorization is a domain-specific problem. There aren't any developer services either. Can you think of a Twilio or a Stripe for authorization? And because there are no standards, or developer services to speak of, companies lose agility because they have to spend time building an in-house authorization system and go through the pain that entails. You have to think about the opportunity cost. How much will it cost you to spend time developing and maintaining an in-house access control system, instead of focusing on your value propositions? And, unfortunately, when companies do this themselves they do it poorly. This is the reason that broken access control ranks #1 in the top 10 security issues listed by the open web application security project (OWASP). It seems like we really dug ourselves into a pretty big hole and now it's time to dig ourselves back out. Cloud-Native Authorization Let's look at how we got here. There have been three transitions that have affected the world of software in general and authorization in particular: 1. Transition to SaaS: Authentication made the move successfully, but access control hasn’t. If we dig into why, we see that back in the day, when applications just talked to the operating system, we had a directory, like LDAP. In this directory, you had groups, with users assigned to those groups. Those groups would typically map to your business application roles and things were pretty simple. But now, we don't have an operating system or a global directory that we can query, so every application has to reinvent the authorization process. 2. Rise of microservices: We’ve seen an architectural shift moving from monolithic applications into microservices. Back when we had monoliths, authorization happened at one time and in one place in the code. Today, we have multiple microservices, and each microservice has to do its own authorization. We also have to think about authorizing interactions between microservices, so that only the right interaction patterns are allowed. 3. Zero-trust: The move from the perimeter-based security approach to zero trust security. With zero trust, a lot of the responsibility for security moved away from the environment and into the application. We have a new world order now where everything is in the cloud, everything is a microservice, and zero trust is a must. Unfortunately, not all applications have caught up with this new paradigm, and when we compare well-architected applications to poorly architected ones we clearly see five anti-patterns and five corresponding best practices emerge. Five Best Practices of Cloud-Native Access Control 1. Purpose-Built Authorization Service Today, every service authorizes on its own. If each microservice has to worry about its authorization, each microservice is likely to do it a little bit differently. So when you want to change the authorization behavior across your entire system, you have to think about how each microservice has to be updated and how the authorization logic works in that microservice, which becomes very difficult as you add more microservices to your system. The best practice that we want to consider is to extract the authorization logic out of the microservices and create a purpose-built microservice that will only deal with authorization. In the past couple of years, large organizations have begun publishing papers that describe how their purpose-built authorization system works. It all started with the Google Zanzibar paper that describes how they built the authorization system for Google Drive and other services. Other companies followed and described how they built their purpose-built authorization service and a distributed system around it. These include Intuit’s AuthZ paper, Airbnb's Himeji, Carta’s AuthZ, and Netflix’s PAS. We are now starting to distill these learnings and are putting them into software. 2. Fine-Grained Access Control The second anti-pattern is baking coarse-grained roles into your application. We often see this in applications where you have roles, such as "admin," "member," and "viewer." These roles are baked directly into the application code and as developers add more permissions, they try to cascade those permissions into these existing roles, which makes the authorization model hard to fine-tune. The best practice, in this case, is to start with a fine-grained authorization model that applies the principle of least privilege. The goal is to give a user only the permissions that they need, no more and no less. This is important because when the identity is compromised — and this is not a question of if, it's a question of when — we can limit the damage that this compromised identity can potentially cause by limiting the permissions that we assign to the roles that we specify. 3. Policy-Based Access Management The third anti-pattern that we see is authorization "spaghetti code," where developers have a sprinkled "switch" and "if" statements all around the code that governs the authorization logic. That's a bad idea and costs a lot when you want to change the way that authorization happens across your system. The best practice here is to maintain a clear separation of duties and keep the authorization-related logic in an authorization policy. By separating policy from application code we ensure that the developer team is responsible for developing the app and the application security team is responsible for securing it. 4. Real-Time Access Checks The fourth anti-pattern is using stale permissions in an access token. This tends to occur in the early life of an app, when developers leverage scopes and then bake those scopes into the access token. Here’s a scenario: a user that has an "admin" scope logs in. That scope is baked into the access token and wherever that user interacts with our system using an unexpired access token with the "admin" scope, that user has admin privileges. Why is this bad? Because if we want to remove the "admin" scope from the user and invalidate it, we’ll run into a hurdle. As long as the user holds an unexpired access token, they're going to have access to all the resources that the access token grants them. You simply cannot have a fine-grade access control model using access tokens. Even if the issuer of the access token has visibility into what resources the user can access, it’s impractical to stuff those entitlements in an access token. Let's say we have a collection of documents and we want to give a user a read permission to a document. Which document are we talking about? All documents? Only a few of them? Clearly this approach doesn’t scale. The best practice here is never to assume that the access token has the permissions that we need and instead have real-time access checks that take into account the identity context, the resource context, and the permission before we grant access to a protected resource. 5. Centralized Decision Logs Lastly, unauthorized access is not a question of if, it's a question of when. With that said, companies tend to neglect to maintain consistent authorization logs, which limits their ability to trace unauthorized incidents. The best practice is to have fine-grained, centralized authorization logs. We need to monitor and log everything in a centralized location that we can analyze downstream to get a better understanding of what’s happening in our system. Fine-Grained Access Control Patterns Let's talk a little bit more about fine-grained access control and how it came to be. Access Control Lists (ACL) Back in the 80s, operating systems would define permissions, such as "read," "write," and "execute" on files and folders. This patterns was called access control lists (ACL) With ACL, you can answer questions like: "Does Alice have `read` access to this file?" Role-Based Access Control (RBAC) RBAC, or role-based access control, came around in the 90s and early 2000s with the advent of directories like LDAP and Active Directory. These directories give you the ability to create groups and then assign users to groups, which typically correspond to a particular role in a business application. An admin would assign a user to a group to give them the appropriate permissions, and everything was done in one console. With RBAC, you can answer questions like: "Is Bob in the `Sales admin` role?" Attribute-Based Access Control (ABAC) The next evolution was attribute-based access control (ABAC) and that's where we started to move away from coarse roles and toward fine-grained access control. In the early 2000s and 2010s, we saw standards like XACML define how to construct fine-grained authorization policies. You could define permissions based on attributes, including user-attributes (e.g. the department the user was in) and resource attributes (e.g. what folder is the user trying to access?), or maybe even environmental attributes, (e.g. what is the user's geography? what is the current time and day?)With ABAC, you can answer questions like: "Is Mallory in the `Sales` department? Is the document in the `Sales` folder? And is it currently working hours in the US?" Relationship-Based Access Control (ReBAC) Last, but not least, there is the Zanzibar paper and a new authorization model called relationship-based access control (ReBAC). In this model, you define a set of subjects (typically your users or groups), a set of objects (such as organizations, directories, folders, or maybe tenants). Then you define whether a particular subject has a relationship with an object. A "viewer," "admin," or "editor" would be relationships between a user and a folder object, for example. With ReBAC, you can answer very complex questions, by traversing this relationship graph that is formed by objects, subjects, and relationships. Two Approaches to Fine-Grained Access Control Two ecosystems have emerged around the concept of fine-grained access control: 1. “Policy-as-Code”: In this paradigm, we express policies as a set of rules written in the Rego language. This is the successor to ABAC, where the Open Policy Agent (OPA) project is popular. OPA is a general-purpose decision engine and it's built for policy-based access management and ABAC. However, it has disadvantages: the language you write policies in, Rego, is a Datalog-derived language that has a high learning curve. It also doesn't help you with modeling application-specific authorization. And because it is truly general purpose, you have to build everything from rudimentary building blocks. OPA also leaves the difficult problem of getting relevant user and resource data to the decision engine as an exercise for the developer implementing the project. Getting data to the decision engine is crucial because the decision has to take place in milliseconds since it's in the critical path of every application request. And that means that you have to solve this distributed systems problem yourself if you really want to build an authorization system on top of OPA. All in all, OPA is a really good place to start, but you're going to face some hurdles. 2. “Policy-as-data”: The policy isn't stored as a set of rules, but rather it is ingrained in the data structure. The relationship graph itself is a very opinionated model, so you don't have to design the model yourself. We have "subjects," "objects," and "relationships" which gives you a lot of flexibility. If your domain model looks like Google Drive, with folders, files, and users, it's a really good place to start. On the other hand, it is still an immature ecosystem, with many competing open-source implementations. It's also difficult to combine it with other authorization models, like RBAC and ABAC. The Rise of Policy-Based Access Management Policy-based access management, as the name suggests, lifts the authorization logic out of the application code and into a policy that is its own artifact. Here’s an example of a policy written in Rego: This is really where the principle of least privilege comes into play. You can see that we're denying access, for example, in the allowed clause, until we have enough proof to grant it. The policy is going to return "allowed = false" unless we have some reason to change that to "allowed = true." On line 9, we can see that we're checking that the user's department is "Operations." If it is, this allowed clause will evaluate as true. Now let's take a quick look at what the application code looks like after we've extracted the authorization logic into the policy: The code snippet above is an express.js endpoint that is responsible for passing the user identity, verified by the check JWT middleware. If the allowed clause from the policy returns true, the middleware will pass the request to the next function, and if not, it will return a 403. There are many reasons to separate authorization logic from your application code. This lets you have a separate artifact, which can be stored and versioned exactly the same way that we would our application code. Every policy change will be part of a git change log, which provides us with an audit trail for our policy. Additionally, with the authorization logic separated from the application code, we're adhering to the principle of separation of duties. The security team can manage the authorization policy and the development team can focus on the application. And when we have the policy artifact, we can build it into an immutable image and sign it to maintain a secure supply chain. Here’s a link to the open-source project that does just that: https://github.com/opcr-io/policy. Real-Time Enforcement Is a Distributed Systems Problem Real-time access checks are critical for modern authorization. Authorization is a really hard problem to solve because when done correctly it is a distributed systems problem. And distributed system problems are not trivial. The first challenge is that our authorization service has to authorize locally because it is in the critical path of every single application request. Authorization happens for every request that tries to access a protected resource. Authorization requires 100% availability and milliseconds of latency. For that to happen authorization needs to be performed locally. We want to authorize locally, but manage our authorization policies and data, globally. We want to make sure the data we are basing our authorization decisions on is fresh and updated across all of our local authorizers. For this we need a control plane that will manage our policies, user directory, and all the resource data, and ensure that every change is distributed to the edge in real-time. We also want to aggregate decision logs from all the local authorizers and stream them to your preferred logging system. Conclusion Cloud-native authorization is a complex problem that has yet to be entirely solved. As a result, every cloud application is reinventing the wheel. Based on our conversations, we’ve identified five anti-patterns and best practices for application authorization in the age of cloud computing. First, you want your authorization to be fine-grained, using any access control pattern that fits best with your application and organization, whether that is RBAC, ABAC, ReBAC, or a combination thereof. Second, you want to separate concerns and extract your access control logic from the application code and into a policy that is handed over to the security team. Third, it is crucial to perform real-time access checks, based on fresh user and resource information. Fourth, you want to manage all of your users, resources, policies, and relationships in one place to increase agility. Lastly, you want to collect and store authorization decision logs for compliance.
When cloud computing burst onto the scene in 2006 with the launch of AWS, it would have been hard to imagine how big of a thing it would eventually become. But more than 15 years later, cloud computing has come a long way. And yet, in my view, it is only just getting started towards realizing its true potential. Why do I think this way? Recently, I came across this Gartner study that contained a couple of mind-boggling facts: More than 85% of organizations will embrace a cloud-first principle by 2025. Over 95% of new digital workloads in 2025 will be deployed on cloud-native platforms, up from 30% in 2021. Of course, numbers can be spurious. But when you talk about 85% of organizations and 95% of all new digital workloads, it is definitely a lot. Even if the figures are off the mark by some points, they are still huge. My curiosity about the matter was piqued and I started digging more into the potential trends that might fuel this anticipated growth. Naturally, as a software developer, I’m always interested in knowing where the industry is headed because that’s how we can prepare and hope to keep ourselves relevant. After doing some reading, I have formed an initial idea of the broad trends that are driving cloud computing and will continue to do so in the coming years. However, before I share them with you, I wanted to make some points about cloud computing that can help us understand the trends in a much better way. You may or may not agree with the points I make. Either way, I’d love it if you share your views in the comments section below. What Made the Cloud So Popular? I believe that the seeds of the future are laid in the past. This must hold true for cloud computing as well. But what made the cloud so popular? In my view, cloud computing democratized the ability to build applications with world-class infrastructure. No longer do you need to be a multi-billion dollar organization with an army of engineers to create applications used by millions of people. Even a startup working out of a garage can do it. So, what was stopping the same thing from happening in the pre-cloud era? For starters, the pre-cloud era could also be labeled as the on-premise era. This meant that organizations typically managed their own IT infrastructure and resources. For example, if you wanted to create an application and make it available to the world in the pre-cloud era, you had to purchase, install and maintain hardware and software in-house. This arrangement had a couple of big technical implications: Management of IT infrastructure such as servers, storage, and networking lay solely on the shoulders of the organization’s workforce. I still vividly remember the anecdote of one of my seniors about the earlier days when he had to even fix network cables when a broken connection brought their application down in the middle of the night and the network vendor was not available. The IT systems were not scalable based on demand since organizations were limited by physical resources. If there was an expectation of higher demand, you had to go out and buy more resources. For that to happen, you had to be really good at predicting demand else you would be incurring extra costs for no reason. Of course, when are higher-ups in organizations worried about tech issues unless there was a threat to the company’s bottom line? However, in this case, there were threats. For starters, on-premise computing is a costly business. You need a significant up-front investment in hardware and software to build a data center. Initially, big companies loved this situation. It was like a huge barrier to entry for smaller players. However, once the genie was out-of-the-bottle and some cloud offerings came on the scene, the huge cost associated with on-premise computing became a liability. Suddenly, the army of engineers hired to just keep the infrastructure running started to look like money-guzzling machines. In more disruptive industries, startups with a skeleton crew of software engineers were leap-frogging established players by using the initial cloud tools to drive faster innovation and reducing time-to-market. This meant a loss of market share and growth opportunities for the big companies. Of course, all of this didn’t happen in a single day, a month, or even a year. But slowly and steadily, large organizations also started to steer their ships in the direction of cloud computing. Once that happened, there was no looking back. Evolution of Cloud Computing It was like discovering an untapped oil field right next to your front door. So, Where Are Things Headed? Predicting the direction of a particular technology can be a fool’s errand. In 2006, not even the creators of AWS would have predicted the kind of growth they have seen. But, of course, it’s important to make an educated guess so that we are better prepared for what’s coming in the next few years. Here are a few broad trends I’m tracking: Hybrid Cloud Adoption This one’s a biggie as it is driven mostly by the large organizations that run the world. Think of big banks, government organizations, and mega-corporations. The trend is largely driven by the growing regulatory and legal requirements about data and the increase in privacy concerns all across the world. In a hybrid cloud setup, companies want to keep a mix of capabilities across external cloud platforms as well as in-house setups. The idea is to use the public platforms for new innovative products but keep the core-business capabilities or data on in-house data centers so that they don’t run foul of government regulations. Since it involves big money, I feel hybrid cloud adoption is only going to grow. Already, big cloud providers are rolling out products to support this vision. Red Hat has been offering its flagship OpenShift platform as an on-premise solution for many years now. Microsoft has launched Azure Arc to cater to hybrid and multi-cloud requirements. Google has launched Anthos - a platform that promises a single and consistent way of managing Kubernetes workloads across the public and on-premises cloud. Multi-Cloud Adoption “Don’t keep all your eggs in one basket.” This adage is so wise that organizations are increasingly exploring the use of a multi-cloud approach. A large part of the multi-cloud adoption is driven by risk mitigation. For example, fintech organization Form3 was compelled to go for a multi-cloud setup when the regulators questioned them about the portability of their platform in case AWS went down. However, some of the shift to multi-cloud is also a result of increased competition and service offerings by different cloud vendors. Even beyond the Big 3, there are dozens of other cloud providers providing all manner of cloud services to lure customers with cost or features. Organizations have been spoiled for choice and are trying to get the best ROI for every piece of their infrastructure. I feel this trend is going to accelerate in the future. The difficulties of managing a multi-cloud setup could have curtailed this movement. However, instead of getting bogged down, the demand for multi-cloud and hybrid-cloud setups has spurred a number of new trends such as the rise of infrastructure-as-code tools and the concept of platform engineering. I will discuss more of them in upcoming posts. Serverless Computing One of the main factors that worked in favor of the cloud in the initial days was cost savings. The idea that you could even launch a product with close-to-zero costs was hard to beat and created a tremendous rush for cloud adoption. In my view, serverless has the potential to make even traditional cloud offerings appear costly. Though a few years have passed since serverless options were launched by most major cloud providers, I feel that we are only at the beginning of the serverless revolution. Since serverless computing allows companies to run code without even provisioning or managing cloud servers, it is extremely lucrative for organizations that want to save on costs and move faster. With the tremendous rise in the number of SaaS startups and the existence of an inflationary environment with rising interest rates, the cost of running your system is a big issue. Organizations are looking to achieve product-market fit without burning through too much cash and serverless computing seems like a good deal with its pay-per-use model and little to no maintenance expenditure. AI-Driven Cloud Apart from cloud computing, the last decade or so has also seen another major trend spread like wildfire - the rise of machine learning and artificial intelligence. As AI seeps into more and more areas and supports real requirements, it is already promising to augment cloud services in really interesting ways. For example, AI-driven cloud services can potentially make autonomous decisions on when to scale up or down based on an intuitive understanding of demand rather than fixed rules. Again, this boils down to monetary benefits with the promise of better cost utilization. Of course, we can only hope that one of these services doesn’t turn into Skynet any time soon! Either way, I’ll be extremely interested in keeping an eye out for this developing trend. Containers as a Service Containers on the cloud started off quite early with Amazon launching ECS. Of course, managing a bunch of containers isn’t the easiest thing out there. However, the surging popularity of Kubernetes has changed the landscape of container orchestration. And within no time, all major and minor cloud providers are offering managed Kubernetes services. This is one area where big and small organizations are lapping up the opportunity. After all, everyone wants to reap the benefits of containerization without acquiring the headache of managing them. As developers, it is definitely important to keep abreast of this trend. That’s It for Now! In the end, I feel that we are living in interesting times when it comes to cloud computing. The technology is at the right level of maturity where it has become mainstream enough to have a large base of innovation. However, it is also not so dormant that things become boring and static. To top it off, cloud computing is also producing other trends in areas such as microservices architecture, DevOps, infrastructure-as-code, and platform engineering.
Boris Zaikin
Senior Software Cloud Architect,
Nordcloud GmBH
Ranga Karanam
Best Selling Instructor on Udemy with 1 MILLION Students,
in28Minutes.com
Samir Behara
Senior Cloud Infrastructure Architect,
AWS
Pratik Prakash
Master Software Engineer (SDE-IV),
Capital One