DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
11 Monitoring and Observability Tools for 2023
Learn more
Cloud Architecture Zone is presented by the following partner

Cloud Architecture

Cloud architecture refers to how technologies and components are built in a cloud environment. A cloud environment comprises a network of servers that are located in various places globally, and each serves a specific purpose. With the growth of cloud computing and cloud-native development, modern development practices are constantly changing to adapt to this rapid evolution. This Zone offers the latest information on cloud architecture, covering topics such as builds and deployments to cloud-native environments, Kubernetes practices, cloud databases, hybrid and multi-cloud environments, cloud computing, and more!

icon
Latest Refcards and Trend Reports
Trend Report
Kubernetes in the Enterprise
Kubernetes in the Enterprise
Refcard #233
Getting Started With Kubernetes
Getting Started With Kubernetes
Refcard #379
Getting Started With Serverless Application Architecture
Getting Started With Serverless Application Architecture

DZone's Featured Cloud Architecture Resources

Keep Your Application Secrets Secret

Keep Your Application Secrets Secret

By Istvan Zoltan Nagy
There is a common problem most backend developers face at least once in their careers: where should we store our secrets? It appears to be simple enough, we have a lot of services focusing on this very issue, we just need to pick one and get on the next task. Sounds easy, but how can we pick the right solution for our needs? We should evaluate our options to see more clearly. The Test For the demonstration, we can take a simple Spring Boot application as an example. This will be perfect for us because that is one of the most popular technology choices on the backend today. In our example, we will assume we need to use a MySQL database over JDBC; therefore, our secrets will be the connection URL, driver class name, username, and password. This is only a proof of concept, any dependency would do as long as it uses secrets. We can easily generate such a project using Spring Initializr. We will get the DataSource auto configured and then create a bean that will do the connection test. The test can look like this: Java @Component public class MySqlConnectionCheck { private final DataSource dataSource; @Autowired public MySqlConnectionCheck(DataSource dataSource) { this.dataSource = dataSource; } public void verifyConnectivity() throws SQLException { try (final Connection connection = dataSource.getConnection()) { query(connection); } } private void query(Connection connection) throws SQLException { final String sql = "SELECT CONCAT(@@version_comment, ' - ', VERSION()) FROM DUAL"; try (final ResultSet resultSet = connection.prepareStatement(sql).executeQuery()) { resultSet.next(); final String value = resultSet.getString(1); //write something that will be visible on the Gradle output System.err.println(value); } } } This class will establish a connection to MySQL, and make sure we are, in fact, using MySQL as it will print the MySQL version comment and version. This way we would notice our mistake even if an auto configured H2 instance was used by the application. Furthermore, if we generate a random password for our MySQL Docker container, we can make sure we are using the instance we wanted, validating the whole configuration worked properly. Back to the problem, shall we? Storing Secrets The Easy Way The most trivial option is to store the secrets together with the code, either hard-coded or as a configuration property, using some profiles to be able to use separate environments (dev/test/staging/prod). As simple as it is, this is a horrible idea as many popular sites had to learn the hard way over the years. These “secrets” are anything but a secret. As soon as someone gets access to a repository, they will have the credentials to the production database. Adding insult to injury, we won’t even know about it! This is the most common cause of data breaches. A good indicator of the seriousness of the situation is to see how common secret scanning offerings got for example on GitHub, GitLab, Bitbucket, or others hosting git repositories. The Right Way Now that we see what the problem is, we can start to look for better options. There is one common thing we will notice in all the solutions we can use: they want us to store our secrets in an external service that will keep them secure. This comes with a lot of benefits these services can provide, such as: Solid access control. Encrypted secrets (and sometimes more, like certificates, keys). Auditable access logs. A way to revoke access/rotate secrets in case of a suspected breach. Natural separation of environments as they are part of the stack (one secrets manager per env). Sounds great, did we solve everything? Well, it is not that simple. We have some new questions we need to answer first: Who will host and maintain these? Where should we put the secrets we need for authentication when we want to access the secrets manager? How will we run our code locally on the developer laptops? How will we run our tests on CI? Will it cost anything? These are not trivial, and their answers depend very much on the solution we want to use. Let us review them one by one in the next section. Examples of Secrets Managers In all cases below, we will introduce the secrets manager as a new component of our stack, so if we had an application and a database, it would look like the following diagram. HashiCorp Vault If we go for the popular open-source option, HashiCorp Vault, then we can either self-host, or use their managed service, HCP Vault. Depending on the variant we select, we may or may not have some maintenance effort already, but it answers the first question. Answering the rest should be easy as well. Regarding the authentication piece, we can use, for example, the AppRole Auth Method using environment variables providing the necessary credentials to our application instances in each environment. Regarding the local and CI execution, we can simply configure and run a vault instance in dev server mode on the machine where the app should run and pass the necessary credentials using environment variables similarly to the live app instances. As these are local vaults, providing access to throw-away dev databases, we should not worry too much about their security as we should avoid storing meaningful data in them. To avoid spending a lot of effort on maintaining these local/CI vault instances, it can be a clever idea to store their contents in a central location, and let each developer update their vault using a single command every now and then. Regarding the cost, it depends on a few things. If you can go with the self-hosted open-source option, you should worry only about the VM cost (and the time spent on maintenance); otherwise, you might need to figure out how you can optimize the license/support cost. Cloud-Based Solutions If we are hosting our services using the services of one of the three big cloud providers, we have even more options. AWS, Azure, and Google Cloud are all offering a managed service for secrets managers. Probably because of the nature of the problem, AWS Secrets Manager, Azure Key Vault, and Google Cloud Secret Manager share many similarities. Please see a list below for examples: Stores versioned secrets. Logs access to the service and its contents. Uses solid authentication and authorization features. Well integrated with other managed services of the same provider Provides an SDK for developers of some popular languages At the same time, we should keep in mind that these are still hugely different services. Some of the obvious differences are the API they are using for communication, and the additional features they provide. For example, Azure Key Vault can store secrets, keys, and certificates, while AWS and GCP provide separate managed services for these additional features. Thinking about the questions we wanted to answer, they can answer the first two questions the same way. All of them are managed services, and the managed identity solution of the cloud provider they belong to is the most convenient, secure way to access them. Thanks to this, we do not need to bother storing secrets/tokens in our application configuration, just the URL of the secrets manager, which is not considered to be a secret. Regarding the cost, AWS and GCP can charge by the number of secrets and number of API calls. On the other hand, Azure only charges for the latter. In general, they are very reasonably priced, and we can sleep better at night knowing our security posture is a bit better. Trouble starts when we try to answer the remaining two questions dealing with the local and CI use-cases. All three solutions can be accessed from the outside world (given the proper network configuration), but simply punching holes on a firewall and sharing the same secrets manager credentials is not an ideal solution. There are situations when doing so is simply not practical, such as the following cases: Our team is scattered around the globe in the home office, and we would not be able to use strong IP restrictions, or we would need constant VPN connection just to build/test the code. Needing internet connection for tests is bad enough. But, using VPN constantly while at work can put additional stress on the infrastructure and team at the same time. When our CI instances are spawning with random IPs from an unknown range, we cannot set proper IP restrictions. A similar case to the previous. We cannot trust the whole team with the secrets of the shared secrets manager. For example, in the case of open-source projects, we cannot run around and share a secrets manager instance with the rest of the world. We need to change the contents of the secrets manager during the tests. When this happens, we are risking isolation problems between each developer and CI instance. We cannot launch a different secrets manager instance for each person and process (or test case) as that would not be very scalable. We do not want to pay extra for the additional secrets managers used in these cases. Can We Fake It Locally? Usually, this would be the moment when I start to search for a suitable test double and formulate plans about using that instead of the real service locally and on CI. What do we expect from such a test double? Behave like the real service would include in exceptional situations. Be actively maintained to reduce the risk of lagging behind in case of API version changes in the real service. Have a way to initialize the content of the secrets manager double on start-up to not need additional code in the application. Allow us to synchronize the secret values between the team and CI instances to reduce maintenance cost. Let us start and throw-away the test double simply, locally and on CI. Do not use a lot of resources. Do not introduce additional dependencies to our application if possible. I know about third-party solutions ticking all the boxes in case of AWS or Azure, while I have failed to locate one for GCP. Solving the Local Use Case for Each Secrets Manager in Practice It is finally time for us to roll up our sleeves and get our hands dirty. How should we modify our test project to be able to use our secrets manager integrations locally? Let us see for each of them: HashiCorp Vault Since we can run the real thing locally, getting a test double is pointless. We can simply integrate vault using the Spring Vault module by adding a property source: Java @Component("SecretPropertySource") @VaultPropertySource(value = "secret/datasource", propertyNamePrefix = "spring.datasource.") public class SecretPropertySource { private String url; private String username; private String password; private String driverClassName; // ... getters and setters ... } As well as a configuration for the “dev” profile: Java @Configuration @Profile("dev") public class DevClientConfig extends AbstractVaultConfiguration { @Override public VaultEndpoint vaultEndpoint() { final String uri = getEnvironment().getRequiredProperty("app.secrets.url"); return VaultEndpoint.from(URI.create(uri)); } @Override public ClientAuthentication clientAuthentication() { final String token = getEnvironment().getRequiredProperty("app.secrets.token"); return new TokenAuthentication(token); } @Override public VaultTemplate vaultTemplate() { final VaultTemplate vaultTemplate = super.vaultTemplate(); final SecretPropertySource datasourceProperties = new SecretPropertySource(); datasourceProperties.setUrl("jdbc:mysql://localhost:15306/"); datasourceProperties.setDriverClassName("com.mysql.cj.jdbc.Driver"); datasourceProperties.setUsername("root"); datasourceProperties.setPassword("16276ec1-a682-4022-b859-38797969abc6"); vaultTemplate.write("secret/datasource", datasourceProperties); return vaultTemplate; } } We need to be careful, as each bean—depending on the fetched secret values (or the DataSource)—must be marked with @DependsOn("SecretPropertySource") to make sure it will not be populated earlier during start-up while the vault backend PropertySource is not registered. As for the reason we used a “dev” specific profile, it was necessary because of two things: The additional initialization of the vault contents on start-up. The simplified authentication as we are using a simple token instead of the aforementioned AppRole. Performing the initialization here solves the worries about the maintenance of the vault contents as the code takes care of it, and we did not need any additional dependencies either. Of course, it would have been even better if we used some Docker magic to add those values without ever needing to touch Java. This might be an improvement for later. Speaking of Docker, the Docker Compose file is simple as seen below: YAML version: "3" services: vault: container_name: self-hosted-vault-example image: vault ports: - '18201:18201' restart: always cap_add: - IPC_LOCK entrypoint: vault server -dev-kv-v1 -config=/vault/config/vault.hcl volumes: - config-import:/vault/config:ro environment: VAULT_DEV_ROOT_TOKEN_ID: 00000000-0000-0000-0000-000000000000 VAULT_TOKEN: 00000000-0000-0000-0000-000000000000 # ... MySQL config ... volumes: config-import: driver: local driver_opts: type: "none" o: "bind" device: "vault" The key points to remember are the dev mode in the entry point, the volume config that will allow us to add the configuration file, and the environment variables baking in the dummy credentials we will use in the application. As for the configuration, we need to set in-memory mode and configure a HTTP endpoint without TLS: disable_mlock = true storage "inmem" {} listener "tcp" { address = "0.0.0.0:18201" tls_disable = 1 } ui = true max_lease_ttl = "7200h" default_lease_ttl = "7200h" api_addr = "http://127.0.0.1:18201" The complexity of the application might need some changes in the vault configuration or the Docker Compose content. However, for this simple example, we should be fine. Running the project, should produce the expected output: MySQL Community Server - GPL - 8.0.32 We are done with configuring vault for local use. Setting it up for tests should be even more simple using the things we have learned here. Also, we can simplify some of the steps there if we decide to use the relevant Testcontainers module. Google Cloud Secret Manager As there is no readily available test double for Google Cloud Secret Manager, we need to make a trade-off. We can decide what we would like to choose from the following three options: We can fall back to the easy option in case of the local/CI case, disabling the logic that will fetch the secrets for us in any real environment. In this case, we will not know whether the integration works until we deploy the application somewhere. We can decide to use some shared Secret Manager instances, or even let every developer create one for themselves. This can solve the problem locally, but it is inconvenient compared to the solution we wanted, and we would need to avoid running our CI tests in parallel and clean up perfectly in case the content of the Secret Manager must change on CI. We can try mocking/stubbing the necessary endpoints of the Secret Manager ourselves. WireMock can be a good start for the HTTP API, or we can even start from nothing. It is a worthy endeavor for sure, but will take a lot of time to do it well. Also, if we do this, we must consider the ongoing maintenance effort. As the decision will require quite different solutions for each, there is not much we can solve in general. AWS Secrets Manager Things are better in case of AWS, where LocalStack is a tried-and-true test double with many features. Chances are that if you are using other AWS managed services in your application, you will be using LocalStack already, making this even more appealing. Let us make some changes to our demo application to demonstrate how simple it is to implement the AWS Secrets Manager integration as well as using LocalStack locally. Fetching the Secrets First, we need a class that will know the names of the secrets in the Secrets Manager: Java @Configuration @ConfigurationProperties(prefix = "app.secrets.key.db") public class SecretAccessProperties { private String url; private String username; private String password; private String driver; // ... getters and setters ... } This will read the configuration and let us conveniently access the names of each secret by a simple method call. Next, we need to implement a class that will handle communication with the Secrets Manager: Java @Component("SecretPropertySource") public class SecretPropertySource extends EnumerablePropertySource<Map<String, String>> { private final AWSSecretsManager client; private final Map<String, String> mapping; private final Map<String, String> cache; @Autowired public SecretPropertySource(SecretAccessProperties properties, final AWSSecretsManager client, final ConfigurableEnvironment environment) { super("aws-secrets"); this.client = client; mapping = Map.of( "spring.datasource.driver-class-name", properties.getDriver(), "spring.datasource.url", properties.getUrl(), "spring.datasource.username", properties.getUsername(), "spring.datasource.password", properties.getPassword() ); environment.getPropertySources().addFirst(this); cache = new ConcurrentHashMap<>(); } @Override public String[] getPropertyNames() { return mapping.keySet() .toArray(new String[0]); } @Override public String getProperty(String property) { if (!Arrays.asList(getPropertyNames()).contains(property)) { return null; } final String key = mapping.get(property); //not using computeIfAbsent to avoid locking map while the value is resolved if (!cache.containsKey(key)) { cache.put(key, client .getSecretValue(new GetSecretValueRequest().withSecretId(key)) .getSecretString()); } return cache.get(key); } } This PropertySource implementation will know how each secret name can be translated to Spring Boot configuration properties used for the DataSource configuration, self-register as the first property source, and cache the result whenever a known property is fetched. We need to use the @DependsOn annotation same as in case of the vault example to make sure the properties are fetched in time. As we need to use basic authentication with LocalStack, we need to implement one more class, which will only run in the “dev” profile: Java @Configuration @Profile("dev") public class DevClientConfig { @Value("${app.secrets.url}") private String managerUrl; @Value("${app.secrets.accessKey}") private String managerAccessKey; @Value("${app.secrets.secretKey}") private String managerSecretKey; @Bean public AWSSecretsManager secretClient() { final EndpointConfiguration endpointConfiguration = new EndpointConfiguration(managerUrl, Regions.DEFAULT_REGION.getName()); final BasicAWSCredentials credentials = new BasicAWSCredentials(managerAccessKey, managerSecretKey); return AWSSecretsManagerClientBuilder.standard() .withEndpointConfiguration(endpointConfiguration) .withCredentials(new AWSStaticCredentialsProvider(credentials)) .build(); } } Our only goal with this service is to set up a suitable AWSSecretsManager bean just for local use. Setting Up the Test Double With the coding done, we need to make sure LocalStack will be started using Docker Compose whenever we start our Spring Boot app locally and stop it when we are done. Starting with the Docker Compose part, we need it to start LocalStack and make sure to use the built-in mechanism for running an initialization script when the container starts using the approach shared here. To do so, we need a script that can add the secrets: Shell #!/bin/bash echo "########### Creating profile ###########" aws configure set aws_access_key_id default_access_key --profile=localstack aws configure set aws_secret_access_key default_secret_key --profile=localstack aws configure set region us-west-2 --profile=localstack echo "########### Listing profile ###########" aws configure list --profile=localstack echo "########### Creating secrets ###########" aws secretsmanager create-secret --endpoint-url=http://localhost:4566 --name database-connection-url --secret-string "jdbc:mysql://localhost:13306/" --profile=localstack || echo "ERROR" aws secretsmanager create-secret --endpoint-url=http://localhost:4566 --name database-driver --secret-string "com.mysql.cj.jdbc.Driver" --profile=localstack || echo "ERROR" aws secretsmanager create-secret --endpoint-url=http://localhost:4566 --name database-username --secret-string "root" --profile=localstack || echo "ERROR" aws secretsmanager create-secret --endpoint-url=http://localhost:4566 --name database-password --secret-string "e8ce8764-dad6-41de-a2fc-ef905bda44fb" --profile=localstack || echo "ERROR" echo "########### Secrets created ###########" This will configure the bundled AWS CLI inside the container and perform the necessary HTTP calls to port 4566 where the container listens. To let LocalStack use our script, we will need to start our container with a volume attached. We can do so using the following Docker Compose configuration: YAML version: "3" services: localstack: container_name: aws-example-localstack image: localstack/localstack:latest ports: - "14566:4566" environment: LAMBDA_DOCKER_NETWORK: 'my-local-aws-network' LAMBDA_REMOTE_DOCKER: 0 SERVICES: 'secretsmanager' DEFAULT_REGION: 'us-west-2' volumes: - secrets-import:/docker-entrypoint-initaws.d:ro # ... MySQL config ... volumes: secrets-import: driver: local driver_opts: type: "none" o: "bind" device: "localstack" This will set up the volume, start LocalStack with the “secretsmanager” feature active, and allow us to map port 4566 from the container to port 14566 on the host so that our AWSSecretsManager can access it using the following configuration: Properties files app.secrets.url=http://localhost:14566 app.secrets.accessKey=none app.secrets.secretKey=none If we run the project, we will see the expected output: MySQL Community Server - GPL - 8.0.32 Well done, we have successfully configured our local environment. We can easily replicate these steps for the tests as well. We can even create multiple throw-away containers from our tests for example using Testcontainers. Azure Key Vault Implementing the Azure Key Vault solution will look like a cheap copy-paste job after the AWS Secrets Manager example we have just implemented above. Fetching the Secrets We have the same SecretAccessProperties class for the same reason. The only meaningful difference in SecretPropertySource is the fact that we are using the Azure SDK. The changed method will be this: Java @Override public String getProperty(String property) { if (!Arrays.asList(getPropertyNames()).contains(property)) { return null; } final String key = mapping.get(property); //not using computeIfAbsent to avoid locking map while the value is resolved if (!cache.containsKey(key)) { cache.put(key, client.getSecret(key).getValue()); } return cache.get(key); } The only missing piece is the “dev” specific client configuration that will create a dumb token and an Azure Key Vault SecretClient for us: Java @Configuration @Profile("dev") public class DevClientConfig { @Value("${app.secrets.url}") private String vaultUrl; @Value("${app.secrets.user}") private String vaultUser; @Value("${app.secrets.pass}") private String vaultPass; @Bean public SecretClient secretClient() { return new SecretClientBuilder() .credential(new BasicAuthenticationCredential(vaultUser, vaultPass)) .vaultUrl(vaultUrl) .disableChallengeResourceVerification() .buildClient(); } } With this, the Java side changes are completed, we can add the missing configuration and the application is ready: Properties files app.secrets.url=https://localhost:10443 app.secrets.user=dummy app.secrets.pass=dummy The file contents are self-explanatory, we have some dummy credentials for the simulated authentication and a URL for accessing the vault. Setting Up the Test Double Although setting up the test double will be like the LocalStack solution we implemented above, it will not be the same. We will use Lowkey Vault, a fake, that implements the API endpoints we need and more. As Lowkey Vault provides a way for us to import the vault contents using an attached volume, we can start by creating an import file containing the properties we will need: { "vaults": [ { "attributes": { "baseUri": "https://{{host}:{{port}", "recoveryLevel": "Recoverable+Purgeable", "recoverableDays": 90, "created": {{now 0}, "deleted": null }, "keys": { }, "secrets": { "database-connection-url": { "versions": [ { "vaultBaseUri": "https://{{host}:{{port}", "entityId": "database-connection-url", "entityVersion": "00000000000000000000000000000001", "attributes": { "enabled": true, "created": {{now 0}, "updated": {{now 0}, "recoveryLevel": "Recoverable+Purgeable", "recoverableDays": 90 }, "tags": {}, "managed": false, "value": "jdbc:mysql://localhost:23306/", "contentType": "text/plain" } ] }, "database-username": { "versions": [ { "vaultBaseUri": "https://{{host}:{{port}", "entityId": "database-username", "entityVersion": "00000000000000000000000000000001", "attributes": { "enabled": true, "created": {{now 0}, "updated": {{now 0}, "recoveryLevel": "Recoverable+Purgeable", "recoverableDays": 90 }, "tags": {}, "managed": false, "value": "root", "contentType": "text/plain" } ] }, "database-password": { "versions": [ { "vaultBaseUri": "https://{{host}:{{port}", "entityId": "database-password", "entityVersion": "00000000000000000000000000000001", "attributes": { "enabled": true, "created": {{now 0}, "updated": {{now 0}, "recoveryLevel": "Recoverable+Purgeable", "recoverableDays": 90 }, "tags": {}, "managed": false, "value": "5b8538b6-2bf1-4d38-94f0-308d4fbb757b", "contentType": "text/plain" } ] }, "database-driver": { "versions": [ { "vaultBaseUri": "https://{{host}:{{port}", "entityId": "database-driver", "entityVersion": "00000000000000000000000000000001", "attributes": { "enabled": true, "created": {{now 0}, "updated": {{now 0}, "recoveryLevel": "Recoverable+Purgeable", "recoverableDays": 90 }, "tags": {}, "managed": false, "value": "com.mysql.cj.jdbc.Driver", "contentType": "text/plain" } ] } } } ] } This is a Handlebars template that would allow us to use placeholders for the host name, port, and the created/updated/etc., timestamp fields. We must use the {{port} placeholder as we want to make sure we can use any port when we start our container, but the rest of the placeholders are optional, we could have just written a literal there. See the quick start documentation for more information. Starting the container has a similar complexity as in case of the AWS example: YAML version: "3" services: lowkey-vault: container_name: akv-example-lowkey-vault image: nagyesta/lowkey-vault:1.18.0 ports: - "10443:10443" volumes: - vault-import:/import/:ro environment: LOWKEY_ARGS: > --server.port=10443 --LOWKEY_VAULT_NAMES=- --LOWKEY_IMPORT_LOCATION=/import/keyvault.json.hbs # ... MySQL config ... volumes: vault-import: driver: local driver_opts: type: "none" o: "bind" device: "lowkey-vault/import" We need to notice almost the same things as before, the port number is set, the Handlebars template will use the server.port parameter and localhost by default, so the import should work once we have attached the volume using the same approach as before. The only remaining step we need to solve is configuring our application to trust the self-signed certificate of the test double, which is used for providing an HTTPS connection. This can be done by using the PKCS#12 store from the Lowkey Vault repository and telling Java that it should be trusted: Groovy bootRun { systemProperty("javax.net.ssl.trustStore", file("${projectDir}/local/local-certs.p12")) systemProperty("javax.net.ssl.trustStorePassword", "changeit") systemProperty("spring.profiles.active", "dev") dependsOn tasks.composeUp finalizedBy tasks.composeDown } Running the project will log the expected string as before: MySQL Community Server - GPL - 8.0.32 Congratulations, we can run our app without the real Azure Key Vault. Same as before, we can use Testcontainers for our test, but, in this case, the Lowkey Vault module is a third-party from the Lowkey Vault project home, so it is not in the list provided by the Testcontainers project. Summary We have established that keeping secrets in the repository defeats the purpose. Then, we have seen multiple solution options for the problem we have identified in the beginning, and can select the best secrets manager depending on our context. Also, we can tackle the local and CI use cases using the examples shown above. The full example projects can be found on GitHub here. More
Multi-Tenant Architecture for a SaaS Application on AWS

Multi-Tenant Architecture for a SaaS Application on AWS

By Alfonso Valdes
SaaS applications are the new normal nowadays, and software providers are looking to transform their applications into a Software As a Service application. For this, the only solution is to build a multi-tenant architecture SaaS application. Have you ever wondered how Slack, Salesforce, AWS (Amazon Web Services), and Zendesk can serve multiple organizations? Does each one have its unique and custom cloud software per customer? For example, have you ever noticed that, on Slack, you have your own URL “yourcompanyname.slack.com?” Most people think that, in the background, they created a particular environment for each organization—application or codebase—and believe that Slack customers have their own server/app environment. If this is you, you might have assumed they have a repeatable process to run thousands of apps across all their customers. Well, no. The real solution is a multi-tenant architecture on AWS for a SaaS application. Let’s start with this impressive fact: 70% of all web apps are considered SaaS applications according to IDC Research. So, if you know about SaaS architecture and multi-tenant, you are probably covering 70% of the web app architecture landscape that would be available in the future. “70% of all web apps are SaaS, but only a few of them are multi-tenant.” This research is intended to showcase an overview of the strategies, challenges, and constraints that DevOps and software developers are likely to face when architecting a SaaS multi-tenant application.There are two concepts that are important for us to understand before starting: The next points are what we will explore in a multi-tenant architecture for your SaaS application, and my contributions will be: What Is Multi-Tenant Architecture? First of all, you need to understand what single tenant and multi-tenant architecture is: Single-tenant architecture (siloed model): is a single architecture per organization where the application has its own infrastructure, hardware, and software ecosystem. Let’s say you have ten organizations; in this case, you would need to create ten standalone environments, and your SaaS application, or company, will function as a single tenant architecture. Additionally, it implies more costs, maintenance, and a level of difficulty to update across the environments. Multi-tenant architecture: is an ecosystem or model where a single environment can serve multiple tenants utilizing a scalable, available, and resilient architecture. The underlying infrastructure is completely shared, logically isolated, and with fully centralized services. The multi-tenant architecture evolves according to the organization or subdomain (organization.saas.com) that is logged into the SaaS application; and is totally transparent to the end-user. Bear in mind that in this paper, we will discuss two multi-tenant architecture models, one for the application layer and one for the database layer. Multi-Tenant Benefits The adoption of a multi-tenant architecture approach will bring extensive valuable benefits for your SaaS application. Let’s go through the next contributions: A reduction of server infrastructure costs utilizing a multi-tenant architecture strategy: Instead of creating a SaaS environment per customer, you include one application environment for all your customers. This enables your AWS hosting costs to be dramatically reduced from hundreds of servers to a single one. One single source of trust: Let’s say again you have a customer using your SaaS. Imagine how many code repositories you would have per customer. At least 3-4 branches per customer, which would be a lot of overhead and misaligned code releases. Even worse, visualize the process of deploying your code to the entire farm of tenants; it is extremely complicated. This is unviable and time-consuming. With a multi-tenant SaaS architecture, you avoid this type of conflict, where you’ll have one codebase (source of trust), and a code repository with a few branches (dev/test/prod). By following the below practice—with a single command (one-click-deployment)—you will quickly perform the deployment process in a few seconds. Cost reductions of development and time-to-market: Cost reduction considers a sequence of decisions to make, such as having a single codebase, a SaaS platform environment, a multi-tenant database architecture, a centralized storage, APIs, and following The Twelve-Factor Methodology. All of them will allow you to reduce development labor costs, time-to-market, and operational efficiencies. SaaS Technology Stack for an Architecture on AWS To build a multi-tenant architecture, you need to integrate the correct AWS web stack, including OS, language, libraries, and services to AWS technologies. This is just the first step towards creating a next-generation multi-tenant architecture. Even though we will surface a few other multi-tenant architecture best practices, this article will be primarily oriented to this AWS SaaS web stack. Let’s dive into our SaaS Technology Stack on AWS: Programming Language It doesn’t really matter which language platform you select. What is vital is that your application can scale, utilize multi-tenant architecture best practices, cloud-native principles, and a well-known language by the open-source community. The latest trends to build SaaS applications are Python + React + AWS. Another “variant” is Node.js + React + AWS, but in the end, the common denominators are always AWS and React. If you are a financial company, ML or AI, with complex algorithms or backend work, I’ll say you should go for Python. On the other hand, if you are using modern technologies like real-time chats, mini feeds, streaming, etc. then go for Node.js. There is a market in the banking sector that is leveraging Java, but that’s for established enterprises. Any new SaaS application better goes with the mentioned web stack. Again, this is just what I’ve noticed as a trend, and what the community is demanding. Note: This data comes from a survey we performed a few months ago for financial services and SaaS companies. Ideal Languages Cloud Provider As a team of DevOps experts, I’ve noticed a cloud variation in the last two years, and which corresponds to these percentages: 70% of our DevOps implementations are based on AWS, 25% with Azure, and 5% go to GCP and digital ocean. Each year the trend is similar, with the exception that Azure is gradually growing with the years. Those are not only my words, but also ideas supported by multiple DevOps partners. So, I strongly recommend deploying your SaaS application under AWS. It has a number of benefits; every day there is a new service available for you, and a new feature that facilitates your development and deployment. Totally recommended to deploy your SaaS on AWS. Microservices If you are planning to leverage the cloud, you must leverage cloud-native principles. One of these principles is to incorporate microservices with Docker. Make sure your SaaS application is under microservices, which brings multiple benefits, including flexibility and standardization, easier to troubleshoot, problems isolation, and portability. Just like the cloud, Docker and microservices have transformed the IT ecosystem and will stay for a long while. Container Orchestration Platform This is a complicated and abstract decision; there are three options in AWS to manage, orchestrate, and create a microservice cluster environment: Amazon ECS: It is the natural Amazon container orchestration system in the AWS ecosystem. (Highly recommended for startups, small SaaS, and medium SaaS). Amazon Fargate: Almost serverless and price and management is per task. Minimal operational effort vs. ECS. There are some studies conducted by our DevOps team; in terms of performance. Fargate can be slower than ECS, so for this particular case, I would recommend Amazon ECS, instead of Fargate. Another thought is that if your team is pure developers and not planning to hire a DevOps engineer, perhaps Fargate is the way to go. Amazon EKS: It is a managed service that makes Kubernetes on AWS easy to manage. Use Amazon EKS instead of deploying a Kubernetes cluster on an EC2 instance, set up the Kubernetes networking, and worker nodes. (Recommended for large SaaS apps and a sophisticated DevOps and web development Team). Database The inherent database will be PostgreSQL with Amazon RDS. However, I strongly recommend that if you have a senior development team, and are projecting a high-traffic for your SaaS application—or even hundreds of tenants—you’d better architect your database with MongoDB. In addition to this, utilize the best practices that will be mentioned below about multi-tenant database. In this case, I would go for Amazon RDS with PostgreSQL or DynamoDB (MongoDB). “If you are projecting a high-traffic for your SaaS application, you’d better architect your database with MongoDB.” GraphQL or Amazon AppSync GraphQL is a query language and an alternative to a RESTful API for your database services. This new and modern ecosystem is adopted as a middleman among the client and the database server. It allows you to retrieve database data faster, mitigate the over-fetching in databases, retrieve the accurate data needed from the GraphQL schema, and maintaining the speed of development by iterating more quickly than a RESTful service. Adopting a monolithic backend application into a multi-tenant microservice architecture is the perfect time to leverage GraphQL or AppSync. Hence, when transforming your SaaS application, don’t forget to include GraphQL! Note: I didn’t include this service in the AWS SaaS architecture diagram, because it is implemented in multiple ways, and it would require an in-depth explanation on this topic. Automation You need a mechanism to trigger or launch new tenants/organizations and attach it to your multi-tenant SaaS architecture. Let’s say you have a new client that just subscribed to your SaaS application, how do you include this new organization inside your environment, database, and business logic? You need an automated process to launch new tenants; this is called Infrastructure as Code (IaC). This script/procedure should live within a git/bitbucket repository, one of the fundamental DevOps principles. A strong argument to leverage automation and IaC is that you need a mechanism to automate your SaaS application for your code deployments. In the same lines, automate the provisioning of new infrastructure for your Dev/Test environments. Infrastructure as Code and Automation Tools It doesn’t matter which Infrastructure as Code tool to use, they are both useful (Terraform and CloudFormation); they do the same job, and are highly known by the DevOps community. I don’t have a winner, they are both good. Terraform (from Hashicorp): A popular cloud-agnostic tool. Used widely for all DevOps communities. It is easier to find DevOps with this skill. Amazon CloudFormation: It is easier to integrate with Amazon Web Services, AWS built-in Automation tool. Whenever there is a new Amazon technology just released, the compatibility with AWS and CloudFormation is released sooner than Terraform. Trust on an AWS CloudFormation expert to automate and release in a secure manner. Message Queue System (MQS) The common MQS are Amazon SQS, RabbitMQ, or Celery. What I suggest here is to utilize the service that requires you less operation, in this case, is Amazon SQS. There are multiple times you need asynchronous communication. From delaying or scheduling a task, to increasing reliability and persistence with critical web transactions, decoupling your monolithic or micro-service application, and, most importantly: using a queue system to communicate event-driven serverless applications (Amazon Lambda functions). Caching System AWS ElastiCache is a caching and data storage system that is fully scalable, available, and managed. It aims to improve the application performance of distributed cache data and in-memory data structure stores. It’s an in-memory key-value store for Memcached and Redis engines. With a few clicks, you can run this AWS component entirely self-managed. It is essential to include a caching system for your SaaS application. Cloud Storage System Amazon S3 and Amazon CloudFront CDN for your static content. All static content, including images, media and HTML, will be hosted on Amazon S3—the cloud system with infinite storage and elasticity. In front of Amazon S3 we will include AWS CloudFront, integrating this pair of elements is vital, in order to cache the entire static content and reduce bandwidth costs. SaaS Web Stack: Multi-Tenant SaaS Architecture Example on AWS Types of Multi-Tenant SaaS Architectures One of the most important questions among the multi-tenant adoption would be which multi-tenant architecture suits better for your SaaS application on AWS. We will explore the two layers needed to enable your application to act as a real SaaS platform since it is paramount to decide which multi-tenant architecture you’ll incorporate in your SaaS platfrom, the application, and database layer. These two types of multi-tenant architectures are: The application layer multi-tenancy. The database layer multi-tenancy. The Application Layer Multi-Tenancy The application layer is an architectural design that enables hosting for tenants and is primarily delivered for Software as a Service applications (SaaS apps). In this first model, the application layer is commonly shared among multiple customers. Monolithic Architecture for SaaS If you haven’t seen this article before—or if you have already developed and architected your own SaaS application—I’m sure you have fallen into this approach. The monolithic components include EC2 instances in the web tier, app tier, and Amazon RDS with MySQL for your database. The monolithic architecture is not a bad approach, with the exception that you are wasting resources massively in the mentioned tiers. At least around 50% and 70% of your CPU/RAM usage is wasted due to the nature of the monolithic (cloud) architecture. Monolithic Architecture Diagram Microservices Architecture for SaaS With Containers and Amazon ECS Microservices are a recommended type of architecture since they provide a balance between modernization and maximum use of available cloud resources (EC2 instances and compute units). As well as it introduces a decomposed system with more granular services (microservices). We won’t touch on much about the microservice benefits since it’s widely expressed in the community. However, I’ll recommend you to utilize the formula of multi-tenant architecture + AWS Services + microservices + Amazon ECS as the container orchestrator; they can be the perfect match. Mainly, consider that Amazon ECS gives fewer efforts to configure your cluster and more NoOps for your DevOps team. “By 2022, 90% of all new apps will feature microservices architectures that improve the ability to design, debug, update, and leverage third-party code; 35% of all production apps will be cloud-native.” —Source: Forbes, 2019 With a talented team, the best multi-tenant SaaS architecture approach would be this use case scenario. Along the same lines, it covers the SaaS software and architecture’s main attributes, including agility, innovation, repeatability, reduced cycle time, cost efficiency, and manageability. The Perfect Match Multi-tenant architecture + AWS Services + microservices + Amazon ECS (as the container orchestrator). Microservices Architecture Diagram Kubernetes Architecture for SaaS With Amazon EKS You may be wondering: what about Kubernetes or Amazon EKS? Well, Kubernetes is another alternative of microservice architecture that adds an extra layer of complexity in the SaaS equation. However, you can overcome this complexity by leveraging Amazon EKS, Amazon Elastic Container Service for Kubernetes; the managed Kubernetes service from Amazon, which is a de facto service by the Kubernetes community. What highlights of this component from the rest of the architectures is that it provides the use of namespaces. This attribute aids to isolate every tenant and its own environment within the corresponding Kubernetes cluster. In this sense, you don’t have to create different clusters per each tenant (you could, but, to satisfy a different approach). By using ResourceQuota, you can limit the resources used per namespace and avoid creating noise to the other tenants. Another point to consider is that if you would like to isolate your namespaces, you need to include Kubernetes network policies because, by default, the networking is open and can communicate across namespaces and containers. Here is a comparison of Amazon ECS vs Kubernetes. If you have a SaaS enterprise, I’ll recommend better to control your microservice via Amazon EKS or Kubernetes since it allows you to have more granular changes. So, what would a Kubernetes multi-tenant architecture look like? Here is a simple Kubernetes multi-tenant architecture and siloed by its respective namespaces. Kubernetes Architecture Diagram A simple multi-tenant architecture with Kubernetes and siloed by Kubernetes namespaces. Serverless Architecture for SaaS on AWS The dream of any AWS architect is to create a multi-tenant SaaS architecture with a serverless approach. That’s a dream that can come true as a DevOps or SaaS architect, but it especially adds a fair amount of complexity as a tradeoff. Additionally, it requires a reasonable amount of collaboration time with your dev team, extensive changes of code application, and a transformative serverless mindset. Given that, in a few years, it will be the ultimate solution, and it all depends on the talent, capabilities, and use case. A Serverless SaaS architecture enables applications to obtain more agility, resilience, and fewer development efforts, a truly NoOps ecosystem. At a high-level, what are the new parts of this next-generation serverless SaaS architecture? Every call becomes an isolated tenant call, either going to a logical service (Lambda function) or going to the database data coming from the Amazon API Gateway as an entry point in the serverless SaaS application. Now that you have decoupled every logical service, the authentication and authorization module needs to be handled by a third-party service like Amazon Cognito, which will be the one in charge to identify the tenant, user, tier, IAM tenant role, and bring back an STS token with these aspects. Particularly, the API Gateway will route all the tenant functions to the correct Lambda functions matching the STS Token. Here is a diagram of a multi tenant architecture example for AWS SaaS applications that are using serverless. Serverless Architecture Diagram The Database Layer Multi-Tenancy The multi-tenancy concept comes with different architecture layers. We have already advocated the multi-tenancy application layer and its variants. Now, it is time to explore multi-tenancy in the database layer, which is another aspect to discover. Paradoxically, the easiest and cost-effective multi-tenant database architecture is the pure and real database multi-tenancy. The database layer is right the opposite of the previous model, the application layer. Over here, the DB layer is kept in common among tenants, and the application layer is isolated. As a next step, you need to evaluate what multi-tenant database architecture to pursue with tables, schemas, or siloed databases. When choosing your database architecture, there are multiple criterias to assess: Scalability: Number of tenants, storage per-tenant, workload. Tenant isolation Database costs: Per tenant costs. Development complexity: Changes in schemas, queries, etc. Operational complexity: Database clustering, update tenant data, database administration, and maintenance. Single Database: A Table Per Tenant A table per tenant single database refers to a pure database multi-tenancy and pooled model. This database architecture is the common and default solution by DevOps or software architects. It is very cost-effective when having a small startup or a few dozen organizations. It consists of leveraging a table per each organization within a database schema. There are specific trade-offs for this architecture, including the sacrifice of data isolation, noise among tenants, and performance degradation—meaning one tenant can overuse compute and ram resources from another. Lastly, every table name has its own tenantID, which is very straightforward to design and architect. In regard to data isolation and compliance, let’s say that one of your developers makes a mistake and brings the wrong tenant information to your customer. Imagine the data breach—please ensure never to expose information from more than one tenant. That’s why compliant SaaS applications, this architecture model is not recommended, however, is used widely because of its cost-effectiveness. Alternative Single-Tenant Database Architecture A shared table in a single schema in a single schema in a single database. Perfect for DynamoDB. (We didn’t cover this approach—FYI). Single Database: A Schema Per Tenant A schema per tenant single database, also known as the bridge model, is a multi-tenant database approach, which is still very cost-effective and more secure than the pure tenancy (DB pooled model), since you are with a single database, with the exception of the database schema isolation per tenant. If you are concerned about data partitioning, this solution is slightly better than the previous one (a table per tenant). Similarly, it is simple to manage across multiple schemas in your application code configuration. One important distinction to notice is that with more than 100 schemas or tenants within a database, it can provoke a lag in your database performance. Hence, it is recommended to split the database into two (add the second database as a replica). However, the best database tool for this approach is PostgreSQL, which supports multiple schemas without much complexity. Lastly, this strategy, of a schema per tenant, shares resources, compute, and storage across all its tenants. As a result, it provokes noisy tenants that utilize more resources than expected. Database Server Per Tenant Also call the siloed model, where you need a database instance per customer. Expensive, but the best for isolation and security compliance. This technique is significantly more costly than the rest of multi-tenant database architectures, but it complies with security regulations; the best for performance, scalability, and data isolation. This pattern uses one database server per tenant, meaning if the SaaS app has 100 tenants, therefore there will be 100 database servers, which is extremely costly. When PCI, HIPAA, or SOC2 is needed, it is vital to utilize a database siloed model, or at least find a workaround with the correct IAM roles and the best container orchestration—either Kubernetes or Amazon ECS namespaces—a VPC per tenant and encryption everywhere. Multi-Tenant Database Architecture Tools Amazon RDS with PostgreSQL (best option). DynamoDB (a great option for a single-tenant database with a single shared table). Amazon RDS with MySQL. GraphQL, as described previously, use it in front of any of these databases to increase speed on data retrieval, speed on development, and alternative to RESTful API, which helps relieve requests from the backed servers to the client. Application Code Changes Once you have selected your multi-tenant strategy in every layer, let’s start considering what is needed to change in the code level, in terms of code changes. If you have decided to adopt Django (from Python) for your SaaS application, you need a few tweak changes to align your current application with your multi-tenant architecture from the database and application layer. Fortunately, web application languages and frameworks are able to capture the URL or subdomain that is coming from the request. The ability to obtain this information (subdomain) at runtime is critical to handling dynamic subdomains for your multi-tenant architecture. We won’t cover in-depth what lines of codes we need to include in your Django application—or in any other framework—but, at least, I’ll let you know what items should be considered in this section. Python Django Multi-Tenancy in a Nutshell Add an app called tenant.py: a class for tenantAwareModel with multiple pool classes. How to identify tenants: you need to give each tenant a subdomain; to do so, you need to modify a few DNS changes, Nginx/Apache tweaks, and add a utility method (utils.py). Now, whenever you have a request, you can use this method to get the tenant. Determine how to extract the tenant utilizing the host header: (subdomain). Admin isolation Note: Previous code suggestions could change depending on the architecture. Wildcard DNS Subdomain: URL-Based SaaS Platform Basically, every organization must have its own subdomain, and they are quite useful for identifying organizations. Per tenant, it is a unique dedicated space, environment, and custom application (at least logically); for example, “org1.saas.com,” “org2.saas.com,” and so on. This URL structure will dynamically provision your SaaS multi-tenant application, and this DNS change will facilitate the identification, authentication, and authorization of every tenant. However, another workaround is called path-based per tenant, which is not recommended, for example, “app.saas.com/org1/…,“ “app.saas.com/org2\…,” and so on. So, the following is required in this particular section: A wildcard record should be in place in your DNS management records. This wildcard subdomain redirects all routes to your multi-tenant architecture (either to the load balancer, application server, or cluster end-point). Similarly, a CNAME record labeled (*) pointing to your “app.saas.com” or “saas.com/login.” An asterisk (*) means a wildcard to your app domain. As a final step, another (A) record pointing your “app.saas.com“ domain to your Amazon ECS cluster, ALB, or IP. DNS Records Entries “*.saas.com” CNAME “app.saas.com” “app.saas.com” A 1.2.3.4 OR “app.saas.com” A (alias) balancer.us-east-1.elb.amazonaws.co Note: An (A) Alias record is when you are utilizing an ALB/ELB (Load Balancer) from AWS. Web Server Setup With NGINX Configuration Let’s move down to your web server, specifically Nginx. In this stage, you will need to configure your Nginx.conf and server blocks (virtual hosts). Set up a wildcard vhost for your Nginx web server. Make sure it is an alias (ServerAlias) and a catch-all wildcard site. You don’t have to create a subdomain VirtualHost in Nginx per tenant; instead, you need to set up a single wildcard VirtualHost for all your tenants. Naturally, the wildcard pattern will match your subdomains and route accordingly to the correct and unique patch of your SaaS app document root. SSL Certificates Just don’t forget to deal with the certificates under your tenant subdomains. You would need to add them either in the Cloudfront CDN, load balancer, or in your web server. Note: This solution can be accomplished using the Apache webserver. Follow the 12-factor Methodology Framework Following the 12-factor methodology represents the pure DevOps and cloud-native principles, including immutable infrastructure, dev/test and prod parity with Docker, CI/CD principles, stateless SaaS application, and more. Multi-Tenant SaaS Architecture Best Practices How is your SaaS platform going to scale? The multi tenant SaaS architecture best practices are: Amazon AutoScaling, either with EC2 instances or microservices. Database replication with Amazon RDS, Amazon Aurora, or DynamoDB. Application load balancer. Including a CloudFront CDN for your static content. Amazon S3 for all your static/media content. Caching system including Redis/Memcached or its equivalent in the AWS cloud—Amazon ElastiCache. Multi-availability zone set up for redundancy and availability. Code Deployments With CI/CD Another crucial aspect to consider is how to deploy your code releases across tenants and your multiple environments (dev, test, and prod). You will need a Continuous Integration and Continuous Delivery (CI/CD) process to streamline your code releases across all environments and tenants. If you follow-up on my previous best practices, it won’t be difficult. Tools to embrace CI/CD CI/CD tools: Jenkins, CircleCi, or AWS Code pipelines (along with Codebuild and CodeDeploy). My advice: If you want a sophisticated DevOps team and a widely known tool, go for Jenkins; otherwise, go for CircleCI. If you want to keep leveraging AWS technologies exclusively, then go for AWS Code pipelines. But, if you’re looking for compliance, banks, or regulated environments, go for Gitlab. DevOps Automation: Automate Your New Tenant Creation Process How are you creating new tenants per each subscription? Identify the process of launching new tenants into your SaaS environment. You need to trigger a script to launch or attach the new multi-tenant environment to your existing multi-tenant architecture, meaning to automate the setup of new tenants. Consider that it can be after your customer gets registered in your onboarding page, or you need to trigger the script manually. Automation Tools Terraform (Recommended) Amazon CloudFormation (Trust on an AWS CloudFormation certified team) Ansible. Note: Ensure you utilize Infrastructure as Code principles in this aspect. Siloed Compute and Siloed Storage How will your architecture be isolated from other tenants? You just need to identify the next: every layer of the SaaS application needs to be isolated. The customer workflow is touching multiple layers, pages, backend, networking, front-end, storage, and more bits, so… How is your isolation strategy? Take in mind the next aspect: IAM Roles per function or microservices. Amazon S3 security policies. VPC isolation. Amazon ECS/Kubernetes namespace isolation. Database isolation (tenant per table/schema/silo database). Tenant Compute Capacity Have you considered how many SaaS tenants it can support per environment? Just think, you have 99 tenants, compute/database load is almost to the limits, do you have a ready environment to support the new tenants? What about the databases? You have a particular customer that wants an isolated tenant environment for its SaaS application. How would you support an extra tenant environment that is separated from the rest of the multi-tenant architecture? Would you do it? What are the implications? Just consider a scenario for this aspect. Tenant Clean-Up What are you doing with the tenants that are idle or not used anymore? Perhaps a clean-up process for any tenant that has been inactive for a prolonged period, or removes unused resources and tenants by hand, but you need a process or automation script. Final Thoughts Multi-tenant architecture and SaaS applications under AWS. What a topic that we just discovered! Now, you understand the whole multi-tenant SaaS architecture cycle from end-to-end, including server configuration, code, and what architecture pursues per every IT layer. As you can notice, there is no global solution for this ecosystem. There are multiple variants per each IT layer, either all fully multi-tenant, partially tenant, or just silo tenants. It falls more on what you need, budget, complexity, and the expertise of your DevOps team. I strongly recommend going for microservices (ECS/EKS), partially multi-tenant SaaS in the app, and database layer. As well, include cloud-native principles, and finally, adopt the multi-tenant architecture best practices and considerations described in this article. That being said, brainstorm your AWS SaaS architecture firstly by thinking on how to gain agility, cost-efficiency, IT labor costs, and leveraging a nearshore collaboration model, which adds another layer of cost-savings. In regard, automation with Terraform and CloudFormation is our best choice. Even better, most of our AWS/DevOps projects are following PCI, HIPAA, and SOC2 regulations. If you are a fintech, healthcare, or SaaS company, well, you know this type of requirement should be included in your processes. More
Introduction to Kubernetes Event-Driven Auto-Scaling (KEDA)
Introduction to Kubernetes Event-Driven Auto-Scaling (KEDA)
By Jyoti Sahoo
Exploring the Architecture of Amazon SQS
Exploring the Architecture of Amazon SQS
By Satrajit Basu CORE
How To Get Started With the Hazelcast Viridian Serverless
How To Get Started With the Hazelcast Viridian Serverless
By Fawaz Ghali, PhD
Deploy a Kubernetes Application With Terraform and AWS EKS
Deploy a Kubernetes Application With Terraform and AWS EKS

When it comes to infrastructure provisioning, including the AWS EKS cluster, Terraform is the first tool that comes to mind. Learning Terraform is much easier than setting up the infrastructure manually. That said, would you rather use the traditional approach to set up the infrastructure, or would you prefer to use Terraform? More specifically, would you rather create an EKS cluster using Terraform and have Terraform Kubernetes deployment in place, or use the manual method, leaving room for human errors? As you may already know, Terraform is an open-source Infrastructure as Code (IaC) software platform that allows you to manage hundreds of cloud services using a uniform CLI approach and uses declarative configuration files to codify cloud APIs. In this article, we won’t go into all the details of Terraform. Instead, we will be focusing on Terraform Kubernetes deployment. In summary, we will be looking at the steps to provision an EKS Cluster using Terraform. Also, we will go through how Terraform Kubernetes deployment helps save time and reduce human errors, which can occur when using a traditional or manual approach for application deployment. Prerequisites for Terraform Kubernetes Deployment Before we proceed and provision EKS cluster using Terraform, there are a few commands (or tools) you need to have in mind and on hand. First off, you must have an AWS Account, and Terraform must be installed on your host machine, seeing as we are going to create an EKS cluster using Terraform CLI on the AWS cloud. Now, let’s take a look at the prerequisites for this setup and help you install them: AWS Account If you don’t have an AWS account, you can register for a “Free Tier Account” and use it for test purposes. IAM Admin User You must have an IAM user with AmazonEKSClusterPolicy and AdministratorAccess permissions as well as its secret and access keys. We will be using the IAM user credentials to provision an EKS cluster using Terraform. The keys you create for this user will be used to connect to the AWS account from the CLI (Command Line Interface). When working on production clusters, only provide the required access and avoid providing admin privileges. EC2 Instance We will be using the Ubuntu 18.04 EC2 Instance as a host machine to execute our Terraform code. You may use another machine, however, you will need to verify which commands are compatible with your host machine to install the required packages. The first step is installing the required packages on your machine. You can also use your personal computer to install the required tools. This step is optional. Access to the Host Machine Connect to the EC2 Instance and install the “unzip package:” ssh -i “<key-name.pem>” ubuntu@<public-ip-of-the–ec2-instance>. If you are using your personal computer, you may not need to connect to the EC2 instance. However, in this case, the installation command will differ. sudo apt-get update -y. sudo apt-get install unzip -y. Terraform To create an EKS cluster using Terraform, you need to have Terraform on your host machine. Use the following commands to install Terraform on an Ubuntu 18.04 EC2 machine. Visit Hashicorps official website to view the installation instructions for other platforms. sudo apt-get update and sudo apt-get install -y gnupg software-properties-common curl. cur—fsSL | sudo “apt-key” add (here). sudo apt-add-repository deb [arch=amd64] $(lsb_release -cs) main (here). sudo “apt-get” update and sudo “apt-get” install Terraform. Terraform—version: AWS CLI There is not much to do with aws-cli, however, we need to use it to check the details about the IAM user whose credentials will be used from the terminal. To install it, use the commands below: curl (here) -o “awscliv2.zip” unzip awscliv2.zip sudo ./aws/install Kubectl We will be using the kubectl command against the Kubernetes cluster to view the resources in the EKS cluster we want to create. Install kubectl on Ubuntu 18.04 EC2 machine using the commands below: curl -LO “curl -L -s (here) /bin/linux/amd64/kubectl” url -LO “curl -L -s (here) /bin/linux/amd64/kubectl.sha256” echo “$(cat kubectl.sha256) kubectl” | sha256sum –check sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl kubectl versio—client DOT Note: This is completely optional. We will be using this to convert the output of the Terraform graph command. The output of the Terraform graph command is in DOT format, which can easily be converted into an image by making use of the DOT provided by GraphViz. The Terraform graph command is used to generate a visual representation of a configuration or execution plan. To install the DOT command, execute this command: sudo apt install graphvi Export Your AWS Access and Secret Keys for the Current Session If the session expires, you will need to export the keys again on the terminal. There are other ways to use your keys that allow aws-cli to interact with AWS: Export AWS_ACCESS_KEY_ID=<YOUR_AWS_ACCESS_KEY_ID>. Export AWS_SECRET_ACCESS_KEY=<YOUR_AWS_SECRET_ACCESS_KEY>. Export AWS_DEFAULT_REGION=<YOUR_AWS_DEFAULT_REGION>. Here, replace <YOUR_AWS_ACCESS_KEY_ID> with your access key, <YOUR_AWS_SECRET_ACCESS_KEY> with your secret key, and <YOUR_AWS_DEFAULT_REGION> with the default region for your aws-cli. Check the Details of the IAM User Whose Credentials Are Being Used Basically, this will display the details of the user whose keys you used to configure the CLI in the above step: aws sts get-caller-identity: Architecture The architecture should appear as follows. A VPC will be created with three public subnets and three private subnets. Traffic from private subnets will route through the NAT gateway and traffic from public subnets will route through the internet gateway. Kubernetes cluster nodes will be created as part of auto-scaling groups and will reside in private subnets. Public subnets can be used to create bastion servers that can be used to connect to private nodes. Three public subnets and three private subnets will be created in three different availability zones. You can change the VPC CIDR in the Terraform configuration files if you wish. If you are just getting started, we recommend following the article without making any unfamiliar changes to the configuration in to avoid human errors. Highlights This article will help you provision an EKS cluster using Terraform and deploy a sample NodeJs application. When creating an EKS cluster, other AWS resources such as VPC, subnets, NAT gateway, internet gateway, and security groups will also be created on your AWS account. This article is divided into two parts: The creation of an EKS cluster using Terraform. Deployment of a sample Nodejs application in the EKS cluster using Terraform. First off, we will create an EKS cluster. After which, we will deploy a sample Nodejs application on it using Terraform. In this article, we refer to Terraform Modules for creating VPC, and its components, along with the EKS cluster. Here is a list of some of the Terraform elements we’ll use: Module: A module is made up of a group of of .tf and/or .tf.jsonfiles contained in a directory. Modules are containers for a variety of resources that are used in conjunction. Terraform’s main method of packaging and reusing resource configurations is through modules. EKS: We will be using a terraform-aws-modules/eks/aws module to create our EKS cluster and its components. VPC: We will be using a terraform-aws-modules/vpc/aws module to create our VPC and its components. Data: The Data source is accessed by a data resource, which is specified using a data block and allows Terraform to use the information defined outside of Terraform, defined within another Terraform configuration, or changed by functions. Provider: Provider is a Terraform plugin that allows users to communicate with cloud providers, SaaS providers, and other APIs. Terraform configurations must specify the required providers in order for Terraform to install and use them. Kubernetes: The Kubernetes provider is used to communicate with Kubernetes’ resources. Before the provider can be utilized, it must be configured with appropriate credentials. For this example, we will use host, token, and cluster_ca_certificate in the provider block. AWS: The AWS provider is used to communicate with AWS’ resources. Before the provider can be utilized, it must be configured with the appropriate credentials. In this case, we will export AWS access and secret keys to the terminal. Resource: The most crucial element of the Terraform language is resources. Each resource block describes one or more infrastructure items. A resource block specifies a particular kind of resource with a specific local name. The resource type and name serve to identify a resource and, therefore, must be distinct inside a module. kubernetes_namespace: We will be using a kubernetes_namespace to create namespaces in our EKS cluster. kubernetes_deployment: We will be using a kubernetes_deployment to create deployments in our EKS cluster. kubernetes_service: We will be using a kubernetes_service to create services in our EKS cluster. aws_security_group: We will be using aws_security_group to create multiple security groups for instances that will be created as part of our EKS cluster. random_string: Random string is a resource that generates a random permutation of alphanumeric and special characters. We will be creating a random string that will be used as a suffix in our EKS cluster name. Output: Output values provide information about your infrastructure to other Terraform setups and make said information available on the command line. In computer languages, output values are equivalent to return values. An output block must be used to declare each output value that a module exports. Required_version: The required_version parameter accepts a version constraint string that determines which Terraform versions are compatible with your setup. Only the Terraform CLI version is affected by the required version parameter. Variable: Terraform variables, including variables in any other programming language, allow you to tweak the characteristics of Terraform modules without modifying the module’s source code. This makes your modules composable and reusable by allowing you to exchange modules across multiple Terraform settings. Variables can include past values, which can be utilized when calling the module or running Terraform. Locals: A local value gives an expression a name, allowing you to use it numerous times within a module without having to repeat it. Before we go ahead and create EKS cluster using Terraform, let’s take a look at why Terraform is a good choice. Why Provision and Deploy With Terraform? It’s normal to wonder “why provision EKS cluster using Terraform?” or “why create EKS cluster using Terraform?” when we can simply achieve the same with the AWS Console, AWS CLI, or other tools. Here are a few of the reasons why: Unified workflow: If you’re already using Terraform to deploy infrastructure to AWS, your EKS cluster can be integrated into that process. Also, Terraform can be used to deploy applications into your EKS cluster. Full lifecycle management: Terraform not only creates resources, but also updates and deletes tracked resources without you having to inspect the API to find those resources. Graph of relationships: Terraform recognizes resource-dependent relationships via a relationship graph. If, for example, an AWS Kubernetes cluster requires specified VPC and subnet configurations, Terraform will not attempt to create the cluster if the VPC and subnets fail to create it with the required configuration. How To Create an EKS Cluster Using Terraform In this part of the article, we shall provision an EKS cluster using Terraform. While doing this, other dependent resources like VPC, Subnets, NAT Gateway, Internet Gateway, and Security Groups will also be created, and we will also deploy an Nginx application with Terraform. Note: You can find all of the relevant code on my Github repository. Before you create an EKS cluster with Terraform using the following steps, you need to set up and make note of a few things: Clone the Github repository in your home directory: cd ~/ pwd git clone After you clone the repo, change your “Present Working Directory” to ~/DevOps/aws/terraform/terraform-kubernetes-deployment/ cd ~/DevOps/aws/terraform/terraform-kubernetes-deployment/ The Terraform files for creating an EKS cluster must be in one folder and the Terraform files for deploying a sample NodeJs application must be in another. eks-cluster: This folder contains all of the .tf files required for creating the EKS Cluster. nodejs-application: This folder contains a .tf file required for deploying the sample NodeJs application. Now, let’s proceed with the creation of an EKS cluster using Terraform: Verify your “Present Working Directory,” it should be ~/DevOps/aws/terraform/terraform-kubernetes-deployment/as you have already cloned my Github repository in the previous step. cd ~/DevOps/aws/terraform/terraform-kubernetes-deployment/ Next, change your “Present Working Directory” to ~/DevOps/aws/terraform/terraform-kubernetes-deployment/eks-cluster cd ~/DevOps/aws/terraform/terraform-kubernetes-deployment/eks-cluster You should now have all the required files: If you haven’t cloned the repo then you are free to create the required .tf files in a new folder. Create your eks-cluster.tf file with the content below. In this case, we are using the module from terraform-aws-modules/eks/aws to create our EKS Cluster. Let’s take a look at the most important inputs that you may want to consider changing depending on your requirements. Source: This informs Terraform of the location of the source code for the module. Version: It is recommended to explicitly specify the acceptable version number to avoid unexpected or unwanted changes. cluster_name: The cluster name is passed from the local variable. cluster_version: This defines the EKS cluster version. Subnets: This specifies the list of Subnets in which nodes will be created. For this example, we will be creating nodes in the “Private Subnets.” Subnet IDs are passed here from the module in which Subnets are created. cpc_id: This specifies the VPI in which the EKS cluster will be created. The value is passed from the module in which the VPC is created. instance_type: You can change this value if you want to create worker nodes with any other instance type. asg_desired_capacity: Here, you can specify the maximum number of nodes that you want in your auto-scaling worker node groups. module “eks” { source = “terraform-aws-modules/eks/aws” version = “17.24.0” cluster_name = local.cluster_name cluster_version = “1.20” subnets = module.vpc.private_subnets vpc_id = module.vpc.vpc_id workers_group_defaults = { root_volume_type = “gp2” } worker_groups = [ { name = “worker-group-1” instance_type = “t2.small” additional_userdata = “echo nothing” additional_security_group_ids = [aws_security_group.worker_group_mgmt_one.id] asg_desired_capacity = 2 }, { name = “worker-group-2” instance_type = “t2.medium” additional_userdata = “echo nothing” additional_security_group_ids = [aws_security_group.worker_group_mgmt_two.id] asg_desired_capacity = 1 }, ] } data “aws_eks_cluster” “cluster” { name = module.eks.cluster_id } data “aws_eks_cluster_auth” “cluster” { name = module.eks.cluster_id } Create a kubernetes.tf file with the following content. Here, we are using Terraform Kubernetes Provider to create Kubernetes objects, such as a namespace, deployment, and service, using Terraform. We are creating these resources for testing purposes only. In the following steps, we will also be deploying a sample application using Terraform. Here are all the resources we will be creating in the EKS cluster. kubernetes_namespace: We will be creating a “test” namespace and using this namespace to create other Kubernetes objects in the EKS cluster. kubernetes_deployment: We will be creating a deployment with two replicas of the Nginx pod. kubernetes_service: A service of type LoadBalancer will be created and used to access our Nginx application. provider “kubernetes” { host = data.aws_eks_cluster.cluster.endpoint token = data.aws_eks_cluster_auth.cluster.token cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data) } resource “kubernetes_namespace” “test” { metadata { name = “nginx” } } resource “kubernetes_deployment” “test” { metadata { name = “nginx” namespace = kubernetes_namespace.test.metadata.0.name } spec { replicas = 2 selector { match_labels = { app = “MyTestApp” } } template { metadata { labels = { app = “MyTestApp” } } spec { container { image = “nginx” name = “nginx-container” port { container_port = 80 } } } } } } resource “kubernetes_service” “test” { metadata { name = “nginx” namespace = kubernetes_namespace.test.metadata.0.name } spec { selector = { app = kubernetes_deployment.test.spec.0.template.0.metadata.0.labels.app } type = “LoadBalancer” port { port = 80 target_port = 80 } } } Create an outputs.tf file with the following content. This is done to export the structured data related to our resources. It will provide information about our EKS infrastructure to other Terraform setups. Output values are equivalent to return values in other programming languages. output “cluster_id” { description = “EKS cluster ID.” value = module.eks.cluster_id } output “cluster_endpoint” { description = “Endpoint for EKS control plane.” value = module.eks.cluster_endpoint } output “cluster_security_group_id” { description = “Security group ids attached to the cluster control plane.” value = module.eks.cluster_security_group_id } output “kubectl_config” { description = “kubectl config as generated by the module.” value = module.eks.kubeconfig } output “config_map_aws_auth” { description = “A kubernetes configuration to authenticate to this EKS cluster.” value = module.eks.config_map_aws_auth } output “region” { description = “AWS region” value = var.region } output “cluster_name” { description = “Kubernetes Cluster Name” value = local.cluster_name } Create a security-groups.tf file with the following content. In this case, we are defining the security groups, which will be used by our worker nodes. resource “aws_security_group” “worker_group_mgmt_one” { name_prefix = “worker_group_mgmt_one” vpc_id = module.vpc.vpc_id ingress { from_port = 22 to_port = 22 protocol = “tcp” cidr_blocks = [ “10.0.0.0/8”, ] } } resource “aws_security_group” “worker_group_mgmt_two” { name_prefix = “worker_group_mgmt_two” vpc_id = module.vpc.vpc_id ingress { from_port = 22 to_port = 22 protocol = “tcp” cidr_blocks = [ “192.168.0.0/16”, ] } } resource “aws_security_group” “all_worker_mgmt” { name_prefix = “all_worker_management” vpc_id = module.vpc.vpc_id ingress { from_port = 22 to_port = 22 protocol = “tcp” cidr_blocks = [ “10.0.0.0/8”, “172.16.0.0/12”, “192.168.0.0/16”, ] } } Create a versions.tf file with the following content. Here, we are defining version constraints along with a provider as AWS. terraform { required_providers { aws = { source = “hashicorp/aws” version = “>= 3.20.0” } random = { source = “hashicorp/random” version = “3.1.0” } local = { source = “hashicorp/local” version = “2.1.0” } null = { source = “hashicorp/null” version = “3.1.0” } kubernetes = { source = “hashicorp/kubernetes” version = “>= 2.0.1” } } required_version = “>= 0.14” } Create a vpc.tf file with the following content. For the purposes of our example, we have defined a 10.0.0.0/16 CIDR for a new VPC that will be created and used by the EKS cluster. We will also be creating public and private Subnets. To create a VPC, we will be using the terraform-aws-modules/vpc/aws module. All of our resources will be created in the us-east-1 region. If you want to use any other region for creating the EKS cluster, all you need to do is change the value assigned to the “region” variable. variable “region” { default = “us-east-1” description = “AWS region” } provider “aws” { region = var.region } data “aws_availability_zones” “available” {} locals { cluster_name = “test-eks-cluster-${random_string.suffix.result}” } resource “random_string” “suffix” { length = 8 special = false } module “vpc” { source = “terraform-aws-modules/vpc/aws” version = “3.2.0” name = “test-vpc” cidr = “10.0.0.0/16” azs = data.aws_availability_zones.available.names private_subnets = [“10.0.1.0/24”, “10.0.2.0/24”, “10.0.3.0/24”] public_subnets = [“10.0.4.0/24”, “10.0.5.0/24”, “10.0.6.0/24”] enable_nat_gateway = true single_nat_gateway = true enable_dns_hostnames = true tags = { “kubernetes.io/cluster/${local.cluster_name}” = “shared” } public_subnet_tags = { “kubernetes.io/cluster/${local.cluster_name}” = “shared” “kubernetes.io/role/elb” = “1” } private_subnet_tags = { “kubernetes.io/cluster/${local.cluster_name}” = “shared” “kubernetes.io/role/internal-elb” = “1” } } At this point, you should have six files in your current directory (as can be seen below). All of these files should be in one directory. In the next step, you will learn to deploy a sample NodeJs application and, to do so, you will need to create a new folder. That said, if you have cloned the repo herein, you won’t need to worry: eks-cluster.tf kubernetes.tf outputs.tf security-groups.tf versions.tf vpc.tf To create an EKS cluster, execute the following command to initialize the current working directory containing the Terraform configuration files (.tffiles). terraform init Execute the following command to determine the desired state of all the resources defined in the above .tffiles. terraform plan Before we go ahead and create our EKS cluster, let’s take a look at the “terraform graph” command. This is an optional step that you can skip if you wish. Basically, we are simply generating a visual representation of our execution plan. Once you execute the following command, you will get the output in a graph.svgfile. You can try to open the file on your personal computer or online with SVG Viewer. terraform graph -type plan | dot -Tsvg > graph.svg We must now create the resources using the .tffiles in our current working directory. Execute the following command. This will take around ten minutes to complete. Feel free to go have a cup of coffee. Once the command is successfully completed, the EKS cluster will be ready. terraform apply 16. After the “terraform apply” command is successfully completed, you should see the output as depicted below. 17. You can now go to the AWS Console and verify the resources created as part of the EKS cluster. EKS cluster 2. VPC 3. EC2 Instance 4. EC2 AutoScaling Groups You can check for other resources in the same way.18. Now, if you try to use the kubectl command to connect to the EKS cluster and control it, you will get an error seeing as you have the kubeconfig file being used for authentication purposes. kubectl get nodes To resolve the error mentioned above, retrieve the access credentials, i.e. update the ~/.kube/config file for the cluster and automatically configure kubectl so that you can connect to the EKS cluster using the kubectlcommand. Execute the following command from the directory where all your .tffiles used to create the EKS cluster are located. aws eks —region $(terraform output -raw region) update-kubeconfi—name $(terraform output -raw cluster_name) Now, you are ready to connect to your EKS cluster and check the nodes in the Kubernetes cluster using the following command: kubectl get nodes Check the resources such as pods, deployments, and services that are available in the Kubernetes cluster across all namespaces. kubectl get pod—A kubectl get deployment—A kubectl get service—A kubectl get ns In the screenshot above, you can see the namespace, pods, deployment, and service we created with Terraform. We have now attempted to create an EKS cluster using Terraform and deploy Nginx with Terraform. Now, let’s see if we can deploy a sample NodeJs application using Terraform in the same EKS cluster. This time, we will keep the Kubernetes objects files in a separate folder so the NodeJs application can be managed independently, allowing us to deploy and/or destroy our NodeJs application without affecting the EKS cluster. Deploying a Sample Nodejs Application on the EKS Cluster Using Terraform? In this part of the article, we will deploy a sample NodeJs application and its dependent resources, including namespace, deployment, and service. We have used the publicly available Docker Images for the sample NodeJs application and MongoDB database. Now, let’s go ahead with the deployment. Change your “Present Working Directory” to ~/DevOps/aws/terraform/terraform-kubernetes-deployment/nodejs-application/if you have cloned my repository. cd ~/DevOps/aws/terraform/terraform-kubernetes-deployment/nodejs-application/ You should now have the required file. If you haven’t cloned the repo, feel free to create the required .tf file in a new folder. Please note that we are using a separate folder here. Create a sample-nodejs-application.tf file with the following content. In this case, we are using a Terraform Kubernetes Provider to deploy a sample NodeJs application. We will be creating a namespace, NodeJs deployment, and its service of type LoadBalancer, and MongoDB Deployment, as well as its service for type. provider “kubernetes” { config_path = “~/.kube/config” } resource “kubernetes_namespace” “sample-nodejs” { metadata { name = “sample-nodejs” } } resource “kubernetes_deployment” “sample-nodejs” { metadata { name = “sample-nodejs” namespace = kubernetes_namespace.sample-nodejs.metadata.0.name } spec { replicas = 1 selector { match_labels = { app = “sample-nodejs” } } template { metadata { labels = { app = “sample-nodejs” } } spec { container { image = “learnk8s/knote-js:1.0.0” name = “sample-nodejs-container” port { container_port = 80 } env { name = “MONGO_URL” value = “mongodb://mongo:27017/dev” } } } } } } resource “kubernetes_service” “sample-nodejs” { metadata { name = “sample-nodejs” namespace = kubernetes_namespace.sample-nodejs.metadata.0.name } spec { selector = { app = kubernetes_deployment.sample-nodejs.spec.0.template.0.metadata.0.labels.app } type = “LoadBalancer” port { port = 80 target_port = 3000 } } } resource “kubernetes_deployment” “mongo” { metadata { name = “mongo” namespace = kubernetes_namespace.sample-nodejs.metadata.0.name } spec { replicas = 1 selector { match_labels = { app = “mongo” } } template { metadata { labels = { app = “mongo” } } spec { container { image = “mongo:3.6.17-xenial” name = “mongo-container” port { container_port = 27017 } } } } } } resource “kubernetes_service” “mongo” { metadata { name = “mongo” namespace = kubernetes_namespace.sample-nodejs.metadata.0.name } spec { selector = { app = kubernetes_deployment.mongo.spec.0.template.0.metadata.0.labels.app } type = “ClusterIP” port { port = 27017 target_port = 27017 } } } At this point, you should have one file in your current directory, as depicted below: sample-nodejs-application.tf To initialize the current working directory containing our Terraform configuration file (.tffile) and deploy a sample NodeJs application, execute the following command: terraform init Next, execute the following command to determine the desired state of all the resources defined in the above .tffile: terraform plan Before we deploy a sample NodeJs application, let’s try generating a visual representation of our execution plan in the same way we did while creating the EKS cluster. In this case, this is an optional step that you can skip if you don’t wish to see the graph. Once you execute the following command, you will get the output in a graph.svgfile. You can try to open the file on your personal computer or online with SVG Viewer. terraform grap—type plan | dot -Tsvg > graph.svg The next step is to deploy the sample Nodejs application using the .tffiles in our current working directory. Execute the following command, and this time, don’t go and have a cup of tea seeing as it should not take more than one minute to complete. Once the command is successfully completed, the sample NodeJs application will be deployed in the EKS cluster using Terraform: terraform apply 10. Once the “terraform apply” command is successfully completed, you should see the following output: 11. You can now verify the objects that have been created using the commands below kubectl get pods -A kubectl get deployments -A kubectl get services -A In the above screenshot, you can see the namespace, pods, deployment, and service that were created for the sample NodeJs application. You can also access the sample NodeJs application using the DNS of the LoadBalancer. Upload an image and add a note to it. Once you’ve published the image, it will be saved in the database.It’s important to note that, in this case, we did not use any type of persistent storage for the application, therefore the data will not be retained after the pod is recreated or deleted. To retain data, try using PersistentVolume for the database. We just deployed a sample NodeJs application that was publicly accessible over the LoadBalancer DNS using Terraform. Next, we will complete the creation of the EKS Cluster and deployment of the sample NodeJs application using Terraform. Cleanup the Resources We Created It’s always better to delete the resource once you’re done with the tests, seeing as this saves costs. To clean up the resources and delete the sample NodeJs application and EKS cluster, follow the steps below: Destroy the sample NodeJs application using the following command: terraform initExecute this command if you get this error: “Error: Inconsistent dependency lock file.” terraform destroy You should see the following output once the above command is successful: Validate whether or not the NodeJs application has been destroyed: kubectl get pod—A kubectl get deployment—A kubectl get service—A In the above screenshot you can see that all of the sample Nodejs application resources have been deleted. Now, let’s destroy the EKS cluster using the following command: terraform init: Execute this command if you get the following error: “Error: Inconsistent dependency lock file.” terraform destroy 5. You will see the following output once the above command is successful. 6. You can now go to the AWS console to verify whether or not the resources have been deleted. There you have it! We have just successfully deleted the EKS cluster, as well as the sample NodeJs application. Conclusion of Terraform Kubernetes Deployment Elastic Kubernetes Service (EKS) is a managed Kubernetes service provided by AWS, which takes the complexity and overhead out of provisioning and optimizing a Kubernetes cluster for development teams. An EKS cluster can be created using a variety of methods; nevertheless, using the best possible way is critical in improving the infrastructure management lifecycle. Terraform is one of the Infrastructure as Code (IaC) tools that allows you to create, modify, and version control cloud and on-premise resources in a secure and efficient manner. You can use the Terraform Kubernetes deployment method to create an EKS cluster using Terraform while automating the creation process of the EKS Cluster and having additional control over the entire infrastructure management process through code. The creation of the EKS cluster and the deployment of Kubernetes objects can also be managed using the Terraform Kubernetes provider.

By Rahul Shivalkar
Azure Observability
Azure Observability

In this article, we will explore Azure Observability, the difference between monitoring and observability, its components, different patterns, and antipatterns. Azure Observability is a powerful set of services provided by Microsoft Azure that allows developers and operations teams to monitor, diagnose, and improve the performance and availability of their applications. With Azure Observability, you can gain deep insights into the performance and usage of your applications and quickly identify and resolve issues. Azure Monitoring and Azure Observability Azure Monitoring and Azure Observability are related but different concepts in the Azure ecosystem. Azure Monitor is a service that provides a centralized location for collecting and analyzing log data from Azure resources and other sources. It includes features for collecting data from Azure services such as Azure Virtual Machines, Azure App Services, and Azure Functions, as well as data from other sources such as Windows Event Logs and custom logs. The service also includes Azure Log Analytics, which is used to analyze the log data and create custom queries and alerts.Azure Observability, on the other hand, is a broader concept that encompasses a set of services provided by Azure for monitoring, diagnosing, and improving the performance and availability of your applications. It includes Azure Monitor but also encompasses other services such as Azure Application Insights, Azure Metrics, and Azure Diagnostics.Azure Monitor is a service that provides log data collection and analysis, while Azure Observability is a broader set of services that provides deep insights into the performance and availability of your application. Azure Observability is built on top of Azure Monitor, and it integrates with other services to provide a comprehensive view of your application's performance. Key Components of Azure Observability One of the key components of Azure Observability is Azure Monitor. This service provides a centralized location for collecting and analyzing log data from Azure resources and other sources. It includes features for collecting data from Azure services such as Azure Virtual Machines, Azure App Services, and Azure Functions, as well as data from other sources such as Windows Event Logs and custom logs. This allows you to have a comprehensive view of your environment and understand how your resources are performing.Another important component of Azure Observability is Azure Log Analytics. This service is used to analyze the log data collected by Azure Monitor and to create custom queries and alerts. It uses a query language called Kusto, which is optimized for large-scale data analysis. With Azure Log Analytics, you can easily search and filter through large amounts of log data and create custom queries and alerts to notify you of specific events or issues.Azure Application Insights is another service provided by Azure Observability. This service provides deep insights into the performance and usage of your applications. It can be used to track requests, exceptions, and performance metrics and to create custom alerts. With Azure Application Insights, you can gain a better understanding of how your users interact with your applications and identify and resolve issues quickly.Azure Metrics is another service provided by Azure observability. It allows you to collect and analyze performance data from your applications and services, including CPU usage, memory usage, and network traffic. This will give you a real-time view of your resource's performance and allow for proactive monitoring.Finally, Azure Diagnostics is a service that is used to diagnose and troubleshoot issues in your applications and services. It includes features for collecting diagnostic data, such as performance counters, traces, and logs, and for analyzing that data to identify the root cause of issues. With Azure Diagnostics, you can quickly identify and resolve issues in your applications and services and ensure that they are performing optimally. Example: Flow of Observability Data From an Azure Serverless Architecture An example of using Azure Observability to monitor and improve the performance of an application would involve the following steps:Enabling Azure Monitor for your application: This involves configuring Azure Monitor to collect log data from your application, such as requests, exceptions, and performance metrics. This data can be collected from Azure services such as Azure App Services, Azure Functions, and Azure Virtual Machines.Analyzing log data with Azure Log Analytics: Once data is collected, you can use Azure Log Analytics to analyze the log data and create custom queries and alerts. For example, you can create a query to identify all requests that returned a 500-error code and create an alert to notify you when this happens.Identifying and resolving performance issues: With the data collected and analyzed, you can use Azure Application Insights to identify and resolve performance issues. For example, you can use the performance metrics collected by Azure Monitor to identify slow requests and use Azure Diagnostics to collect additional data, such as traces and logs, to understand the root cause of the issue.Monitoring your resources: With Azure Metrics, you can monitor your resource's performance and understand the impact on the application. This will give you a real-time view of your resources and allow for proactive monitoring.Setting up alerts: Azure Monitor, Azure Log Analytics, and Azure Application Insights can set up alerts; this way, you can be notified of any issues or potential issues. This will allow you to act before it becomes a problem for your users.Continuously monitoring and improving: After resolving the initial issues, you should continue to monitor your application using Azure Observability to ensure that it is performing well and identify any new issues that may arise. This allows you to continuously improve the performance and availability of your application. Observability Patterns Azure Observability provides a variety of patterns that can be used to monitor and improve the performance of your application. Some of the key patterns and metrics include:Logging: Collecting log data such as requests, exceptions, and performance metrics and then analyzing this data using Azure Monitor and Azure Log Analytics. This can be used to identify and troubleshoot issues in your application and to create custom queries and alerts to notify you of specific events or issues.Tracing: Collecting trace data such as request and response headers and analyzing this data using Azure Diagnostics. This can be used to understand the flow of requests through your application and to identify and troubleshoot issues with specific requests.Performance monitoring: Collecting performance metrics such as CPU usage, memory usage, and network traffic and analyzing this data using Azure Metrics. This can be used to identify and troubleshoot issues with the performance of your application and resources.Error tracking: Collecting and tracking errors and exceptions and analyzing this data using Azure Application Insights. This can be used to identify and troubleshoot issues with specific requests and to understand how errors are impacting your users.Availability monitoring: Collecting and monitoring data related to the availability of your application and resources, such as uptime and response times, and analyzing this data using Azure Monitor. This can be used to identify and troubleshoot issues with the availability of your application.Custom metrics: Collecting custom metrics that are specific to your application and analyzing this data using Azure Monitor and Azure Log Analytics. This can be used to track key performance indicators (KPIs) for your application and to create custom alerts.All these patterns and metrics can be used together to gain a comprehensive understanding of the performance and availability of your application and to quickly identify and resolve issues. Additionally, Azure Observability services are integrated; this way, you can easily correlate different data sources and have a holistic view of your application's performance. While Azure Observability provides a powerful set of services for monitoring, diagnosing, and improving the performance of your applications, there are also some common mistakes/contrasts that should be avoided to get the most out of these services. Here are a few examples of Azure Observability contrasts:Not collecting enough data: Collecting insufficient data makes it difficult to diagnose and troubleshoot issues and can lead to incomplete or inaccurate analysis. Make sure to collect all the relevant data for your application, including logs, traces, and performance metrics, to ensure that you have a comprehensive view of your environment.Not analyzing the data: Collecting data is not enough; you need to analyze it and act. Not analyzing the data can lead to missed opportunities to improve the performance and availability of your applications. Make sure to use Azure Monitor and Azure Log Analytics to analyze the data, identify patterns and issues, and act. Conclusion In summary, Azure observability architecture is a set of services that allows for data collection, data analysis, and troubleshooting. It provides a comprehensive set of services that allows you to monitor, diagnose, and improve the performance and availability of your applications. With Azure Observability, you can gain deep insights into your environment and quickly identify and resolve issues, ensuring that your applications are always available and performing at their best.

By Shripad Barve
Running Databases on Kubernetes
Running Databases on Kubernetes

A few weeks ago, Kelsey Hightower wrote a tweet and held a live discussion on Twitter about whether it's a good idea or not to run a database on Kubernetes. This happened to be incredibly timely for me since we at QuestDB are about to launch our own cloud database service (built on top of k8s)! "Rubbing Kubernetes on Postgres Won't Turn it Into Cloud SQL" You can run databases on Kubernetes because it's fundamentally the same as running a database on a VM. The biggest challenge is understanding that rubbing Kubernetes on Postgres won't turn it into Cloud SQL. https://t.co/zdFobm4ijy One of the biggest takeaways from this discussion is that there seems to be a misconception about the features that k8s actually provides. While newcomers to k8s may expect that it can handle complex application lifecycle features out-of-the-box, it, in fact, only provides a set of cloud-native primitives (or building blocks) for you to configure and use to deploy your workflows. Any functionality outside of these core building blocks needs to be implemented somehow in additional orchestration code (usually in the form of an operator) or config. K8s Primitives When working with databases, the obvious concern is data persistence. Earlier in its history, k8s really shined in the area of orchestrating stateless workloads, but support for stateful workflows was limited. Eventually, primitives like StatefulSets, PersistentVolumes (PVs), and PersistentVolumeClaims (PVCs) were developed to help orchestrate stateful workloads on top of the existing platform. PersistentVolumes are abstractions that allow for the management of raw storage, ranging from local disk to NFS, cloud-specific block storage, and more. These work in concert with PersistentVolumeClaims that represent requests for a pod to access the storage managed by a PV. A user can bind a PVC to a PV to make an ownership claim on a set of raw disk resources encompassed by the PV. Then, you can add that PVC to any pod spec as a volume, effectively allowing you to mount any kind of persistent storage medium to a particular workload. The separation of PV and PVC also allows you to fully control the lifecycle of your underlying block storage, including mounting it to different workloads or freeing it all together once the claim expires. StatefulSets manage the lifecycles of pods that require more stability than what exists in other primitives like Deployments and ReplicaSets. By creating a StatefulSet, you can guarantee that when you remove a pod, the storage managed by its mounted PVCs does not get deleted along with it. You can imagine how useful this property is if you're hosting a database! StatefulSets also allow for ordered deployment, scaling, and rolling updates, all of which create more predictability (and thus stability) in our workloads. This is also something that seems to go hand-in-hand with what you want out of your database's infrastructure. What Else? While StatefulSets, PVs, and PVCs do quite a bit of work for us, there are still many administration and configuration actions that you need to perform on a production-level database. For example, how do you orchestrate backups and restores? These can get quite complex when dealing with high-traffic databases that include functionality such as WALs. What about clustering and high availability? Or version upgrades? Are these operations zero-downtime? Every database deals with these features in different ways, many of which require precise coordination between components to succeed. Kubernetes alone can't handle this. For example, you can't have a StatefulSet automatically set up your average RDBMS in a read-replica mode very easily without some additional orchestration. Not only do you have to implement many of these features yourself, but you also need to deal with the ephemeral nature of Kubernetes workloads. To ensure peak performance, you have to guarantee that the k8s scheduler places your pods on nodes that are already pre-tuned to run your database, with enough free resources to run it properly. If you're dealing with clustering, how are you handling networking to ensure that database nodes are able to connect to each other (ideally in the same cloud region)? This brings me to my next point... Pets, Not Cattle Once you start accounting for things like node performance-tuning and networking, along with the requirement to store data persistently in-cluster, all of a sudden, your infrastructure starts to grow into a set of carefully groomed pet servers instead of nameless herds of cattle. But one of the main benefits of running your application in k8s is the exact ability to treat your infrastructure like cattle instead of pets! All of the most common abstractions like Deployments, Ingresses, and Services, along with features like vertical and horizontal autoscaling, are made possible because you can run your workloads on a high-level set of infrastructure components, so you don't have to worry about your physical infrastructure layer. These abstractions allow you to focus more on what you're trying to achieve with your infrastructure instead of how you're going to achieve it. Then Why Even Bother With K8s? Despite these rough edges, there are plenty of reasons to want to run your database on k8s. There's no denying that k8s' popularity has increased tremendously over the past few years across both startups and enterprises. The k8s ecosystem is under constant development so that its feature set continues to expand and improve regularly. And its operator model allows end users to programmatically manage their workloads by writing code against the core k8s APIs to automatically perform tasks that would previously have to be done manually. K8s allows for easy GitOps-style management so you can leverage battle-tested software development practices when managing infrastructure in a reproducible and safe way. While vendor lock-in still exists in the world of k8s, its effect can be minimized to make it easier for you to go multi-cloud (or even swap one for another). So what can we do if we want to take advantage of all the benefits that k8s has to offer while using it to host our database? What Do You Need to Build an RDS on K8s? Toward the end of the live chat, someone asked Kelsey, "what do you actually need to build an RDS on k8s?" He jokingly answered with expertise, funding, and customers. While we're certainly on the right track with these at QuestDB, I think that this can be better phrased in that you need to implement Day 2 Operations to get to what a typical managed database service would provide. Day 2 Operations Day 2 Operations encompass many of the items that I've been discussing; backups, restore, stop/start, replication, high availability, and clustering. These are the features that differentiate a managed database service from a simple database hosted on k8s primitives, which is what I would call a Day 1 Operation. While k8s and its ecosystem can make it very easy to install a database in your cluster, you're going to eventually need to start thinking about Day 2 Operations once you get past the prototype phase. Here, I'll jump into more detail about what makes these operations so difficult to implement and why special care must be taken when implementing them, either by a database admin or a managed database service provider. Stop/Start Stopping and starting databases is a common operation in today's DevOps practices and is a must-have for any fully-featured managed database service. It is pretty easy to find at least one reason for wanting to stop and start a database. For example, you may want to have a database used for running integration tests that run on a pre-defined schedule. Or you maybe have a shared instance that's used by a development team for live QA before merging a commit. You could always create and delete database instances on-demand, but it is sometimes easier to have a reference to a static database connection string and URL in your test harness or orchestration code. While stop/start can be automated in k8s (perhaps by simply setting a StatefulSet's replica count to 0), there are still other aspects that need to be considered. If you're shutting down a database to save some money, will you also be spinning down any infrastructure? If so, how can you ensure that this infrastructure will be available when you start the database backup? K8s provides primitives like node affinity and taints to help solve this problem, but everyone's infrastructure provisioning situation and budget are different, and there's no one-size-fits-all approach to this problem. Backup and Restore One interesting point that Kelsey made in his chat was that having the ability to start an instance from scratch (moving from a stopped -> running state), is not trivial. Many challenges need to be solved, including finding the appropriate infrastructure to run the database, setting up network connectivity, mounting the correct volume, and ensuring data integrity once the volume has been mounted. In fact, this is such an in-depth topic that Kelsey compares going from 0 -> 1 running instance to an actual backup-and-restore test. If you can indeed spin up an instance from scratch while loading up pre-existing data, you have successfully completed a live restore test! Even if you have restores figured out, backups have their own complexities. K8s provides some useful building blocks like Jobs and CronJobs, which you can use if you want to take a one-off backup or create a backup schedule, respectively. But you need to ensure that these jobs are configured correctly in order to access raw database storage. Or if your database allows you to perform a backup using a CLI, then these jobs also need secure access to credentials to even connect to the database in the first place. From an end-user standpoint, you need an easy way to manage existing backups, which includes creating an index, applying data retention policies, and RBAC policies. Again, while k8s can help us build out these backup-and-restore components, a lot of these features are built on top of the infrastructure primitives that k8s provides. Replication, HA, and Clustering These days, you can get very far by simply vertically scaling your database. The performance of modern databases can be sufficient for almost anyone's use case if you throw enough resources at the problem. But once you've reached a certain scale or require features like high availability, there is a reason to enable some of the more advanced database management features like clustering and replication. Once you start down this path, the amount of infrastructure orchestration complexity can increase exponentially. You need to start thinking more about networking and physical node placement to achieve your desired goal. If you don't have a centralized monitoring, logging, and telemetry solution, you're now going to need one if you want to easily diagnose issues and get the best performance out of your infrastructure. Based on its architecture and feature set, every database can have different options for enabling clustering, many of which require intimate knowledge of the inner workings of the database to choose the correct settings. Vanilla k8s know nothing of these complexities. Instead, these all need to be orchestrated by an administrator or operator (human or automated). If you're working with production data, changes may need to happen with close-to-zero downtime. This is where managed database services shine. They can make some of these features as easy to configure as a single web form with a checkbox or two and some input fields. Unless you're willing to invest the time into developing these solutions yourself, or leverage existing open-source solutions if they exist, sometimes it's worth giving up some level of control for automated expert assistance when configuring a database cluster. Orchestration For your Day 2 Operations to work as they would in a managed database service such as RDS, they need to not just work but also be automated. Luckily for us, there are several ways to build automation around your database on k8s. Helm and YAML Tools Won't Get Us There Since k8s configuration is declarative, it can be very easy to get from 0 -> 1 with traditional YAML-based tooling like Helm or cdk8s. Many industry-leading k8s tools install into a cluster with a simple helm install or kubectl apply command. These are sufficient for Day 1 Operations and non-scalable deployments. But as soon as you start to move into more vendor-specific Day 2 Operations that require more coordination across system components, the usefulness of traditional YAML-based tools starts to degrade quickly since some imperative programming logic is required. Provisioners One pattern that you can use to automate database management is a provisioner process. We've even used this approach to build v1 of our managed cloud solution. When a user wants to make a change to an existing database's state, our backend sends a message to a queue that is eventually picked up by a provisioner. The provisioner reads the message, uses its contents to determine which actions to perform on the cluster, and performs them sequentially. Where appropriate, each action contains a rollback step in case of a kubectl apply error to leave the infrastructure in a predictable state. Progress is reported back to the application on a separate gossip queue, providing almost-immediate feedback to the user on the progress of each state change. While this has grown to be a powerful tool for us, there is another way to interact with the k8s API that we are now starting to leverage... Operators K8s has an extensible Operator pattern that you can use to manage your own Custom Resources (CRs) by writing and deploying a controller that reconciles your current cluster state into its desired state, as specified by CR YAML spec files that are applied to the cluster. This is also how the functionality of the basic k8s building blocks are implemented, which just further emphasizes how powerful this model can be. Operators have the ability to hook into the k8s API server and listen for changes to resources inside a cluster. These changes get processed by a controller, which then kicks off a reconciliation loop where you can add your custom logic to perform any number of actions, ranging from simple resource existence to complex Day 2 Operations. This is an ideal solution to our management problem; we can offload much of our imperative code into a native k8s object, and database-specific operations appear to be as seamless as the standard set of k8s building blocks. Many existing database products use operators to accomplish this, and more are currently in development (see the Data on Kubernetes community for more information on these efforts). As you can imagine, coordinating activities like backups, restores, and clustering inside a mostly stateless and idempotent reconciliation loop isn't the easiest. Even if you follow best practices by writing a variety of simple controllers, with each managing its own clearly-defined CR, the reconciliation logic can still be very error-prone and time-consuming to write. While frameworks like Operator SDK exist to help you with scaffolding your operator and libraries like Kubebuilder provide a set of incredibly useful controller libraries, it's still a lot of work to undertake. K8s Is Just a Tool At the end of the day, k8s is a single tool in the DevOps engineer's toolkit. These days, it's possible to host workloads in a variety of ways, using managed services (PaaS), k8s, VMs, or even running on a bare metal server. The tool that you choose depends on a variety of factors, including time, experience, performance requirements, ease of use, and cost. While hosting a database on k8s might be a fit for your organization, it just as easily could create even more overhead and instability if not done carefully. Implementing the Day 2 features that I described above is time-consuming and costly to get right. Testing is incredibly important since you want to be absolutely sure that your (and your customers') precious data is kept safe and accessible when it's needed. If you just need a reliable database to run your application on top of, then maybe all of the work required to run a database on k8s might be too much for you to undertake. But if your database has strong k8s support (most likely via an operator), or you are doing something unique (and at scale) with your storage layer, it might be worth it to look more into managing your stateful databases on k8s. Just be prepared for a large time investment and ensure that you have the requisite in-house knowledge (or support) so that you can be confident that you're performing your database automation activities correctly and safely.

By Steve Sklar
Deploying Go Applications to AWS App Runner: A Step-By-Step Guide
Deploying Go Applications to AWS App Runner: A Step-By-Step Guide

In this blog post, you will learn how to run a Go application to AWS App Runner using the Go platform runtime. You will start with an existing Go application on GitHub and deploy it to AWS App Runner. The application is based on the URL shortener application (with some changes) that persists data in DynamoDB. Introduction AWS App Runner is a robust and user-friendly service that simplifies the deployment process of web applications in the AWS Cloud. It offers developers an effortless and efficient way to deploy their source code or container image directly to a scalable and secure web application without requiring them to learn new technologies or choose the appropriate compute service. One of the significant benefits of using AWS App Runner is that it connects directly to the code or image repository, enabling an automatic integration and delivery pipeline. This eliminates the need for developers to go through the tedious process of manually integrating their code with AWS resources. For developers, AWS App Runner simplifies the process of deploying new versions of their code or image repository. They can easily push their code to the repository, and App Runner will automatically take care of the deployment process. On the other hand, for operations teams, App Runner allows for automatic deployments every time a new commit is pushed to the code repository or a new container image version is added to the image repository. App Runner: Service Sources With AWS App Runner, you can create and manage services based on two types of service sources: Source code (covered in this blog post) Source image Source code is nothing but your application code that App Runner will build and deploy. All you need to do is point App Runner to a source code repository and choose a suitable runtime that corresponds to a programming platform version. App Runner provides platform-specific managed runtimes (for Python, Node.js, Java, Go, etc.). The AWS App Runner Go platform runtime makes it easy to build and run containers with web applications based on a Go version. You don’t need to provide container configuration and build instructions such as a Dockerfile. When you use a Go runtime, App Runner starts with a managed Go runtime image which is based on the Amazon Linux Docker image and contains the runtime package for a version of Go and some tools. App Runner uses this managed runtime image as a base image and adds your application code to build a Docker image. It then deploys this image to run your web service in a container. Let’s Get Started Make sure you have an AWS account and install AWS CLI. 1. Create a GitHub Repo for the URL Shortener Application Clone this GitHub repo and then upload it to a GitHub repository in your account (keep the same repo name i.e. apprunner-go-runtime-app): git clone https://github.com/abhirockzz/apprunner-go-runtime-app 2. Create a DynamoDB Table To Store URL Information Create a table named urls. Choose the following: Partition key named shortcode (data type String) On-Demand capacity mode 3. Create an IAM Role With DynamoSB-Specific Permissions export IAM_ROLE_NAME=apprunner-dynamodb-role aws iam create-role --role-name $IAM_ROLE_NAME --assume-role-policy-document file://apprunner-trust-policy.json Before creating the policy, update the dynamodb-access-policy.json file to reflect the DynamoDB table ARN name. aws iam put-role-policy --role-name $IAM_ROLE_NAME --policy-name dynamodb-crud-policy --policy-document file://dynamodb-access-policy.json Deploy the Application to AWS App Runner If you have an existing AWS App Runner GitHub connection and want to use that, skip to the Repository selection step. 1. Create an AWS App Runner GitHub Connection Open the App Runner console and choose Create service. Create AWS App Runner Service On the Source and deployment page, in the Source section, for Repository type, choose Source code repository. Under Connect to GitHub, choose Add new, and then, if prompted, provide your GitHub credentials. Add GitHub connection In the Install AWS Connector for GitHub dialog box, if prompted, choose your GitHub account name. If prompted to authorize the AWS Connector for GitHub, choose Authorize AWS Connections. Choose Install. Your account name appears as the selected GitHub account/organization. You can now choose a repository in your account. 2. Repository Selection For Repository, choose the repository you created: apprunner-go-runtime-app. For Branch, choose the default branch name of your repository (for example, main). Configure your deployment: In the Deployment settings section, choose Automatic, and then choose Next. Choose GitHub repo 3. Configure Application Build On the Configure build page, for the Configuration file, choose Configure all settings here. Provide the following build settings: Runtime: Choose Go 1 Build command: Enter go build main.go Start command: Enter ./main Port: Enter 8080 Choose Next. Configure runtime info 4. Configure Your Service Under Environment variables, add an environment variable. For Key, enter TABLE_NAME, and for Value, enter the name of the DynamoDB table (urls) that you created before. Add environment variables Under Security > Permissions, choose the IAM role that you had created earlier (apprunner-dynamodb-role). Add IAM role for App Runner Choose Next. On the Review and create page, verify all the details you’ve entered, and then choose Create and deploy. If the service is successfully created, the console shows the service dashboard, with a Service overview of the application. Verify URL Shortener Functionality The application exposes two endpoints: To create a short link for a URL Access the original URL via the short link First, export the App Runner service endpoint as an environment variable: export APP_URL=<enter App Runner service URL> # example export APP_URL=https://jt6jjprtyi.us-east-1.awsapprunner.com 1. Invoke It With a URL That You Want to Access via a Short Link curl -i -X POST -d 'https://abhirockzz.github.io/' $APP_URL # output HTTP/1.1 200 OK Date: Thu, 21 Jul 2022 11:03:40 GMT Content-Length: 25 Content-Type: text/plain; charset=utf-8 {"ShortCode":"ae1e31a6"} You should get a JSON response with a short code and see an item in the DynamoDB table as well. You can continue to test the application with other URLs that you want to shorten! 2. Access the URL Associated With the Short Code Enter the following in your browser http://<enter APP_URL>/<shortcode>. For example, when you enter https://jt6jjprtyi.us-east-1.awsapprunner.com/ae1e31a6, you will be redirected to the original URL. You can also use curl. Here is an example: export APP_URL=https://jt6jjprtyi.us-east-1.awsapprunner.com curl -i $APP_URL/ae1e31a6 # output HTTP/1.1 302 Found Location: https://abhirockzz.github.io/ Date: Thu, 21 Jul 2022 11:07:58 GMT Content-Length: 0 Clean up Once you complete this tutorial, don’t forget to delete the following resources: DynamoDB table App Runner service Conclusion In this blog post, you learned how to go from a Go application in your GitHub repository to a complete URL shortener service deployed to AWS App Runner!

By Abhishek Gupta CORE
Time For Me To Fly… To Render
Time For Me To Fly… To Render

The Gartner hype cycle, illustrated below, can be applied to most aspects of technology: As new innovations enter their respective cycles, expectations are eventually realized—leading to some level of adoption. The goal for every innovation is to reach the plateau of productivity where consumers have determined that the reward of adopting the innovation far outweighs any known risks. At the same time, there is a point where the plateau of productivity begins to diminish, leading to an exodus away from that innovation. One simple example would be pagers (or beepers), which were common before mobile phones/devices reached the plateau of productivity. As technologists, we strive to deliver features, frameworks, products, or services that increase the plateau of productivity. The same holds true for the ones that we use. Recently, I felt like my current hosting platform began falling off the plateau of productivity. In fact, a recent announcement made me wonder if it was time to consider other options. Since I had a positive experience using the Render PaaS, I wanted to look at how easily I could convert one of my Heroku applications, adopt PostgreSQL, and migrate to Render. I’m describing that journey in this two-part series: Part 1: We’ll focus on migrating my backend services (Spring Boot and ClearDB MySQL). Part 2: We’ll focus on porting and migrating my frontend Angular client. Why I Chose Render If you have never heard of Render before, check out some of my previous publications: Using Render and Go for the First Time Under the Hood: Render Unified Cloud Purpose-Driven Microservice Design Launch Your Startup Idea in a Day How I Used Render To Scale My Microservices App With Ease What I find exciting about Render is that they continue to climb the slope of enlightenment while actively providing a solid solution for adopters recognizing the plateau of productivity. As I’ve noted in my articles, Render offers a “Zero DevOps” promise. This perfectly aligns with my needs since I don’t have the time to focus on DevOps tasks. The Heroku platform has several things that I am not too fond of: Daily restarts led to unexpected downtime for one of my services. Entry-level (all I really need) Postgres on Heroku allows for four hours of downtime per month. Pricing levels, from a consumer perspective, don’t scale well. From a pricing perspective, I am expecting to see significant cost savings after migrating all of my applications and services from Heroku to Render. What is more amazing is that I am getting better memory and CPU for that price, with linear scaling as my application footprint needs to grow. Converting a Single Service As I noted above, this is part one of a two-part series, and I’ll focus on the service tier in this article. The service I want to convert has the following attributes: Spring Boot RESTful API Service Heroku CloudAMQP (RabbitMQ) Message Broker Heroku ClearDB (MySQL) Database (single schema) Okta Integration On the Render PaaS side, the new service design will look like this: Render Web Service hosting Spring Boot RESTful API Service (via Docker) Render Private Service hosting RabbitMQ Message Broker (via Docker) Render Postgres with the ability for multiple schemas to exist Okta Integration Below is a side-by-side comparison of the two ecosystems: My high-level plan of attack for the conversion is as follows: Prepare Heroku for Conversion Before getting started, it is recommended to put all existing Heroku services into Maintenance Mode. This will prohibit any consumers from accessing the applications or services. While the source code should already be backed up and stored in a git-based repository, it is a good idea to make sure a database backup has been successfully created. Conversion of Heroku Services From a conversion perspective, I had two items to convert: the service itself and the ClearDB (MySQL) database. The conversion of my Spring Boot RESTful service did not involve much work. In fact, I was able to leverage the approach I used for a previous project of mine. For the database, I needed to convert from MySQL to PostgreSQL. My goal was to use Render’s Heroku Migrator to easily migrate Heroku Postgres to Render Postgres, but I needed to convert from MySQL to PostgreSQL first. Initially, I started down the pgloader path, which seemed to be a common approach for the database conversion. However, using my M1-chip MacBook Pro led to some unexpected issues. Instead, I opted to use NMIG to convert MySQL to PostgreSQL. For more information, please check out the “Highlights From the Database Conversion” section below. Create Services in Render After converting the database and the Spring Boot RESTful service running inside Docker, the next step was to create a Render Web Service for the Spring Boot RESTful API service. This was as easy as creating the service, giving it a name, and pointing to the appropriate repository for my code in GitLab. Since I also needed a RabbitMQ service, I followed these instructions to create a RabbitMQ Private Service running on Render. This included establishing a small amount of disk storage to persist messages that have not been processed. Finally, I created the necessary environment variables in the Render Dashboard for both the Spring Boot RESTful API service and the RabbitMQ message broker. Initialize and Validate the Services The next step was to start my services. Once they were running and the APIs were validated using my Postman collection, I updated my client application to point to the new Render service location. Once everything was up and running, my Render Dashboard appeared as shown below: Next Steps All that remained at this point was to delete the databases still running on Heroku and remove the migrated services from the Heroku ecosystem. When using Heroku, any time I merged code into the master branch of my service repository, the code was automatically deployed, provided I used GitLab CI/CD to deploy to Heroku in my source repository. However, there is no need to add code to the source file repository with Render. I simply needed to specify the Build & Deploy Branch in the Render Dashboard for the service: I love the Zero DevOps promise. Highlights From the Database Conversion By following the steps above, the conversion from Heroku to Render was smooth and successful. The biggest challenge for me was the conversion of data. At a high level, this mostly boiled down to a series of commands executed from the terminal of my MacBook Pro. First, I started a local Postgres instance via Docker: docker run --publish 127.0.0.1:5432:5432 --name postgres -e POSTGRES_PASSWORD=dbo -d postgres Next, I created a database called “example” using the following command (or pgAdmin): createdb example For converting my ClearDB (MYSQL) instance running on Heroku to my example Postgres database running locally, I used NMIG, which is a Node.js-based database conversion utility. After installing NMIG, I set up the config.json file with database endpoint information and credentials, and then I ran: /path/to/nmig$ npm start Next, I backed up the data to a file using the following command: pg_dump -Fc --no-acl --no-owner -h localhost -U postgres example > example.dump Rather than go through the hassle of creating a signed URL in AWS, I just used the pgAdmin client to import the backup into a newly created Postgres instance on Heroku. With the Postgres instance running and data validated, I created a new Postgres database on the Render PaaS. Then all I had to do was issue the following command: pg_restore --verbose --no-acl --no-owner -d postgres://username:password@hostname.zone-postgres.render.com/example example.dump Lessons Learned Along the Way Looking back on my conversion from Heroku to Render, here are some lessons I learned along the way: I had a minor issue with the Postgres database updating the date/time value to include the DST offset. This may have been an issue with my original database design, but I wanted to pass this along. In my case, the impacted column is only used for Date values, which did not change for me. I included a database column named END in one of my tables, which caused a problem when either Postgres or Hibernate attempted to return a native query. The service saw the END column name and injected it as a SQL keyword. I simply renamed the column to fix this issue, which I should have known not to do in the first place. With Render, I needed to make the RabbitMQ service a Private Service because the Web Service option does not expose the expected port. However, with this approach, I lost the ability to access the RabbitMQ admin interface, since Private Services are not exposed externally. It looks like Render plans to address this feature request. All in all, these minor hurdles weren’t significant enough to impact my decision to migrate to Render. Conclusion The most important aspect of Gartner’s plateau of productivity is providing products, frameworks, or services that allow consumers to thrive and meet their goals. The plateau of productivity is not intended to be flashy or fashionable—in a metaphorical sense. When I shared this conclusion with Ed, a Developer Advocate at Render, his response was something I wanted to share: “Render is pretty avowedly not trying to be ‘fashionable.’ We're trying to be unsurprising and reliable.” Ed’s response resonated deeply with me and reminded me of a time when my former colleague told me my code came across as “boring” to him. His comment turned out to be the greatest compliment I could have received. You can read more here. In any aspect of technology, the decision on which provider to select should always match your technology position. If you are unsure, the Gartner hype cycle is a great reference point, and you can get started with a subscription to their service here. I have been focused on the following mission statement, which I feel can apply to any IT professional: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” - J. Vester When I look at the Render PaaS ecosystem, I see a solution that adheres to my mission statement while residing within my hype cycle preference. What makes things better is that I fully expect to see a 44% savings in my personal out-of-pocket costs—even more as my services need to scale vertically. For those considering hosting solutions, I recommend adding Render to the list of providers for review and analysis. You can get started for free by following this link. The second part of this series will be exciting. I will demonstrate how to navigate away from paying for my static client written in Angular and take advantage of Render’s free Static Sites service using either Vue or Svelte. Which framework will I choose … and why? Have a really great day!

By John Vester CORE
Get Started With Trino and Alluxio in Five Minutes
Get Started With Trino and Alluxio in Five Minutes

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino was designed to handle data warehousing, ETL, and interactive analytics by large amounts of data and producing reports. Alluxio is an open-source data orchestration platform for large-scale analytics and AI. Alluxio sits between compute frameworks such as Trino and Apache Spark and various storage systems like Amazon S3, Google Cloud Storage, HDFS, and MinIO. This is a tutorial for deploying Alluxio as the caching layer for Trino using the Iceberg connector. Why Do We Need Caching for Trino? A small fraction of the petabytes of data you store is generating business value at any given time. Repeatedly scanning the same data and transferring it over the network consumes time, compute cycles, and resources. This issue is compounded when pulling data from disparate Trino clusters across regions or clouds. In these circumstances, caching solutions can significantly reduce the latency and cost of your queries. Trino has a built-in caching engine, Rubix, in its Hive connector. While this system is convenient as it comes with Trino, it is limited to the Hive connector and has not been maintained since 2020. It also lacks security features and support for additional compute engines. Trino on Alluxio Alluxio connects Trino to various storage systems, providing APIs and a unified namespace for data-driven applications. Alluxio allows Trino to access data regardless of the data source and transparently cache frequently accessed data (e.g., tables commonly used) into Alluxio distributed storage. Using Alluxio Caching via the Iceberg Connector Over MinIO File Storage We’ve created a demo that demonstrates how to configure Alluxio to use write-through caching with MinIO. This is achieved by using the Iceberg connector and making a single change to the location property on the table from the Trino perspective. In this demo, Alluxio is run on separate servers; however, it’s recommended to run it on the same nodes as Trino. This means that all the configurations for Alluxio will be located on the servers where Alluxio runs, while Trino’s configuration remains unaffected. The advantage of running Alluxio externally is that it won’t compete for resources with Trino, but the disadvantage is that data will need to be transferred over the network when reading from Alluxio. It is crucial for performance that Trino and Alluxio are on the same network. To follow this demo, copy the code located here. Trino Configuration Trino is configured identically to a standard Iceberg configuration. Since Alluxio is running external to Trino, the only configuration needed is at query time and not at startup. Alluxio Configuration The configuration for Alluxio can all be set using the alluxio-site.properties file. To keep all configurations colocated on the docker-compose.yml, we are setting them using Java properties via the ALLUXIO_JAVA_OPTS environment variable. This tutorial also refers to the master node as the leader and the workers as followers. Master Configurations alluxio.master.mount.table.root.ufs=s3://alluxio/ The leader exposes ports 19998 and 19999, the latter being the port for the web UI. Worker Configurations alluxio.worker.ramdisk.size=1G alluxio.worker.hostname=alluxio-follower The follower exposes ports 29999 and 30000, and sets up a shared memory used by Alluxio to store data. This is set to 1G via the shm_size property and is referenced from the alluxio.worker.ramdisk.size property. Shared Configurations Between Leader and Follower alluxio.master.hostname=alluxio-leader # Minio configs alluxio.underfs.s3.endpoint=http://minio:9000 alluxio.underfs.s3.disable.dns.buckets=true alluxio.underfs.s3.inherit.acl=false aws.accessKeyId=minio aws.secretKey=minio123 # Demo-only configs alluxio.security.authorization.permission.enabled=false The alluxio.master.hostname needs to be on all nodes, leaders and followers. The majority of shared configs points Alluxio to the underfs, which is MinIO in this case. alluxio.security.authorization.permission.enabled is set to “false” to keep the Docker setup simple. Note: This is not recommended to do in a production or CI/CD environment. Running Services First, you want to start the services. Make sure you are in the trino-getting-started/iceberg/trino-alluxio-iceberg-minio directory. Now, run the following command: docker-compose up -d You should expect to see the following output. Docker may also have to download the Docker images before you see the “Created/Started” messages, so there could be extra output: [+] Running 10/10 ⠿ Network trino-alluxio-iceberg-minio_trino-network Created 0.0s ⠿ Volume "trino-alluxio-iceberg-minio_minio-data" Created 0.0s ⠿ Container trino-alluxio-iceberg-minio-mariadb-1 Started 0.6s ⠿ Container trino-alluxio-iceberg-minio-trino-coordinator-1 Started 0.7s ⠿ Container trino-alluxio-iceberg-minio-alluxio-leader-1 Started 0.9s ⠿ Container minio Started 0.8s ⠿ Container trino-alluxio-iceberg-minio-alluxio-follower-1 Started 1.5s ⠿ Container mc Started 1.4s ⠿ Container trino-alluxio-iceberg-minio-hive-metastore-1 Started Open Trino CLI Once this is complete, you can log into the Trino coordinator node. We will do this by using the exec command and run the trino CLI executable as the command we run on that container. Notice the container id is trino-alluxio-iceberg-minio-trino-coordinator-1, so the command you will run is: <<<<<<< HEAD docker container exec -it trino-alluxio-iceberg-minio-trino-coordinator-1 trino ======= docker container exec -it trino-minio_trino-coordinator_1 trino >>>>>>> alluxio When you start this step, you should see the trino cursor once the startup is complete. It should look like this when it is done: trino> To best understand how this configuration works, let’s create an Iceberg table using a CTAS (CREATE TABLE AS) query that pushes data from one of the TPC connectors into Iceberg that points to MinIO. The TPC connectors generate data on the fly so we can run simple tests like this. First, run a command to show the catalogs to see the tpch and iceberg catalogs since these are what we will use in the CTAS query: SHOW CATALOGS; You should see that the Iceberg catalog is registered. MinIO Buckets and Trino Schemas Upon startup, the following command is executed on an intiailization container that includes the mc CLI for MinIO. This creates a bucket in MinIO called /alluxio, which gives us a location to write our data to and we can tell Trino where to find it: /bin/sh -c " until (/usr/bin/mc config host add minio http://minio:9000 minio minio123) do echo '...waiting...' && sleep 1; done; /usr/bin/mc rm -r --force minio/alluxio; /usr/bin/mc mb minio/alluxio; /usr/bin/mc policy set public minio/alluxio; exit 0; " Note: This bucket will act as the mount point for Alluxio, so the schema directory alluxio://lakehouse/ in Alluxio will map to s3://alluxio/lakehouse/. Querying Trino Let’s move to creating our SCHEMA that points us to the bucket in MinIO and then run our CTAS query. Back in the terminal, create the iceberg.lakehouse SCHEMA. This will be the first call to the metastore to save the location of the schema location in the Alluxio namespace. Notice, we will need to specify the hostname alluxio-leader and port 19998 since we did not set Alluxio as the default file system. Take this into consideration if you want Alluxio caching to be the default usage and transparent to users managing DDL statements: CREATE SCHEMA iceberg.lakehouse WITH (location = 'alluxio://alluxio-leader:19998/lakehouse/'); Now that we have a SCHEMA that references the bucket where we store our tables in Alluxio, which syncs to MinIO, we can create our first table. Optional: To view your queries run, log into the Trino UI and log in using any username (it doesn’t matter since no security is set up). Move the customer data from the tiny generated TPCH data into MinIO using a CTAS query. Run the following query, and if you like, watch it running on the Trino UI: CREATE TABLE iceberg.lakehouse.customer WITH ( format = 'ORC', location = 'alluxio://alluxio-leader:19998/lakehouse/customer/' ) AS SELECT * FROM tpch.tiny.customer; Go to the Alluxio UI and the MinIO UI, and browse the Alluxio and MinIO files. You will now see a lakehouse directory that contains a customer directory that contains the data written by Trino to Alluxio and Alluxio writing it to MinIO. Now, there is a table under Alluxio and MinIO, you can query this data by checking the following: SELECT * FROM iceberg.lakehouse.customer LIMIT 10; How are we sure that Trino is actually reading from Alluxio and not MinIO? Let’s delete the data in MinIO and run the query again just to be sure. Once you delete this data, you should still see data return. Stopping Services Once you complete this tutorial, the resources used for this excercise can be released by runnning the following command: docker-compose down Conclusion At this point, you should have a better understanding of Trino and Alluxio, how to get started with deploying Trino and Alluxio, and how to use Alluxio caching with an Iceberg connector and MinIO file storage. I hope you enjoyed this article. Be sure to like this article and comment if you have any questions!

By Brian Olsen
AWS Cloud Migration: Best Practices and Pitfalls to Avoid
AWS Cloud Migration: Best Practices and Pitfalls to Avoid

Migrating to the cloud can be a daunting task, but with the right plan and execution, it can be a seamless process. AWS offers various services that can help you with your migration, but it's important to be aware of the best practices and pitfalls to avoid. This blog post will discuss the best practices and common pitfalls to avoid when migrating to the AWS cloud. Best Practices Plan Your Migration Before you begin your migration, it's important to plan your migration. This includes the following things. Identifying your current environment. Define your migration goals. The applications and data you want to migrate. Planning your migration will help you identify potential challenges and provide a roadmap for a successful migration. Assess Your Current Environment Before migrating to the cloud, it's important to assess your current environment. This includes identifying your current infrastructure, applications, and data. Assessing your current environment will help you identify what needs to be migrated and what can be left behind. For example, you can use AWS Application Discovery Service. It will automatically discover and collects information about your application's infrastructure, including servers, databases, and dependencies. Choose the Right Migration Strategy AWS offers seven migration strategies (initially, it was six), which include: Retire Retain Relocate Lift and shift (Rehost) Repurchase Re-platform Refactor Theoretically, there are seven strategies, but I will discuss the more common approaches to migrating applications to the cloud. Always choose the migration strategy that best fits your needs. For example, lift and shift is a good option for simple, infrequently used applications, while refactoring is a better option for more complex applications. 1. Lift and Shift: This is the most basic migration strategy and involves simply moving existing applications and workloads to the cloud without any significant changes. This approach is best for simple, stateless applications that do not require significant changes to operate in the cloud. This approach is also known as "lift and shift" or "rehosting" because the goal is to move the application as is, with minimal changes. This approach can be done using the AWS MGN service. It is the best way to migrate any on-prem physical or virtual servers to AWS. After migration, you can use AWS Elastic Beanstalk, AWS EC2, or AWS Auto Scaling. This approach is relatively quick and simple, but it may not provide optimal performance or cost savings in the long run, as the applications may not be fully optimized for the cloud. 2. Re-architecture: This approach involves making significant changes to the architecture of the application to take full advantage of the cloud. This may include breaking down monolithic applications into microservices, using containers and Kubernetes for orchestration, and using cloud-native services such as AWS Lambda and Amazon SNS. This approach is best for complex and large applications that require significant changes to operate efficiently in the cloud. This approach takes longer than lift and shift and requires a deep understanding of the application and the cloud. 3. Replatforming: This approach is to move an existing application to a new platform, such as moving a Java application to .NET. This approach is best for organizations that want to move to a new technology platform that is not supported on-premises and to take advantage of the benefits of the new platform. AWS services like AWS Elastic Beanstalk, AWS ECS, and AWS RDS can be used to deploy the new platform in the cloud. 4. Hybrid: This approach involves running some workloads on-premises and some in the cloud. This approach is best for organizations that have strict compliance or security requirements that prevent them from moving all their workloads to the cloud. This approach is also good for organizations that have complex interdependencies between on-premise and cloud-based workloads. It also allows organizations to take a more gradual approach to migration, moving workloads to the cloud as they become ready. AWS services like AWS Direct Connect and AWS VPN can be used to create a secure and reliable connection between on-premise and cloud-based resources. AWS EKS and Storage gateway, AWS Outposts are good examples to work in a hybrid cloud. 5. Cloud-native: This approach involves building new applications using cloud-native services and architectures from the ground up. This approach is best for organizations that are starting new projects and want to take full advantage of the scalability and elasticity of the cloud. This approach requires a deep understanding of cloud-native services and architectures and is generally more complex than lift and shift or re-platforming. AWS App Runner, AWS Fargate, and ECS can be used to implement cloud-native services. Test Your Migration Once your migration plan is in place, it's important to test your migration. This includes testing your applications, data, and infrastructure. Testing your migration will help you identify any issues and ensure that your applications and data are working as expected in the cloud. Monitor and Optimize Your Migration After your migration is complete, it's important to monitor and optimize your migration. This includes monitoring your applications, data, and infrastructure to ensure that they are working as expected in the cloud. It also includes optimizing your cloud resources to reduce costs and improve performance. Avoid Pitfalls Avoid vendor lock-in: Take advantage of open-source and cross-platform tools and technologies to avoid being locked into a single vendor's ecosystem. Avoid the pitfall of not testing: One of the common pitfalls of cloud migration is not testing the migration properly. It is important to test the migration thoroughly to ensure that all applications and data are working as expected in the cloud. Another pitfall is not considering security: Another common pitfall of cloud migration is not considering security. It's important to ensure that your applications and data are secure in the cloud. This includes securing your data in transit and at rest and ensuring that your applications are secure. Not considering scalability: Another pitfall of cloud migration is not considering scalability. It's important to ensure that your applications and data are scalable in the cloud. This includes ensuring that your applications and data can handle an increase in traffic and usage. Not considering cost: Another pitfall of cloud migration is not considering the cost. It's important to ensure that your migration is cost-effective and that you are not over-provisioning resources. Not considering compliance: Another pitfall of cloud migration is not considering compliance. It's important to ensure that your migration complies with any relevant laws and regulations. Finally, train your team on the new tools and technologies that they'll be using in the cloud. In conclusion, migrating to the AWS cloud requires planning, testing, monitoring, and optimization. Avoiding the pitfalls mentioned above, and following the best practices, will help ensure a successful migration. Additionally, it is important to keep security, scalability, cost, and compliance in mind throughout the migration process.

By Rahul Nagpure
Taming Cloud Costs With Infracost
Taming Cloud Costs With Infracost

When we combine the cloud with IaC tools like Terraform and continuous deployment we get the almost magical ability to create resources on demand. For all its benefits, however, the cloud has also introduced a set of difficulties, one of which is estimating cloud costs accurately. Cloud providers have complex cost structures that are constantly changing. AWS, for example, offers 536 types of EC2 Linux machines. Many of them have similar names and features. Take for example "m6g.2xlarge" and "m6gd.2xlarge" — the only difference is that the second comes with an SSD drive, which will add $60 dollars to the bill. Often, making a mistake in defining your infrastructure can cause your bill to balloon at the end of the month. It’s so easy to go above budget. We can set up billing alerts, but there are no guarantees that they will work. Alerts can happen during the weekend or be delayed, making us shoot past our budget in a few hours. So, how can we avoid this problem and use the cloud with confidence? Enter Infracost Infracost is an open-source project that helps us understand how and where we’re spending our money. It gives a detailed breakdown of actual infrastructure costs and calculates how changes impact them. Basically, Infracost is a git diff for billing. Infracost has two versions: a VSCode addon and a command line program. Both do the same thing: parse Terraform code, pull the current cost price points from a cloud pricing API, and output an estimate. You can use Infracost pricing API for free or host your own. The paid tier includes a cloud dashboard to track changes over time. We can see the estimates right in the IDE: Real-time cost estimation on VSCode. Or as comments in pull requests or commits: Cost change information in the PR. Setting Up Infracost To try out Infracost, we’ll need the following: An Infracost API key: You can get one by signing up for free at Infracost.io. The Infracost CLI installed in your machine Some Terraform files Once the CLI tool is installed, run infracost auth login to retrieve the API key. Now we’re ready to go. The first command we’ll try is infracost breakdown. It analyzes Terraform plans and prints out a cost estimate. The --path variable must point to the folder containing your Terraform files. For example, imagine we want to provision an "a1.medium" EC2 instance with the following: provider "aws" { region = "us-east-1" skip_credentials_validation = true skip_requesting_account_id = true } resource "aws_instance" "myserver" { ami = "ami-674cbc1e" instance_type = "a1.medium" root_block_device { volume_size = 100 } } At current rates, this instance costs $28.62 per month to run: $ infracost breakdown --path . Name Monthly Qty Unit Monthly Cost aws_instance.myserver ├─ Instance usage (Linux/UNIX, on-demand, a1.medium) 730 hours $18.62 └─ root_block_device └─ Storage (general purpose SSD, gp2) 100 GB $10.00 OVERALL TOTAL $28.62 If we add some extra storage (600GB of EBS), the cost increases to $155.52, as shown below: $ infracost breakdown --path . Name Monthly Qty Unit Monthly Cost aws_instance.myserver ├─ Instance usage (Linux/UNIX, on-demand, a1.medium) 730 hours $18.62 ├─ root_block_device │ └─ Storage (general purpose SSD, gp2) 100 GB $10.00 └─ ebs_block_device[0] ├─ Storage (provisioned IOPS SSD, io1) 600 GB $75.00 └─ Provisioned IOPS 800 IOPS $52.00 OVERALL TOTAL $155.62 Infracost can also calculate usage-based resources like AWS Lambda. Let's see what happens when we swap the EC2 instance for serverless functions: provider "aws" { region = "us-east-1" skip_credentials_validation = true skip_requesting_account_id = true } resource "aws_lambda_function" "my_lambda" { function_name = "my_lambda" role = "arn:aws:lambda:us-east-1:account-id:resource-id" handler = "exports.test" runtime = "nodejs12.x" memory_size = 1024 } Running infracost breakdown yields a total cost of 0 dollars: $ infracost breakdown --path . Name Monthly Qty Unit Monthly Cost aws_lambda_function.my_lambda ├─ Requests Monthly cost depends on usage: $0.20 per 1M requests └─ Duration Monthly cost depends on usage: $0.0000166667 per GB-seconds OVERALL TOTAL $0.00 That can’t be right unless no one uses our Lambda function, which is precisely what the tool assumes by default. We can fix this by providing an estimate via a usage file. We can create a sample usage file with this command: $ infracost breakdown --sync-usage-file --usage-file usage.yml --path . We can now provide estimates by editing usage.yml. The following example consists of 5 million requests with an average runtime of 300 ms: resource_usage: aws_lambda_function.my_lambda: monthly_requests: 5000000 request_duration_ms: 300 We’ll tell Infracost to use the usage file with --usage-file to get a proper cost estimate: $ infracost breakdown --path . --usage-file usage.yml Name Monthly Qty Unit Monthly Cost aws_lambda_function.my_lambda ├─ Requests 5 1M requests $1.00 └─ Duration 1,500,000 GB-seconds $25.00 OVERALL TOTAL $26.00 That’s much better. Of course, this is accurate as long as our usage file is correct. If you’re unsure, you can integrate Infracost with the cloud provider and pull the utilization metrics from the source. Git Diff for Cost Changes Infracost can save results in JSON by providing the --format json and --out-file options. This gives us a file we can check in source control and use as a baseline. $ infracost breakdown --path . --format json --usage-file usage.yml --out-file baseline.json We can now compare changes by running infracost diff. Let’s see what happens if the Lambda execution time goes from 300 to 350 ms: $ infracost diff --path . --compare-to baseline.json --usage-file usage.yml ~ aws_lambda_function.my_lambda +$4.17 ($26.00 → $30.17) ~ Duration +$4.17 ($25.00 → $29.17) Monthly cost change for TomFern/infracost-demo/dev Amount: +$4.17 ($26.00 → $30.17) Percent: +16% As you can see, the impact is a 16% increase. Integrating Infracost With CI/CD We’ve seen how this tool can help us estimate cloud costs. That’s valuable information, but what role does Infracost take in continuous integration? To answer that, we must understand what infracost comment does. The comment command takes a JSON file generated by infracost diff and posts its contents directly into GitHub, Bitbucket, or GitLab. Thus, by running Infracost inside CI, we make relevant cost information available to everyone on the team. Infracost comments on the cost difference in a GitHub commit. If you want to learn how to configure CI/CD to run Infracost on every update, check out this tutorial: How to Run Infracost on Semaphore. Working With Monorepos You will likely have separate Terraform files for each subproject if you work with a monorepo. In this case, you should add an infracost config file at the project's root. This allows you to specify the project names and where Terraform and usage files are located. You can also set environment variables and other options. version: 0.1 projects: - path: dev usage_file: dev/infracost-usage.yml env: NODE_ENV: dev - path: prod usage_file: prod/infracost-usage.yml env: AWS_ACCESS_KEY_ID: ${PROD_AWS_ACCESS_KEY_ID} AWS_SECRET_ACCESS_KEY: ${PROD_AWS_SECRET_ACCESS_KEY} NODE_ENV: production When the config file is involved, you must replace the --path argument with --config-file in all your commands. Establishing Policies One more trick Infracost has up its sleeve is enforcing policies. Policies are rules that evaluate the output of infracost diff and stop the CI pipeline if a resource goes over budget. This feature allows managers and team leads to enforce limits. When the policy fails, the CI/CD pipeline stops with an error, preventing the infrastructure from being provisioned. When a policy is in place, Infracost warns us if any limits are exceeded. Infracost implements policies using Open Policy Agent (OPA), which uses the Rego language to encode policy rules. Rego has a ton of features, and it’s worth digging in to learn it thoroughly, but for our purposes, we only need to learn a few keywords: deny[out] defines a new policy rule that fails if the out object has failed: true msg: defines the error message shown when the policy fails. out: defines the logic that makes the policy pass or fails. input: references the contents of the JSON object generated with infracost diff The following example shows a policy that fails when the total budget exceeds $1,000: # policy.rego package infracost deny[out] { # define a variable maxMonthlyCost = 1000.0 msg := sprintf( "Total monthly cost must be less than $%.2f (actual diff is $%.2f)", [maxMonthlyCost, to_number(input.totalMonthlyCost)], ) out := { "msg": msg, "failed": to_number(input.totalMonthlyCost) >= maxMonthlyCost } } This is another example that fails if the cost difference is equal to or greater than $500. package infracost deny[out] { # maxDiff defines the threshold that you require the cost estimate to be below maxDiff = 500.0 msg := sprintf( "Total monthly cost diff must be less than $%.2f (actual diff is $%.2f)", [maxDiff, to_number(input.diffTotalMonthlyCost)], ) out := { "msg": msg, "failed": to_number(input.diffTotalMonthlyCost) >= maxDiff } } You can experiment and try several examples online on the OPA playground. To enforce a policy, you must add the --policy-path option in any of the infracost comment commands like this: curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh checkout infracost diff --path . --usage-file usage.yml --compare-to baseline.json --format json --out-file /tmp/infracost-diff-commit.json infracost comment github --path=/tmp/infracost-diff-commit.json --repo=$SEMAPHORE_GIT_REPO_SLUG --commit=$SEMAPHORE_GIT_SHA --github-token=$GITHUB_API_KEY --policy-path policy.rego --behavior=update Conclusion The power to spin up resources instantly is a double-edged knife: a typo in a Terraform file can be a costly mistake. Staying proactive when managing our cloud infrastructure is essential to sticking to the budget and avoiding nasty surprises at the end of the month. If you’re already automating deployment with continuous deployment and managing services with Terraform, you may as well add Infracost to the mix to make more informed decisions and impose spending limits. Setting this up takes only a few minutes and can save thousands of dollars down the road.

By Tomas Fernandez
The Evolution of Cloud-Native Authorization
The Evolution of Cloud-Native Authorization

Authentication in the Age of SaaS and Cloud Let's start with the differences between authentication and authorization. People tend to lump these concepts together as auth, but they're two distinct processes. Authentication describes the process of finding out that you are who you say you are. In the past, we used user IDs and passwords. These days it's much more common to use magic links or multi-factor authentication, etc. but, it's the same process. Authentication used to be the responsibility of the operating system that logs you in once you provide a password. But over the past 15 years, as we moved into the age of SaaS and cloud, that changed. The first generation of SaaS and cloud apps had to reinvent this process because there were no longer any operating systems to ask to authenticate the user's identity. In the course of the last 15 years, we started to work together as an industry to develop standards around authentication, like OAuth2, OpenID connect, and SAML. We’ve started to use JWTs and so on. Today, no one has to build a log-in system if they don't want to. Numerous developer services can help you do this. Overall, you can say that we've successfully moved identity from on-premises to the realm of SaaS in the cloud. Authorization, on the other hand, has not transitioned to the cloud. Authorization, or access control, is the process of discerning what you can see and do once you're logged in. Unlike authentication, authorization is a problem that is far from being solved. The problem is that there aren’t any industry standards for authorization. You can apply some patterns like role-based access control (RBAC) and attribute-based access control (ABAC), but there are no standards because authorization is a domain-specific problem. There aren't any developer services either. Can you think of a Twilio or a Stripe for authorization? And because there are no standards, or developer services to speak of, companies lose agility because they have to spend time building an in-house authorization system and go through the pain that entails. You have to think about the opportunity cost. How much will it cost you to spend time developing and maintaining an in-house access control system, instead of focusing on your value propositions? And, unfortunately, when companies do this themselves they do it poorly. This is the reason that broken access control ranks #1 in the top 10 security issues listed by the open web application security project (OWASP). It seems like we really dug ourselves into a pretty big hole and now it's time to dig ourselves back out. Cloud-Native Authorization Let's look at how we got here. There have been three transitions that have affected the world of software in general and authorization in particular: 1. Transition to SaaS: Authentication made the move successfully, but access control hasn’t. If we dig into why, we see that back in the day, when applications just talked to the operating system, we had a directory, like LDAP. In this directory, you had groups, with users assigned to those groups. Those groups would typically map to your business application roles and things were pretty simple. But now, we don't have an operating system or a global directory that we can query, so every application has to reinvent the authorization process. 2. Rise of microservices: We’ve seen an architectural shift moving from monolithic applications into microservices. Back when we had monoliths, authorization happened at one time and in one place in the code. Today, we have multiple microservices, and each microservice has to do its own authorization. We also have to think about authorizing interactions between microservices, so that only the right interaction patterns are allowed. 3. Zero-trust: The move from the perimeter-based security approach to zero trust security. With zero trust, a lot of the responsibility for security moved away from the environment and into the application. We have a new world order now where everything is in the cloud, everything is a microservice, and zero trust is a must. Unfortunately, not all applications have caught up with this new paradigm, and when we compare well-architected applications to poorly architected ones we clearly see five anti-patterns and five corresponding best practices emerge. Five Best Practices of Cloud-Native Access Control 1. Purpose-Built Authorization Service Today, every service authorizes on its own. If each microservice has to worry about its authorization, each microservice is likely to do it a little bit differently. So when you want to change the authorization behavior across your entire system, you have to think about how each microservice has to be updated and how the authorization logic works in that microservice, which becomes very difficult as you add more microservices to your system. The best practice that we want to consider is to extract the authorization logic out of the microservices and create a purpose-built microservice that will only deal with authorization. In the past couple of years, large organizations have begun publishing papers that describe how their purpose-built authorization system works. It all started with the Google Zanzibar paper that describes how they built the authorization system for Google Drive and other services. Other companies followed and described how they built their purpose-built authorization service and a distributed system around it. These include Intuit’s AuthZ paper, Airbnb's Himeji, Carta’s AuthZ, and Netflix’s PAS. We are now starting to distill these learnings and are putting them into software. 2. Fine-Grained Access Control The second anti-pattern is baking coarse-grained roles into your application. We often see this in applications where you have roles, such as "admin," "member," and "viewer." These roles are baked directly into the application code and as developers add more permissions, they try to cascade those permissions into these existing roles, which makes the authorization model hard to fine-tune. The best practice, in this case, is to start with a fine-grained authorization model that applies the principle of least privilege. The goal is to give a user only the permissions that they need, no more and no less. This is important because when the identity is compromised — and this is not a question of if, it's a question of when — we can limit the damage that this compromised identity can potentially cause by limiting the permissions that we assign to the roles that we specify. 3. Policy-Based Access Management The third anti-pattern that we see is authorization "spaghetti code," where developers have a sprinkled "switch" and "if" statements all around the code that governs the authorization logic. That's a bad idea and costs a lot when you want to change the way that authorization happens across your system. The best practice here is to maintain a clear separation of duties and keep the authorization-related logic in an authorization policy. By separating policy from application code we ensure that the developer team is responsible for developing the app and the application security team is responsible for securing it. 4. Real-Time Access Checks The fourth anti-pattern is using stale permissions in an access token. This tends to occur in the early life of an app, when developers leverage scopes and then bake those scopes into the access token. Here’s a scenario: a user that has an "admin" scope logs in. That scope is baked into the access token and wherever that user interacts with our system using an unexpired access token with the "admin" scope, that user has admin privileges. Why is this bad? Because if we want to remove the "admin" scope from the user and invalidate it, we’ll run into a hurdle. As long as the user holds an unexpired access token, they're going to have access to all the resources that the access token grants them. You simply cannot have a fine-grade access control model using access tokens. Even if the issuer of the access token has visibility into what resources the user can access, it’s impractical to stuff those entitlements in an access token. Let's say we have a collection of documents and we want to give a user a read permission to a document. Which document are we talking about? All documents? Only a few of them? Clearly this approach doesn’t scale. The best practice here is never to assume that the access token has the permissions that we need and instead have real-time access checks that take into account the identity context, the resource context, and the permission before we grant access to a protected resource. 5. Centralized Decision Logs Lastly, unauthorized access is not a question of if, it's a question of when. With that said, companies tend to neglect to maintain consistent authorization logs, which limits their ability to trace unauthorized incidents. The best practice is to have fine-grained, centralized authorization logs. We need to monitor and log everything in a centralized location that we can analyze downstream to get a better understanding of what’s happening in our system. Fine-Grained Access Control Patterns Let's talk a little bit more about fine-grained access control and how it came to be. Access Control Lists (ACL) Back in the 80s, operating systems would define permissions, such as "read," "write," and "execute" on files and folders. This patterns was called access control lists (ACL) With ACL, you can answer questions like: "Does Alice have `read` access to this file?" Role-Based Access Control (RBAC) RBAC, or role-based access control, came around in the 90s and early 2000s with the advent of directories like LDAP and Active Directory. These directories give you the ability to create groups and then assign users to groups, which typically correspond to a particular role in a business application. An admin would assign a user to a group to give them the appropriate permissions, and everything was done in one console. With RBAC, you can answer questions like: "Is Bob in the `Sales admin` role?" Attribute-Based Access Control (ABAC) The next evolution was attribute-based access control (ABAC) and that's where we started to move away from coarse roles and toward fine-grained access control. In the early 2000s and 2010s, we saw standards like XACML define how to construct fine-grained authorization policies. You could define permissions based on attributes, including user-attributes (e.g. the department the user was in) and resource attributes (e.g. what folder is the user trying to access?), or maybe even environmental attributes, (e.g. what is the user's geography? what is the current time and day?)With ABAC, you can answer questions like: "Is Mallory in the `Sales` department? Is the document in the `Sales` folder? And is it currently working hours in the US?" Relationship-Based Access Control (ReBAC) Last, but not least, there is the Zanzibar paper and a new authorization model called relationship-based access control (ReBAC). In this model, you define a set of subjects (typically your users or groups), a set of objects (such as organizations, directories, folders, or maybe tenants). Then you define whether a particular subject has a relationship with an object. A "viewer," "admin," or "editor" would be relationships between a user and a folder object, for example. With ReBAC, you can answer very complex questions, by traversing this relationship graph that is formed by objects, subjects, and relationships. Two Approaches to Fine-Grained Access Control Two ecosystems have emerged around the concept of fine-grained access control: 1. “Policy-as-Code”: In this paradigm, we express policies as a set of rules written in the Rego language. This is the successor to ABAC, where the Open Policy Agent (OPA) project is popular. OPA is a general-purpose decision engine and it's built for policy-based access management and ABAC. However, it has disadvantages: the language you write policies in, Rego, is a Datalog-derived language that has a high learning curve. It also doesn't help you with modeling application-specific authorization. And because it is truly general purpose, you have to build everything from rudimentary building blocks. OPA also leaves the difficult problem of getting relevant user and resource data to the decision engine as an exercise for the developer implementing the project. Getting data to the decision engine is crucial because the decision has to take place in milliseconds since it's in the critical path of every application request. And that means that you have to solve this distributed systems problem yourself if you really want to build an authorization system on top of OPA. All in all, OPA is a really good place to start, but you're going to face some hurdles. 2. “Policy-as-data”: The policy isn't stored as a set of rules, but rather it is ingrained in the data structure. The relationship graph itself is a very opinionated model, so you don't have to design the model yourself. We have "subjects," "objects," and "relationships" which gives you a lot of flexibility. If your domain model looks like Google Drive, with folders, files, and users, it's a really good place to start. On the other hand, it is still an immature ecosystem, with many competing open-source implementations. It's also difficult to combine it with other authorization models, like RBAC and ABAC. The Rise of Policy-Based Access Management Policy-based access management, as the name suggests, lifts the authorization logic out of the application code and into a policy that is its own artifact. Here’s an example of a policy written in Rego: This is really where the principle of least privilege comes into play. You can see that we're denying access, for example, in the allowed clause, until we have enough proof to grant it. The policy is going to return "allowed = false" unless we have some reason to change that to "allowed = true." On line 9, we can see that we're checking that the user's department is "Operations." If it is, this allowed clause will evaluate as true. Now let's take a quick look at what the application code looks like after we've extracted the authorization logic into the policy: The code snippet above is an express.js endpoint that is responsible for passing the user identity, verified by the check JWT middleware. If the allowed clause from the policy returns true, the middleware will pass the request to the next function, and if not, it will return a 403. There are many reasons to separate authorization logic from your application code. This lets you have a separate artifact, which can be stored and versioned exactly the same way that we would our application code. Every policy change will be part of a git change log, which provides us with an audit trail for our policy. Additionally, with the authorization logic separated from the application code, we're adhering to the principle of separation of duties. The security team can manage the authorization policy and the development team can focus on the application. And when we have the policy artifact, we can build it into an immutable image and sign it to maintain a secure supply chain. Here’s a link to the open-source project that does just that: https://github.com/opcr-io/policy. Real-Time Enforcement Is a Distributed Systems Problem Real-time access checks are critical for modern authorization. Authorization is a really hard problem to solve because when done correctly it is a distributed systems problem. And distributed system problems are not trivial. The first challenge is that our authorization service has to authorize locally because it is in the critical path of every single application request. Authorization happens for every request that tries to access a protected resource. Authorization requires 100% availability and milliseconds of latency. For that to happen authorization needs to be performed locally. We want to authorize locally, but manage our authorization policies and data, globally. We want to make sure the data we are basing our authorization decisions on is fresh and updated across all of our local authorizers. For this we need a control plane that will manage our policies, user directory, and all the resource data, and ensure that every change is distributed to the edge in real-time. We also want to aggregate decision logs from all the local authorizers and stream them to your preferred logging system. Conclusion Cloud-native authorization is a complex problem that has yet to be entirely solved. As a result, every cloud application is reinventing the wheel. Based on our conversations, we’ve identified five anti-patterns and best practices for application authorization in the age of cloud computing. First, you want your authorization to be fine-grained, using any access control pattern that fits best with your application and organization, whether that is RBAC, ABAC, ReBAC, or a combination thereof. Second, you want to separate concerns and extract your access control logic from the application code and into a policy that is handed over to the security team. Third, it is crucial to perform real-time access checks, based on fresh user and resource information. Fourth, you want to manage all of your users, resources, policies, and relationships in one place to increase agility. Lastly, you want to collect and store authorization decision logs for compliance.

By Noa Shavit
Why the Cloud Revolution Is Just Getting Started
Why the Cloud Revolution Is Just Getting Started

When cloud computing burst onto the scene in 2006 with the launch of AWS, it would have been hard to imagine how big of a thing it would eventually become. But more than 15 years later, cloud computing has come a long way. And yet, in my view, it is only just getting started towards realizing its true potential. Why do I think this way? Recently, I came across this Gartner study that contained a couple of mind-boggling facts: More than 85% of organizations will embrace a cloud-first principle by 2025. Over 95% of new digital workloads in 2025 will be deployed on cloud-native platforms, up from 30% in 2021. Of course, numbers can be spurious. But when you talk about 85% of organizations and 95% of all new digital workloads, it is definitely a lot. Even if the figures are off the mark by some points, they are still huge. My curiosity about the matter was piqued and I started digging more into the potential trends that might fuel this anticipated growth. Naturally, as a software developer, I’m always interested in knowing where the industry is headed because that’s how we can prepare and hope to keep ourselves relevant. After doing some reading, I have formed an initial idea of the broad trends that are driving cloud computing and will continue to do so in the coming years. However, before I share them with you, I wanted to make some points about cloud computing that can help us understand the trends in a much better way. You may or may not agree with the points I make. Either way, I’d love it if you share your views in the comments section below. What Made the Cloud So Popular? I believe that the seeds of the future are laid in the past. This must hold true for cloud computing as well. But what made the cloud so popular? In my view, cloud computing democratized the ability to build applications with world-class infrastructure. No longer do you need to be a multi-billion dollar organization with an army of engineers to create applications used by millions of people. Even a startup working out of a garage can do it. So, what was stopping the same thing from happening in the pre-cloud era? For starters, the pre-cloud era could also be labeled as the on-premise era. This meant that organizations typically managed their own IT infrastructure and resources. For example, if you wanted to create an application and make it available to the world in the pre-cloud era, you had to purchase, install and maintain hardware and software in-house. This arrangement had a couple of big technical implications: Management of IT infrastructure such as servers, storage, and networking lay solely on the shoulders of the organization’s workforce. I still vividly remember the anecdote of one of my seniors about the earlier days when he had to even fix network cables when a broken connection brought their application down in the middle of the night and the network vendor was not available. The IT systems were not scalable based on demand since organizations were limited by physical resources. If there was an expectation of higher demand, you had to go out and buy more resources. For that to happen, you had to be really good at predicting demand else you would be incurring extra costs for no reason. Of course, when are higher-ups in organizations worried about tech issues unless there was a threat to the company’s bottom line? However, in this case, there were threats. For starters, on-premise computing is a costly business. You need a significant up-front investment in hardware and software to build a data center. Initially, big companies loved this situation. It was like a huge barrier to entry for smaller players. However, once the genie was out-of-the-bottle and some cloud offerings came on the scene, the huge cost associated with on-premise computing became a liability. Suddenly, the army of engineers hired to just keep the infrastructure running started to look like money-guzzling machines. In more disruptive industries, startups with a skeleton crew of software engineers were leap-frogging established players by using the initial cloud tools to drive faster innovation and reducing time-to-market. This meant a loss of market share and growth opportunities for the big companies. Of course, all of this didn’t happen in a single day, a month, or even a year. But slowly and steadily, large organizations also started to steer their ships in the direction of cloud computing. Once that happened, there was no looking back. Evolution of Cloud Computing It was like discovering an untapped oil field right next to your front door. So, Where Are Things Headed? Predicting the direction of a particular technology can be a fool’s errand. In 2006, not even the creators of AWS would have predicted the kind of growth they have seen. But, of course, it’s important to make an educated guess so that we are better prepared for what’s coming in the next few years. Here are a few broad trends I’m tracking: Hybrid Cloud Adoption This one’s a biggie as it is driven mostly by the large organizations that run the world. Think of big banks, government organizations, and mega-corporations. The trend is largely driven by the growing regulatory and legal requirements about data and the increase in privacy concerns all across the world. In a hybrid cloud setup, companies want to keep a mix of capabilities across external cloud platforms as well as in-house setups. The idea is to use the public platforms for new innovative products but keep the core-business capabilities or data on in-house data centers so that they don’t run foul of government regulations. Since it involves big money, I feel hybrid cloud adoption is only going to grow. Already, big cloud providers are rolling out products to support this vision. Red Hat has been offering its flagship OpenShift platform as an on-premise solution for many years now. Microsoft has launched Azure Arc to cater to hybrid and multi-cloud requirements. Google has launched Anthos - a platform that promises a single and consistent way of managing Kubernetes workloads across the public and on-premises cloud. Multi-Cloud Adoption “Don’t keep all your eggs in one basket.” This adage is so wise that organizations are increasingly exploring the use of a multi-cloud approach. A large part of the multi-cloud adoption is driven by risk mitigation. For example, fintech organization Form3 was compelled to go for a multi-cloud setup when the regulators questioned them about the portability of their platform in case AWS went down. However, some of the shift to multi-cloud is also a result of increased competition and service offerings by different cloud vendors. Even beyond the Big 3, there are dozens of other cloud providers providing all manner of cloud services to lure customers with cost or features. Organizations have been spoiled for choice and are trying to get the best ROI for every piece of their infrastructure. I feel this trend is going to accelerate in the future. The difficulties of managing a multi-cloud setup could have curtailed this movement. However, instead of getting bogged down, the demand for multi-cloud and hybrid-cloud setups has spurred a number of new trends such as the rise of infrastructure-as-code tools and the concept of platform engineering. I will discuss more of them in upcoming posts. Serverless Computing One of the main factors that worked in favor of the cloud in the initial days was cost savings. The idea that you could even launch a product with close-to-zero costs was hard to beat and created a tremendous rush for cloud adoption. In my view, serverless has the potential to make even traditional cloud offerings appear costly. Though a few years have passed since serverless options were launched by most major cloud providers, I feel that we are only at the beginning of the serverless revolution. Since serverless computing allows companies to run code without even provisioning or managing cloud servers, it is extremely lucrative for organizations that want to save on costs and move faster. With the tremendous rise in the number of SaaS startups and the existence of an inflationary environment with rising interest rates, the cost of running your system is a big issue. Organizations are looking to achieve product-market fit without burning through too much cash and serverless computing seems like a good deal with its pay-per-use model and little to no maintenance expenditure. AI-Driven Cloud Apart from cloud computing, the last decade or so has also seen another major trend spread like wildfire - the rise of machine learning and artificial intelligence. As AI seeps into more and more areas and supports real requirements, it is already promising to augment cloud services in really interesting ways. For example, AI-driven cloud services can potentially make autonomous decisions on when to scale up or down based on an intuitive understanding of demand rather than fixed rules. Again, this boils down to monetary benefits with the promise of better cost utilization. Of course, we can only hope that one of these services doesn’t turn into Skynet any time soon! Either way, I’ll be extremely interested in keeping an eye out for this developing trend. Containers as a Service Containers on the cloud started off quite early with Amazon launching ECS. Of course, managing a bunch of containers isn’t the easiest thing out there. However, the surging popularity of Kubernetes has changed the landscape of container orchestration. And within no time, all major and minor cloud providers are offering managed Kubernetes services. This is one area where big and small organizations are lapping up the opportunity. After all, everyone wants to reap the benefits of containerization without acquiring the headache of managing them. As developers, it is definitely important to keep abreast of this trend. That’s It for Now! In the end, I feel that we are living in interesting times when it comes to cloud computing. The technology is at the right level of maturity where it has become mainstream enough to have a large base of innovation. However, it is also not so dormant that things become boring and static. To top it off, cloud computing is also producing other trends in areas such as microservices architecture, DevOps, infrastructure-as-code, and platform engineering.

By Saurabh Dashora CORE

Top Cloud Architecture Experts

expert thumbnail

Boris Zaikin

Senior Software Cloud Architect,
Nordcloud GmBH

Certified Software and Cloud Architect Expert who is passionate about building solutions and architecture that solve complex problems and bring value to the business. He has solid experience designing and developing complex solutions based on the Azure, Google, AWS clouds. Boris has expertise in building distributed systems and frameworks based on Kubernetes, Azure Service Fabric, etc. His solutions successfully work in the following domains: Green Energy, Fintech, Aerospace, Mixed Reality. His areas of interest Enterprise Cloud Solutions, Edge Computing, High loaded Web API and Application, Multitenant Distributed Systems, Internet-of-Things Solutions.
expert thumbnail

Ranga Karanam

Best Selling Instructor on Udemy with 1 MILLION Students,
in28Minutes.com

Ranga Karanam is a Best Selling Udemy Instructor and Hands on architect with more than 15 years of programming experience. He is the author of book "Mastering Spring 5.0". He created in28Minutes to create great courses . He runs a successful YouTube channel (with more than 80K subscribers) and Udemy courses (with more than 300K students).
expert thumbnail

Samir Behara

Senior Cloud Infrastructure Architect,
AWS

Samir Behara builds software solutions using cutting edge technologies. He is a Microsoft Data Platform MVP with over 15 years of IT experience. Samir is a frequent speaker at technical conferences and is the Co-Chapter Lead of the Steel City SQL Server UserGroup. He is the author of www.samirbehara.com
expert thumbnail

Pratik Prakash

Master Software Engineer (SDE-IV),
Capital One

Pratik is an experienced solution architect and a passionate engineer with hands-on multi-cloud and data science expertise. He is an open source advocate and enjoys community participation around Java, Kotlin, and the cloud native world. He has worked with companies like UnitedHealth Group as a senior staff engineer and with Fidelity International as an expert engineer where he led engineering teams for cloud-native application modernization.

The Latest Cloud Architecture Topics

article thumbnail
Cloud Modernization Strategies for Coexistence with Monolithic and Multi-Domain Systems: A Target Rollout Approach
This article discusses a strategy for modernizing complex domains with monolithic databases through four phases.
March 21, 2023
by Neeraj Kaushik
· 539 Views · 3 Likes
article thumbnail
Introduction to Spring Cloud Kubernetes
In this article, we will explore the various features of Spring Cloud Kubernetes, its benefits, and how it works.
March 21, 2023
by Aditya Bhuyan
· 1,401 Views · 1 Like
article thumbnail
Integrate AWS Secrets Manager in Spring Boot Application
A guide for integration of AWS Secrets Manager in Spring Boot. This service will load the secrets at runtime and keep the sensitive information away from the code.
March 21, 2023
by Aakash Jangid
· 2,345 Views · 1 Like
article thumbnail
Old School or Still Cool? Top Reasons To Choose ETL Over ELT
In this article, readers will learn about use cases where ETL (extract, transform, load) is a better choice in comparison to ELT (extract, load, transform).
March 20, 2023
by Hiren Dhaduk
· 1,232 Views · 1 Like
article thumbnail
Strategies for Kubernetes Cluster Administrators: Understanding Pod Scheduling
This guide will equip you with the knowledge and skills necessary to master the art of pod scheduling.
March 20, 2023
by shishir khandelwal
· 1,687 Views · 1 Like
article thumbnail
Use AWS Controllers for Kubernetes To Deploy a Serverless Data Processing Solution With SQS, Lambda, and DynamoDB
Discover how to use AWS Controllers for Kubernetes to create a Lambda function, SQS, and DynamoDB table and wire them together to deploy a solution.
March 20, 2023
by Abhishek Gupta CORE
· 1,809 Views · 1 Like
article thumbnail
Spring Cloud
In this article, we will discuss the various modules of Spring Cloud and how they can be used to build cloud-native applications.
March 20, 2023
by Aditya Bhuyan
· 1,646 Views · 2 Likes
article thumbnail
Mission-Critical Cloud Modernization: Managing Coexistence With One-Way Data Sync
Implementing a one-way data sync with AWS Data Migration Service: Challenges and coexistence strategy during complex cloud modernization with the phased rollout.
March 20, 2023
by Neeraj Kaushik
· 3,680 Views · 3 Likes
article thumbnail
AWS CodeCommit and GitKraken Basics: Essential Skills for Every Developer
AWS CodeCommit can be easily integrated with GitKraken GUI to streamline developer workflow. This enables efficient code management, version control, and more.
March 19, 2023
by Mohamed Fayaz
· 2,807 Views · 1 Like
article thumbnail
Seamless Integration of Azure Functions With SQL Server: A Developer's Perspective
Explore this article that provides a practical guide to integrating Azure Functions with SQL Server using C#.
March 17, 2023
by Naga Santhosh Reddy Vootukuri
· 3,358 Views · 1 Like
article thumbnail
Fargate vs. Lambda: The Battle of the Future
Fargate and Lambda are two popular serverless computing options available within the AWS ecosystem. This blog aims to take a deeper look into the battle.
March 17, 2023
by William Talluri
· 3,384 Views · 1 Like
article thumbnail
What To Know Before Implementing IIoT
Industrial Internet of Things (IIoT) technology offers many benefits for manufacturers to improve operations. Here’s what to consider before implementation.
March 17, 2023
by Zac Amos
· 2,917 Views · 1 Like
article thumbnail
Apache Kafka Is NOT Real Real-Time Data Streaming!
Learn how Apache Kafka enables low latency real-time use cases in milliseconds, but not in microseconds; learn from stock exchange use cases at NASDAQ.
March 17, 2023
by Kai Wähner CORE
· 2,893 Views · 1 Like
article thumbnail
AWS IP Address Management
Readers will learn about AWS IPAM, its advanced benefits, granular access controls, and how it can help improve security and optimize IP address utilization.
March 16, 2023
by Rahul Nagpure
· 2,421 Views · 1 Like
article thumbnail
Multi-Cloud Integration
In this post, take a look at the top integration services that AWS, Google Cloud, and Azure provide as well as the benefits and drawbacks of each service.
March 16, 2023
by Boris Zaikin CORE
· 3,803 Views · 1 Like
article thumbnail
Container Security: Don't Let Your Guard Down
To comprehend the security implications of a containerized environment, it is crucial to understand the fundamental elements of a container deployment network.
March 16, 2023
by Akanksha Pathak
· 4,209 Views · 2 Likes
article thumbnail
Differentiate With Google Cloud Cortex Framework
The Cortex Framework for SAP is a robust framework that helps businesses streamline data integrations and facilitate advanced analytics-driven decision-making.
March 15, 2023
by Kamal Bhargava
· 2,131 Views · 1 Like
article thumbnail
How To Handle Secrets in Docker
DevOps engineers must handle secrets with care. In this series, we summarize best practices for leveraging secrets with your everyday tools, including code.
March 15, 2023
by Keshav Malik
· 3,953 Views · 1 Like
article thumbnail
Building a RESTful API With AWS Lambda and Express
Build a RESTful API using AWS Lambda and Express.js with this quick and easy-to-follow tutorial. Deploy your RESTful API with confidence using Node.js.
March 15, 2023
by Preet Kaur
· 3,668 Views · 2 Likes
article thumbnail
All the Cloud’s a Stage and All the WebAssembly Modules Merely Actors
In this post, we’ll take a look at the notion of actors creating more actors and how wasmCloud accomplishes the same goals but without manual supervision tree management.
March 15, 2023
by Kevin Hoffman
· 2,757 Views · 1 Like
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: