[This article by Daniel Bryant comes to you from the DZone Guide to Cloud Development - 2015 Edition. For more information—including in-depth articles from industry experts, best solutions for PaaS, iPaaS, IaaS, and MBaaS, and more—click the link below to download your free copy of the guide.]
Everyone is talking about moving to the cloud these days, regardless of whether the business actually focuses on software development. We software creators are constantly being challenged by the leading “cloud native” companies, such as Netflix and Amazon, to take better advantage of the unique environment provided by the cloud. Whether an application is being migrated to the cloud (a “lift and shift”), built as a greenfield project, or developed as hybrid migration and enhancement, cloud developers face a common set of challenges.
This article aims to accomplish two things: first, to identify and discuss these challenges; and second, to offer advice on how to overcome them.
Flying High (and Low) in the Cloud - Changing the Game
If done correctly, cloud-based deployments are a game-changer, and can enable extremely rapid deployment of software that can be scaled elastically to meet demand. However, this flexibility is countered by several constraints placed on the architecture and design of software. Common problems experienced when building software for the cloud can be broadly grouped into the following categories:
- Difficulty in communicating the new design and deployment topology that results from embracing the cloud
- Problems with creating an architecture that works in harmony with (and leverages) cloud infrastructure
- Lack of testing in a cloud-based environment
- Lack of understanding of the underlying cloud fabric and its properties
- Limited strategy for application and platform monitoring
- Difficulty in designing for bizarre (and partial) cloud infrastructure failure modes
- The remainder of this article will discuss these issues in greater depth, and also provide strategies for overcoming or mitigating the accompanying challenges.
Core Design Principles - Inspiration from Cosmic Law and Order
This article introduces the cloud DHARMAdevelopment principles, which have been designed to be used in much the same way as the SOLID software development principles. The word dharma can be found in Hindu and Buddhist texts, and signifies behaviors that are considered to be in accord with the order that makes life and the universe possible. Although this underlying definition may also be relevant, the word has been chosen primarily as a mnemonic to represent the following cloud development guidelines:
- Documented (just enough)
- Highly cohesive/loosely coupled (all the way down)
- Automated from commit to cloud
- Resource aware
- Monitored thoroughly
Documentation - Not Just for Waterfall Projects
Documentation is often shunned in software projects that are run using an agile methodology, despite many prominent members of the community, including Simon Brown and Bob Martin, constantly attempting to reiterate the benefits of “just enough” documentation. Whether a greenfield software project is being designed for a cloud-native deployment, or an existing application is being “lifted and shifted,” the current trend is to split large code bases into smaller microservices. This, combined with the unique opportunities and challenges presented by a cloud environment, should convince architects, developers, and operators that lightweight documentation is essential in order to communicate our design and intentions.
The key purpose of documentation in a cloud-based project is to provide a map of the application and deployment territory. Architectural views should be captured in diagrams, especially if the application has been adapted architecturally to take advantage of cloud features, such as elastic scaling or multi-region failover, as these features necessarily impact the runtime operation of a system. Simon Brown’s C4 model of architecture provides a great start in learning how to create such diagrams.
The creation of a physical infrastructure diagram is also essential. This should highlight operational features that are important to the ops, dev, and QA teams. For example, private subnets, firewalls, load-balancers, replicated data stores, and service clustering all have a large impact on the design and configuration of software.
Other essential documentation for a cloud application includes a list of software components, their purpose (and contract), and initialization instructions. Any discrepancies between the approach used to run services in a local development environment in comparison with the production cloud stack should be clearly noted. Brief documented highlights of service state and caching are also often useful, and provide essential information during configuration and debugging sessions. All of this information can typically be summarized in a code repository README file, and suggested section headings for this documentation can be found below:
- Component description responsibilities
- Component initialization instructions
- Profiles available (i.e., modes of operation)
- Component external interactions (i.e., collaborations)
- State characteristics (i.e., stateless, mandatory sticky- sessions)
- Data store and cache interactions (in/out of process, eviction policy, etc.)
- API documentation (e.g. Swagger, Thrift IDL, etc.)
- Failure modes of component
- Developer highlights (i.e. classes of interest)
- Decision log (core architectural changes)
Highly Cohesive/Loosely Coupled - Good Architecture All the Way Down
Creating an appropriate architecture that supports the functioning and evolution of a software application is essential. regardless of the deployment environment, but is especially important when building software that will take advantage of the flexible (but volatile) cloud fabric. Anyone performing a “lift and shift” will potentially have less options for establishing a good architecture, but an analysis should at least be performed, and any areas of the system that may cause friction with a cloud environment should be noted. For example, the volatile nature of the cloud means that services that are highly coupled to a dedicated IP address can cause problems, as can services that are not capable of being clustered or surviving a restart.
A good architecture is evident “all the way down” a software system, and in a cloud application this will most likely include the public-facing API, services (and internal APIs), components, and the code itself. The core measures of cohesion and coupling can be applied at every level. It is difficult to escape the current trend toward implementing a system by decomposing functionality into small services, and architectural guidelines such as those presented at 12factor.netand the book Building Microservices are well worth understanding. Martin Abbott and Michael Fisher’s Scalability Rules is also essential reading for any cloud developer.
Automated From Commit to Cloud - The Pipeline to Production
The creation of a build pipeline that supports continuous delivery is beneficial for many software projects. This is especially important when developing applications for the cloud, as typically developers will create software using an environment that has radically different hardware and operating system configuration than the production stack. The goal of a build pipeline is to take code committed to a version control systems such as Git, create the corresponding build artifacts and provision the required infrastructure, and exercise (or “torture”) the resulting applications. Only artifacts that survive their journey through the complete build pipeline are flagged as being suitable for production deployment. A new version of the software emerging from the pipeline should ideally add value to end-users, and must definitely not introduce any regressions in existing functional or cross-functional requirements. Jez Humble and Dave Farley’s Continuous Delivery book is an essential reference for learning more about the benefits and implementation of a build pipeline.
A build pipeline for a cloud project must push the modified application into a cloud environment as soon as possible. This cannot be emphasised enough! It is not uncommon for developers to be creating software on an ultra-powerful laptop running Mac OS X, but the corresponding production infrastructure is a micro-powered cloud instance running Debian. It should go without saying that performance (and potentially even functionality) will be very different across these two environments. Accordingly, running automated acceptance and performance tests against a cloud-based deployment of the application is essential.
Cloud infrastructure and compute resources are often defined with provisioning/configuration management utilities (such as Chef, Ansible, Puppet), and the resulting artifacts may be packaged as images or containers (with Packer or Docker). This infrastructure as code should also
be tested as part of the build pipeline. Vendor-specific testing tools, such as Chef’s Foodcritic and Puppet’s rspec-puppet, are the operational code equivalent of unit and integration testing, and tools such as ServerSpec can be used for acceptance testing infrastructure deployed into a cloud environment. It is also beneficial at the infrastructural testing phase of the build pipeline to follow the example of simulate failure modes that can be experienced in a heavily virtualized and networked environment.
Resource Aware - Noisy Neighbors and Mechanical Sympathy
A core change when moving to develop software for the cloud is the fact that practically everything within this environment is virtualized, and the vast majority of communication occurs over a network. At first glance this may appearto be an obvious statement, but developers are often used to creating software on computing infrastructure that is not multi-tenant. When software is running on compute resource that is not shared, there are no “noisy neighbours” that are attempting to compete for contended physical resources, such as the CPU, memory, and disk access. In addition, traditional applications typically communicate to a database over dedicated low-latency network channels, but in the cloud this channel may be shared by many other services. In high availability configurations, it is not uncommon for a database to be running in a different data center than the machine making a request (although this should be avoided), and this only adds to the communication overhead.
Werner Vogels, CTO of Amazon, is famous for stating that “everything fails all the time in the cloud.” The flexibility and cost benefits of using virtualized commodity hardware within a public cloud has a clear trade-off—every infrastructure resource must be treated as ephemeral and volatile. The challenges introduced by using cloud fabric must be countered by cultivating “mechanical sympathy,” or put another way, developing an understanding of the hardware fabric onto which you are deploying applications. Key skills that every developer, QA specialist, and operator must develop when deploying applications to the cloud include:
- Deep understanding of virtualization—Hypervisors, steal time and resource contention
- Good comprehension of computer networking—TCP/IP, DNS, and the OSI model
- Good knowledge of caching—Reverse proxies, distributed caches, and CDNs
- Expert Linux skills—Including diagnostic tools like top, vmstat and tcpdump
Monitored Thoroughly - If It Moves, Graph It...
Potentially the biggest operational challenge with moving an application to the cloud is the new monitoring infrastructure required. Although a good monitoring strategy and implementation is beneficial in any environment, it is essential with a volatile, multi-tenant and virtualised fabric such as that offered by a public cloud. Basic operational details about each compute instance should be monitored, for example utilizing collectd or Munin to collect CPU usage, memory statistic and disk performance. This data can then be shipped to time-series datastores such as InfluxDB or OpenTSDB, graphed by tools such as Grafana or Cacti, and used for on-call alerting by applications such as Nagios or Zabbix. Data stores and middleware should also be monitored, and modern applications typically expose these metrics out of the box via a series of mechanisms. For example, MySQL exposes a proprietary API, Solr exposes statistics via JMX, and RabbitMQ exposes data via an HTTP interface.
All software applications running on these compute resources should also be monitored, whether they have been developed in-house or not. Key metrics for an application can be exposed using frameworks such as Codahale’s Metrics or StatsD. These frameworks allow developers to specify status flags, counters, and gauges in code, which can emit information such the current service health status, transaction throughput, or cache statistics. If you are operating in a microservice environment, then distributed tracing is also essential in order to learn how each ingress request is handled as it passes through your application stack. Tools such as Twitter’s Zipkin or AppDynamics Application Performance Monitor are prime candidates for this. Finally, centralized logging is also essential in order to avoid having to manually SSH into multiple locations, and the ELK stack of Elasticsearch, Logstash and Kibanais becoming a de facto standard for achieving this goal (and can easily be experienced via a Vagrant box).
Antifragile - Robust is Not the Opposite of Fragile
The fabric of the cloud provides a set of unique challenges in terms of increased contention, volatility, and transience. Software must clearly be designed to handle the “fragile” nature of the underlying infrastructure. The initial design of a system to counter fragility often leads to the creation of a robust system, which can typically withstand heavy load and recover from failure. However, the methods by which a robust system achieves these properties may not be optimal for the post-Web 2.0 generation of users, who expect applications to be constantly available and highly responsive.
A robust web server that simply refuses any connection over a preset maximum limit will provide a good experience for those already connected, but what about the additional users who can’t connect? The same can be said from an e-commerce site where failure in the recommendation component automatically shuts down the entire site—a user will never receive an incorrect recommendation; but is this really the correct response to the problem that the business would expect (or want)?
Applications must go further than being robust, they must be “antifragile.” As a foundation for antifragility, cloud- native software must be created using fault-tolerant design patterns such as timeouts, retries, bulkheads, and circuit breakers. Michael Nygard’s book “Release It!” provides a great overview of all of these concepts.
The elastic nature of compute resources truly allows antifragile behavior. Software can be designed to take advantage of this elasticity by rapidly scaling compute power to meet increased demand or reduce costs. The vast majority of cloud vendor APIs and SDKs make this operation trivial, and many vendors provide an automated approach to handling this common use case. Additional tools for creating an antifragile application include using message queues, such as RabbitMQ or Kafka, which enable asynchronous communication and the buffering of events, and the introduction of eventual consistency into datastores, for example using Cassandra or Riak.
As a first step towards understanding the principle of antifragility, it is well worth consulting the classic texts on distributed computing principles, such as the eight fallacies of distributed computing and notes on distributed systems for young bloods. Netflix is the poster child of antifragility within software, and every cloud developer should take a tour of their public Github account, if only to take inspiration from the plethora of solutions provided.
Summary - The DHARMA Checklist
This article has attempted to identify the common challenges experienced when developing and deploying software applications to the cloud. The cloud DHARMA principles are designed to act as a checklist when designing and implementing software that will be deployed onto a cloud environment. The principles state that attention must be focused on documenting architecture and deployment topologies, utilizing good software design principles “all the way down” the stack, creating a comprehensive build pipeline, increasing awareness of the underlying fabric of the cloud, implementing a comprehensive monitoring strategy, and building antifragile systems. Although there are many challenges encountered when developing applications for the cloud, the benefits make this an attractive environment for many software applications. Simply remember that the cloud is a completely different animal than a traditional datacenter.