Software design and architecture focus on the development decisions made to improve a system's overall structure and behavior in order to achieve essential qualities such as modifiability, availability, and security. The Zones in this category are available to help developers stay up to date on the latest software design and architecture trends and techniques.
Cloud architecture refers to how technologies and components are built in a cloud environment. A cloud environment comprises a network of servers that are located in various places globally, and each serves a specific purpose. With the growth of cloud computing and cloud-native development, modern development practices are constantly changing to adapt to this rapid evolution. This Zone offers the latest information on cloud architecture, covering topics such as builds and deployments to cloud-native environments, Kubernetes practices, cloud databases, hybrid and multi-cloud environments, cloud computing, and more!
Containers allow applications to run quicker across many different development environments, and a single container encapsulates everything needed to run an application. Container technologies have exploded in popularity in recent years, leading to diverse use cases as well as new and unexpected challenges. This Zone offers insights into how teams can solve these challenges through its coverage of container performance, Kubernetes, testing, container orchestration, microservices usage to build and deploy containers, and more.
Integration refers to the process of combining software parts (or subsystems) into one system. An integration framework is a lightweight utility that provides libraries and standardized methods to coordinate messaging among different technologies. As software connects the world in increasingly more complex ways, integration makes it all possible facilitating app-to-app communication. Learn more about this necessity for modern software development by keeping a pulse on the industry topics such as integrated development environments, API best practices, service-oriented architecture, enterprise service buses, communication architectures, integration testing, and more.
A microservices architecture is a development method for designing applications as modular services that seamlessly adapt to a highly scalable and dynamic environment. Microservices help solve complex issues such as speed and scalability, while also supporting continuous testing and delivery. This Zone will take you through breaking down the monolith step by step and designing a microservices architecture from scratch. Stay up to date on the industry's changes with topics such as container deployment, architectural design patterns, event-driven architecture, service meshes, and more.
Performance refers to how well an application conducts itself compared to an expected level of service. Today's environments are increasingly complex and typically involve loosely coupled architectures, making it difficult to pinpoint bottlenecks in your system. Whatever your performance troubles, this Zone has you covered with everything from root cause analysis, application monitoring, and log management to anomaly detection, observability, and performance testing.
The topic of security covers many different facets within the SDLC. From focusing on secure application design to designing systems to protect computers, data, and networks against potential attacks, it is clear that security should be top of mind for all developers. This Zone provides the latest information on application vulnerabilities, how to incorporate security earlier in your SDLC practices, data governance, and more.
Cloud Native
Cloud native has been deeply entrenched in organizations for years now, yet it remains an evolving and innovative solution across the software development industry. Organizations rely on a cloud-centric state of development that allows their applications to remain resilient and scalable in this ever-changing landscape. Amidst market concerns, tool sprawl, and the increased need for cost optimization, there are few conversations more important today than those around cloud-native efficacy at organizations.Google Cloud breaks down "cloud native" into five primary pillars: containers and orchestration, microservices, DevOps, and CI/CD. For DZone's 2024 Cloud Native Trend Report, we further explored these pillars, focusing our research on learning how nuanced technology and methodologies are driving the vision for what cloud native means and entails today. The articles, contributed by experts in the DZone Community, bring the pillars into conversation via topics such as automating the cloud through orchestration and AI, using shift left to improve delivery and strengthen security, surviving observability challenges, and strategizing cost optimizations.
High-Load Systems: Overcoming Challenges in Social Network Development
Efficient data synchronization is crucial in high-performance computing and multi-threaded applications. This article explores an optimization technique for scenarios where frequent writes to a container occur in a multi-threaded environment. We’ll examine the challenges of traditional synchronization methods and present an advanced approach that significantly improves performance for write-heavy environments. The method in question is beneficial because it is easy to implement and versatile, unlike pre-optimized containers that may be platform-specific, require special data types, or bring additional library dependencies. Traditional Approaches and Their Limitations Imagine a scenario where we have a cache of user transactions: C++ struct TransactionData { long transactionId; long userId; unsigned long date; double amount; int type; std::string description; }; std::map<long, std::vector<TransactionData>> transactionCache; // key - userId In a multi-threaded environment, we need to synchronize access to this cache. The traditional approach might involve using a mutex: C++ class SimpleSynchronizedCache { public: void write(const TransactionData&& transaction) { std::lock_guard<std::mutex> lock(cacheMutex); transactionCache[transaction.userId].push_back(transaction); } std::vector<TransactionData> read(const long&& userId) { std::lock_guard<std::mutex> lock(cacheMutex); try { return transactionCache.at(userId); } catch (const std::out_of_range& ex) { return std::vector<TransactionData>(); } } std::vector<TransactionData> pop(const long& userId) { std::lock_guard<std::mutex> lock(_cacheMutex); auto userNode = _transactionCache.extract(userId); return userNode.empty() ? std::vector<TransactionData>() : std::move(userNode.mapped()); } private: std::map<int, std::vector<TransactionData>> transactionCache; std::mutex cacheMutex; }; As system load increases, especially with frequent reads, we might consider using a shared_mutex: C++ class CacheWithSharedMutex { public: void write(const TransactionData&& transaction) { std::lock_guard<std::shared_mutex> lock(cacheMutex); transactionCache[transaction.userId].push_back(transaction); } std::vector<TransactionData> read(const long&& userId) { std::shared_lock<std::shared_mutex> lock(cacheMutex); try { return transactionCache.at(userId); } catch (const std::out_of_range& ex) { return std::vector<TransactionData>(); } } std::vector<TransactionData> pop(const long& userId) { std::lock_guard<std::shared_mutex> lock(_cacheMutex); auto userNode = _transactionCache.extract(userId); return userNode.empty() ? std::vector<TransactionData>() : std::move(userNode.mapped()); } private: std::map<int, std::vector<TransactionData>> transactionCache; std::shared_mutex cacheMutex; }; However, when the load is primarily generated by writes rather than reads, the advantage of a shared_mutex over a regular mutex becomes minimal. The lock will often be acquired exclusively, negating the benefits of shared access. Moreover, let’s imagine that we don’t use read() at all — instead, we frequently write incoming transactions and periodically flush the accumulated transaction vectors using pop(). As pop() involves reading with extraction, both write() and pop() operations would modify the cache, necessitating exclusive access rather than shared access. Thus, the shared_lock becomes entirely useless in terms of optimization over a regular mutex, and maybe even performs worse — its more intricate implementation is now used for the same exclusive locks that a faster regular mutex provides. Clearly, we need something else. Optimizing Synchronization With the Sharding Approach Given the following conditions: A multi-threaded environment with a shared containerFrequent modification of the container from different threadsObjects in the container can be divided for parallel processing by some member variable. Regarding point 3, in our cache, transactions from different users can be processed independently. While creating a mutex for each user might seem ideal, it would lead to excessive overhead in maintaining so many locks. Instead, we can divide our cache into a fixed number of chunks based on the user ID, in a process known as sharding. This approach reduces the overhead and yet allows the parallel processing, thereby optimizing performance in a multi-threaded environment. C++ class ShardedCache { public: ShardedCache(size_t shardSize): _shardSize(shardSize), _transactionCaches(shardSize) { std::generate( _transactionCaches.begin(), _transactionCaches.end(), []() { return std::make_unique<SimpleSynchronizedCache>(); }); } void write(const TransactionData& transaction) { _transactionCaches[transaction.userId % _shardSize]->write(transaction); } std::vector<TransactionData> read(const long& userId) { _transactionCaches[userId % _shardSize]->read(userId); } std::vector<TransactionData> pop(const long& userId) { return std::move(_transactionCaches[userId % _shardSize]->pop(userId)); } private: const size_t _shardSize; std::vector<std::unique_ptr<SimpleSynchronizedCache>> _transactionCaches; }; This approach allows for finer-grained locking without the overhead of maintaining an excessive number of mutexes. The division can be adjusted based on system architecture specifics, such as size of a thread pool that works with the cache, or hardware concurrency. Let’s run tests where we check how sharding accelerates cache performance by testing different partition sizes. Performance Comparison In these tests, we aim to do more than just measure the maximum number of operations the processor can handle. We want to observe how the cache behaves under conditions that closely resemble real-world scenarios, where transactions occur randomly. Our optimization goal is to minimize the processing time for these transactions, which enhances system responsiveness in practical applications. The implementation and tests are available in the GitHub repository. C++ #include <thread> #include <functional> #include <condition_variable> #include <random> #include <chrono> #include <iostream> #include <fstream> #include <array> #include "SynchronizedContainers.h" const auto hardware_concurrency = (size_t)std::thread::hardware_concurrency(); class TaskPool { public: template <typename Callable> TaskPool(size_t poolSize, Callable task) { for (auto i = 0; i < poolSize; ++i) { _workers.emplace_back(task); } } ~TaskPool() { for (auto& worker : _workers) { if (worker.joinable()) worker.join(); } } private: std::vector<std::thread> _workers; }; template <typename CacheImpl> class Test { public: template <typename CacheImpl = ShardedCache, typename ... CacheArgs> Test(const int testrunsNum, const size_t writeWorkersNum, const size_t popWorkersNum, const std::string& resultsFile, CacheArgs&& ... cacheArgs) : _cache(std::forward<CacheArgs>(cacheArgs)...), _writeWorkersNum(writeWorkersNum), _popWorkersNum(popWorkersNum), _resultsFile(resultsFile), _testrunsNum(testrunsNum), _testStarted (false) { std::random_device rd; _randomGenerator = std::mt19937(rd()); } void run() { for (auto i = 0; i < _testrunsNum; ++i) { runSingleTest(); logResults(); } } private: void runSingleTest() { { std::lock_guard<std::mutex> lock(_testStartSync); _testStarted = false; } // these pools won’t just fire as many operations as they can, // but will emulate real-time occuring requests to the cache in multithreaded environment auto writeTestPool = TaskPool(_writeWorkersNum, std::bind(&Test::writeTransactions, this)); auto popTestPool = TaskPool(_popWorkersNum, std::bind(&Test::popTransactions, this)); _writeTime = 0; _writeOpNum = 0; _popTime = 0; _popOpNum = 0; { std::lock_guard<std::mutex> lock(_testStartSync); _testStarted = true; _testStartCv.notify_all(); } } void logResults() { std::cout << "===============================================" << std::endl; std::cout << "Writing operations number per sec:\t" << _writeOpNum / 60. << std::endl; std::cout << "Writing operations avg time (mcsec):\t" << (double)_writeTime / _writeOpNum << std::endl; std::cout << "Pop operations number per sec: \t" << _popOpNum / 60. << std::endl; std::cout << "Pop operations avg time (mcsec): \t" << (double)_popTime / _popOpNum << std::endl; std::ofstream resultsFilestream; resultsFilestream.open(_resultsFile, std::ios_base::app); resultsFilestream << _writeOpNum / 60. << "," << (double)_writeTime / _writeOpNum << "," << _popOpNum / 60. << "," << (double)_popTime / _popOpNum << std::endl; std::cout << "Results saved to file " << _resultsFile << std::endl; } void writeTransactions() { { std::unique_lock<std::mutex> lock(_testStartSync); _testStartCv.wait(lock, [this] { return _testStarted; }); } std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now(); // hypothetical system has around 100k currently active users std::uniform_int_distribution<> userDistribution(1, 100000); // delay up to 5 ms for every thread not to start simultaneously std::uniform_int_distribution<> waitTimeDistribution(0, 5000); std::this_thread::sleep_for(std::chrono::microseconds(waitTimeDistribution(_randomGenerator))); for ( auto iterationStart = std::chrono::steady_clock::now(); iterationStart - start < std::chrono::minutes(1); iterationStart = std::chrono::steady_clock::now()) { auto generatedUser = userDistribution(_randomGenerator); TransactionData dummyTransaction = { 5477311, generatedUser, 1824507435, 8055.05, 0, "regular transaction by " + std::to_string(generatedUser)}; std::chrono::steady_clock::time_point operationStart = std::chrono::steady_clock::now(); _cache.write(dummyTransaction); std::chrono::steady_clock::time_point operationEnd = std::chrono::steady_clock::now(); ++_writeOpNum; _writeTime += std::chrono::duration_cast<std::chrono::microseconds>(operationEnd - operationStart).count(); // make span between iterations at least 5ms std::this_thread::sleep_for(iterationStart + std::chrono::milliseconds(5) - std::chrono::steady_clock::now()); } } void popTransactions() { { std::unique_lock<std::mutex> lock(_testStartSync); _testStartCv.wait(lock, [this] { return _testStarted; }); } std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now(); // hypothetical system has around 100k currently active users std::uniform_int_distribution<> userDistribution(1, 100000); // delay up to 100 ms for every thread not to start simultaneously std::uniform_int_distribution<> waitTimeDistribution(0, 100000); std::this_thread::sleep_for(std::chrono::microseconds(waitTimeDistribution(_randomGenerator))); for ( auto iterationStart = std::chrono::steady_clock::now(); iterationStart - start < std::chrono::minutes(1); iterationStart = std::chrono::steady_clock::now()) { auto requestedUser = userDistribution(_randomGenerator); std::chrono::steady_clock::time_point operationStart = std::chrono::steady_clock::now(); auto userTransactions = _cache.pop(requestedUser); std::chrono::steady_clock::time_point operationEnd = std::chrono::steady_clock::now(); ++_popOpNum; _popTime += std::chrono::duration_cast<std::chrono::microseconds>(operationEnd - operationStart).count(); // make span between iterations at least 100ms std::this_thread::sleep_for(iterationStart + std::chrono::milliseconds(100) - std::chrono::steady_clock::now()); } } CacheImpl _cache; std::atomic<long> _writeTime; std::atomic<long> _writeOpNum; std::atomic<long> _popTime; std::atomic<long> _popOpNum; size_t _writeWorkersNum; size_t _popWorkersNum; std::string _resultsFile; int _testrunsNum; bool _testStarted; std::mutex _testStartSync; std::condition_variable _testStartCv; std::mt19937 _randomGenerator; }; void testCaches(const size_t testedShardSize, const size_t workersNum) { if (testedShardSize == 1) { auto simpleImplTest = Test<SimpleSynchronizedCache>( 10, workersNum, workersNum, "simple_cache_tests(" + std::to_string(workersNum) + "_workers).csv"); simpleImplTest.run(); } else { auto shardedImpl4Test = Test<ShardedCache>( 10, workersNum, workersNum, "sharded_cache_" + std::to_string(testedShardSize) + "_tests(" + std::to_string(workersNum) + "_workers).csv", 4); shardedImpl4Test.run(); } } int main() { std::cout << "Hardware concurrency: " << hardware_concurrency << std::endl; std::array<size_t, 7> testPlan = { 1, 4, 8, 32, 128, 4096, 100000 }; for (auto i = 0; i < testPlan.size(); ++i) { testCaches(testPlan[i], 4 * hardware_concurrency); } // additional tests with diminished load to show limits of optimization advantage std::array<size_t, 4> additionalTestPlan = { 1, 8, 128, 100000 }; for (auto i = 0; i < additionalTestPlan.size(); ++i) { testCaches(additionalTestPlan[i], hardware_concurrency); } } We observe that with 2,000 writes and 300 pops per second (with a concurrency of 8) — which are not very high numbers for a high-load system — optimization using sharding significantly accelerates cache performance, by orders of magnitude. However, evaluating the significance of this difference is left to the reader, as, in both scenarios, operations took less than a millisecond. It’s important to note that the tests used a relatively lightweight data structure for transactions, and synchronization was applied only to the container itself. In real-world scenarios, data is often more complex and larger, and synchronized processing may require additional computations and access to other data, which can significantly increase the time of operation itself. Therefore, we aim to spend as little time on synchronization as possible. The tests do not show the significant difference in processing time when increasing the shard size. The greater the size the bigger is the maintaining overhead, so how low should we go? I suspect that the minimal effective value is tied to the system's concurrency, so for modern server machines with much greater concurrency than my home PC, a shard size that is too small won’t yield the most optimal results. I would love to see the results on other machines with different concurrency that may confirm or disprove this hypothesis, but for now I assume it is optimal to use a shard size that is several times larger than the concurrency. You can also note that the largest size tested — 100,000 — effectively matches the mentioned earlier approach of assigning a mutex to each user (in the tests, user IDs were generated within the range of 100,000). As can be seen, this did not provide any advantage in processing speed, and this approach is obviously more demanding in terms of memory. Limitations and Considerations So, we determined an optimal shard size, but this is not the only thing that should be considered for the best results. It’s important to remember that such a difference compared to a simple implementation exists only because we are attempting to perform a sufficiently large number of transactions at the same time, causing a “queue” to build up. If the system’s concurrency and the speed of each operation (within the mutex lock) allow operations to be processed without bottlenecks, the effectiveness of sharding optimization decreases. To demonstrate this, let’s look at the test results with reduced load — at 500 writes and 75 pops (with a concurrency of 8) — the difference is still present, but it is no longer as significant. This is yet another reminder that premature optimizations can complicate code without significantly impacting results. It’s crucial to understand the application requirements and expected load. Also, it’s important to note that the effectiveness of sharding heavily depends on the distribution of values of the chosen key (in this case, user ID). If the distribution becomes heavily skewed, we may revert to performance more similar to that of a single mutex — imagine all of the transactions coming from a single user. Conclusion In scenarios with frequent writes to a container in a multi-threaded environment, traditional synchronization methods can become a bottleneck. By leveraging the ability of parallel processing of data and predictable distribution by some specific key and implementing a sharded synchronization approach, we can significantly improve performance without sacrificing thread safety. This technique can prove itself effective for systems dealing with user-specific data, such as transaction processing systems, user session caches, or any scenario where data can be logically partitioned based on a key attribute. As with any optimization, it’s crucial to profile your specific use case and adjust the implementation accordingly. The approach presented here provides a starting point for tackling synchronization challenges in write-heavy, multi-threaded applications. Remember, the goal of optimization is not just to make things faster, but to make them more efficient and scalable. By thinking critically about your data access patterns and leveraging the inherent structure of your data, you can often find innovative solutions to performance bottlenecks.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC. A decade ago, Google introduced Kubernetes to simplify the management of containerized applications. Since then, it has fundamentally transformed the software development and operations landscape. Today, Kubernetes has seen numerous enhancements and integrations, becoming the de facto standard for container orchestration. This article explores the journey of Kubernetes over the past 10 years, its impact on the software development lifecycle (SDLC) and developers, and the trends and innovations that will shape its next decade. The Evolution of Kubernetes Kubernetes, often referred to as K8s, had its first commit pushed to GitHub on June 6, 2014. About a year later, on July 21, 2015, Kubernetes V1 was released, featuring 14,000 commits from 400 contributors. Simultaneously, the Linux Foundation announced the formation of the Cloud Native Computing Foundation (CNCF) to advance state-of-the-art technologies for building cloud-native applications and services. After that, Google donated Kubernetes to the CNCF, marking a significant milestone in its development. Kubernetes addressed a critical need in the software industry: managing the lifecycle of containerized applications. Before Kubernetes, developers struggled with orchestrating containers, leading to inefficiencies and complexities in deployment processes. Kubernetes brought advanced container management functionality and quickly gained popularity due to its robust capabilities in automating the deployment, scaling, and operations of containers. While early versions of Kubernetes introduced the foundation for container orchestration, the project has since undergone significant improvements. Major updates have introduced sophisticated features such as StatefulSets for managing stateful applications, advanced networking capabilities, and enhanced security measures. The introduction of Custom Resource Definitions (CRDs) and Operators has further extended its functionality, allowing users to manage complex applications and workflows with greater ease. In addition, the community has grown significantly over the past decade. According to the 2023 Project Journey Report, Kubernetes now has over 74,680 contributors, making it the second-largest open-source project in the world after Linux. Over the years, Kubernetes has seen numerous enhancements and integrations, becoming the de facto standard for container orchestration. The active open source community and the extensive ecosystem of tools and projects have made Kubernetes an essential technology for modern software development. It is now the "primary container orchestration tool for 71% of Fortune 100 companies" (Project Journey Report). Kubernetes' Impact on the SDLC and Developers Kubernetes abstracts away the complexities of container orchestration and allows developers to focus on development rather than worry about application deployment and orchestration. The benefits and key impacts on the SDLC and developer workflows include enhanced development and testing, efficient deployment, operational efficiency, improved security, and support for microservices architecture. Enhanced Development and Testing Kubernetes ensures consistency for applications running across testing, development, and production environments, regardless of whether the infrastructure is on-premises, cloud based, or a hybrid setup. This level of consistency, along with the capability to quickly spin up and tear down environments, significantly accelerates development cycles. By promoting portability, Kubernetes also helps enterprises avoid vendor lock-in and refine their cloud strategies, leading to a more flexible and efficient development process. Efficient Deployment Kubernetes automates numerous aspects of application deployment, such as service discovery, load balancing, scaling, and self-healing. This automation reduces manual effort, minimizes human error, and ensures reliable and repeatable deployments, reducing downtime and deployment failures. Operational Efficiency Kubernetes efficiently manages resources by dynamically allocating them based on the application's needs. It ensures operations remain cost effective while maintaining optimal performance and use of computing resources by scheduling containers based on resource requirements and availability. Security Kubernetes enhances security by providing container isolation and managing permissions. Its built-in security features allow developers to build secure applications without deep security expertise. Such built-in features include role-based access control, which ensures that only authorized users can access specific resources and perform certain actions. It also supports secrets management to securely store and manage sensitive information like passwords and API keys. Microservices Architecture Kubernetes has facilitated the adoption of microservices architecture by enabling developers to deploy, manage, and scale individual microservices independently. Each microservice can be packaged into a separate container, providing isolation and ensuring that dependencies are managed within the container. Kubernetes' service discovery and load balancing features enable communication between microservices, while its support for automated scaling and self-healing ensures high availability and resilience. Predictions for the Next Decade After a decade, it has become clear that Kubernetes is now the standard technology for container orchestration that's used by many enterprises. According to the CNCF Annual Survey 2023, the usage of Kubernetes continues to grow, with significant adoption across different industries and use cases. Its reliability and flexibility make it a preferred choice for mission-critical applications, including databases, CI/CD pipelines, and AI and machine learning (ML) workloads. As a result, there is a growing demand for new features and enhancements, as well as simplifying concepts for users. The community is now prioritizing improvements that not only enhance user experiences but also promote the sustainability of the project. Figure 1 illustrates the anticipated future trends in Kubernetes, and below are the trends and innovations expected to shape Kubernetes' future in more detail. Figure 1. Future trends in Kubernetes AI and Machine Learning Kubernetes is increasingly used to orchestrate AI and ML workloads, supporting the deployment and management of complex ML pipelines. This simplifies the integration and scaling of AI applications across various environments. Innovations such as Kubeflow — an open-source platform designed to optimize the deployment, orchestration, and management of ML workflows on Kubernetes — enable data scientists to focus more on model development and less on infrastructure concerns. According to the recent CNCF open-source project velocity report, Kubeflow appeared on the top 30 CNCF project list for the first time in 2023, highlighting its growing importance in the ecosystem. Addressing the resource-intensive demands of AI introduces new challenges that contributors are focusing on, shaping the future of Kubernetes in the realm of AI and ML. The Developer Experience As Kubernetes evolves, its complexity can create challenges for new users. Hence, improving the user experience is crucial moving forward. Tools like Backstage are revolutionizing how developers work with Kubernetes and speeding up the development process. The CNCF's open-source project velocity report also states that "Backstage is addressing a significant pain point around developer experience." Moreover, the importance of platform engineering is increasingly recognized by companies. This emerging trend is expected to grow, with the goal of reducing the learning curve and making it easier for developers to adopt Kubernetes, thereby accelerating the development process and improving productivity. CI/CD and GitOps Kubernetes is revolutionizing continuous integration and continuous deployment (CI/CD) pipelines through the adoption of GitOps practices. GitOps uses Git repositories as the source of truth for declarative infrastructure and applications, enabling automated deployments. Tools like ArgoCD and Flux are being widely adopted to simplify the deployment process, reduce human error, and ensure consistency across environments. Figure 2 shows the integration between a GitOps operator, such as ArgoCD, and Kubernetes for managing deployments. This trend is expected to grow, making CI/CD pipelines more robust and efficient. Figure 2. Kubernetes GitOps Sustainability and Efficiency Cloud computing's carbon footprint now exceeds the airline industry, making sustainability and operational efficiency crucial in Kubernetes deployments. The Kubernetes community is actively developing features to optimize resource usage, reduce energy consumption, and enhance the overall efficiency of Kubernetes clusters. CNCF projects like KEDA (Kubernetes event-driven autoscaling) and Karpenter (just-in-time nodes for any Kubernetes cluster) are at the forefront of this effort. These tools not only contribute to cost savings but also align with global sustainability goals. Hybrid and Multi-Cloud Deployments According to the CNCF Annual Survey 2023, multi-cloud solutions are now the norm: Multi-cloud solutions (hybrid and other cloud combinations) are used by 56% of organizations. Deploying applications across hybrid and multi-cloud environments is one of Kubernetes' most significant advantages. This flexibility enables organizations to avoid vendor lock-in, optimize costs, and enhance resilience by distributing workloads across multiple cloud providers. Future developments in Kubernetes will focus on improving and simplifying management across different cloud platforms, making hybrid and multi-cloud deployments even more efficient. Increased Security Features Security continues to be a top priority for Kubernetes deployments. The community is actively enhancing security features to address vulnerabilities and emerging threats. These efforts include improvements to network policies, stronger identity and access management (IAM), and more advanced encryption mechanisms. For instance, the 2024 CNCF open-source project velocity report highlighted that Keycloak, which joined CNCF last year as an incubating project, is playing a vital role in advancing open-source IAM, backed by a large and active community. Edge Computing Kubernetes is playing a crucial role in the evolution of edge computing. By enabling consistent deployment, monitoring, and management of applications at the edge, Kubernetes significantly reduces latency, enhances real-time processing capabilities, and supports emerging use cases like IoT and 5G. Projects like KubeEdge and K3s are at the forefront of this movement. We can expect further optimizations for lightweight and resource-constrained environments, making Kubernetes even more suitable for edge computing scenarios. Conclusion Kubernetes has revolutionized cloud-native computing, transforming how we develop, deploy, and manage applications. As Kelsey Hightower noted in Google's Kubernetes Podcast, "We are only halfway through its journey, with the next decade expected to see Kubernetes mature to the point where it 'gets out of the way' by doing its job so well that it becomes naturally integrated into the background of our infrastructure." Kubernetes' influence will only grow, shaping the future of technology and empowering organizations to innovate and thrive in an increasingly complex landscape. References: "10 Years of Kubernetes" by Bob Killen et al, 2024CNCF Annual Survey 2023 by CNCF, 2023"As we reach mid-year 2024, a look at CNCF, Linux Foundation, and top 30 open source project velocity" by Chris Aniszczyk, CNCF, 2024"Orchestration Celebration: 10 Years of Kubernetes" by Adrian Bridgwater, 2024"Kubernetes: Beyond Container Orchestration" by Pratik Prakash, 2022"The Staggering Ecological Impacts of Computation and the Cloud" by Steven Gonzalez Monserrate, 2022 This is an excerpt from DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC. In recent years, observability has re-emerged as a critical aspect of DevOps and software engineering in general, driven by the growing complexity and scale of modern, cloud-native applications. The transition toward microservices architecture as well as complex cloud deployments — ranging from multi-region to multi-cloud, or even hybrid-cloud, environments — has highlighted the shortcomings of traditional methods of monitoring. In response, the industry has standardized utilizing logs, metrics, and traces as the three pillars of observability to provide a more comprehensive sense of how the application and the entire stack is performing. We now have a plethora of tools to collect, store, and analyze various signals to diagnose issues, optimize performance, and respond to issues. Yet anyone working with Kubernetes will still say that observability in Kubernetes remains challenging. Part of it comes from the inherent complexity of working with Kubernetes, but the fact of the matter is that logs, metrics, and traces alone don't make up observability. Also, the vast ecosystem of observability tooling does not necessarily equate to ease of use or high ROI, especially given today's renewed focus on cost. In this article, we'll dive into some considerations for Kubernetes observability, challenges of and some potential solutions for implementing it, and the oft forgotten aspect of developer experience in observability. Considerations for Kubernetes Observability When considering observability for Kubernetes, most have a tendency to dive straight into tool choices, but it's advisable to take a hard look at what falls under the scope of things to "observe" for your use case. Within Kubernetes alone, we already need to consider: Cluster components – API server, etcd, controller manager, schedulerNode components – kublect, kube-proxy, container runtimeOther resources – CoreDNS, storage plugins, Ingress controllersNetwork – CNI, service meshSecurity and access – audit logs, security policiesApplication – both internal and third-party applications And most often, we inevitably have components that run outside of Kubernetes but interface with many applications running inside. Most notably, we have databases ranging from managed cloud offerings to external data lakes. We also have things like serverless functions, queues, or other Kubernetes clusters that we need to think about. Next, we need to identify the users of Kubernetes as well as the consumers of these observability tools. It's important to consider these personas as building for an internal-only cluster vs. a multi-tenant SaaS cluster may have different requirements (e.g., privacy, compliance). Also, depending on the team composition, the primary consumers of these tools may be developers or dedicated DevOps/SRE teams who will have different levels of expertise with not only these tools but with Kubernetes itself. Only after considering the above factors can we start to talk about what tools to use. For example, if most applications are already on Kubernetes, using a Kubernetes-focused tool may suffice, whereas organizations with lots of legacy components may elect to reuse an existing observability stack. Also, a large organization with various teams mostly operating as independent verticals may opt to use their own tooling stack, whereas a smaller startup may opt to pay for an enterprise offering to simplify the setup across teams. Challenges and Recommendations for Observability Implementation After considering the scope and the intended audience of our observability stack, we're ready to narrow down the tool choices. Largely speaking, there are two options for implementing an observability stack: open source and commercial/SaaS. Open-Source Observability Stack The primary challenge with implementing a fully open-source observability solution is that there is no single tool that covers all aspects. Instead, what we have are ecosystems or stacks of tools that cover different aspects of observability. One of the more popular tech stacks from Prometheus and Grafana Lab's suite of products include: Prometheus for scraping metrics and alerting Loki for collecting logsTempo for distributed tracingGrafana for visualization While the above setup does cover a vast majority of observability requirements, they still operate as individual microservices and do not provide the same level of uniformity as a commercial or SaaS product. But in recent years, there has been a strong push to at least standardize on OpenTelemetry conventions to unify how to collect metrics, logs, and traces. Since OpenTelemetry is a framework that is tool agnostic, it can be used with many popular open-source tools like Prometheus and Jaeger. Ideally, architecting with OpenTelemetry in mind will make standardization of how to generate, collect, and manage telemetry data easier with the growing list of compliant open-source tools. However, in practice, most organizations will already have established tools or in-house versions of them — whether that is the EFK (Elasticsearch, Fluentd, Kibana) stack or Prometheus/Grafana. Instead of forcing a new framework or tool, apply the ethos of standardization and improve what and how telemetry data is collected and stored. Finally, one of the common challenges with open-source tooling is dealing with storage. Some tools like Prometheus cannot scale without offloading storage with another solution like Thanos or Mimir. But in general, it's easy to forget to monitor the observability tooling health itself and scale the back end accordingly. More telemetry data does not necessarily equal more signals, so keep a close eye on the volume and optimize as needed. Commercial Observability Stack On the commercial offering side, we usually have agent-based solutions where telemetry data is collected from agents running as DaemonSets on Kubernetes. Nowadays, almost all commercial offerings have a comprehensive suite of tools that combine into a seamless experience to connect logs to metrics to traces in a single user interface. The primary challenge with commercial tools is controlling cost. This usually comes in the form of exposing cardinality from tags and metadata. In the context of Kubernetes, every Pod has tons of metadata related to not only Kubernetes state but the state of the associated tooling as well (e.g., annotations used by Helm or ArgoCD). These metadata then get ingested as additional tags and date fields by the agents. Since commercial tools have to index all the data to make telemetry queryable and sortable, increased cardinality from additional dimensions (usually in the form of tags) causes issues with performance and storage. This directly results in higher cost to the end user. Fortunately, most tools now allow the user to control which tags to index and even downsample data to avoid getting charged for repetitive data points that are not useful. Be aggressive with filters and pipeline logic to only index what is needed; otherwise, don't be surprised by the ballooning bill. Remembering the Developer Experience Regardless of the tool choice, one common pitfall that many teams face is over-optimizing for ops usage and neglecting the developer experience when it comes to observability. Despite the promise of DevOps, observability often falls under the realm of ops teams, whether that be platform, SRE, or DevOps engineering. This makes it easy for teams to build for what they know and what they need, over-indexing on infrastructure and not investing as much on application-level telemetry. This ends up alienating developers to invest less time or become too reliant on their ops counterparts for setup or debugging. To make observability truly useful for everyone involved, don't forget about these points: Access. It's usually more of a problem with open-source tools, but make sure access to logs, dashboards, and alerts are not gated by unnecessary approvals. Ideally, having quick links from existing mediums like IDEs or Slack can make tooling more accessible.Onboarding. It's rare for developers to go through the same level of onboarding in learning how to use any of these tools. Invest some time to get them up to speed.Standardization vs. flexibility. While a standard format like JSON is great for indexing, it may not be as human readable and is filled with extra information. Think of ways to present information in a usable format. At the end of the day, the goals of developers and ops teams should be aligned. We want tools that are easy to integrate, with minimal overhead, that produce intuitive dashboards and actionable, contextual information without too much noise. Even with the best tools, you still need to work with developers who are responsible for generating telemetry and also acting on it, so don't neglect the developer experience entirely. Final Thoughts Observability has been a hot topic in recent years due to several key factors, including the rise of complex, modern software coupled with DevOps and SRE practices to deal with that complexity. The community has moved past the simple notion of monitoring to defining the three pillars of observability as well as creating new frameworks to help with generation, collection, and management of these telemetry data. Observability in a Kubernetes context has remained challenging so far given the large scope of things to "observe" as well as the complexity of each component. With the open source ecosystem, we have seen a large fragmentation of specialized tools that is just now integrating into a standard framework. On the commercial side, we have great support for Kubernetes, but cost control has been a huge issue. And to top it off, lost in all of this complexity is the developer experience in helping feed data into and using the insights from the observability stack. But as the community has done before, tools and experience will continue to improve. We already see significant research and advances in how AI technology can improve observability tooling and experience. Not only do we see better data-driven decision making, but generative AI technology can also help surface information better in context to make tools more useful without too much overhead. This is an excerpt from DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC. Kubernetes has become a cornerstone in modern infrastructure, particularly for deploying, scaling, and managing artificial intelligence and machine learning (AI/ML) workloads. As organizations increasingly rely on machine learning models for critical tasks like data processing, model training, and inference, Kubernetes offers the flexibility and scalability needed to manage these complex workloads efficiently. By leveraging Kubernetes' robust ecosystem, AI/ML workloads can be dynamically orchestrated, ensuring optimal resource utilization and high availability across cloud environments. This synergy between Kubernetes and AI/ML empowers organizations to deploy and scale their ML workloads with greater agility and reliability. This article delves into the key aspects of managing AI/ML workloads within Kubernetes, focusing on strategies for resource allocation, scaling, and automation specific to this platform. By addressing the unique demands of AI/ML tasks in a Kubernetes environment, it provides practical insights to help organizations optimize their ML operations. Whether handling resource-intensive computations or automating deployments, this guide offers actionable advice for leveraging Kubernetes to enhance the performance, efficiency, and reliability of AI/ML workflows, making it an indispensable tool for modern enterprises. Understanding Kubernetes and AI/ML Workloads In order to effectively manage AI/ML workloads in Kubernetes, it is important to first understand the architecture and components of the platform. Overview of Kubernetes Architecture Kubernetes architecture is designed to manage containerized applications at scale. The architecture is built around two main components: the control plane (coordinator nodes) and the worker nodes. Figure 1. Kubernetes architecture For more information, or to review the individual components of the architecture in Figure 1, check out the Kubernetes Documentation. AI/ML Workloads: Model Training, Inference, and Data Processing AI/ML workloads are computational tasks that involve training machine learning models, making predictions (inference) based on those models, and processing large datasets to derive insights. AI/ML workloads are essential for driving innovation and making data-driven decisions in modern enterprises: Model training enables systems to learn from vast datasets, uncovering patterns that power intelligent applications. Inference allows these models to generate real-time predictions, enhancing user experiences and automating decision-making processes.Efficient data processing is crucial for transforming raw data into actionable insights, fueling the entire AI/ML pipeline. However, managing these computationally intensive tasks requires a robust infrastructure. This is where Kubernetes comes into play, providing the scalability, automation, and resource management needed to handle AI/ML workloads effectively, ensuring they run seamlessly in production environments. Key Considerations for Managing AI/ML Workloads in Kubernetes Successfully managing AI/ML workloads in Kubernetes requires careful attention to several critical factors. This section outlines the key considerations for ensuring that your AI/ML workloads are optimized for performance and reliability within a Kubernetes environment. Resource Management Effective resource management is crucial when deploying AI/ML workloads on Kubernetes. AI/ML tasks, particularly model training and inference, are resource intensive and often require specialized hardware such as GPUs or TPUs. Kubernetes allows for the efficient allocation of CPU, memory, and GPUs through resource requests and limits. These configurations ensure that containers have the necessary resources while preventing them from monopolizing node capacity. Additionally, Kubernetes supports the use of node selectors and taints/tolerations to assign workloads to nodes with the required hardware (e.g., GPU nodes). Managing resources efficiently helps optimize cluster performance, ensuring that AI/ML tasks run smoothly without over-provisioning or under-utilizing the infrastructure. Handling resource-intensive tasks requires careful planning, particularly when managing distributed training jobs that need to run across multiple nodes. These workloads benefit from Kubernetes' ability to distribute resources while ensuring that high-priority tasks receive adequate computational power. Scalability Scalability is another critical factor in managing AI/ML workloads in Kubernetes. Horizontal scaling, where additional Pods are added to handle increased demand, is particularly useful for stateless workloads like inference tasks that can be easily distributed across multiple Pods. Vertical scaling, which involves increasing the resources available to a single Pod (e.g., more CPU or memory), can be beneficial for resource-intensive processes like model training that require more power to handle large datasets. In addition to Pod autoscaling, Kubernetes clusters benefit from cluster autoscaling to dynamically adjust the number of worker nodes based on demand. Karpenter is particularly suited for AI/ML workloads due to its ability to quickly provision and scale nodes based on real-time resource needs. Karpenter optimizes node placement by selecting the most appropriate instance types and regions, taking into account workload requirements like GPU or memory needs. By leveraging Karpenter, Kubernetes clusters can efficiently scale up during resource-intensive AI/ML tasks, ensuring that workloads have sufficient capacity without over-provisioning resources during idle times. This leads to improved cost efficiency and resource utilization, especially for complex AI/ML operations that require on-demand scalability. These autoscaling mechanisms enable Kubernetes to dynamically adjust to workload demands, optimizing both cost and performance. Data Management AI/ML workloads often require access to large datasets and persistent storage for model checkpoints and logs. Kubernetes offers several persistent storage options to accommodate these needs, including PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs). These options allow workloads to access durable storage across various cloud and on-premises environments. Additionally, Kubernetes integrates with cloud storage solutions like AWS EBS, Google Cloud Storage, and Azure Disk Storage, making it easier to manage storage in hybrid or multi-cloud setups. Handling large volumes of training data requires efficient data pipelines that can stream or batch process data into models running within the cluster. This can involve integrating with external systems, such as distributed file systems or databases, and using tools like Apache Kafka for real-time data ingestion. Properly managing data is essential for maintaining high-performance AI/ML pipelines, ensuring that models have quick and reliable access to the data they need for both training and inference. Deployment Automation Automation is key to managing the complexity of AI/ML workflows, particularly when deploying models into production. CI/CD pipelines can automate the build, test, and deployment processes, ensuring that models are continuously integrated and deployed with minimal manual intervention. Kubernetes integrates well with CI/CD tools like Jenkins, GitLab CI/CD, and Argo CD, enabling seamless automation of model deployments. Tools and best practices for automating AI/ML deployments include using Helm for managing Kubernetes manifests, Kustomize for configuration management, and Kubeflow for orchestrating ML workflows. These tools help standardize the deployment process, reduce errors, and ensure consistency across environments. By automating deployment, organizations can rapidly iterate on AI/ML models, respond to new data, and scale their operations efficiently, all while maintaining the agility needed in fast-paced AI/ML projects. Scheduling and Orchestration Scheduling and orchestration for AI/ML workloads require more nuanced approaches compared to traditional applications. Kubernetes excels at managing these different scheduling needs through its flexible and powerful scheduling mechanisms. Batch scheduling is typically used for tasks like model training, where large datasets are processed in chunks. Kubernetes supports batch scheduling by allowing these jobs to be queued and executed when resources are available, making them ideal for non-critical workloads that are not time sensitive. Kubernetes Job and CronJob resources are particularly useful for automating the execution of batch jobs based on specific conditions or schedules. On the other hand, real-time processing is used for tasks like model inference, where latency is critical. Kubernetes ensures low latency by providing mechanisms such as Pod priority and preemption, ensuring that real-time workloads have immediate access to the necessary resources. Additionally, Kubernetes' HorizontalPodAutoscaler can dynamically adjust the number of pods to meet demand, further supporting the needs of real-time processing tasks. By leveraging these Kubernetes features, organizations can ensure that both batch and real-time AI/ML workloads are executed efficiently and effectively. Gang scheduling is another important concept for distributed training in AI/ML workloads. Distributed training involves breaking down model training tasks across multiple nodes to reduce training time, and gang scheduling ensures that all the required resources across nodes are scheduled simultaneously. This is crucial for distributed training, where all parts of the job must start together to function correctly. Without gang scheduling, some tasks might start while others are still waiting for resources, leading to inefficiencies and extended training times. Kubernetes supports gang scheduling through custom schedulers like Volcano, which is designed for high-performance computing and ML workloads. Latency and Throughput Performance considerations for AI/ML workloads go beyond just resource allocation; they also involve optimizing for latency and throughput. Latency refers to the time it takes for a task to be processed, which is critical for real-time AI/ML workloads such as model inference. Ensuring low latency is essential for applications like online recommendations, fraud detection, or any use case where real-time decision making is required. Kubernetes can manage latency by prioritizing real-time workloads, using features like node affinity to ensure that inference tasks are placed on nodes with the least network hops or proximity to data sources. Throughput, on the other hand, refers to the number of tasks that can be processed within a given time frame. For AI/ML workloads, especially in scenarios like batch processing or distributed training, high throughput is crucial. Optimizing throughput often involves scaling out workloads horizontally across multiple Pods and nodes. Kubernetes' autoscaling capabilities, combined with optimized scheduling, ensure that AI/ML workloads maintain high throughput — even as demand increases. Achieving the right balance between latency and throughput is vital for the efficiency of AI/ML pipelines, ensuring that models perform at their best while meeting real-world application demands. A Step-by-Step Guide: Deploying TensorFlow Sentiment Analysis Model on AWS EKS In this example, we demonstrate how to deploy a TensorFlow-based sentiment analysis model using AWS Elastic Kubernetes Service (EKS). This hands-on guide will walk you through setting up a Flask-based Python application, containerizing it with Docker, and deploying it on AWS EKS using Kubernetes. Although many tools are suitable, TensorFlow was chosen for this example due to its popularity and robustness in developing AI/ML models, while AWS EKS provides a scalable and managed Kubernetes environment that simplifies the deployment process. By following this guide, readers will gain practical insights into deploying AI/ML models in a cloud-native environment, leveraging Kubernetes for efficient resource management and scalability. Step 1: Create a Flask-based Python app setup Create a Flask app (app.py) using the Hugging Face transformers pipeline for sentiment analysis: Shell from flask import Flask, request, jsonify from transformers import pipeline app = Flask(__name__) sentiment_model = pipeline("sentiment-analysis") @app.route('/analyze', methods=['POST']) def analyze(): data = request.get_json() result = sentiment_model(data['text']) return jsonify(result) if __name__ == '__main__': app.run(host='0.0.0.0', port=5000) Step 2: Create requirements.txt Shell transformers==4.24.0 torch==1.12.1 flask jinja2 markupsafe==2.0.1 Step 3: Build Docker image Create a Dockerfile to containerize the app: Shell FROM python:3.9-slim WORKDIR /app COPY requirements.txt requirements.txt RUN pip install -r requirements.txt COPY . . CMD ["python", "app.py"] Build and push the Docker image: Shell docker build -t brainupgrade/aiml-sentiment:20240825 . docker push brainupgrade/aiml-sentiment:20240825 Step 4: Deploy to AWS EKS with Karpenter Create a Kubernetes Deployment manifest (deployment.yaml): Shell apiVersion: apps/v1 kind: Deployment metadata: name: sentiment-analysis spec: replicas: 1 selector: matchLabels: app: sentiment-analysis template: metadata: labels: app: sentiment-analysis spec: containers: - name: sentiment-analysis image: brainupgrade/aiml-sentiment:20240825 ports: - containerPort: 5000 resources: requests: aws.amazon.com/neuron: 1 limits: aws.amazon.com/neuron: 1 tolerations: - key: "aiml" operator: "Equal" value: "true" effect: "NoSchedule" Apply the Deployment to the EKS cluster: Shell kubectl apply -f deployment.yaml Karpenter will automatically scale the cluster and launch an inf1.xlarge EC2 instance based on the resource specification (aws.amazon.com/neuron: 1). Karpenter also installs appropriate device drivers for this special AWS EC2 instance of inf1.xlarge, which is optimized for deep learning inference, featuring four vCPUs, 16 GiB RAM, and one Inferentia chip. Reference Karpenter spec as follows: Shell apiVersion: karpenter.sh/v1alpha5 kind: Provisioner metadata: name: default spec: limits: resources: cpu: "16" provider: instanceProfile: eksctl-KarpenterNodeInstanceProfile-<cluster-name> securityGroupSelector: karpenter.sh/discovery: <cluster-name> subnetSelector: karpenter.sh/discovery: <cluster-name> requirements: - key: karpenter.sh/capacity-type operator: In values: - spot - key: node.kubernetes.io/instance-type operator: In values: - inf1.xlarge - key: kubernetes.io/os operator: In values: - linux - key: kubernetes.io/arch operator: In values: - amd64 ttlSecondsAfterEmpty: 30 Step 5: Test the application Once deployed and exposed via an AWS Load Balancer or Ingress, test the app with the following cURL command: Shell curl -X POST -H "Content-Type: application/json" -d '{"text":"I love using this product!"}' https://<app-url>/analyze This command sends a sentiment analysis request to the deployed model endpoint: https://<app-url>/analyze. Challenges and Solutions Managing AI/ML workloads in Kubernetes comes with its own set of challenges, from handling ephemeral containers to ensuring security and maintaining observability. In this section, we will explore these challenges in detail and provide practical solutions to help you effectively manage AI/ML workloads in a Kubernetes environment. Maintaining State in Ephemeral Containers One of the main challenges in managing AI/ML workloads in Kubernetes is handling ephemeral containers while maintaining state. Containers are designed to be stateless, which can complicate AI/ML workflows that require persistent storage for datasets, model checkpoints, or intermediate outputs. For maintaining state in ephemeral containers, Kubernetes offers PVs and PVCs, which enable long-term storage for AI/ML workloads, even if the containers themselves are short-lived. Ensuring Security and Compliance Another significant challenge is ensuring security and compliance. AI/ML workloads often involve sensitive data, and maintaining security at multiple levels — network, access control, and data integrity — is crucial for meeting compliance standards. To address security challenges, Kubernetes provides role-based access control (RBAC) and NetworkPolicies. RBAC ensures that users and services have only the necessary permissions, minimizing security risks. NetworkPolicies allow for fine-grained control over network traffic, ensuring that sensitive data remains protected within the cluster. Observability in Kubernetes Environments Additionally, observability is a key challenge in Kubernetes environments. AI/ML workloads can be complex, with numerous microservices and components, making it difficult to monitor performance, track resource usage, and detect potential issues in real time. Monitoring and logging are essential for observability in Kubernetes. Tools like Prometheus and Grafana provide robust solutions for monitoring system health, resource usage, and performance metrics. Prometheus can collect real-time metrics from AI/ML workloads, while Grafana visualizes this data, offering actionable insights for administrators. Together, they enable proactive monitoring, allowing teams to identify and address potential issues before they impact operations. Conclusion In this article, we explored the key considerations for managing AI/ML workloads in Kubernetes, focusing on resource management, scalability, data handling, and deployment automation. We covered essential concepts like efficient CPU, GPU, and TPU allocation, scaling mechanisms, and the use of persistent storage to support AI/ML workflows. Additionally, we examined how Kubernetes uses features like RBAC and NetworkPolicies and tools like Prometheus and Grafana to ensure security, observability, and monitoring for AI/ML workloads. Looking ahead, AI/ML workload management in Kubernetes is expected to evolve with advancements in hardware accelerators and more intelligent autoscaling solutions like Karpenter. Integration of AI-driven orchestration tools and the emergence of Kubernetes-native ML frameworks will further streamline and optimize AI/ML operations, making it easier to scale complex models and handle ever-growing data demands. For practitioners, staying informed about the latest Kubernetes tools and best practices is crucial. Continuous learning and adaptation to new technologies will empower you to manage AI/ML workloads efficiently, ensuring robust, scalable, and high-performance applications in production environments. This is an excerpt from DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC.Read the Free Report
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC. Kubernetes is driving the future of cloud computing, but its security challenges require us to adopt a full-scale approach to ensure the safety of our environments. Security is not a one-size-fits-all solution; security is a spectrum, influenced by the specific context in which it is applied. Security professionals in the field rarely declare anything as entirely secure, but always as more or less secure than alternatives. In this article, we are going to present various methods to brace the security of your containers. Understanding and Mitigating Container Security Threats To keep your containerized systems secure, it's important to understand the threats they face. Just like a small leak can sink a ship, even a tiny vulnerability can cause big issues. This section will help you gain a deeper understanding of container security and will provide guidance on how to mitigate the threats that come with it. Core Principles of Container Security Attackers often target containers to hijack their compute power — a common example is to gain access for unauthorized cryptocurrency mining. Beyond this, a compromised container can expose sensitive data, including customer information and workload details. In more advanced attacks, the goal is to escape the container and infiltrate the underlying node. If the attacker succeeds, they can move laterally across the cluster, gaining ongoing access to critical resources such as user code, processing power, and valuable data across other nodes. One particularly dangerous attack method is container escape, where an attacker leverages the fact that containers share the host's kernel. If they gain elevated privileges within a compromised container, they could potentially access data or processes in other containers on the same host. Additionally, the Kubernetes control plane is a prime target. If an attacker compromises one of the control plane components, they can manipulate the entire environment, potentially taking it offline or causing significant disruption. Furthermore, if the etcd database is compromised, attackers could alter or destroy the cluster, steal secrets and credentials, or gather enough information to replicate the application elsewhere. Defense in Depth Maintaining a secure container environment requires a layered strategy that underscores the principle of defense in depth. This approach involves implementing multiple security controls at various levels. By deploying overlapping security measures, you create a system where each layer of defense reinforces the others. This way, even if one security measure is breached, the others continue to protect the environment. Figure 1. Defense-in-depth strategy Understanding the Attack Surface Part of the security strategy is understanding and managing the attack surface, which encompasses all potential points of exploitation, including container images, runtime, orchestration tools, the host, and network interfaces. Reducing the attack surface means simplifying the system and minimizing unnecessary components, services, and code. By limiting what is running and enforcing strict access controls, you decrease the opportunities for vulnerabilities to exist or be exploited, making the system more secure and harder for attackers to penetrate. Common Threats and Mitigation Strategies Let's shift our focus to the everyday threats in container security and discover the tools you can immediately put to work to safeguard your systems. Vulnerable Container Images Relying on container images with security vulnerabilities poses significant risks as these vulnerable images often include outdated software or components with publicly known vulnerabilities. A vulnerability, in this context, is essentially a flaw in the code that malicious actors can leverage to trigger harmful outcomes. An example of this is the infamous Heartbleed flaw in the OpenSSL library, which allowed attackers to access sensitive data by exploiting a coding error. When such flaws are present in container images, they create opportunities for attackers to breach systems, leading to potential data theft or service interruptions. Best practices to secure container images include the following: To effectively reduce the attack surface, start by using minimal base imagesthat include only the essential components required for your application. This approach minimizes potential vulnerabilities and limits what an attacker can exploit. Tools like Docker's FROM scratch or distroless images can help create these minimal environments.Understanding and managing container image layers is crucial as each layer can introduce vulnerabilities. By keeping layers minimal and only including what is necessary, you reduce potential attack vectors. Use multi-stage builds to keep the final image lean and regularly review and update your Dockerfiles to remove unnecessary layers. It's important to avoid using unverified or outdated images. Unverified images from public repositories may contain malware, backdoors, or other malicious components. Outdated images often have unpatched vulnerabilities that attackers can exploit. To mitigate these risks, always source images from trusted repositories and regularly update them to the latest versions. Insecure Container Runtime An insecure container runtime is a critical threat as it can lead to privilege escalation, allowing attackers to gain elevated access within the system. With elevated access, attackers can disrupt services by modifying or terminating critical processes, causing downtime and impacting the availability of essential applications. They can gain full control over the container environment, manipulating configurations to deploy malicious containers or introduce malware, which can be used as a launchpad for further attacks. Best practices for hardening the container runtime include the following: Implementing strict security boundaries and adhering to the principle of least privilege are essential for protecting the container runtime. Containers should be configured to run with only the permissions they need to function, minimizing the potential impact of a security breach. This involves setting up role-based access controls. Admission control is a critical aspect of runtime security that involves validating and regulating requests to create or update containers in the cluster. By employing admission controllers, you can enforce security policies and ensure that only compliant and secure container configurations are deployed. This can include checking for the use of approved base images, ensuring that security policies are applied, and verifying that containers are not running as root. Tools like Open Policy Agent (OPA) can be integrated into your Kubernetes environment to provide flexible and powerful admission control capabilities. Here's an example for OPA policy that acts as a gatekeeper, ensuring no container runs with root privileges: Shell package kubernetes.admission deny[msg] { input.request.kind.kind == "Pod" input.request.object.spec.containers[_].securityContext.runAsUser == 0 msg = "Containers must not run as root." } There are a few practices to avoid when securing container runtime: If a container running as root is compromised, an attacker can gain root-level access to the host system, potentially leading to a full system takeover.When containers have unrestricted access to host resources, like the file system, network, or devices, a compromised container could exploit this access to then tamper with the host system, steal sensitive data, or disrupt other services. To prevent such scenarios, use tools like seccomp and AppArmor. These tools can restrict the system calls that containers make and enforce specific security policies. By applying these controls, you can confine containers to their intended operations, protecting the host system from potential breaches or unauthorized activities. Misconfigured Kubernetes Settings Misconfigured Kubernetes settings are a significant threat as they expose the cluster to attacks through overly permissive network policies, weak access controls, and poor secrets management: Overly permissive network policies enable attackers to intercept and tamper with data.Weak access controls allow unauthorized users to perform administrative tasks, disrupt services, and alter configurations. Poor secrets management exposes sensitive information like API keys and passwords, enabling attackers to escalate privileges. Best practices for secure Kubernetes configuration are as follows: The risk of transmitting sensitive information without protection is that it can be intercepted or tampered with by malicious actors during transit. To mitigate this risk, secure all communication channels with transport layer security (TLS). Kubernetes offers tools like cert-manager to automate the management and renewal of TLS certificates. This ensures that communication between services remains encrypted and secure, thereby protecting your data from interception or manipulation. Network policies control the traffic flow between Pods and services in a Kubernetes cluster. By defining network policies, you can isolate sensitive workloads and reduce the risk of lateral movement in case of a compromise. Use Kubernetes' native NetworkPolicy resource to create rules that enforce your desired network security posture. On the other hand, it's important to avoid exposing unnecessary application ports. Exposure of ports provides multiple entry points for attackers, making the cluster more vulnerable to exploits. CI/CD Security CI/CD pipelines are granted extensive permissions, ensuring they can interact closely with production systems and manage updates. However, this extensive access also makes CI/CD pipelines a significant security risk. If compromised, attackers can exploit these broad permissions to manipulate deployments, introduce malicious code, gain unauthorized access to critical systems, steal sensitive data, or create backdoors for ongoing access. There are several best practices to implement when securing CI/CD. The first best practice is ensuring that once a container image is built and deployed, it is immutable. We always want to make sure the Pod is running on exactly what we intended. It also helps in quickly identifying and rolling back to previous stable versions if a security issue arises, maintaining a reliable and predictable deployment process. Implementing immutable deployments involves several key steps to ensure consistency and security: Assign unique version tags to each container image build, avoiding mutable tags like "latest," and use Infrastructure-as-Code tools like Terraform or Ansible to maintain consistent setups. Configure containers with read-only file systems to prevent changes post-deployment.Implement continuous monitoring with tools like Prometheus and runtime security with Falco to help detect and alert to unauthorized changes, maintaining the security and reliability of your deployments. Another best practice is implementing image vulnerability scanning in CI/CD. Vulnerability scanners meticulously analyze the components of container images, identifying known security flaws that could be exploited. Beyond just examining packages managed by tools like DNF or apt, advanced scanners also inspect additional files added during the build process, such as those introduced through Dockerfile commands like ADD, COPY, or RUN. It's important to include both third-party and internally created images in these scans as new vulnerabilities are constantly emerging. To guarantee that images are thoroughly scanned for vulnerabilities before deployment, scanning tools like Clair or Trivy can be directly embedded into your CI/CD pipeline. Do not store sensitive information directly in the source code (e.g., API keys, passwords) as this increases the risk of unauthorized access and data breaches. Use secrets management tools like SOPS, AWS Secrets Manager, or Google Cloud Secret Manager to securely handle and encrypt sensitive information. Conclusion Regularly assessing and improving Kubernetes security measures is not just important — it's essential. By implementing the strategies we introduced above, organizations can protect their Kubernetes environments, ensuring that containerized applications are more secure and resilient against challenges. In the future, we anticipate that attackers will develop more sophisticated methods to specifically bypass Kubernetes' built-in security features. As organizations increasingly rely on Kubernetes for critical workloads, attackers will likely invest time in uncovering new vulnerabilities or weaknesses in Kubernetes' security architecture, potentially leading to breaches that are more difficult to detect and mitigate. The path to a secure Kubernetes environment is clear, and the time to act is now. Prioritize security to safeguard your future. This is an excerpt from DZone's 2024 Trend Report, Kubernetes in the Enterprise: Once Decade-Defining, Now Forging a Future in the SDLC.Read the Free Report
In this interview with Julian Fischer, CEO of the cloud computing and automation company anynines GmbH, we explore the evolving landscape of cloud-native technologies with a strong focus on the roles of Kubernetes and Cloud Foundry in modern enterprise environments. About the Interviewee The interviewee, Julian Fischer, has extensive experience in Cloud Foundry and Kubernetes ops. Julian leads anynines in helping organizations operate applications at scale. Under his guidance, they're also pioneering advancements in managing data services across many Kubernetes clusters via the open-source Klutch project. The Dominance of Kubernetes Question: Kubernetes has dominated the container orchestration space in recent years. What key factors have contributed to its success? Answer: "Kubernetes has indeed taken the lead in container orchestration. It's flexible, and this flexibility allows companies to customize their container deployment and management to fit their unique needs. But it's not just about flexibility. The ecosystem around Kubernetes is robust and ever-growing. Think tools, services, integrations – you name it. This expansive ecosystem is a major draw. Community support is another big factor. The Kubernetes community is large, active, and innovative. And let's not forget about multi-cloud capabilities. Kubernetes shines here. It enables consistent deployments across various cloud providers and on-premises environments. That's huge for companies with diverse infrastructure needs. Lastly, it's efficient. Kubernetes has some pretty advanced scheduling capabilities. This means optimal use of cluster resources." Question: Despite Kubernetes' popularity, what challenges do organizations face when managing large-scale Kubernetes environments? Answer: "Well, Kubernetes isn't without its challenges, especially at scale. Complexity is a big one. Ensuring consistent configs across multiple clusters? It's not for the faint of heart. Resource management becomes a real juggling act as you scale up. You're dealing with compute, storage, network – it all gets more complex. Monitoring is another headache. As your microservices and containers multiply, maintaining visibility becomes tougher. It's like trying to keep track of a thousand moving parts. Security is a constant concern too. Implementing and maintaining policies across a large Kubernetes ecosystem is a full-time job. And then there are all the updates and patches. Keeping a large Kubernetes environment up-to-date is like painting the Golden Gate Bridge. By the time you finish, it's time to start over. It's a never-ending process." Question: Given Kubernetes' dominance, is there still a place for Cloud Foundry in the cloud-native ecosystem? Answer: "Absolutely. Cloud Foundry still brings a lot to the table. It's got a different focus. While Kubernetes is all about flexibility, Cloud Foundry is about simplicity and operational efficiency for developers. It streamlines the whole process of deploying and scaling apps. That's valuable. Think about it this way. Cloud Foundry abstracts away a lot of the infrastructure complexity. Developers can focus on code, not on managing the underlying systems. That's powerful. Robust security features, proven track record in large enterprises – these things matter. And here's something interesting—in some large-scale scenarios, Cloud Foundry can actually be more economical. Especially when you're running lots of cloud-native apps. It's all about the right tool for the job." The Relationship Between Cloud Foundry and Kubernetes Question: How are the Cloud Foundry and Kubernetes communities working together to bridge these technologies? Answer: "It's not a competition anymore. The communities are collaborating, and it's exciting to see. There are some really interesting projects in the works. Take Klutch, for example. It's an open-source tool that's bridging the gap between Cloud Foundry and Kubernetes for data services. Pretty cool stuff." Figure 1. The open-source Klutch project enables centralized resource management for multi-cluster Kubernetes environments. "Then there's Korifi. This project is ambitious. It's bringing the Cloud Foundry developer experience to Kubernetes. Imagine getting Cloud Foundry's simplicity with Kubernetes' power. That's the goal. These projects show a shift in thinking. It's not about choosing one or the other anymore. It's about leveraging the strengths of both platforms. That's the future of cloud-native tech." Question: What factors should organizations consider when choosing between Kubernetes and Cloud Foundry? Answer: "Great question. There's no one-size-fits-all answer here. First, look at your team. What are they comfortable with? What's their expertise? That matters a lot. Then, think about your applications. What do they need? Some apps are better suited for one platform over the other. Scalability is crucial too. How much do you need to grow? And how fast? Consider your control needs as well. Do you need the fine-grained control of Kubernetes? Or would you benefit more from Cloud Foundry's abstraction? Don't forget about your existing tools and workflows. Integration is key. You want a solution that plays nice with what you already have. It's about finding the right fit for your specific situation." Question: Can you elaborate on the operational efficiency advantages that Cloud Foundry might offer in certain scenarios? Answer: "Sure thing. Cloud Foundry can be a real efficiency booster in the right context. It's all about its opinionated approach. This might sound limiting, but in large-scale environments, it can be a blessing. Here's why – Cloud Foundry streamlines a lot of operational aspects. Deployment, scaling, management - it's all simplified. This means less operational overhead. In some cases, it can lead to significant cost savings. Especially when you're dealing with a large number of applications that fit well with Cloud Foundry's model. But here's the catch. This advantage is context-dependent. It's not a universal truth. You need to evaluate your specific use case. For some, the efficiency gains are substantial. For others, not so much. It's all about understanding your needs and environment." Looking to the Future of Cloud-Native Technologies Question: How do you see the future of cloud-native technologies evolving, particularly concerning Kubernetes and Cloud Foundry? Answer: "The future is exciting. And diverse. We're moving away from the idea that there's one perfect solution for everything. Kubernetes will continue to dominate, no doubt. But Cloud Foundry isn't going anywhere. In fact, I see increased integration between the two. We're likely to see more hybrid approaches. Organizations leveraging the strengths of both platforms. Why choose when you can have both, right? The focus will be on creating seamless experiences. Imagine combining Kubernetes' flexibility with Cloud Foundry's developer-friendly abstractions. That's incredibly powerful, and what we’re working towards. Innovation will continue at a rapid pace. We'll see new tools, new integrations. The line between these technologies might even start to blur. It's an exciting time to be in this space." Question: What advice would you give to organizations trying to navigate this complex cloud-native ecosystem? Answer: "My advice? Stay flexible. And curious. This field is evolving rapidly. What works today might not be the best solution tomorrow. Start by really understanding your needs. Not just your current needs, but where you're headed. Don't view it as a binary choice. Kubernetes or Cloud Foundry – it doesn't have to be either/or. Consider how they can work together in your stack. Experiment. Start small. See what works for your specific use cases. Invest in your team. Train them on both technologies. The more versatile your team, the better positioned you'll be. And remember, it's okay to change course. Be prepared to evolve your strategy as the technologies and your needs change. The goal isn't to use the trendiest tech. It's to choose the right tools that solve your problems efficiently. Sometimes that's Kubernetes. Sometimes it's Cloud Foundry. Often, it's a combination of both. Stay focused on your business needs, and let that guide your technology choices." This article was shared as part of DZone's media partnership with KubeCon + CloudNativeCon.View the Event
In our industry, few pairings have been as exciting and game-changing as the union of artificial intelligence (AI) and machine learning (ML) with cloud-native environments. It's a union designed for innovation, scalability, and yes, even cost efficiency. So put on your favorite Kubernetes hat and let's dive into this dynamic world where data science meets the cloud! Before we explore the synergy between AI/ML and cloud-native technologies, let’s set a few definitions. AI: A broad concept referring to machines mimicking human intelligence.ML: The process of “teaching” a machine to perform specific tasks and generate accurate output through pattern identification.Cloud native: A design paradigm that leverages modern cloud infrastructure to build scalable, resilient, and flexible applications – picture microservices in Docker containers orchestrated by Kubernetes and continuously deployed by CI/CD pipelines. The Convergence of AI/ML and Cloud Native What are some of the benefits of implementing AI and ML in cloud-native environments? Scalability Ever tried to manually scale an ML model as it gets bombarded with a gazillion requests? Not fun. But with cloud-native platforms, scaling becomes as easy as a Sunday afternoon stroll in the park. Kubernetes, for instance, can automatically scale pods running your AI models based on real-time metrics, which means your AI model can perform well even under duress. Agility In a cloud-native world, a microservices architecture means your AI/ML components can be developed, updated, and deployed independently. This modularity fosters agility, which lets you innovate and iterate rapidly, and without fear of breaking the entire system. It's like being able to swap out parts of the engine of your car while driving to update them—except much safer. Cost Efficiency Serverless computing platforms (think AWS Lambda, Google Cloud Functions, and Azure Functions) allow you to run AI/ML workloads only when needed. No more paying for idle compute resources. It's the cloud equivalent of turning off the lights when you leave a room—simple, smart, and cost-effective. It’s also particularly advantageous for intermittent or unpredictable workloads. Collaboration Cloud-native environments make a breeze out of collaboration among data scientists, developers, and operations teams. With centralized repositories, version control, and CI/CD pipelines, everyone can work harmoniously on the same ML lifecycle. It's the tech equivalent of a well-coordinated kitchen in a highly-rated-on-Yelp restaurant. Trending Applications of AI/ML in Cloud Native While most of the general public is familiar with AI/ML technologies through interactions with generative AI chatbots, fewer realize the extent to which AI/ML has already enhanced their online experiences. AI-Powered DevOps (AIOps) By supercharging DevOps processes with AI/ML, you can automate incident detection, root cause analysis, and predictive maintenance. Additionally, integrating AI/ML with your observability tools and CI/CD pipelines enables you to improve operational efficiency and reduce service downtime. Kubernetes + AI/ML Kubernetes, the long-time de facto platform for container orchestration, is now also the go-to for orchestrating AI/ML workloads. Projects like Kubeflow simplify the deployment and management of machine learning pipelines on Kubernetes, which means you get end-to-end support for model training, tuning, and serving. Edge Computing Edge computing processes AI/ML workloads closer to where data is generated, which dramatically reduces latency. By deploying lightweight AI models at edge locations, organizations can perform real-time inference on devices such as IoT sensors, cameras, and mobile devices – even your smart fridge (because why not?). Federated Learning Federated learning does not need organizations to share raw data in order for them to collaboratively train AI models. It's a great solution for industries that have strict privacy and compliance regulations, such as healthcare and finance. MLOps MLOps integrates DevOps practices into the machine learning lifecycle. Tools like MLflow, TFX (TensorFlow Extended), and Seldon Core make continuous integration and deployment of AI models a reality. Imagine DevOps, but smarter. Because Integration Challenges Keep Things Interesting Of course, none of this comes without its challenges. Complexity Integrating AI/ML workflows with cloud-native infrastructure isn't for the faint of heart. Managing dependencies, ensuring data consistency, and orchestrating distributed training processes requires a bit more than a sprinkle of magic. Latency and Data Transfer For real-time AI/ML applications, latency can be a critical concern. Moving tons of data between storage and compute nodes introduces delays. Edge computing solutions can mitigate this by processing data closer to its source. Cost Management The cloud's pay-as-you-go model is great—until uncontrolled resource allocation starts nibbling away at your budget. Implementing resource quotas, autoscaling policies, and cost monitoring tools is your financial safety net. AI/ML Practices That Could Help Save the Day Modularize! Design your AI/ML applications using the principles of microservices. Decouple data preprocessing, model training, and inference components to enable independent scaling and updates.Leverage managed services: Cloud providers offer AI/ML services to simplify infrastructure management and accelerate development. Observe your models: Integrate your AI/ML workloads with observability tools – having access to metrics about resource usage, model performance, and system health can help you proactively detect and address issues.Secure your data and models: Use encryption, access controls, and secure storage solutions to protect sensitive data and AI models. In Summary The integration of AI/ML technologies in cloud-native environments offers scalability, agility, and cost efficiency, while enhancing collaboration across teams. However, navigating this landscape comes with its own set of challenges, from managing complexity to ensuring data privacy and controlling costs. There are trends to keep an eye on, such as edge computing—a literal edge of glory for real-time processing—AIOps bringing brains to DevOps, and federated learning letting organizations share the smarts without sharing the data. The key to harnessing these technologies lies in best practices: think modular design, robust monitoring, and a sprinkle of foresight through observability tools. The future of AI/ML in cloud-native environments isn't just about hopping on the newest tech bandwagon. It’s about building systems so smart, resilient, and adaptable, you’d think they were straight out of a sci-fi movie (hopefully not Terminator). Keep your Kubernetes hat on tight, your algorithms sharp, and your cloud synced – and let’s see what’s next! This article was shared as part of DZone's media partnership with KubeCon + CloudNativeCon.View the Event
In environments with AWS Cloud workloads, a proactive approach to vulnerability management involves shifting from traditional patching to regularly deploying updated Secure Golden Images. This approach is well-suited to a modern Continuous Integration and Continuous Delivery (CI/CD) environment, where the goal is rapid, automated deployment — and doing this with AMIs (Amazon Machine Images) ensures that every instance benefits from consistent security updates. Creating the Golden Image The first step to securing your EC2 environment is building a Secure Golden Image (SGI) —a pre-configured AMI that serves as the baseline for deploying secure EC2 instances. An SGI should include: AWS-updated kernels: Using the latest AWS-supported kernel ensures you’re starting with a secure, updated OS. The latest AWS kernels also support Kernel Live Patching, which allows for updates without rebooting, minimizing downtime.AWS Systems Manager (SSM): Enabling SSM eliminates the need for traditional SSH access, a significant attack vector. With Session Manager, you can securely access and manage instances without SSH keys, reducing risk.Baseline security configurations: The image should be hardened following security best practices. This includes encryption, restrictive network access, secure IAM role configuration, and logging integration with AWS CloudTrail and AWS GuardDuty for monitoring and alerting. Vulnerability Scanning and Image Hardening After building your golden image, leverage tools to scan for vulnerabilities and misconfigurations. Integrating these scans into your CI/CD pipeline ensures that every new deployment based on the golden image meets your security standards. Keeping the Golden Image Patched and Updated One of the most important aspects of using a golden image strategy is maintaining it. In a dynamic cloud environment, vulnerabilities evolve continuously, requiring frequent updates. Here are some key steps to keep your golden images up-to-date: Release new secure golden images at a regular cadence: Releasing new Secure Golden Images (SGIs) at a regular cadence — whether monthly or quarterly — ensures consistent security updates and a reliable fallback if issues arise. Automating the process using AWS services like EC2 Image Builder helps streamline AMI creation and management, reducing manual errors. A regular and consistent release schedule guarantees your infrastructure stays secure and up-to-date, aligning with best practices for vulnerability management and continuous deployment.Archive and version control: It’s important to maintain the version history for your AMIs. This allows for easy rollback if necessary and ensures compliance during security audits by demonstrating how you manage patching across your instances.Continuous monitoring: While a golden image provides a secure baseline, vulnerabilities can still emerge in running applications. Use tools to monitor the health of your deployed EC2 instances and ensure compliance with security policies. Patching vs. Golden Image Deployment: A Thoughtful Debate When debating whether to adopt a golden image strategy versus traditional patching, it’s essential to weigh the pros and cons of both methods. Patching, while effective for quick fixes, can create inconsistencies over time, especially when patches are applied manually or across multiple servers. This can lead to configuration drift, library drift, package drift, etc..., where each server has a slightly different configuration, making it difficult to maintain a consistent security posture across your infrastructure. Manual patching also introduces the risk of missing patches or creating security gaps if updates are not applied in time. On the other hand, Golden Image Deployment offers consistency and uniformity. By standardizing the creation and deployment of hardened AMIs, you eliminate these drifts entirely. Every instance spun up from a golden image starts with the same secure baseline, ensuring that all EC2 instances are protected by the same set of patches and security configurations. This is particularly valuable in CI/CD environments, where automation and rapid deployment are priorities. However, golden image deployment can take longer than traditional patching, especially in environments where uptime is critical. Rebuilding and redeploying AMIs requires careful coordination and orchestration, particularly for live production environments. Automation through tools like EC2 Image Builder and blue/green deployment strategies can help reduce downtime, but the upfront effort to automate these processes is more complex than simply applying a patch. A balanced approach would be to deploy Secure Golden Images (SGIs) at regular intervals — such as monthly or quarterly — to maintain consistency and uniformity across your EC2 instances, preventing configuration drift. In between these regular SGI deployments, manual patching can be applied in special cases where critical vulnerabilities arise. This strategy combines the best of both worlds: regular, reliable updates through golden images, and the flexibility to address urgent issues through patching. In summary, patching may be faster in certain emergency situations, but over time, it can lead to inconsistencies. A golden image strategy, while requiring more initial setup and automation, ensures long-term consistency and security. For organizations with cloud-native architectures and a DevOps approach, adopting a golden image strategy aligns better with modern security and CI/CD practices.
When we talk about security in cloud-native applications, broken access control remains one of the most dangerous vulnerabilities. The OWASP Top 10 lists it as the most prevalent security risk today, and for good reason: the impact of mismanaged permissions can lead to catastrophic outcomes like data breaches or ransomware attacks. For CISOs, addressing broken access control isn't just a technical challenge—it’s a strategic priority that touches nearly every aspect of an organization’s security posture. As part of my job as the VP of Developer Relations in Permit.io, I consulted with dozens of CISOs and security engineers leaders, from small garage startup founders to Fortune 100 enterprise security staff. This article will try to provide the most comprehensive perspective I gathered from these chats, guiding you in considering broken access control challenges in cloud-native applications. Understanding the Threat At its core, broken access control occurs when unauthorized users gain access to parts of an application they shouldn’t be able to see or modify. This vulnerability can manifest in several ways: from users gaining admin privileges they shouldn’t have to attackers exploiting weak session management to move laterally within a system. What makes this threat particularly dangerous in cloud-native environments is the complexity of modern application architectures. Microservices, third-party APIs, and distributed resources create a multifaceted ecosystem where data flows across various services. Each connection is a potential point of failure. CISOs must ensure that access control mechanisms are ironclad—every request to access sensitive data or perform critical operations must be carefully evaluated and tightly controlled. The Three Pillars of Access Control Addressing broken access control requires a comprehensive strategy built on three key pillars: authentication, permissions, and session management. Each plays a critical role in securing cloud-native applications: Authentication: This is the first line of defense, ensuring that users are who they claim to be. Strong authentication methods like multi-factor authentication (MFA) can drastically reduce the risk of unauthorized access.Permissions: Even after authentication, not all users should have equal access. Permissions dictate what authenticated users can do. In cloud-native apps, fine-grained permissions are essential to prevent privilege escalation and data leakage.Session Management: Proper session management ensures that once a user is authenticated and authorized, their activities are monitored, and their access remains limited to the session’s scope. Poor session management can allow attackers to hijack sessions or escalate privileges. Why Permissions Matter More Than Ever While all three pillars are crucial, permissions are the backbone of modern access control. In a cloud-native environment, where services and resources are distributed across different infrastructures, managing permissions becomes exponentially more challenging. A one-size-fits-all approach, like assigning simple roles (e.g., Admin, User), isn’t sufficient. Today’s applications require a more nuanced approach to permissions management. Fine-Grained Authorization To prevent unauthorized access, organizations should implement fine-grained authorization models. These models allow for more precise control by evaluating multiple attributes—such as a user’s role, location, or even payment method—before granting access. This granular level of control is necessary to avoid both horizontal and vertical privilege escalation. For example, imagine a SaaS product with different pricing tiers. A user’s access to features shouldn’t just depend on their role (e.g., admin or regular user) but also on their subscription level, which should automatically update based on their payment status in an external payment application. Implementing fine-grained permissions ensures that only users who have paid for premium features can access them, even if they have elevated roles within the system. The Importance of Least Privilege A critical part of permissions management is enforcing the principle of least privilege. Simply put, users should have the minimal level of access required to perform their tasks. This principle is especially important in cloud-native applications, where microservices may expose sensitive data across various parts of the system. For example, a developer working on one service shouldn’t have full access to every service in the environment. Limiting access in this way reduces the risk of an attacker exploiting one weak point to gain broader access. It also prevents insider threats, where an internal user might misuse their privileges. Managing Sessions to Contain Threats While permissions control access to features and data, session management ensures that users’ activities are properly constrained during their session. Strong session management practices include limiting session duration, detecting unusual behavior, and ensuring that session tokens are tightly secured. Session hijacking, where attackers steal a user’s session token and take over their session, is a common attack vector in cloud-native environments. Implementing session timeouts, MFA for high-risk actions, and token revocation mechanisms can help mitigate these risks. Effective session management also includes ensuring that users cannot escalate their privileges within the session. For example, a user who starts a session with standard permissions shouldn’t be able to gain admin-level privileges without re-authenticating. The CISO’s Role in Securing Access Control For a CISO, the challenge of preventing broken access control goes beyond simply setting policies. It involves fostering collaboration between security teams, developers, and product managers. This ensures that access control isn’t just a checkbox in compliance reports but a living, adaptive process that scales with the organization’s needs. A Strategic Approach to Collaboration CISOs must ensure that developers have the resources and tools they need to build secure applications without becoming bottlenecks in the process. Traditional access control systems often put too much burden on developers, requiring them to manually write permission logic into the code. This not only slows down development, but also introduces the risk of human error. Instead, CISOs should promote a culture of collaboration where security, development, and product teams can work together on defining and managing access control policies. By implementing automated and scalable tools, CISOs can empower teams to enforce security policies effectively while maintaining agility in the development process. Authorization-as-a-Service One of the most effective ways to manage permissions in a scalable and secure manner is through authorization-as-a-service solutions. These platforms can provide a centralized, no-code interface for defining and managing authorization policies, making it easier for non-technical stakeholders to be involved in the process. By leveraging these tools, organizations can reduce their reliance on developers to manually manage permissions. This not only speeds up the process, but also ensures that permissions are consistently enforced across all services. With real-time policy updates, automated monitoring, and auditability features, authorization-as-a-service platforms allow organizations to stay agile while maintaining strong access control measures. The flexibility of these solutions also allows for easier scaling as the application and user base grow, ensuring that permission models can evolve without requiring significant re-engineering. Additionally, having a no-code UI allows for rapid adjustments to access policies in response to changing business needs or security requirements, without creating unnecessary dependencies on development teams. Conclusion Preventing broken access control vulnerabilities in cloud-native applications is a critical priority for CISOs. It requires a strategic focus on fine-grained permissions, the principle of least privilege, and robust session management. Collaboration across teams and the adoption of modern tools like authorization-as-a-service platforms can greatly simplify this complex challenge, enabling organizations to secure their environments without sacrificing speed or flexibility. By addressing these areas, CISOs can help ensure that their organizations remain resilient to access control vulnerabilities while empowering their teams to manage permissions effectively and securely. This article was shared as part of DZone's media partnership with KubeCon + CloudNativeCon.View the Event
As organizations put artificial intelligence and machine learning (AI/ML) workloads into continuous development and production deployment, they need to have the same levels of manageability, speed, and accountability as regular software code. The popular way to deploy these workloads is Kubernetes, and the Kubeflow and KServe projects enable them there. Recent innovations like the Model Registry, ModelCars feature, and TrustyAI integrations in this ecosystem are delivering these improvements for users who rely on AI/ML. These, and other improvements, have made open source AI/ML ready for use in production. More improvements are coming in the future. Better Model Management AI/ML analyzes data and produces output using machine learning "models," which consist of code, data, and tuning information. In 2023, the Kubeflow community identified a key requirement to have better ways of distributing tuned models across large Kubernetes clusters. Engineers working on Red Hat's OpenShift AI agreed and started work on a new Kubeflow component, Model Registry. "The Model Registry provides a central catalog for developers to index and manage models, their versions, and related artifacts metadata," explained Matteo Mortari, Principal Software Engineer at Red Hat and Kubeflow contributor. "It fills a gap between model experimentation and production activities, providing a central interface for all users to effectively collaborate on ML models." The AI/ML model development journey, from initial experimentation to deployment in production, requires coordination between data scientists, operations staff, and users. Before Model Registry, this involved coordinating information scattered across many places in the organization – even email! With Model Registry, system owners can implement efficient machine learning operations (MLOps), letting them deploy directly from a dedicated component. It's an essential tool for researchers looking to run many instances of a model across large Kubernetes clusters. The project is currently in Alpha, and was included in the recent Kubeflow 1.9 release. Faster Model Serving Kubeflow makes use of the KServe project to "serve," or run, models on each server in the Kubernetes cluster. Users care a great deal about latency and overhead when serving models: they want answers as quickly as possible, and there's never enough GPU power. Many organizations have service level objectives (SLO) for response times, particularly in regulated industries. "One of the challenges that we faced when we first tried out LLMs on Kubernetes was to avoid unnecessary data movements as much as possible," said Roland Huss, Senior Principal Software Engineer at Red Hat and KServe and Knative contributor. "Copying over a multi-gigabyte model from an external storage can take several minutes which adds to the already lengthy startup of an inference service. Kubernetes itself knows how to deal with large amounts of data when it comes to container images, so why not piggyback on those matured techniques?" This thinking led to the development of Modelcars, a passive "sidecar" container holding the model data for KServe. That way, a model needs to be present only once at a cluster node, regardless how many replicas are accessing it. Container image handling is a very well explored area in Kubernetes, with sophisticated caching and performance optimization for the image handling. The result has been faster startup times for serving models, and greatly reduced disk space requirements for cluster nodes. Huss also pointed out that Kubernetes 1.31 recently introduced an image volume type that allows the direct mount of OCI images. When that feature is generally available, which may take a year, it can replace ModelCar for even better performance. Right now, ModelCar is available in KServe v0.12 and above. Safer Model Usage AI/ML systems are complex, and it can be difficult to figure out how they arrive at their output. Yet it's important to ensure that unexpected bias or logic errors don't create misleading results. TrustyAI is a new open source project which aims to bring "responsible AI" to all stages of the AI/ML development lifecycle. "The TrustyAI community strongly believes that democratizing the design and research of responsible AI tooling via an open source model is incredibly important in ensuring that those affected by AI decisions – nowadays, basically everyone – have a say in what it means to be responsible with your AI," stated Rui Vieira, Senior Software Engineer at Red Hat and TrustyAI contributor. The project uses an approach where a core of techniques/algorithms, mostly focused on AI explainability, metrics and guardrails, can be integrated at different stages of the lifecycle. For example, a Python TrustyAI library can be used through Jupyter notebooks during the model experimentation stage to identify biases. The same functionality can be also used for continuous bias detection of production models by incorporating the tool as a pipeline step before model building or deployment. TrustyAI is in its second year of development and KServe supports TrustyAI. Future AI/ML Innovations With these features and tools, and others, development and deployment of AI/ML models is becoming more consistent, reliable, efficient, and verifiable. As with other generations of software, this allows organizations to adopt and customize their own open source AI/ML stacks that would have been too difficult or risky before. The Kubeflow and KServe community is working hard on the next generation of improvements, usually in the Kubernetes Serving Working Group (WG Serving). This includes the LLM Serving Catalog, to provide working examples for popular model servers and explore recommended configurations and patterns for inference workloads. WG Serving is also exploring the LLM Instance Gateway to more efficiently serve distinct LLM use cases on shared model servers running the same foundation model, allowing scheduling requests to pools of model servers. The KServe project is working on features to support very large models. One is multi-host/multi-node support for models which are too big to run on a single node/host. Support for "Speculative Decoding," another in-development feature, speeds up large model execution and improves inter-token latency in memory-bound LLM inference. The project is also developing "LoRA adapter" support which permits serving already trained models with in-flight modifications via adapters to support distinct use cases instead of re-training each of them from scratch before serving. The KServe community is also working on Open Inference Protocol extension to GenAI Task APIs that provide community-maintained protocols to support various GenAI task specific APIs. The community is also working closely with WG Serving to integrate with the efforts like LLM Instance Gateway and provide KServe examples in the Serving Catalog. These and other features are in the KServe Roadmap. The author will be delivering a keynote about some of these innovations at KubeCon's Cloud Native AI Day in Salt Lake City. Thanks to all of the ingenuity and effort being poured into open source AI/ML, users will find the experience of building, running, and training models to keep getting more manageable and performant for many years to come. This article was shared as part of DZone's media partnership with KubeCon + CloudNativeCon.View the Event