Agile, Waterfall, and Lean are just a few of the project-centric methodologies for software development that you'll find in this Zone. Whether your team is focused on goals like achieving greater speed, having well-defined project scopes, or using fewer resources, the approach you adopt will offer clear guidelines to help structure your team's work. In this Zone, you'll find resources on user stories, implementation examples, and more to help you decide which methodology is the best fit and apply it in your development practices.
Is the Era of the Scrum Master Coming to an End?
Differences Between Software Design and Software Architecture
If you’re an experienced developer or a seasoned computer user, you might recall a time when having 128MB of RAM was considered a luxury. For those who don’t know, are too young, or started using computers more recently, let me put this into perspective: the original Max Payne video game required just 128MB of RAM, as did GTA: Vice City. This meant that a computer needed only this amount of memory to run the game alongside all other programs. But that’s not the point. The point is that these games were, and still are, considered significant milestones in gaming. Yes, their graphics are now dated. Yes, they don’t offer the same level of immersion or the rich gameplay mechanics found in modern titles. However, from an algorithmic perspective, they are not fundamentally different from modern games like Assassin’s Creed or Red Dead Redemption. The latter, by the way, requires 12GB of RAM and 120GB of storage. That is 100 times more resources than Max Payne or GTA Vice City required! And that's a lot. This isn’t just about games, it applies to nearly all software we use daily. It’s no longer surprising to see even basic applications, like a messaging app, consuming 1GB of RAM. But why is this happening? Why would a messenger require ten times more memory than an open-world game? The Problem: Software Crisis The problem isn’t new — it’s been around for more than 50 years. In fact, there’s even a dedicated Wikipedia article about it: "Software Crisis." Just imagine, that as far back as the 1960s, people were already struggling with the gap between hardware capabilities and our ability to produce effective software. It’s a significant chapter in the history of software development, yet it remains surprisingly under-discussed. Without getting into too much detail, two major NATO Software Engineering Conferences were held in 1968 and 1969 to address these issues. These conferences led to the formalization of terms like "software engineering," "software acceptance testing," and others that have since become commonplace. The latter had already been extensively used in aerospace and defense systems before gaining broader recognition. Just imagine that this issue was already highlighted even before a UNIX operating system started its development! Despite the fact that billions and billions of lines of code have been written since then, we still see people often disgruntled with certain approaches, programming languages, libraries, etc. Statements like "JavaScript sucks!", "You don't need Kubernetes!", and "Agile is a waste of time!" are often heard from frustrated developers. These claims are usually lavishly accompanied by certain advice and personal preferences, such as "You should always use X instead of Y," or "Stay away from X, or you'll regret it." Another popular recommendation is to apply certain design patterns and principles. They are sure that if you follow them, your code will be cleaner, leading to more maintainable, more readable code for others and more stable releases. I agree that there is a certain truth to these claims. Design patterns, in particular, have really improved the "language" of how developers talk to each other. For instance, mentioning "adapter" to another developer immediately conveys a common understanding of its purpose. It is even possible to convey this understanding to developers coming from different backgrounds. Without this shared vocabulary, explaining concepts would require much more effort. But why, despite having an abundance of design patterns, concepts, numerous architectures, and best practices, are we still struggling to deliver high-quality software in time? Thoughts About "Where Does Bad Code Come From?" In his interesting video, "Where Does Bad Code Come From?", Casey Muratori claims that there is still a lot of software to be made because of how low-quality software is nowadays. It seems that while hardware development has advanced significantly, we haven't advanced in software practices much. Over the past 30 years, we’ve seen the rise of countless new programming languages, sophisticated frameworks, engines, libraries, and third-party services. Yet, for an individual developer, programming is still as challenging as it was back then. A developer needs to read the existing code (often a substantial amount), modify it or write a new piece of software, execute, debug, and, finally, ship it. While we have improved the execution (with modern programming languages) and the debugging experience, and there are certainly a lot of tools to ship the code (thanks to CI pipelines and DevOps), the tasks of reading and writing the code remain major bottlenecks. The human brain can only process so much information, and these limitations are a significant obstacle in software development. Referring back to Casey's video, he compares the programming process to navigation. It’s like starting at a known point with a defined final destination, but the journey itself is uncertain. You have to go through the entire process to reach the final point. You don't necessarily know what the thing would be in the end, but you will know once you get there. So the quality of the result becomes your reference point. The process itself is complex and usually full of unknowns. We see real-life evidence of this all the time. Think about how often you read the news of a company working on certain software, only to decide in the middle of the process to rewrite everything from scratch in a different language. This highlights how, during development, teams can end up getting so lost and overwhelmed by challenges that even the idea of throwing all the work out of the window and going back to the starting point seems like a good option. Conclusion I agree with Casey that we are still in the early stages of understanding how to produce high-quality software in a controllable environment. And the key to achieving this is to focus on our abilities as humans; specifically, how we read and write software. Our main focus should be on reducing the cognitive load of programming. That means we should be producing software with a whole different approach than we do today. Instead of reading and writing pieces of text in an editor, we should aim for a higher level of abstraction. Instead of dealing with raw text and attaching pieces of code like stapling the paper, we should have some kind of different environment that will let us modify the software in a predictable and controllable way aligned with the intent of a developer. At the same time, I don't think we will completely refuse to edit a raw text; instead, we should move between different levels of abstraction with ease. This is similar to how we moved from punched cards to assembly languages and to high-level programming languages. It is still possible to descend to the lowest level of machine code and write the software on that level. That would be a waste of time and energy for the majority of cases, but what matters is the ability to move to a lower level of abstraction. In certain cases this ability is invaluable. In the future, this ability will be a core ability for every developer.
Can you rely on pure Scrum to transform your organization and deliver value? Not always. While Scrum excels in simplicity and flexibility, applying it “out of the box” often falls short in corporate contexts due to limitations in product discovery, scaling, and portfolio management. This article explores the conditions under which pure Scrum thrives, the organizational DNA required to support it, and practical scenarios where it works best — along with a candid look at where it struggles. Discover whether pure Scrum is a realistic approach for your team and how thoughtful adaptation can unlock its true potential. Pure Scrum Constraints “Pure Scrum,” described in the Scrum Guide, is an idiosyncratic framework that helps create customer value in a complex environment. However, five main issues are challenging its general corporate application: Pure Scrum focuses on delivery: How can we avoid running in the wrong direction by building things that do not solve our customers’ problems?Pure Scrum ignores product discovery in particular and product management in general. If you think of the Double Diamond, to use a popular picture, Scrum is focused on the right side; see above.Pure Scrum is designed around one team focused on supporting one product or service. Pure Scrum does not address portfolio management. It is not designed to align and manage multiple product initiatives or projects to achieve strategic business objectives. Pure Scrum is based on far-reaching team autonomy: The Product Owner decides what to build, the Developers decide how to build it, and the Scrum team self-manages. While constant feedback loops, from Product Backlog refinement to the Daily Scrum to Sprint Review to the Retrospective, help with the delivery and risk mitigation focus, the lack of “product genes” is more challenging. The idea that the Product Owner knows what is worth building is unconventional. Consequently, many practitioners, particularly from the management level, are skeptical about this level of faith. As a result, most Product Owners are told what to do: requirements, deadlines, feature factories — you name it. Also, having just one Scrum team is rare unless you’re working for a startup in its infancy. Most of the time, multiple teams develop products. Scrum scaling frameworks like LeSS or Nexus try to remedy the issue, often with limited success. Closely related is pure Scrum’s lack of any alignment layer or process. Its lack might also be where SAFe has its most outstanding “achievement” in corporate settings if you like to use that term in conjunction with it. Finally, pure Scrum’s management or leadership approach, which is focused on autonomy and agency at the team level, does not reflect typical organizational structures, which in many cases still resemble the practices of the industrial paradigm. The question is obvious: Under what circumstance can pure Scrum or Scrum out of the box thrive? The Organizational Ecosystem of Pure Scrum Considering the previously identified constraints, we can say that pure Scrum isn’t just a framework — it’s an organizational philosophy that thrives only in specific cultural environments. Think of it like a delicate plant that requires the right conditions to flourish. In these rare environments, teams operate with trust and openness that transforms Scrum from a set of practices into a living, breathing approach to creating value. The Ideal Organizational DNA The most fertile ground — to stick with the plant metaphor — for pure Scrum exists in organizations characterized by a radical commitment to collaboration and continuous learning. These are typically younger, technology-driven companies where innovation isn’t just encouraged — it’s expected: Software product companies, digital service creators, and cutting-edge research and development teams represent the sweet spot. What sets these organizations apart is their ability to embrace uncertainty. Unlike traditional businesses obsessed with predictability, these companies understand that true innovation requires comfort with controlled experimentation. Their leadership doesn’t just tolerate failure; they see it as a crucial learning mechanism. Size and Structure Matter Pure Scrum finds its most natural home in smaller organizations — typically those under 250 employees. These companies possess an agility that allows rapid communication, minimal bureaucratic friction, and the ability to pivot quickly. The organizational structure is typically flat, with decision-making distributed across teams rather than concentrated in hierarchical management layers. Cultural Non-Negotiables For pure Scrum to truly work, an organization must cultivate: Psychological safety, where team members can speak up without fear,A genuine commitment to empirical process control,Leadership that understands and actively supports Agile principles,Funding models that support iterative, incremental delivery,A cultural tolerance for controlled failure and rapid learning — a failure culture. The Counterpoint: Where Scrum Struggles By contrast, pure Scrum suffocates in environments like heavily regulated industries, traditional manufacturing, stuck in the industrial paradigm of the 20th century, and bureaucratic government agencies. These organizations are typically characterized by: Strict processes focused on resilience,Top-down decision-making, often influenced by politics,Resistance to transparency,Punishment-oriented cultures that discourage experimentation. Examples Where Pure Scrum May Work So, pure Scrum is most applicable in organizations where the complexity of the problem space aligns with Scrum’s emphasis on iterative development and rapid feedback loops; however, organizational context does not introduce constraints that require heavy customization. Here are practical scenarios and industries where pure Scrum can work well: Single-Product Focus in Early Scaling Organizations: Pure Scrum thrives in organizations that have grown beyond the startup phase but are not yet burdened by large-scale portfolio management. For example, a SaaS company with one main product and a dedicated team can effectively use Scrum to focus on continuous delivery while fostering alignment through the framework's inherent transparency.Internal Development Teams in Larger Organizations: Departments or units within larger organizations that operate with clear boundaries and minimal dependency on other teams are also well-suited for pure Scrum. For instance, an innovation hub within a legacy organization experimenting with AI-powered tools can avoid the misalignment issues that often plague scaled Scrum setups.New Product Lines in Established Enterprises: When a larger enterprise launches a new product line with a dedicated, self-contained team, pure Scrum can provide the structure needed to iterate quickly and get airborne to start learning in the marketplace. For example, an e-commerce giant rolling out a subscription-based feature can use pure Scrum to ship incremental changes while keeping the focus on customer feedback and delivery speed.Teams With Minimal External Dependencies: Pure Scrum works best where teams control their destiny — such as product teams that own the entire development pipeline from ideation to deployment, covering the problem and the solution space. For instance, a team building a customer-facing app with its own backend can succeed with pure Scrum, as external delays and cross-team coordination are minimized.Organizations Transitioning From Waterfall to Agile: Pure Scrum is an excellent entry point for organizations transitioning from traditional waterfall methodologies to Agile. By clearly focusing on Scrum’s foundational principles, such as delivering shippable Increments and prioritizing transparency, these vanguard teams can build a strong, agile culture before introducing complexities like scaling or hybrid approaches. The common thread in these examples is autonomy and clarity of purpose. Pure Scrum struggles when faced with dependencies, misaligned incentives, or competing priorities, but it excels when teams are empowered to self-manage, focus on a single goal, and deliver customer-centric value in iterative cycles. Conclusion At its core, pure Scrum is less a project management framework and more a reflection of an organization’s fundamental approach to creating value. It requires a profound shift from seeing work as a series of prescribed steps to viewing it as a continuous journey of discovery and adaptation. The most successful implementations, therefore, aren’t about perfectly following a set of rules but embracing a mindset of continuous improvement, radical transparency, and genuine collaboration. Ultimately, applying pure Scrum may lead to identifying an organization’s idiosyncratic way of creating value, thus abandoning the original framework in the long run, which is fine: We are not paid to practice Scrum but to solve our customers’ problems within the given constraints while contributing to the organization’s sustainability. In all other cases, you will struggle to apply Scrum out of the box in a corporate context; the five constraints sketched about will take their toll and require much “engineering” to utilize Scrum advantages while adapting to organizational requirements. The good news is that an undogmatic, skilled approach to adapting Scrum will avoid creating another botched version of the “corporate Scrum” we all have come to despise.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems. In a world where organizations are considering migrating to the cloud or have already migrated their workloads to the cloud, ensuring all critical workloads are running seamlessly is a complicated task. As organizations scale their infrastructure, maintaining system uptime, performance, and resilience becomes increasingly challenging. Observability plays a crucial role in monitoring, collecting, and analyzing system data to gain insights into the health and performance of services. However, observability is not just about uptime and availability; it also intersects with security. Managing site reliability involves addressing any security concerns such as data breaches, unauthorized access, and misconfigurations, all of which can lead to system downtime or compromise. Observability often comes with a powerful toolset that allows site reliability engineering (SRE) and security teams to collaborate, detect potential threats in real time, and ensure that both performance and security objectives are met. This article reviews the biggest security challenges in managing site reliability, examines how observability can help mitigate these risks, and explores critical areas such as incident response and how observability can be unified with security practices to build more resilient, secure systems. The Role of Observability in Security Observability is a critical instrument that helps both security and SRE teams by providing real-time insights into system behavior. Unified Telemetry for Proactive Threat Detection Observability unifies telemetry data — logs, traces, and metrics — into a centralized system, providing comprehensive visibility across the entire infrastructure. This convergence of data is essential for both site reliability and security. By monitoring this unified telemetry, teams can proactively detect anomalies that may indicate potential threats, such as system failures, misconfigurations, or security breaches. SRE teams may use this data to identify issues that could affect system availability, while security teams can use the same data to uncover patterns that suggest a cyberattack. For example, abnormal spikes in CPU usage may indicate a denial-of-service attack, and unexpected traffic from unknown IPs could be a sign of unauthorized access attempts. Incident Detection and Root Cause Analysis Effective incident detection and root cause analysis are critical for resolving both security breaches and performance issues. Observability empowers SRE and cybersecurity teams with the data needed to detect, analyze, and respond to a wide range of incidents. Logs provide detailed records of actions leading up to an incident, traces illustrate how transactions flow through the system, and metrics spotlight unusual patterns that may indicate anomalies. Observability integrated with automated systems enables faster detection and response to diverse cybersecurity incidents: Data exfiltration. Observability detects unusual data access patterns and spikes in outbound traffic, limiting data loss and regulatory risks.Insider threats. Continuous monitoring identifies suspicious access patterns and privilege escalations, allowing swift mitigation of insider risks.Malware infiltration. Anomalies in resource usage or unauthorized code execution indicate potential malware, enabling quick containment and limiting system impact.Lateral movement. Unexpected cross-system access reveals attacker pathways, helping contain threats before they reach critical systems. Automated observability shortens detection and response times, minimizing downtime and strengthening system security and performance. Monitoring Configuration and Access Changes One of the critical benefits of observability is its ability to monitor configuration changes and user access in real time. Configuration drift — when system configurations deviate from their intended state — can lead to vulnerabilities that expose the system to security risks or reliability issues. Observability platforms track these changes and alert teams when unauthorized or suspicious modifications are detected, enabling rapid responses before any damage is done. How Observability Can Be Unified With Security The integration of observability with security is essential for ensuring both the reliability and safety of cloud environments. By embedding security directly into observability pipelines and fostering collaboration between SRE and security teams, organizations can more effectively detect, investigate, and respond to potential threats. Security-First Observability Embedding security principles into observability pipelines is a key strategy for uniting observability with security. Security-first observability ensures that the data generated from logs, metrics, and traces is encrypted and accessible only to authorized personnel using access control mechanisms such as role-based access control. Figure 1. Observability data encrypted in transit and at rest Additionally, security teams can leverage SRE-generated telemetry to detect vulnerabilities or attack patterns in real time. By analyzing data streams that contain information on system performance, resource usage, and user behavior, security teams can pinpoint anomalies indicative of potential threats, such as brute-force login attempts or distributed denial-of-service (DDoS) attacks, all while maintaining system reliability. SRE and Security Collaboration Collaboration between SRE and security teams is essential for creating a unified approach to observability. One of the best ways to foster this collaboration is by developing joint observability dashboards that combine performance metrics with security alerts. These dashboards provide a holistic view of both system health and security status, allowing teams to identify anomalies related to both performance degradation and security breaches simultaneously. Another key collaboration point is integrating observability tools with security information and event management (SIEM) systems. This integration enables the correlation of security incidents with reliability events, such as service outages or configuration changes. For instance, if an unauthorized configuration change leads to an outage, both security and SRE teams can trace the root cause through the combined observability and SIEM data, enhancing incident response effectiveness. Incident Response Synergy Unified observability also strengthens incident response capabilities, allowing for quicker detection and faster recovery from security incidents. Observability data, such as logs, traces, and metrics, provide real-time insights that are crucial for detecting and understanding security breaches. When suspicious activities (e.g., unauthorized access, unusual traffic patterns) are detected, observability data can help security teams isolate the affected systems or services with precision. Figure 2. Automating security response based on alerts generated Moreover, automating incident response workflows based on observability telemetry can significantly reduce response times. For instance, if an intrusion is detected in one part of the system, automated actions such as isolating the compromised components or locking down user accounts can be triggered immediately, minimizing the potential damage. By integrating observability data into security response systems, organizations can ensure that their response is both swift and efficient. Penetration Testing and Threat Modeling Observability also strengthens proactive security measures like penetration testing and threat modeling. Penetration testing simulates real-world attacks, and observability tools provide a detailed view of how those attacks affect system behavior. Logs and traces generated during these tests help security teams understand the attack path and identify vulnerabilities. Threat modeling anticipates potential attack vectors by analyzing system architecture. Observability ensures that these predicted risks are continuously monitored in real time. For example, if a threat model identifies potential vulnerabilities in APIs, observability tools can track API traffic and detect any unauthorized access attempts or suspicious behavior. By unifying observability with penetration testing and threat modeling, organizations can detect vulnerabilities early, improve system resilience, and strengthen their defenses against potential attacks. Mitigating Common Threats in Site Reliability With Observability Observability is essential for detecting and mitigating threats that can impact site reliability. By providing real-time insights into system performance and user behavior, observability enables proactive responses to potential risks. Table 1 reviews how observability helps address common threats: Table 1. Common threats and mitigation strategies ThreatMitigation StrategyPreventing service outages from cyberattacks Use real-time observability data to identify and mitigate DDoS attacks before they impact service availabilityMonitor performance metrics continuously to detect and prevent service-level agreement (SLA) violationsPreventing data breaches Continuously monitor for signs of data exfiltration or compromise within the telemetry streamUtilize observability to detect exfiltration attempts early, with a clear difference in detection capabilities between environments with and without observabilityHandling insider threats Leverage system-level observability data to detect anomalous actions by authorized users, indicating potential insider threatsUse observability data for forensic analysis and audits in case of an insider attack to trace user activities and system changesAutomation for incident resolution Implement automated alerting and self-healing processes that trigger based on observability insights to ensure rapid incident resolution and maintain uptime Building a Secure SRE Pipeline With Observability Integrating observability into SRE and security workflows creates a robust pipeline that enhances threat detection and response. This section outlines key components for building an effective and secure SRE pipeline. End-to-End Integration To build a secure SRE pipeline, it is essential to seamlessly integrate observability tools with existing security infrastructure (e.g., SIEM); security orchestration, automation, and response (SOAR); and extended detection and response (XDR) platforms. This integration allows for comprehensive monitoring of system performance alongside security events. Figure 3. Security and observability platform integration with automated response By creating a unified dashboard, teams can gain visibility into both reliability metrics and security alerts in one place. This holistic view enables faster detection of issues, improves incident response times, and fosters collaboration between SRE and security teams. Proactive Monitoring and Auto-Remediation Leveraging artificial intelligence (AI) and machine learning (ML) within observability systems allows for the analysis of historical data to predict potential security or reliability issues before they escalate. For example, by learning historical data, AI and ML can identify patterns and anomalies in system behavior. Additionally, automated remediation processes can be triggered when specific thresholds are met, allowing for quick resolution without manual intervention. Custom Security and SRE Alerts A secure SRE pipeline requires creating tailored alerting systems that combine security and SRE data. By customizing alerts to focus on meaningful insights, teams can ensure they receive relevant notifications that prioritize critical issues. For instance, alerts can be set up to notify SRE teams of security misconfigurations that could impact system performance or alerts that would notify the security teams of system performance issues that could indicate a potential security incident. This synergy ensures that both teams are aligned and can respond to incidents swiftly, maintaining a balance between operational reliability and security. Conclusion As organizations and their environments grow in complexity, integrating observability with security is crucial for effective site reliability management. Observability provides the real-time insights needed to detect threats, prevent incidents, and maintain system resilience. By aligning SRE and security efforts, organizations can proactively address vulnerabilities, minimize downtime, and respond swiftly to breaches. Unified observability not only enhances uptime but also strengthens security, making it a key component in building reliable, secure systems. In an era when both performance and security are critical, this integrated approach is essential for success. This is an excerpt from DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems.Read the Free Report
I visited a small local grocery store which happens to be in a touristy part of my neighborhood. If you’ve ever traveled abroad, then you’ve probably visited a store like that to stock up on bottled water without purchasing the overpriced hotel equivalent. This was one of these stores. To my misfortune, my visit happened to coincide with a group of tourists arriving all at once to buy beverages and warm up (it’s winter!). It just so happens that selecting beverages is often much faster than buying fruit — the reason for my visit. So after I had selected some delicious apples and grapes, I ended up waiting in line behind 10 people. And there was a single cashier to serve us all. The tourists didn’t seem to mind the wait (they were all chatting in line), but I sure wish that the store had more cashiers so I could get on with my day faster. What Does This Have to Do With System Performance? You’ve probably experienced a similar situation yourself and have your own tale to tell. It happens so frequently that sometimes we forget how applicable these situations can be to other domain areas, including distributed systems. Sometimes when you evaluate a new solution, the results don’t meet your expectations. Why is latency high? Why is the throughput so low? Those are two of the top questions that pop up every now and then. Many times, the challenges can be resolved by optimizing your performance testing approach, as well as better maximizing your solution’s potential. As you’ll realize, improving the performance of a distributed system is a lot like ensuring speedy checkouts in a grocery store. This blog covers 7 performance-focused steps for you to follow as you evaluate distributed systems performance. Step 1: Measure Time With groceries, the first step towards doing any serious performance optimization is to precisely measure how long it takes for a single cashier to scan a barcode. Some goods, like bulk fruits that require weighing, may take longer to scan than products in industrial packaging. A common misconception is that processing happens in parallel. It does not (note: we’re not referring to capabilities like SIMD and pipelining here). Cashiers do not service more than a single person at a time, nor do they scan your products’ barcodes simultaneously. Likewise, a single CPU in a system will process one work unit at a time, no matter how many requests are sent to it. In a distributed system, consider all the different work units you have and execute them in an isolated way against a single shard. Execute your different items with single-threaded execution and measure how many requests per second the system can process. Eventually, you may learn that different requests get processed at different rates. For example, if the system is able to process a thousand 1 KB requests/sec, the average latency is 1 ms. Similarly, if throughput is 500 requests/sec for a larger payload size, then the average latency is 2 ms. Step 2: Find the Saturation Point A cashier is never scanning barcodes all the time. Sometimes, they will be idle waiting for customers to place their items onto the checkout counter, or waiting for payment to complete. This introduces delays you’ll typically want to avoid. Likewise, every request your client submits against a system incurs, for example, network round trip time — and you will always pay a penalty under low concurrency. To eliminate this idleness and further increase throughput, simply increase the concurrency. Do it in small increments until you observe that the throughput saturates and the latency starts to grow. Once you reach that point, congratulations! You effectively reached the system’s limits. In other words, unless you manage to get your work items processed faster (for example, by reducing the payload size) or tune the system to work more efficiently with your workload, you won’t achieve gains past that point. You definitely don’t want to find yourself in a situation where you are constantly pushing the system against its limits, though. Once you reach the saturation area, fall back to lower concurrency numbers to account for growth and unpredictability. Step 3: Add More Workers If you live in a busy area, grocery store demand might be beyond what a single cashier can sustain. Even if the store happened to hire the fastest cashier in the world, they would still be busy as demand/concurrency increases. Once the saturation point is reached it is time to hire more workers. In the distributed systems case, this means adding more shards to the system to scale throughput under the latency you’ve previously measured. This leads us to the following formula: Number of Workers = Target Throughput/Single worker limit You already discovered the performance limits of a single worker in the previous exercise. To find the total number of workers you need, simply divide your target throughput by how much a single worker can sustain under your defined latency requirements. Distributed systems like ScyllaDB provide linear scale, which simplifies the math (and total cost of ownership [TCO]). In fact, as you add more workers, chances are that you’ll achieve even higher rates than under a single worker. The reason is due to Network IRQs, and out of scope for this write-up (but see this perftune docs page for some details). Step 4: Increase Parallelism Think about it. The total time to check out an order is driven by the number of items in a cart divided by the speed of a single cashier. Instead of adding all the pressure on a single cashier, wouldn’t it be far more efficient to divide the items in your shopping cart (our work) and distribute them among friends who could then check out in parallel? Sometimes the number of work items you need to process might not be evenly split across all available cashiers. For example, if you have 100 items to check out, but there are only 5 cashiers, then you would route 20 items per counter. You might wonder: “Why shouldn’t I instead route only 5 customers with 20 items each?” That’s a great question — and you probably should do that, rather than having the store’s security kick you out. When designing real-time low-latency OLTP systems, however, you mostly care about the time it takes for a single work unit to get processed. Although it is possible to “batch” multiple requests against a single shard, it is far more difficult (though not impossible) to consistently accomplish that task in such a way that every item is owned by that specific worker. The solution is to always ensure you dispatch individual requests one at a time. Keep concurrency high enough to overcome external delays like client processing time and network RTT, and introduce more clients for higher parallelism. Step 5: Avoid Hotspots Even after multiple cashiers get hired, it sometimes happens that a long line of customers queue after a handful of them. More often than not you should be able to find less busy — or even totally free — cashiers simply by walking through the hallway. This is known as a hotspot, and it often gets triggered due to unbound concurrency. It manifests in multiple ways. A common situation is when you have a traffic spike to a few popular items (load). That momentarily causes a single worker to queue a considerable amount of requests. Another example: low cardinality (uneven data distribution) prevents you from fully benefiting from the increased workforce. There’s also another commonly overlooked situation that frequently arises. It’s when you dispatch too much work against a single worker to coordinate, and that single worker depends on other workers to complete that task. Let’s get back to the shopping analogy: Assume you’ve found yourself on a blessed day as you approach the checkout counters. All cashiers are idle and you can choose any of them. After most of your items get scanned, you say, “Dear Mrs. Cashier, I want one of those whiskies sitting in your locked closet.” The cashier then calls for another employee to pick up your order. A few minutes later, you realize: “Oops, I forgot to pick up my toothpaste,” and another idling cashier nicely goes and picks it up for you. This approach introduces a few problems. First, your payment needs to be aggregated by a single cashier — the one you ran into when you approached the checkout counter. Second, although we parallelized, the “main” cashier will be idle waiting for their completion, adding delays. Third, further delays may be introduced between each additional and individual request completion: for example, when the keys of the locked closet are only held by a single employee, the total latency will be driven by the slowest response. Consider the following pseudocode: See that? Don’t do that. The previous pattern works nicely when there is a single work unit (or shard) to route requests to. Key-value caches are a great example of how multiple requests can get pipelined altogether for higher efficiency. As we introduce sharding into the picture, this becomes a great way to undermine your latencies given the previously outlined reasons. Step 6: Limit Concurrency When more clients are introduced, it’s like customers inadvertently ending up at the supermarket during rush hour. Suddenly, they can easily end up in a situation where many clients all decide to queue under a handful of cashiers. You previously discovered the maximum concurrency at which a single shard can service requests. These are hard numbers and — as you observed during small-scale testing — you won’t see any benefits if you try to push requests further. The formula goes like this: Concurrency = Throughput * Latency If a single shard sustains up to 5K ops/second under an average latency of 1 ms, then you can execute up to 5 concurrent in-flight requests at all times. Later you added more shards to scale that throughput. Say you scaled to 20 shards for a total throughput goal of 100K ops/second. Intuitively, you would think that your maximum useful concurrency would become 100. But there’s a problem. Introducing more shards to a distributed system doesn’t increase the maximum concurrency that a single shard can handle. To continue the shopping analogy, a single cashier will continue to scan barcodes at a fixed rate — and if several customers line up waiting to get serviced, their wait time will increase. To mitigate (though not necessarily prevent) that situation, divide the maximum useful concurrency among the number of clients. For example, if you’ve got 10 clients and a maximum useful concurrency of 100, then each client should be able to queue up to 10 requests across all available shards. This generally works when your requests are evenly distributed. However, it can still backfire when you have a certain degree of imbalance. Say all 10 clients decided to queue at least one request under the same shard. At a given point in time, that shard’s concurrency climbed to 10, double our initially discovered maximum concurrency. As a result, latency increases, and so does your P99. There are different approaches to prevent that situation. The right one to follow depends on your application and use case semantics. One option is to limit your client concurrency even further to minimize its P99 impact. Another strategy is to throttle at the system level, allowing each shard to shed requests as soon as it queues past a certain threshold. Step 7: Consider Background Operations Cashiers do not work at their maximum speed at all times. Sometimes, they inevitably slow down. They drink water, eat lunch, go to the restroom, and eventually change shifts. That’s life! It is now time for real-life production testing. Apply what you’ve learned so far and observe how the system behaves over long periods of time. Distributed systems often need to run background maintenance activities (like compactions and repairs) to keep things running smoothly. In fact, that’s precisely the reason why I recommended that you stay away from the saturation area at the beginning of this article. Background tasks inevitably consume system resources, and are often tricky to diagnose. I commonly receive reports like “We observed a latency increase due to compactions,” only to find out later the actual cause was something else; for example, a spike in queued requests to a given shard. Irrespective of the cause, don’t try to “throttle” system tasks. They exist and need to run for a reason. Throttling their execution will likely backfire on you eventually. Yes, background tasks slow down a given shard momentarily (that’s normal!). Your application should simply prefer other less busy replicas (or cashiers) when it happens. Applying These Steps Hopefully, you are now empowered to address questions like “Why is latency high?” or “Why is throughput so low?”. As you start evaluating performance, start small. This minimizes costs and gives you fine-grained control during each step. If latencies are sub-optimal on a small scale, it either means you are pushing a single shard too hard, or that your expectations are off. Do not engage in larger-scale testing until you are happy with the performance a single shard gives you. Once you feel comfortable with the performance of a single shard, scale capacity accordingly. Keep an eye on concurrency at all times and watch out for imbalances, mitigating or preventing them as needed. When you find yourself in a situation where throughput no longer increases but the system is idling, add more clients to increase parallelism.
History knows a lot of examples when brilliant ideas were conceived in a garage and started in a somewhat chaotic manner. But it also knows just as many examples of equally brilliant ventures failing because of simple mistakes and a general lack of a systematic approach. I suggest you have a look at four basic steps that can get you some decent insurance against chaos. Get yourself and your team through them — and build your project’s foundation layer by layer. And remember: Amat Victoria Curam, Victory Loves Preparation. Question 1: Who Needs Your Project? Identifying your target audience is a critical step that should never be skipped. Without this, your project is at risk of failure even before it starts. You may ask why defining a target audience is of any matter for a developer. The answer is simple and straightforward. Your audience is the fundamental factor that defines everything else, from technology stack to product features. When you know who is going to use your product and how they will use it, you can optimize it accordingly. For instance, if you're building a progressive web app (PWA), the ability to use offline functionality (when there's no internet on the phone) is important, since many users will be using it with an unstable internet connection or even without it. It shouldn't just give you a "white screen": you should at least warn the user that an internet connection is required or give them the option to do something without one. Or, if you know your users’ busiest time slots, you can design a robust system to handle peak traffic without spending too much to make it bulletproof 24/7. All in all, without understanding your audience, you are facing all imaginable risks: from technical setbacks to developing something that sees little demand. So, how can you understand who your users are? Start by analyzing competitors. Look at their audience, their social media, the blogs, and forums where their products are discussed. Notice which demographics engage most, and read their feedback. There may be unmet needs you can address in your own project, potentially drawing users to what you have to offer. If your product is truly unique, there are still ways to identify your target audience. For instance, surveys can gauge interest and help you accumulate valuable data across user groups. Create short, focused surveys with questions about people’s needs, interests, and preferences. Post these on social media, send them via email, or use survey platforms like Google Forms or SurveyMonkey. Run test ad campaigns, targeting different user groups to test their interest and draw their attention. This will show you which audience segments respond best to your ideas. You can also create "personas" – user profiles that include age, gender, interests, profession, goals, and challenges. Personas can help you refine messaging and prioritize features for different segments. Question 2: Why Do THEY Need It and Why Do YOU Need It? Now that you know your target audience, the next step is determining whether they truly need your project. This actually comes down to a simple question: why would they need it? To explore this, ask yourself, your team, and your probable audience several key questions: What challenges do your potential users face daily and what specific problems does your project address for users?Why aren’t existing solutions meeting their needs? Competitors’ products may be inconvenient, expensive, or complicated. Knowing their weaknesses helps you create a unique value proposition.How exactly is your project going to make users’ lives easier and what user experience do you aim to deliver? For example, it could automate routine tasks, improve customer interactions, or provide access to valuable information. Understanding the product’s primary tasks enables you to focus on the core functions that matter most to users. If you’re also managing business or marketing tasks, answering these questions can help you devise a strategy that speaks to your audience. In some cases, it may also help you persuade investors or partners of your project’s potential value. There is also the other side of the goal topic that is equally important for your future project. Besides defining user-oriented goals, you should also ask yourself and your team what the project goals are from your own perspective. In other words, why do you need this project? What are its business goals? Do you plan to keep it small, but beautiful? Or is your ambition to outshine famous unicorns? Is it intended just to generate some income to support yourself and your team, or do you plan to attract additional investments and scale it up? Knowing these objectives is crucial to keep yourself and your team highly committed. It would also help you draft your monetization strategy (if any) and shape your marketing and growth patterns. But most of all, it will give you a roadmap of what is truly important so you can focus on worthy priorities. Question 3: How Are You Going to Develop the Project? Sometimes, a project’s success hinges on choosing the right technologies. Stability, scalability, and maintenance of your product — and especially the development costs — all depend on selecting suitable tools. This makes it essential to determine the technologies you are going to employ. Here’s how to filter down your tech stack selection: Step 1: Define Your Project Goals and Requirements Before choosing technologies, clarify how the product will be used, the expected number of users, data processing and storage needs, and the devices it is going to support. At this point, certain technologies may already seem more suitable than others. Step 2: Assess Your Development Team’s Competencies Choose technologies that align with the team’s expertise. If the team is familiar with specific frameworks or programming languages, leaning toward those can save time and resources. When developers are well-versed in a technology, the likelihood of bugs and delays significantly decreases. Step 3: Consider Project Timelines Even if your team is proficient in a particular technology, other options might allow faster, more affordable development. It’s also essential to account for the project’s future growth: popular, well-supported technologies reduce the risk of issues with outdated solutions. Question 4: Do You Have Everything Prepared? Web project success relies not only on technology and functionality but also on the team’s effectiveness. So, before beginning development, it’s crucial to ensure all technical and organizational resources are ready and configured. Here is a typical checklist that you should run through before you dive into a development frenzy: Set up a task board and workflows. Create a Kanban or Scrum board, define work stages (e.g., Backlog, In Progress, Code Review, Testing, Done), and assign tasks. Make sure everyone knows their roles and task deadlines.Organize chat channels. Create project channels in a messaging app like Slack to facilitate instant communication.Establish repositories on GitHub/GitLab/Bitbucket and set up the basic project structure. Ensure all team members have proper access and understand the branching strategy. Require reviews before merging to minimize errors and maintain code quality. Configure main branches (e.g., ‘main’, ‘develop’, etc.) and add protection policies to prevent direct pushes.Set up Docker images for all project components (frontend, backend, databases, services) and use Docker Compose to streamline local development.Implement CI/CD systems (e.g., Jenkins, GitLab CI/CD, CircleCI). Automate code builds, testing, and deployment. Prepare unit and integration tests to run on every commit. Automate deployments for each development stage.Create a Wiki or an internal knowledge base (e.g., Confluence, GitLab Wiki). Document all project information, architecture, requirements, deployment instructions, and development standards in one location. If possible, automate documentation updates (e.g., generating API documentation with Swagger) to keep your knowledge in line with the development progress. Love and respect are probably the most important keywords for any human life. When it comes to work, it is crucial to love and respect what you do, care for the people you work with, and also for your users and their needs. But you should also never forget about loving and respecting yourself and your own ideas. Good preparation is ultimately a natural tribute to these emotions. Your project is your brainchild, so it deserves to be planted into a properly arranged environment. And you deserve to see it prosper and live happily ever after.
TL; DR: The Lean Tech Manifesto With Fabrice Bernhard: Hands-on Agile #65 Join Fabrice Bernhard on how the “Lean Tech Manifesto” solves the challenge of scaling Agile for large organizations and enhances innovation and team autonomy (note: the recording is in English). Lean Tech Manifesto Abstract The release of the Agile Manifesto on February 13th, 2001, marked a revolutionary shift in how tech organizations think about work. By empowering development teams, Agile cut through the red tape in software development and quickly improved innovation speed and software quality. Agile's new and refreshing approach led to its adoption beyond just the scope of a development team, spreading across entire companies far beyond the initial context the manifesto’s original thinkers designed for it. And here lies the problem: the Agile Manifesto was intended for development teams, not for organizations with hundreds or thousands of people. As enthusiasts of Agile, Fabrice and his partner went through phases of excitement and then frustration as they experienced these limitations firsthand while their company grew and our clients became larger. What gave them hope was seeing organizations on both sides of the Pacific, in Japan and California, achieve growth and success almost unmatched while retaining the principles that made the Agile movement so compelling. The “Lean Tech Manifesto” resulted from spending the past 15 years studying these giants and experimenting as they scaled their business. It tries to build on the genius of the original 2001 document but adapt it to a much larger scale. Fabrice shares the connection we identified between Agile and Lean principles and the tech innovations we found the best tech organizations adopt to distribute work and maintain team autonomy. Meet Fabrice Bernhard Fabrice Bernhard is the co-author of The Lean Tech Manifesto and the Group CTO of Theodo, a leading technology consultancy he cofounded with Benoît Charles-Lavauzelle and scaled from 10 people in 2012 to 700 people in 2022. Based in Paris, London and Casablanca, Theodo uses Agile, DevOps, and Lean to build transformational tech products for clients all over the world, including global companies — such as VF Corporation, Raytheon Technologies, SMBC, Biogen, Colas, Tarkett, Dior, Safran, BNP Paribas, Allianz, and SG — and leading tech scale-ups — such as ContentSquare, ManoMano, and Qonto. Fabrice is an expert in technology and large-scale transformations and has contributed to multiple startups scaling more sustainably with Lean thinking. He has been invited to share his experience at international conferences, including the Lean Summit, DevopsDays, and CraftConf. The Theodo story has been featured in multiple articles and in the book Learning to Scale at Theodo Group. Fabrice is also the co-founder of the Paris DevOps meetup and an active YPO member. He studied at École Polytechnique and ETH Zürich and lives with his two sons in London. Connect with Fabrice Bernhard on LinkedIn. Video Watch the recording of Fabrice Bernhard’s The Lean Tech Manifesto session now:
People may perceive Agile methodology and hard deadlines as two incompatible concepts. The word “Agile” is often associated with flexibility, adaptability, iterations, and continuous improvement, while “deadline” is mostly about fixed dates, finality, and time pressure. Although the latter may sound threatening, project teams can prioritize non-negotiable deadlines and simultaneously modify those that are flexible. The correct approach is the key. In this article, we’ll analyze how deadlines are perceived within an Agile framework and what techniques can help successfully manage deadlines in Agile-driven projects. Immersing Into the Vision of a Powerful Methodology RAD, Scrumban, Lean, XP, AUP, FDD... do these words sound familiar? If you’re involved in IT, you surely must have heard them before. They all are about Agile. This methodology presupposes splitting the software creation process within a project into small iterations called sprints (each typically lasting 2-3 weeks). Agile enables regular delivery of a working product increment as an alternative to a single extensive software rollout. It also fosters openness to any changes, quick feedback for continuous IT product enhancement, and more intensive communication between teams. This approach is ideal for complex projects with dynamic requirements, frequent functionality updates, and the need for continuous alignment with user feedback. Grasping How Time Limitations Are Woven Into an Agile-Driven Landscape Although Agile emphasizes boosted flexibility, it doesn’t mean that deadlines can be neglected. They must be addressed with the same level of responsibility and attention but with a more adaptable mindset. As sprints are short, unforeseen issues or alterations are contained within that specific sprint. This helps mitigate the risks of delaying the entire project and simplifies problem-solving, as only a limited part of the project is impacted at a time. Moreover, meeting deadlines in Agile projects relies heavily on accurate task estimations. If they are off the mark, project teams risk either falling behind schedule because of overcommitting or spending time aimlessly due to an insufficient workload for the sprint. If such situations happen even once, team members must reevaluate their approach to estimating tasks to better align them with team capacity. Proven Practices for Strategic Navigation of Time Constraints Let’s have a closer look at a number of practices for ensuring timely releases throughout the entire Agile development process and keep project teams moving in the right direction: 1. Foster a Steady Dialogue The majority of Agile frameworks support specific ceremonies that ensure transparency and keep team members and stakeholders informed of all project circumstances, thus effectively managing deadlines. For instance, during a daily stand-up meeting, project teams discuss current progress, objectives, and the quickest and most impactful ways of overcoming hurdles to complete all sprint tasks on time. A backlog refinement meeting is another pivotal activity during which a product owner reviews tasks in the backlog to confirm that prioritized activities are completed before each due date. A retrospective meeting performed after each sprint analyzes completed work and considers an improved approach to addressing problems in the future to minimize their effect on hitting deadlines. 2. Set Up Obligatory Sprint Planning Before each sprint, a product owner or a Scrum master needs to conduct a sprint planning meeting, during which they collaborate with software developers to decide on the efforts for each task and prioritize which items from the backlog should be completed further. To achieve this, they analyze what objectives should be attained during this sprint, what techniques will be used to fulfill them, and who will be responsible for each backlog item. This helps ensure that team members continuously progress towards specific goals, have clarity regarding the upcoming activities, and deliver high-quality output, always staying on schedule. 3. Promote Clarity for Everyone Meeting deadlines requires a transparent work environment where everyone has quick access to the current project status, especially in distributed teams. Specific tools, such as Kanban boards or task cards, contribute to achieving this. They provide a flexible shared space that gives a convenient overview of the entire workflow of tasks with highlighted priorities and due dates. This enables team members to prioritize critical tasks without delays, control task completion time, and take full accountability for their work. 4. Implement a Resilient Change Management Framework The ability to swiftly and proficiently process probable modifications in scope or objectives within a sprint directly impacts a team’s ability to adhere to time constraints. Change-handling workflows enable teams to manage adjustments continuously, reducing the risk of downtime or missed deadlines. Therefore, key project contributors, product owners, and Scrum masters can formulate a prioritization system to define which alterations should be addressed first. They also should discuss how each adjustment corresponds to milestones and the end goal. 5. Create a Clear Definition of Done The definition of done is a win-win practice that fosters straightforward criteria for marking tasks as complete. When everyone understands these criteria, they deliver more quality achievements aligned with high standards, minimize the chance of last-minute rework, and decrease the accumulation of technical debt on the project. 6. Follow Time Limits To enhance task execution, team leaders can adopt time limits — for example, restricting daily stand-ups to 15 minutes. This helps to focus on the task and avoid distractions to meet deadlines. Final Thoughts Navigating deadlines in Agile projects is a fully attainable goal that requires an effective strategy. By incorporating practices such as regular communication, sprint planning, transparency, a change management approach, a definition of done, and timeboxing, specialists can successfully accomplish short — and long-term targets without compromising set deadlines.
Being a developer comes with many benefits. Often, there's the opportunity for continued on-the-job learning, a high salary, and the ability to work remotely, which means developers can live wherever the cost of living gives them the biggest bang for the buck. But programming isn't all fun and games. Like any other career, development has its downsides. Here are the things that developers hate most about being developers. 1. Too Much Screen Time Programming, by nature, requires quite a lot of screen time. This is the one thing that Guardian Software Systems developer Erica Gilhuber hates about being a developer: "Even with taking frequent breaks and a bunch of other eyestrain-prevention techniques, my eyes are begging me to just stare at the clouds by the end of the day." — Erica Gilhuber So, how can you overcome too much screen time? Blue-light glasses can be a great help, Gilhuber said. Also, "the Windows night-light setting is awesome. Set it more intensely than you think; it won't look orange after a couple days." She also uses dark mode everywhere she can — OS settings, themes, browser extensions, etc. 2. Clients Who Don't Understand Tech But Think They Do Devin Ceartas, owner of NacreData, dislikes working with clients who don't understand tech. It's one thing to have family members requesting help troubleshooting printers on holidays, but in the workplace, non-tech-savvy clients can make your job difficult. Even worse, he said, are "clients who think they understand tech and don't," especially clients who are sure they know exactly how something should be implemented because they read one thing somewhere or watched a video. "As an independent developer, my future work is largely dependent on having satisfied customers and their recommendations or references. But 'satisfied' isn't just doing what you ask me to do, competently and affordably." Non-technical clients likely won't understand the difficulties developers face when transforming ideas into reality, he said. For example, the new feature on your website or app fits into an existing ecosystem, and "in the end, you'll judge me by the impact my work has on the totality." Then, there's the fact that the experience the client's users have may vary considerably because they work with different devices and use tech in different ways. The latest development tools, such as frameworks, may entice clients but not necessarily the developers themselves. Ceartas said, "Yes, that new JavaScript library you read about makes the interface really pop," but the one you're already using can do almost the same thing without slowing your site down with another large code download. Or yes, that four-tier cascading dropdown menu would allow you to access every page of your site from the home page and works great on your high-resolution laptop, but have you tried using it on the phone? Programmers need to find ways of working with clients and managers who fail to comprehend the technical aspects of a project. "Sometimes the client and I never see eye to eye, and I just do it their way or ditch the project," Ceartas said. But often, it can be worked out by pulling out only what it is about the tech they've stumbled across that they like. This can be something like: "I think I hear you saying that you really like the responsiveness of this example code you found. That's a wonderful example to have. I'd love to use that concept while also optimizing the page load time so the overall user experience improves.'" Time and Money Problems SoftwareMill Scala software engineer Bartłomiej Żyliński said it's challenging to work with businesses that "do not know what they want, but it has to be ready yesterday." Often, there's a disconnect between what a client wants and what's possible based on budget, time, and resources. What seems like an easy job on paper might be too costly or time-intensive to actually implement. Still, good communication can ease the pains of clients who don't fully grasp the technological side of business projects. 3. Context Switching That Breaks Your Momentum Developers must juggle different tasks. In programming, context switching means storing and restoring the state of a thread or process so its execution can be resumed later. As Over-C CTO James Sugrue said, "Context switching — it wears you out sometimes." Context switching is necessary when stuff breaks "at inopportune moments, like weekends or when you're trying to relax, [or] when you're minding your own business and you start worrying about some bug," Sugrue said. Simple but effective strategies, such as maintaining a balanced calendar, can reduce the impact of context switching. 4. Depending on Technology and Teams for Success While technology can make a developer's life easier, it can also present hurdles. Dave Fecak, founder of both Resume Raiders and the Philadelphia-area Java Users' Group, laments that, as a developer, he's not in full control. There are dependencies on things he didn't build, including tools, APIs, open-source products, other people who have access to your code, etc. "No matter how well you do, there's a chance somebody else can screw it up," Fecak said. The technology and processes that allow development are a gift and a curse. Regardless of your programming prowess, you need to depend on many outside factors that ultimately impact projects. 5. Implementing Someone Else's Designs Programmers may seem like magicians to non-developers, but they're not clairvoyant. Vedcraft founder Ankur Kumar explained that what he hates about programming is implementing "designs thought of by someone else." There can be a disconnect between the idea on paper and what's possible in the real world. This is further complicated when working with clients who don't understand or respect the technical limitations that developers face. Similarly, working with someone else's code, and especially legacy code, can be a huge pain point for devs. 6. Old Coding Practices That Become the Newest Thing Although programming has risen quite a bit in popularity, it's by no means new. Mad Botter founder Michael Dominick is bothered by how some ancient coding concepts resurface as new ideas. "In general, I love software development, but the one thing that has begun to irk me over the years is venerable old ideas being presented as brand new." — Michael Dominick In particular, this has been a trend with functional programming, something that has been around since the 1970s "but was recently rediscovered by developers of my generation and younger," Dominick said. It's great to see these older technologies being embraced by new developers, he said, but "there's also a lot of value in understanding the fundamentals of where these methodologies and technologies came from within their historical context." 7. Doing Someone Else's Job It's not uncommon that the work you're doing after signing your contract differs somewhat from the job description. But doing someone else's job is another story. "I hate doing my manager's job," Abstract software engineer Pam Selle said. She's seen managers push all organizing work onto individual contributors rather than providing direction about what an individual contributor needs to do. "If managers manage, it lets me work as an engineer, which is why I have the job I have." — Pam Selle Generally, a manager is in a managerial position for that reason: to manage. However, many individual contributors are forced to manage or even perform their manager's job for them. It's often said that employees don't leave companies; they leave managers. Having to do your manager's job in addition to your own may eventually lead to burnout. Selle said there are solutions to this problem. "As much as large corporations get flack for, say, having a 'Jira of Jiras,' managers working out how to keep a flow of work coming down the pipe for engineers really allows engineers to focus on the engineering and do better work because of it." Can You Live With the Downsides? With promises of high-paying salaries and because almost every company is a software company, whether by design or not, programming continues to be an enticing career. But like any other job, development isn't without its downsides. Whether it's clients who don't understand technology, the frustrations of implementing someone else's (sometimes sloppy) designs, too much screen time, or dealing with poor management, something will likely come up that will make coding seem like less than a dream job. However, if you're willing to put up with some annoyances and find solutions to mitigate problems, it can be a satisfying career choice.
There are many reasons why Kubernetes is a popular container runtime platform for distributed applications. One of these reasons is the portability and flexibility that it provides to IT architects. However, the difficulties of service discovery, infrastructure reliability, and security are known challenges that result from these benefits. From challenges, opportunities are created, and many tools have risen to mitigate common problems faced by teams that benefit from containerized applications on Kubernetes. A service mesh is a pattern that aims to mitigate some of these challenges when architecting an application on Kubernetes. By providing a dedicated service layer to facilitate service discovery and how applications share information with each other, they provide security, tracing, monitoring, and traffic control. Dapr, which stands for Distributed Application Runtime, is an open-source, portable, event-driven runtime designed to make it easier for developers to build resilient, stateless, and stateful microservice applications. Similar to service meshes, Dapr provides features such as discoverability, service-to-service secure communication, and distributed tracing. These overlapping features often raise questions about when you should choose Dapr or a service mesh when trying to make your distributed system architecture more robust. This is a multi-part series where the first part unpacks this decision, but instead of focusing on a single tool for the job, we will dive deeper into the two technologies, understanding where they overlap and, most importantly, where their strengths can be used together to achieve your microservices goals. The Sidecar Pattern Before understanding how Dapr and service meshes work in detail, we need to understand what a sidecar is, as it is a pattern leveraged by both technologies. Sidecars are containers or processes that are deployed alongside an application to extend functionality and provide isolation. They are a completely independent piece of software that abstracts the responsibility of tasks like monitoring, logging, and configuring network services from the application code. Sidecars always have exactly one process or replica running alongside each application replica and, when running on Kubernetes, it is often deployed within the same pod as the application. When using the sidecar pattern, applications do not communicate directly with each other, but instead through their corresponding sidecars. Two Dapr-enabled applications communicating through their associated sidecars Understanding Service Meshes A service mesh is a dedicated layer within your application architecture that manages service discovery and communication in distributed applications. This layer provides features including monitoring, logging, tracing, and traffic control. Service meshes are focused on the networking layer and because they do not require programming skills or application code modifications, they are typically managed by system operators. Alongside service discovery, service meshes provide networking features like load balancing, advanced traffic management, mTLS encryption, and security policies to control access to application endpoints. Service meshes also provide comprehensive auditing and debugging features through monitoring metrics, system performance analysis, error and latency calculations, and distributed tracing. These features are provided through network proxies that are responsible for routing requests between applications. The proxies act as a gateway between the networking layer and the applications, forcing all traffic to be routed through the service mesh. As co-located processes or containers that run alongside applications, service meshes commonly follow the sidecar pattern. Service meshes rely on two components: the control plane and the data plane. The control plane is where service endpoints, network settings, load balancing configurations, and routing rules are registered. This information is shared with the data plane, where the sidecar proxies are hosted alongside the applications. Popular service meshes include Linkerd, Istio, and Cilium. Service mesh architecture Dapr vs. Service Mesh: Overlapping Features Dapr provides a set of APIs for developers to build distributed applications abstracting underlying infrastructure and leveraging industry best practices for observability, security, and resiliency. The APIs focus on facilitating microservice development by providing building blocks for service invocation, state management, pub/sub messaging, stateful workflows, actors, and more. Dapr also traditionally relies on the sidecar pattern to provide service communication with mTLS, metrics collection, distributed tracing, and resiliency. Opposite to service meshes, where operations teams are the primary demographic, Dapr is developer-centric, as developers need to build it into application code through native HTTP/gRPC clients or SDK integrations. Dapr and service meshes share many of the same features, commonly executed at different layers of the system architecture. Here are some overlapping features: Secure Service-to-Service Communication Dapr provides end-to-end security with the service invocation API, with the ability to authenticate applications with token-based auth and restrict access using policies. These applications are typically scoped to namespaces for deployment, and traffic can be encrypted end-to-end using mTLS with no extra code configuration. Dapr provides service discovery and invocation via names through the concept of “Application IDs,” which is a developer-centric concern. This means that through Dapr’s service invocation API, developers call a method on an application using their App ID allowing for easily readable code. Importantly, application IDs (or names) provide identity for the application, meaning that service discovery is dependent on where the application is currently running and also you can enforce security between applications. This is not possible with a service mesh that operates only at the network level. Dapr service-to-service secure communication Service meshes handle service-to-service communication by providing a dedicated infrastructure layer that manages and optimizes the interactions between microservices. This layer provides service discovery by managing service endpoints and makes apps more resilient, rerouting requests if they fail. Service meshes also provide traffic management capabilities through load balancing and traffic splitting; encryption via mutual mTLS and policy enforcements, such as access control and rate limiting. Observability Both service meshes and Dapr provide observability into distributed systems. Service meshes operate at the network level and provide traces for the network calls between applications along with metrics and network activity logs. Dapr also provides metrics, tracing, and logs for all of the APIs that are used within the system. Since Dapr provides a layer between the infrastructure and applications, it is able to include insights into the application and the infrastructure level, something that service meshes lack. Observability with Dapr goes beyond service communication. For asynchronous messaging in particular, Dapr provides observability into pub/sub calls using trace IDs written into the CloudEvents envelope. This means that metrics and tracing with Dapr are more extensive than with a service mesh for applications that use both service-to-service invocation and pub/sub to communicate. Both service meshes and Dapr can leverage popular tracing solutions like Jaeger and Zipkin for distributed tracing, which helps in monitoring and troubleshooting microservices by visualizing the flow of requests through different applications. Service meshes collect and visualize traces, which represent the path of a request through various microservices. These distributed tracing solutions are commonly included as part of the service mesh deployment, enabling automatic tracing of service-to-service communication. Options like Linkerd and Istio provide dashboards to provide a visual representation of their metrics. Dapr also supports multiple solutions for distributed tracing, sending trace data using the OpenTelemetry and Zipkin protocols. This allows developers to monitor and visualize the interactions between Dapr-enabled microservices and their associated infrastructure resources. Developers that want to have a full understanding of their Dapr applications can leverage Diagrid Conductor. Conductor provides full control of your Dapr environment, allowing access to a comprehensive set of features that help developers and system administrators manage the current health of their workloads. It also has advisories that show how to operate Dapr from a well-architected lens. Dapr applications in Diagrid Conductor’s App Graph Resiliency Through Retries Both Dapr and service meshes handle resiliency through retries by implementing policies that automatically retry failed requests, ensuring that transient errors and network noise do not disrupt the system. Dapr provides a robust mechanism for handling retries through its resiliency policies. Users can define retry policies in a configuration file, specifying parameters like the retry strategy (constant or exponential), duration between retries, maximum interval, and the maximum number of retries. The code snippet below contains a Dapr resiliency policy that configures retries. YAML spec: policies: retries: pubsubRetry: policy: constant duration: 5s maxRetries: 10 retryForever: policy: exponential maxInterval: 15s maxRetries: -1 # Retry indefinitely Service meshes, like Istio, also handle resiliency through retries by providing built-in features for retrying failed requests. These retries can be configured with parameters such as the number of attempts, the interval between retries, and the conditions under which retries should be attempted. Unique and overlapping features of Dapr and service meshes Dapr, Service Mesh, or Both? So, should you be using Dapr, a service mesh, or both? The answer here depends on your system requirements. Applications With a Variety of Developer-Centric Needs If you are building microservices and need a set of building blocks that handle developer-centric needs for state management, pub/sub messaging, service invocation, and workflows, Dapr is the best choice. Dapr can also handle network security and observability requirements, for most cases like these and a service mesh may not be required. Polyglot Applications With Cloud Dependencies Dapr is likely the best choice for microservices that are built in multiple programming languages that need to communicate with many backing cloud services. With SDKs and APIs that abstract the need to understand how these underlying infrastructure resources work, developers do not need to worry about the intricacies of using Amazon DynamoDB or Azure ComosDB for state management for example. Systems that are deployed on multiple clouds also benefit from this modularity and code abstraction that Dapr provides. Architectures Where mTLS Needs to Be Enforced for All Application Communication If your solution requires complex network-level security that needs to be enforced with fine-grained control, network-level policies, and mTLS encryption for all applications — Dapr-enabled or not — you likely require a service mesh. Dapr provides access control and mTLS; however, it cannot provide these capabilities for apps that do not have Dapr sidecars. The picture below demonstrates an architecture where Dapr is used for their developer-purposed APIs while service meshes are leveraged for mTLS communication between all applications. Architecture where mTLS is enforced by a Service Mesh Multi-Cluster Connectivity Another common case is when your microservices are spread between multiple Kubernetes clusters. These architectures often require load balancing, advanced traffic routing, and traffic splitting, leaving service meshes to be the best option in this case. The architecture below contemplates service meshes for load balancing and Dapr for developer-purposed APIs. Multi-cluster architecture where load balancing is handled by a Service Mesh A common use case is where traffic splitting is needed for A/B testing scenarios. A service mesh is preferred in this case as Dapr does not provide this capability. Architectures That Mix Dapr-Enabled Applications With Regular Apps Typically you would use a service mesh with Dapr where there is a corporate policy that requires traffic on the network to be encrypted for all applications. For example, you may be using Dapr in only part of your system, and other services and processes that are not using Dapr also need their traffic encrypted. In this scenario, a service mesh is the better option, and most likely you should use mTLS and distributed tracing on the service mesh and disable this on Dapr. Conclusion Kubernetes is a great platform for distributed applications. The challenges that come with portability and flexibility can be solved by using both service meshes and Dapr to make your architecture robust, resilient, and secure. In some cases, where you require capabilities that are unique to both, you will find it useful to leverage both Dapr alongside a service mesh, or you might find that the security, observability, and resiliency features of Dapr alone are enough. There is no limitation in combining both tools, and it is common for Dapr to be deployed with service meshes like Istio, Linkerd, and others. Understanding the overlapping features and making sure that they are not enabled in both technologies is the key to a successful deployment. Part two of this blog series will provide a step-by-step solution covering both developer-centric needs and fine-grained network requirements, focusing on how to avoid common errors that can lead to an operational nightmare. See you there.
When I was a child, I loved making pancakes with my grandmother. As time went on, I became a web developer, and now, instead of pancakes, I create various web projects with my colleagues. Every time I start a new project, I’m haunted by one question: How can I make this development "tasty" not only for the user but also for my colleagues who will work on it? This is a crucial question because over time, the development team may change, or you might decide to leave the project and hand it over to someone else. The code you create should be clear and engaging for those who join the project later. Moreover, you should avoid a situation where the current developers are dissatisfied with the final product yet have to keep adding new "ingredients" (read: functions) to satisfy the demands of the "restaurant owner." Important note: Before I describe my recipe, I want to point out that methods can vary across different teams and, of course, they depend on their preferences. However, as we know, some people have rather peculiar tastes, so I believe it's essential to reiterate even the simplest truths. Selecting the Ingredients: Choose Your Technology Stack Before you start cooking a dish, you usually check what ingredients you already have. If something is missing, you go to the store or look for alternative ways to acquire them, like venturing out to pick them up in the woods. The web development process is similar: before starting your work on a new project, you need to understand what resources you currently have and what you want to achieve in the end. To prepare for creating your technological masterpiece, it helps to answer a series of questions: What are the functional requirements for the project? What exactly needs to be done?Is anything known about the expected load on your future product?Do you have any ready-made solutions that can be reused?If you’re working in a team: What knowledge and skills does each team member possess?Is the programming language/framework you want to use for development still relevant?How difficult and expensive is it to find developers who specialize in the chosen technologies? Even if you’re working on the project alone, always remember the "bus factor" — the risk associated with losing key personnel. Anything can happen to anyone (even you), so it’s crucial to prepare in advance for any hypothetical issues. No Arbitrary Action: Stick to Coding Standards How about trying oatmeal with onions? It’s hard to believe, but I once had such a dish in Kindergarten. This memory is vividly etched in my mind, and it taught me an important lesson. Coding standards were invented for the same reason as “compatibility standards” of food ingredients. They are supposed to improve code readability and understanding by all developers on the team. We avoid debates about the best way to write code, which constructs to use, and how to structure it. When everyone follows the same rules, the code becomes easier to read, understand, and maintain (also mind that maintenance becomes cheaper this way). But that's not the only reason for having standards: adhering to them helps reduce the number of bugs and errors. For instance, strict rules for using curly braces can prevent situations where an operation is accidentally left out of a condition. Line length restrictions make code more readable, and consistent rules for creating conditions in an if statement help avoid logical errors. Strict rules for data types and type casting in languages with less strict typing also help prevent many runtime errors. Coding standards help reduce dependency on specific developers, which is also good for the developers themselves since they won't be bothered with silly questions during their vacation. In popular programming languages, there are generally accepted coding standards supported by the development team and the community around the language. For example, PEPs (Python Enhancement Proposals) are maintained and managed by the Python developer community under the guidance of the Python Software Foundation (PSF). PSR (PHP Standards Recommendations) is a set of standards developed by PHP-FIG (PHP Framework Interoperability Group) for PHP. Golang has stricter coding standards maintained by the language's developers. However, each development team or company may have its own standards in addition to (or instead of) those supported by the community. There can be many reasons for this; for example, the main codebase might have been written long before any standards were established, making it too costly to rewrite. To maintain uniformity in the codebase, the rules may be adjusted. There are tools for automatically checking standards, known as static code analyzers. These tools generally have a wide range of functionality that can be further expanded and customized. They can also detect errors in the code before it is released to production. Examples of such tools include PHPStan, Psalm, and PHP_CodeSniffer for PHP; Pylint, Flake8, and Mypy for Python; golint and go vet for Golang. There are also tools for automatic code fixing code and bringing it up to existing standards when possible. This means much of this work now should not require as much manual labor and resources as it used to. Keep Ingredients Fresh: Constant Refactoring What happens if you don't keep the code fresh, and how can it lose its freshness? Programming languages and libraries (these are the ingredients) get updated, and with that — old ingredients rot. Establish rules on how you keep the code fresh, use automation tools, and update libraries. This advice may seem obvious, but it's frequently neglected: make sure your project's dependencies and server software are constantly monitored and regularly updated. This is especially important since outdated or insecure code presents an easy target for attackers. Just as with code checking and fixing, you don't have to manually update everything; there are numerous automation tools that can assist. For instance, GitHub’s Dependabot automatically identifies outdated or vulnerable dependencies and proposes updates. It's also vital to automate the renewal of security certificates. Expired certificates can cause significant issues, but automating this process is a straightforward task. For instance, if you're using Let's Encrypt certificates, Certbot can automate their renewal. The same concept applies to server software. For larger projects with multiple servers, tools like Puppet, Salt, Ansible, or Chef can be employed to handle updates. For those working with Linux, especially Debian/Ubuntu-based systems, Unattended Upgrades can efficiently manage this. Taste (And Test) Along the Way A good chef usually tastes the dish at different preparation stages to ensure everything is going according to plan. In a similar fashion, a professional developer should check not just the final, but also the intermediate results using tests. Testing is often associated with just detecting bugs. Indeed, it catches errors and unexpected behaviors before the product reaches users, improving overall quality and reducing the likelihood of issues down the line. But in fact, its importance is much bigger than that. Effective testing is crucial for delivering high-quality, dependable, and well-understood code: Code comprehension: Writing test scenarios demands a deep understanding of the code’s architecture and functionality, leading to better insights into how the program operates and how different parts interact.Supplemental documentation: Tests can also serve as practical examples of how functions and methods are used, helping to document the project’s capabilities and providing new team members with real-world use cases. It’s pretty much clear that achieving 100% test coverage for complex code is unrealistic. Therefore, developers must focus on testing critical functions and essential code segments, and knowing when to stop is key to avoiding an endless cycle of testing. Also, testing can consume significant resources, especially during the early stages of development. So, it’s important to strike a balance between the necessity of testing and the available time and resources. Chef’s Logbook: Add a Pinch of Documentation It’s common knowledge that many famous types of food, like mozzarella, nachos, and even french fries, were discovered by accident. Others took decades to develop through countless trial and error. In both cases, all of them would be just one-off products if knowledge about them could not have been passed on. It is the same with tech: each project needs proper documentation. The lack of such paperwork makes it much harder to identify and fix errors, complicates maintenance and updates, and slows down the onboarding process for new team members. While teams lacking documentation get bogged down in repetitive tasks, projects with well-structured documentation demonstrate higher efficiency and reliability. According to the 2023 Stack Overflow Developer Survey, 90.36% of respondents rely on technical documentation to understand the functionality of technologies. Yet, even with documentation, they often struggle to find the information they need, turning to other resources like Stack Overflow (82.56%) and blogs (76.69%). Research by Microsoft shows that developers spend an average of 1-2% of their day (8-10 minutes) on documentation, and 10.3% report that outdated documentation forces them to waste time searching for answers. The importance of documentation is also a significant concern for the academic community, as evidenced by the millions of scientific publications on the topic. Researchers from HAN University of Applied Sciences and the University of Groningen in the Netherlands identified several common issues with technical documentation: Developer productivity is measured solely by the amount of working software.Documentation is seen as wasteful if it doesn’t immediately contribute to the end product.Informal documentation, often used by developers, is difficult to understand.Developers often maintain a short-term focus, especially in continuous development environments.Documentation is frequently out of sync with the actual software. These “practices” should be avoided at all costs in any project, but it is not always up to developers. Getting rid of these bad habits often involves changes to planning, management, and long-term vision of the entire company from top management to junior dev staff. Conclusion As you see, food and tech project development (including, but not limited to web) has a lot in common with cooking. Proper recipes, fresh and carefully selected ingredients, meticulous compliance with standards, and checking intermediate and final results — sticking to this checklist is equally essential for a chef and for a developer. And, of course, they both should ideally have a strong vision, passion for what they do, and a strong appetite for innovation. I am sure you recognize yourself in this description. Happy cooking!
Stefan Wolpers
Agile Coach,
Berlin Product People GmbH
Daniel Stori
Software Development Manager,
AWS
Alireza Rahmani Khalili
Officially Certified Senior Software Engineer, Domain Driven Design Practitioner,
Worksome