Software design and architecture focus on the development decisions made to improve a system's overall structure and behavior in order to achieve essential qualities such as modifiability, availability, and security. The Zones in this category are available to help developers stay up to date on the latest software design and architecture trends and techniques.
Cloud architecture refers to how technologies and components are built in a cloud environment. A cloud environment comprises a network of servers that are located in various places globally, and each serves a specific purpose. With the growth of cloud computing and cloud-native development, modern development practices are constantly changing to adapt to this rapid evolution. This Zone offers the latest information on cloud architecture, covering topics such as builds and deployments to cloud-native environments, Kubernetes practices, cloud databases, hybrid and multi-cloud environments, cloud computing, and more!
Containers allow applications to run quicker across many different development environments, and a single container encapsulates everything needed to run an application. Container technologies have exploded in popularity in recent years, leading to diverse use cases as well as new and unexpected challenges. This Zone offers insights into how teams can solve these challenges through its coverage of container performance, Kubernetes, testing, container orchestration, microservices usage to build and deploy containers, and more.
Integration refers to the process of combining software parts (or subsystems) into one system. An integration framework is a lightweight utility that provides libraries and standardized methods to coordinate messaging among different technologies. As software connects the world in increasingly more complex ways, integration makes it all possible facilitating app-to-app communication. Learn more about this necessity for modern software development by keeping a pulse on the industry topics such as integrated development environments, API best practices, service-oriented architecture, enterprise service buses, communication architectures, integration testing, and more.
A microservices architecture is a development method for designing applications as modular services that seamlessly adapt to a highly scalable and dynamic environment. Microservices help solve complex issues such as speed and scalability, while also supporting continuous testing and delivery. This Zone will take you through breaking down the monolith step by step and designing a microservices architecture from scratch. Stay up to date on the industry's changes with topics such as container deployment, architectural design patterns, event-driven architecture, service meshes, and more.
Performance refers to how well an application conducts itself compared to an expected level of service. Today's environments are increasingly complex and typically involve loosely coupled architectures, making it difficult to pinpoint bottlenecks in your system. Whatever your performance troubles, this Zone has you covered with everything from root cause analysis, application monitoring, and log management to anomaly detection, observability, and performance testing.
The topic of security covers many different facets within the SDLC. From focusing on secure application design to designing systems to protect computers, data, and networks against potential attacks, it is clear that security should be top of mind for all developers. This Zone provides the latest information on application vulnerabilities, how to incorporate security earlier in your SDLC practices, data governance, and more.
Observability and Application Performance
Making data-driven decisions, as well as business-critical and technical considerations, first comes down to the accuracy, depth, and usability of the data itself. To build the most performant and resilient applications, teams must stretch beyond monitoring into the world of data, telemetry, and observability. And as a result, you'll gain a far deeper understanding of system performance, enabling you to tackle key challenges that arise from the distributed, modular, and complex nature of modern technical environments.Today, and moving into the future, it's no longer about monitoring logs, metrics, and traces alone — instead, it’s more deeply rooted in a performance-centric team culture, end-to-end monitoring and observability, and the thoughtful usage of data analytics.In DZone's 2023 Observability and Application Performance Trend Report, we delve into emerging trends, covering everything from site reliability and app performance monitoring to observability maturity and AIOps, in our original research. Readers will also find insights from members of the DZone Community, who cover a selection of hand-picked topics, including the benefits and challenges of managing modern application performance, distributed cloud architecture considerations and design patterns for resiliency, observability vs. monitoring and how to practice both effectively, SRE team scalability, and more.
Build a Time-Tracking App With ClickUp API Integration Using Openkoda
Observability Maturity Model
In the dynamic landscape of modern technology, the realm of Incident Management stands as a crucible where professionals are tested and refined. Incidents, ranging from minor hiccups to critical system failures, are not mere disruptions but opportunities for growth and learning. Within this crucible, we have traversed the challenging terrain of Incident Management. The collective experiences and insights offer a treasure trove of wisdom, illuminating the path for personal and professional development. In this article, we delve deep into the core principles and lessons distilled from the crucible of Incident Management. Beyond the technical intricacies lies a tapestry of skills and virtues—adaptability, resilience, effective communication, collaborative teamwork, astute problem-solving, and a relentless pursuit of improvement. These are the pillars upon which successful incident response is built, shaping not just careers but entire mindsets and approaches to life's challenges. Through real-world anecdotes and practical wisdom, we unravel the transformative power of Incident Management. Join us on this journey of discovery, where each incident is not just a problem to solve but a stepping stone towards personal and professional excellence. Incident Management Essentials: Navigating Through Challenges Incident Management is a multifaceted discipline that requires a strategic approach and a robust set of skills to navigate through various challenges effectively. At its core, Incident Management revolves around the swift and efficient resolution of unexpected issues that can disrupt services, applications, or systems. One of the fundamental aspects of Incident Management is the ability to prioritize incidents based on their impact and severity. This involves categorizing incidents into different levels of urgency and criticality, akin to triaging patients in a hospital emergency room. By prioritizing incidents appropriately, teams can allocate resources efficiently, focus efforts where they are most needed, and minimize the overall impact on operations and user experience. Clear communication channels are another critical component of Incident Management. Effective communication ensures that all stakeholders, including technical teams, management, customers, and other relevant parties, are kept informed throughout the incident lifecycle. Transparent and timely communication not only fosters collaboration but also instills confidence in stakeholders that the situation is being addressed proactively. Collaboration and coordination are key pillars of successful incident response. Incident Management often involves cross-functional teams working together to diagnose, troubleshoot, and resolve issues. Collaboration fosters collective problem-solving, encourages knowledge sharing, and enables faster resolution times. Additionally, establishing well-defined roles, responsibilities, and escalation paths ensures a streamlined and efficient response process. Proactive monitoring and alerting systems play a crucial role in Incident Management. Early detection of anomalies, performance issues, or potential failures allows teams to intervene swiftly before they escalate into full-blown incidents. Implementing robust monitoring tools, setting up proactive alerts, and conducting regular health checks are essential proactive measures to prevent incidents or mitigate their impact. Furthermore, incident documentation and post-mortem analysis are integral parts of Incident Management. Documenting incident details, actions taken, resolutions, and lessons learned not only provides a historical record but also facilitates continuous improvement. Post-incident analysis involves conducting a thorough root cause analysis, identifying contributing factors, and implementing corrective measures to prevent similar incidents in the future. In essence, navigating through challenges in Incident Management requires a blend of technical expertise, strategic thinking, effective communication, collaboration, proactive monitoring, and a culture of continuous improvement. By mastering these essentials, organizations can enhance their incident response capabilities, minimize downtime, and deliver superior customer experiences. Learning from Challenges: The Post-Incident Analysis The post-incident analysis phase is a critical component of Incident Management that goes beyond resolving the immediate issue. It serves as a valuable opportunity for organizations to extract meaningful insights, drive continuous improvement, and enhance resilience against future incidents. Here are several key points to consider during the post-incident analysis: Root Cause Analysis (RCA) Conducting a thorough RCA is essential to identify the underlying factors contributing to the incident. This involves tracing back the chain of events, analyzing system logs, reviewing configurations, and examining code changes to pinpoint the root cause accurately. RCA helps in addressing the core issues rather than just addressing symptoms, thereby preventing recurrence. Lessons Learned Documentation Documenting lessons learned from each incident is crucial for knowledge management and organizational learning. Capture insights, observations, and best practices discovered during the incident response process. This documentation serves as a valuable resource for training new team members, refining incident response procedures, and avoiding similar pitfalls in the future. Process Improvement Recommendations Use the findings from post-incident analysis to recommend process improvements and optimizations. This could include streamlining communication channels, revising incident response playbooks, enhancing monitoring and alerting thresholds, automating repetitive tasks, or implementing additional failover mechanisms. Continuous process refinement ensures a more effective and efficient incident response framework. Cross-Functional Collaboration Involve stakeholders from various departments, including technical teams, management, quality assurance, and customer support, in the post-incident analysis discussions. Encourage open dialogue, share insights, and solicit feedback from diverse perspectives. Collaborative analysis fosters a holistic understanding of incidents and promotes collective ownership of incident resolution and prevention efforts. Implementing Corrective and Preventive Actions (CAPA) Based on the findings of the post-incident analysis, prioritize and implement corrective actions to address immediate vulnerabilities or gaps identified. Additionally, develop preventive measures to mitigate similar risks in the future. CAPA initiatives may include infrastructure upgrades, software patches, security enhancements, or policy revisions aimed at strengthening resilience and reducing incident frequency. Continuous Monitoring and Feedback Loop Establish a continuous monitoring mechanism to track the effectiveness of implemented CAPA initiatives. Monitor key metrics such as incident recurrence rates, mean time to resolution (MTTR), customer satisfaction scores, and overall system stability. Solicit feedback from stakeholders and iterate on improvements iteratively to refine incident response capabilities over time. By embracing a comprehensive approach to post-incident analysis, organizations can transform setbacks into opportunities for growth, innovation, and enhanced operational excellence. The insights gleaned from each incident serve as stepping stones towards building a more resilient and proactive incident management framework. Enhancing Post-Incident Analysis With AI The integration of Artificial Intelligence is revolutionizing Post-Incident Analysis, offering advanced capabilities that significantly augment traditional approaches. Here's how AI can elevate the PIA process: Pattern Recognition and Incident Detection AI algorithms excel in analyzing extensive historical data to identify patterns indicative of potential incidents. By detecting anomalies in system behavior or recognizing error patterns in logs, AI efficiently flags potential incidents for further investigation. This automated incident detection streamlines identification efforts, reducing manual workload and response times. Advanced Root Cause Analysis (RCA) AI algorithms are adept at processing complex data sets and correlating multiple variables. In RCA, AI plays a pivotal role in pinpointing the root cause of incidents by analyzing historical incident data, system logs, configuration changes, and performance metrics. This in-depth analysis facilitated by AI accelerates the identification of underlying issues, leading to more effective resolutions and preventive measures. Predictive Analysis and Proactive Measures Leveraging historical incident data and trends, AI-driven predictive analysis forecasts potential issues or vulnerabilities. By identifying emerging patterns or risk factors, AI enables proactive measures to mitigate risks before they escalate into incidents. This proactive stance not only reduces incident frequency and severity but also enhances overall system reliability and stability. Continuous Improvement via AI Insights AI algorithms derive actionable insights from post-incident analysis data. By evaluating the effectiveness of implemented corrective and preventive actions (CAPA), AI offers valuable feedback on intervention impact. These insights drive ongoing process enhancements, empowering organizations to refine incident response strategies, optimize resource allocation, and continuously enhance incident management capabilities. Integrating AI into Post-Incident Analysis empowers organizations with data-driven insights, automation of repetitive tasks, and proactive risk mitigation, fostering a culture of continuous improvement and resilience in Incident Management. Applying Lessons Beyond Work: Personal Growth and Resilience The skills and lessons gained from Incident Management are highly transferable to various aspects of life. For instance, adaptability is crucial not only in responding to technical issues but also in adapting to changes in personal circumstances or professional environments. Teamwork teaches collaboration, conflict resolution, and empathy, which are essential in building strong relationships both at work and in personal life. Problem-solving skills honed during incident response can be applied to tackle challenges in any domain, from planning a project to resolving conflicts. Resilience, the ability to bounce back from setbacks, is a valuable trait that helps individuals navigate through adversity with determination and a positive mindset. Continuous improvement is a mindset that encourages individuals to seek feedback, reflect on experiences, identify areas for growth, and strive for excellence. This attitude of continuous learning and development not only benefits individuals in their careers but also contributes to personal fulfillment and satisfaction. Dispelling Misconceptions: What Lessons Learned Isn't We highlight common misconceptions about lessons learned, clarifying that it's not about: Emergency mindset: Lessons learned don't advocate for a perpetual emergency mindset but emphasize preparedness and maintaining a healthy, sustainable pace in incident response and everyday operations. Assuming all situations are crises: It's essential to discern between true emergencies and everyday challenges, avoiding unnecessary stress and overreaction to non-critical issues. Overemphasis on structure and protocol: While structure and protocols are important, rigid adherence can stifle flexibility and outside-the-box thinking. Lessons learned encourage a balance between following established procedures and embracing innovation. Decisiveness at the expense of deliberation: Rapid decision-making is crucial during incidents, but rushing decisions can lead to regrettable outcomes. It's about finding the right balance between acting swiftly and ensuring thorough deliberation to avoid hasty or ill-informed decisions. Short-term focus: Lessons learned extend beyond immediate goals and short-term fixes. It promotes a long-term perspective, strategic planning, and continuous improvement to address underlying issues and prevent recurring incidents. Minimizing risk to the point of stagnation: While risk mitigation is important, excessive risk aversion can lead to missed opportunities for growth and innovation. Lessons learned encourage a proactive approach to risk management that balances security with strategic decision-making. One-size-fits-all approach: Responses to incidents and lessons learned should be tailored to the specific circumstances and individuals involved. Avoiding a one-size-fits-all approach ensures that solutions are effective, relevant, and scalable across diverse scenarios. Embracing Growth: Conclusion In conclusion, Incident Management is more than just a set of technical processes or procedures. It's a mindset, a culture, and a journey of continuous growth and improvement. By embracing the core principles of adaptability, communication, teamwork, problem-solving, resilience, and continuous improvement, individuals can not only excel in their professional roles but also lead more fulfilling and meaningful lives.
Origin of Cell-Based Architecture In the rapidly evolving domain of digital services, the need for scalable and resilient architectures (the ability of the system to recover from a failure quickly) has peaked. The introduction of cell-based architecture marks a pivotal shift tailored to meet the surging demands of hyper-scaling (architecture's ability for rapid scaling in response to fluctuating demand). This methodology, essential for rapid scaling in response to fluctuating demands, has become the foundation for digital success. It's a strategy that empowers tech behemoths like Amazon and Facebook, along with service platforms such as DoorDash, to skillfully navigate the tidal waves of digital traffic during peak moments and ensure service to millions of users worldwide without a hitch. Consider the surge Amazon faces on Prime Day or the global traffic spike Facebook navigates during significant events. Similarly, DoorDash's quest to flawlessly handle a flood of orders showcases a recurring theme: the critical need for an architecture that scales vertically and horizontally — expanding capacity without sacrificing system integrity or the user experience. In the current landscape, where startups frequently encounter unprecedented growth rates, the dream of scaling quickly can become a nightmare of scalability issues. Hypergrowth — a rapid expansion that surpasses expectations — presents a formidable challenge, risking a company's collapse if it fails to scale efficiently. This challenge birthed the concept of hyperscaling, emphasizing an architecture's nimbleness in adapting and growing to meet dynamic demands. Essential to this strategy is extensive parallelization and rigorous fault isolation, ensuring companies can scale without succumbing to the pitfalls of rapid growth. Cell-based architecture emerges as a beacon for applications and services where downtime is not an option. In scenarios where every second of inactivity spells significant reputational or financial loss, this architectural paradigm proves invaluable. It is especially crucial for: Applications requiring uninterrupted operation to ensure customer satisfaction and maintain business continuity. Financial services vital for maintaining economic stability. Ultra-scale systems where failure is an unthinkable option. Multi-tenant services requiring segregated resources for specific clients. This architectural innovation was developed in direct response to the increasing need for modern, rapidly expanding digital services. It provides a scalable, resilient framework supporting continuous service delivery and operational superiority. Understanding Cell-Based Architecture What Exactly Is Cell-Based Architecture? Cell-based architecture is a modern approach to creating digital services that are both scalable and resilient, taking cues from the principles of distributed systems and microservices design patterns. This architecture breaks down an extensive system into smaller, independent units called cells. Each cell is self-sufficient, containing a specific segment of the system's functionality, data storage, compute, application logic, and dependencies. This modular setup allows each cell to be scaled, deployed, and managed independently, enhancing the system's ability to grow and recover from failures without widespread impact. Drawing an analogy to urban planning, consider cell-based architecture akin to a well-designed metropolis where each neighborhood operates autonomously, equipped with its services and amenities, yet contributes to the city's overall prosperity. In times of disruption, such as a power outage or a water main break, only the affected neighborhood experiences downtime while the rest of the city thrives. Just as a single neighborhood can experience disruption without paralyzing the entire city, a cell encountering an issue in this architectural framework does not trigger a system-wide failure. This ensures the digital service remains robust and reliable, maintaining high uptime and resilience. Cell-based architecture builds scalable and robust digital services by breaking down an extensive system into smaller, independent units called cells. Each cell is self-contained with its own data storage and computing power similar to how neighborhoods work in a city. They operate independently, so if one cell has a problem, it doesn't affect the rest of the system. This design helps improve the system's stability and ability to grow without causing widespread issues. Fig. 1: Cell-Based Architecture Key Components Cell: Akin to neighborhoods, cells are the foundational building blocks of this architecture. Each cell is an autonomous microservice cluster with resources capable of handling a subset of service responsibilities. A cell is a stand-alone version of the application with its own computing power, load balancer, and databases. This setup allows each cell to operate independently, making it possible to deploy, monitor, and maintain them separately. This independence means that if one cell runs into problems, it doesn't affect the others, which helps the system to scale effectively and stay robust. Cell Router: Cell Routers play a critical role similar to a city's traffic management system. They dynamically route requests to the most appropriate cell based on factors such as load, geographic location, or specific service requirements. By efficiently balancing the load across various cells, cell routers ensure that each request is processed by the cell best suited to handle it, optimizing system performance and the user experience, much like how traffic lights and signs direct the flow of vehicles to ensure smooth transit within a city. Inter-Cell Communication Layer: Despite the autonomy of individual cells, cooperation between them is essential for handling tasks across the system. The Inter-Cell Communication Layer facilitates secure and efficient message exchange between cells. This layer acts as the public transportation system of our city analogy, connecting different neighborhoods (cells) to ensure seamless collaboration and unified service delivery across the entire architecture. It ensures that even as cells operate independently, they can still work together effectively, mirroring how different parts of a city are connected yet function cohesively. Control Plane: The control plane is a critical component of cell-based architecture, acting as the central hub for administrative operations. It oversees tasks such as setting up new cells (provisioning), shutting down existing cells (de-provisioning), and moving customers between cells (migrating). This ensures that the infrastructure remains responsive to the system's and its users' needs, allowing for dynamic resource allocation and seamless service continuity. Why and When to Use Cell-Based Architecture? Why Use It? Cell-based architecture offers a robust framework for efficiently scaling digital services, guaranteeing their resilience and adaptability during expansion. Below is a breakdown of its advantages: Higher Scalability: By defining and managing the capacity of each cell, you can add more cells to scale out (handle growth by adding more system components, such as databases and servers, and spreading the workload evenly). This avoids hitting the resource limits that come with scaling up (accommodating growth by increasing the size of a system's component, such as a database, server, or subsystem). As demand grows, you add more cells, each a contained unit with known capacities, making the system inherently scalable. Safer Deployments: Deployments and rollbacks are smoother with cells. You can deploy changes to one cell at a time, minimizing the impact of any issues. Canary cells can be used to test new deployments under actual conditions with minimal risk, providing a safety net for broader deployment. Easy Testability: Testing large, spread-out systems can be challenging, especially as they get bigger. However, with cell-based architecture, each cell is kept to a manageable size, making it much simpler to test how they behave at their largest capacity. Testing a whole big service can be too expensive and complex. However, testing just one cell is doable because you can simulate the most significant amount of work the cell can handle, similar to the most crucial job a single customer might give your application. This makes it practical and cost-effective to ensure each cell runs smoothly. Lower Blast Radius: Cell-based architecture limits the spread of failures by isolating issues within individual cells, much like neighborhoods in a city. This division ensures that a problem in one cell doesn't affect the entire system, maintaining overall functionality. Each cell operates independently, minimizing any single incident's impact area, or "blast radius," akin to the regional isolation seen in large-scale services. This setup enhances system resilience by keeping disruptions contained and preventing widespread outages.Fig. 2: Cell-based architecture services exhibit enhanced resilience to failures and feature a reduced blast radius compared to traditional services Improved Reliability and Recovery Higher Mean Time Between Failure (MTBF): Cell-based architecture increases the system's reliability by reducing how often problems occur. This design keeps each cell small and manageable, allowing for regular checks and maintenance, smoothing operations and making them more predictable. With customers distributed across different cells, any issues affect only a limited set of requests and users. Changes are tested on just a few cells at a time, making it easy to revert without widespread impact. For example, if you have customers divided across ten cells, a problem in one cell affects only 10% of your customers. This controlled approach to managing changes and addressing issues quickly means the system experiences fewer disruptions, leading to a more stable and reliable service. Lower Mean Time to Recovery (MTTR): Recovery is quicker and more straightforward with cells since you deal with a more minor, contained issue rather than a system-wide problem. Higher Availability: Cell-based architecture can lead to fewer and shorter failures, improving the overall uptime of your service. Even though there might be more potential points of failure (each cell could theoretically fail), the impact of each failure is significantly reduced, and they're easier to fix. When to Use It? Here's a brief guide to help you understand when it's advantageous to use this architectural strategy: High-Stakes Applications: If downtime could severely impact your customers, tarnish your reputation, or result in substantial financial loss, a cell-based approach can safeguard against widespread disruptions. Critical Economic Infrastructure: Cell-based architecture ensures continuous operation for financial services industries (FSI), where workloads are pivotal to economic stability. Ultra-Scale Systems: Systems too large or critical to fail—those that must maintain operation under almost any circumstance—are prime candidates for cell-based design. Stringent Recovery Objectives: Cell-based architecture offers quick recovery capabilities for workloads requiring a Recovery Point Objective (RPO) of less than 5 seconds and a Recovery Time Objective (RTO) of less than 30 seconds. Multi-Tenant Services with Dedicated Needs: For services where tenants demand fully dedicated resources, assigning them their cell ensures isolation and dedicated performance. Although cell-based architecture brings considerable benefits to handling critical workloads, it also comes with its own hurdles, such as heightened complexity, elevated costs, the necessity for specialized tools and practices, and the need for investment in a routing layer. For a more in-depth analysis of these challenges, please see the "Weighing the Scales: Benefits and Challenges." Implementing Cell-Based Architecture This section highlights critical design factors that come into play while designing and implementing a cell-based architecture. Designing a Cell Cell design is a foundational aspect of cell-based architecture, where a system is divided into smaller, self-contained units known as cells. Each cell operates independently with its resources, making the entire system more scalable and resilient. To embark on cell design, identify distinct functionalities within your system that can be isolated into individual cells. This might involve grouping services by their operational needs or user base. Once you've defined these boundaries, equip each cell with the necessary resources, such as databases and application logic, to ensure it can function autonomously. This setup facilitates targeted scaling and recovery and minimizes the impact of failures, as issues in one cell won't spill over to others. Implementing effective communication channels between cells and establishing comprehensive monitoring are crucial steps to maintain system cohesion and oversee cell performance. By systematically organizing your architecture into cells, you create a robust framework that enhances the manageability and adaptability of your system. Here are a few ideas on cell design that can be leveraged to bolster system resilience: Distribute Cells Across Availability Zones: By positioning cells across different availability zones (AZs), you can protect your system against the failure of a single data center or geographic location. This geographical distribution ensures that even if one AZ encounters issues, other cells in different AZs can continue to operate, maintaining overall system availability and reducing the risk of complete service downtime. Implement Redundant Cell Configurations: Creating redundant copies of cells within and across AZs can further enhance resilience. This redundancy means that if one cell fails, its responsibilities can be immediately taken over by a duplicate cell, minimizing service disruption. This approach requires careful synchronization between cells to ensure data consistency but significantly improves fault tolerance. Design Cells for Autonomous Operation: Ensuring that each cell can operate independently, with its own set of resources, databases, and application logic, is crucial. This independence allows cells to be isolated from failures elsewhere in the system. Even if one cell experiences a problem, it won't spread to others, localizing the impact and making it easier to identify and rectify issues. Use Load Balancers and Cell Routers Strategically: Integrating load balancers and cell routers that are aware of cell locations and health statuses can help efficiently redirect traffic away from troubled cells or AZs. This dynamic routing capability allows for real-time adjustments to traffic flow, directing users to the healthiest available cells and balancing the load to prevent overburdening any single cell or AZ. Facilitate Easy Cell Replication and Deployment: Design cells with replication and redeployment in mind. In case of a cell or AZ failure, having mechanisms for quickly spinning up new cells in alternative locations can be invaluable. Automation tools and templates for cell deployment can expedite this process, reducing recovery times and enhancing overall system resilience. Regularly Test Failover Processes: Regular testing of cell failover processes, including simulated failures and recovery drills, can ensure that your system responds as expected during actual outages. These tests can reveal potential weaknesses in your cell design and failover strategies, allowing for continuous improvement of system resilience. By incorporating these ideas into your cell design, you can create a more resilient system capable of withstanding various failure scenarios while minimizing the impact on service availability and performance. Cell Partitioning Cell partitioning is a crucial technique in cell-based architecture. It focuses on dividing a system's workload among distinct cells to optimize performance, scalability, and resilience. It involves categorizing and directing user requests or data to specific cells based on predefined criteria. This process ensures no cell becomes overwhelmed, enhancing system reliability and efficiency. How Cell Partitioning Can Be Done: Identify Partition Criteria: Determine the basis for distributing workloads among cells. Typical criteria include geographic location, user ID, request type, or date range. This step is pivotal in defining how the system categorizes and routes requests to the appropriate cells. Implement Routing Logic: Develop a routing mechanism within the cell router or API gateway that uses the identified criteria to direct incoming requests to the correct cell. This might involve dynamic decision-making algorithms that consider current cell load and availability. Continuous Monitoring and Adjustment: Regularly monitor the performance and load distribution across cells. Use this data to adjust partitioning criteria and routing logic to maintain optimal system performance and scalability. Partitioning Algorithms: Several algorithms can be utilized for effective cell partitioning, each with its strengths and tailored to different types of workloads and system requirements: Consistent Hashing: Requests are distributed based on the hash values of the partition key (e.g., user ID), ensuring even workload distribution and minimal reorganization when cells are added or removed. Range-Based Partitioning: Divides data into ranges (e.g., alphabetical or numerical) and assigns each range to a specific cell. This is ideal for ordered data, allowing efficient query operations. Round Robin: This method distributes requests evenly across all available cells in a cyclic manner. It is straightforward and helpful in achieving a basic level of load balancing. Sharding: Similar to range-based partitioning but more complex, sharding involves splitting large databases into smaller, faster, more easily managed parts, or "shards," each handled by a separate cell. Dynamic Partitioning: Adjusts partitioning in real-time based on workload characteristics or system performance metrics. This approach requires advanced algorithms capable of analyzing system states and making immediate adjustments. By thoughtfully implementing cell partitioning and choosing the appropriate algorithm, you can significantly enhance your cell-based architecture's performance, scalability, and resilience. Regular review and adjustment of your partitioning strategy ensures it continues to meet your system's evolving needs. Implementing a Cell Router In cell-based architecture, the cell router is crucial for steering traffic to the correct cells, ensuring efficient workload management and scalability. An effective cell router hinges on two key elements: traffic routing logic and failover strategies, which maintain system reliability and optimize performance. Implementing Traffic Routing Logic: Start by defining the criteria for how requests are directed to various cells, including the users' geographic location, the type of request, and the specific services needed. The aim is to reduce latency and evenly distribute the load. Employ dynamic routing that adapts to cell availability and workload changes in real time, possibly through integration with a service discovery tool that monitors each cell's status and location. Establishing Failover Strategies: Solid failover processes are essential for the cell router to ensure the system's dependability. Should any cell become unreachable, the router must automatically reroute traffic to the next available cell, requiring minimal manual intervention. This is achieved by implementing health checks across cells to swiftly identify and respond to failures, thus keeping the user experience smooth and the service highly available, even during cell outages. Fig 3. The cell router ensures a smooth user experience by redirecting traffic to healthy cells during outages, maintaining uninterrupted service availability For the practical implementation of a cell router, you can take one of the following approaches: Load Balancers: Use cloud-based load balancers that dynamically direct traffic based on specific request attributes, such as URL paths or headers, according to set rules. API Gateways: An API gateway can serve as the primary entry for all incoming requests and route them to the appropriate cell based on configured logic. Service Mesh: A service mesh offers a network layer that facilitates efficient service-to-service communications and routing requests based on policies, service discovery, and health status. Custom Router Service: Developing a custom service allows routing decisions based on detailed request content, current cell load, or bespoke business logic, offering tailored control over traffic management. Choosing the right implementation strategy for a cell router depends on specific needs, such as the granularity of routing decisions, integration capabilities with existing systems, and management simplicity. Each method provides varying degrees of control, complexity, and adaptability to cater to distinct architectural requirements. Cell Sizing Cell sizing in a cell-based architecture refers to determining each cell's optimal size and capacity to ensure it can handle its designated workload effectively without overburdening. Proper cell sizing is crucial for several reasons: Balanced Load Distribution: Correctly sized cells help achieve a balanced distribution of workloads across the system, preventing any single cell from becoming a bottleneck. Scalability: Well-sized cells can scale more efficiently. As demand increases, the system can add more cells or adjust resources within existing cells to accommodate growth. Resilience and Recovery: Smaller, well-defined cells can isolate failures more effectively, limiting the impact of any single point of failure. This makes the system more resilient and simplifies recovery processes. Cost Efficiency: Optimizing cell size helps utilize resources more efficiently, avoiding unnecessary expenditure on underutilized capacities. How Cell Sizing Is Done? Cell sizing involves a careful analysis of several factors: Workload Analysis: Understand the nature and volume of each cell's workload. This includes peak demand times, data throughput, and processing requirements. Resource Requirements: Based on the workload analysis, estimate the resources (CPU, memory, storage) each cell needs to operate effectively under various conditions. Performance Metrics: Consider key performance indicators (KPIs) that define successful cell operation. This could include response times, error rates, and throughput. Scalability Goals: Define how the system should scale in response to increased demand. This will influence whether cells should be designed to scale up (increase resources in a cell) or scale out (add more cells). Testing and Adjustment: Validate cell size assumptions by testing under simulated workload conditions. Monitoring real-world performance and adjusting as needed is a continuous part of cell sizing. Effective cell sizing often involves a combination of theoretical analysis and empirical testing. Starting with a best-guess estimate based on workload characteristics and adjusting based on observed performance ensures that cells remain efficient, responsive, and cost-effective as the system evolves. Cell Deployment Cell deployment in a cell-based architecture is the process of distributing and managing your application's workload across multiple self-contained units called cells. This strategy ensures scalability, resilience, and efficient resource use. Here's a concise guide on how it's typically done and the technology choices available for effective implementation. How Is Cell Deployment Done? Automated Deployment Pipelines: Start by setting up automated deployment pipelines. These pipelines handle your application's packaging, testing, and deployment to various cells. Automation ensures consistency, reduces errors, and enables rapid deployment across cells. Blue/Green Deployments: Use blue/green deployment strategies to minimize downtime and reduce risk. By deploying the new version of your application to a separate environment (green) while keeping the current version (blue) running, you can switch traffic to the latest version once it's fully ready and tested. Canary Releases: Gradually roll out updates to a small subset of cells or users before making them available system-wide. This allows you to monitor the impact of changes and roll them back if necessary without affecting all users. Technology Choices for Cell Deployment: Container Orchestration Tools: Tools such as Kubernetes, AWS ECS, and Docker Swarm are crucial for orchestrating cell deployments, enabling the encapsulation of applications into containers for streamlined deployment, scaling, and management across various cells. CI/CD Tools: Continuous Integration and Continuous Deployment (CI/CD) tools such as Jenkins, GitLab CI, CircleCI, and AWS Pipeline facilitate the automation of testing and deployment processes, ensuring that new code changes can be efficiently rolled out. Infrastructure as Code (IaC): Tools like Terraform and AWS CloudFormation allow you to define your infrastructure in code, making it easier to replicate and deploy cells across different environments or cloud providers. Service Meshes: Service meshes like Istio or Linkerd provide advanced traffic management capabilities, including canary deployments and service discovery, which are crucial for managing communication and cell updates. By leveraging these deployment strategies and technologies, you can achieve a high degree of automation and control in your cell deployments, ensuring your application remains scalable, reliable, and easy to manage. Cell Observability Cell observability is crucial in a cell-based architecture to ensure you have comprehensive visibility into each cell's health, performance, and operational metrics. It allows you to monitor, troubleshoot, and optimize the system effectively, enhancing overall reliability and user experience. Implementing Cell Observability: To achieve thorough cell observability, focus on three key areas: logging, monitoring, and tracing. Logging captures detailed events and operations within each cell. Monitoring tracks key performance indicators and health metrics in real time. Tracing follows requests as they move through the cells, identifying bottlenecks or failures in the workflow. Technology Choices for Cell Observability: Logging Tools: Solutions like Elasticsearch, Logstash, Kibana (ELK Stack), or Splunk provide powerful logging capabilities, allowing you to aggregate and analyze logs from all cells centrally. Monitoring Solutions: Prometheus, coupled with Grafana for visualization, offers robust monitoring capabilities with support for custom metrics. Cloud-native services like Amazon CloudWatch or Google Operations (formerly Stackdriver) provide integrated monitoring solutions tailored for applications deployed on their respective cloud platforms. Distributed Tracing Systems: Tools like Jaeger, Zipkin, and AWS XRay enable distributed tracing, helping you to understand the flow of requests across cells and identify latency issues or failures in microservices interactions. Service Meshes: Service meshes such as Istio or Linkerd inherently offer observability features, including monitoring, logging, and tracing requests between cells without requiring changes to your application code. By leveraging these tools and focusing on comprehensive observability, you can ensure that your cell-based architecture remains performant, resilient, and capable of supporting your application's dynamic needs. Weighing the Scales: Benefits and Challenges Adopting Cell-Based Architecture transforms the structural and operational dynamics of digital services. Breaking down a service into independently scalable and resilient units (cells) offers a robust framework for managing complexity and ensuring system availability. However, this architectural paradigm also introduces new challenges and complexities. Here's a deeper dive into the technical advantages and considerations. Benefits Horizontal Scalability: Unlike traditional scale-up approaches, Cell-Based Architecture enables horizontal scaling by adding more cells. This method alleviates common bottlenecks associated with centralized databases or shared resources, allowing for linear scalability as user demand increases. Fault Isolation and Resilience: The architecture's compartmentalized design ensures that failures are contained within individual cells, significantly reducing the system's overall blast radius. This isolation enhances the system's resilience, as issues in one cell can be mitigated or repaired without impacting the entire service. Deployment Agility: Leveraging cells allows for incremental deployments and feature rollouts, akin to implementing rolling updates across microservices. This granularity in deployment strategy minimizes downtime and enables a more flexible response to market or user demands. Simplified Operational Complexity: While the initial setup is complex, the ongoing operation and management of cells can be more straightforward than monolithic architectures. Each cell's autonomy simplifies monitoring, troubleshooting, and scaling efforts, as operational tasks can be executed in parallel across cells. Challenges (Considerations) Architectural Complexity: Transitioning to or implementing Cell-Based Architecture demands a meticulous design phase, focusing on defining cell boundaries, data partitioning strategies, and inter-cell communication protocols. This complexity requires a deep understanding of distributed systems principles and may necessitate a development and operational practices shift. Resource and Infrastructure Overhead (Higher Cost): Each cell operates with its set of resources and infrastructure, potentially leading to increased overhead compared to shared-resource models. Optimizing resource utilization and cost-efficiency becomes paramount, especially as the number of cells grows. Inter-Cell Communication Management: Ensuring coherent and efficient communication between cells without introducing tight coupling or significant latency is a critical challenge. Developers must design a communication layer that supports the necessary interactions while maintaining cells' independence and avoiding performance bottlenecks. Data Consistency and Synchronization: Maintaining data consistency across cells, especially in scenarios requiring global state or real-time data synchronization, adds another layer of complexity. Implementing strategies like event sourcing, CQRS (Command Query Responsibility Segregation), or distributed sagas may be necessary to address these challenges. Specialized Tools and Practices: Operating a cell-based architecture requires specialized operational tools and practices to effectively manage multiple instances of workloads. Routing Layer Investment: A robust cell routing layer is essential for directing traffic appropriately across cells, necessitating additional investment in technology and expertise. Navigating the Trade-offs Opting for Cell-Based Architecture involves navigating these trade-offs and evaluating whether scalability, resilience, and operational agility benefits outweigh the complexities of implementation and management. It is most suitable for services requiring high availability, those undergoing rapid expansion, or systems where modular scaling and failure isolation are critical. Best Practices and Pitfalls Best Practices Adopting a cell-based architecture can significantly enhance the scalability and resilience of your applications. Here are streamlined best practices for implementing this approach effectively: Begin With a Solid Foundation Treat Your Current Setup as Cell Zero: Viewing your existing system as the initial cell, gradually introducing traffic routing and distribution across new cells. Launch with Multiple Cells: Implement more than one cell from the beginning to quickly learn and adapt to the operational dynamics of a cell-based environment. Plan for Flexibility and Growth Implement a Cell Migration Mechanism Early: Prepare for the need to move customers between cells, ensuring you can scale and adjust without disruption. Focus on Reliability Conduct a Failure Mode Analysis: Identify and assess potential failures within each cell and their impact, developing strategies to ensure robustness and minimize cross-cell effects. Ensure Independence and Security Maintain Cell Autonomy: Design cells to be self-sufficient, with dedicated resources and clear ownership, possibly by a single team. Secure Communication: Use versioned, well-defined APIs for cell interactions and enforce security policies at the API gateway level. Minimize Dependencies: Keep inter-cell dependencies low to preserve the architecture's benefits, such as fault isolation. Optimize Deployment and Operations Avoid Shared Resources: Each cell should have its data storage to eliminate global state dependencies. Deploy in Waves: Introduce updates and deployments in phases across cells for better change management and quick rollback capabilities. By following these practices, you can leverage cell-based architecture to create scalable, resilient, but also manageable, and secure systems ready to meet the challenges of modern digital demands. Common Pitfalls While cell-based architecture offers significant advantages for scalability and resilience, it also introduces specific challenges and pitfalls that organizations need to be aware of when adopting this approach: Complexity in Management and Operations Increased Operational Overhead: Managing multiple cells can introduce complexity in deployment, monitoring, and operations, requiring robust automation and orchestration tools to maintain efficiency. Consistency Management: Ensuring data consistency across cells, especially in stateful applications, can be challenging and might require sophisticated synchronization mechanisms. Initial Setup and Migration Challenges Complex Migration Process: Transitioning to a cell-based architecture from a traditional setup can be complex, requiring careful planning to avoid service disruption and data loss. Steep Learning Curve: Teams may face a learning curve in understanding cell-based concepts and best practices, necessitating training and potentially slowing initial progress. Design and Architectural Considerations Cell Isolation: Properly isolating cells to prevent failure propagation requires meticulous design, failing which the system might not fully realize the benefits of fault isolation. Optimal Cell Size: Determining the optimal size for cells can be tricky, as overly small cells may lead to inefficiencies, while huge cells might compromise scalability and resilience. Resource Utilization and Cost Implications Potential for Increased Costs: If not carefully managed, the duplication of resources across cells can lead to increased operational costs. Underutilization of Resources: Balancing resource allocation to prevent underutilization while avoiding over-provisioning requires continuous monitoring and adjustment. Networking and Communication Overhead Network Complexity: The cell-based architecture may introduce additional network complexity, including the need for sophisticated routing and load-balancing strategies. Inter-Cell Communication: Ensuring efficient and secure communication between cells, especially in geographically distributed setups, can introduce latency and requires safe, reliable networking solutions. Security and Compliance Security Configuration: Each cell's need for individual security configurations can complicate enforcing consistent security policies across the architecture. Compliance Verification: Verifying that each cell complies with regulatory requirements can be more challenging in a distributed architecture, requiring robust auditing mechanisms. Scalability vs. Cohesion Trade-Off Dependency Management: While minimizing dependencies between cells enhances fault tolerance, it can also lead to challenges in maintaining application cohesion and consistency. Data Duplication: Avoiding shared resources may result in data duplication and synchronization challenges, impacting system performance and consistency. Organizations should invest in robust planning, adopt comprehensive automation and monitoring tools, and ensure ongoing team training to mitigate these pitfalls. Understanding these challenges upfront can help design a more resilient, scalable, and efficient cell-based architecture. Cell-Based Wins in the Real World Cell-based architecture has become essential for managing scalability and ensuring system resilience, from high-growth startups to tech giants like Amazon and Facebook. This architectural model has been adopted across various industries, reflecting its effectiveness in handling large-scale, critical workloads. Here's a brief look at how DoorDash and Slack have implemented cell-based architecture to address their unique challenges. DoorDash's Transition to Cell-Based Architecture Faced with the demands of hypergrowth, DoorDash migrated from a monolithic system to a cell-based architecture, marking a pivotal shift in its operational strategy. This transition, known as Project SuperCell, was driven by the need to efficiently manage fluctuating demand and maintain consistent service reliability across diverse markets. By leveraging AWS's cloud infrastructure, DoorDash was able to isolate failures within individual cells, preventing widespread system disruptions. It significantly enhanced their ability to scale resources and maintain service reliability, even during peak times, demonstrating the transformative potential of adopting a cell-based approach. Slack's Migration to Cell-Based Architecture Slack underwent a major shift to a cell-based architecture to lessen the impact of gray failures and boost service redundancy. Prompted by a review of a network outage, this move revealed the risks of depending solely on a single availability zone. The new cellular structure aims to confine failures more effectively and minimize the extent of potential site outages. With the adoption of isolated services in each availability zone, Slack has enabled its internal services to function independently within each zone, curtailing the fallout from outages and speeding up the recovery process. This significant redesign has markedly improved Slack's system resilience, underscoring cell-based architecture's role in ensuring high service availability and quality. Roblox's Strategic Shift to Cellular Infrastructure Roblox's shift to a cell-based architecture showcases its response to rapid growth and the need to support over 70 million daily active users with reliable, low-latency experiences. Roblox created isolated clusters within their data centers by adopting a cellular infrastructure, enhancing system resilience through service replication across cells. This setup allowed for the deactivation of non-functional cells without disrupting service, effectively containing failures. The move to cellular infrastructure has significantly boosted Roblox's system reliability, enabling the platform to offer always-on, immersive experiences worldwide. This strategy highlights the effectiveness of cell-based architecture in managing large-scale, dynamic workloads and maintaining high service quality as platforms expand. These examples from DoorDash, Slack, and Roblox illustrate the strategic value of cell-based architecture in addressing the challenges of scale and reliability. By isolating workloads into independent cells, these companies have achieved greater scalability, fault tolerance, and operational efficiency, showcasing the effectiveness of this approach in supporting dynamic, high-demand services. Key Takeaways Cell-based architecture represents a transformative approach for organizations aiming to achieve hyper-scalability and resilience in the digital era. Companies like Amazon, Facebook, DoorDash, and Slack have demonstrated their efficacy in managing hypergrowth and ensuring uninterrupted service by segmenting systems into independent, self-sufficient cells. This architectural strategy facilitates dynamic scaling and robust fault isolation and demands careful consideration of increased complexity, resource allocation, and the need for specialized operational tools. As businesses continue to navigate the demands of digital growth, the adoption of cell-based architecture emerges as a strategic solution for sustaining operational integrity and delivering consistent user experiences amidst the ever-evolving digital landscape. Acknowledgments This article draws upon the collective knowledge and experiences of industry leaders and practitioners, including insights from technical blogs, case studies from companies like Amazon, Slack, and Doordash, and contributions from the wider tech community. References https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/reducing-scope-of-impact-with-cell-based-architecture.html https://github.com/wso2/reference-architecture/blob/master/reference-architecture-cell-based.md https://newsletter.systemdesign.one/p/cell-based-architecture https://highscalability.com/cell-architectures/ https://www.youtube.com/watch?v=ReRrhU-yRjg https://slack.engineering/slacks-migration-to-a-cellular-architecture/ https://blog.roblox.com/2023/12/making-robloxs-infrastructure-efficient-resilient/
Serverless architectures have emerged as a paradigm-shifting approach to building, fast, scalable, and cost-efficient applications. While serverless architectures provide unparalleled flexibility, they also introduce new challenges in terms of monitoring and troubleshooting. In this article, we'll explore how Quarkus integrates with AWS X-Ray and how using a Jakarta CDI Interceptor can keep your code clean while adding custom instrumentation. Quarkus and AWS Lambda Quarkus is a Java-based framework tailored for GraalVM and HotSpot, which results in an amazingly fast boot time while having an incredibly low memory footprint. It offers near-instant scale-up and high-density memory utilization, which can be very useful for container orchestration platforms like Kubernetes or Serverless runtimes like AWS Lambda. Building AWS Lambda Functions can be as easy as starting a Quarkus project, adding the quarkus-amazon-lambda dependency, and defining your AWS Lambda Handler function. XML <dependency> <groupId>io.quarkus</groupId> <artifactId>quarkus-amazon-lambda</artifactId> </dependency> An extensive guide on how to develop AWS Lambda Functions with Quarkus can be found in the official Quarkus AWS Lambda Guide. Enabling X-Ray for Your Lambda Functions Quarkus provides out-of-the-box support for X-Ray, but you will need to add a dependency to your project and configure some settings to make it work with GraalVM/native compiled Quarkus applications. Let's first start by adding the quarkus-amazon-lambda-xray dependency. XML <!-- adds dependency on required x-ray classes and adds support for graalvm native --> <dependency> <groupId>io.quarkus</groupId> <artifactId>quarkus-amazon-lambda-xray</artifactId> </dependency> Don't forget to enable tracing for your Lambda function otherwise, it won't work. An example of doing that is by setting the tracing argument to active within your AWS CDK code. Java function = Function.Builder.create(this, "feed-parsing-function") ... .memorySize(512) .tracing(Tracing.ACTIVE) .runtime(Runtime.PROVIDED_AL2023) .logRetention(RetentionDays.ONE_WEEK) .build(); After the deployment of your function and a function invocation, you should be able to see the X-Ray traces from within the Cloudwatch interface. By default, it will show you some basic timing information for your function like the initialization and the invocation duration. Adding More Instrumentation Now that the dependencies are in place and tracing is enabled for our function, we can enrich the traces in X-Ray by leveraging the X-Ray SDKs TracingIntercepter . For instance, for the SQS and DynamoDB client, you can explicitly set the intercepter inside the application.properties file. Plain Text quarkus.dynamodb.async-client.type=aws-crt quarkus.dynamodb.interceptors=com.amazonaws.xray.interceptors.TracingInterceptor quarkus.sqs.async-client.type=aws-crt quarkus.sqs.interceptors=com.amazonaws.xray.interceptors.TracingInterceptor After putting these properties in place, redeploying, and executing the function, the TracingIntercepter will wrap around each API call to SQS and DynamoDB and store the actual trace information alongside the trace. This is very useful for debugging purposes as it will allow you to validate your code and check for any mistakes. Requests to AWS Services are part of the pricing model, so if you make a mistake in your code and you make too many calls, it can become quite costly. Custom Subsegments With the AWS SDK TracingInterceptor configured, we get information about the calls to the AWS APIs, but what if we want to see information about our own code or remote calls to services outside of AWS? The Java SDK for X-Ray supports the concept of adding custom subsegments to your traces. You can add subsegments to a trace by adding a few lines of code to your own business logic as you can see in the following code snippet. Java public void someMethod(String argument) { // wrap in subsegment Subsegment subsegment = AWSXRay.beginSubsegment("someMethod"); try { // Your business logic } catch (Exception e) { subsegment.addException(e); throw e; } finally { AWSXRay.endSubsegment(); } } Although this is trivial to do, it will become quite messy if you have a lot of methods you want to apply tracing to. This isn't ideal, and it would be better if we didn't have to mix our own code with the X-Ray instrumentation. Quarkus and Jakarta CDI Interceptors The Quarkus programming model is based on the Lite version of the Jakarta Contexts and Dependency Injection 4.0 specification. Besides dependency injection, the specification also describes other features like: Lifecycle Callbacks — A bean class may declare lifecycle @PostConstruct and @PreDestroy callbacks. Interceptors — Used to separate cross-cutting concerns from business logic. Decorators — Similar to interceptors, but because they implement interfaces with business semantics, they are able to implement business logic. Events and Observers — Beans may also produce and consume events to interact in a completely decoupled fashion. As mentioned, CDI Interceptors are used to separate cross-cutting concerns from business logic. As tracing is a cross-cutting concern, this sounds like a great fit. Let's take a look at how we can create an interceptor for our AWS X-Ray instrumentation. How to Create an Interceptor for AWS X-Ray Instrumentation We start with defining our interceptor binding, which we will call XRayTracing. Interceptor bindings are intermediate annotations that may be used to associate interceptors with target beans. Java package com.jeroenreijn.aws.quarkus.xray; import jakarta.annotation.Priority; import jakarta.interceptor.InterceptorBinding; import java.lang.annotation.Retention; import static java.lang.annotation.RetentionPolicy.RUNTIME; @InterceptorBinding @Retention(RUNTIME) @Priority(0) public @interface XRayTracing { } The next step is to define the actual Interceptor logic, which is the code that will add the additional X-Ray instructions for creating the subsegment and wrapping it around our business logic. Java package com.jeroenreijn.aws.quarkus.xray; import com.amazonaws.xray.AWSXRay; import jakarta.interceptor.AroundInvoke; import jakarta.interceptor.Interceptor; import jakarta.interceptor.InvocationContext; @Interceptor @XRayTracing public class XRayTracingInterceptor { @AroundInvoke public Object tracingMethod(InvocationContext ctx) throws Exception { AWSXRay.beginSubsegment("## " + ctx.getMethod().getName()); try { return ctx.proceed(); } catch (Exception e) { AWSXRay.getCurrentSubsegment().addException(e); throw e; } finally { AWSXRay.endSubsegment(); } } } An important part of the interceptor is the @AroundInvoke annotation, which means that this interceptor code will be wrapped around the invocation of our own business logic. Now that we've defined both our interceptor binding and our interceptor, it's time to start using it. Every method that we want to create a subsegment for can now be annotated with the @XRayTracing annotation. Java @XRayTracing public SyndFeed getLatestFeed() { InputStream feedContent = getFeedContent(); return getSyndFeed(feedContent); } @XRayTracing public SyndFeed getSyndFeed(InputStream feedContent) { try { SyndFeedInput feedInput = new SyndFeedInput(); return feedInput.build(new XmlReader(feedContent)); } catch (FeedException | IOException e) { throw new RuntimeException(e); } } That looks much better. Pretty clean, if I say so myself. Based on the hierarchy of subsegments for a trace, X-Ray will be able to show a nested tree structure with the timing information. Closing Thoughts The integration between Quarkus and X-Ray is quite simple to enable. The developer experience is really good out of the box with defining the interceptors on a per-client basis. With the help of CDI interceptors, you can keep your code clean without worrying too much about X-Ray-specific code inside your business logic. An alternative to building your own Interceptor might be to start using AWS PowerTools for Lambda (Java). Powertools for Java is a great way to boost your developer productivity, but it can be used for more than X-Ray, so I’ll save it for another post.
Alternative Text: This comic depicts an interaction between two characters and is split into four panes. In the upper left pane, Character 1 enters the scene with a slightly agitated expression and comments to Character 2, "Your PR makes SQL injection possible!" Character 2, who is typing away at their computer, responds happily, "Wow, that wasn't even my intention," as if Character 1 has paid them a compliment. In the upper right pane, Character 1, now with an increasingly agitated expression, says, "I mean, your code is vulnerable." Character 2, now standing and facing Character 1, is almost proudly embarrassed at what they take as positive feedback and replies, "Stop praising me, I get shy." In the lower-left pane, Character 1, now shown with sharp teeth and a scowl, points a finger at Character 2 and shouts clearly, "Vulnerable is bad!" Character 2 seems shocked at this statement, standing with their mouth and eyes wide open. In the lower right and final pane of the comic, Character 2, smiling once again, replies with the comment, "At least it can do SQL injection!" Character 1 stares back at Character 2 with a blank expression.
In today's digital world, mobile apps play a crucial role in our daily lives. They serve a range of purposes from transactions and online shopping to social interactions and work efficiency, making them essential. However, with their widespread use comes an increased risk of security threats. Ensuring the security of an app requires an approach from development methods to continuous monitoring. Prioritizing security is key to safeguarding your users and upholding the trustworthiness of your app. Remember, security is an ongoing responsibility rather than a one-time task. Stay updated on emerging risks. Adjust your security strategies accordingly. The following sections discuss the importance of security measures and outline the steps for developing a mobile app. What Is Mobile App Security and Why Does It Matter? Mobile app security involves practices and precautions to shield apps from vulnerabilities, attacks, and unauthorized entry. It encompasses elements such as data safeguarding, authentication processes, authorization mechanisms, secure coding principles, and encryption techniques. The Significance of Ensuring Mobile App Security User Trust: Users expect their personal information to be kept safe when using apps. A breach would damage trust and reputation. Compliance With Laws and Regulations: Most countries have laws to protect data such as GDPR, which organizations are required to adhere to. Not following these regulations could result in penalties. Financial Consequences: Security breaches can lead to losses to costs, compensations, and recovery efforts. Sustaining Business Operations: A compromised app has the potential to disrupt business functions and affect revenue streams. Guidelines for Developing a Secure Mobile App Creating an application entails various crucial steps aimed at fortifying the app against possible security risks. The following is a detailed roadmap for constructing an app. 1. Recognize and Establish Security Requirements Prior to commencing development, outline the security prerequisites specific to your app. Take into account aspects like authentication, data storage, encryption, and access management. 2. Choose a Reliable Cloud Platform Choose a cloud service provider that offers security functionalities. Popular choices may include Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). 3. Ensure Safe Development Practices • Educate developers on coding methods to steer clear of vulnerabilities such as SQL injection, cross-site scripting (XSS), and insecure APIs. • Conduct routine code reviews to detect security weaknesses at an early stage. 4. Implement Authentication and Authorization Measures • Employ robust authentication methods like factor authentication (MFA) for heightened user login security. • Utilize Role-Based Access Control (RBAC) to assign permissions based on user roles limiting access to functionalities. 5. Safeguard Data Through Encryption • Utilize HTTPS for communication between the application and server for in-transit encryption. • Encrypt sensitive data stored in databases or files for at-rest encryption. 6. Ensure the Security of APIs • Validate input by employing API keys. Set up rate limiting for API security. • Securely handle user authentication and authorization with OAuth and OpenID Connect protocols. 7. Conduct Regular Security Assessments • Perform penetration testing periodically to identify vulnerabilities. • Leverage automated scanning tools to detect security issues efficiently. 8. Monitor Activities and Respond to Incidents • Keep track of behavior in time to spot any irregularities or anomalies promptly. Having a plan for handling security incidents is crucial. What Is Involved in Mobile Application Security Testing? Implementing robust security testing methods is crucial for ensuring the integrity and resilience of mobile applications. Static Application Security Testing (SAST), Dynamic Application Security Testing (DAST), and Mobile App Penetration Testing are fundamental approaches that help developers identify and address security vulnerabilities. These methodologies not only fortify the security posture of apps but also contribute to maintaining user trust and confidence. Let's delve deeper into each of these testing techniques to understand their significance in securing mobile apps effectively. Static Application Security Testing (SAST) This method involves identifying security vulnerabilities in applications during the development stage. It entails examining the application's source code or binary without executing it, which helps detect security flaws in the development process. SAST scans the codebase for vulnerabilities like injection flaws, authentication, insecure data storage, and other typical security issues. Automated scanning tools are used to analyze the code and pinpoint problems such as hardcoded credentials, improper input validation, and exposure of data. By detecting security weaknesses before deployment, SAST allows developers to make necessary improvements to enhance the application's security stance. Integrating SAST into the development workflow aids in meeting industry standards and regulatory mandates. In essence, SAST strengthens mobile application resilience against cyber threats by protecting information and upholding user confidence in today's interconnected environment. Dynamic Application Security Testing (DAST) This method is used to test the security of apps while they are running, assessing their security in time. Unlike analysis that looks at the app's source code, DAST evaluates how the app behaves in a setting. DAST tools emulate real-world attacks by interacting with the app as a user would, sending different inputs and observing the reactions. By analyzing how the app operates during runtime, DAST can pinpoint security issues such as injection vulnerabilities, weak authentication measures, and improper error handling. DAST mainly focuses on uncovering vulnerabilities that may not be obvious from examining the code. Some common techniques used in DAST include fuzz testing, where the app is bombarded with inputs to reveal vulnerabilities, and penetration testing conducted by hackers to exploit security flaws. By using DAST, developers can detect vulnerabilities that malicious actors could exploit to compromise an app's confidentiality, integrity, or availability of data. Integrating DAST into mobile app development allows developers to find and fix security weaknesses before deployment, thereby reducing the chances of security breaches and strengthening application security. Mobile App Penetration Testing This proactive approach is employed to pinpoint weaknesses and vulnerabilities in apps. Simulating real-world attacks is part of assessing the security stance of an application and its underlying infrastructure. Penetration tests can be conducted manually by cybersecurity experts or automated using specialized tools and software. The testing procedure includes several phases: Reconnaissance: Gather details about the application's structure, features, and possible attack paths. Vulnerability Scanning: Use automated tools to pinpoint security vulnerabilities in the app. Exploitation: Attempt to exploit identified vulnerabilities to gain entry or elevate privileges. Post-Exploitation: Document the consequences of breaches and offer recommendations for mitigation. Mobile App Penetration Testing helps organizations uncover and rectify security weaknesses and reduces the risk of data breaches, financial harm, and damage to reputation. By evaluating the security of their apps, companies can enhance their security standing and maintain the confidence of their clients. By combining the above methodologies, Mobile App Security Testing helps identify and rectify security vulnerabilities in the development process, ensuring that mobile apps are strong, resilient, and protected against cybersecurity risks. This helps safeguard user data and maintain user trust in today's interconnected world. Common Mobile App Security Threats Data Leakage Data leakage refers to the unauthorized exposure of sensitive information stored or transmitted via mobile apps. This poses significant risks for both individuals and companies, including identity theft, financial scams, damage to reputation, and legal ramifications. For individuals, data leaks can compromise details such as names, addresses, social security numbers, and financial information, impacting their privacy and security. Moreover, leaks of health or personal data can tarnish someone's reputation and well-being. On the business front, data leaks can result in financial losses, regulatory fines, and erosion of customer trust. Breaches involving customer data can harm a company's image, leading to customer loss, which can affect revenue and competitiveness. Failure to secure sensitive information can also lead to severe consequences and penalties, especially in regulated industries like healthcare, finance, or e-commerce. Therefore, implementing robust security measures is crucial to protect information and maintain user trust in mobile apps. Man-in-the-Middle (MITM) Attacks Man-in-the-Middle (MITM) Attacks happen when someone secretly intercepts and alters communication between two parties. In the context of apps, this involves a hacker inserting themselves between a user's device and the server, allowing them to spy on shared information. MITM attacks are risky, potentially leading to data theft and identity fraud as hackers can access login credentials, financial transactions, and personal data. To prevent MITM attacks, developers should use encryption methods such as HTTPS/TLS, while users should avoid public Wi-Fi networks and consider using VPNs for added security. Remaining vigilant and taking precautions are essential in protecting against MITM attacks. Injection Attacks Injection attacks pose significant security risks to apps as malicious actors exploit vulnerabilities to insert and execute unauthorized code. Common examples include SQL injection and JavaScript injection. During these attacks, perpetrators tamper with input fields to inject commands, gaining unauthorized access to data or disrupting app functions. Injection attacks can lead to data breaches, data tampering, and system compromise. To prevent these attacks, developers should enforce input validation, use secure queries, and adhere to secure coding practices. Regular security assessments and tests are crucial for pinpointing and addressing vulnerabilities in apps. Insecure Authentication Insecure authentication methods can lead to vulnerabilities, opening the door to entry and data breaches. Common issues include weak passwords, absence of two-factor authentication, and improper session management. Cyber attackers exploit these weaknesses to impersonate users, access data unlawfully, or seize control of user accounts. This compromised authentication system jeopardizes user privacy, data accuracy, and accessibility, posing risks to individuals and organizations. To address this risk, developers should implement security measures such as two-factor authentication and session tokens. Regular updates and enhancements to security protocols are crucial to stay ahead of evolving threats. Data Storage Ensuring secure data storage is crucial in today's technology landscape, especially for apps. It's vital to protect sensitive information and financial records to prevent unauthorized access and data breaches. Secure data storage includes encrypting information both at rest and in transit using encryption methods and secure storage techniques. Moreover, setting up access controls, authentication procedures, and conducting regular security checks are essential to uphold the confidentiality and integrity of stored data. By prioritizing these data storage practices and security protocols, developers can ensure that user information remains shielded from risks and vulnerabilities. Faulty Encryption Faulty encryption and flawed security measures can lead to vulnerabilities within apps, putting sensitive data at risk of unauthorized access and misuse. If encryption algorithms are weak or not implemented correctly, encrypted data could be easily decoded by actors. Poor key management, like storing encryption keys insecurely, worsens these threats. Additionally, security protocols lacking proper authentication or authorization controls create opportunities for attackers to bypass security measures. The consequences of inadequate encryption and security measures can be substantial and can include data breaches, financial losses, and a decline in user trust. To address these risks effectively, developers should prioritize encryption algorithms, secure management practices, and thorough security protocols in their mobile apps. The Unauthorized Use of Device Functions The misuse of device capabilities within apps presents a security concern, putting user privacy and device security at risk. Malicious apps or attackers could exploit weaknesses to access features like the camera, microphone, or GPS without permission leading to privacy breaches. This unauthorized access may result in monitoring, unauthorized audio/video recording, and location tracking, compromising user confidentiality. Additionally, unauthorized use of device functions could allow attackers to carry out activities such as sending premium SMS messages or making calls that incur costs or violate privacy. To address this issue effectively, developers should enforce permission controls. Carefully evaluate third-party tools and integrations to prevent misuse of device capabilities. Reverse Engineering and Altering Code Altering the code within apps can pose security risks and put the app's integrity and confidentiality at risk. Bad actors might decompile the code to find weaknesses, extract data, or alter its functions for malicious purposes. These activities allow attackers to bypass security measures, insert malicious code, or create vulnerabilities leading to data breaches, unauthorized access, and financial harm. Moreover, tampering with code can enable hackers to circumvent licensing terms or protections for developers' intellectual property, impacting their revenue streams. To effectively address this threat, developers should employ techniques like code obfuscation to obscure the code's meaning and make it harder for attackers to decipher. They should also establish safeguards during the app's operation and regularly audit the codebase for any signs of tampering or unauthorized modifications. These proactive measures help mitigate the risks associated with code alteration and maintain the app's security and integrity. Third-Party Collaborations Third-party collaborations in apps bring both advantages and risks. While connecting with third-party services can improve features and user satisfaction, it also exposes the app to security threats and privacy issues. Thoroughly evaluating third-party partners, following security protocols, and regularly monitoring are steps to manage these risks. Neglecting to assess third-party connections can lead to data breaches, compromised user privacy, and harm to the app's reputation. Therefore, developers should be cautious and diligent when entering into collaborations with parties to safeguard the security and credibility of their apps. Social Manipulation Strategies Social manipulation strategies present a security risk for apps leveraging human behavior to mislead users and jeopardize their safety. Attackers can use methods like emails deceptive phone calls or misleading messages to deceive users into sharing sensitive data like passwords or financial information. Moreover, these tactics can influence user actions like clicking on links or downloading apps containing malware. Such strategies erode user trust and may lead to data breaches, identity theft, or financial scams. To address this, it's important for users to understand social manipulation tactics and be cautious when dealing with suspicious requests, messages, or links in mobile apps. Developers should also incorporate security measures like two-factor authentication and anti-phishing tools to safeguard users against engineering attacks. Conclusion Always keep in mind that security is an ongoing responsibility and not a one-time job. Stay informed about threats and adapt your security measures accordingly. Developing an app can be crucial for safeguarding user data establishing trust and averting security breaches.
People initially became interested in blockchain several years ago after learning about it as a decentralized digital ledger. It supports transparency because no one can change information stored on it once added. People can also watch transactions as they happen, further enhancing visibility. But how does blockchain support the integrity of cloud-stored data? 3 Ways Blockchain Supports the Integrity of Cloud-Stored Data 1. Protecting and Facilitating the Sharing of Medical Records Technological advancements have undoubtedly improved the ease of sharing medical records between providers. When patients go to new healthcare facilities, all involved parties can easily see those individuals’ histories, treatments, test results, and more. Such records keep everyone updated about what’s happened to patients, which significantly reduces the likelihood of redundancies and confusion that could extend a health management timeline. Cloud computing has also accelerated information-sharing efforts within healthcare and other industries. It allows medical professionals to access and collaborate through scalable platforms. Many healthcare workers also appreciate how they can access cloud apps from anywhere. That convenience supports physicians who must travel for continuing medical education events, travel nurses, surgeons who split their time between multiple hospitals, and others who often work from numerous locations. However, despite these cloud computing benefits, a security-related downside is platforms use a centralized infrastructure to allow record sharing across users. That characteristic leaves cloud tools open to data breaches. In one case, researchers proposed addressing this shortcoming with a blockchain architecture to authenticate users and enable opportunities for sharing medical records securely. The group prioritized blockchain due to its immutability while seeking to create a system that allowed patients and their providers to share and store medical records securely. The researchers also wanted to design something that was not at risk of data loss or other failures. The researchers implemented so-called “special recognition keys” to identify medical-related specifics, such as identifying doctors, patients, and hospitals. When testing their system, some metrics studied included the time to complete a transaction and how well the communication-related attributes performed. The outcomes suggested the researchers’ approach worked far better than existing solutions. 2. Improving Access Control Data breaches can be costly, catastrophic events. Although there’s no single solution for preventing them, people can make meaningful progress by focusing on access control. One of the most convenient things about the cloud is it allows all authorized users to access content regardless of their location. However, as the number of people engaging with a cloud platform increases, so does the risk of compromised credentials that could allow hackers to enter networks and wreak havoc. Many corporate leaders have prioritized cloud-first strategies. That approach can strengthen cybersecurity because service providers have numerous security features to supplement internal measures. Additionally, cloud-based backup capabilities facilitate faster data recovery if cyberattacks occur. However, research suggests some access control practices used by cloud administrators have significant shortcomings that could make cyberattacks more likely. For example, one study about access management for cloud platforms found 49% of administrators store passwords in a spreadsheet. That’s a huge security risk for many reasons, but it also highlights the need for better password hygiene practices. Fortunately, the blockchain is well-positioned to solve this problem. In one example, researchers developed a blockchain system that uses attribute-based encryption technology to improve how cloud users access content. The setup also contains an audit contract that dynamically manages who can use the cloud and when. The team’s creation built a fine-grained and searchable system that maintained access control by strengthening cloud security and getting the desired results without excessive computing power. Results also showed this system increased storage capacity. When the group performed a security analysis on their blockchain creation, they found it stopped chosen-plaintext attacks and cybersecurity breaches based on guessed keywords. A theoretical examination and associated experiments suggested this tool worked better from a computing power and storage efficiency perspective than comparable alternatives. 3. Curbing Emerging Technologies’ Potential Threats Even as new technologies show tremendous progress and excite people about the future, some individuals specifically investigate how they could harm others through technological advancements. Developments associated with ChatGPT and other generative AI tools are excellent examples. Indeed, these chatbots can save people time by assisting them with tasks such as idea generation or outline creation. However, because these tools create believable-sounding paragraphs in seconds, some cybercriminals use generative artificial intelligence (genAI) chatbots to write phishing emails much faster than before. It’s easy to imagine the ramifications of a cybercriminal who writes a convincing phishing message and uses it to access someone’s cloud-stored information. ChatGPT runs on a cloud platform built by OpenAI, which created the chatbot. A lesser-known issue affecting data integrity is OpenAI uses interactions with the tool to train future versions of the algorithms. People can opt out of having their conversations become part of the training, but many people haven’t or don’t know the process for doing it. As workers eagerly tested ChatGPT and similar tools, some committed potential security breaches without realizing it. Consider if a web developer enters a proprietary code string into ChatGPT and asks the tool for help debugging it. That seemingly minor decision could result in sensitive information becoming part of training data and no longer being carefully protected by the developer’s employer. Some leaders quickly established rules for appropriate usage or banned generative AI tools to address these threats. A February 2024 study also showed some workers kept entering sensitive information when using ChatGPT despite knowing the associated risks. It’s still unclear how the blockchain will support data integrity for people using cloud-based generative AI tools, but many professionals are upbeat about the potential. Conclusion: Using Blockchain for Cloud Data Protection Entities ranging from government agencies to e-commerce stores use cloud platforms daily. These options are incredibly convenient because they eliminate geographical barriers and allow people to use them through an active internet connection anywhere in the world. However, many cloud tools store sensitive data, such as health records or payment details. Since cloud platforms hold such a wealth of information, hackers will likely continue targeting them. Although most cloud providers have built-in security features, cybercriminals continually seek ways to circumvent such protections. The examples here show why the blockchain is an excellent candidate for much-needed additional safeguards.
What Is the C4 Model? The C4 model is a hierarchical framework designed to help software architects and developers visualize and communicate the essential aspects of software architecture in a clear and structured way. Unlike traditional diagramming approaches that often result in cluttered and overly complex diagrams, the C4 model focuses on simplicity and abstraction to convey architectural concepts effectively. The next question is which tool you use to create said diagrams. You can use Visio, draw.io, PlantUML, even PowerPoint, or whatever tool you normally use for creating diagrams. However, these tools do not check whether naming, relations, etc. are consistently used in the different diagrams. Besides that, it might be difficult to review new versions of diagrams because it is not clear which changes are made. In order to solve these problems, Simon Brown, the author of the C4 model, created Structurizr. What Is Structurizr? Structurizr allows you to create diagrams as code. Based on the code, Structurizr visualizes the diagrams for you and you can interact with the visualization. Because the diagrams are maintained in code, you can add them to your version control system (git), and changes in the diagrams are tracked and can be easily reviewed. In a previous article, some features of Structurizr are explored. Structurizr Lite was used, which supports only one workspace. However, if you have a more diverse system landscape, Structurizr Lite is not sufficient anymore. You will have multiple workspaces, one for every software system. You also probably want an overview of your entire system landscape. In this article, you will explore how you can use Structurizr to maintain not only the software architecture of one system but your entire system landscape as code. Sources used in this blog can be found at GitHub. Prerequisites Prerequisites for this blog are: Basic knowledge of the C4 model Basic knowledge of Docker Basic knowledge of Structurizr Linux is used — if you are using a different Operating System, you will need to adjust the commands accordingly Installation As mentioned before, Structurizr Lite cannot be used for this scenario. Instead, you need to install Structurizr on-premises. Create in the root of the repository a data directory. This directory will be mapped as a volume in the docker container. If you have executed the previous blog, ensure that you clean the data directory first. With Structurizr Lite, it is intended that you can edit files in this data directory, with Structurizr on-premises it is advised not to alter the files in the data directory. Structurizr on-premises should be run on a separate server and a normal user should not have access to the data directory anyway. Execute the following command from within the root of the repository: Shell $ docker run -it --rm -p 8080:8080 -v ./data:/usr/local/structurizr structurizr/onpremises Navigate in your browser to http://localhost:8080, log in with the default user structurizr and password password, and the Structurizr webpage is shown. Single Workspace First, let’s see how you can create a single workspace with Structurizr on-premises. Click New workspace, and an empty workspace is created. It is not possible anymore to edit files on your host machine, just like with Structurizr Lite. So, how can you upload your DSL files to the workspace? In order to do so, you need Structurizr CLI. At the moment of writing, v2024.02.22 is the latest version, which can be downloaded as a zip from GitHub. Unpack the zip file, and add the directory to your path. You will upload the latest version of the software system from the previous blog. The DSL is located in the workspaces/3-basic-styles directory. Navigate to this directory. To push the DSL to Structurizr, you will make use of the push command. The push command needs some parameters, which can be found in the settings of the Structurizr workspace. You need the information as shown under API details. Below this information, the parameters can easily be copied. Execute the following command, replacing the parameters for your situation: Shell $ structurizr.sh push -url http://localhost:8080/api -id 1 -key 2607de22-7ce0-4eb1-9f28-1e7e9979121a -secret 09528dfd-0c0a-4380-85cb-766b8da5e1dc -workspace workspace.dsl Pushing workspace 1 to http://localhost:8080/api - creating new workspace - parsing model and views from /<path to project directory>/MyStructurizrPlanet/workspaces/3-basic-styles/workspace.dsl - merge layout from remote: true - storing previous version of workspace in null - pushing workspace Getting workspace with ID 1 Putting workspace with ID 1 {"success":true,"message":"OK","revision":2} - finished If everything goes well, the DSL is pushed successfully. The System Context and Container diagram are now added to the workspace. Workspace Features In this section, some interesting features of Structurizr on-premises are shown. 5.1 Version Control Every upload automatically creates a new version. It is also possible to retrieve an older version. 5.2 Error Checking The Inspections in the left menu, gives you an overview of errors in your DSL. 5.3 Reviews When you open a diagram, you can create a review. When creating the review, you can choose which diagrams need to be reviewed, what kind of review you are requesting and whether unauthenticated access is allowed or not. The reviewer can add comments of course. Next to the Public review text, a link to a checklist is present which can help you executing the review. Create System Landscape Using DSL Only The above examples consist of diagrams for a single software system. Often, multiple software systems are used in an organization. These software systems interact with each other and thus form together a system landscape. Each team will be responsible for its own software system diagrams, but it is also necessary to have a diagram containing the larger picture. Let’s explore whether this is possible using Structurizr. You will be using an example based on the enterprise example provided at the Structurizr GitHub repository. The files can be found in workspaces/4-system-landscape. Create a new workspace via the UI, navigate to the 4-system-landscape directory, and push the customer-service DSL to this workspace. Shell $ structurizr.sh push -url http://localhost:8080/api -id 2 -key f24fe705-a508-4f8d-9cf7-3fc7b323f293 -secret 02c6597f-c750-47e0-9b88-f6e26fccdf38 -workspace customer-service/workspace.dsl In the same way, create a workspace for the invoice-service and the order-service. Push the corresponding DSL to each workspace. A separate system-landscape DSL is present, which uses a plugin to create the relationships between the software systems. Create a workspace for this DSL and push it. Shell $ structurizr.sh push -url http://localhost:8080/api -id 5 -key cb18cabb-61c7-4c3a-a58e-2e97ff0fa285 -secret a638aa99-73cd-427d-8188-3788e678129f -workspace system-landscape/workspace.dsl This creates the system landscape overview. However, two issues are encountered with this view: It is not possible to click on the Order Service f.e. in order that it opens the software system diagram for the Order Service. The DSL of the Customer Service does not define the relationship with Order Service and Invoice Service as can be seen in the diagram below. It would be nice if this inconsistency was reported one way or the other. I asked a question about this on GitHub and used the answer to create a solution that can be found in the following paragraphs. Create System Landscape Using Java The solution to the problem with the absence of links to the different services is to make use of the Java Structurizr library. With this library, you have much more control to achieve the desired functionality. I used the source code from the example in the Structurizr repository and added it to the directory: workspaces/5-system-landscape. The pom file contains the necessary dependencies to run the code, and the maven-assembly-plugin is added to create a fat jar. The code executes the following steps: Create a workspace for the system landscape. Create workspaces for each service. Generate the system landscape by parsing the workspaces metadata, create the necessary relationships, add a link to the services and create a view for the system landscape. Execute the following command from within the workspaces/5-system-landscape directory in order to build the fat jar. Shell $ mvn clean package Run the code and an error occurs. Shell $ java -jar target/mystructurizrplanet-1.0-SNAPSHOT-jar-with-dependencies.jar Mar 02, 2024 11:41:12 AM com.structurizr.api.AdminApiClient createWorkspace SEVERE: com.structurizr.api.StructurizrClientException: The API key is not configured for this installation - please refer to the documentation Exception in thread "main" com.structurizr.api.StructurizrClientException: com.structurizr.api.StructurizrClientException: The API key is not configured for this installation - please refer to the documentation at com.structurizr.api.AdminApiClient.createWorkspace(AdminApiClient.java:109) at com.mydeveloperplanet.mystructurizrplanet.CreateSystemLandscape.main(CreateSystemLandscape.java:30) Caused by: com.structurizr.api.StructurizrClientException: The API key is not configured for this installation - please refer to the documentation at com.structurizr.api.AdminApiClient.createWorkspace(AdminApiClient.java:105) ... 1 more To use the Java library, you need to use an API key. This API key is disabled by default. To enable it, you need to add a file structurizr.properties to your data directory. In the properties file, you set the API key to its bcrypt encoded value. Properties files structurizr.apiKey=$2a$10$ekjju1h3fC1y2YAln7wqxuJ.q0gBjQoFPX/Wvmzr.L5aIdoqvUIwa Add read permissions to the file. Shell $ chmod o+r data/structurizr.properties Restart the Docker container and execute the jar file again. Shell $ java -jar target/mystructurizrplanet-1.0-SNAPSHOT-jar-with-dependencies.jar Mar 02, 2024 11:50:03 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 7 Mar 02, 2024 11:50:04 AM com.structurizr.api.WorkspaceApiClient putWorkspace INFO: Putting workspace with ID 7 Mar 02, 2024 11:50:04 AM com.structurizr.api.WorkspaceApiClient putWorkspace INFO: {"success":true,"message":"OK","revision":2} Mar 02, 2024 11:50:04 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 8 Mar 02, 2024 11:50:04 AM com.structurizr.api.WorkspaceApiClient putWorkspace INFO: Putting workspace with ID 8 Mar 02, 2024 11:50:04 AM com.structurizr.api.WorkspaceApiClient putWorkspace INFO: {"success":true,"message":"OK","revision":2} Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 9 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient putWorkspace INFO: Putting workspace with ID 9 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient putWorkspace INFO: {"success":true,"message":"OK","revision":2} Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 1 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 2 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 3 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 4 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 5 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 6 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 7 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 8 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 9 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient getWorkspace INFO: Getting workspace with ID 6 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient putWorkspace INFO: Putting workspace with ID 6 Mar 02, 2024 11:50:05 AM com.structurizr.api.WorkspaceApiClient putWorkspace INFO: {"success":true,"message":"OK","revision":2} If you open the system landscape workspace, it is now possible to double-click one of the services, and you will be navigated to the corresponding service. Great, but there are some caveats to mention: This source code always creates new workspaces every time you run it. This is just an example of what is possible using the Java library. If you want to update existing workspaces, you will need to alter the source code for this purpose. The source code contains a hardcoded API key in plain text. You should not do this in a production environment. Validate Relationships Is it possible to validate the relationships using the Java library? Yes, it is. An example of the source code can be found in directory workspaces/6-validate-relationships. This code will validate offline whether the DSL contains the correct relationships. It is only intended to prove that the validation can be done. For using this in production, the source code needs to be made more robust. Build the code and run it. Shell $ mvn clean package $ java -jar target/validaterelationships-1.0-SNAPSHOT-jar-with-dependencies.jar missing relation in CustomerService {2 | Order Service | } ---[Manages customer data using]---> {4 | Customer Service | } missing relation in CustomerService {3 | Invoice Service | } ---[Gets customer data from]---> {4 | Customer Service | } The validation finds the two errors in the Customer Service. Add the relationships to the Customer Service DSL. Plain Text model { !extend customerService { api = container "Customer API" database = container "Customer Database" api -> database "Reads from and writes to" orderService -> customerService "Gets customer data from" invoiceService -> customerService "Gets customer data from" } } Build the code and run it. The errors are gone and the relationships are visible in the Customer Service if you run the code from the previous paragraph. Conclusion Structurizr offers many features to get a grip on your software architecture. It also allows you to generate a system landscape and to implement several customizations, e.g. custom validation checks. You need to learn the Java Structurizr library, but the learning curve is not very steep.
In today’s text, I want to take a closer look at Server-Sent Events (or SSE for short) and WebSockets. Both are good and battle-tested approaches to data exchange. I will start with a short characteristic of both tools — what they are and what they offer. Then, I will compare them according to eight categories, which, in my opinion, are the most crucial for modern-day systems. The categories are as follows: Communication Direction Underlying Protocol Security Simplicity Performance Message Structure Ease of Adoption Tooling In contrast to my previous comparison, which compared REST and gRPC, I will not proclaim any winner or grant points per category. Instead, in the Summary paragraph, you will find a kind of TL;DR table. The table contains the key differences between both technologies in the above-mentioned areas. The Why Unlike REST, both SSE and WebSockets are more use-case-focused. In this particular case (or cases), the main focus point of both concepts is providing a “real-time” communication medium. Because of their specific focus, they are less popular than REST, which is a more generic and one-size-fits-all type of tool. Nevertheless, both SSE and WebSockets offer an interesting set of possibilities and a slight refreshment from the classic REST approach to solving problems. In my opinion, it is good to be aware of them and find some space for them in our toolbox, as they may come in handy one day, providing you with a simpler solution to quite a complex problem - especially when you will need “real-time” updates or when your app will require a more push oriented approach. Besides comparing and describing them here, I also want to make them more popular. What Is WebSockets? In short, WebSockets is a communication protocol that provides bi-directional communication between server and client with the usage of a single long-lasting TCP connection. Thanks to this feature, we do not have to constantly pull new data from the server. Instead, the data is exchanged between interested parties in “real time”. Each message is either binary data or Unicode text. The protocol was standardized in 2011 by the IETF in the form of RFC 6455. WebSocket protocol is distinct from HTTP, but both are located at layer 7 of the OSI model and depend on TCP at the 4th layer. The protocol has its unique set of prefixes, which works in a similar manner as HTTP prefixes of "http" and "https": ws - Indicates that the connection is not secured with TLS wss - Indicate that the connection is secured with TLS What is more, non-secure WebSockets connections (ws) should not be open from secure sites (https). Similarly, secure WebSockets connections (wss) should not be open from non-secure sites (http). On the other hand, WebSocket, by design, works on HTTP ports 443 and 80 and supports HTTP concepts like proxies and intermediaries. Additionally, WebSocket handshake uses an HTTP upgrade header to upgrade protocol from HTTP to WebSocket. The biggest disadvantage of WebSocket as a protocol is security. WebSocket is not restricted by same-origin policy, which may make CSRF-like attacks a lot easier. What Is Server-Sent Events? SSE is a technology that allows a web server to send updates to a web page. It is a part of HTML 5 specification and, similarly to WebSockets, utilizes a single long live HTTP connection to send data in “real-time." On a conceptual level, it is quite old technology with its theoretical background dating back to 2004. The first approach to implementing SSE was conducted in 2006 by the Opera team. SSE is supported by most of the modern-day browsers — Microsoft Edge added SSE support in January 2020. It can also take full advantage of HTTP/2, which eliminates one of the biggest issues of SSE, by practically eliminating the connection limit imposed by HTTP/1.1. By definition, Server-Sent Events has two basic building blocks: EventSource - An interface based on WHATWG specification and implemented by the browser, it allows the client (a browser in this case) to subscribe to events. Event stream - A protocol that describes the standard plain-text format of events sent by the server that must be followed for the EventSource client to understand and propagate them. According to the specification, events can carry arbitrary text data, an optional ID, and are delimited by newlines. They even have their unique MIME type: text/event-stream. Unfortunately, the Server-Sent Events as a technology is designed to support only text-based messages and although we can send events with custom format, in the end, the message must be a UTF-8 encoded string. What is more, SSE provides two very interesting features: Automatic reconnection - If the client disconnects unexpectedly, EventSource periodically tries to reconnect. Automatic stream resume - EventSource automatically remembers the last received message ID and will automatically send a Last-Event-ID header when trying to reconnect. The Comparison Communication Direction Probably the biggest difference between the two is their way of communication. SSE provides only one-way communication — events can only be sent from the server to the client. WebSockets provides full two-way communication, enabling interested parties to exchange information and react to any events from both sides. I would say that both of the approaches have their pros and cons with a set of dedicated use cases for each. On one hand, if you just need to push a stream of constant updates to the client, then SSE would be a more suitable choice. On the other hand, if you need to react in any way to one of those events, then WebSocket may be more beneficial. In theory (and practice), all the things that can be done with SSE can also be done with WebSockets, but then we are entering areas like support, simplicity of the solution, or security. I will describe all of these areas and more in the following paragraphs. Additionally, using WebSocket in all cases can be a significant overkill, and an SSE-based solution may just be simpler to implement. Underlying Protocol Here comes another big difference between both technologies. SSE fully relies on HTTP and has support for both HTTP/1.1 and HTTP/2. In contrast, WebSocket is using its own custom protocol — surprise, surprise — the WebSocket protocol. In the case of SSE, utilizing HTTP/2 solves one of the major issues of SSE — max parallel connection limit. The HTTP/1.1, by its specification, limits the number of parallel connections. This behavior may lead to a problem called head-of-line blocking. HTTP/2 addresses this issue via the introduction of multiplexing, which solves HOL blocking at the application layer. However, head-of-line blocking may still occur on the TCP level. As to WebSocket protocol, I mentioned it in some detail, just a few lines above. Here, I would just reiterate the most important points. The protocol is somewhat different from classic HTTP despite using an HTTP upgrade header to initialize the WebSocket connection and effectively change communication protocols. Nevertheless, it also uses TCP protocol as a base and is fully compatible with HTTP. The biggest drawback of the WebSocket protocol is its security. Simplicity In general, setting up SSE-based integration is simpler than its WebSocket counterpart. The most important reason behind it is the nature of communication utilized by a particular technology. One-directional way of SSE and its push model makes it easier on a conceptual level. Combining it with the automatic reconnection and stream continuity support provided out of the box the number of things that we have to take care of is significantly reduced. With all of these features, SSE may also be viewed as a way to reduce the coupling between client and server. Clients just need to know an endpoint that produces the events. Nevertheless, in such a case, the client can only receive messages, so if we want to send any type of information back to the server, we need to have another communication medium, which may greatly complicate things. In the case of WebSockets, things are somewhat more complicated. First of all, we need to handle the connection upgrade from HTTP to WebSockets protocol. Albeit being the simplest thing here, it is another thing we need to remember. The second problem comes from the bi-directional nature of WebSockets. We have to manage the state of a particular connection and handle all possible exceptions occurring while processing the message. For example, what if the processing of one of the messages throws an exception on the server side? Next comes the problem of handling reconnections, which, in the case of WebSockets, we have to handle ourselves. There is also a problem that impacts both technologies — long-running connections. Both technologies need to maintain long-lived open connections to send a continuous stream of events. Managing such connections, especially on a large scale, can be a challenge as we can quite quickly run out of resources. Additionally, they may require special configurations like extended timeout and are more exposed to any network connection problems. Security In the case of SSE, there is nothing special about security as it utilizes the plain old HTTP protocol as the transport medium. All standard HTTP advantages and disadvantages apply to SSE, as simple as that. On the other hand, security is one of the biggest drawbacks of WebSocket protocol as a whole. For a start, there is no such thing as Same Origin Policy, so there are no restrictions as to the place that we want to connect via WebSockets. There is even a specific type of attack aimed at exploiting this vulnerability, the Cross-Origin WebSocket Hijacking. If you want to dive more into the topic of Same Origin Policy and WebSocket, here is an article that may be interesting for you. Besides that, there are no protocol-specific security loopholes in WebSockets. I would say that in both cases, all the standards and best security practices apply, so just be careful while you are implementing your solution. Performance I would say that both of the technologies are on equal footing as to the performance. No theoretical performance limitations are coming from either of the technologies themselves. However, I would say that SSE can be faster in terms of a pure number of messages sent per second as it kind of works on fire and forget principle. WebSocket needs to also handle responses coming from the client. The only thing that can impact the performance of both of them is the underlying client we are using in our application and its implementation. Check, read documentation, run custom stress tests, and you may end up with very interesting insight about the tool you are using or your entire system. Message Structure The message structure is probably another one of the most important differences between protocols. As I mentioned above, SSE is a pure text protocol. We can send messages with different formats and contents, but in the end, everything ends up as UTF-8 encoded text. No complex format or binary data is possible. WebSocket, on the other hand, can handle both text and binary messages. Giving us the possibility to send images, audio, or just regular files. Just remember that processing files can have a significant overhead. Ease of Adoption Here, both of the technologies are at a very similar stage. There are plenty of tools for adding WebSockets and Server-Sent Events support, both client and server, from an SSE perspective. Most of the established programming languages have more than one such library. Without going into too much detail. I have prepared the table summarizing the basic libraries, adding WebSockets and SSE support. Java: Spring SSE/WebSockets Quarkus SSE/WebSockets Scala: Akka SSE/WebSockets Play /WebSockets JavaScript EventSource total.js SSE/WebSocekts Socket.io Python Starlette FastAPI As you can see, we have plenty of well-established choices if we want to add SSE or WebSockets integration to our application. Of course, this is only a minuscule example picked from all the libraries; many more are out there. The real problem may be finding the one most suitable for your particular use case. Tooling Automated Tests As far as I know, there are no automated testing tools for either SSE or WebSockets. However, you can relatively easily achieve similar functions with the use of Postman and collections. Postman supports both Server-Sent Events and WebSockets. With the use of some magic originating from Postman collections, you can prepare a set of tests verifying the correctness of your endpoints. Performance Tests In the case of Performance Tests, you can go with either JMeter or Gatling. As far as I am aware, these are the two most mature tools for overall performance testing. Of course, both of them also support SSE (JMeter, Gatling) and WebSockets (JMeter, Gatling). There are also other tools like sse-perf (SSE only), Testable, or k6 (WebSockets only). Out of all of these tools, I would personally recommend Gatling or k6. Both seem to have the best user experience and be the most production-ready. Documentation To a degree, there are no tools dedicated solely to documenting either SSE or WebSockets. On the other hand, there is a tool called AsyncAPI which can be used just in this way for both concepts. Unfortunately, OpenAPI seems not to support either SSE or WebSockets. Summary As I promised, the summary will be quick and simple — look below. I think that the table above is quite a nice and compact summary of the topic and the article as a whole. The most important difference is the Communication Direction, as it determines the possible use case of a particular technology. It will probably have the biggest impact on choosing one over the other. Message structure can also be a very important category when it comes to choosing a particular way of communication. Allowing only plain text messages is a very significant drawback for Server-Sent Events. If you ever need help choosing SSE over WebSockets or dealing with any other kind of technical problem, just let me know. Thank you for your time. Reviewed by: Michał Matłoka, Michał Grabowski
Hi, DZone Community — we'd love for you to join us for our Cloud Native Research Survey! This year, we're combining our annual cloud and Kubernetes research into one 12-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. Our 2024 cloud native research questions cover: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management and monitoring/observability DZone members and readers just like you drive the research that we cover in our Trend Reports, and this is where we could use your anonymous perspectives! Oh, and don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive a $150 e-gift card. We're asking for ~12 minutes of your time to share your experience. Participate in Our Research Over the coming months, we will compile and analyze data from hundreds of respondents, and our observations will be featured in the "Key Research Findings" of our Cloud Native (May) and Kubernetes in the Enterprise (September) Trend Reports. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. Your responses help shape the narrative of our Trend Reports, so we truly cannot do this without you. We thank you in advance for your help!—The DZone Publications team
Cross-Origin Resource Sharing (CORS) often becomes a stumbling block for developers attempting to interact with APIs hosted on different domains. The challenge intensifies when direct server configuration isn't an option, pushing developers towards alternative solutions like the widely-used cors-anywhere. However, less known is the capability of NGINX's proxy_pass directive to handle not only local domains and upstreams but also external sources, for example: This is how the idea was born to write a universal (with some reservations) NIGNX config that supports any given domain. Understanding the Basics and Setup CORS is a security feature that restricts web applications from making requests to a different domain than the one that served the web application itself. This is a crucial security measure to prevent malicious websites from accessing sensitive data. However, when legitimate cross-domain requests are necessary, properly configuring CORS is essential. The NGINX proxy server offers a powerful solution to this dilemma. By utilizing NGINX's flexible configuration system, developers can create a proxy that handles CORS preflight requests and manipulates headers to ensure compliance with CORS policies. Here's how: Variable Declaration and Manipulation With the map directive, NGINX allows the declaration of new variables based on existing global ones, incorporating regular expression support for dynamic processing. For instance, extracting a specific path from a URL can be achieved, allowing for precise control over request handling. Thus, when requesting http://example.com/api, the $my_request_path variable will contain api. Header Management NGINX facilitates the addition of custom headers to responses via add_header and to proxied requests through proxy_set_header. Simultaneously, proxy_hide_header can be used to conceal headers received from the proxied server, ensuring only the necessary information is passed back to the client. We now have an X-Request-Path header with api. Conditional Processing Utilizing the if directive, NGINX can perform actions based on specific conditions, such as returning a predetermined response code for OPTIONS method requests, streamlining the handling of CORS preflight checks. Putting It All Together First, let’s declare $proxy_uri that we will extract from $request_uri: In short, it works like this: when requesting http://example.com/example.com, the $proxy_uri variable will contain https://example.com. From the resulting $proxy_uri, extract the part that will match the Origin header: For the Forwarded header, we need to process 2 variables at once: The processed X-Forwarded-For header is already built into NGINX. Now we can move on to declaring our proxy server: We get a minimally working proxy server, which can process the CORS Preflight Request and add the appropriate headers. Enhancing Security and Performance Beyond basic setup, further refinements can improve security and performance: Hiding CORS Headers When NGINX handles CORS internally, it's beneficial to hide these headers from client responses to prevent exposure of server internals. Rate Limit Bypassing It would also be nice to pass the client’s IP to somehow bypass the rate limit, which can occur if several users access the same resource. Disabling Caching And finally, for dynamic content or sensitive information, disabling caching is a best practice, ensuring data freshness and privacy. Conclusion This guide not only demystifies the process of configuring NGINX to handle CORS requests but also equips developers with the knowledge to create a robust, flexible proxy server capable of supporting diverse application needs. Through careful configuration and understanding of both CORS policies and NGINX's capabilities, developers can overcome cross-origin restrictions, enhance application performance, and ensure data security. This advanced understanding and application of NGINX not only solves a common web development hurdle but also showcases the depth of skill and innovation possible when navigating web security and resource-sharing challenges.