A developer's work is never truly finished once a feature or change is deployed. There is always a need for constant maintenance to ensure that a product or application continues to run as it should and is configured to scale. This Zone focuses on all your maintenance must-haves — from ensuring that your infrastructure is set up to manage various loads and improving software and data quality to tackling incident management, quality assurance, and more.
Data freshness, sometimes referred to as data timeliness, is the frequency with which data is updated for consumption. It is an important dimension of data quality because recently refreshed data is more accurate and, thus, more valuable. Since it is impractical and expensive to have all data refreshed on a near real-time basis, data engineers ingest and process most analytical data in batches with pipelines designed to update specific data sets at a similar frequency in which they are consumed. Red Ventures director of data engineering, Brandon Beidel, talked to us about this process saying: “We [would] start diving deep into discussions around data quality and how it impacted their day-to-day. I would always frame the conversation in simple business terms and focus on the who, what, when, where, and why. I’d especially ask questions probing the constraints on data freshness, which I’ve found to be particularly important to business stakeholders.” For example, a customer churn dashboard for a B2B SaaS company may only need to be updated once every seven days for a weekly meeting, whereas a marketing dashboard may require daily updates in order for the team to optimize its digital campaigns. Data freshness is important because the value of data decreases exponentially over time. The consequences of ignoring data freshness can be severe. One e-commerce platform lost around $5 million in revenue because their machine learning model that identified out-of-stock items and recommended substitutions was operating on thousands of temporary tables and stale data for six months. How To Measure Data Freshness for Data Quality As previously mentioned, the required level of data freshness is completely contextual to the use case. One way data teams measure data freshness is by the number of complaints they receive from their data consumers over a period of time. While this is a customer-focused approach, it is reactive and has serious disadvantages such as: Corroding data trust; Delaying decision-making and the pace of business operations; Requiring a human in the loop that is familiar with the data (not always the case when powering machine learning models); and If data is external and customer-facing, it creates a risk of churn. A better measurement is the data downtime formula (above), which more comprehensively measures the amount of time the data was inaccurate, missing, or otherwise erroneous. A proactive approach for measuring data freshness is to create service level agreements or SLAs for specific data pipelines. We’ve written a step-by-step guide for creating data SLAs, but in summary: Identify your most important data tables based on the number of reads/writes or their monetary impact on the business. Identify the business owners of those data assets. In other words, who will be most impacted by data freshness or other data quality issues? Ask them how they use their data and how frequently they access it. Create an SLA that specifies how frequently and when the data asset will be refreshed. Implement a means of monitoring when the SLA has been breached and measure how frequently the SLA has been met over a period of time. This can be done through data testing or by using a data observability platform. The end result should look something like, “The customer_360 dashboard met its daily data freshness SLA 99.5% of the time over the last 30 days, a 1% increase over the previous 30 days.” Data Freshness Challenges Data teams face numerous challenges in their data freshness quest as a result of the scale, speed, and complexity of data and data pipelines. Here are a few examples: Data sources are constantly changing: Whether internal or external, data engineers are rarely in control of the source emitting the desired data. Changes in schedule or schema can break data pipelines and create data freshness issues. Data consumption patterns change a lot, too: Strategies are adapted, metrics evolve, and departments are reorganized. Without capabilities such as data lineage, it can be difficult to understand what is a key asset (or upstream of an important data product in the context of a data mesh) and what is obsolete clutter. Outside of the smallest companies, identifying relevant data consumers and business stakeholders for each asset is also extremely challenging. This creates a communication chasm between the data and business teams. Data pipelines have a lot of failure points: The more complex moving parts a machine has, the more opportunities for it to break. Data platforms are no exception. The ingestion connector could break, the orchestration job could fail, or the transformation model could be updated incorrectly. Fixing data freshness issues takes a long time: Because there are so many moving parts, troubleshooting data freshness incidents can take data engineers hours–even days. The root cause could be hidden in endless blocks of SQL code, a result of system permission issues, or just a simple data entry error. Data Freshness Best Practices Once you have talked with your key data consumers and determined your data freshness goals or SLAs, there are a few best practices you can leverage to provide the best service or data product possible. The first step is to architect your data pipeline so that the goal is technically feasible (low latency). This is typically a data ingestion decision between batch, micro-batch, or stream processing. However, this could impact any decisions regarding complex transformation models or other data dependencies as well. At this point, you will want to consider layering approaches for detecting, resolving, and preventing data freshness issues. Let’s look at each in turn. Detecting Data Freshness Issues One of the simplest ways to start detecting data freshness issues is to write a data freshness check (test) using SQL rules. For example, let’s assume you are using Snowflake as your data warehouse and have integrated with Notification Services. You could schedule the following query as a Snowflake task which would alert you Monday through Friday at 8:00 am EST when no rows had been added to “your_table” once you have specified the “date_column” with a column that contains the timestamp when the row was added. SQL CREATE TASK your_task_name WAREHOUSE = your_warehouse_name SCHEDULE = 'USING CRON 0 8 * * 1-5 America/New_York' TIMESTAMP_INPUT_FORMAT = 'YYYY-MM-DD HH24:MI:SS' AS SELECT CASE WHEN COUNT(*) = 0 THEN SYSTEM$SEND_SNS_MESSAGE( 'your_integration_name', 'your_sns_topic_arn', 'No rows added in more than one day in your_table!' ) ELSE 'Rows added within the last day.' END AS alert_message FROM your_table WHERE date_column < DATEADD(DAY, -1, CURRENT_DATE()); The query above looks at rows added, but you could instead use a similar statement to make sure there is at least something matching the current date. Of course, both of these simple checks can be prone to error. SQL CREATE TASK your_task_name WAREHOUSE = your_warehouse_name SCHEDULE = 'USING CRON 0 8 * * 1-5 America/New_York' TIMESTAMP_INPUT_FORMAT = 'YYYY-MM-DD HH24:MI:SS' AS SELECT CASE WHEN DATEDIFF (DAY, max(last_modified), current_timestamp()) > 0 THEN SYSTEM$SEND_SNS_MESSAGE( 'your_integration_name', 'your_sns_topic_arn', 'No rows added in more than one day in your_table!' ) ELSE 'Max modified date within the last day.' END AS alert_message FROM your_table; You could also use a dbt source freshness block: SQL sources: - name: your_source_name database: your_database schema: your_schema tables: - name: your_table freshness: warn_after: count: 1 period: day loaded_at_field: date_column These are great tools and tactics to use on your most important tables, but what about the tables upstream from your most important tables? Or what if you don’t know the exact threshold? What about important tables you are unaware of or failed to anticipate a freshness check was needed? The truth is data freshness checks don’t work well at scale (more than 50 tables or so). One of the benefits of a data observability platform with data lineage is that if there is a data freshness problem in an upstream table that then creates data freshness issues in dozens of tables downstream, you get one cohesive alert rather than disjointed pings telling you your modern data stack is on fire. Resolving Data Freshness Issues The faster you resolve data freshness incidents, the fewer data downtime and cost you incur. Solve the data freshness issue quickly enough, and it may not even count against your SLA. Unfortunately, this is the most challenging part of dealing with data freshness issues. As previously mentioned, data can break in a near-infinite amount of ways. This leaves two options. You can manually hop from tab to tab, checking out the most common system, code, and data issues. However, this takes a lot of time and doesn’t guarantee you find the root cause. Our recent survey found it took respondents an average of 15 hours to resolve data incidents once detected! A data observability platform can help teams resolve data freshness issues much quicker with capabilities such as data lineage, query change detection, correlation insights for things like empty queries, and more. Preventing Data Freshness Issues Unfortunately, bad data and data freshness issues are a fact of life for data teams. You can’t out-architect bad data. However, you can reduce the number of incidents by identifying and refactoring your problematic data pipelines. Another option, which is a bit of a double-edged data freshness sword, is data contracts. Unexpected schema changes are one of the most frequent causes (along with failed Airflow jobs) of stale data. A data contract architecture can encourage software engineers to be more aware of how service updates can break downstream data systems and facilitate how they collaborate with data engineers. However, data contracts also prevent this bad data from landing in the data warehouse in the first place, so they can be cut both ways. The Bottom Line: Make Your Data Consumers Wildly Happy With Fresh Data When you flip a light switch, you expect there to be light. When your data consumers visit a dashboard, they expect the data to be fresh–it’s a baseline expectation. Prevent those nasty emails and make your data consumers wildly happy by ensuring when they need the data, it is available and fresh. Good luck!
The "500 Internal Server Error" message is certainly known to website owners and web developers. Including Nginx, this issue might happen on any web server. It is a frustrating error that could stop your website from functioning, giving users a bad experience. We will provide you with a thorough explanation of how to resolve the 500 Internal Server Error Nginx in this post, allowing you to quickly restore the functionality of your website. What Does “500 Internal Server Error” Mean? The web server is having trouble, but it is unable to identify the exact problem, as indicated by the HTTP status code 500 Internal Server Error. Numerous factors, such as a server configuration issue, a conflict between a plugin or theme, or a problem with the code, can cause this error. Step 1: Examine the Server Logs Examining the server logs is the first step in resolving the 500 Internal Server Error Nginx. You may resolve the problem by learning what triggered the error from the server logs. The Nginx error log file, which is often found in the /var/log/nginx/error.log file, contains the server logs. If you're using a hosting control panel, you may use a file manager or SSH to access the file. Step 2: Increase the PHP Memory Limit The next step is to raise the PHP memory limit if the server logs don't provide any helpful information. This error may appear if the PHP memory limit is exceeded by your website. You must modify the php.ini file in order to raise the PHP memory limit. To do this, include the following line in the file: limit_memory = 256M Step 3: Verify Plugin or Theme Conflicts Checking for a plugin or theme incompatibilities is the next step if raising the PHP RAM limit doesn't fix the problem. Conflicts between plugins or themes might be the cause of this issue. You must disable all plugins and switch to the default theme in order to check for incompatibilities. Reactivate each plugin and then switch to each theme one at a time until you identify the one that is at fault. Step 4: Examine the Code for Syntax Mistakes The next step is to look for syntax mistakes in the code if there aren't any conflicts with plugins or themes. If there is a syntactic problem in the code, this error may happen. It is necessary to go over the code line by line in order to look for syntax mistakes. To assist you in seeing any issues, you may also utilize a code editor that incorporates syntax highlighting. Step 5: Reinstall Nginx Reinstalling Nginx is the last resort if none of the preceding methods are successful in fixing the problem. If there is an issue with the Nginx installation, this error may appear. You must remove the current version of Nginx and install the most recent version in order to reinstall it. Conclusion Although dealing with the 500 Internal Server Error Nginx may be unpleasant, by using the procedures above, you ought to be able to locate and resolve the problem. Always remember to examine the server logs first since they often provide useful details about what went wrong. If you're still having issues, don't be hesitant to seek help from a skilled web developer or the hosting support team. You can guarantee that your website offers a better user experience for your visitors by correcting this problem.
Developers, architects, and application teams are constantly chasing technical debt. For better or worse, it’s a nagging problem that too often gets kicked down the road until it’s too late and application development slows down, new features slip, test cycles increase, and costs ramp up. In the most public situations, the applications tip over completely — like we’ve seen most recently at Southwest Airlines, Twitter, FAA, and others which never get publicized — but you know who you are. Technical debt takes on various forms from source code smells to security risks to the more serious issue of architectural technical debt. Great tools exist to scan for source code quality and security, but tracking, baselining, and detecting architectural drift has been hard due to the lack of observability, tooling, and best practices. What exactly is architectural technical debt and why should I care? If you are an architect or developer responsible for maintaining and extending an older Java or .NET monolith, you probably are already deeply familiar with the problem. A monolithic application is actually defined by its architectural (monolithic) pattern which carries with it dense dependencies, long dependency chains, and in essence a big ball of mud that is opaque for any architect trying to understand and track. This is the essence of architectural technical debt: the class entanglements, deep dependencies, dead-code, long dependency chains, dense topologies, and lack of common code libraries that plague monoliths, older applications, and more recently even microservices that have begun to resemble monoliths themselves. Architectural Observability Software architects, up to this point, have lacked the observability and tooling to understand, track, and manage technical debt from their perspective. Architectural debt is NOT source code quality or cyclomatic complexity, although these are critical technical debt elements to track and manage. The problems cut much deeper, as these structural issues directly affect product quality, feature delivery lead time, and testing times. Academic research has highlighted how analyzing dependencies provides a primary predictor towards the complexity of rework, refactoring, and application modernization. Architectural observability shines a light on application black boxes and ball-of-mud apps, making the opaque transparent, so architects can shift left into the ongoing software development lifecycle. This allows them to manage, monitor, and fix structural anomalies on an iterative, continuous basis before they blow up into bigger issues. Observable architecture starts with the tooling to first establish a baseline, set thresholds, and check for architectural drift to proactively detect critical anomalies. Five Critical Forms of Architectural Debt to Track Getting in front of architectural debt is challenging, but it’s never too late to start. Many monoliths have been lifted and shifted to the cloud over the last decade and should be your first targets. There are five critical factors to analyze, track, and set a plan to fix. Dead Code: The hardest kind of dead code to find is reachable legacy code residing inside applications and common libraries that is obsolete or no longer accessed by any current user flows. It’s often identified as “zombie code” as it lurks in the shadows and no developer really wants to touch it. Finding it requires a mix of dynamic and static analysis to determine whether the code is there but not ever accessed in production. Dead code is different from “unreachable code” in that the code is in fact technically reachable but actually no longer used. Dead code can develop and spread over time, bloating and complicating refactoring and modernization efforts. Service Creep: Set a baseline service topology either through manual or automated means. Itemize the core business services and common services within the application, preferably in a shared location that the entire team can track. Routinely audit the app structure to see if new services have been added or deleted and whether that occurred for the proper business or technical reason. Common Classes: One of the critical aspects of preparing for a refactoring or re-architecting project is identifying common classes that should comprise core platform services that act as a shared common library. This critical modernization best practice will reduce duplicate code and dependencies, collecting common services in one place. Routinely observe the application to check for new common classes that should be added to a common library to prevent further technical debt from building up. Service Exclusivity: Once you’ve extracted one or more microservices from a monolith, baselining the exclusivity of those services and looking for architectural drift will flag future technical debt early. Measure and baseline service exclusivity to determine the percentage of independent classes and resources of a service, to alert when new dependencies are introduced that expand architectural technical debt. High-Debt Classes: Some classes carry much more technical debt than others. Analyze and set “high-debt” class scores based on the dependents, dependencies, and size to determine the best candidates for refactoring which will have the highest impact on reducing your technical debt. Proactive architectural oversight with automated tooling will enable architects to be in front of these types of changes by setting schedules for observation, analysis and setting configure baseline measurements and thresholds. Architecture Drift Management Continuous modernization requires architects to take on a much more proactive role not only at the initial design of an application or when it’s time to re-architect or refactor, but all the way through the lifecycle of their apps. Architecture drift management gives architects the observability and instrumentation they need to stay on top of their architecture, modernize over time, and avoid the next tech debt disaster.
The old engineering adage “Don’t touch it, it works” is terrible. Don’t listen to it. It might be OK at a small scale, but as time goes by, the bit rot spreads through your code and servers polluting everything. Large swaths of your system become “no-man's-land.” As you’re developing a new system, you must always “touch it” and make sure we hire engineers who aren’t afraid to do so. Yes, I get it. I said that sentence frequently in the past. I understand the motivation. Management doesn’t care about the bit rot in our future. They care about the here and now. Why are you wasting time on this feature? It’s working. Don’t you have enough on your plate already? Are you the Marie Kondo of coding? Does this code not spark joy? It’s more like a bad apple in a barrel. Bad code and forbidden zones tend to grow and metastasize. A living project needs to be fully accessible by the current team. It can keep working without that, but that makes every future step painful. When we have a flexible team with a relatively small and familiar code base, touching everything isn’t challenging. It’s easy in that case. The Legacy Project The hard part is touching code in legacy projects. As a consultant, I had to do that often. How do you enter a project with a million lines of code and start refactoring? The nice thing is that we’re all alike. The engineers that built the project were trained with similar books and similar thought processes. Once you understand their logic you can understand why they did something. But a large part of the difficulty is in the tooling. Projects that were built 20 years ago used tools that are no longer available. The code might no longer compile on a modern IDE. Our immediate reflex would be to try to use an old IDE and old tooling. That might be a mistake. Old tools keep the stale bit rot. This is an opportunity. Revisit the project and update the tools. A few years ago I did some work for an older C++ codebase. I didn’t understand the code base, but the original developers built it in an older version of Visual Studio. Getting it to work on my Mac with LLVM and VS Code helped me visualize the moving pieces more clearly. Once I had a debugger up and running, fixing the bugs and weird issues became trivial. I can’t say I fully understood that codebase. But the process of porting and updating the tools exposed me to many nuances and issues. When You Can’t The flip side of that were cases where an existing legacy system is a customer requirement. I had to implement integrations with legacy systems that were external black boxes. We didn’t need to touch their code, but we needed to interface with these systems and rely on their behaviors. This is a very challenging situation. Our solution in those cases was to create a mock of the system so we can simulate and test various scenarios. In one such situation, we wrote an app that sent requests and saved responses from such a “black box” to create a simple recorder. We then used the recordings as the basis for tests in our implementation. This might not be an option since sometimes, the black box is directly wired to production (directly to the stock market in one case). My rules for dealing with such a black box are: A single isolated module handles all the connections — that way we can build uniform workarounds for failures. We can use a physically isolated microservice which is ideal for this specific case. Expose results using asynchronous calls — this prevents deadlocks and overloading a legacy system. We can use a queue to map causes of failure and error handling is simpler since a failure just won’t invoke the result callback. We need to code defensively. Use circuit breakers, logging and general observability tooling. Expect failure in every corner since this will be the most contentious part of the project. Once we wrap that legacy we need to trigger alerts on the failures. Some failures might not bubble up to the user interface and might trigger retries that succeed. This can be a serious problem. E.g. in a case of a stock market purchase command that fails a trader might press retry which will issue a new successful command. But the original command might retry implicitly in the legacy system and we can end up with two purchases. Such mistakes can be very costly and originate from that black box. Without reviewing the legacy code fully and understanding it, we can make no guarantee. What we can do is respond promptly and accurately to failures of this type. Debuggability is important in these situations hence the importance of observability and isolation in such a black box. Confidence Through Observability In the past, we used to watch the server logs whenever we pushed a new release. Waiting for user complaints to pour in. Thanks to observability we’re the first to know about a problem in our production. Observability flipped the script. Unfortunately, there’s a wide chasm between knowing about a problem and understanding it, fixing it and noticing it. If we look at the observability console, we might notice an anomaly that highlights a problem but it might not trigger an alert even though a regression occurs. A good example of that would be a miscalculation. A change to the application logic can report wrong results and this is very unlikely to show in the observability data. In theory, tests should have found that issue but tests are very good at verifying that things we predicted didn’t happen. They don’t check against unexpected bugs. E.g. We might allocate a field size for financial calculations and it worked great for our developers based in the USA. However, a customer in Japan working in Yen might have a far larger number and experience a regression because of that limit. We can debug such issues with developer observability tools but when we deeply integrate legacy systems, we must apply the fail-fast principles deeply, that way the observability layer will know of the problem. We need to assert expectations and check for conditions not in the test, but in the production code. Here an actual error will be better than a stealthy bug. A lot of focus has been given in languages to the non-null capabilities of languages. But the concepts pioneered in languages like Eiffel of design by contract have gone out of fashion. This is understandable, it’s hard and awkward to write that sort of code. Checked exceptions are often the most hated feature of the Java language. Imagine having to write all the constraints you expect for every input. Not to mention dependencies on the environmental state. This isn’t tenable and enforcing this check-in runtime would be even more expensive. However, this is something we can consciously do in entry points to our module or microservice. The fail-fast principle is essential when integrating with legacy systems because of the unpredictable nature of the result. Summary In the 90s I used to take a bus to my job. Every day as I walked to the office I would pass by a bank machine and every time it would reboot as I came close. This was probably part of their cycling policy, banks have a culture of rebooting machines on a schedule to avoid potential issues. One morning I went by the machine and it didn’t reboot. I did what every good programmer/hacker would do; I pulled out my card and tried to use it. It instantly rebooted and wouldn’t take my card, but the fact that my instinct was to “try” is good. Even if it isn’t the smartest thing in the world, we need to keep code accessible and fresh. Legacy code isn’t a haunted house and we shouldn’t be afraid.
Platform engineering is the discipline of building and maintaining a self-service platform for developers. The platform provides a set of cloud-native tools and services to help developers deliver applications quickly and efficiently. The goal of platform engineering is to improve developer experience (DX) by standardizing and automating most of the tasks in the software delivery lifecycle (SDLC). Instead of context switching like provisioning infrastructure, managing security, and learning curve, developers can focus on coding and delivering the business logic using automated platforms. Platform engineering has an inward-looking perspective as it focuses on optimizing developers in the organization for better productivity. Organizations benefit greatly from developers working at the optimum level because it leads to faster release cycles. The platform makes it happen by providing everything developers need to get their code into production so they do not have to wait for other IT teams for infrastructure and tooling. The self-service platform that makes developers' day-to-day activities more effortless and autonomous is called an internal developer platform (IDP). What Is an Internal Developer Platform (IDP)? IDP is a platform that comprises self-serving cloud-native tools and technologies which developers can use to build, test, deploy, monitor or does almost anything regarding application development and delivery with as little overhead as possible. Platform engineers or platform teams build it after consulting the developers and understanding their unique challenges and workflows. After discussing and implementing Kubernetes CI/CD pipelines and GitOps solutions for many large hi-tech enterprises, we realized a typical IDP would consist of the below 5 pillars: CI/CD platforms for automated deployments (Jenkins, Docker Hub, Argo CD, Devtron, Spinnaker) Container orchestration platforms for managing containers (Kubernetes, Nomad, Docker Swarm) Security management tools for authentication, authorization, and secret management (HashiCorp Vault, AWS Secrets Manager, Okta Identity Cloud) Infrastructure as code (IaC) tools for automated infrastructure provisioning (Terraform, Ansible, Chef, AWS CloudFormation) Observability stacks for workloads and applications visualization across all the clusters (Devtron Kubernetes dashboard, Prometheus, Grafana, ELK stack) The platform team designs IDP in a way that is easy to use for developers with a minimal learning curve. IDPs can help reduce developers' cognitive load and improve DX by automating repetitive tasks, reducing maintenance overhead, and eliminating the need for endless scripting. IDP enables development teams to independently manage resources, infrastructure needs, deployments, and rollbacks by providing a self-service platform. This increases developer autonomy and accountability, reduces dependencies, and streamlines the development cycle. Why Is Platform Engineering Important? Platform engineering can help organizations reap several internal (developers) and external (end users) benefits: Kubernetes Dashboard is an external service developed on top of Kubernetes architecture. Under the hood, the Dashboard uses APIs to read all cluster-wide information for visibility into a single pane. It also uses the APIs to deploy resources and applications into a cluster. Both CLI and Kubernetes Dashboards depend on the kube-API-server to process the requests. To get started with the CLI, the Ops team must deploy the Kubernetes Dashboard in the same cluster (similar to Kubectl deployment). Improved developer experience (DX): The plethora of cloud-native tools increases the cognitive load of developers, as it takes a good amount of time to decide which one to use for their specific use cases and master it. Platform engineering solves this and improves DX by providing a simplified, standardized set of tools and services that suit developers’ unique workflows. Increased productivity: The IDP provides everything developers need to get their code tested and deployed in a self-service manner. This reduces the delays in different stages of SDLC, like waiting for someone to provision the infrastructure to deploy, for example. Platform engineering ensures developer productivity by helping them focus mainly on the core development work. Standardization by design: IT teams use a variety of tooling in a typical software organization, varying from team to team. Maintaining and keeping track of things becomes complex in such a situation. Platform engineering solves this by standardizing the tools and services, and it is easier for them to solve any bottlenecks because the platform is identical for every developer. Faster releases: The platform team ensures developers are working on delivering the business logic by providing toolchains that are easily consumable, reusable, and configurable. Developers are very productive as a result, and it accelerates faster time-to-market for features and innovations reliably and securely. Implementing a successful platform team in an organization and leveraging the above benefits requires following some common principles. Treating the platform as a product is one of them. Platform as a Product One of the core principles of platform engineering is productizing the platform. The platform team needs to employ a product management mindset to design and maintain a platform that is not only user-friendly but meets the expectations and needs of the customers (app developers). It starts with collecting data points around the problems developers have and identifying which area to facilitate. This could improve deployment frequency, reduce the change failure rate, improve reliability and security, improve DX, etc. It is important to note that building a platform is all about building a core product that solves common challenges most teams have. It is not about solving the problems of a single team but providing the product across multiple teams to solve the same set of problems. For example, if multiple teams require the same piece of infrastructure, it makes sense for the platform team to work on that shared piece and distribute it. This idea of reusing the platform and repeatability is crucial as it allows for standardization, consistency, and scalability in application delivery. As in product management, the platform team owns the product, chooses certain metrics, and continues taking customer feedback to improve the user experience. The platform's product roadmap evolves with respect to feedback, and it accommodates changing needs and desires of the customers. Roles and Responsibilities of Platform Engineers The primary role of a platform engineer is to design and maintain a self-service platform (IDP) and provide platform services for developers. It starts with engaging with the developers and understanding their pain points: Listen to the Customers Interview developers and different IT teams to understand their engineering landscape and challenges and to know what they are optimizing for. They may be trying to build an effective CI/CD pipeline or implement better access control, among many other challenges around software delivery. Prioritize Identify common challenges most teams share and prioritize solving them over problems individual teams face. For example, if most teams find it hard to store and retrieve secrets securely, it is ideal to prioritize and solve them for everyone. Platform Designing Design IDP with required tools that would solve those problems for users, along with documentation to enable developers to self-serve resources and infrastructure. Adopting a secret management tool would solve challenges around securely managing secrets in the above case. Part of platform designing also includes writing scripts to automate routine development tasks, such as spinning up new environments and provisioning infrastructure to reduce errors and friction points in the development flow. Metrics Choose specific metrics around the goals to measure the platform's effectiveness. For example, if the goal is to improve DX, the metrics include engagement scores, team feedback, etc. Similarly, the metrics will change if the goal is to reduce the change failure rate or to increase deployment frequency. Gather Feedback and Maintain the Platform Continue listening to the customers and watch the metrics. Gather user feedback to add new tools to the platform and optimize for a better user experience. This also includes staying up-to-date with emerging tools and technologies in the DevOps and cloud infrastructure space and adopting them if necessary. It is easy to confuse the roles of a DevOps engineer or SRE with that of a platform engineer since they all manage the underlying infrastructure and support software development teams. Although there are certain overlapping responsibilities between all these roles, each differs from the others with its unique focus. Platform Engineering vs. DevOps DevOps is a philosophy that brought a cultural shift to SDLC to improve software delivery speed and quality. DevOps facilitated collaboration and communication between development and ops teams and accelerated automation to streamline deployments. Platform engineering — a practice rather than a philosophy — can be considered the next iteration of DevOps as it shares some core principles of DevOps: collaboration (with Ops), continuous improvement, and automation. The daily tasks of a platform team and DevOps differ from each other in some aspects. DevOps use certain tools and automation to streamline getting the code to production, managing it, and observing it using logging and monitoring tools. They mostly work on building an effective CI/CD pipeline. Platform engineers take all the tools used by DevOps and integrate them into a shared platform, which different IT teams can use on an enterprise level. This eliminates the need for teams to configure and manage infrastructure and tooling on their own and saves significant time, effort, and resources. Platform engineers also create the documentation and optimize the platform so developers can self-serve the tools and infrastructure in their workflow. Platform teams are required only in matured companies with many different IT teams using complex tools and infrastructure. Naturally, a dedicated platform team to manage the complexity will become necessary in such an engineering landscape. The platform team builds and manages the infrastructure, helping DevOps speed up continuous delivery. However, it is common for the DevOps team to perform platform engineering tasks (configuring Terraform, for example) in startups. Platform Engineering vs. SRE Site reliability engineers (SREs) focus on ensuring the application is reliable, secure, and always available. They work with developers and Ops teams to create systems or infrastructure that support delivering highly reliable applications. SREs also perform capacity planning and infrastructure scaling and manage and respond to incidents so that the platform meets required service level objectives (SLOs). On the other hand, platform engineering manages complex infrastructure and builds an efficient platform for developers to optimize SDLC. While both work on platforms and their roles sound similar, their goals differ. The major difference between platform engineering and SRE regards whom they face and cater their services to. SREs face end users and ensure the application is reliable and available for them. Platform engineers face internal developers and focus on improving their developer experience. The daily tasks of both teams differ with respect to these goals. Platform engineering provides the underlying infrastructure for rapid application delivery, while SREs do the same to deliver highly reliable and available applications. SREs work more on troubleshooting and incident response, and platform engineers focus on complex infrastructure and enabling developer self-service. To achieve their respective goals, both SREs and platform teams use different tools in their workflows. SREs mostly use monitoring and logging tools like Prometheus or Grafana to detect anomalies in real-time and to set automated alerts. Platform teams work with different sets of tools spanning various stages of the software delivery process, such as container orchestration tools, CI/CD pipeline tools, and IaC tools. All in all, SREs and platform teams work on building a reliable and scalable infrastructure with different goals but with some overlapping between the tools they use. How To Implement Platform Engineering in an Organization A platform team will not be an immediate requirement in a startup with a few engineers. Once the organization grows to multiple IT teams and starts dealing with complex tooling and infrastructure, it is ideal to have platform engineers to manage the complexity. Create the Role (Head/VP of Engineering) Top-level engineers like the VP or Head of Engineering usually create the role of a platform engineer when developers spend more time configuring the tools and infrastructure rather than delivering the business logic. They would find that most IT teams are solving the same problems, like spinning up a new environment, which lags the delivery process. So the Head of Engineering would define the scope of platform engineering, identify the areas of responsibility, and create the role of a platform engineer/team. Create an Internal Developer Platform (Platform Engineers/Team) The platform engineer starts by building the logs of the infrastructure and tools that are already used in the organization. Then they would interview developers and understand their challenges and build the internal developer platform with tools and services that solve problems on an enterprise level. They will build the platform in a way that is flexible and facilitates different architectures and deployment styles. Platform engineers also create documentation and conduct training sessions to help developers self-serve the platform. It is ideal for platform engineers to have a developer background so they know what it is like to be a developer and understand the challenges better. Onboard Users (Application Developers) Once the platform is ready, platform engineers onboard application developers. It will require internal marketing and letting teams know of the platform and what it can solve. The best way to onboard users is to pull them to the platform rather than throw the platform at them. This can be done by starting with a small team and helping them overcome a challenge. For example, help a small team optimize CI/CD pipeline and provide the best experience possible in the process. Word-of-mouth from early adopters will have a positive ripple effect throughout the organization, which will help onboard more users to the platform. Platform engineering does not stop at onboarding the users. It is a continuous process where the platform accommodates emerging tools and technologies and the changing needs and requirements of the users. Conclusion: Platform Engineering With Open-Source Tools Selecting an open-source platform that is built to enable platform engineers with a standardized toolchain that helps developers accelerate software delivery is important. Devtron is one such platform that helps developers by automating CI/CD platform, security, and observability for end-to-end SDLC.
This article outlines some of the key factors to consider while choosing whether to build or buy Incident Management software. When your organization realizes that it needs an Incident Management System (IMS), the first question is almost always, "Build or Buy?" Superficially, the requirements seem simple, and being a technical organization, you probably have the skills you need as well. With your deep knowledge of your internal setup, you can surely build one that's best suited to your needs. This may seem like a solid argument for building your own IMS; however, there are some hidden factors that you may not have considered. In this blog, we look at the costs involved in building your own IMS and help you determine if the return on investment (ROI) makes it worth building one. First, let's quickly look at some advantages of building your own IMS. The biggest advantage is that you can build an IMS that suits your needs perfectly. If your organization does not use Sensu, for example, then why build support for it? Instead, you can directly integrate with any on-premise monitoring tool you have in place. If you restrict access to your production network, an off-the-shelf SaaS IMS will be difficult to use. An in-house IMS will not face such issues. Now that we have had a look at the advantages, let's look at the disadvantages. Budgeting for an In-House Incident Management System (IMS) When building your own IMS, it can be very difficult to estimate the total cost of ownership. In general, it is easier to get approval for a one-time cost rather than an open-ended project. The trouble with building an IMS is that, more often than not, the costs do not include long-term maintenance, usability, and reliability. While getting budgetary approval for the IMS, it's hard to communicate the benefits the system will bring. This is because many of the benefits of having a strong IMS platform in place are qualitative. While the on-call experience and effectiveness of engineers will definitely improve, it is hard to measure the benefits quantitatively. Convincing your management to increase the budget may become harder as time passes and additional features are required. Many organizations may not have the measuring tools necessary to decisively prove the return on investment. You may build your own dashboard that tracks the MTTR (Mean Time To Respond), but unless such metrics have been tracked even earlier, it will be a hard sell to convince management. Off-the-shelf systems, on the other hand, often don't have high upfront costs and require little commitment. A small pilot of a commercial product is an easier sell than a potentially long and expensive development project. But is it, in fact, expensive? Let's break it down: Development Costs This includes the cost of assigning programmers to the task, the tools required to build the IMS, and the infrastructure to test and deploy it. Given a list of features, it is possible to estimate this particular cost, and it is reasonably straightforward to get funding for these expenses. Maintenance Costs Like any other piece of software, there will be maintenance costs associated with building an IMS. The costs associated with maintenance include fixing the bugs that crop up during the development and use of the IMS. You will also need to factor in costs when your requirements grow - this can be changed in the production applications, databases, vendor tools, or any other dependencies. As the underlying software is updated, you will also need to consider the associated security fixes. This involves setting aside time to ensure that any newly discovered security vulnerabilities won't compromise your system. In certain scenarios, you may also need to hire external contractors to validate your security. Since it is a mission-critical piece of software that alerts you to any problems in your entire infrastructure, you cannot neglect its maintenance. You cannot afford to delay patching any vulnerabilities or making critical fixes to your IMS. Therefore you will need to set aside dedicated and continuous engineering capacity for the maintenance of the IMS. Even if it is a part-time team, there must be someone available at short notice to make any critical fixes required. This team and the overhead of maintaining it will likely be your single highest cost and the most difficult to sustain. Opportunity Cost This is one of the hidden costs that are harder to measure. While developing your own IMS, you will take away engineering capacity from other aspects of your organization. These people could have been working on your organization's product instead of working on the IMS. Now that we have looked at the cost of developing your own in-house platform let us look at the cost incurred if you opt for an off-the-shelf incident management platform. Factors to Consider for an Off-the-Shelf Incident Management Platform Usually, off-the-shelf platforms are more expensive to develop because they have to be more flexible in terms of feature sets and be able to scale to a higher number of users. Fortunately, you will end up paying only a fraction of that cost because it is shared among all the customers of the product. In fact, if you have a small team, you can get many features free of cost from several incident management platforms. In general, for a particular feature set, the cost of acquisition will be far lower with off-the-shelf systems. Deployment and Training Cost Off-the-shelf systems are usually quite flexible, but you may have to spend some time and effort adapting your systems to them. You may have to change some of your processes or deprecate old, unsupported monitoring tools, for example. This also includes any training costs for the users in your organization. Usability and Features Due to the competitive nature of the market, any off-the-shelf incident management platform will need to keep up and add features to ensure it does not fall behind. An in-house platform often stops being developed as soon as basic minimum functionality is in place. In-house platforms can have poor usability as they are built in an ad-hoc fashion by SREs without input from UX professionals. A better user interface ensures more efficiency and ease of use. Any external product will already have been used by hundreds if not thousands of users in other organizations and, therefore, will have a highly optimized layout. An external platform will also have the added benefit of a customer support team to answer any queries not covered by the support documentation. Conclusion These were the costs and benefits of having an in-house versus an external system. If you factor in the hidden costs, compliance, and support issues, unless you are operating at the scale of Google or Facebook or are operating an esoteric system that is incompatible with external tools, investing in an in-house incident management platform makes little sense. However, in the majority of cases, be it a growing or a small SRE team in a large organization, an off-the-shelf solution is significantly desirable. For most organizations, the return on investment is not substantial enough to warrant planning and developing an in-house incident management system.
Infrastructure as Code started off with the commoditization of virtual machines sometime around the mid-2000s. As with many things in the cloud infra space, Amazon played a key role in making IaC popular. The launch of AWS Cloudformation in 2009 made IaC an essential DevOps practice. But what makes IaC such a conversation-inspiring topic these days? Probably because developers like to handle their infrastructure just like their application code. Version Controlled Tested Deployed automatically Yes, IaC lets you treat your infrastructure just like your application code. This is a big step up from the pre-IaC situation where configuring servers was a task reserved for a few select team members. Here’s what things look like from a big-picture point of view, no matter the tool you use. The big-picture view of infrastructure as code tools A typical IaC workflow consists of the below steps: Developers define and write the infrastructure code and put it in a version control tool like GitHub. This code goes through the pull request and peer-review magic, ultimately ending up in a release branch. After that, the IaC tool takes over and creates the infrastructure in the cloud or on-premise. With IaC, you are no longer dependent on a secret society of extremely important engineers, who own the keys to the application’s infrastructure. With IaC, you can write and test code that makes creating, updating, and deleting infrastructure a matter of pushing a button. But the IaC landscape is daunting. Even though the industry is still nascent, there must be hundreds of tools out there. And more are coming. In this post, I cut through the clutter and divide the IaC landscape into broad categories that can help you make a more informed choice. How To Think About IaC Tools? The problem with IaC tools is that there are so many of them. Also, each one of them is marketed as the best out there. This makes it tough for developers to understand and select the right tools for their requirements. To begin with, stop looking at these tools in isolation. Instead, look at them based on 3 broad categories. 1 - Based on the Language The first wave of IaC tools relied on a Domain-Specific Language or DSL to describe the infrastructure. Tools such as Ansible, Terraform & Chef fall under the DSL category. For example, Terraform uses a special language known as Hashicorp Configuration Language (HCL) to describe the resources. Here’s an example to define an EC2 instance in Terraform using HCL: provider "aws" { region = "us-west-2" profile = "terraform-user" } resource "aws_instance" "hello_aws" { ami = "ami-0ceecbb0f30a902a6" instance_type = "t2.micro" tags = { Name = "HelloAWS" } } On the other end of the spectrum, we have IaC tools that use existing programming languages to describe the infrastructure. Tools such as Pulumi & AWS CDK fall into this category. Here’s how the EC2 code is written for Pulumi with TypeScript. TypeScript import * as pulumi from "@pulumi/pulumi"; import * as aws from "@pulumi/aws"; const instance = new aws.ec2.Instance("my-ec2-instance", { ami: "ami-0c55b159cbfafe1f0", // Amazon Linux 2 AMI instanceType: "t2.micro", tags: { Name: "HelloAWS", }, }); export const publicIp = instance.publicIp; For reference, Pulumi supports a bunch of languages such as TypeScript, Python, Go, C#, Java and even YAML. What’s the takeaway? If you don’t want to learn a new domain-specific language for writing infra code, go for an IaC tool that works with a familiar language. This is particularly useful for teams that don’t have a dedicated DevOps role and rely on application developers to dabble with infrastructure. If you are a team lead, you can make things easy for them by going with a tool that supports a familiar programming language. 2 - Based on the Approach You can also differentiate IaC tools on the basis of their provisioning approach - push and pull. Push-based tools use a centralized server to push configuration changes to target machines. For example, Ansible & Terraform Enterprise. Pull-based tools use a centralized server to store configuration information and rely on agents on the target machines to pull the latest configuration from the server. For example, Puppet & Chef. Here’s a diagram that illustrates the difference between these two approaches in the context of Ansible and Puppet. Push vs Pull-based IaC with Ansible & Puppet In Ansible, a central node uses SSH to connect to host systems and configures them. Basically, the configuration is pushed to the host systems. In Puppet, an agent installed on the host system pulls the manifest from the Puppet master and applies the configuration on the machine. What’s the takeaway? If you don’t want to install additional software on your host machines, go for a tool that supports a push-based approach. However, if you want things to be more automated, pull-based tools shine as your host machines are automatically configured as soon as they are ready. 3 - Based on the Philosophy Lastly, you can also group IaC tools based on their philosophy - declarative and imperative. Declarative IaC is like setting a destination in your GPS and letting the tool figure out the best route to get you there. You simply declare the desired state of your infrastructure and let the IaC tool do the rest. Tools like Terraform & AWS Cloudformation follow this philosophy. Here’s what a declarative code looks like in Terraform: resource "aws_instance" "hello_aws" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t2.micro" tags = { Name = "HelloAWS" } } Imperative IaC is all about giving instructions to the IaC tool on how to reach the destination. You have to describe the exact steps needed to reach the desired state of the infrastructure. Tools like Ansible & Puppet follow this philosophy. Here’s what imperative code looks like in Ansible: TypeScript - name: Install Apache apt: name: apache2 state: present - name: Configure Apache template: src: /path/to/httpd.conf.j2 dest: /etc/httpd/conf/httpd.conf mode: '0644' What’s the takeaway? If you want to make your IaC code easy to write and reason about, go for a tool that supports declarative philosophy. It will make your life as a developer easy. However, if you want flexibility, go for imperative tools. In an imperative tool, you can control each step the tool is going to take to provide the infrastructure. Over to You Are you planning to use IaC in your project? Also, which type of IaC tool do you find more preferable? DSL or non-DSL Push or Pull Declarative or Imperative Write your replies in the comments section. If you found today’s post useful, consider sharing it with friends and colleagues.
Site Reliability Engineering (SRE) is a systematic and data-driven approach to improving the reliability, scalability, and efficiency of systems. It combines principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives. This article discusses the key elements of SRE, including reliability goals and objectives, reliability testing, workload modeling, chaos engineering, and infrastructure readiness testing. The importance of SRE in improving user experience, system efficiency, scalability, and reliability, and achieving better business outcomes is also discussed. Site Reliability Engineering (SRE) is an emerging field that seeks to address the challenge of delivering high-quality, highly available systems. It combines the principles of software engineering, operations, and quality assurance to ensure that systems meet performance goals and business objectives. SRE is a proactive and systematic approach to reliability optimization characterized by the use of data-driven models, continuous monitoring, and a focus on continuous improvement. SRE is a combination of software engineering and IT operations, combining the principles of DevOps with a focus on reliability. The goal of SRE is to automate repetitive tasks and to prioritize availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. The benefits of adopting SRE include increased reliability, faster resolution of incidents, reduced mean time to recovery, improved efficiency through automation, and increased collaboration between development and operations teams. In addition, organizations that adopt SRE principles can improve their overall system performance, increase the speed of innovation, and better meet the needs of their customers. SRE 5 Why's 1. Why Is SRE Important for Organizations? SRE is important for organizations because it ensures high availability, performance, and scalability of complex systems, leading to improved user experience and better business outcomes. 2. Why Is SRE Necessary in Today's Technology Landscape? SRE is necessary for today's technology landscape because systems and infrastructure have become increasingly complex and prone to failures, and organizations need a reliable and efficient approach to manage these systems. 3. Why Does SRE Involve Combining Software Engineering and Systems Administration? SRE involves combining software engineering and systems administration because both disciplines bring unique skills and expertise to the table. Software engineers have a deep understanding of how to design and build scalable and reliable systems, while systems administrators have a deep understanding of how to operate and manage these systems in production. 4. Why Is Infrastructure Readiness Testing a Critical Component of SRE? Infrastructure Readiness Testing is a critical component of SRE because it ensures that the infrastructure is prepared to support the desired system reliability goals. By testing the capacity and resilience of infrastructure before it is put into production, organizations can avoid critical failures and improve overall system performance. 5. Why Is Chaos Engineering an Important Aspect of SRE? Chaos Engineering is an important aspect of SRE because it tests the system's ability to handle and recover from failures in real-world conditions. By proactively identifying and fixing weaknesses, organizations can improve the resilience and reliability of their systems, reducing downtime and increasing confidence in their ability to respond to failures. Key Elements of SRE Reliability Metrics, Goals, and Objectives: Defining the desired reliability characteristics of the system and setting reliability targets. Reliability Testing: Using reliability testing techniques to measure and evaluate system reliability, including disaster recovery testing, availability testing, and fault tolerance testing. Workload Modeling: Creating mathematical models to represent system reliability, including Little's Law and capacity planning. Chaos Engineering: Intentionally introducing controlled failures and disruptions into production systems to test their ability to recover and maintain reliability. Infrastructure Readiness Testing: Evaluating the readiness of an infrastructure to support the desired reliability goals of a system. Reliability Metrics In SRE Reliability metrics are used in SRE is used to measure the quality and stability of systems, as well as to guide continuous improvement efforts. Availability: This metric measures the proportion of time a system is available and functioning correctly. It is often expressed as a percentage and calculated as the total uptime divided by the total time the system is expected to be running. Response Time: This measures the time it takes for the infrastructure to respond to a user request. Throughput: This measures the number of requests that can be processed in a given time period. Resource Utilization: This measures the utilization of the infrastructure's resources, such as CPU, memory, Network, Heap, caching, and storage. Error Rate: This measures the number of errors or failures that occur during the testing process. Mean Time to Recovery (MTTR): This metric measures the average time it takes to recover from a system failure or disruption, which provides insight into how quickly the system can be restored after a failure occurs. Mean Time Between Failures (MTBF): This metric measures the average time between failures for a system. MTBF helps organizations understand how reliable a system is over time and can inform decision-making about when to perform maintenance or upgrades. Reliability Testing In SRE Performance Testing: This involves evaluating the response time, processing time, and resource utilization of the infrastructure to identify any performance issues under BAU scenario 1X load. Load Testing: This technique involves simulating real-world user traffic and measuring the performance of the infrastructure under heavy loads 2X Load. Stress Testing: This technique involves applying more load than the expected maximum to test the infrastructure's ability to handle unexpected traffic spikes 3X Load. Chaos or Resilience Testing: This involves simulating different types of failures (e.g., network outages, hardware failures) to evaluate the infrastructure's ability to recover and continue operating. Security Testing: This involves evaluating the infrastructure's security posture and identifying any potential vulnerabilities or risks. Capacity Planning: This involves evaluating the current and future hardware, network, and storage requirements of the infrastructure to ensure it has the capacity to meet the growing demand. Workload Modeling In SRE Workload Modeling is a crucial part of SRE, which involves creating mathematical models to represent the expected behavior of systems. Little's Law is a key principle in this area, which states that the average number of items in a system, W, is equal to the average arrival rate (λ) multiplied by the average time each item spends in the system (T): W = λ * T. This formula can be used to determine the expected number of requests a system can handle under different conditions. Example: Consider a system that receives an average of 200 requests per minute, with an average response time of 2 seconds. We can calculate the average number of requests in the system using Little's Law as follows: W = λ * T W = 200 requests/minute * 2 seconds/request W = 400 requests This result indicates that the system can handle up to 400 requests before it becomes overwhelmed and reliability degradation occurs. By using the right workload modeling, organizations can determine the maximum workload that their systems can handle and take proactive steps to scale their infrastructure and improve reliability and allow them to identify potential issues and design solutions to improve system performance before they become real problems. Tools and techniques used for modeling and simulation: Performance Profiling: This technique involves monitoring the performance of an existing system under normal and peak loads to identify bottlenecks and determine the system's capacity limits. Load Testing: This is the process of simulating real-world user traffic to test the performance and stability of an IT system. Load testing helps organizations identify performance issues and ensure that the system can handle expected workloads. Traffic Modeling: This involves creating a mathematical model of the expected traffic patterns on a system. The model can be used to predict resource utilization and system behavior under different workload scenarios. Resource Utilization Modeling: This involves creating a mathematical model of the expected resource utilization of a system. The model can be used to predict resource utilization and system behavior under different workload scenarios. Capacity Planning Tools: There are various tools available that automate the process of capacity planning, including spreadsheet tools, predictive analytics tools, and cloud-based tools. Chaos Engineering and Infrastructure Readiness in SRE Chaos Engineering and Infrastructure Readiness are important components of a successful SRE strategy. They both involve intentionally inducing failures and stress into systems to assess their strength and identify weaknesses. Infrastructure readiness testing is done to verify the system's ability to handle failure scenarios, while chaos engineering tests the system's recovery and reliability under adverse conditions. The benefits of chaos engineering include improved system reliability, reduced downtime, and increased confidence in the system's ability to handle real-world failures and proactively identify and fix weaknesses; organizations can avoid costly downtime, improve customer experience, and reduce the risk of data loss or security breaches. Integrating chaos engineering into DevOps practices (CI\CD) can ensure their systems are thoroughly tested and validated before deployment. Methods of chaos engineering typically involve running experiments or simulations on a system to stress and test its various components, identify any weaknesses or bottlenecks, and assess its overall reliability. This is done by introducing controlled failures, such as network partitions, simulated resource exhaustion, or random process crashes, and observing the system's behavior and response. Example Scenarios for Chaos Testing Random Instance Termination: Selecting and terminating an instance from a cluster to test the system response to the failure. Network Partition: Partitioning the network between instances to simulate a network failure and assess the system's ability to recover. Increased Load: Increasing the load on the system to test its response to stress and observing any performance degradation or resource exhaustion. Configuration Change: Altering a configuration parameter to observe the system's response, including any unexpected behavior or errors. Database Failure: Simulating a database failure by shutting it down and observing the system's reaction, including any errors or unexpected behavior. By conducting both chaos experiments and infrastructure readiness testing, organizations can deepen their understanding of system behavior and improve their resilience and reliability. Conclusion In conclusion, SRE is a critical discipline for organizations that want to deliver highly reliable, highly available systems. By adopting SRE principles and practices, organizations can improve system reliability, reduce downtime, and improve the overall user experience.
Software has become an essential part of our daily lives, from the apps on our phones to the programs we use at work. However, software, like all things, has a lifecycle, and as it approaches its end-of-life (EOL). Then it poses risks to the security, privacy, and performance of the system on which it runs. End-of-life software is the one that no longer receives security updates, bug fixes, or technical support from the vendor. This article will look at reducing end-of-life software risks while protecting your systems and data. Tips to Minimize End-of-Life Software Risks Organizations can mitigate the risk associated with end-of-life technologies by researching modern technologies, developing a timeline for transitioning to the latest technology, training staff on new features and capabilities, and creating budget plans. Additionally, organizations should consider buying technology with longer life cycles and investing in backup or redundant systems in case of any problems or delays in transitioning to the new technology. Apart from this, they can give try to below-mentioned tips to get rid of their end-of-life software more quickly and efficiently: 1. Track Status of End-of-Life Software One can know minute details like its working pattern, operation, and dependencies by knowing one's software. But understanding how to keep it running after support has ended is critical. That is why developers should prepare a clear plan for end-of-life software. This plan should include the following: Identifying which software is at risk Assessing the challenges Implementing mitigation strategies Switching to open-source alternatives 2. Give Adequate Time to Planning While planning for an end-of-life software life cycle, it is necessary to consider a few core aspects, like training, implementation, and adoption. For this, one should carefully plan the timeline by accounting for supply-chain issues that often cause unnecessary delays. While dealing with end-of-life support issues, one should begin planning and allot dates for the project one wants to execute in the future. Knowing important dates will assist an organization in better planning, risk management, and reducing unforeseen budget expenses. Apart from this, one can have an accurate maintenance plan in place. Taking third-party maintenance support assistance is beneficial here. Third-party service providers offer valuable services like hardware replacement, repairing critical parts and hardware to keep your end-of-life products running even after their expiry date. 3. Evaluate Your Investments Planning for the EOL solutions allows organizations to rethink how they use existing technology. Also, it helps organizations to ascertain the viability of transitioning to an alternative solution like the cloud. Further, reviewing business challenges and knowing how alternative solutions may resolve them efficiently and cost-effectively becomes easier. Companies can boost employee effectiveness with a hybrid workforce and simplify network management by transitioning to cloud-based software. Because the conversion may take significant time and money, planning should give IT managers enough time to make an informed strategic decision. 4. Try to Keep the Tech Debt Low Developers often have the wrong notion that their legacy applications or software keeps on running smoothly even without upgrading. But the reality is quite the opposite. The EOL software ceases to communicate with modern technologies, and to upgrade the same requires new hardware for compatibility alignment, a firmware update, or third-party application compatibility, resulting in high-tech debts. Here are some effective tips to reduce tech debt: Find the codebase areas that increase maintenance costs Restructure the codebase by distributing it into small pieces Invest in automated testing to make changes to the codebase Keep proper documentation to track real-time code changes Use low-code development platforms to build complex software 5. Use Compatibility Testing Compatibility tests ensure the successful performance of software across all platforms, lowering the risk of failure and eliminating hurdles of releasing an application that is full of compatibility bugs. Besides this, compatibility testing examines how well software performs with modern operating systems, networks, databases, hardware platforms, etc. It even allows developers to detect and eliminate errors before the release of the final product. LambdaTest, BrowserStack, BrowseEmAll, TestingBot, etc., are popular compatibility testing tools developers widely use. 6. Adopt Best Cybersecurity Practices Organizations need to identify any possible vulnerabilities and take appropriate measures to minimize risks. A good place to start is by evaluating their current IT policies to determine whether they include strategies for disposing of software. Additionally, it is important to ensure that any sensitive data files are safely removed from the system and stored or transmitted using encryption. To enhance their cybersecurity, it is recommended that individuals adhere to password strength policies, regularly change passwords, and comply with relevant regulations. 7. Avoid Waiting for a Long Time End-of-life software is a time bomb waiting to explode. Waiting until the last minute leads to disaster. The sooner you identify obsolete software and replace it with something, much better. If you are not sure where to begin, then you can try out a few things to get started: Stay updated on relevant industry trends and news Constantly track new update releases from the vendor Visit the vendor’s website and check the software lifecycle section Scan your EOL software environment by using AppDetectivePro 8. Get Ready for the Price Hike The vendor will often increase its price as the software approaches its end-of-life date. Because it knows that customers are less likely to switch to a new product when their current product is about to end. Therefore, companies must be prepared for a price increase by budgeting for it in advance. Additionally, they need to conduct extensive research to find better alternatives to end-of-life software to save themselves from making holes in their pockets. In a Nutshell Organizations can implement any of the above solutions to combat EOL software risks. Additionally, they can prioritize software development practices that prioritize maintainability and long-term support to avoid end-of-life scenarios. These practices can include code maintainability reviews, regular software updates, and developing documentation that aids in the long-term support of the software. Overall, taking proactive steps to mitigate the risks associated with end-of-life software is critical to reducing the likelihood of security breaches, system failures, and other issues caused by end-of-life software.
Incident management has evolved considerably over the last couple of decades. Traditionally having been limited to just an on-call team and an alerting system, today it has evolved to include automated incident response combined with a complex set of SRE workflows. Importance of Reliability While the number of active internet users and people consuming digital products has been on the rise for a while, it is actually the combination of increased user expectations and competitive digital experiences that have led organizations to deliver super Reliable products and services. The bottom line is, customers have the right to seek reliable software, and the right to expect the product to work when they really want it. And it is the responsibility of the organizations to build Reliable products. But having said that, no software can be 100% reliable. Even achieving 99.9% reliability is a monumental task. As engineering infrastructure grows more complex by the day, the possibility of Incidents becomes inevitable. But triaging and remediating the issues quickly with minimal impact is what will make all of the difference. From the Vault: Recapping Incidents and Outages from the Past Let’s look back at some notable outages from the past that have had a major impact on both businesses and end users alike. October 2021: A mega outage took down Facebook, WhatsApp, Messenger, Instagram, and Oculus VR…for almost five hours! And no one could use any of those products during those five hours. November 2021: A downstream effect of a Google Cloud outage led to outages across multiple GCP products. This also indirectly impacted many non-Google companies. December 2022: An incident corresponding to Amazon’s Search issue impacted at least 20% of all global users for almost an entire day. Jan 2023: Most recently, the Federal Aviation Authority (FAA) suffered an outage due to a failed scheduled maintenance causing 32,578 flights to be delayed and a further 409 to get cancelled together. And needless to say, the monetary impact was massive. Share prices of numerous U.S. air carriers fell steeply in the immediate aftermath. Reliability Trends as of 2023 These are just a few of the major outages that have impacted users on a global scale. In reality, incidents such as these are not uncommon and are far more frequent. While businesses and business owners bear the brunt of such outages, the impact is experienced by end users too, resulting in a poor user/customer experience (UX/CX). Here are some interesting stats as a result of poor CX/UX: It takes 12 positive user experiences to make up for one unresolved negative experience 88% of web visitors are less likely to return to a site after a bad experience And even a 1 second delay in page load can cause a 7% loss in customers And that is why resolving incidents quickly is CRITICAL! But (literally :p) the million dollar question is how to effectively deal with incidents? Let’s address this by probing into the challenges of incident management in the first place. State of Incident Management Today Evolving business and user needs have directly impacted incident management practices. Increasingly complex systems have led to increasingly complex incidents.The use of public cloud and Microservices architecture has made it difficult to find out what went wrong, e.g.: which service is impacted, does the outage have an upstream/downstream on other services, etc. Hence incidents are complex too. User expectations have grown considerably due to increased dependency on technology.The widespread adoption of technologies has led to more dependency on technology. This has made them more comfortable using it, and as a result, they are unwilling to put up with any kind of downtime or bad experience that they may face. Tool sprawl amid evolving business needs adds to the complexity.The increasing number of tools within the tech stack to address complex requirements and use cases only adds to the complexity of incident management. “...you want teams to be able to reach for the right tool at the right time, not to be impeded by earlier decisions about what they think they might need in the future.” - Steve McGhee, Reliability Advocate, SRE, Google Cloud Evolution of Incident Management Over the years, the scope of activities associated with Incident Management has only been growing. And most of the evolution that’s taken place can be bucketed into one of the four categories: technology, people, process, and tools. Technology When? What was it like? 15 years ago Most teams ran monolithic applications These were easy to operate systems, with very less sophistication 7 years ago Sophisticated distributed systems in medium-to-large organizations were the norm Growing adoption of microservices architecture and public clouds Today Even the smallest teams run complex, distributed apps Widespread adoption of microservices architecture and public cloud services People When? What was it like? 15 years ago Large Operations teams with manual workloads Basic On-Call team with low-skilled labor 7 years ago Smaller, more efficient Ops teams with partially automated workload Dedicated Incident Response teams with basic automation to notify On-Call Today Fewer members in Operations; but fully automated workloads Dedicated Response teams with instant & diverse notifications for On-Call Process When? What was it like? 15 years ago Manual processes (with very low/no automation) Less stringent SLAs Customers more accepting of outages 7 years ago Improved automation in systems architecture More stringent SLAs Customers less accepting of outages Today Heavy reliance on automation due to prevailing system complexity Strict SLAs No or much less tolerance toward outages Tools When? What was it like? 15 years ago Less tooling involved Basic monitoring/alerting solutions in place 7 years ago Improved operations tooling with IaC Advanced monitoring/alerting with increased automation Today Heavy operations tooling Specialized tools associated with the observability world Problems Adjusting to Modern Incident Management Now is the ideal time to address issues that are holding engineering teams back from doing incident management the right way. Managing Complexity Service ownership and visibility are the foremost contributing factors preventing engineering teams from maximizing their time at hand during incident triage. This is a result of the adoption of distributed applications, in particular microservices. An irrational number of services makes it hard to track service health and their respective owners. Tool sprawl (a great number of tools within the tech stack) makes it even more difficult to track dependencies and ownership. Lack of Automation Achieving a respectable amount of automation is still a distant dream for most incident response teams. Automating their entire infrastructure stack through incident management will make a great deal of a difference in improving MTTA and MTTR. The tasks that are still manual, with great potential for automation during incident response are: Ability to quickly notify the On-Call team of service outages/service degradation Ability to automate incident escalations to the senior/ more experienced responders/ stakeholders Providing the appropriate conference bridge for communication and documenting incident notes Poor Collaboration A poor effort put into collaboration during an incident is a major reason keeping response teams from doing what they do best. The process of informing members within the team, across the team, within the organization, and outside of the organization must be simplified and organized. Activities that can improve with better collaboration are Bringing visibility of service health to team members, internal and external stakeholders, customers, etc. with a status page Maintaining a single source of truth in regard to incident impact and incident response Doing the root cause analysis or postmortems or incident retrospectives in a blameless way Lack of Visibility into Service Health One of the most important (and responsible) activities for the response team is to facilitate complete transparency about incident impact, triage, and resolution to internal and external stakeholders as well as business owners. The problems: Absence of a platform such as a status page, that can keep all stakeholders informed of impact timelines, and resolution progress Inability to track the health of the dependent upstream/downstream services and not just the affected service Now, the timely question to probe is: what should Engineering teams start doing? And how can organizations support them in their reliability journey? What Can Engineering Leaders/Teams Do to Mitigate the Problem? The facets of incident management today can be broadly classified into 3 categories: On-call alerting Incident response (automated and collaborative) Effective SRE Addressing the difficulties and devising appropriate processes and strategies around these categories can help engineering teams improve their incident management by 90%. Certainly sounds ambitious, so let's understand this in more detail. On-Call Alerting and Routing On-call is the foundation of a good reliability practice. Three are two main aspects to on-call alerting and they are highlighted below. a. Centralizing Incident Alerting and Monitoring The crucial aspect of on-call alerting is the ability to bring all the alerts into a single/centralized command center. This is important because a typical tech stack is made up of multiple alerting tools monitoring different services (or parts of the infrastructure), put in place by different users. An ecosystem that can bring such alerts together will make Incident Management much more organized. b. On-Call Scheduling and Intelligent Routing While organized alerting is a great first step, effective Incident Response is all about having an On-Call Schedule in place and routing alerts to the concerned On-Call responder. And in case of non-resolution or inaction, escalating it to the most appropriate engineer (or user). Incident Response (Automated and Collaborative) While on-call scheduling and alert routing are the fundamentals, it is incident response that gives structure to incident management. a. Alert Noise Reduction and Correlation Oftentimes, teams get notified of unnecessary events. And more commonly, during the process of resolution, engineers tend to get notified for similar and related alerts, which are better off addressing the collective incident and not just the specific incident. Hence with the right practices in place, incident/alert fatigue can be handled with automation rules for suppressing alerts and deduplicating alerts. b. Integration and Collaboration Integrating the infrastructure stack with tools well within the response process can possibly be the simplest and easiest way to organize incident response. Collaboration can improve by establishing integrations with: ITSM tools for ticket management ChatOps tools for communication CI/CD tools for deployment/ quick rollback Effective SRE Engineering reliability into a product requires the entire organization to adopt the SRE mindset and buy into the ideology. While on-call is at one end of the spectrum, SRE (site reliability engineering) can be thought of being at the other end of the spectrum. But what exactly is SRE? For starters, SRE should not be confused with what DevOps stands for. While DevOps focuses on Principles, SRE emphasizes the focus on Activities instead. SRE is fundamentally about taking an engineering approach to systems operations in order to achieve better reliability and performance. It puts a premium on monitoring, tracking bugs, and creating systems and automation that solve the problem in the long term. While Google was the birthplace of SRE, many top technology companies such as LinkedIn, Netflix, Amazon, Apple, and Facebook have adopted it and benefitted highly from doing that. POV: Gartner predicts that, by 2027, 75% of enterprises will use SRE practices organization-wide, up from 10% in 2022. What difference will SRE make? Today users are expecting nothing but the very best. And an exclusive focus on SRE practices will help in: Providing a delightful User experience (or Customer experience) Improving feature velocity Providing fast and proactive issue resolution How does SRE add value to the business? SRE adds a ton of value to any business that is digital-first. Below mentioned are some of the key points: Provides an engineering-driven and data-driven approach to improve customer satisfaction Enables you to measure toil and save time for strategic tasks Leverage Automation Learn from Incident Retrospectives Communicate with Status Pages The bottom line is, Reliability has evolved. You have to be proactive and preventive.Teams will have to fix things faster and keep getting better at it. And on that note, let’s look at the different SRE aspects that engineering teams can adopt for better incident management: a. Automated Response Actions Automating manual tasks and eliminating toil is one of the fundamental truths on which SRE is built. Be it automating workflows with Runooks, or automating response actions, SRE is a big advocate of automation, and response teams will widely benefit from having this in place. b. Transparency SRE advocates for providing complete visibility into the health status of services and this can be achieved by the use of Status Pages. It also puts a premium on the need to have greater transparency and visibility of service ownership within the organization. c. Blameless Culture During times of an incident, SRE stresses greatly on blaming the process and not the individuals responsible for it. This blameless culture of not blaming individuals for outages goes a long way in fostering a healthy team culture and promoting team harmony. This process of doing RCAs is called incident retrospectives or postmortems. d. SLO and Error Budget Tracking This is all about using a metric-driven approach to balance reliability and innovation. It encourages the use of SLIs to keep track of service health. By actively tracking SLIs, SLOs and error budgets can be in check, thus not breaching customer any of the customer SLAs.
Samir Behara
Senior Cloud Infrastructure Architect,
AWS
Shai Almog
OSS Hacker, Developer Advocate and Entrepreneur,
Codename One
JJ Tang
Co-Founder,
Rootly
Sudip Sengupta
Technical Writer,
Javelynn