A developer's work is never truly finished once a feature or change is deployed. There is always a need for constant maintenance to ensure that a product or application continues to run as it should and is configured to scale. This Zone focuses on all your maintenance must-haves — from ensuring that your infrastructure is set up to manage various loads and improving software and data quality to tackling incident management, quality assurance, and more.
As of now, I have published the first three videos of the course and will publish the fourth tomorrow. I plan to publish two videos per week on YouTube to maximize the impact but here, I'll only blog one lesson per week to avoid oversaturation. I shot about four hours of video content and still haven’t finished the 2nd module out of eight. This course will be very detailed and intensive. I’m working on the course platform right now, making good progress, and hope to publish it soon enough here. It might already be live by the time you read this! It’s important to me that the user experience for the course platform is satisfying. So I avoided DRM as much as possible. Hopefully, people will respect that and the effort I’m putting into this. I’m building the course platform on Spring Boot 3 with GraalVM, so it will ideally be lean, fast, and pretty simple. I plan to host the videos on YouTube as unlisted videos to give a decent viewing experience. This means developers could betray my trust and share unlisted videos. I hope they don’t. My reasoning here is low overhead and excellent performance. I would also like to make the course free in a few years. By using YouTube, I can make the course free by making the videos public. The following video discusses control flow in debugging. This starts with the basics but can get pretty deep for jump-to-line, force return, etc. These are tools that can significantly change the way we debug code. Don't forget to check out my book and subscribe to the YouTube channel for future videos! Transcript Welcome back to the second part of debugging at Scale, where you can learn the secret tricks of debugging. In this section, we’ll discuss the most basic aspect of debugging. We hit a breakpoint. Now what? Well, this is where debuggers let us control the flow to investigate how everything works in a controlled environment. So what’s on the agenda for today? We’ll discuss stepping over and into code; I hope most of this list is familiar to you. The last two items where we disrupt the control flow might not be familiar to you. I’m pretty sure most of you aren’t familiar with the last item on the agenda. How do I know? Stay tuned and find out! Step Over, Into, Out, and Force Step over is the most basic form of control flow. We let the code in the line execute, and then we can inspect the results in the variable pane. It’s simple and easy. In this case, I just pressed the button here a couple of times, but I could also just press F8 to get the same effect… Next, we’ll discuss two distinct operations, step into and the related step out. Step into goes into the method we invoke. Notice that if there’s no method to go into, the step into will act like step over. We have two step-into operations. The regular one and the force-step into, which normally acts the same way. We need the force version when we want to step into an API that IntelliJ will normally skip. We can press F7 to step into a method. We can press Shift F7 to force step into. When we finished looking at a method and don’t care about the rest, we can step out. This executes the rest of the method and returns. Notice that if we have a breakpoint before the return, it would still stop at the breakpoint, as we see in this case. We can press this button here to step out, or we can press shift-F8 to do the same thing. Continue and Run To Cursor Continue proceeds with the execution of the project until the breakpoint is hit again. This is also called resume. It’s a simple feature that we use a lot. You can continue by pressing the special version of the play button here. The shortcut is also helpful since we use it a lot; it’s F9. Run to cursor lets us skip lines that are uninteresting and reach the code that matters. We can set a breakpoint at that line to get the same effect, but this is sometimes more convenient as it removes the need to set and unset a breakpoint. We can press this button to run to the cursor, or we can use ALT-F9 as the shortcut for this feature. Force Return and Throw Exception This feature is known as force return in IntelliJ/IDEA. To see the force return option, we right-click on the stack trace and can see a set of options. An even more interesting option is drop frame which I’ll show soon. Notice the throw exception option, which is identical to force return, but it throws an exception from the method. Once I click this option. I’m shown a dialog to enter the return value from the method. This lets me change the value returned from the method, which is very useful when I want to debug a hard-to-reproduce bug. Imagine a case where a failure happens to a customer, but you can’t reproduce it. In this case, I can simulate what the customer might be experiencing by returning a different value from the method. Here the value is a boolean variable, so it’s simple. But your code might return an object; using this technique, you can replace that object with an arbitrary value. A good example would be null; what if this method returned null? Would it reproduce the problem my user is experiencing? Similarly, throw exception lets us reproduce edge cases, such as throwing an exception due to an arbitrary failure. Once we press OK, we return with a different value. In this case, I was at the edge of the method, but I could have done it at the start of the method and skipped the execution of the method entirely. This lets us simulate cases where a method might fail, but we want to mock its behavior. That can make sense if we can’t reproduce the behavior seen by the customer. We can simulate by using tools like this. Drop Frame Drop frame is almost as revolutionary, but it’s also more of a “neat trick.” Here I stepped into a method by mistake. Oops, I didn’t want to do that. I wanted to change something before stepping in… Luckily, there’s a drop frame. We saw I can reach it in the right-click menu; you can also click here to trigger it. Drop frame effectively drops the stack frame. It’s an undo operation. But it isn’t exactly that. It can’t undo the state changes that occurred within the method we stepped into. So if you stepped into the method and variables that aren’t on the stack were changed, they would remain changed. Variables on the stack are those variables that the method declares or accepts as arguments; those will be reset. However, if one of those variables points at an object, then that object resides in the heap, and changes there can’t be reset by unwinding the stack. This is still a very useful feature similar to force return, with the exception that it will return to the current line, not the next line. So it won’t return a value. This gets even better than that! Jump to Line Jump to the line is a secret feature in IntelliJ. It works, but developers don’t know about it. You need to install the Jump to Line plugin to use it. Since it has a relatively low install count, I assume people just don’t know it exists. Because this is a must-have plugin. It will change the way you debug! With the jump to line, we can move the current instruction pointer to a different position in the method. We can drag the arrow on the left to bring execution to a new location. Notice that this works in both directions; I can move the current execution back and forth. This doesn’t execute the code in between; it literally moves the current instruction to a new position. It’s so cool, and I have to show it again… If you see a bug, just drag the execution back and reproduce it. You can change variable values and reproduce the issue over and over until you fully understand it. We can skip over code that’s failing, etc. This is spectacular. I don’t need to recompile the code. If you ever had a case where you accidentally stepped over a line and oops. That’s not what I wanted. Then stopped the debugger and started from scratch. This is the plugin for you. This happened to everyone! We can just drag the execution back and have a do-over. It’s absolutely fantastic! Finally In the next video, we will discuss the watch briefly. We will dig much deeper into it in the sixth video in the series. So stay tuned! If you have any questions, please use the comments section. Thank you!
Today I want to write about why comments are so important in software development and how you can improve them. For me, the most disappointing thing about good source code is when it is insufficiently commented on. And this happens with closed code in the same way as with Open Source code. In the worst case, and this is usually the most common, the source code is not commented on at all. Especially in open-source projects, this is a disdain against the community reading your code. But even in internal software projects, I consider missing comments to be bad behavior towards your colleagues, which should be avoided. Of course, we all know the funny answers in this context. “Good code doesn’t need any comments” or “Take a look into the source code if you don’t understand something.” Is it laziness or ignorance not to write comments? No, I believe that many developers simply do not know how to write good comments. For this reason, I would like to explain here a very simple rule on how to really improve your source code through comments. Don’t Explain to Me the What and How! The biggest mistake you can make when writing comments is assuming you have to explain what you are doing. In this case, you write comments that explain the obvious. Java /** * Builder Class to build Events */ public class EventBuilder { .... } As long as the name of a class or method is not absolutely meaningless, a name should explain the usage of a class or method. So please do not explain that a Builder Class builds an object or a getter method returns an object. Writing such a comment is a waste of time. By the way, the same applies to those who have to read this kind of useless comment. The second example of bad comments is explaining how the code works. Java /** * Build the event and return it. */ public T build() { .... return event; } Of course, you shouldn’t assume that the users reading your code are complete idiots. You can assume that other developers can code just as well as you do. And explaining software patterns in the comment of a method or class is pointless. So now we have seen what bad comments are. But what does a good comment look like? Start With Why There is a great book from Simon Sinek with the Title “Start with Why: How Great Leaders Inspire Everyone to Take Action“. The core message of this book is, that you always should start by explaining the ‘Why’. The ‘What’ and ‘How’ are usually explaining things later in more detail. For software development, the ‘What’ and ‘How’ is indeed in the code and need no comments in many cases. So starting with the "why" can make source code comments much more useful – for yourself and for others. Starting with "why" makes you think about what you’re actually doing. So going back to my previous examples a comment may look like this: Java /** * When using Events in the client, the EventBuilder may help * to construct them in the correct way more easily. */ public class EventBuilder { .... } Instead of describing what the class is, answer the question of why you have invented it. The same is true for methods. As explained before if I want to understand the "how," then I can read the source code in detail. But why does the class or method exist? This is what a comment is for: Java /** * To display the summary of an invoice in the client, * this method calculates the summary including all taxes. */ public float summarizeInvoice() { .... return event; } I don’t want to go into more detail. It’s obvious that the better you explain why you have written a class or method, the more it helps others like yourself to understand your code. So when you start writing your next code, think about why you’re doing this and comment that into your class or method header. You’ll see – your colleagues and your community will love it and will much prefer using your code.
Shifting left has brought massive benefits for testing, security and DevOps, but not for tech debt and issue tracking. Yet, technical debt is one of the main reasons development teams struggle or fail to deliver. Let’s face it – developers don’t want to work through an endless backlog of tech debt. It just sucks. Tech debt hampers productivity, which impacts morale. There’s nothing worse than slowly grinding to a halt as a team. In this article, we’ll answer the following questions: What is shifting left? What does it mean to shift technical debt left? What are the effects of not shifting technical debt left? How can you go about shifting technical debt left? What Is Shifting Left? In general, when we say we’re “shifting left”, we’re talking about moving tasks earlier in a process’s timeline. By doing this, we lay foundations for better code quality in the future, and avoid the even greater effort to fix things later. The concept arose back when software development was mapped out on boards with requirements on the left-hand side, followed by design, coding, testing and finally delivery. But testing late in the development process meant finding out about problems late, too. That meant delays, costs, compromised products and unhappy people. As products or features become more concrete, it becomes harder and more time-consuming to fix them. So they moved software testing earlier in the development cycle. They shifted it left. The practice helps teams be more proactive rather than reactive. “Shifting left” has already transformed a range of disciplines, like… DevOps (e.g. devs deploy by making a docker image and pulling it into a hosting platform. Shift it left by setting up one-click deploy.) IT Service Management (e.g. customers reset their passwords by submitting a helpdesk request. Shift it left by building a self-serve password reset tool) Security (e.g. a team manually reviews software components for security risks. Shift it left by bringing in the use of automatic Software Composition Analysis tools.) A11y (e.g. a11y features like ARIA roles and attributes are manually reviewed. Shift it left by bringing in an a11y linter.) So, What Does Shifting Left Mean in the Context of Technical Debt? Tech teams are already shifting plenty of stuff left. We can also benefit from shifting technical debt left. Now, technical debt is an inevitable part of the software development process. It has the potential to harm or help, just like financial debt. Technical debt accumulates when engineers take shortcuts in order to ship faster, and often, that’s necessary. Engineers and team leaders must manage technical debt effectively to ensure the debt is repaid with minimum impact. When we shift technical debt left, we’re building in tools and better processes to help us track tech debt. This enables us to effectively measure it, prioritise it and ultimately fix it further down the line. What Happens When We Don’t Shift Tech Debt Left? In every sprint, we face the tradeoff between shipping new features vs paying off some tech debt. But choosing always to ship new features is unsustainable. The way the end game plays out is we eventually become technically bankrupt. That means rewriting huge chunks of code, which can take months. But tech debt causes meta-problems that extend well beyond the engineering team. By reducing quality and increasing time to market, we’re impacting features and deadlines that marketers and salespeople need to hit their goals. Performance issues and downtime increase workloads for Customer Support, Success and Account Management. And of course, no engineer wants to spend a tonne of time dealing with tech debt. They don’t improve at their job. They get bored. They leave. That’s rubbish for them, for the team and also for HR – it costs a bomb and can take forever* to hire new engineers. *well it feels like it, anyway Find out more about how tech debt impacts everybody here. How Do We Shift Tech Debt Left? To recap, when we shift tech debt left, we’re moving tasks associated with accumulating and managing tech debt earlier in the software development life cycle. Dealing with tech debt on an ongoing basis and at the earliest possible point helps push quality upstream and avoid unruly tech debt. The key practice involved with shifting tech debt left is creating a robust system to track tech debt. This lays the groundwork for far more straightforward prioritisation and fixing of tech debt further down the line. Shifting Left Technical Debt Starts With Tracking Tech Debt Properly Tech teams typically struggle with tracking tech debt effectively. This manifests itself in a few ways. 1. Issues don’t get tracked. Pretty much every developer hates the usual PM tools for tracking issues, like Jira. These tools force them to context switch constantly, breaking their flow and serving more as a distraction that keeps them from their code. 2. Tracked issues lack detail. Flicking backwards and forwards between the issue tracker and the IDE is annoying, which makes it unlikely we’re going to properly signpost all the relevant lines and files of code. 3. Issues get forgotten about. That’s right – the graveyard of tickets gathering dust in the backlog. Most teams have one..! These problems all compound each other. Vague tickets make reviewing the backlog pointless. Issues don’t get tracked because the developer doesn’t think anybody will ever bother to look at them. So what can we do? Well, to make tracking tech debt a continuous habit, we need to make it easy for tech teams to: Report and log problems. Easily add detail to them. See codebase problems. Our goals are to ensure high-quality tickets get created. We can do this by… Minimising context switching – this can be achieved through extensions which live in the code editor itself Making it easy to link specific lines of code and files to issues Using a tool to make issues visible from the code editor Tools like Stepsize’s VSCode and JetBrains extensions. This helps engineers track technical debt directly from the editor and add codebase issues to your sprint. This method allows you to link issues to code, get visibility on your tech debt, and collaborate with your teammates on maintenance issues without leaving your editor and switching contexts. Next Steps: Prioritizing and Fixing Tech Debt Now that issues are being tracked on a regular basis, we can have a meaningful strategy to prioritise and fix issues once they’ve been logged. Now we’ve shifted our tech debt left, our next steps are to: Create a space for your team to discuss your codebase issues. Integrate your tech debt work into your existing workflow. It’s crucial to allocate a sensible tech debt budget regularly – for example, 20% of each sprint or regular refactoring sprints – so that engineers can pay back tech debt. Getting this right is a win-win situation. You’ll avoid technical bankruptcy and ship features faster. We’ve written about the three best tactics for fighting and preventing technical debt here. The Bottom Line The only way to fix your technical debt problem is to start shifting it left. Tracking needs to be a continuous process that happens as early as possible in the software development lifecycle. It’s already happened in places from security to a11y, and it’s time we shift our focus to tech debt. When it isn’t managed well, tech debt can have a widespread and serious impact on codebase health, team morale and ultimately business prosperity. We can avoid this by adopting three core practices to shift tech debt left. These are to track tech debt issues efficiently, prioritise and fix tech debt routinely, and use metrics to detect and shrink tech debt. Shifting tech debt left needs to be the next frontier of making software development efficient, reliable and enjoyable.
Precise endeavors must be done to exact standards in clean environments. Surgeons scrub in, rocket scientists work in clean rooms, and data scientists…well, we try our best. We’ve all heard the platitude, “garbage in, garbage out,” so we spend most of our time doing the most tedious part of the job: data cleaning. Unfortunately, no matter how hard we scrub, poor data quality is often too pervasive and invasive for a quick shower. Our research across the data stacks of more than 150 organizations shows an average of 70 impactful data incidents a year for every 1,000 tables in an environment. These incidents invade exploratory data analysis; they invade model training, and validation; and invade the model’s inference data post-deployment, creating drift. Model accuracy doesn’t start or end with data cleaning in your notebook with the few tables you use to inform, train, and validate your model. It starts with the ETL pipeline and the instant you choose what to measure to solve your problem. Let’s walk through a semi-hypothetical scenario that contains real examples we’ve seen in the wild to highlight some common failure points. We’ll then discuss how they can be avoided with an organizational commitment to high-quality data. Imagine This You’re a data scientist with a swagger working on a predictive model to optimize a fast-growing company’s digital marketing spend. After diligent data exploration, you import a few datasets into your Python notebook. Exploratory Data Analysis Because your company is dashboard crazy, and it’s easier than ever for the data engineering team to pipe in data to accommodate ad-hoc requests, the discovery was challenging. The data warehouse is a mess and devoid of semantic meaning. Without clear data lineage, you wasted time merging and cleaning data after failing to notice a downstream table with far more data sources already combined. That stung almost as much as when you noticed you almost left out a key dataset, but you console yourself that even the greats make those mistakes from time to time. Model Design You see the LinkedIn ad click data has .1% NULLS, so you impute the value to the median of the feature column. That is neat and tidy in your notebook, but following model deployment, the integration between LinkedIn and the marketing automation platform wasn’t reauthorized. The NULLS in the production dataset have now jumped to 90%, causing this imputation to be much more frequent and based on a smaller, less accurate sample. Your model also uses data inferred by another machine learning model for ad spend optimization that a former colleague built. Unfortunately, they built the model on thousands of temporary tables before leaving the company. It’s broken, on autopilot, and losing the company millions, but you don’t know that. Not re-authorizing an expiring connection between data sources is a very common cause of data downtime. Image courtesy of Marketo. Model Training and Validation You carefully separate out your holdout set to avoid contamination and ensure the data you use to validate your model will not overlap with the training data. Unbeknownst to you, the training data contains a table with aggregated visitor website data with columns that haven’t been updated in a month. It turns out the marketing operations team upgraded to Google Analytics 4 to get ahead of the July 2023 deadline, which changed the data schema. This caused the automated ETL pipeline to spin up a completely new table, breaking the dependencies of the aggregated table. As a result, your training set doesn’t contain the last month of data, which included statistically significant shifts in browsing behavior and buying patterns as a result of the changing macroeconomic environment. In both digital marketing and in data, the only constant is change: in this case, a changing schema messing with the pipelines on which your model depends. Model Deployment Your model is deployed and silently suffering from significant drift. Facebook changed how they delivered their data to every 12 hours instead of every 24. Your team’s ETLs were set to pick up data only once per day, so this meant that suddenly half of the campaign data that was being sent wasn’t getting processed or passed downstream, skewing their new user metrics away from “paid” and toward “organic.” Because your model was constantly training on new data, this shift in classes degraded your model’s performance as it began to overfit for organic user acquisition. Since this happened after the data cleaning and after the model was built, you were unaware of this imbalance that needed to be corrected. Image from Geeks for Geeks. Model Review All of these issues mean your predictive model had no impact on digital advertising performance. You’ve now lost the trust of the marketing team and senior executives. After all, they were skeptical in the first place. How could they trust a predictive black box when they see errors in their weekly reports and their dashboards crash twice a month? Justifying additional headcount and investment in your team has now become much more difficult, even though the model’s failure was not your fault. Does anything in this story sound familiar? While this specific tale might be a fabrication, stories like the one you just read are all too common among modern data teams. So, what can be done to avoid outcomes like this? Let’s take a look at how a commitment to data quality might help our data scientist yield better results. Data Cleaning for the Modern Data Stack A data scientist cannot and should not be responsible for cleaning every table in the data warehouse on a continuous basis. However, we do need to partner with our data engineering colleagues to create an environment fit for data science. Just like a chef understands the supply chain for her ingredients, we should understand the supply chain of our data. Every organization has a different mix of data sources, and each one runs its pipelines slightly differently. Some pipe everything into a central data warehouse or lake, while others operate separate environments for raw and prepared data with a layer of analytical engineers in between. Most can be better at clearing out legacy datasets. Issues in the data supply chain can create headaches for data scientists. Image courtesy of Chad Sanderson. The destination of the data and its organization matter as it impacts your exploratory data analysis. However, the journey the data takes matters just as much as each path introduces a different set of risks for your models. In addition to thinking at the dataset level with the traditional 6 dimensions of data quality (accuracy, completeness, consistency, timeliness, validity, and uniqueness), it’s time to start thinking at the pipeline level around data freshness, volume, schema, and distribution anomalies. You can do this by building your own anomaly detectors (here’s a three-part series that shows how) or by leveraging a commercial data observability solution. Monitoring for anomalies in the volume of data being ingested by your pipelines can ensure your model is always ingesting the minimum number of samples it needs to predict outcomes with high accuracy.Gartner Hype Cycle for Data Management 2022 The advantage of a commercial solution is that instead of constantly updating and adjusting your custom monitor thresholds, you can count on a baseline of coverage across all data assets the minute they’re added while also being able to add custom monitoring rules whenever necessary. Monitoring all your production pipelines and tables will not only make your job easier — it will also make your models more accurate. For example, by monitoring the distributions of values in the data, you can quickly see these ranges for the dataset rather than having to manually conduct multiple explorations and analyses to answer questions like the historical unique percentage. These monitors could also alert you to sudden changes in the proportion of each data class, as we saw in the Facebook example. Monitoring for anomalies in the volume of data being ingested by your pipelines can ensure your model is always ingesting the minimum number of samples it needs to predict outcomes with high accuracy. Data observability solutions also include data lineage and light cataloging features that can help during the discovery and exploratory data analysis process by surfacing relationships between objects and identifying related incidents. Data lineage within a data observability platform showing table dependencies. With a clean data environment, you can return your focus to creating precise, trusted ML models that drive business value. This article was co-written with Ryan Kearns, data scientist at Monte Carlo, and Elor Arieli, Ph.D., data science lead at Monte Carlo.
Render is a one-stop shop for web application infrastructure, simplifying the hosting of static content, databases, microservices, and more. Whether you’re maintaining a monolith application or have adopted a microservice pattern — and you’re not yet ready to create an entire SRE team — Render is a good choice for your infrastructure and hosting needs. In this article, we’ll focus on debugging in Render. We’ll build a simple Node.js microservice with a PostgreSQL database. We’ll deploy these two components with Render and then demonstrate two different ways to debug our service and database. We’ll use Log Streams for accessing syslogs and Datadog for observability metrics and alerts. Are you ready? Let’s dive in. Hosting a Node.js App With Render The intent of this article is not to do a deep dive into building a Node.js microservice application. However, if you have some experience with Node.js, TypeScript, Express, and PostgreSQL, you’ll likely find additional value in the code presented here. Our mini-demo application is a user management API with several typical endpoints for retrieving, creating, updating, or deleting users. Behind the scenes, the service performs query operations on the PostgreSQL database. You can find the complete application code here. Set Up a Datadog account Before we can set up our PostgreSQL instance, we’ll need to get our hands on a Datadog API key. You can sign up for a free trial account and then follow these instructions to obtain an API key. We’ll discuss Datadog in more detail soon. Deploy a PostgreSQL Instance We’ll start with the PostgreSQL instance since we’ll need database credentials before we can use our Users service. To create an instance for your project, log in to Render and click the New + button in the top right corner. Then, click on PostgreSQL to start the process. This drops you into a form to configure your database: Specify a name for your instance, along with a database name and user name for access. For the “Datadog API Key” field, it’s important to add this now because it’s difficult to add it after the database has been created. Paste in the API key that you generated from the previous step. Click Create Database and wait for the initialization to complete. Render will display the credentials for your new database. You’ll need those later when setting up your Users service. Initialize Database Schema Note also that we’ll need to initialize the schema in this new database, creating a users table with proper columns. We’ll do this via the node-pg-migrate package. The instructions for initializing the database are in the code repository’s README file. However, we’ll bundle this step into our application’s startup command on Render. Deploy Users Microservice Deploying a microservice with Render is straightforward, especially if you host your code in GitLab or GitHub. After you connect your code hosting account to Render, return to your Render dashboard. Click on the New + button and select Web Service. Select the code repository with your microservice. Each input field is well-documented on this page. For our example Node.js service, we want to ensure the following: The “Environment” should be set to Node . The “Build Command” should be yarn install && yarn build so that all the dependencies are installed and the final system is built. The “Start Command” (not shown) for this particular product will run our database migration and start up our application. It should be: yarn migrate up --no-reject-unauthorized && yarn start:prod. In addition, the “Advanced” menu contains additional fields that we’ll need to fill in for this project. We need to set five environment variables for database connectivity: DB_HOST DB_NAME DB_USER DB_PASSWORD DATABASE_URL — This is the database connection string used by node-pg-migrate. Its value is postgres://DB_USER:DB_PASSWORD@DB_HOST:5432/DB_NAME, with the actual values for these placeholders filled in. For example: postgres://john:Ue473C@dpg-1g50-a.oregon-postgres.render.com:5432/mydb These credentials should come from the PostgreSQL instance you created in the previous step. Lastly, make sure to set the “Health Check Path” to /health to facilitate zero-downtime deploys and health monitoring within Render. Click Create Web Service. Render will work its magic and deploy this service from your repository! Debugging With Datadog Datadog is a cloud-based SaaS that provides extensive log aggregation and searching tools as well as monitoring and alerting capabilities. Whether you have a single service or ten thousand, Datadog is a useful tool for gaining a deeper understanding of how your resources are running. We’ve already connected our PostgreSQL instance to Datadog by including the Datadog API key during its creation. This provides monitoring capabilities for many different metrics on this database instance. The full list of supported metrics for PostgreSQL is available here. Let’s look at how to use these metrics for monitoring and alerting. Exploring Metrics Log into your Datadog account and navigate to the Metrics page. The top line of the Metrics Explorer shows the metric query toolbar with various dropdown and input boxes for filtering what metrics are displayed in the graph. The default metric, system.cpu.user, is shown. If your database was configured correctly with the API key, then you should see a line graph displaying the percentage of time the CPU spent running user space processes. When you click on the “from” dropdown menu, you’ll see two potential sources for your database’s metrics: Metrics can be shown based on a specific database within a host, or across the entire host. For our example, there is only one database instance running on the host. Therefore, with this particular metric, there’s no difference between the database-specific and the host-specific metric. If you want to investigate a different metric, select the first input box and enter system.mem. The auto-complete will show different memory-related metrics: If you select system.mem.used, you’ll start to see the database host’s memory that is in use. Notice that the dialog provides a description of the selected metric. The PostgreSQL-specific metrics that are available in Datadog are also interesting. For example, we can monitor the number of active connections to the database at any time. In this query, the “from” field uses the single-database instance, and Datadog sums the data based on the database-id property on the metric. Monitoring is a useful debugging tool. If you’re experiencing slow database response with your application, you can view the current metrics in Datadog. By correlating PostgreSQL metrics like query times or connection counts with application response times, you might surface a poorly written query associated with a specific endpoint or specific code that is improperly cleaning up database connections. Access to these metrics may provide the key to solving an issue. Of course, resolving an issue usually depends on your awareness that the issue is occurring. Fortunately, Datadog provides significant resources for alerting upon an occurrence of poor performance currently happening with your systems. You can configure these alerts to monitor the metrics that you choose, pushing notifications to various places. Let’s talk about Alerts as a tool for debugging web applications. Alerts in Datadog Select Monitors from the Datadog navigation bar, and then select New Monitor and choose the Metric monitor type. We’ll create a monitor to alert us when the number of active database connections exceeds a threshold. Select Threshold Alert and then define the metric query just as you would on the metrics page. We want to use the postgres.connections metric from our single database instance. Lastly, we select the min by threshold. If the “minimum number of connections” is above a threshold, this indicates that there are currently too many connections on this database instance. That’s when we should be alerted. Next, define the threshold values that trigger an alert: This is a contrived example, but it means that Datadog will trigger a warning if it detects 12 connections to the database over the past five minutes. Meanwhile, we’ll receive a full alert on 20 connections to the database over the past five minutes. Finally, we set up the notifications to occur when an alert is triggered. For example, Datadog can send notifications to Jira, Slack, or a custom webhook endpoint. Setting up notifications is beyond the scope of this article, but the Datadog documentation for notifications is clear. Alerts are important for debugging a web application, as they raise awareness that a problem is happening, pointing you to the relevant metrics describing the current state of the system. The metrics we’ve looked at so far are mostly of the “outside the system looking in” variety. Next, let’s look at how to use logs to understand what is happening inside of the system itself. Debugging With Log Streams In Node.js development, using console.log is a common debugging technique. Seeing log output is easy when you’ve deployed your Node.js application to your local machine, but it’s challenging if you’ve deployed it to the cloud. This is where Log Streams help. Render Log Streams enable the exporting of syslogs from your deployed services to an external log aggregator. Datadog is one example, but Render supports others too, including Papertrail and LogTail. Set Up syslog Export To set up syslog export, log into your Render account and go to your Account Settings page. Click on Log Streams. Add the endpoint for your log aggregator, along with an API token (if any). For our example, we’ll send our syslogs to Datadog. In Datadog, navigate to the Logs tab. This brings you to the Log Search dashboard where you can see a live tail of all the logs being ingested by your Datadog account. This includes logs emitted by our database instance, our Node.js microservice, and the Render build process for our microservice. To narrow down the logs, we can add the hostname to the query search bar. The result will show only those logs coming from our Users microservice instance (both the application logs and the Render build logs). Logs emitted from the application via console.log or a logger such as Pino will show up here, aiding in the debugging effort. You can trace entire requests through your systems by following them through your various services logs. As data flows through your system from your API endpoints to the service layer and then to the database layer, you can use logs to track down the root cause of issues. Leveraging Logs for Debugging How might you use logs from your Node.js application to help you with debugging? Here are a few helpful suggestions: Emit a log when catching an exception. Using try-catch is good practice because it ensures graceful handling when an error is thrown. In the catch block, you can do more than just handle the exception; emit a log message with helpful contextual information. Then, query for these kinds of messages in your log management system to track down the root cause. Use other severities besides log. When using console, you’re not limited to just using console.log. You can also use functions like console.info, console.debug, and console.error. Doing so will help you differentiate the severity of what you’re logging, making it easier to query or filter for specific issues. Set up alerts on certain log conditions. Most log management systems have an alerting mechanism, allowing you to set certain conditions on incoming logs which would result in triggering a notification. For example, if your Node.js application logs every HTTP response code (such as 200, 401, and 500), then you can set up an alert to notify you when a certain response code is seen too frequently. You could set up an alert to tell you when a 500 or a 401 code has been seen more than five times in ten minutes, indicating a possible issue that needs to be dealt with immediately. Conclusion In our mini-project, we built a small microservice application using a simple Node.js backend service and a PostgreSQL database. We deployed both pieces to Render so that we could explore system debugging through Render. We looked at two different ways of debugging: Sending observability metrics to Datadog, working in conjunction with alerts and notifications. Using Log Streams to export syslogs from your different application components, which we also viewed in Datadog. When issues with your microservice application arise, your team needs to respond quickly to reduce or prevent downtime. Effective debugging techniques coupled with alerts and notifications can equip you for fast response. If you’re deploying your application to Render, then you have quick and simple facilities — a Datadog integration and Log Streams — to provide real-time debugging insights and alerts.
In the world of reliability engineering, folks talk frequently about “incident response teams.” But they rarely explain what, exactly, an incident response team looks like, how it’s structured, or which roles organizations should define for incident response. That’s a problem because your incident response team is only as effective as the roles that go into it. Without the right structure and responsibilities, you risk leaving gaps in your incident response plan that could undercut your team’s ability to respond quickly and efficiently to all aspects of an incident. This article explains how to define incident response roles in order to build a team that works as effectively and efficiently as possible. What Is an Incident Response Team? Before defining incident response roles, let’s take a look at what an incident response team does collectively. An incident response team is a group of personnel who respond to incidents that disrupt IT resources (and that also, by extension, disrupt the business). Designating staff to be part of an incident response team is important because you don’t want to waste time in the midst of an incident trying to decide who needs to do what to handle the problem. By creating an incident response team ahead of time, you have a group of experts who are prepared to respond quickly to issues whenever they arise. Structuring Incident Response Roles Of course, the exact nature of incident response teams varies from one organization to the next. So do the titles given to different roles within incident response. But in general, most incident response teams include the following core roles. Incident Commander The incident commander or incident manager is basically the executive in charge of incident response (although he or she need not be an actual executive at your company). The person in this role is the lead decision-maker and is responsible for overseeing the rest of the incident response team. Incident managers often come from a technical background, but having people skills and management experience is just as important for this role as technical expertise. Technical Lead Technical leads are in charge of managing the technical dimensions of incident response. Their main responsibility is determining what went wrong, devising a remediation strategy, and implementing the fix as efficiently as possible. To do this work well, technical leads should interface with other incident response roles to ensure that technical problem-solving aligns with other priorities -- like minimizing disruptions to customers and protecting the business’s brand. Subject Matter Experts Subject matter experts, who are overseen by the team’s technical lead, provide the technical expertise and labor required to work through an incident. Depending on which types of systems you are supporting, you may need a variety of subject matter experts who are prepared to respond to different types of issues. For example, you may want to define one subject matter expert role for a networking engineer, another for a storage or database engineer, and another for a software engineer. Each of these areas of expertise may be necessary when responding to an incident. Customer Relations Lead The customer relations lead is the role in charge of managing the customer-impacting aspects of an incident. This person is responsible for determining how an incident affects customers and helping the rest of the team to proceed in a way that results in the best possible customer experience. Communications Lead The communications lead oversees communications with the public about an incident. This person will typically come from a PR background, but an ability to understand how IT systems impact business operations and branding is critical, too. Note that the communications lead doesn’t manage communications within the incident response team itself. That’s a job that is typically overseen by the assistant incident manager, with help from the scribe. Scribe Documentation is always important, and incident response is no exception. The scribe role addresses this requirement by recording information about incident response processes as they unfold. They can also help in generating postmortem reports. Other Potential Incident Response Roles Beyond the core incident response roles defined above, you may want to consider adding some other roles to your team, depending on your priorities and the type of business you support. Social Media Lead Although the communications lead will oversee public communications about an incident, companies with a large social media presence may benefit from assigning someone to oversee social media communications specifically during incident response. Partner Lead If your company depends extensively on relationships with partners, consider creating an incident response role that will take the lead in communicating with partners during the incident and help to minimize the impact of the incident on partner relationships. Security Lead Although security should be everyone’s responsibility, it can be easy in the midst of a hectic incident response process to make mistakes or fail to follow best practices regarding security. Designating a security lead for your team helps to avoid these risks by ensuring that there is someone whose main goal is to enforce security during all stages of the response. Legal or Compliance Lead Not all incidents have repercussions for compliance or could lead to legal issues, but some do. Consider creating a legal or compliance lead role to help the team manage these aspects of incident response. Conclusion: The Best Incident Response Team Is a Flexible Team Some parting advice: There are many incident response roles you can define, but not every company needs every role. Your main priority when defining roles should be to create a team that is agile and flexible, while also covering all of the areas of expertise that the team is likely to need. After all, incidents by definition are problems that you don’t predict ahead of time, and defining an agile set of roles is the best means of preparing for whichever incidents may come your way.
A few years ago, when working as a software developer building and maintaining internal platform components for a cloud company, I deleted an application from production as part of a deprecation. I had double and triple-checked references and done my due diligence communicating with the company. Within minutes, though, our alerting and monitoring systems began to flood our Slack channels, in a deluge of signals telling me something wasn’t working. The timing was pretty clear; I had broken production. In medical dramas, the moment when things are about to go wrong is unmistakable. Sounds are muffled. High-pitched, prolonged beeps take over your ears. Vision blurs. When alarms sound, or danger is near, something takes over within you. Blood drains from your head, heat rises in your body, and your hands sweat as you begin to process the situation. Sometimes you confront the issue, sometimes you try to get as far away as possible, and sometimes, you just freeze. In my case, with red dashboards and a sudden influx of noise, I had turned into the surgical intern holding a scalpel for the first time over a critical patient, with no idea what to do. Incidents are scary, but they don’t have to be. Doctors and surgeons undergo years of training to maintain their composure when approaching high pressure, highly complex, and high stakes problems. They have a wealth of experience to draw from in the form of their attendings and peers. They have established priorities and mental checklists to help them address the most pressing matters first: stop the bleeding and then fix the damage. As cloud software becomes part of the critical path of our lives, incident response practices at an individual and organizational level are becoming formalized disciplines, as evidenced by the growth of site reliability engineering. As we grow collectively more experienced, incident response becomes less of an unfamiliar, stressful, or overwhelming experience and more like something you've trained and prepared for. As I’ve responded to incidents over my career, I’ve collected a few heuristics that have helped me turn my fight, flight, or freeze response into a reliable incident response practice: Understand what hurts for your users Be kind to yourself and others Information is key Focus on sustainable response Stop the bleeding Apply fixes one at a time Know your basics Understand What Hurts for Your Users In times of emergency, you learn what is truly important. When creating dashboards and alerts, it’s very easy to alert on just about anything. Who doesn’t want to know when their systems are not working as expected? This is a trick question. You want to know when people can’t use your system as expected. When all your alerts are going off, and all your graphs are red, it becomes important to distinguish between signal and noise. The signal you want to prioritize is user impact. Ask yourself: What pain are users experiencing? How widespread is the issue? What is the business impact? Revenue loss? Data loss? Are we in violation of our Service Level Agreements (SLAs)? At the beginning of an incident, make a best effort estimate of impact based on the data that you have at hand. You can continue to assess the impact throughout the response time and even after an incident is resolved. If you’re working in an organization with a mature incident response practice, questions of impact will be codified and easy to answer. Otherwise, you can rely on metrics, logs, support tickets, or product data to make an educated guess. Use your impact estimate to determine what is the appropriate response and share your reasoning with other responders. Some companies have defined severity scales that dictate who should get paged and when. If not, think that the greater the impact, the more extreme the response. In the emergency room, the more severe and life-threatening your ailment, the faster you will be treated. A large wound will require stitches while a scratch will require a band-aid. When I broke production by deleting a deprecated application, the first things I turned to were our customer-facing API response rates. The user error rate was over 45%; a user would encounter an error every other time they tried to do something in our product. Our users couldn’t view their accounts, pay their invoices, or do anything at all. With this information, I looked at our published severity scale and determined this required an immediate response, even though we were outside of business hours. I worked with our cloud operations team to get an incident channel created and the response initiated. Be Kind To Yourself and Others In today’s world of complex distributed systems and unreliable networks, it is inevitable that one day, you will break production. It’s often seen as a rite of passage when you join new teams. It can happen to even the most seasoned of engineers; a very senior engineer I once worked with broke our whole application because they had forgotten to copy and paste a closing tag in some HTML. When it does happen, be kind to yourself and remember that when these things happen, you are not to blame. We work in complex systems that are not strictly technological. They are surrounded by humans and human processes that intersect them in messy ways. Outages are just the culmination of small mistakes that, in isolation, are not a big deal. In responding to and learning from incidents, we are figuring out how to make our systems better and improve the processes that surround them. In the moment of the incident and in the postmortem, a process for learning from incidents, strive to remain “blameless.” To act blamelessly means that we assume responders made the best decision possible with the available information at the time. When something goes wrong, it is easy to point fingers. After the fact, it’s easy to identify more optimal decisions when you have all the information and have had time to analyze things outside of the heat of the moment. If we point fingers and assume we could have made better decisions, we close the opportunity to evaluate our weaknesses with candor. When blame is spread, discourse stops. Incident postmortems are discussions intended to understand how an incident happened, how to prevent it in the future, and how to respond better in the future, not a court in which we declare who is guilty and who is innocent. By being kind to yourself and others through the spirit of blamelessness, we learn and improve together. Information Is Key An incident can be unpredictable and you never know what kind of information may be helpful to responders. Whether it is a daily standup or a monthly business review, we curate information for our audiences because these are well-known, well-controlled situations. However, an ongoing incident is not the right place to filter new information. Surfacing information throughout an incident serves two purposes. It gives responders more information to use as they make decisions, and it lets non-responders know the status of what is going on. If there is an ongoing incident that is owned by another team and you are noticing abnormal behavior in your metrics and logs, surfacing that parts of your system are affected by the outage can help determine the breadth of impact and can influence response. One day, our alerts indicated that people were experiencing latency from our services and we started an incident. Around the same time, a second team told us about similar latency issues they had noticed. With this information, we were able to determine more quickly that the issue was really in the database and were able to page the correct team, resulting in a faster resolution. The other side of the coin is balance. If too many teams are surfacing the same information, it can easily become overwhelming for those managing the incident. Having asynchronous incident communications readily available throughout the incident in a Slack channel or something similar can help keep various stakeholders like support or account representatives informed of the incident status. During our incidents, someone will periodically provide a situation report that will detail a rough timeline as well as the steps that have been taken to date. These sorts of updates help keep responders focused and stakeholders informed for a quick and effective resolution. As an added bonus, having the communication documented as it happened is very helpful in postmortems. The incident communications will give you a good idea of timeline, decisions taken, and resolutions. Focus on Sustainable Response In life and throughout an incident, it’s important to focus on the things you can control. This is especially relevant when you experience downtime because of a third-party vendor or a dependency on another team. Incident response is like your body’s stress response. It can give you the capability to accomplish great things in the name of self-preservation, and it’s not good to be in that state for a prolonged period of time. These periods of heightened stress can leave you exhausted afterward, and maintaining them is a sure recipe for burning yourself out. When you cannot do anything to directly impact the outcome of an incident, it’s time for you to stand down and let others take the lead. One day, a large portion of our customers could not log in or sign up for new accounts because of an outage with a downstream vendor. Based on our severity scale, we would need to be working 24/7 to resolve the outage, but the only way to mitigate the issue would be to move to a new vendor. This was a monumental task that was not likely to be completed with quality late into the evening after a long work day. Keeping responders engaged would have been a sure way to burn them out and reduce response quality the next day. We made the call to stand down from the incident while we waited for a response from the vendor the next morning. We had workarounds in place for customers that would allow them to move forward in most cases, so we did the right thing and waited till morning. On the other hand, in the morning, it was evident that we wouldn’t be able to get a timely resolution from the vendor. With the mounting burden on support and the growing impact of lost signups, we turned to fixes that we had control over and began the process of changing vendors. Using our knowledge of customer and support impact, we prioritized the vendor change for the areas that would unblock the most customers instead of trying to mitigate every failing system. Breaking down the problem into smaller pieces made it possible to take on this difficult task and spread out the work. This brought our time to resolution down and allowed us to make better decisions for later migrations without exhausting the incident response team. Stop the Bleeding When you’re in incident response mode, center your efforts on addressing user pain first: how can we best alleviate the impact to our users? A mitigation is a fix that you can implement that will restore functionality or reduce the impact of a malfunctioning part. This could mean a rollback of a recent deployment, a manual workaround with support, or a temporary configuration change. When we develop features or enhance existing ones, we are designing, architecting, and refactoring for mid to long-term stability. When you’re finding mitigations, you may do things that you wouldn’t normally do because they are inherently short sighted, but provide relief to users while you find a more permanent stable solution. In an operating room, a surgeon may clamp an artery to control bleeding while they repair an organ, and they do so knowing full well that the artery cannot remain clamped indefinitely. You want to be able to develop fixes while your system is in a stable, if not functioning state. During my application deletion fiasco, we were able to determine that only half of our live instances were trying to connect to the deleted application. Instead of trying to get the deleted application re-deployed, or fix the configuration for the broken services, we decided to route traffic away from the faulty instances. This left us temporarily with services running in only one region, but it allowed our users to continue to use the product while we found the permanent fix. We were able to introduce a new configuration and test that the deployments would work before rerouting traffic to them. It took us 30 minutes to route traffic away and another 60 minutes to fix the instances and reroute traffic to them, leading to only 30 minutes of downtime as opposed to what could have been 90 minutes of downtime. Apply Fixes One at a Time Incidents that have a singular cause are relatively easy to approach: you find the mitigation, you apply it, and then you fix the problem. However, not all problems stem from a single cause. Symptoms may hide other problems, and issues that are benign on their own may be problematic when combined with other factors. The relationships between all these factors may not even be evident at first. Distributed systems are large and complex. A single person might not be able to thoroughly understand the full breadth of the system. To find a root cause in these cases we must rely on examining the results of controlled input to better understand what is going on; in essence, performing an experiment that a certain fix will produce a certain result. If you make two concurrent fixes, how do you know which one fixed the issue? How do you know one of the solutions didn’t make things worse? In another outage, one of our services was suddenly receiving large bursts of traffic, leading to latency in our database calls. Our metrics showed that the database calls were taking a long time, but the database metrics were showing that the queries were completing within normal performance thresholds. We couldn’t even find the source of the traffic. After a great deal of digging, and a deep dive into the inner workings of TCP, we found the issue! Our database connection pool was not configured well for bursty traffic. We prepared to deploy a fix. In the spirit of addressing user pain and working in the areas they could control, another team was investigating the issue in parallel. They had discovered that a deployment of theirs coincided with when the outage had begun and were preparing to do a rollback. In the spirit of surfacing information, both teams were coordinating through the incident channel. Before we applied either fix, someone brought up that we should apply one fix first to see if it addressed the problem. We moved to apply the change to the connection pool, and to our joy and then immediate dismay, we had fixed the original issue but not the customer outage. Our service still couldn’t handle the volume of traffic it was receiving. At that point the other team applied their rollback and the traffic returned to normal. By applying these fixes separately, we discovered both a connection pool misconfiguration and a bug that was causing an application to call our service many more times than it needed to. If we had simply rolled back the deployment, it was possible that in the future, similar traffic would cause our service to fail, creating another outage. With methodical application of fixes, you can better identify root causes in complex distributed systems. Know Your Basics Distributed systems are hard. As a beginner or even a more seasoned engineer, fully understanding them at scale is not something that our brains are made to do. These systems have a wide breadth, the pieces are complex, and they are constantly changing. However, all things in nature follow patterns, and distributed systems are no exception. You can use these to your advantage to know how to ask the right questions. Distributed systems will often have centralized logging, metrics, and tracing. Microservices and distributed monoliths will often have API gateways or routers that provide a singular and consistent customer-facing interface to the disparate services that back them. These distributed services will likely make use of queueing mechanisms, cache stores, and databases. By having a high-level understanding of your implemented architecture, even without knowing all the complexity and nuance, you can engage the people on your team or at your company who do. A general surgeon knows how a heart works but may consult with a cardiothoracic surgeon if they find that the case requires more specialized knowledge. If you are familiar with the high-level architectural patterns in your application, you can ask the right questions to find the people with the information you need. Parting Thoughts We rely on healthcare professionals to treat us when the complex systems that are our bodies don’t work as we expect them to. We’ve come to rely on them as people who will methodically break down what happens in our bodies, put us back together, and heal our pain. As software developers, we don’t directly hold lives in our hands the way health workers do, but we must recognize that the world is becoming more and more dependent on the systems we build. People are building their lives around our systems with varying degrees of impact. We build entertainment systems like games and social media, but also we build systems that pay people, help them pay their bills, and coordinate transportation. When a game is down, maybe we go for a walk. When an outage fails to disburse a check, it could mean the difference between making rent and becoming homeless. If someone cannot pay obligations due to system downtime, it may have huge repercussions on their life. Responsible practice of our craft, including incident response, is how we acknowledge our responsibility to those who depend on us. Use these heuristics to center yourself on the people who rely on your systems. For their sake, keep calm and respond.
Books on bad programming habits take up a fraction of the shelf space dedicated to best practices. We know what good habits are – or we pay convincing lip service to them – but we lack the discipline to prevent falling into bad habits. Especially when writing test code, it is easy for good intentions to turn into bad habits, which will be the focus of this article. But first, let’s get the definitions right. An anti-pattern isn’t simply the absence of any structured approach, which would amount to no preparation, no plan, no automated tests and just hacking the shortest line from brainwave to source code. This chaotic approach is more like non-pattern programming. Anti-patterns are still patterns, just unproductive ones, according to the official definition. The approach must be structured and repeatable, even when it is counter-productive. Secondly, a more effective, documented, and proven solution to the same problem must be available. Many (in)famous anti-patterns consistently flout one or more good practices. Spaghetti code and the god object testify to someone’s ignorance or disdain of the principles of loose coupling and cohesion, respectively. Education can fix that. More dangerous, however, are the folks who never fell out of love with over-engineering since the day they read about the Gang of Four because doing too much of a good thing is the anti-pattern that rules them all. It’s much harder to fight because it doesn’t feel like you’re doing anything wrong. Drawn To Complexity Like Moths to a Flame – With Similar Results In the same as you can overdose on almost anything that is beneficial in small doses, you can overdo any programming best practice. I don’t know many universally great practices, only suitable and less suitable solutions to the programming challenge at hand. It always depends. Yet developers remain drawn to complexity like moths to a flame, with similar results. The usefulness of SOLID design principles has a sweet spot, after which it’s time to stop. It doesn’t follow an upward curve where more is always better. Extreme dedication to the single responsibility principle gives you an explosion of specialized boilerplate classes that do next to nothing, leaving you clueless as to how they work together. The open/closed principle makes sense if you’re maintaining a public API, but for a work-in-progress, it’s better to augment existing classes than create an overly complex inheritance tree. Dependency inversion? Of course, but you don’t need an interface if there will only ever be one private implementation, and you don’t always need Spring Boot to create an instance of it. The extreme fringes on opposite sides of the political spectrum have more in common with each other than they have with the reasonable middle. Likewise, no pattern at all gives you the same unreadable mess as well-intentioned over-engineering, only a lot quicker and cheaper. Cohesion gone mad gives you FizzBuzzEnterpriseEdition while too little of it gives you the god object. Let's turn over, then, to test code and the anti-patterns that can turn the efforts into an expensive sinkhole. Unclarity about the purpose of testing is already a massive anti-pattern before a single line of code is written. It’s expressed in the misguided notion that any testing is better than no testing. It isn’t. Playing foosball is better than ineffectual testing because the first is better for team morale and doesn’t lull you into a false sense of security. You must be clear on why you write tests in the first place. First Anti-Pattern: Unaware of the Why Well, the purpose of writing tests is to bring Sonar coverage up to 80%, isn’t it? I’m not being entirely sarcastic. Have you never inherited a large base of untested legacy that has been working fine for years but is languishing at an embarrassing 15% test coverage? Now suddenly the powers that be decide to tighten the quality metrics. You can’t deploy unless coverage is raised by 65 percentage points, so the team spends several iterations writing unit tests like mad. It’s a perverse incentive, but it happens all the time: catch-up testing. Here are three reasons that hopefully make more sense. First, tests should validate specifications. They verify what the code is supposed to do, which comes down to producing the output that the stakeholders asked for. A developer who isn’t clear on requirements can only write a test that confirms what the code already does, and only from inspecting the source code. Extremely uncritical developers will write a test confirming that two times four equals ten because that’s what the (buggy) code returns. This is what can happen when you rush to improve coverage on an inherited code base and don’t take the time to fully understand the what and why. Secondly, tests must facilitate clean coding; never obstruct it. Only clean code keeps maintenance costs down, gets new team members quickly up to speed, and mitigates the risk of introducing bugs. Developing a clean codebase is a highly iterative process where new insights lead to improvements. That means constant refactoring. As the software grows, it’s fine to change your mind about implementation details, but you can only improve code comfortably that way if you minimize the risk that your changes break existing functionality. Good unit tests warn you immediately when you introduce a regression, but not if they’re slow or incomplete. Thirdly, tests can serve as a source of documentation for the development team. No matter how clean your code, complex business logic is rarely self-explanatory if all you have is the code. Descriptive scenarios with meaningful data and illustrative assertions show the relevant input permutations much clearer than any wiki can. And they’re always up to date. Second Anti-Pattern: London School Orthodoxy I thank Vladimir Khorikov for pointing out the distinction between the London versus the classical unit test approach. I used to be a Londoner, but now I’m convinced that unit tests should primarily target public APIs. Only this way can you optimize the encapsulated innards without constantly having to update the tests. Test suites that get in the way of refactoring are often tightly coupled to implementation details. As long as you can get sufficient execution speed and coverage, I find no compelling reason for a rigid one-to-one mapping between source classes and corresponding test classes. Such an approach forces you to emulate every external dependency’s behavior with a mocking framework. This is expensive to set up and positively soul-crushing if the classes under test have very little salient business logic. A case in point: Java @RestController public class FriendsController { @AutoWired FriendService friendService; @AutoWired FriendMapper friendMapper; @GetMapping(“/api/v1/friends”) public List<FriendDto> getAll(){ return friendMapper.map(friendService.retrieveAll()); } } This common Controller/Service layered architecture makes perfect sense: cohesion and loose coupling are taken care of. The Controller maps the network requests, (de)serializes input/output, and handles authorization. It delegates to the Service layer, which is where all the exciting business logic normally takes place. CRUD operations are performed through an abstraction to the database layer, injected in the service layer. Not much seems to go on in this simple example, but that’s because the framework does the heavy lifting. If you leave Spring out of the equation, there is precious little to test, especially when you add advanced features like caching and repositories generated from interfaces. Boilerplates and configurations do not need unit tests. And yet I keep seeing things like this: Java @ExtendWith(MockitoExtension.class) public class FriendsControllerTest { @Mock FriendService friendService; @Mock FriendMapper friendMapper; @InjectMocks FriendController controller; @Test void retrieve_friends(){ //arrange var friendEntities = List.of(new Friend(“Jenny)); var friendDtos = List.of(new FriendDto(“Jenny”)); Mockito.doReturn(friendEntities).when(friendService).findAll(); Mockito.doReturn(friendDtos).when(friendMapper).map(eq(friendEntities)); //act var friends = controller.findAll(); //assert Assertions.assertThat(friends).hasSize(1); Assertions.assertThat(friends.get(0).getName()).isEqualTo(“Jenny”); } } A test like this fails on three counts: it’s too much concerned with implementation details to validate specifications and it’s too simplistic to have any documentary merit. And being tightly bound to the implementation, it surely does not facilitate refactoring. Even a “Hello, World!”-level example like this takes four lines of mock setup. Add more dependencies with multiple interactions and 90% of the code (and your time!) is taken up with tedious mocks setup. What matters most is that Spring is configured with the right settings. Only a component test that spins up the environment can verify that. If you include a test database, it can cover all three classes without any mocking, unless you need to connect to an independent service outside the component under test. Java @SpringBootTest @AutoConfigureMockMvc class FriendsControllerTest { @Test @WithAnonymousUser void get_friends(){ mockMvc.perform(get("/v1/api/friends")) .andExpect(content().string(“[{\“name\”: \”Jenny\”}]”)) } } Third Anti-Pattern: Trying to Test the Untestable The third anti-pattern I want to discuss rears its head when you try to write tests for complex business functionality without refactoring the source. Say we have a 1500-line monster class with one deceptively simple public method. Give it your outstanding debt, last year’s salary slips, and it tells you how much you’re worth. public int getMaxLoanAmountEuros(List<SalaryReport> last12SalarySlips, List<Debt> outstandingDebt); It’s different from the previous example in two important ways: The code-under-test centers on business logic and has high cyclomatic complexity, requiring many scenarios to cover all the relevant outcomes. The code is already there and testing is late to the party, meaning that by definition you can’t work in a test-driven approach. That of itself is an anti-pattern. We’re writing tests as an afterthought, only to validate what the code does. The code may be very clean, with all complexity delegated to multiple short private methods and sub-classes to keep things readable. Sorry, but if there are no unit tests, it’s more likely to be a big ball of mud. Either way, we can’t reduce the essential complexity of the business case. The only way to reach full coverage of all the code under this single public method is by inventing scenarios with different input (the salary and debt information). I have never been able to do that comfortably without serious refactoring, by delegating complex isolated portions to their own class and writing dedicated tests per class. If you’re terrified of breaking things, then changing the access level of private methods to default scope is safer. If the test is in the same package, you might write focused tests for these methods, but it’s a controversial strategy. Best avoid it. You’re breaking encapsulation, wading knee-deep in implementation details, which makes future refactoring even harder. You have to proceed carefully if you tamper with untested production code, but there is no good alternative other than a full rewrite. Writing thorough unit tests for messy code that you inherited without the privilege to refactor is painful and ineffectual. That’s because untested code is often too cumbersome to test. Such untestable code is by definition bad code, and we should not settle for bad code. The best way to avoid that situation is to be aware of the valid reasons why we test in the first place. You will then never write tests as an afterthought. Green Checkbox Addiction The productive path to establishing and maintaining effective test automation is not easy, but at least the good habits make more sense than the well-intentioned yet harmful anti-patterns I pointed out here. I leave you with a funny one, which you might call green checkbox addiction: the satisfaction of seeing an all-green suite, regardless of whether the tests make any sense. That's the false sense of security I mentioned earlier, which makes bad testing worse than no testing at all. It’s like the productivity geeks who create spurious tasks in their to-do lists for the dopamine spike they get when checking them off. Very human, and very unproductive.
Motivated by some recent M&A news and general productivity pressures in time of tight budgets, we present some anti-patterns to the use of engineering metrics and give an overview of how to use metrics for productivity insights instead. First, let us first start with what to avoid: Anti-Pattern to Engineering Metrics Lines of code written: While lines of code can be a proxy metric for how much code you have to maintain (and therefore costs to incur), it has been proven to be a terrible metric for productivity. This shouldn’t need much explanation, but: Output does not necessarily equate to the outcome. Different programming languages have different verbosity, and e.g., just including some open-source packages or pasting in some code from Stack Overflow does not make you more productive. And, of course, solving hard problems often does not require a lot of code but a lot of thinking, exploration, and collaboration. Complexity or salience of code: Complex code is not good code. Someone else has to read, understand, and maintain it. Salient code (let’s interpret this as “most notable”) can be worth noting, but that does not make you a good team player or show any consistency. One key point here: Software development is a team sport. However, every company has some rock stars who are able to do things that others cannot, but that does not mean you only want rock stars who cannot work with others. This is especially true if you have more than on star with similar traits. Number of PRs/Reviews: Now, this is a bit more on the grey side. Of course, if you can churn out a lot of pull requests (PRs), this typically means it is work associated with some feature spec (at least one would hope so). This means it is at least going in the direction that is perceived as customer value. But on their own, PRs are again a proxy of the amount of work that requires maintenance and a metric that is prone to gaming. Sure, you can do a lot of small PRs and “clog up” the review and merge pipeline, or do a lot of cursory reviews (LGTM), or worth add confusing and useless comments. So, adding value to some customer features is good, but for the sake of showing activity, it is not. Bug fixes: Someone fixes a zillion bugs? This can be great, but two questions come to mind: Why are there so many bugs in the first place? And: do all of these “bugs” need fixing? The unfortunate truth is that no software is perfect, and there is always way more to do than that there is time for. Number of hours worked: This shouldn’t need much explanation either, but hours do not equate to time well spent. Sure, if you only work 1h a day, you must be really exceptional (and not hardcore) to produce continuously high value. But if you work 14h, it does not mean you are at your most productive. Especially running out of steam/coffee. All good things need the right effort without going to extremes. Sometimes this might be more coding, and sometimes more exploration/thinking/discussing. The Value of Data-Driven Engineering Now, we are the last ones not to promote some metrics (ahem ….), but metrics are not the goal; insights into what is going on in the organization and continuous improvement loops are. Firstly, metrics are almost always poor to manage individuals. If you need metrics to understand if an employee gives you the value you are expecting, then you are already in trouble. We understand that large organizations need some sense of objectivity, but using any of the above is probably poor. Also, metrics in themselves do not solve principal issues. This brings us to the next point: Metrics are of value to highlight trends, anomalies, imbalances, and bottlenecks. This is, in particular, true for processes, workflows, and team/group aggregates. It would be unthinkable to run your ops without watching some numbers, and the same for sales or marketing. But it is always intended to improve and streamline processes and create less disruption. As such, metrics can provide great signals to manage organizational engineering productivity and remove friction in the system. For development teams, it makes sense to measure some numbers and trends. For instance: Where in the development process do we get stuck or waste time? Does our build system hold us back? Are we overloaded with reviews? Do we have QA issue? Do our sprints get re-scoped all the time in mid-air? Are there too many context switches? All these things are essential, especially for good managers to assist their teams. Managers should remove roadblocks and distractions, not micromanage lines of code. As for trends, there are some indicators of whether teams are “in the flow” or not. How are the features/PRs tracking over time? How much work do we usually get done (feature tickets, even PRs as a proxy)? Do we follow the process (reviews and approve those)? Do our security checks run? How does our build reliability perform? The question then is: Do we see spikes, anomalies, or unhealthy trends? These indicate issues with infrastructure, processes, and team health. Those warrant fixing and, in turn, create happier and more productive teams. Ensuring a smooth flow in software delivery. Reducing Friction, Not Micro-Managing The key to using metrics is to gain insights and confidence into processes, teams, and workflows that are otherwise hard to obtain. The engineering productivity objective should be to minimize any friction in the software delivery pipeline. This means the steps from planning a feature to delivering it to the customer and the reaction time to customer feedback/requests should be quick, smooth, and reliable. Any process or principle bottlenecks significantly outweigh a single contributor's shortcomings and should therefore be the major focus. Modern engineering leaders focus on the 4 Key Metrics to structure their engineering measurements to get better visibility into their teams and delivery processes. These metrics are: Velocity of delivery (how fast) Throughput of delivery (how much) Quality of delivery (how good) Impediments to delivery (how risky) Recently, a new wave of engineering intelligence platforms supports teams and their engineering managers to get the required visibility and insights to create such healthy and productive organizations. By shifting the focus away from individual performance metrics to reducing friction towards productivity goals, companies have achieved double-digit percentages of cost savings after a short period of time. We will dive deeper into this in a future article.
By experience, most of us know how difficult it is to express what we mean talking about quality. Why is that so? There exist many different views on quality and every one of them has its importance. What has to be defined for our project is something that fits its needs and works with the budget. Trying to reach perfectionism can be counterproductive if a project is to be terminated successfully. We will start based on a research paper written by B. W. Boehm in 1976 called “Quantitative evaluation of software quality." Boehm highlights the different aspects of software quality and the right context. Let's have a look more deeply into this topic. You may also enjoy: Custodians of Software Quality When we discuss quality, we should focus on three topics: code structure, implementation correctness, and maintainability. Many managers just care about the first two aspects, but not about maintenance. This is dangerous because enterprises will not invest in individual development just to use the application for only a few years. Depending on the complexity of the application the price for creation could reach hundreds of thousands of dollars. Then it's understandable that the expected business value of such activities is often highly estimated. A lifetime of 10 years and more in production is very typical. To keep the benefits, adaptions will be mandatory. That implies also a strong focus on maintenance. Clean code doesn't mean your application can simply change. A very easily understandable article that touches on this topic is written by Dan Abramov. Before we go further on how maintenance could be defined we will discuss the first point: the structure. Scaffolding Your Project An often underestimated aspect in development divisions is a missing standard for project structures. A fixed definition of where files have to be placed helps team members find points of interest quickly. Such a meta-structure for Java projects is defined by the build tool Maven. More than a decade ago, companies tested Maven and readily adopted the tool to their established folder structure used in the projects. This resulted in heavy maintenance tasks, given the reason that more and more infrastructure tools for software development were being used. Those tools operate on the standard that Maven defines, meaning that every customization affects the success of integrating new tools or exchanging an existing tool for another. Another aspect to look at is the company-wide defined META architecture. When possible, every project should follow the same META architecture. This will reduce the time it takes a new developer to join an existing team and catch up with its productivity. This META architecture has to be open for adoption which can be reached in two simple steps: Don’t be concerned with too many details; Follow the KISS (Keep it simple, stupid.) principle. A classical pattern that violates the KISS principle is when standards heavily got customized. George Schlossnagle describes a very good example of the effects of strong customization in his book “Advanced PHP Programming.” In chapter 21 he explains the problems created for the team when adopting the original PHP core and not following the recommended way via extensions. This resulted in the effect that every update of the PHP version had to be manually manipulated to include its own development adaptations to the core. In conjunction, structure, architecture, and KISS already define three quality gates, which are easy to implement. The open-source project TP-CORE, hosted on GitHub, concerns itself with the aforementioned structure, architecture, and KISS. There you can find their approach on how to put it into practice. This small Java library rigidly defined the Maven convention with its directory structure. For fast compatibility detection, releases are defined by semantic versioning. The layer structure was chosen as its architecture. Examination of their main architectural decisions concludes as follows: Each layer is defined by its own package and the files following also a strict rule. No special PRE or POST-fix is used. The functionality Logger, for example, is declared by an interface called Logger, and the corresponding implementation LogbackLogger. The API interfaces can detect in the package "business" and the implementation classes located in the package "application." Naming like ILogger and LoggerImpl should be avoided. Imagine a project that was started 10 years ago and the LoggerImpl was based on Log4J. Now a new requirement arises, and the log level needs to be updated during run time. To solve this challenge, the Log4J library could be replaced with Logback. Now it is understandable why it is a good idea to name the implementation class like the interface, combined with the implementation detail: it makes maintenance much easier! Equal conventions can also be found within the Java standard API. The interface List is implemented by an ArrayList. Obviously, again the interface is not labeled as something like IList, and the implementation not as ListImpl . Summarizing this short paragraph, a full measurement rule set was defined to describe our understanding of structural quality. By experience, this description should be short. If other people can easily comprehend your intentions, they willingly accept your guidance, deferring to your knowledge. In addition, the architect will be much faster in detecting rule violations. Measure Your Success The most difficult part is to keep a clean code. Some advice is not bad per se, but in the context of your project, may not prove as useful. In my opinion, the most important rule would be to always activate the compiler warning, no matter which programming language you use! All compiler warnings will have to be resolved when a release is prepared. Companies dealing with critical software, like NASA, strictly apply this rule in their projects resulting in utter success. Coding conventions about naming, line length, and API documentation, like JavaDoc, can be simply defined and observed by tools like Checkstyle. This process can run fully automated during your build. Be careful; even if the code checkers pass without warnings, this does not mean that everything is working optimally. JavaDoc, for example, is problematic. With an automated Checkstyle, it can be assured that this API documentation exists, although we have no idea about the quality of those descriptions. There should be no need to discuss the benefits of testing in this case; let us rather take a walkthrough of test coverage. The industry standard of 85% of covered code in test cases should be followed because coverage at less than 85% will not reach the complex parts of your application. 100% coverage just burns down your budget fast without resulting in higher benefits. A prime example of this is the TP-CORE project, whose test coverage is mostly between 92% to 95%. This was done to see real possibilities. As already explained, the business layer contains just interfaces, defining the API. This layer is explicitly excluded from the coverage checks. Another package is called internal and it contains hidden implementations, like the SAX DocumentHandler. Because of the dependencies the DocumentHandler is bound to, it is very difficult to test this class directly, even with Mocks. This is unproblematic given that the purpose of this class is only for internal usage. In addition, the class is implicitly tested by the implementation using the DocumentHandler. To reach higher coverage, it also could be an option to exclude all internal implementations from checks. But it is always a good idea to observe the implicit coverage of those classes to detect aspects you may be unaware of. Besides the low-level unit tests, automated acceptance tests should also be run. Paying close attention to these points may avoid a variety of problems. But never trust those fully automated checks blindly! Regularly repeated manual code inspections will always be mandatory, especially when working with external vendors. In our talk at JCON 2019, we demonstrated how simply test coverage could be faked. To detect other vulnerabilities you can additionally run checkers like SpotBugs and others. Tests don’t indicate that an application is free of failures, but they indicate a defined behavior for implemented functionality. For a while now, SCM suites like GitLab or Microsoft Azure support pull requests, introduced long ago in GitHub. Those workflows are nothing new; IBM Synergy used to apply the same technique. A Build Manager was responsible to merge the developers' changes into the codebase. In a rapid manner, all the revisions performed by the developer are just added to the repository by the Build Manager, who does not hold sufficiently profound knowledge to decide about the implementation quality. It was the usual practice to simply secure that the build is not broken and always the compile produces an artifact. Enterprises have discovered this as a new strategy to handle pull requests. Now, managers often make the decision to use pull requests as a quality gate. In my personal experience, this slows down productivity because it takes time until the changes are available in the codebase. Understanding the branch and merge mechanism helps you to decide on a simpler branch model, like release branch lines. On those branches tools like SonarQube operate to observe the overall quality goal. If a project needs an orchestrated build, with a defined order how artifacts have to create, you have a strong hint for a refactoring. The coupling between classes and modules is often underestimated. It is very difficult to have an automated visualization for the bindings of modules. You will find out very fast the effect it has when a light coupling is violated because of an increment of complexity in your build logic. Repeat Your Success Rest assured, changes will happen! It is a challenge to keep your application open for adjustments. Several of the previous recommendations have implicit effects on future maintenance. A good source quality simplifies the endeavor of being prepared. But there is no guarantee. In the worst cases the end of the product lifecycle, EOL is reached, when mandatory improvements or changes cannot be realized anymore because of an eroded code base, for example. As already mentioned, light coupling brings with it numerous benefits with respect to maintenance and reutilization. Reaching this goal is not as difficult as it might look. In the first place, try to avoid as much as possible the inclusion of third-party libraries. Just to check if a String is empty or NULL it is unnecessary to depend on an external library. These few lines are fast done by oneself. A second important point to be considered in relation to external libraries: “Only one library to solve a problem.” If your project deals with JSON then decide on one implementation and don’t incorporate various artifacts. These two points heavily impact security: a third-party artifact we can avoid using will not be able to cause any security leaks. After the decision is taken for an external implementation, try to cover the usage in your project by applying design patterns like proxy, facade, or wrapper. This allows for a replacement more easily because the code changes are not spread around the whole codebase. You don’t need to change everything at once if you follow the advice on how to name the implementation class and provide an interface. Even though an SCM is designed for collaboration, there are limitations when more than one person is editing the same file. Using a design pattern to hide information allows you an iterative update of your changes. Conclusion As we have seen: a nonfunctional requirement is not that difficult to describe. With a short checklist, you can clearly define the important aspects of your project. It is not necessary to check all points for every code commit in the repository, this would with all probability just elevate costs and doesn’t result in higher benefits. Running a full check around a day before the release represents an effective solution to keep quality in an agile context and will help recognize where optimization is necessary. Points of Interests (POI) to secure quality are the revisions in the code base for a release. This gives you a comparable statistic and helps increase estimations. Of course, in this short article, it is almost impossible to cover all aspects regarding quality. We hope our explanation helps you to link theory by examples to best practice. In conclusion, this should be your main takeaway: a high level of automation within your infrastructure, like continuous integration, is extremely helpful but doesn’t prevent you from manual code reviews and audits. Checklist Follow common standards KISS - keep it simple, stupid! Equal directory structure for different projects Simple META architecture, which can reuse as much as possible in other projects Defined and follow coding styles If a release got prepared - no compiler warnings are accepted Have test coverage up to 85% Avoid third-party libraries as much as possible Don’t support more than one technology for a specific problem (e. g., JSON) Cover foreign code with a design pattern Avoid strong object/module coupling Further Reading Software Design Principles DRY and KISS Code Quality: Honing Your Craft
Samir Behara
Senior Cloud Infrastructure Architect,
AWS
Shai Almog
OSS Hacker, Developer Advocate and Entrepreneur,
Codename One
JJ Tang
Co-Founder,
Rootly
Sudip Sengupta
Technical Writer,
Javelynn