A Methodical Approach to Measuring ROI on Incident Response Systems
This article provides a simplified thought process of creating a model that any process/quality control engineer at your factory/plant/workshop can work with.
Join the DZone community and get the full member experience.
Join For FreeOften MSMEs operating small-scale machine workshops struggle to map and prioritize their technology platform adoption needs: specifically, when they learn about broad concepts such as IT-OT integration, secure and seamless data extraction, and availability in cloud platforms for visualization and decision making. They get the concept but lack the analytical toolsets to quantify what it all means for them.
The perceived benefits through data visualization (the “dashboarding” or “digital/virtual cockpit views”) and corresponding task-driven process automation involving connectors and middleware that promises to connect assets, machines, tooling, and sensors, should be quantified considering the on-field realities to justify funding.
Often, this is where a relative lack of analytical tooling hinders decision making.
- What if I invest in data extraction and corresponding visualization of machine/asset/tooling and operator performance indicators?
- What kind of incidents is interfering with day to day productivity?
- What is the opportunity cost of not being able to address such incidents at present?
- Which incidents will it allow me to foresee and pre-empt?
- What kind of reduction in incidents and their severity am I expected to experience post-investment?
- To what extent will I be able to reduce or transfer associated risks?
- What kind of productivity gains, I might be able to achieve post-investment?
Answering these concerns will help you translate the benefits showcased in product demonstrations and client testimonials to your own and very specific scenario.
These answers must be discovered through analytical modeling techniques customized to your factory/plant/workshop realities.
Here, I will lay out a simplified analytical framework based on FMEA (Failure Modes and Effects Analysis) techniques that will allow you to layout, categorize, prioritize and quantify the incidents encountered and corresponding control needs.
Next, based on this model, you will be able to work towards quantifying the impact that can be achieved through the incorporation of IoT/AR platforms.
Just a disclaimer that this article aims to showcase a simplified thought process behind putting together a model that any process/quality control engineer at your factory/plant/workshop can work with. This is by no means an exhaustive approach but should, at the very least, allow you to know what is at stake if the platform/solution is not implemented.
The Model
The model follows a 7-step process. Again, your own model can have fewer or more steps depending on how complex you want your model.
Step 1: List All Assets and Record Potential Incidents Based on Present Capabilities To Detect Them
- This step will allow you to document the “AS IS” scenario.
- Please make a list of all assets you wish to consider in scope and their respective MTBF (Mean time between Failure) values.
- When making a list, include attributes such as type, sub-type if any, followed by the description of each incident (starting categories can be either failure, malfunction, or a deviation from normal).
- Take time to describe each potential incident in a separate column.
- To quantify the impact of each included incident, you can use an RPN scale. RPN (Risk Priority Number) comprises 3 components, namely Severity, Occurrence, and Detection. A scale of 1–10 is provided for each of these component scores. An example reference table is presented below this section for ready reference.
- Referring to the reference table, for each asset/machine/tool, you should be able to assign a corresponding Occurrence, Detection, and Severity score.
By doing this,
- You will have mapped what kind of potential risks your assets are being subjected to, interfering with your daily production processes and
- You will know the likelihood of these risks is becoming real and then.
- You will have a reason to become alarmed about your inability to detect one or more of these.
Here is an example of an RPN Reference table. Ideally, a reference table must be computed for each device/component, etc. Based on their respective MTBF values, we certainly know that it is bound to fail at least once post that.
For instance, if a device has an MTBF value of 8760 hours, then it is bound to fail at least once a year.
Next, since we need to use a scale of 1–10, scale back likelihood values (starting from 1 at the bottom) as shown in the third column of Figure 1.
Below is an example of worked out scales for each of these.
Figure 3: RPN Reference table scale explanation- For Occurrence
Note that the opportunities column is optional… It just presents a more human-readable interpretation expressed in terms of count per second.
The above table is again based on the same reference of 12 months’ timescale. You may need to change this, considering the MTBF values of your assets. However, bear in mind that likelihoods are a ratio expressed relatively to scale.
For instance, specifically for the Detection scale, a value of 1 indicates near certainty that an interesting incident such as a failure event will be detected. Hence, 1 or 100% is assigned to the corresponding likelihood column value.
You can start with a general MTBF period and compute the reference scale for Occurrence, respectively.
The climb from 0% to 100% can be defined as per your specific process/asset performance needs.
Likelihoods are just that.. how likely or unlikely it is for your present-day systems to detect a given incident (which itself can be a very severe or not so severe one).
You will also need the Detection scale so that you can reflect on whether your factory/plant/assembly units or individual assets/machines have the capability to detect and alert about noteworthy incidents as laid out earlier.
Now, back to our Asset/Incident Map.

You should now in a position to map incidents ranked as per the RPN score. By the way, the RPN score is just the product of namely Occurrence, Detection, and Severity, respectively.
One last thing to note is that while in the reference table, we took a general 12 month period when we want to compute the Occurrence score for each type of asset, we must take into account the respective MTBF score and then translate that into an Occurrence scale of 1–10.
For instance, for a given asset, an MTBF value of 1 failure per month on an average corresponds to an occurrence scale of 7.
Step 2: Categorization of RPN Scores and Corresponding Visualization
This is an optional step but can come in handy where you want to categorize asset/incident maps based on their computed RPN scores. You can choose to assign literal categories such as Category A, B, C, and so on to plot a nice histogram with it.
This tells you the incident-asset mix, which you must pay attention to. (Assets prone to incidents that are high in occurrence yet hard to detect and can be severe enough to derail day to day ops and productivity.)
Step 3: Compute Costs for Responding to Incidents
After you mapped various incident types, you need to figure out the cost implications of responding to such incidents in the present situation. In other words, how do you quantify your plant’s/workshop’s incident responses and the burden it places on your operating budget/working capital needs.
The steps are quite straightforward.
- A. List each machine/asset.
- B. List resource types associated with installation, repair, servicing, including process specialists as required to address downtime/disruptive incidents.
- C. Next, enter the hourly/daily/monthly billing rate for each of these resources.
- D. Optionally, you can lay out the RPN categories to compute incident recovery costs for each category.
Essentially, what you are computing is what it would take to recover from how many resources I would need, for how long and how costly are they?
Here is an example:
This means what kind of resources it will take, the corresponding time lost, and expenses to recover from a given incident. Recovery can be in the form of repairing a fault or, in case of severe malfunction, a need to replace the asset entirely.
Part I: Summarize Baseline Costs
You should now be able to summarize baseline costs to repair as follows. These are hourly costs per incident.
You will also need to model the cost of eventual replacement. Here is an example.
Part II: Prepare Markdowns for RPN Categorization
You would also need to consider the relative markdown depending on the RPN score categorization we arrived at earlier. This is because not every incident is identical, and hence cost and time to recover from it will also vary accordingly.
Part III: Quantify the Likelihoods of Occurrence for Each Incident Category
Additionally, you will also need to quantify the likelihood of occurrence for each incident category, along with the likelihood of a replacement if such failure occurs. This is essentially a translation of the FMEA scale to hourly likelihoods.
Note: Depending on the need time scale can be daily or per minute. Hourly is considered because, typically, MTBF values are expressed in hours.
Part IV: Estimate MTBF and Serviceable Life
Based on the above likelihoods MTBF and corresponding serviceable life period can be estimated. The Incident table we built should be used to compute the same, so that incident count and likelihoods used for cost impact estimation are consistent with your incident-asset map.’
Part V: Quantify the Likelihoods of Detecting Such Occurrences
As with occurrence, hourly likelihoods for the detection of incidents should also be computed. This is again based on the incident-asset map prepared earlier and corresponding RPN categorization.
Part VI: Attribute Costs Proportional to Incident Severity
You would also need to attribute the cost associated with recovering from each documented incident based on its severity classification. After all, some incidents can be severe enough to result in complete asset replacement while others can be dealt with with minor repair work.
Considering the previously computed hourly cost to repair a given machine/asset/component, the following table provides what % of that absolute cost. Each categorization carries based on the incident entries you made earlier.
Note: These reflect individual magnitude relative to the hourly repair cost. Depending on your incident-asset map, an asset may not have all A-H categorizations. But whichever category happens to be present, the scale down magnitude can be looked up from this table.
Step 4
Part I: Compute Costs for Responding To Incidents
Now that you have estimated the costs and likelihoods, this model needs to be applied to your specific operational scenario, i.e., Plant, Types of Machines, and Quantity for each. Here is an example.
Part II: Compute Costs for Responding To Incidents
For each of the mapped assets and corresponding quantities, we need to compute:
- Cost of Repair work
- Cost of Replacement
From the previously computed MTBF values, it should be easily possible.
The expected value can be arrived at by multiplying the number of assets by the number of incidents per asset by the joint likelihood of occurrence and detection of such incidents.
Qty- Incidents= (No. of Assets*Mapped Incidents Per Asset*Likelihood of Occurrence*Likelihood of Detection) at any given hour.
Step 5: Sensitivity Analysis
Once you estimate the incident quantities and their costs, you can then perform a sensitivity analysis to understand what kind of erroneous variations your model may produce if the FMEA category scale rating was overstated or understated.
Hence, you need to perform this analysis for each of the Occurrence, Detection, and Severity ratings you entered in Step 2. The objective is to arrive at the cost impact of misclassification.
For example, if your systems were to detect incidents correctly and earlier, you would avoid or mitigate associated failure and downtime risks.
You should be able to quantify at the end of this stage:
A. How many incidents (failure or replacement) can be uncovered per a unit change in classification, irrespective of the present magnitude of occurrence or detection rating.
B. Cost of each unit of misclassified severity.
Step 6: Business Impact of Downtime
This step requires both collaboration and systematic information mapping to collect data from various systems such as ERP, CRM, and Accounting functions along with day to day production planning and execution information as obtained from MES/PLC/Automation systems.
Essentially, you are trying to discover the impact of downtime on your day-to-day order fulfillment and the financial impact of fulfillment delays.
A simplified approach such as one outlined below can come in handy to make initial estimations.
Part I: Start With Impact Estimation on OEE
When you are modeling downtime, you are considering the impact on the availability of assets/machines/tooling to execute jo, this as necessary to meet orders. Hence, the present capability to estimate Availability, Performance, and Quality, respectively, considering the product/component mix that your factory/ant/unexpected to produce is crucial.
Let us look at each of the three-constituent metrics for OEE score computation.
- Availability: This indicates the portion of time your machines/assets/tools are available to produce, relative to the planned production time: typically, your plant’s working shift times.
- Performance: This is the output your assets are able to deliver presents, in contrast with what they could have if operating conditions were ideal/optimal.
- Quality: This indicates the proportion of good quality products among total produce.
OEE metric is then computed as the product of the above 3 metrics.
Part II: Model the Loss in OEE
Once you estimate the current OEE metric levels, the next step is to model loss in %.
- Layout the product mix and weekly/monthly/annual order volume for each of your plants.
- Figure out how many assets/machines/tooling each product/job to be produced has to undergo. This may also involve sequential/parallel run computations.
- Then, it should be fairly straightforward to compute drop in OEE corresponding to given productivity losses. A simple way is to create slabs of % decline and plug in numbers to compute the net economic impact.
Here is an example:
Part III: Model Opportunity Costs
The opportunity costs themselves can comprise one or more of the following:
- Orders delayed
- Orders canceled
- Shipment delays
- Penalties/damages payable owing to delayed or substandard deliveries
- Inventory holding and restocking costs
So, essentially you are trying to estimate the net opportunity cost of productivity loss owing to downtime.
A good metric to use is what is the economic burden of a 1% decline in OEE.
Step 7: Operator Productivity
In step 3, we estimated Costs and Likelihoods to respond to and recover from downtime/failure incidents.
If you were to evaluate technology platforms such as augmented reality, you are thinking about in which ways my operators, engineers, technicians, and support personnel will be aided to do their respective jobs more effectively.
IIOT and AR initiatives go hand in hand.
You need to draw real-time operational and usage information about your production assets, and you need to give that in hands of your staff in a timely manner for decision making and execution.
What if you can derive real-time information about the sudden rise in temperature or drop in effluents levels? Are you able to provide the necessary contextual insights in a real-time consumable manner to your operations and support personnel? This is where AR meets IIOT initiatives.
Part I: Modeling the Impact of AR
- The impact of Augmented reality can be laid out in 6 simple aspects.
- Greater Contextual Awareness to know scope and factors.
- Faster Diagnosis of the situation at hand.
- Access to relevant standard operating procedures to pick, choose, and apply.
- Near real-time collaboration with the rest of the team as necessary.
- Decreased response time in executing response.
Here is an example with possible % improvement assigned. To build a conservative estimate, you can begin with minimal improvement objectives such as below.
Part II
You will also need the effort estimate in terms of hours needed to repair/recover from the incident. We calculated that in Step 3.
Here is a worked-out summary in terms of hours saved post introduction of Augmented Reality.
Figure 19: Cost and time comparison
Summarizing Results
We walked through the 7 steps necessary to build an ROI model. Many additional aspects such as Energy savings, Emission savings can be added as well. You may also want to add the existing IT/OT operational costs (hosting, licensing, and development/support costs, to name a few) to understand realistic returns.
Here is an example worked out OEE Waterfall Chart.
What this model is expected to unravel is threefold.
- What is the present level of awareness or rather lack of so, into failure/downtime/repair incidents that impact day to day productivity?
- What is the resulting impact on the OEE metric if a given IIOT platform investment is made? Alternatively, what is the minimum improvement that such a platform must deliver in order to feel a positive impact? This can be twofold.
- What is the improvement in awareness about incidents that went undetected before?
- What is the reduction in failure incidents post platform implementation?
- What kind of additional impact your Augmented Reality initiative is expected to have in terms of reduced time and cost to recover from incidents and a corresponding uptick in OEE metric.
Opinions expressed by DZone contributors are their own.
Comments