Why 99% Accuracy Isn't Good Enough: The Reality of ML Malware Detection

ML models need to be complemented with traditional detection techniques for malware detection to work in real enterprise environments, due to the "base rate problem."

Udbhav Prasad

Aniesh Chawla

Jun. 16, 25 · Analysis

Likes (5)

Comment

Save

2.4K Views

The threat of malware in enterprises is evolving each year. As enterprises expand their digital footprint through remote work and cloud adoption, their attack surface increases, making them more vulnerable to targeted malware campaigns. FBI’s 2023 Internet Crime Report showed that Business Email Compromise (BEC) scams alone caused over USD 2.9 billion in losses. Investment fraud losses also rose by 38% to USD 4.57 billion, and ransomware caused USD 59.6 million in losses.

Other reports paint similarly bleak pictures of the state of enterprise security today. The 2024 IBM Cost of a Data Breach Report shows that the average cost of a data breach jumped 10% to USD 4.88 million. It also shows that organizations using AI in incident prevention saved USD 2.2 million on average. More than half of breached organizations are experiencing severe security staffing shortages, a 26.2% increase from last year. AI tools can help fill the gap.

Malware Detection Techniques

Digital forensic techniques are traditionally signature-based, using static hashes like SHA-256 and MD5, byte pattern matching techniques like YARA rules or digests based on static analysis like import hashes. Threat actors authors evade these techniques by deploying polymorphic malware that changes its code structure with each infection while maintaining functionality. Ransomware also encrypts its payload with different keys each time it spreads. In general, signatures require prior knowledge of malware and are usually implemented by matching signatures against datasets of known good or bad signatures.

Machine Learning Approaches

Modern security software employs machine learning approaches to detect unknown and unseen malware. CrowdStrike’s 2024 Global Threat Report highlights the increasing use of ML techniques to detect novel threats and uncover hidden patterns, and for analyzing the large-scale, evolving datasets associated with modern attacks, including cloud-conscious intrusions. Generative AI tools have lowered the entry barrier to the threat landscape for less sophisticated threat actors, and facilitate social engineering and information operations campaigns. This reinforces the need for AI-based counter techniques to combat these new attack vectors.

Recent neural network-based malware detection techniques using transfer learning with CNNs and LSTMs have yielded models with over 99% accuracy and over 99% precision rates. In this article, we’ll show why this isn’t sufficient in practice by discussing how the base rate (the number of malware samples evaluated) is a critical metric in determining the rate of false positive detections. We’ll also discuss how these challenges can be mitigated in practice.

Evaluating Machine Learning Models

Key Metrics and Terminology

Performance of any machine learning algorithm generally uses terms like confusion matrix, precision, recall, etc. We will understand their meaning by taking a simple example. Let's suppose the ML Model predicts whether a particular file is malware or not. We ran our ML model on 20 files, out of which it predicted 4 being as malware and 16 as not malware. Out of the 4 files predicted, only 3 had actual malware. We know that there are a total of 6 malware files and 14 regular files.

The following table represents the example being taken:

Total Files = 20	Predicted Malware	Predicted Not Malware
Actual Malware = 6	2	4
Not Malware = 14	1	13

Now we define the following terms:

Confusion Matrix: The matrix shown above is the confusion matrix. This is the main performance matrix for any classification Model.
Precision: As the name suggests, it tells how precise the algorithm is. In the above example it is the ratio of Total Malware Predicted / Total Malware i.e 3/6. In technical terms it is the ratio of true positives to total predicted. That is why Precision is also called positive predictive value.
Recall/True Positive Rate: This term tells us about the sensitivity of prediction. It tells us how much the algorithm was actually able to predict properly. In our case it is the ratio of Malware Predicted / Actual Malware i.e. 2/6 = 1/3. Technically, it is ratio of true positives to all positive
False Positive Rate: When it is not Malware, how often does the Model predict a file to be Malware. In our example, it is 1/14.
Accuracy: This tells us how accurate the Model is. In the above example it is (2+13)/20 = 75%.
Misclassification Rate: This is opposite of Accuracy i.e. misclassification rate = 1 - Accuracy = 25%
ROC (Receiver Operating Characteristics) is the graphing between TPR over FPR.
AUC (Area under the ROC curve). Higher the AUC, better the model. A random model has AUC as 0.5. Model having AUC less than 0.5 means that model is worse than a random classifier model.
Prevalence: This tells whether the data used in the classification is sparse or not. In the above example it is the percentage of files that have Malware i.e. 6/20 = 30%.

A perfect model has TPR as 1, FPR as 0, AUC as 1.

Early detection of Malware is critical for the companies. Mistakenly detecting a file as Malware is not much of an issue as it can be manually marked as not Malware. Thus, it is important to improve the Recall matrix for the Malware detection model(s).

The Base Rate Problem

Malware dataset(s) have the same issue as the datasets for rare disease or the credit card fraud data i.e. the base rate is very low to identify the malware. If the malware dataset has only 0.01% malware. If the malware detection model marks every file as not malware, then it would be 99.99% accurate. But it would miss all the malware. Hence, it is important to understand the data and various techniques to take care of the base rate. Accounting for base rate changes the precision formula to

     Plain Text
    
    		  TruePositiveRate x BaseRate
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
TruePositiveRate x BaseRate + FalsePositiveRate x (1 - BaseRate)

Deploying Malware Detectors in the Enterprise

Enterprise Data Volumes

A 2022 study by the Ponemon Institute found that the average U.S. enterprise manages approximately 135,000 endpoint devices. Notably, nearly half of these devices are either undetected by IT departments or running outdated operating systems, posing significant security risks. Each device contains thousands of files, most of which are completely benign. Assuming we have a system where 0.01% files are actually malware, and the prediction model has a 99.99% true positive rate and a 0.001% false positive rate. That still gives us a precision of 9%, which means 91% of the alarms are false positives! This highlights the importance of low false positive rates in systems with a very low base rate.

Implementation Challenges

Given the challenge of low base rates, the training approach for malware detection models needs to be adapted. Oversampling minority class samples or undersampling majority class samples can help lower the rate of false positives, along with using cost-sensitive learning algorithms that weigh errors differently.

Hybrid approaches incorporating signature-based techniques to perform anomaly detection can complement pure classification offered by ML models. Careful prediction time threshold tuning based on operational requirements can also increase the accuracy of these models.

Future Directions

Several promising research directions may help address the fundamental base rate challenge in enterprise malware detection. One approach is developing hierarchical detection systems that use lightweight models for initial screening, followed by more sophisticated analysis only for suspicious files.

A related approach is to use active learning techniques, which select the most informative samples for human analysis. These approaches could help security teams validate potential malware and feed that knowledge back into detection models. This could be particularly valuable for enterprises dealing with large volumes of files but limited security staff.

The continued development of these techniques, along with careful attention to operational requirements and performance metrics discussed earlier, will be crucial for building malware detection systems that can perform effectively despite the inherent base rate challenges in enterprise environments.

Machine learning Malware security

Opinions expressed by DZone contributors are their own.

Related

Trending