DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Machine Learning: A Revolutionizing Force in Cybersecurity
  • Send Time Optimization
  • Bridging UI, DevOps, and AI: A Full-Stack Engineer’s Approach to Resilient Systems
  • ITBench, Part 1: Next-Gen Benchmarking for IT Automation Evaluation

Trending

  • Event-Driven Microservices: How Kafka and RabbitMQ Power Scalable Systems
  • Creating a Web Project: Caching for Performance Optimization
  • The End of “Good Enough Agile”
  • SaaS in an Enterprise - An Implementation Roadmap
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. XAI for Fraud Detection Models

XAI for Fraud Detection Models

We will explore the importance of eXplanation in fraud detection models and learn how it can help to understand different patterns of fraud in our system.

By 
Kalpan Dharamshi user avatar
Kalpan Dharamshi
·
Feb. 28, 25 · Analysis
Likes (2)
Comment
Save
Tweet
Share
4.8K Views

Join the DZone community and get the full member experience.

Join For Free

One would question, why should I worry about what is happening behind the scenes as long as my model is able to deliver high-precision results for me? 

In this article, we dive deep into the aspect of reasoning and try to answer the question above. More importantly, we will understand how it can help us to build greater insights into evolving fraud patterns.

The eXplainable AI (XAI) has been around for quite a while, but it has not really created a buzz in the industry. Now, with the arrival of the DeepSeek-R1 reasoning model, there is a buzz in the industry for models that can not only make highly accurate predictions but also provide some reasoning on how these predictions were made.

The research of XAI has demonstrated that a model that can accurately identify fraudulent transactions may not necessarily be accurate in terms of reasoning. XAI provides system users with the insight and confidence that not only is the model working as expected, but also the reasoning for the decisions is accurate. In subsequent sections, we will use simple techniques of XAI and unsupervised learning to solidify our approach.

Methodology

We would use a publicly available fraud data set with anonymized feature attributes and build a simple classifier model that provides us decent accuracy to detect fraud. The model will be used further for the calculation of feature importance that drives fraud decisions.

Next, we use SHapley Additive exPlanations (SHAP) to determine the importance of features that drive our decisions of fraud vs non-fraud transactions. AWS Sagemaker Explain service also uses the same concept for explanation. Here is a cool paper for users who would like to understand more about it.

Finally, once we have the SHAP values for our features, we would use an unsupervised learning technique to categorize the different types of fraud transactions in our dataset. The idea of clustering gives us the fraud patterns in our dataset, and businesses can use it to monitor and understand these patterns easily.

Experiment and Results

We start by installing libraries like scikit-learn, shap, and pandas.

We check for any missing values in our dataset and try to understand the data distribution. The fraud dataset should be unbalanced, which means that normal transactions should far exceed fraudulent transactions. Our dataset contains 0.2% of transactions identified as fraud, and the rest are non-fraud. In this example, 0 indicates a normal transaction, and 1 indicates a fraudulent transaction.

Distribution of a target variable

Below, we have a simple random forest classifier that tries to predict the fraudulent transactions with 93% precision. The accuracy is reasonable for us to start our eXplanation process and determine feature weights that are primarily used for identifying the fraud. 

Python
 
from sklearn.ensemble import RandomForestClassifier

features = df.columns[:-1]

X = df[features]
y = df['Class']
X = X.drop('Time',axis=1)

features = X.columns

model = RandomForestClassifier(n_estimators=5)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=287)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

cm = confusion_matrix(y_test,y_pred)

print(classification_report(y_test,y_pred))
Plain Text
 
precision    recall  f1-score   support

           0       1.00      1.00      1.00     85297
           1       0.93      0.75      0.83       146

    accuracy                           1.00     85443
   macro avg       0.97      0.88      0.92     85443
weighted avg       1.00      1.00      1.00     85443


Next, we extract shap values for all the fraudulent transactions in the dataset. We will apply an unsupervised clustering algorithm on shap values to generalize different underlying reasons for fraud. Please note that the process to determine the SHAP values will be time-consuming.

Python
 
import shap

explainer = shap.TreeExplainer(model)

shap_values = explainer(X)


We use dimensionality reduction techniques like T-SNE to visualize higher dimensional data. We pass on the results to clustering algorithms like k-means to identify fraud patterns in our dataset. The silhouette score and elbow technique are used to identify the optimal value of k. 

Python
 
X = fraud_shap_values

from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
tsne.kl_divergence_

common_params = {
    "n_init": "auto",
    "random_state": 42,
}

from sklearn.metrics import silhouette_score

sil = []
kmax = 10

# dissimilarity would not be defined for a single cluster, thus, minimum number of clusters should be 2
for k in range(2, kmax+1):
  kmeans = KMeans(n_clusters = k, **common_params).fit(X_tsne)
  labels = kmeans.labels_
  sil.append(silhouette_score(X_tsne, labels, metric = 'euclidean'))

plt.plot(range(2, kmax+1),sil)
plt.xlabel("K")  
plt.ylabel("Silhouette Score")  
plt.title("Elbow method")
plt.show()

Python
 
y_pred = KMeans(n_clusters=k, **common_params).fit_predict(X_tsne)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_pred)
plt.title("Optimal Number of Clusters")
plt.show()

Optimal number of clusters

 Finally, in the last step of our process, we need to identify the features that have maximum weights for the frauds in our dataset. We plot a bar graph with the top five heavyweights for each fraud category.

Python
 
for i in range(k):
    cluster_data = explanation_df[explanation_df['Class'] == i]
    cluster_data = cluster_data.drop('Class',axis=1)
    cluster_data = cluster_data.drop('Amount',axis=1)
    shap.summary_plot(cluster_data.to_numpy(),cluster_data,plot_type='bar',feature_names=features, max_display=5)


The SHAP summary plot highlights various attributes contributing to different types of fraud in our dataset. 

Mean graph (1/2) 

Mean graph (2/2)

Conclusion

Above, we have shown two types of fraud transactions in our dataset. If we observe closely, most of the top five factors contributing to the two types of fraud are different. Business users can easily interpret the graphs and understand the combination of features that are causing different types of fraud. 

The clustering of SHAP values helps us to identify various patterns of fraud in the system. Without reasoning capabilities, it would be difficult for end users to understand any new or evolving patterns of fraud or why a certain transaction is fraudulent.

Hope you guys liked the article and that it helped you learn something new!

AI Unsupervised learning clustering

Opinions expressed by DZone contributors are their own.

Related

  • Machine Learning: A Revolutionizing Force in Cybersecurity
  • Send Time Optimization
  • Bridging UI, DevOps, and AI: A Full-Stack Engineer’s Approach to Resilient Systems
  • ITBench, Part 1: Next-Gen Benchmarking for IT Automation Evaluation

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!