Finding Needles in a Haystack With Graph Databases and Machine Learning
Finding Needles in a Haystack With Graph Databases and Machine Learning
Learn how we used machine learning to train an algorithm that can identify phone callers as fraudsters, pranksters, or salespeople.
Join the DZone community and get the full member experience.Join For Free
You know a technology has reached a tipping point when your kids ask about it. This happened recently when my eighth-grade daughter asked, "What is machine learning and why is it so important?"
Answering her question, I explained how machine learning is part of AI, where we teach machines to reason and learn like human beings. I used the example of fraud detection. In many ways, catching fraud is like finding needles in a haystack — you must sort and make sense of massive amounts of data in order to find your "needles" or, in this case, your fraudsters.
Consider a phone company that has billions of calls occurring in its network on a weekly basis. How can we identify signs of fraudulent activity from a mountain — or haystack — of calls? This is where machine learning comes in.
Of course, my daughter was ready with a solution to the problem: "Why not use a powerful magnet to draw out the needles from the haystack?"
She's right. When it comes to training a machine to spot fraudsters, we need to provide it with a more powerful magnet for drawing them out. Our magnet, in this case, is the ability to identify behaviors and patterns of likely fraudsters. Using this, a machine is more adept at recognizing suspicious phone call patterns and is able to separate them from the billions of calls made by regular people who comprise our haystack of data.
Current Machine Training Approaches Are Missing the Mark
Let's use this example to consider current approaches for identifying fraudsters based on machine learning. Supervised machine learning algorithms need training data — in this case, phone calls identified as calls from confirmed fraudsters. There are two problems with the current approach, including both the quantity and of training data.
Confirmed fraudulent activity in phone networks currently constitutes less than 0.01% of total call volume. So, the volume or the quantity of training data with confirmed fraud activity is tiny. Having a small quantity of training data, in turn, results in poor accuracy for the machine learning algorithms.
Features or attributes for finding a fraudster are based on simple analyses. In this case, they include calling history of particular phones to other phones that may be in or out of the network, the age of a pre-paid SIM card, the percentage of one-directional calls made (cases where the call recipient did not return a phone call), and the percentage of rejected calls. These simplistic features tend to result in a lot of false positives. It's no wonder when you consider how, in addition to a fraudster, these features may also fit the behavior of a salesperson or a prankster!
Training the Machine for Fraud Detection by Building a Better Magnet With Graph Features
A large mobile operator uses TigerGraph, the next-generation graph database with real-time deep link analytics, to address the deficiencies of current approaches for training machine learning algorithms. The solution analyzes over ten billion calls for 460 million mobile phones and generates 118 features for each mobile phone. These are based on deeper analysis of calling history and go beyond immediate recipients for calls.
The diagram below illustrates how the graph database identifies a phone as a "good" or a "bad" phone. A bad phone requires further investigation to determine whether it belongs to a fraudster.
Figure 1: Detecting phone-based fraud by analyzing network or graph relationship features.
"Good" Phones vs. "Bad" Phones
A customer with a good phone calls other subscribers, and the majority of their calls are returned. This helps to indicate familiarity or trusted relationships between the users. A good phone also regularly calls a set of others phones — say, every week or month — and this group of phones is fairly stable over a period of time ("stable group").
Another feature indicating good phone behavior is when a phone calls another that has been in the network for many months or years and receives calls back. We also see a high number of calls between the good phone, the long-term phone contact, and other phones within a network calling both these numbers frequently. This indicates many in-group connections for our good phone.
Lastly, a good phone is often involved in a three-step friend connection — meaning our good phone calls another phone, phone 2, which calls phone 3. The good phone is also in touch with direct calls with phone 3. This indicates a three-step friend connection, indicating a circle of trust and interconnectedness.
By analyzing such call patterns between phones, TigerGraph can easily identify bad phones, which are phones likely involved with scams. These are phones have short calls with multiple good phones, but receive no calls back. They also do not have a stable group of phones called on a regular basis (representing an "empty stable group"). When a bad phone calls a long-term customer in the network, the call is not returned. The bad phone also receives many rejected calls and lacks three step friend relationships.
As a graph database platform, TigerGraph leverages more than 118 new features that highly correlate with good and bad phone behavior for each of 460 million mobile phones in our use case. In turn, it generates 54 billion new training data features to feed machine learning algorithms.
This has led to dramatic improvement in accuracy of machine learning for fraud detection, resulting in fewer false positives (non-fraudulent phones marked as potential fraudster phones), as well as lower false negatives (phones involved in fraud that weren't marked as such).
Improving Machine Learning Accuracy With Graph-Based Features
To see how graph-based features improve accuracy for machine learning, let's consider an example (Figure 2) using profiles for four mobile users: Tim, Sarah, Fred, and John.
Figure 2: Improving accuracy for machine learning with graph features.
Traditional calling history features (such as the age of the SIM card used, the percentage of one directional calls, and the percentage of total calls rejected by their recipients) result in flagging three out of four of our customers — Tim, Fred, and John — as likely or potential fraudsters, as they look very similar based on these features. Graph-based features with analysis of deep link or multi-hop relationships across phones and subscribers help machine learning classify Tim as a prankster, John as a salesperson, and Fred as a likely fraudster. Let's consider how.
In the case of Tim, he has a stable group, which means he is unlikely to be a sales guy since salespeople call different numbers each week. Tim doesn't have many in-group connections, which means he is likely calling strangers. He also doesn't have any three-step friend connections to confirm that the strangers he is calling aren't related. It is very likely that Tim is a prankster based on these features.
Let's consider John, who doesn't have a stable group, which means he is calling new potential leads every day. He calls people with many in-group connections. As John presents his product or service, some of the call recipients are most likely introducing him to other contacts if they think the product or service would be interesting or relevant to them. John is also connected via three-step friend relations, indicating that he is closing the loop as an effective sales guy, navigating the friends or colleagues of his first contact within a group, as he reaches the final buyer for his product or service. The combination of these features classifies John as a salesperson.
In the case of Fred, he doesn't have a stable group, nor does he interact with a group that has many in-group connections. Plus, he does not have three-step friend relations among the people he calls. This makes him a very likely candidate for investigation as a phone scam artist or fraudster.
Going back to our original analogy, we are able to find our needle in the haystack — in our case, it's Fred who is the potential fraudster — by leveraging graph analysis for better machine learning for improved accuracy. This is achieved by using the graph database framework to model data in a way that allows for more features that can be identified and considered to further analyze our haystack of data. The machine, in turn, is trained with more and more accurate data, making it smarter and more successful in recognizing potential scam artists and fraudsters.
Training Machine Learning With Graph Features for Other Use Cases
Graph features generated in real-time by TigerGraph are being used for a host of use cases beyond identifying phone-based scams. These include training Machine Learning to detect various other types of anomalous behavior, including credit card-related fraud — which affects all merchants selling products or services via eCommerce, and money laundering violations — spanning the entire financial services ecosystem and including banks, payment providers and newer cryptocurrencies such as Bitcoin and Ripple.
eCommerce companies are also using graph-based features to create product recommendations based on a customer's buying behavior, other customers in their extended network and also those who have similar buying preferences. These new features are fed as training data to the machine learning algorithms to improve accuracy for future recommendations.
Published at DZone with permission of Gaurav Deshpande , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.