Detecting Fraud With Arcade Analytics and Big Data

DZone 's Guide to

Detecting Fraud With Arcade Analytics and Big Data

We investigate how Arcade Analytics and big data processes are used by organizations to detect fraudulentf behavior.

· Big Data Zone ·
Free Resource

What Is Fraud Detection?

By fraud detection, we mean the process of identifying actual or expected fraud within an organization.

Telephone companies, insurance companies, banks, and e-commerce platforms are examples of industries that use massive data analysis techniques to prevent fraud.

In this scenario, for every organization, there is a big challenge to face: being good at detecting known types of traditional fraud, through the searching of well-known patterns, and being good to uncover new patterns and fraud.

We usually can categorize fraud detection according to the following aspects:

  • Proactive and Reactive
  • Manual and Automate

Why Fraud Detection Is Important

According to an economic crime survey performed by PwC in 2018, fraud is a billion-dollar business and it is increasing every year: half (49 percent) of the 7,200 companies they surveyed had experienced fraud of some kind.

Most of the fraud involves cell phones, tax return claims, insurance claims, credit cards, supply chains, retail networks, and purchase dependencies.

Investing in fraud detection can have the following benefits:

  • Promptly react to fraudulent activities.
  • Reduce exposure to fraudulent activities.
  • Reduce the economic damage caused by fraud.
  • Recognize the vulnerable accounts more exposed to fraud.
  • Increase trust and confidence of the shareholders of the organization.

A good fraudster can workaround the basic fraud detection techniques, thus, for this reason, developing new detection strategies is very important for any organization. Fraud detection must be considered a complex and ever-evolving process.

Phases and Techniques

The fraud detection process starts with a high-level data overview, with the goal of discovering some anomalies and suspicious behaviors inside the dataset, e.g. we could be interested in looking for weird credit card purchases. Once we have found the anomalies we have to recognize their origin, because each of them could be due to fraud, but also to errors in the dataset or just missing data.

This fundamental step is called data validation, and it consists of error detection, followed by incorrect data correction, and missing data filling up.

Once the data is cleaned up, the real phase of data analysis can start; after the analysis is completed all the results must be validated, reported, and graphically presented.

To recap, the main steps in the detection process are the following:

  • Data collection.
  • Data preparation.
  • Data analysis.
  • Report and presentation of results.

Arcade Analytics fits very well here, as it is a tool that allows us to create captivating and effective reports that to share the results of a specific analysis in a very easy way by dividing the data between different widgets in complex dashboards.

The main widget is the Graph Widget. It allows users to visually see the connections within their datasets and find meaningful relationships. Moreover, all the widgets present in the same dashboard can be connected in order to make them interact with each other. In this way, we will be able to see bidirectional interactions between the graphs, data tables, and the traditional charts widgets in the resulting dashboard.

The chart distributions will be computed according to the partial datasets of the correspondent primary widgets, making the final report dynamic and interactive.

But that is not at all, Arcade can be useful for several techniques in the data analysis process. Let’s see how!

Data analysis often relies on automated processes exploiting statistical methods and artificial intelligence techniques, that are commonly classified as supervised and unsupervised techniques.

Among the statistical methods, we can use:

  • Data processing.
  • Statistical parameters calculation, relevant for the specific domain.
  • Models and probability distributions.
  • Time series analysis.
  • Clustering and classification of entities, in order to find associations and patterns among data.

Arcade offers several tools to perform single and multi-series analyses exploiting an efficient, full-text search engine and inverted indices, that assure good performance in computing statistical parameters and distributions on the whole data source.

Credit Card Distribution

Global Transactions/Orders Distribution

This method is good to identify statistical classification and infer rules: these rules can be then used to define rule-based classifiers, supervised learning algorithms that use If (fulfills certain conditions) and Then (appropriate category) rules.

Moreover, Arcade offers good support to time series analysis: by using the timeline feature you can see your data in the form of a graph and how it changes over time.

Image title

In this way, we can analyze when the relationships between specific items or entities appeared by exploiting a time filtering window to narrow the temporal analysis to a specific and customizable range.

Image title

Then you can interact with this analysis by zooming in and out: by changing the grain you can perform a simple temporal, top down analysis starting from a wider view, which is useful to see at a first glance how the events are distributed over time.

Obviously, in each perspective, you can move back and forth through time.

Image title

Image title

Besides the statistical methods, it could be helpful put beside automated processes like:

  • Data mining.
  • Pattern recognition.
  • Machine learning and prediction to implement proactive rules.

In fact, these unsupervised methods do not require samples of fraudulent transactions, so they turn out to be useful in all scenarios where there is no a prior knowledge of classes of transactions or when we want to extend these categories in order to recognize previously undiscovered fraud.

Importance of Human Interaction

Often in this scenarios, we can encounter the concept of Fraud Analytics that is commonly conceived as a combination of automated analytics technologies and analytics techniques with human interaction. In fact, we cannot get rid of domain experts interaction with users for two main reasons:

  • A high number of false positives: not all transactions detected as fraudulent are actually fraud. Generally, detection systems based on the best algorithms result in too many false positives, even though they are able to identify a high percentage of the actual fraudulent transactions (up to about 99 percent). Thus, all the results must be validated in order to exclude the false positives from the first result.
  • High computing time due to the complexity of the algorithms, especially in prediction scenarios: when algorithm execution time is exponential due to complexity, monolithic execution is not a good approach, because it could require a lot of time for big inputs. Thus, a progressive approach is adopted, consisting of decreasing requested computational time by combining specific resolution models and automated calculations with human interaction. Intermediate results are proposed to the system designer during the computation, and they then decide which way the analysis has to go in a progressive manner. In this way, the whole execution branch can be omitted, achieving a good gain in terms of performance.

For both of these two aims, a visual tool is needed. Arcade Analytics turns out very appropriate for these tasks thanks to the features already shown and the expressive power of the graph model.

How a Graph Perspective Can Help

A graph perspective can be very useful in fraud detection use cases because, as we already said, most of the computation relies on pattern recognition. Then we can use these patterns to find and retrieve all the unusual behaviors we are looking for, without needing to write complex join queries. Arcade offers support to different graph querying languages based on:

  • the pattern matching approach: the Cypher query language proposed by Neo4j and the MATCH statement of the OrientDB SQL query language are fully supported in Arcade. This is a great approach when we need to rely on several patterns to detect fraud.
  • the graph traversal approach, that makes a very simple to explore the graph and any information of actual interest. Gremlin is a good example of these kinds of languages.

Moreover, one of the most attractive features of Arcade Analytics is that it allows users to query data from a relational database to easily visualize the data therein as a graph and explore the connections inside it, without any migration and only a few simple steps.

Image title

Now, let’s have a look at a very common pattern recognized as a potentially fraudulent activity that is often missed by traditional fraud detection systems, and in what way Arcade Analytics can help us in the analysis of all the instances matching this specific schema.

First-Party Fraud Detection

First of all, let’s define what first-party fraud detection is. From definitions.uslegal.com, we get the following definition:

“First party fraud refers to fraud that is committed by an individual or group of individuals on their own account by opening an account with no intention of repayment. A first party fraud applicant uses synthetic identification or they generally misrepresent their real identity by lying to creditors on application forms, or by using false or proxy addresses. A first party fraud is different from a third party fraud or identity fraud because in third party fraud the perpetrator of fraud uses another person's identifying information. First party fraud includes advances fraud, bust out fraud, friendly fraud, application fraud, and sleeper fraud.”

In last few years the number of third-party fraud, based on identity theft, are decreasing while the cases of misrepresented identities and false personal information are growing.

We can suppose a lot of different scenarios can be categorized as first-party fraud detection, from the simplest ones to the most complicated.

The following can be a simple example:

John Smith opens a new credit card account, maxes out his credit line, defaults, and then disappears without a trace. In this scenario, Mr. Smith used his own credentials, with minor variations in his contact data, to deliberately defraud the credit card company.

But we can also encounter a group of two or more people organized into a fraud ring, where there is a subset of legitimate contact information, like telephone numbers and addresses.

This data is combined to create several synthetic identities that will be used by the ring members to open fraudulent accounts. Thus, this new account has access to credit lines, credit cards, overdraft protection, personal loans, etc.

The accounts are used normally, with regular purchases and timely payments, and, for this reason, banks increase the revolving credit lines over time, due to the observed, responsible credit behavior.
One day the accounts coordinate their activity, maxing out all of their credit lines and disappear.

Then this ring schema can be detected as suspicious behavior and, if recognized and validated in time, can avoid big losses.

Here is a sample graph instance in Arcade matching this ring pattern:

Image title

This pattern can be looked for in the data through a specific query, loaded into Arcade Analytics and deeply investigated by human experts in order to prevent this kind of fraud.

To conclude, we can state that Arcade Analytics can take a valid contribution in a complex fraud detection system by covering different roles in the whole process.

I hope this post was helpful and interesting, stay tuned!

big data, data analysis, data mining, fraud detection, graph analytics

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}