Fraud Detection: Approaches and Pitfalls

With the explosive growth of data businesses now can have a look at the very minutae of their processes. This includes details on which user took what action in which system, ranging from mission critical payments and asset transactions to the ultimately mundane users’ login in their own computers. On the one hand this trove of data is a gold mine for those looking for real-time business process monitoring. On the other – it is absolutely overwhelming. To give an example – we log more than 1.5 million personal data accesses on the Regix component of the Bulgarian e-government per month, or more than 50,000 daily, and given our other clients – this is on the very low end! Amounts in the order of millions daily are commonplace.

The overwhelming amount of such transactions are legitimate and business-critical ones and are to be processed efficiently. However, some of those are not legitimate or “normal”. Those anomalous ones signify either fraud attempts, exceptions and errors, or radical changes of behavior. All those are of business interest and may necessitate action. Given the huge amount of data (usually contained in logs), an automatic anomaly detection turns out to be a critical business capability. While such systems are not uncommon, their performance is often sub-par. In general, there are four approaches which may be of different utility, given the specific data and use case:

  • Rule-based
  • Supervised Learning
  • Unsupervised Learning
  • Ensemble Models


Rule-based approaches are probably the oldest and time-proof ones. They consist of defining certain rules and label actions that do not match them as anomalous and potentially worth checking. For example, if a credit card transaction is more than ten times larger than the average for this customer, a notification is given. Rules vary from the traditional statistical ones (e.g. flag all transactions larger than 3 standard deviations from the mean as suspect) to business rules (block credit card after three wrong PINs are entered consecutively). This group of rules is most flexible and allow the human experts to apply their subject matter expertise, but is also most difficult and time-consuming to implement well as it includes the painstaking definition of every single rule for anomaly possible. Needless to say, if experts make an omission, undetected anomalies will happen and nobody will suspect it.

Supervised Learning approaches tend to use machine learning for detecting anomalous and potentially fraudulent behavior. Essentially a database with labeled transactions (normal/fraud) is utilized and an appropriate algorithm (such as neural networks, k Nearest Neighbors or Support Vector Machine) is used to create a statistical model distinguishing between normal and fraud. As new transactions come in, the model classifies them. If this transactions is suspect, then a notification is thrown. This class of models seem to work reasonably well but come with two major drawbacks. First, the data with labels is often unavailable. Even when it is available, its accuracy is not guaranteed: some undetected fraud is labeled as legitimate action. Second, some of these algorithms are computationally very intensive, especially during the time of model training and those resources need to be provisioned for.

Unsupervised Learning approaches also take advantage of recent advances in machine learning and leverage large amounts of data. However, they do not need labels for each transaction. They assume that the fraudulent events are a tiny percentage of all and try to create a model of normal behavior. As new events come in, they are compared against “normal” behavior and if the difference is large enough, then they are labeled as anomalous. Standard models here range from the simple k-Means clustering to more involved Principal Component Analysis, One-class Support Vector Machines or time-series methods such as ARIMA. The key advantage of these approaches is that they can mine a huge amount of data with no previous knowledge of what events are frauds and can act independent of human judgment. In addition to that they are generally more light-weight during model building but may be more computationally expensive during testing.

Ensemble (hybrid) models try to take the best of all worlds. They combine different algorithms to achieve greater accuracy of anomaly detection. It has long been established in the statistics and machine-learning communities that the combination of different algorithms tends to produce superior results. It is no surprise then that state-of-the-art fraud detection hinges on an ensemble of different approaches. For example, we at LogSentinel combine unsupervised learning with rule-based fraud detection to merge the power of machine learning with the versatility of domain expertise. The key intuition behind using ensemble models is that each algorithm captures best few of the key features of data. Combining a number of different algorithms thus gives a more comprehensive view of the data at hand. The main advantage of ensemble models is their improved accuracy but this comes at an increased computation cost and less intuitive interpretation.


New trends in operations and risk management place greater and greater reliance on data-driven decision-making. The capability to automatically review all activity in your organization and flag those instances that merit attention is crucial. This is especially true in the case of fraud or exception detection which have a bearing on the firm’s bottom line. The trend is also reflected in recent regulation as the Second Payment Services Directive (PSD2) and its associated guidelines which explicitly ask for payment operators to have a couple of simultaneously operating fraud detection systems. Even without regulation, businesses are well advised to vigilantly keep track of anomalous behavior and take corrective action if they truly mean to achieve operational excellence.