The Data Leakage Nightmare in AI
This article will discuss how data leakage can occur, its consequences, and how industries, governmental institutions, and individuals may handle these concerns.
Join the DZone community and get the full member experience.Join For Free
Nowadays, we think of artificial intelligence as the solution to many problems and as a tool that can help humanity achieve huge things faster and with less effort. Of course, those thoughts are not far from being true, but it is really important to be aware of the issues that may arise until then and how those issues can affect us humans and our environment.
Among the issues with artificial intelligence (AI from now on), one of the most relevant is called “data leakage.” This refers to a machine learning problem in which the data used to train the model (the technique that we use to predict an output from an input data set) contains unexpected information that could lead to an overestimation of the model’s usefulness when run with real data.
In this article, we will go through how data leakage can occur, its consequences, and how industries, governmental institutions, and individuals may handle these concerns.
As was already mentioned in the introduction, data leakage is an issue that can arise during the implementation of a machine-learning model. The issue arises when the model has information that will not be available at the prediction time (in production). When this happens, the model will perform well under development and training conditions, but it will underperform when used with production data.
Types of Data Leakage
Leakage can occur in different forms, depending on what information is actually leaking. For example, there is training leakage and feature leakage; let’s go through them to learn more.
This is the easy case; it is caused by introducing a column in the data that explicitly gives the model information about what it’s actually trying to predict. And the problem is that this information is not going to be available at the time of the prediction. So, for example, if we are developing a model to predict user clicks in an ad per year and have a field dailyUserAdClicks, we are leaking information because that information will not be available to a new user in production. So the fact that this information was in the training data but not in production will cause the model to underperform in the real case.
In machine learning, certain techniques are used to separate the available data at the development stage between the so-called training, testing, and Cross-Validation (a.k.a. CV) stages.
Training leakage can happen in multiple ways, one of them being when doing some kind of operation (normalization, scaling, etc.) to the whole data set, including the test and CV split. Doing so would cause the validation stage to be conditioned by this modified data, which would not be the case in production.
Another case of training leakage occurs when working with time-based information. The issue comes when the data set is randomly split for training, testing, and CV. It can happen that the model is trained based on time-sensitive data, and due to the fact that it’s randomly split, information from future events can condition the model’s ability to predict. In other words, that information from the future will not be available in a production environment.
Consequences of Data Leakage in AI
As we have seen, based on the development practices used during a machine learning process, we can find different outcomes that might not be ideal. If we think of a future where most of our applications are based on services that make use of artificial intelligence, which works on underlying machine learning models, it is really a worrisome matter to be aware of the accuracy of these models.
There are many vectors where data leakage could be a dangerous factor in this eventual future, starting with bad development practices and ending with bad actors trying to modify the behavior of systems. We have to be conscious that a lot of these systems are going to be based on the information we are harvesting today, which is being used in an unregulated environment. For example, AI is already being used in medical applications without any formal regulation on the data that they rely on to give results.
Luckily, some governmental institutions are aware of this matter and are taking action on it. One example is the Artificial Intelligence Act, a project proposed by the European Law in which they aim to regulate this technology. We have to acknowledge that this is a good starting point, and it is good to see that there is awareness, but the implications of the technology are way too big to rely on one or a couple of institutions. Their current regulation aims to categorize applications based on their risk profile, with the high-risk ones highly regulated and the least risky ones not regulated at all. This categorization can be really dangerous since it will always be subjective. For example, some could say AI in social networks is not risky, but some could argue that its impact on social development is huge.
Then we have the dark side of the matter, where bad actors come into play. Europe might have its own policy for AI, but some other parts of the world might never reach that point, meaning that someone with bad intentions could surpass what is believed to be ethical or legal somewhere else. And even in a regulated environment, there are ways to corrupt systems that could lead to catastrophes. Hacks occur on a daily basis nowadays. Given a technological infrastructure based on data, it’s very likely that this data will be a target for bad actors to manipulate at their will.
After going through some technical details of data leakage, how it can occur, and its possible consequences for our future, I think it’s really important to think about how the software industry faces this. From a developer’s point of view, we must ensure good practices. For example, a leakage bug in a model could lead to a hacker polluting data sets that could be the starting point for many other services that are basing their applications on this model’s output. From a regulator’s point of view, it is not only important to categorize applications but also data sources and access to this data.
There are a huge number of dangers and things to take into account to narrow the possibilities of disaster-related technologies, and probably we are not yet even close to knowing the majority of them. But it’s key to have an open mind and treat these matters as delicate as their consequences can be; rushing the development and application of such technologies could be a huge mistake.
Published at DZone with permission of Felipe Ferrari. See the original article here.
Opinions expressed by DZone contributors are their own.