DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

How are you handling the data revolution? We want your take on what's real, what's hype, and what's next in the world of data engineering.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Guarding the Gates of GenAI: Security Challenges in AI Evolution
  • Evolution of Privacy-Preserving AI: From Protocols to Practical Implementations
  • Privacy-Preserving AI: How Multimodal Models Are Changing Data Security
  • Secure File Transfer as a Critical Component for AI Success

Trending

  • WebAssembly (Wasm) and AI at the Edge: The New Frontier for Real-Time Applications
  • When Incentives Sabotage Product Strategy
  • Blockchain in Healthcare: Enhancing Data Security and Interoperability
  • Effective Exception Handling in Java and Spring Boot Applications
  1. DZone
  2. Software Design and Architecture
  3. Security
  4. The Data Leakage Nightmare in AI

The Data Leakage Nightmare in AI

This article will discuss how data leakage can occur, its consequences, and how industries, governmental institutions, and individuals may handle these concerns.

By 
Felipe Ferrari user avatar
Felipe Ferrari
·
Jan. 31, 23 · Analysis
Likes (1)
Comment
Save
Tweet
Share
3.7K Views

Join the DZone community and get the full member experience.

Join For Free

Nowadays, we think of artificial intelligence as the solution to many problems and as a tool that can help humanity achieve huge things faster and with less effort. Of course, those thoughts are not far from being true, but it is really important to be aware of the issues that may arise until then and how those issues can affect us humans and our environment.

Among the issues with artificial intelligence (AI from now on), one of the most relevant is called “data leakage.” This refers to a machine learning problem in which the data used to train the model (the technique that we use to predict an output from an input data set) contains unexpected information that could lead to an overestimation of the model’s usefulness when run with real data.

In this article, we will go through how data leakage can occur, its consequences, and how industries, governmental institutions, and individuals may handle these concerns.

Data Leakage

As was already mentioned in the introduction, data leakage is an issue that can arise during the implementation of a machine-learning model. The issue arises when the model has information that will not be available at the prediction time (in production). When this happens, the model will perform well under development and training conditions, but it will underperform when used with production data.

Types of Data Leakage

Leakage can occur in different forms, depending on what information is actually leaking. For example, there is training leakage and feature leakage; let’s go through them to learn more.

Feature Leakage

This is the easy case; it is caused by introducing a column in the data that explicitly gives the model information about what it’s actually trying to predict. And the problem is that this information is not going to be available at the time of the prediction. So, for example, if we are developing a model to predict user clicks in an ad per year and have a field dailyUserAdClicks, we are leaking information because that information will not be available to a new user in production. So the fact that this information was in the training data but not in production will cause the model to underperform in the real case.

Training Leakage

In machine learning,  certain techniques are used to separate the available data at the development stage between the so-called training, testing, and Cross-Validation (a.k.a. CV) stages. 

Training leakage can happen in multiple ways, one of them being when doing some kind of operation (normalization, scaling, etc.) to the whole data set, including the test and CV split. Doing so would cause the validation stage to be conditioned by this modified data, which would not be the case in production.

Another case of training leakage occurs when working with time-based information. The issue comes when the data set is randomly split for training, testing, and CV. It can happen that the model is trained based on time-sensitive data, and due to the fact that it’s randomly split, information from future events can condition the model’s ability to predict. In other words, that information from the future will not be available in a production environment.

Consequences of Data Leakage in AI

As we have seen, based on the development practices used during a machine learning process, we can find different outcomes that might not be ideal. If we think of a future where most of our applications are based on services that make use of artificial intelligence, which works on underlying machine learning models, it is really a worrisome matter to be aware of the accuracy of these models. 

There are many vectors where data leakage could be a dangerous factor in this eventual future, starting with bad development practices and ending with bad actors trying to modify the behavior of systems. We have to be conscious that a lot of these systems are going to be based on the information we are harvesting today, which is being used in an unregulated environment. For example, AI is already being used in medical applications without any formal regulation on the data that they rely on to give results.

Luckily, some governmental institutions are aware of this matter and are taking action on it. One example is the Artificial Intelligence Act, a project proposed by the European Law in which they aim to regulate this technology. We have to acknowledge that this is a good starting point, and it is good to see that there is awareness, but the implications of the technology are way too big to rely on one or a couple of institutions. Their current regulation aims to categorize applications based on their risk profile, with the high-risk ones highly regulated and the least risky ones not regulated at all. This categorization can be really dangerous since it will always be subjective. For example, some could say AI in social networks is not risky, but some could argue that its impact on social development is huge.

Then we have the dark side of the matter, where bad actors come into play. Europe might have its own policy for AI, but some other parts of the world might never reach that point, meaning that someone with bad intentions could surpass what is believed to be ethical or legal somewhere else. And even in a regulated environment, there are ways to corrupt systems that could lead to catastrophes. Hacks occur on a daily basis nowadays. Given a technological infrastructure based on data, it’s very likely that this data will be a target for bad actors to manipulate at their will. 

Conclusion

After going through some technical details of data leakage, how it can occur, and its possible consequences for our future, I think it’s really important to think about how the software industry faces this. From a developer’s point of view, we must ensure good practices. For example, a leakage bug in a model could lead to a hacker polluting data sets that could be the starting point for many other services that are basing their applications on this model’s output. From a regulator’s point of view, it is not only important to categorize applications but also data sources and access to this data. 

There are a huge number of dangers and things to take into account to narrow the possibilities of disaster-related technologies, and probably we are not yet even close to knowing the majority of them. But it’s key to have an open mind and treat these matters as delicate as their consequences can be; rushing the development and application of such technologies could be a huge mistake.

AI Data (computing) Data security security

Published at DZone with permission of Felipe Ferrari. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Guarding the Gates of GenAI: Security Challenges in AI Evolution
  • Evolution of Privacy-Preserving AI: From Protocols to Practical Implementations
  • Privacy-Preserving AI: How Multimodal Models Are Changing Data Security
  • Secure File Transfer as a Critical Component for AI Success

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: