DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • What Nobody Tells You About Multimodal Data Pipelines for AI Training
  • How to Prevent Data Loss in C#
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Production Database Migration or Modernization: A Comprehensive Planning Guide [Part 2]

Trending

  • Understanding MCP Architecture: LLM + API vs Model Context Protocol
  • Invisible Failures in S/4HANA Conversions (And Why Teams Miss Them)
  • How AI Coding Assistants Are Changing Developer Flow
  • AI in Software Development: A Mirror, Not a Magic Wand
  1. DZone
  2. Data Engineering
  3. Data
  4. Data Privacy Mechanisms and Attacks

Data Privacy Mechanisms and Attacks

In this article, learn more about Data Privacy Mechanisms based on anonymization and perturbation, and gain insight into possible attacks.

By 
Dr. Preeti Goel user avatar
Dr. Preeti Goel
·
Oct. 03, 22 · Analysis
Likes (3)
Comment
Save
Tweet
Share
5.6K Views

Join the DZone community and get the full member experience.

Join For Free

A large volume of sensitive personal data is collected on a day-to-day basis by organizations like hospitals, financial institutions, government agencies, social networks, and insurance companies. The purpose of the organization can be to extract and publish useful information about the data while protecting the private information of individuals.

Data privacy laws, regulations, and issues come up when an organization wants to share or publish private data of individuals while at the same time wanting to protect their personally identifiable information.

There are two main research focuses for the implementation of data privacy.

  • Perturbation: Addition of noise (i.e., differential privacy)
  • Anonymization: Anonymize data records (i.e., k-anonymity, l-diversity, t-closeness)

The aim is to release or publish the private data such that there can be guarantees to the participating individuals that they cannot be identified while the released data is still useful. 

Records in a data set are uniquely identified by a set of identifiers associated with it. The first step towards anonymization is the removal of these identifiers. For example, if the name, phone number, and address fields are removed from the medical data published by a hospital, it can be assumed to be anonymous. However, it is not sufficient as there are still link attacks that can be done to uniquely identify individuals. 

For example, two sets of data can be obtained from different sources like a voter registration list and the anonymous data released by medical organizations from which the unique identifiers have been removed. These two can be linked based on their common attributes (zip code, birth date, gender, etc.) to identify individuals and their medical conditions. 

 

Links in data based on common attributes

k-anonymity is a privacy protection model where the information of each individual cannot be distinguished from at least k-1 individuals whose information also appears in the database. k-anonymity provides a “blend in the crowd" approach to privacy; however, it is still subject to link and homogeneity attacks. There are attacks possible in which case k-anonymity fails to provide privacy which are:

  • Homogeneity Attack: This is possible when the sensitive attribute lacks diversity in its values. For example, if the attacker knows that he is trying to find information for his neighbor (zip code 3024) Mark who is nearly 33-year-old, then looking at the 2-anonymized in Table 3, he knows that Mark has heart disease.
  • Background knowledge attack: This is when the attacker has background knowledge. For example, if the attacker knows that Mark, his 25-year-old neighbor (zip code:3087), is seriously ill and was taken in an ambulance, then by looking at Table 3, he knows that he has either flu or cancer. Since he knows he is seriously ill and in the hospital for a long time, he knows that he has cancer.

Data tables providing info for homogeneity and background knowledge attacksPrivacy algorithms like l-diversity and t-closeness protect against some of these attacks, but they still have limitations. For example, l-diversity does not protect against probabilistic inference attacks and similarity attacks.

  • Similarity attack: Since l-diversity does not consider the semantic meanings of sensitive values there could be blocks with very similar values of the sensitive attribute hence providing little or no privacy.

The Differential Privacy model claims that it provides a stronger privacy guarantee than the above-discussed models, as it does not require any assumptions about the data and protects against attackers who know all but one row in the database. It provides privacy guarantees regardless of the background knowledge and computational power of the attacker. Differential privacy originated as a model for interactive databases and seeks to limit the knowledge that users obtain from query responses. It was later extended to privacy-preserving data publishing.

For data publishing, both t-closeness and differential privacy limit disclosure, as an attacker’s prior knowledge is limited to their prior knowledge about the distribution of confidential attributes.

The Differential Privacy model provides privacy guarantees where the overall information gained by an adversary does not change, whether an individual opts to be present or absent in the data. The objective is as follows: if a disclosure occurs when an individual participates in the database, then the same disclosure also occurs with a similar probability (within a small multiplicative factor) even when the individual does not participate. Differential privacy attempts to break the strong correlation between the output of a function (or algorithm) from its set of inputs such that a small change in the input data set should not cause huge variations in the output.

However, there are challenges to implementing differential private systems. The first is how to adapt to time series data or highly co-related data. For time series data, the composition property adds up the noise added at each timestamp to generate a huge amount of noise in the series. It is known that differential privacy does not protect against attackers' auxiliary information. The biggest challenge is privacy vs utility, which is how to reduce noise while still maintaining the utility of the results. It is a challenge to tune the data privacy approach based on the tradeoff between utility and privacy, and create metrics to evaluate the model quality measuring utility loss and privacy gain, and strike the right balance.

Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • What Nobody Tells You About Multimodal Data Pipelines for AI Training
  • How to Prevent Data Loss in C#
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Production Database Migration or Modernization: A Comprehensive Planning Guide [Part 2]

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook