Zero-Trace Paradigm: Emerging Technologies in Personal Data Anonymization
Get the lowdown on the advantages and downsides of emerging data anonymization mechanisms that lay the groundwork for the zero-trace paradigm.
Join the DZone community and get the full member experience.
Join For FreeEmerging technologies like homomorphic encryption and zero-knowledge proofs can definitely help organizations approach zero-trace personal data anonymization. These and similar techniques can bring datasets to a near-zero-trace status, even achieving it in limited cases. There’s a major force that’s acting against efforts at implementing the zero-trace paradigm, though, and it’s difficult to discuss this paradigm without delving into what undermines it.
Put simply, personal data has to come from somewhere before anonymization can even be considered. For very large datasets of personal information covering broad swaths of the population, the data broker industry remains the most practical source for many applications. Unfortunately, data brokers have proven themselves incapable of securing the almost exclusively plaintext datasets they hold.
This affects attempts at achieving the zero-trace status in two ways: in some cases, it renders such endeavors moot as equivalent datasets are made readily available elsewhere; in others, the presence of data broker databases provides the grist for re-identification attacks based on data correlation, something to which techniques like differential privacy are particularly vulnerable.
Why Is Personal Data Anonymization So Important, and So Difficult?
Personal information is the fuel that sustains what the famous American philosopher Shoshana Zuboff described as “surveillance capitalism.” In an age in which many of us have moved much of our everyday lives online, personal information can include data points never even dreamt of in pre-digital times. Having personal information like your name, phone number and address publicly available is nothing new; previous generations lived with phone books for decades.
But with the vast stores of personal data held by data brokers, big tech companies and others these days go far beyond that. Behavioral patterns (including shopping and media consumption habits), religious and political beliefs, sexual preferences, location histories, social and professional networks, not to mention court records, governmental records, and property records – these are all fair game.
In short, virtually everything you see and do online and all that this betrays about you, as well as everything the public record contains on your offline life, is out there. It’s being bought and sold in plaintext and refined and analyzed every minute of every day. Such data in the wrong hands can lead to everything from identity theft and stalking to more insidious forms of discrimination and harassment.
Personal data is increasingly treated as a commodity, and companies today need it like their predecessors needed whale oil. Whether today’s surveillance capitalism is justifiable, or even sustainable, it is a fact of life and we need ways for companies to exploit personal data without infringing on individuals’ right to privacy.
The difficulty with traditional anonymization and pseudonymization techniques is twofold. When done in such a way as to leave personal data useful for the purposes of analysis, they’re easily broken through various re-identification techniques (like correlation attacks, which benefit from the ready availability of personal data through other channels, like via data brokers). If taken to such a degree that re-identification is unlikely, they render the data unusable (and even then, they might still be de-anonymized in the future).
What Is the Zero-Trace Paradigm?
This search for ways to simultaneously anonymize and exploit personal data, taken to its natural conclusion, has led to the creation of the zero-trace paradigm. “Zero trace” refers to the idea that no traces of personal identifiers are left available in a given dataset, making re-identification virtually if not actually impossible. At the same time, the paradigm requires that datasets remain usable (for generating insights, making predictions, etc.) in their anonymized states.
In short, the zero-trace paradigm requires the best of both worlds: unbreakable anonymization and uninhibited exploitation. It’s safe to assume that no non-trivial dataset has been brought into a perfect zero-trace state – there’s always a compromise to be negotiated. With technologies like quantum computing seemingly just around the corner, future-proofing the anonymization aspect is another source of difficulty.
Technologies That Bring Us Closer to a True Zero-Trace Status
Zero trace is a paradigm, a goal, and an ideal. There are a whole host of established, new, and emerging technologies that bring or promise to bring us closer to a completely non-traceable data environment.
Differential Privacy
Differential privacy refers to a mathematical framework that can provide formal, provable privacy guarantees. At its core is a simple yet powerful concept: carefully calibrated statistical noise is introduced into data or query results to mask individual contributions while preserving varying degrees of overall statistical validity.
The privacy guarantees offered by this technique are probabilistic: the presence or absence of any individual data subject in a dataset cannot be determined from the results of an analysis beyond some controllable probability threshold. This threshold is expressed as the privacy budget or epsilon value. The smaller the epsilon value, the stronger the privacy protection.
Smaller epsilon values come at the expense of diminishing data utility. This is the key limitation of differential privacy: it doesn’t escape the trade-off between data utility and privacy, meaning that the accuracy of analyses breaks down as privacy is strengthened.
Homomorphic Encryption
Homomorphic encryption is a technology that allows us to do what was once seemingly impossible: it facilitates the performance of computations on encrypted data without requiring the decryption of that data. The homomorphism in this type of encryption comes from mathematical structures that preserve certain algebraic relations between plaintext and ciphertext versions of the data in question.
Third parties can perform calculations on homomorphically encrypted data without gaining access to the underlying content. The major drawback of this approach is the huge computational overhead involved in working with homomorphically encrypted datasets. Operations on such datasets can be thousands of times slower than on their plaintext equivalents.
Secure Multi-Party Computation (SMPC)
Secure multi-party computation is closely related to homomorphic encryption and also allows third parties to operate on data without requiring access to identifiers within that data. Specifically, it allows multiple parties to jointly compute functions over their inputs while keeping those inputs private. This could allow multiple healthcare providers, for example, to analyze patient records across institutions without exposing actual, individual records.
SMPC also has applications in finance, where competitive analyses can be performed without revealing proprietary information. The drawback is the same as with homomorphic encryption: computational overheads inherent to working under encryption make the process slow and resource-intensive.
Federated Learning and Edge Computing
Federated learning represents a different kind of paradigm shift to the above technologies, introducing the concept of decentralization into the mix. It’s also the first technology on this list to explicitly depend on machine learning and related disciplines. Rather than having raw data centralized for analysis, federated learning has the algorithm go to where the data resides. This eliminates the need for data transfers to central servers, drastically reducing the risk of exposure.
A central server sends the model to devices participating in the analysis. Each device then trains the model on local data. Under this approach, only model updates (rather than raw data) are sent back to improve the primary model. This is an architecture that inherently preserves privacy while leveraging collective intelligence.
The applications of federated learning are many: from improving onscreen keyboard predictions on mobile devices to allowing smartphone health apps to contribute to medical research without sharing individuals’ sensitive health data. There are, at least for now, drawbacks to this approach, though.
Model updates can sometimes leak information about training data, creating the need for additional privacy preserving techniques like differential privacy to be employed in parallel with federated learning. Federated learning, like homomorphic encryption and SMPC, also introduces some efficiency challenges, as model training is distributed across devices of varying computational capacity and connection throughput.
Zero-Knowledge Proofs (ZKPs) and Blockchain Applications
Zero-knowledge proofs take yet another approach to preserving both privacy and utility. They allow one party to prove to another that a statement is true without revealing any information beyond the validity of that statement.
Using ZKPs, you could prove that you meet an age requirement without revealing your birthdate. Other applications could be proving that you possess a password without revealing that password or even reporting near misses on a construction site. Combining this technology with a blockchain opens up the way for anonymous yet verifiable transactions. A cryptocurrency like Zcash does something like this by confirming transactions without revealing the sender, recipient or amount transferred.
The main drawbacks of ZKPs are, again, computational efficiency and accessibility: in spite of standards like ZKProof, ZKP-based operations remain resource-heavy and largely opaque to non-cryptographers.
What the Future May Hold
To talk about the future, we need to take into account the broader context. This is because, while technologies aimed at implementing the zero-trace paradigm will continue to improve in leaps and bounds, they don’t do so in a vacuum. It’s difficult to discuss personal data without mentioning data protection laws like the GDPR and CCPA.
Anonymization techniques can operate at cross purposes to such data privacy laws by limiting legally required transparency, for example. They can also be undermined, either directly or indirectly, by the vast stores of personal data – covering many of the same data subjects – held and distributed by data brokers.
In these ways, the purely technical aspects of personal data anonymization are made interdependent with broader legal and market trends.
Opinions expressed by DZone contributors are their own.
Comments