Privacy-Preserving Marketplace for Data
Privacy-Preserving Marketplace for Data
While big data, AI, and machine learning are making great strides, the ability to securely share the knowledge gleaned is still not quite available.
Join the DZone community and get the full member experience.Join For Free
Data sharing is inevitable in the current digital ecosystem. Individuals and entities end up sharing data, knowingly or otherwise, related to identity, transactions, personal preferences, etc. This is aptly called ‘Digital trail or exhaust’ and needs to be carefully controlled. If privacy is not preserved, data sharing in the digital economy poses significant risks to individuals and organizations.
However, there are clear benefits to sharing data, like better insights resulting in superior services and value to the data provider. A very good example of this service is the traffic congestion shown on Google Maps. The data required to determine traffic congestion is provided by commuters through telecom operators. Social media companies such as Google, Facebook, and Twitter are well-known examples of value-added services provided at zero cost to customers. These organizations are effectively monetizing the data collected through these services.
Data needs to be considered more as raw material for the value-added services offered. Availability of raw data is a must have for the current economy. Several forms of data monetization models have emerged where ‘A trusted third party’ ends up owning an incredible amount of data. Individuals and organizations providing the data are unfortunately not well informed about the mechanisms nor risks associated with this [i]. The concept of privacy-preserving data sharing may seem like a death blow to these businesses. Digital identity is a case in point.
This is changing very rapidly from both perspectives. On one side, governments across the globe are realizing the need to regulate this process by providing basic protection to individuals in the form of data privacy laws or creating infrastructure for data governance and authenticity (e.g. GPDR in the EU) [ii]. It is also becoming clear that the ‘Trusted Third Parties’ cannot be trusted completely, as data breach incidences in Equifax, the Sony PlayStation network, Target's database clearly highlight [iii]. The challenges arising out of data breaches or single points of failure can have a devastating impact on the digital economy.
On the other side, recent developments in privacy proving anonymization techniques, big data, decentralized databases, distributed ledgers, and decentralized apps may provide the crucial tradeoff. The article is an attempt to connect the dots and propose a solution.
Privacy Preserving Data Sharing
What is Data Privacy ? Privacy is the privilege to have some control over how the personal information is collected and used. Information privacy is the capacity of an individual or group to stop information about themselves from becoming known to people other than those to whom they give the information. One serious user privacy issue is the identification of personal information during transmission over the Internet.
It would be worth understanding the difference between data privacy and security.
Privacy is the appropriate use of a user’s information.
Security is the confidentiality, integrity, and availability of data.
Privacy is the ability to decide what information from an individual goes where.
Security offers a confidence that these decisions are implemented.
Privacy is the consumer’s right to safeguard their information from any other parties.
Security provides the appropriate confidentiality to protect the data provider.
It is possible to have poor privacy and good security.
It is difficult to have good privacy without adequate security.
The payment processing industry has been at the forefront of this. Adoption of EMV, PCI DSS standards and tokenization are prominent examples.
Several technology solutions have emerged which are “privacy proving” techniques. Just to name few:
- Identity-based anonymization
This topic has been well researched several advanced algorithms are now available. [iv]
Apart from privacy, there are several other features which those working in a data sharing infrastructure would love to have.
- Decentralized: avoids a single point of failure.
- Highly available.
- Provides data security.
- Control mechanism for data ownership and sharing (defines who owns the data and who is authorized to share after appropriate anonymization).
- Data Marketplace: structure for incentivizing data sharing and sharing costs.
- Directories and Token Vaults (where individuals can create tokens).
- Access based on privileges (e.g. special privileges to regulators and government agencies).
- Ability to run sophisticated data analytics and machine learning.
Centralized vs Decentralized Approach
A centralized system has the following inherent drawbacks:
- Creates a single point of failure.
- Does not provide immutability, and thus changes can be made to the database.
- Can lead to monopoly and unfair pricing.
The advent of distributed ledgers (e.g. Hyperledger, BigchainDB), open source databases, and decentralized applications (DApps) provides the potential to create a decentralized infrastructure which can not only address the issues above but also provide the following features/advantages:
- Federation of permissioned members ensuring proof of stake.
- Open source technologies.
- Benign and Byzantine fault tolerant system.
- Ability to mitigate Sybil attacks.
- Create an incentive and cost-sharing mechanism using digital currency, assets, and smart contracts.
- Provide proof of process (e.g. proof of data validation and authenticity).
- Provide consent-based access to data.
These features can truly create a ‘marketplace for data [v] which will have multiple, competing data providers and data consumers. With the appropriate access mechanism and logic for de-anonymization, regulators and government agencies can have privileged access which provides the required visibility into this data. (financial crime investigation [vi], anti-money laundering, etc). banks, insurance companies, and other e-commerce players can participate as data providers as well as Consumers. [vii]
The proposed data infrastructure can also create a mechanism which allows individuals to share digital content selectively and securely with other participants with complete control over what they share.
A privacy-preserving, decentralized infrastructure can provide avenues to monetize data via value added services such as a data analytics computation platform.
The section below provides a logical and conceptual/technical view of the proposed solution. It is assumed that there will be a federated model (no need for a trusted third party) of participants who join based on certain agreed-upon protocols. Initially, a lead agency would need to create the charter for the participants. Data providers and consumers will need to be ‘permissioned’ to join the consortium.
The role of Lead Agency can be performed by regulators/government agencies to kick-start the process and ensure that the charter created for the consortium is both fair and complies with regulations or the law of the land.
Technology Stack Options
Here is a view of the technology stack which is [viii]:
Case for Decentralized, Shared Service for Identity Management
For the digital ecosystem to survive, a robust digital identity management system is a must. A comprehensive digital identity management system with broad coverage is still elusive. Identity management systems need to store various attributes related to identity.
Some important characteristics of the desired solution is as follows:
- Privacy-preserving and secure.
- Avoids single point of failure.
- Ability to manage:
- Ownership of Data.
- Data validation processes.
- Selective and secure sharing of data.
- Comprehensive attributes.
- The lifecycle of identity management.
There has been a rapid advancement in the development of wide-coverage identity management systems. Several organizations, both private and state controlled, have emerged in the last couple of years. Notable examples are Aadhar or UIDAI Database in India [ix], private organizations such as IdentityMind Global, Trulioo ,etc. [x]
These organizations are providing critical solutions for identity verification, fraud detection, P2P payments, etc. While these platforms are a definite improvement over the current solutions, there is a need for further improvements.
Platforms such as KYC Chain [xi] have attempted to provide some of those improvements by creating a potential decentralized platform.
A decentralized data marketplace for identity could be a long-term solution providing wide coverage. Such common identity data can help address key issues faced by governments, regulators, banks, law enforcement agencies without compromising on data privacy issues.
Case for Decentralized Data Sharing for Health Care Industry
The availability of quality healthcare data can do wonders for society. Data such as genome data can provide insights which can benefit individuals, research organizations, and pharma companies.
Collecting genomic data through genome sequencing and cheaper “SNP arrays” is important both for scientific research and commerce involving genome sequencing and human health. It is particularly potentially beneficial for personal genomic medicine. While numerous databases already exist to capture genomic data and to use it in science and commerce, current schemes to accumulate and proliferate that data for use are insufficiently secure (or just altogether open!).
This applies not just to genomes or individual DNA sequences, but other health-related information. It can help prevent and provide critical medical help.
Privacy preservation and security is an equally critical element of this solution. Any breach or misuse of medical information can be disastrous.
Availability of this data along with powerful analytical and machine learning algorithms can make this platform extremely useful to individuals, the healthcare industry, insurance companies, and government.
Several decentralized solutions are emerging. One notable example is “Gene-Chain” a solution for enhancing privacy, security, and utility in genomic databases by Encrypgen. [xii]
To realize the benefits of big data, machine learning, and AI assumes the availability of quality data covering a large population. A decentralized, privacy-preserving data marketplace can address several issues in sharing data without compromising the privacy. Development of various technologies related anonymization, distributed ledgers, and distributed databases provides the potential to deliver this.
[i] Battery Status Not Included: Assessing Privacy in Web Standards, Arvind Narayanan, Steven Englehardt etc (2016)
[iii] http://www.nytimes.com/2011/05/01/business/01stream.html?& Article on New York times by Singer, Natasha
[iv] Big data privacy: a technological perspective and review: Priyank Jain, Manasi Gyanchandani and Nilay Khare (2016), Enhancing cloud security using Data anonymization by Jeff Sedayayo (2012), Hitachi develops technology to anonymize Encrypted personal data(2016), Enigma: Decentralized Computation Platform with Guaranteed Privacy by Guy Zyskind, Oz Nathan Alex, Sandy Pentland (2016), A Precautionary Approach to Big Data Privacy by Arvind Narayanan, Joanna Huey & Edward W. Felten (2015)
[v] https://dataprivacylab.org/people/sweeney/new.html by Dr., Latanya Sweeney (2016)
[vii] Could banks be the consumers’ data champion? BANKNXT by Chris Skinner
[viii] BigchainDB: A Scalable Blockchain Database, by Trent McConaghy, Rodolphe Marques, Andreas Muller and others(2016), Enhancing cloud security using Data anonymization by Jeff Sedayayo(2012), Privacy-preserving Machine Learning Algorithms for Big Data Systems by Kaihe Xu, Hao Yue, Linke Guo, Yuanxiong Guo, Yuguang Fang (2015)
[ix] (https://uidai.gov.in/, n.d.)
[x] https://www.identitymindglobal.com/ , https://www.trulioo.com/
[xi] (https://kyc-chain.com/, n.d.)
[xii] (https://www.encrypgen.com/, n.d.)
Published at DZone with permission of Mohan Joshi . See the original article here.
Opinions expressed by DZone contributors are their own.