Over a million developers have joined DZone.

Can Artificial Data Help With Data Privacy?

DZone's Guide to

Can Artificial Data Help With Data Privacy?

Do you think artificial data holds the key for the long-awaited data liberation? Before you answer, check out this research from MIT.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Recent MIT laboratory work is opening up a new path in the implementation of solutions related to data privacy and data exploitation. Their investigations demonstrate that companies can now afford their data science and advanced analytic streams to leverage artificial data. This will advance research in areas as varied as healthcare, climate studies, transportations, energy, and finance by using technologies that are appropriate for both unprecedented amounts of data and new demands of the data privacy rules.

At a time when digital transformation is a big part of main companies road map, data — with all its implications in terms of technologies (Hadoop, NoSQL, in-memory computing, cloud computing, deep learning, etc.) and the immense business opportunities that it offers (new business models, cost rationalization, deal with competition and adapt to market demand, etc.) — play a major role at the heart of their business development strategies. These companies fully recognize that their innovation and improvement potentials, their ability to reinvent themselves and to survive, depends on their ability to adopt data analytics solutions.

Exabyte Represents Approximately 36,000 Years Spent Watching HD Videos

Every day, we produce 2.5 exabytes of data (equivalent to be stored on 40,000,000 of 64GB smartphones).

By 2025, the total amount of data might be 163 zettabytes. There, it becomes almost impossible to conceive of such a huge figure, but let’s still try. A standard USB stick is 32 GB. One zettabyte represents 44,000,000 GB (13.75 million USB sticks). How could we cope with the enormity of the challenge that this size of data represents for finding insights? How could we make sure that we take advantage of the value of this data?

We need to start by accepting that placing more data in a data storage system is lost value. All it does is fill a data swamp instead of building a data lake.

Thus, 60% of big data projects will fail and will end up abandoned — yet there's tremendous potential in the value that is buried in a firm’s data ecosystem! Often, this value can't be capitalized upon because it needs to yet be recognized. Keep in mind that 90% of data is really fresh — it is less than two years old. We know that the biggest companies are leading big data analytics projects in-house to capitalize on their data insights, yet a survey of 1,800 European and North American companies revealed that only 4% of these initiatives are actually successful.

I believe that cognitive computing may be a new resource that can help us tackle this issue. Cognitive computing opens up a new era of appealing perspectives, including with regards to data privacy protection and confidentiality.

Cognitive Computing Is the Simulation of the Human Thinking Process in a Computer Model

In fact, we know that more and more companies are urged to deliver immediate insight from good data in order to beat the competition. This means many of them are turning to the development of cognitive information systems.

It’s all about “making good use of data in order to design intelligent and self-learning systems that will support humans by using complex thinking, searching, and analysis techniques to identify and suggest credible operational choices.”

Cognitive systems are complex information processing solutions capable of acquiring, operating, and passing along knowledge to humans or to other systems. At the same time, it can take action on perceived (seen, heard, or felt), calculated (mathematical), and reasoned conclusions derived from the same data. Theses systems rely on a wide range of scientific disciplines including linguistic, neurosciences, and artificial intelligence.

Going forward, these systems will have to take data privacy restrictions into account, as it’s becoming a major issue for companies around the globe.

Data Teams Have Now to Address the Severe Constraints Raised by Data Privacy

The international data protection rules — more specifically the European control (GDPR) — are forcing companies to define what information is shared with what groups while also protecting personal information against the risks of theft, disclosure, or any other type of compromise. In complying with this ruling, data scientists, developers, and even business specialists should not work on real data, personal/identifiable data, or sensitive data.

But if we do that going forward, how do we extract value from the data that’s left to us? How do we design models? And how do we make future facing predictions?

Data masking, a technique that excels in replacing original characters with random ones, could begin to address this issue. Eight out of ten companies are operating with “home-grown” data masking tools in order to protect their sensitive data. Basically, these tools are considered to rely heavily on encryption, combination, and substitution methods. Even if these techniques do make data unintelligible and help to protect the data privacy — for bank account numbers, for example — the fact remains that expanding the application on an entire data warehouse is barely feasible. It would make data inoperable for analysts and hinder the build of predictive models.

The best solution lies elsewhere. MIT researchers continue to open new doors in this area:

“Companies can now take their data warehouses or databases and create synthetic versions of them.” K. Veeramachaneni, Laboratory for Information and Decision Systems (LIDS), MIT. 

Within the Synthetic Data Vault (SDV) white paper, MIT researchers and members of the Data to AI lab describe a system that relies on machine learning that makes it possible to produce synthetic data — that is, artificial data.

The idea is, therefore, to get rid of real data constraints related to privacy in order to let data workers (i.e. data scientists, developers, analysts, and statisticians) to fully harness data before submitting them to all the kinds of testing, modeling, and analysis — even to share it with a third party.

Their approach involves modeling databases with the aim of producing samples or data sets and even comprehensive databases made of artificial data. It’s a question of generating data that would have the same properties as original data but would have the advantage of being exempt from specifications that make them personal, sensitive, or private.

This approach goes beyond the previous research in this area, which was limited to building samples and statistics. This method had the drawback of reducing the spectrum of possible applications since as a result, data was suffering for a lack of diversity, wealth, and volume.

The SDV Creates Data Models to Generate Synthetic Databases

It proceeds through multiple iterations through the many possible relations to create a model for an entire database. A fresh multivariate modeling approach is then used to model data.

This solution was tested in a concrete way. About 40 data scientists divided into various working groups were able to develop predictive models either on real data or synthetic data. Up to 70% of features data scientists develop for predictive models using only synthetic data performed the same or better than those using the original dataset.

It was then demonstrated that synthetic data can successfully be used instead of real data.

To date, generating artificial data could be a powerful tool for addressing data privacy issues. It could also end up being an important asset in successfully preparing and implementing big data analytics projects since it makes it possible to synthesize small and big data volumes for accurate testing without distorting the data. This will fit well with data scientists and analysts expectations.

Do you think artificial data holds the key for the long-awaited data liberation?

Shout out to Niki for your help. Thank you so much for your support.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

data privacy ,big data ,synthetic data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}