Remember when "big data" meant a terabyte or so? If you don't, I suspect I know why — you blinked... and you missed it.
It's hard to believe there was a time when the total volume of customer data coursing through the networks of United Airlines was a couple terabytes. Today, a decent desktop solid state drive stores that much. Today's predictive analytics projects are dealing with petabytes or exabytes of data, and I've seen predictions that by 2025 human genome sequencing will be working with zettabytes of data per year — that's 1021 bytes.
For some reason, as big data grew and matured and hit the Gartner Hype Cycle, people decided that only words starting with the letter "V" could be used to describe data. It started with the "three Vs" — volume, velocity, and variety. Then, someone added variability. Then veracity. Then validity, and vulnerability, and volatility, and on and on.
Today, it's time to add another V, but this one deserves more serious contemplation than the previous new Vs of big data. It's time to talk about the "virtue" of data and ask, Is the usage of this data virtuous? This impacts far more than characteristics or capabilities of those teeny bits of digital information. We're talking about the ethics of how all that data is used and when it may not be appropriate to use it.
Consider just a few examples of the scenarios that are emerging — both real and hypothetical.
The Target Pregnancy Fiasco
If you haven't heard about this story (where have you been?), it's a doozy... but also a perfect example of the "data virtue" problem we're facing.
A few years back, Target built an analytical model for predicting pregnancy — almost too good, if you ask me. A data scientist identified 25 products that, when purchased together, indicated a woman might be pregnant. From a business perspective, this was great information! It meant Target could send personalized promotions and transform this likely-pregnant individual into a solid customer for years.
But as Target learned the hard way, this use of big data could also lead to inadvertent exposure of private information. A man walked into a Target store furiously clutching coupons that had been mailed to his teenage daughter, congratulating her on her pregnancy and offering discounts on diapers! She actually was pregnant, and that was unwelcome news to this grandfather-to-be. It was a public relations disaster for Target, and it raised serious questions about unintended consequences of big data analytics.
The Preexisting Condition Conundrum
Amid all the bickering and political backstabbing going on in the U.S. Congress today over health care, an interesting new question has been ignored: What should be done about preexisting health conditions that are only predicted? For example, what happens if an insurance company uses big data to build a predictive analytics model that determines a customer is likely to develop an illness or suffer a catastrophic event, such as cardiac arrest, and then uses that information to deny coverage? The discussion today only focuses on pre-existing conditions we already know about.
The Genomic Editing Issue
As we get closer and closer to unlocking the human genome, it's only a matter of time before big data and predictive analytics can also be applied to sort out possible outcomes for unborn babies. It's a no-brainer that we'd want to start identifying the indicators for Autism, Alzheimer's, and other chronic ailments such as psoriatic arthritis. Yet the question arises, Why stop there? As a culture, do we also begin addressing less critical concerns such as physical appearance, skin and hair color, and so on? The data will be there, the answers will be there; it's up to us to decide how and when to use it.
Organ Donation Analyzed
Let's also consider something on the more positive side of the spectrum: Using analytics to increase the effectiveness of organ donation. The United Network for Organ Sharing (UNOS), a Talend customer, is using data and algorithms to optimize the matching of patients with transplantable organs. The use of analytics allows physicians to match the history of the organ and other vital data about the organ with the history and vital data about patients so that they can make a better decision. In this case, the "data virtue" question is not about whether or not to use the technology — it's how to best expand it to other areas of patient care.
Let's Talk About This Before It Becomes Taboo
It's not up to me or Talend or the open-source community to shape opinions about how to deal with the complex ethical issues emerging with the growth of big data analytics. But I think it's time to start an honest discussion.
I believe we're already starting to see a formal government response to the societal backlash of data virtue (GDPR, for example), and as the scope of what big data can unlock enters its renaissance over the next couple years, this topic is going to become even more critical. At the end of the day we must remember customers are providing this information, sometimes very personal information for something in return, and when that value doesn't match up, we have a serious problem.
Part of the way we can combat this is by highlighting the virtuous uses of big data because the media cycles tend to only highlight the negative. In doing so, we'd like to hear from you. Please send your thoughts, ideas, or examples of the ethical dilemmas you're wrestling with. We'll share them with the community and our customers-and do some serious thinking about how to maximize the virtues of what we're all engaged in. In the meantime, I'll be sure to start championing the value of using big data the right way.