Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Unstructured Data Is an Oxymoron

DZone's Guide to

Unstructured Data Is an Oxymoron

People tend to use the phrase unstructured data to mean non-tabular data. But non-tabular data still has a structure. So what exactly is unstructured data?

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Strictly speaking, “unstructured data” is a contradiction in terms. Data must have structure to be comprehensible. By “unstructured data” people usually mean data with a non-tabular structure.

Tabular data is data that comes in tables. Each row corresponds to a subject, and each column corresponds to a kind of measurement. This is the easiest data to work with.

Non-tabular data could mean anything other than tabular data, but in practice it often means text, or it could mean data with a graph structure or some other structure.

More Productive Discussions

My point here isn’t to quibble over language usage but to offer a constructive suggestion: say what structure data has, not what structure it doesn’t have.

Discussions about “unstructured data” are often unproductive because two people can use the term with two different ideas of what it means, and think they’re in agreement when they’re not. Maybe an executive and a sales rep shake hands on an agreement that isn’t really an agreement.

Eventually, there will have to be a discussion of what structure data actually has rather than what structure it lacks, and to what degree that structure is exploitable. Having that discussion sooner rather than later can save a lot of money.

Free Text Fields

One form of “unstructured” data is free text fields. These fields are not free of structure. They usually contain prose, written in a particular language, or, at most, in a small number of languages. That’s a start. There should be a more exploitable structure from context. Is the text a pathology report? A Facebook status? A legal opinion?

Clients will ask how to de-identify free text fields. You can’t. If the text is truly free, it could be anything, by definition. But if there’s some known structure, then maybe there’s some practical way to anonymize the data, especially if there’s some tolerance for error.

For example, a program may search for and mask probable names. Such a program would find “Elizabeth” but might fail to find “the queen.” Since there are only a couple queens [1], this would be a privacy breach. Such software would also have false positives, such as masking the name of the ocean liner Queen Elizabeth 2. [2]

Notes

[1] The Wikipedia list of current sovereign monarchs lists only two women, Queen Elizabeth II of the UK and Queen Margrethe II of Denmark.

[2] The ship, also known as QE2, is Queen Elizabeth 2, while the monarch is Queen Elizabeth II.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,data structure ,unstructured data ,structured data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}