{{announcement.body}}
{{announcement.title}}

What Mick Jagger Taught Me About Data Extraction From Tables

DZone 's Guide to

What Mick Jagger Taught Me About Data Extraction From Tables

In this article, we work to better understand how to better extract data from tables and find clear insights.

· Big Data Zone ·
Free Resource

If you try sometime, you might find you get what you need.

You can’t always get what you want.

You can’t always get what you want.

You can’t always get what you want.

But, if you try sometime, you might find…

You get what you need.

Last night, I went to a reception. And, when I saw this woman with a glass of wine in her hand, I immediately thought of how wrong The Rolling Stones were.

Specifically, I considered how wrong they were when it comes to extracting data from tables.

See, you CAN always get what you want... AND what you need. Faster.

Even if it doesn’t seem like you can.

Even if you have data that doesn’t seem consumable.

But, first... you have to truly understand the problem you’re up against.

Let’s face it. Extracting information from tables to feed an automation process is as complex as it is difficult.

When even human readers struggle to understand the information presented in tables... when most are challenged if performing a mental operation (like adding) is needed to grasp all the necessary information... it may seem like Mick and friends were right. 

“You can’t always get what you want” is further validated when we recognize that automating the information extraction process from tables is a challenge few have fully conquered.

Why? It’s because when it comes to tables: 

  • Data is difficult to extract.
  • The extracted data is rarely consumable.
  • Information is not uniform throughout the table.
  • Extracted information can be missing or invalid.
  • Contextual information is usually lost.

As a result, tables almost always get the manual workflow treatment. That’s you trying sometimes —getting workers who could be doing more valuable work to extract — so you get what you need — the data.

But, what if there was a way to get both?

What if you could always get what you want when it comes to

extracting all that valuable data trapped in tables?

The way to address any challenge is not just to identify the problem but truly understand it.

So let’s dive in on how to extract data from tables, then clean and transform that data so it can be consumed by an automation system or an ML platform.

That way, you’ll get what you want — the speed and cost-savings of automation — and what you need —the extracted gold trapped in your tables.

What Is a Table?

I know. Sounds silly, right? That is, until you consider that there’s no standard for what makes a table...no universal definition.

Tables are an intuitive and universal way of presenting large sets of data, findings, and information.  

What’s more? A table is more than its data.

Tables contain a variety of data and information (e.g. words, digits, formulas, or images) and are embedded in a variety of document types (e.g. plain text, image, handwritten, or web pages).

But wait, there’s more!

We can’t forget relationships and context.

And there’s where tables are unique. 

A table displays multi-dimensional information using a two-dimensional, linear format. It’s a set of data and data context presented in a non-standard format.

Because of this unique feature — displaying not only sets of information, but their relationships and context — tables present a challenge for data extraction. And, that challenge expands when you consider there’s no universal format for a table: You might have three suppliers send you three different tables in three different formats...all trying to measure and show the same thing.

The Challenges of Tables

So let’s call out all the challenges tables present. That way, we can see how they may be overcome. That way, we can see if we CAN get what we want, as well as what we need.

Challenge 1: No standard structural layouts or visual relationships

The structure of a table is determined by the structure and the relationships of its cells.

Tables capture multi-dimensional information using a two-dimensional, linear format. There is no standard table layout. For example:

  • How lines are used.
  • How formatted (e.g., bold or italicized) text is used.
  • Header location: Headers can be in two places — the top row or the first column.
  • Use of borders: A table have or may not have a border, which makes it hard to locate
  • Variation in styles: separators for cells, rows and columns
  • Nested tables: one table inside another
  • Cells spanning across columns and rows show hierarchical grouping of data
  • Word wrap and column merge allows content to span across multiple lines and cells
  • Multi-page tables that are used for long data displays. In some cases, headers repeat
  • Tables that float and have text around them

Challenge 2: Data visualization for humans…only

Formatting is for visual consumption by humans, not technology. Often, table design is poor. Consider the header rows that aren’t explicit, but rather implied based on the table or the supporting text.

Challenge 3: Cell content that uses various formats (letters, numbers, symbols, etc)

When cell values are presented using different syntactic representation patterns — like symbols, images, text, abbreviations, or mathematical notations — extraction requires knowledge of all possible presentation patterns.

Challenge 4: Multiple languages in cell content

The table and its cells may use different languages or domain-specific jargon.

Challenge 5: Cell content varies in density and formats

The content of the cells can be numbers or text. But what happens when cell content is dense, containing ambiguous, short chunks of text with the use of acronyms and abbreviations? To decode tables, text must be made more clear with abbreviations and acronyms fully defined.

Challenge 6: Document types may vary

A document and table can be in a PDF, text, image, HTML or other format. Some formats are more challenging than others. For example, the PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis.

How to Get What You Want...AND What You Need

This list may not be all-inclusive, but it’s a good start. When it comes to data extraction, tables are tough.

When you want to free your workforce to focus on more value-based action (and when DON’T you want that?), you want to be able to enable the automation and ML processes to take the action instead.

To do this takes just four steps.

Once you understand the challenges tables present...and how to overcome them, getting your prescription filled is a whole lot better than standing in line at the Chelsea drugstore with that Mr. Jimmy character.

Topics:
automation, big data, data extraction, table

Published at DZone with permission of Artificial Intelligent . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}