Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Tackle Big Data With Natural Language Processing

DZone's Guide to

How to Tackle Big Data With Natural Language Processing

Big data is daunting and can have a lot of insight buried inside it. NLP can help by teaching machines to analyze large datasets.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Natural language processing (NLP) is a pretty exciting frontier of research that products such as Siri, Alexa, and Google Home have tapped into in order to bring a new level of interaction to their respective products. In order to use NLP ideally, we must look at how this particular type of processing can help us, what we intend to gain from making use of it, and how we go from raw data to the final product. If you're only just beginning to look at NLP, it can be an overwhelming experience, but by breaking the process down into more manageable parts, we can navigate this topic with ease.

Starting With the Basics

The basic processing we're looking at is how to turn regular, everyday text into something understandable by the computer. From it, we can extract things like jargon, slang, and even the speaking style of someone else. The basics of this processing will take the Unicode characters and separate them into words, phrases, sentences, and other linguistic delineations such as tokenization, decompounding, and lemmatization. Using all of these strategies, we can start to pick apart the language and even determine which language it is by the words and spelling present alongside the punctuation. Before we can build up the language for use, we must first break it down and analyze its component parts so we can understand how it works.

Figuring Out the Scope

Looking at a large block of text can make it difficult to determine what exactly the text is about, even for a human. Do we need to know the general gist of the text or is it more prudent to figure out what's being said within the text body itself? This is what we term macro understanding and micro understanding. NLP is limited by cost and time factors and certain levels of processing are simply not available because of these constraints. Once we get an idea of what scope we're aiming for, we can now move on to extraction.

Extraction of Content for Processing

Macro understanding allows us to figure out what the general gist of the document we're processing is about. We can use that for classification, extracting topics, summarization of legal documents, semantic search, duplicate detection, and keyword or key phrase extraction. If we're looking at micro understanding, we can use processing to read deeper into the text itself and extract acronyms and their meanings or the proper names of people or companies. In micro understanding, the word order is extremely important and must be preserved.

Back Trace Availability

Once we've extracted data from a particular document, we'll want to ensure that we know where that data comes from. Having a link to where the source document can save lots of time in the long run. This tracing can help to track down possible errors in the text, and if one of those source documents gets updated to a newer version, future changes can be reflected on the extracted information with a minimum of reprocessing, which will save time and processing power.

Human Feedback

The best method of developing NLP to adapt is to teach it how to listen to feedback that comes from people who created the language: humans themselves. Feedback from people about how an NLP system performs should be taken to help adapt it to what we want it to do.

Keeping Ahead of the Curve

Constant quality analysis is crucial to ensuring that an NLP fulfills its role and adapts to the world around it. Creating an NLP is basically teaching a computer how to learn from its mistakes and how to garner feedback to improve itself. By itself, big data is daunting and repetitive and can have a lot of insight buried inside it. By developing an NLP, you give a computer a task that it is well-suited to do while at the same time teaching it to think like a human in its extraction process. It's the best of both worlds.

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Topics:
big data ,data analytics ,nlp

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}