Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Apache NiFi + Apache OpenNLP With Organizations and Flow Files

DZone's Guide to

Apache NiFi + Apache OpenNLP With Organizations and Flow Files

Using a custom processor to extract Natural Language Processing entities for names, locations, dates and organizations for files in stream.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Updating the Apache OpenNLP Community Apache NiFi Processor to Support Flow Files

In this new release, we add the ability to read content from the FlowFile and analyze that for Locations, Dates, Organizations, and Names. We are using the Apache OpenNLP 1.5 Models that are available for download. These do a decent job. You can build new models as needed. I also changed it to output one attribute per type with a String list of locations, organizations, dates, and names.

I put out a new release, built around Apache NiFi 1.6.0.

Source and NAR Download

You can check out the source code on GitHub.

Download the pre-trained models for your language here

I chose English (en).

In a future release, I may add Organization, Money, Time, and Percentage to the lists we extract if there is interest.

A Final JSON File Produced

{"created_at":"Thu May 10 16:55:17 +0000 2018","id":994621913115840512,"id_str":"994621913115840512","text":"Inflated 3D Convnet or I3D model trained for action recognition on kinetics-400. https:\/\/t.co\/4Udj1jTSVp","source":"\u003ca href=\"https:\/\/ifttt.com\" rel=\"nofollow\"\u003eIFTTT\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2496666240,"id_str":"2496666240","name":"Brent Arnichec","screen_name":"luckflow","location":"San Francisco, CA","url":"http:\/\/emulai.com","description":"#ArtificialIntelligence #MachineLearning #DeepLearning #IoT #fintech #Bigdata #Technology #Science #Robotics #DL #tech #Blockchain #Computing #AI","translator_type":"none","protected":false,"verified":false,"followers_count":146,"friends_count":711,"listed_count":14,"favourites_count":1,"statuses_count":822,"created_at":"Thu May 15 16:21:13 +0000 2014","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"E81C4F","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/878328407003496450\/i2Ii4dAz_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/878328407003496450\/i2Ii4dAz_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2496666240\/1498327723","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/4Udj1jTSVp","expanded_url":"https:\/\/www.tensorflow.org\/hub\/modules\/deepmind\/i3d-kinetics-400\/1","display_url":"tensorflow.org\/hub\/modules\/de\u2026","indices":[81,104]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1525971317925"}

Example Output

The Main Flow for Trying Out the NLP Processor

Set Your Models

New NLP Processor Documentation

Here is the schema to use to process this data. Note that nlp_namesis a String of comma delimited values. You may want to parse this or do additional processing in these fields.

High-Level Flow

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
apache nifi ,apache opennlp ,nlp ,text processing ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}