Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Apache NiFi + Apache OpenNLP With Organizations and Flow Files

DZone's Guide to

Apache NiFi + Apache OpenNLP With Organizations and Flow Files

Using a custom processor to extract Natural Language Processing entities for names, locations, dates and organizations for files in stream.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Updating the Apache OpenNLP Community Apache NiFi Processor to Support Flow Files

In this new release, we add the ability to read content from the FlowFile and analyze that for Locations, Dates, Organizations, and Names. We are using the Apache OpenNLP 1.5 Models that are available for download. These do a decent job. You can build new models as needed. I also changed it to output one attribute per type with a String list of locations, organizations, dates, and names.

I put out a new release, built around Apache NiFi 1.6.0.

Source and NAR Download

You can check out the source code on GitHub.

Download the pre-trained models for your language here

I chose English (en).

In a future release, I may add Organization, Money, Time, and Percentage to the lists we extract if there is interest.

A Final JSON File Produced

{"created_at":"Thu May 10 16:55:17 +0000 2018","id":994621913115840512,"id_str":"994621913115840512","text":"Inflated 3D Convnet or I3D model trained for action recognition on kinetics-400. https:\/\/t.co\/4Udj1jTSVp","source":"\u003ca href=\"https:\/\/ifttt.com\" rel=\"nofollow\"\u003eIFTTT\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2496666240,"id_str":"2496666240","name":"Brent Arnichec","screen_name":"luckflow","location":"San Francisco, CA","url":"http:\/\/emulai.com","description":"#ArtificialIntelligence #MachineLearning #DeepLearning #IoT #fintech #Bigdata #Technology #Science #Robotics #DL #tech #Blockchain #Computing #AI","translator_type":"none","protected":false,"verified":false,"followers_count":146,"friends_count":711,"listed_count":14,"favourites_count":1,"statuses_count":822,"created_at":"Thu May 15 16:21:13 +0000 2014","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"E81C4F","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/878328407003496450\/i2Ii4dAz_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/878328407003496450\/i2Ii4dAz_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2496666240\/1498327723","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/4Udj1jTSVp","expanded_url":"https:\/\/www.tensorflow.org\/hub\/modules\/deepmind\/i3d-kinetics-400\/1","display_url":"tensorflow.org\/hub\/modules\/de\u2026","indices":[81,104]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1525971317925"}

Example Output

The Main Flow for Trying Out the NLP Processor

Set Your Models

New NLP Processor Documentation

Here is the schema to use to process this data. Note that nlp_namesis a String of comma delimited values. You may want to parse this or do additional processing in these fields.

High-Level Flow

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
apache nifi ,apache opennlp ,nlp ,text processing ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}