Apache NiFi + Apache OpenNLP With Organizations and Flow Files
Apache NiFi + Apache OpenNLP With Organizations and Flow Files
Using a custom processor to extract Natural Language Processing entities for names, locations, dates and organizations for files in stream.
Join the DZone community and get the full member experience.
Join For FreeThe open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
Updating the Apache OpenNLP Community Apache NiFi Processor to Support Flow Files
In this new release, we add the ability to read content from the FlowFile and analyze that for Locations, Dates, Organizations, and Names. We are using the Apache OpenNLP 1.5 Models that are available for download. These do a decent job. You can build new models as needed. I also changed it to output one attribute per type with a String list of locations, organizations, dates, and names.
I put out a new release, built around Apache NiFi 1.6.0.
Source and NAR Download
You can check out the source code on GitHub.
Download the pre-trained models for your language here.
I chose English (en).
In a future release, I may add Organization, Money, Time, and Percentage to the lists we extract if there is interest.
A Final JSON File Produced
{"created_at":"Thu May 10 16:55:17 +0000 2018","id":994621913115840512,"id_str":"994621913115840512","text":"Inflated 3D Convnet or I3D model trained for action recognition on kinetics-400. https:\/\/t.co\/4Udj1jTSVp","source":"\u003ca href=\"https:\/\/ifttt.com\" rel=\"nofollow\"\u003eIFTTT\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2496666240,"id_str":"2496666240","name":"Brent Arnichec","screen_name":"luckflow","location":"San Francisco, CA","url":"http:\/\/emulai.com","description":"#ArtificialIntelligence #MachineLearning #DeepLearning #IoT #fintech #Bigdata #Technology #Science #Robotics #DL #tech #Blockchain #Computing #AI","translator_type":"none","protected":false,"verified":false,"followers_count":146,"friends_count":711,"listed_count":14,"favourites_count":1,"statuses_count":822,"created_at":"Thu May 15 16:21:13 +0000 2014","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"E81C4F","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/878328407003496450\/i2Ii4dAz_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/878328407003496450\/i2Ii4dAz_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2496666240\/1498327723","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"quote_count":0,"reply_count":0,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/4Udj1jTSVp","expanded_url":"https:\/\/www.tensorflow.org\/hub\/modules\/deepmind\/i3d-kinetics-400\/1","display_url":"tensorflow.org\/hub\/modules\/de\u2026","indices":[81,104]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1525971317925"}
Example Output
The Main Flow for Trying Out the NLP Processor
Set Your Models
New NLP Processor Documentation
Here is the schema to use to process this data. Note that nlp_names
is a String of comma delimited values. You may want to parse this or do additional processing in these fields.
High-Level Flow
Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.
Opinions expressed by DZone contributors are their own.
{{ parent.title || parent.header.title}}
{{ parent.tldr }}
{{ parent.linkDescription }}
{{ parent.urlSource.name }}