Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Integrating lucene-geo-gazetteer for Geo Enrichment With Apache NiFi

DZone's Guide to

Integrating lucene-geo-gazetteer for Geo Enrichment With Apache NiFi

In this interesting post, a data scientist explores Enriching Twitter data with geocoding using an open source tool made by Apache.

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

lucene-geo-gazetteer is a very cool Apache Tika, Apache Lucene, and Apache OpenNLP tool that builds a fast index of geo data built from a large list of all countries' data. It then provides a REST API that we can easily integrate into a flow.

So I have connected this to a NiFi flow for enhancing and enriching Twitter data with Geodata.

Example NiFi Flow To Convert Twitter Locations Into Geo Information

Downloading the Countries' Data and Building the Geo Indexes

Calling the Local Geo Server

Example JSON Data Returned

Let's pull out the fields we want after the split

Let's build a new JSON file of just the fields we like including the new geo ones.

Example JSON Processed

{  
   "msg":"RT @pauljauregui: Cybersecurity Startups Struggle - https://t.co/wADHLyUEEB #CyberSecurity #AI #IoT #IIoT #IndustrialIoT #DataSecurity #Sec…",
   "unixtime":"1516754942404",
   "friends_count":"4293",
   "sentiment":"NEGATIVE",
   "geolongitude":"-98.5",
   "hashtags":"[\"CyberSecurity\",\"AI\",\"IoT\",\"IIoT\",\"IndustrialIoT\",\"DataSecurity\"]",
   "listed_count":"520",
   "tweet_id":"955965632402485248",
   "user_name":"Lee Weiden",
   "favourites_count":"12454",
   "source":"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
   "vadersentiment":"Compound -0.3182 Negative 0.161 Neutral 0.839 Positive 0.0 \n",
   "placename":"United States",
   "media_url":"[]",
   "sentiment2":"Negative\n",
   "retweet_count":"0",
   "user_mentions_name":"[]",
   "geo":"",
   "urls":"[]",
   "countryCode":"US",
   "user_url":"",
   "place":"",
   "timestamp":"1516754942404",
   "geolatitude":"39.76",
   "coordinates":"",
   "handle":"LeeWeiden",
   "profile_image_url":"http://pbs.twimg.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg",
   "time_zone":"Eastern Time (US & Canada)",
   "ext_media":"[]",
   "statuses_count":"186127",
   "followers_count":"1461",
   "location":"United States",
   "time":"Wed Jan 24 00:49:02 +0000 2018",
   "user_mentions":"[]",
   "user_description":"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience."
}

Test The API

http://localhost:8765/api/search?s=Hightstown&s=New+Jersey 

Build the Index From All Countries Dataset

./src/main/bin/lucene-geo-gazetteer -i geoIndex -b allCountries.txt 

Run the REST Server

 ./src/main/bin/lucene-geo-gazetteer -server 

Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardContext setPath
WARNING: A context path must either be an empty string or start with a '/' and do not end with a '/'. The path [/] does not meet these criteria and has been changed to []
Starting Embedded Tomcat on port : 8765
Mar 20, 2018 12:33:35 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-nio-8765"]
Mar 20, 2018 12:33:35 PM org.apache.tomcat.util.net.NioSelectorPool getSharedSelector
INFO: Using a shared selector for servlet write/read
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Tomcat
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: Apache Tomcat/8.0.28
Mar 20, 2018 12:33:35 PM org.apache.cxf.transport.servlet.CXFNonSpringServlet loadBusNoConfig
INFO: Load the bus without application context
Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext prepareRefresh
INFO: Refreshing org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9: display name [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]; startup date [Tue Mar 20 12:33:36 EDT 2018]; root of context hierarchy
Mar 20, 2018 12:33:36 PM org.apache.cxf.bus.spring.BusApplicationContext getConfigResources
INFO: No cxf.xml configuration file detected, relying on defaults.
Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext obtainFreshBeanFactory
INFO: Bean factory for application context [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]: org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b
Mar 20, 2018 12:33:36 PM org.springframework.beans.factory.support.DefaultListableBeanFactory preInstantiateSingletons
INFO: Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b: defining beans [cxf,org.apache.cxf.bus.spring.BusApplicationListener,org.apache.cxf.bus.spring.BusWiringBeanFactoryPostProcessor,org.apache.cxf.bus.spring.Jsr250BeanPostProcessor,org.apache.cxf.bus.spring.BusExtensionPostProcessor,org.apache.cxf.resource.ResourceManager,org.apache.cxf.configuration.Configurer,org.apache.cxf.binding.BindingFactoryManager,org.apache.cxf.transport.DestinationFactoryManager,org.apache.cxf.transport.ConduitInitiatorManager,org.apache.cxf.wsdl.WSDLManager,org.apache.cxf.phase.PhaseManager,org.apache.cxf.workqueue.WorkQueueManager,org.apache.cxf.buslifecycle.BusLifeCycleManager,org.apache.cxf.endpoint.ServerRegistry,org.apache.cxf.endpoint.ServerLifeCycleManager,org.apache.cxf.endpoint.ClientLifeCycleManager,org.apache.cxf.transports.http.QueryHandlerRegistry,org.apache.cxf.endpoint.EndpointResolverRegistry,org.apache.cxf.headers.HeaderManager,org.apache.cxf.catalog.OASISCatalogManager,org.apache.cxf.endpoint.ServiceContractResolverRegistry,org.apache.cxf.jaxrs.JAXRSBindingFactory]; root of factory hierarchy
Mar 20, 2018 12:33:36 PM org.apache.cxf.transport.servlet.AbstractCXFServlet replaceDestinationFactory
INFO: Replaced the http destination factory with servlet transport factory
Mar 20, 2018 12:33:36 PM edu.usc.ir.geo.gazetteer.api.SearchResource <init>
INFO: Initialising searcher from index /Volumes/seagate/opensourcecomputervision/lucene-geo-gazetteer/src/main/bin/../../../geoIndex 

Example Call

 http://localhost:8765/api/search?s=Hightstown&s=New+Jersey 

Example Results

{"Hightstown":[{"name":"Hightstown","countryCode":"US","admin1Code":"NJ","admin2Code":"021","latitude":40.26955,"longitude":-74.52321}],"New Jersey":[{"name":"New Jersey","countryCode":"US","admin1Code":"NJ","admin2Code":"","latitude":40.16706,"longitude":-74.49987}]}

Example Schema

{
 "type": "record",
 "name": "twitter",
 "fields": [
  {
   "name": "msg",
   "type": "string"
  },
  {
   "name": "unixtime",
   "type": "string"
  },
  {
   "name": "friends_count",
   "type": "string"
  },
  {
   "name": "sentiment",
   "type": "string"
  },
  {
   "name": "geolongitude",
   "type": "string"
  },
  {
   "name": "hashtags",
   "type": "string"
  },
  {
   "name": "listed_count",
   "type": "string"
  },
  {
   "name": "tweet_id",
   "type": "string"
  },
  {
   "name": "user_name",
   "type": "string"
  },
  {
   "name": "favourites_count",
   "type": "string"
  },
  {
   "name": "source",
   "type": "string"
  },
  {
   "name": "vadersentiment",
   "type": "string"
  },
  {
   "name": "placename",
   "type": "string"
  },
  {
   "name": "media_url",
   "type": "string"
  },
  {
   "name": "sentiment2",
   "type": "string"
  },
  {
   "name": "retweet_count",
   "type": "string"
  },
  {
   "name": "user_mentions_name",
   "type": "string"
  },
  {
   "name": "geo",
   "type": "string"
  },
  {
   "name": "urls",
   "type": "string"
  },
  {
   "name": "countryCode",
   "type": "string"
  },
  {
   "name": "user_url",
   "type": "string"
  },
  {
   "name": "place",
   "type": "string",
   "doc": "Type inferred from '\"\"'"
  },
  {
   "name": "timestamp",
   "type": "string",
   "doc": "Type inferred from '\"1516754942404\"'"
  },
  {
   "name": "geolatitude",
   "type": "string",
   "doc": "Type inferred from '\"39.76\"'"
  },
  {
   "name": "coordinates",
   "type": "string",
   "doc": "Type inferred from '\"\"'"
  },
  {
   "name": "handle",
   "type": "string",
   "doc": "Type inferred from '\"LeeWeiden\"'"
  },
  {
   "name": "profile_image_url",
   "type": "string",
   "doc": "Type inferred from '\"http://pbs.twimg.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg\"'"
  },
  {
   "name": "time_zone",
   "type": "string",
   "doc": "Type inferred from '\"Eastern Time (US & Canada)\"'"
  },
  {
   "name": "ext_media",
   "type": "string",
   "doc": "Type inferred from '\"[]\"'"
  },
  {
   "name": "statuses_count",
   "type": "string",
   "doc": "Type inferred from '\"186127\"'"
  },
  {
   "name": "followers_count",
   "type": "string",
   "doc": "Type inferred from '\"1461\"'"
  },
  {
   "name": "location",
   "type": "string",
   "doc": "Type inferred from '\"United States\"'"
  },
  {
   "name": "time",
   "type": "string",
   "doc": "Type inferred from '\"Wed Jan 24 00:49:02 +0000 2018\"'"
  },
  {
   "name": "user_mentions",
   "type": "string",
   "doc": "Type inferred from '\"[]\"'"
  },
  {
   "name": "user_description",
   "type": "string",
   "doc": "Type inferred from '\"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience.\"'"
  }
 ]
}

https://github.com/chrismattmann/lucene-geo-gazetteer

Topics:
big data ,apache nifi ,lucene ,opennlp ,geo data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}