DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Integrating lucene-geo-gazetteer for Geo Enrichment With Apache NiFi

Integrating lucene-geo-gazetteer for Geo Enrichment With Apache NiFi

In this interesting post, a data scientist explores Enriching Twitter data with geocoding using an open source tool made by Apache.

Tim Spann user avatar by
Tim Spann
CORE ·
Mar. 22, 18 · Tutorial
Like (3)
Save
Tweet
Share
5.37K Views

Join the DZone community and get the full member experience.

Join For Free

lucene-geo-gazetteer is a very cool Apache Tika, Apache Lucene, and Apache OpenNLP tool that builds a fast index of geo data built from a large list of all countries' data. It then provides a REST API that we can easily integrate into a flow.

So I have connected this to a NiFi flow for enhancing and enriching Twitter data with Geodata.

Example NiFi Flow To Convert Twitter Locations Into Geo Information

Downloading the Countries' Data and Building the Geo Indexes

Calling the Local Geo Server

Example JSON Data Returned

Let's pull out the fields we want after the split

Let's build a new JSON file of just the fields we like including the new geo ones.

Example JSON Processed

{  
   "msg":"RT @pauljauregui: Cybersecurity Startups Struggle - https://t.co/wADHLyUEEB #CyberSecurity #AI #IoT #IIoT #IndustrialIoT #DataSecurity #Sec…",
   "unixtime":"1516754942404",
   "friends_count":"4293",
   "sentiment":"NEGATIVE",
   "geolongitude":"-98.5",
   "hashtags":"[\"CyberSecurity\",\"AI\",\"IoT\",\"IIoT\",\"IndustrialIoT\",\"DataSecurity\"]",
   "listed_count":"520",
   "tweet_id":"955965632402485248",
   "user_name":"Lee Weiden",
   "favourites_count":"12454",
   "source":"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
   "vadersentiment":"Compound -0.3182 Negative 0.161 Neutral 0.839 Positive 0.0 \n",
   "placename":"United States",
   "media_url":"[]",
   "sentiment2":"Negative\n",
   "retweet_count":"0",
   "user_mentions_name":"[]",
   "geo":"",
   "urls":"[]",
   "countryCode":"US",
   "user_url":"",
   "place":"",
   "timestamp":"1516754942404",
   "geolatitude":"39.76",
   "coordinates":"",
   "handle":"LeeWeiden",
   "profile_image_url":"http://pbs.twimg.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg",
   "time_zone":"Eastern Time (US & Canada)",
   "ext_media":"[]",
   "statuses_count":"186127",
   "followers_count":"1461",
   "location":"United States",
   "time":"Wed Jan 24 00:49:02 +0000 2018",
   "user_mentions":"[]",
   "user_description":"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience."
}

Test The API

http://localhost:8765/api/search?s=Hightstown&s=New+Jersey 

Build the Index From All Countries Dataset

./src/main/bin/lucene-geo-gazetteer -i geoIndex -b allCountries.txt 

Run the REST Server

 ./src/main/bin/lucene-geo-gazetteer -server 

Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardContext setPath
WARNING: A context path must either be an empty string or start with a '/' and do not end with a '/'. The path [/] does not meet these criteria and has been changed to []
Starting Embedded Tomcat on port : 8765
Mar 20, 2018 12:33:35 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-nio-8765"]
Mar 20, 2018 12:33:35 PM org.apache.tomcat.util.net.NioSelectorPool getSharedSelector
INFO: Using a shared selector for servlet write/read
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Tomcat
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: Apache Tomcat/8.0.28
Mar 20, 2018 12:33:35 PM org.apache.cxf.transport.servlet.CXFNonSpringServlet loadBusNoConfig
INFO: Load the bus without application context
Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext prepareRefresh
INFO: Refreshing org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9: display name [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]; startup date [Tue Mar 20 12:33:36 EDT 2018]; root of context hierarchy
Mar 20, 2018 12:33:36 PM org.apache.cxf.bus.spring.BusApplicationContext getConfigResources
INFO: No cxf.xml configuration file detected, relying on defaults.
Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext obtainFreshBeanFactory
INFO: Bean factory for application context [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]: org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b
Mar 20, 2018 12:33:36 PM org.springframework.beans.factory.support.DefaultListableBeanFactory preInstantiateSingletons
INFO: Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b: defining beans [cxf,org.apache.cxf.bus.spring.BusApplicationListener,org.apache.cxf.bus.spring.BusWiringBeanFactoryPostProcessor,org.apache.cxf.bus.spring.Jsr250BeanPostProcessor,org.apache.cxf.bus.spring.BusExtensionPostProcessor,org.apache.cxf.resource.ResourceManager,org.apache.cxf.configuration.Configurer,org.apache.cxf.binding.BindingFactoryManager,org.apache.cxf.transport.DestinationFactoryManager,org.apache.cxf.transport.ConduitInitiatorManager,org.apache.cxf.wsdl.WSDLManager,org.apache.cxf.phase.PhaseManager,org.apache.cxf.workqueue.WorkQueueManager,org.apache.cxf.buslifecycle.BusLifeCycleManager,org.apache.cxf.endpoint.ServerRegistry,org.apache.cxf.endpoint.ServerLifeCycleManager,org.apache.cxf.endpoint.ClientLifeCycleManager,org.apache.cxf.transports.http.QueryHandlerRegistry,org.apache.cxf.endpoint.EndpointResolverRegistry,org.apache.cxf.headers.HeaderManager,org.apache.cxf.catalog.OASISCatalogManager,org.apache.cxf.endpoint.ServiceContractResolverRegistry,org.apache.cxf.jaxrs.JAXRSBindingFactory]; root of factory hierarchy
Mar 20, 2018 12:33:36 PM org.apache.cxf.transport.servlet.AbstractCXFServlet replaceDestinationFactory
INFO: Replaced the http destination factory with servlet transport factory
Mar 20, 2018 12:33:36 PM edu.usc.ir.geo.gazetteer.api.SearchResource <init>
INFO: Initialising searcher from index /Volumes/seagate/opensourcecomputervision/lucene-geo-gazetteer/src/main/bin/../../../geoIndex 

Example Call

 http://localhost:8765/api/search?s=Hightstown&s=New+Jersey 

Example Results

{"Hightstown":[{"name":"Hightstown","countryCode":"US","admin1Code":"NJ","admin2Code":"021","latitude":40.26955,"longitude":-74.52321}],"New Jersey":[{"name":"New Jersey","countryCode":"US","admin1Code":"NJ","admin2Code":"","latitude":40.16706,"longitude":-74.49987}]}

Example Schema

{
 "type": "record",
 "name": "twitter",
 "fields": [
  {
   "name": "msg",
   "type": "string"
  },
  {
   "name": "unixtime",
   "type": "string"
  },
  {
   "name": "friends_count",
   "type": "string"
  },
  {
   "name": "sentiment",
   "type": "string"
  },
  {
   "name": "geolongitude",
   "type": "string"
  },
  {
   "name": "hashtags",
   "type": "string"
  },
  {
   "name": "listed_count",
   "type": "string"
  },
  {
   "name": "tweet_id",
   "type": "string"
  },
  {
   "name": "user_name",
   "type": "string"
  },
  {
   "name": "favourites_count",
   "type": "string"
  },
  {
   "name": "source",
   "type": "string"
  },
  {
   "name": "vadersentiment",
   "type": "string"
  },
  {
   "name": "placename",
   "type": "string"
  },
  {
   "name": "media_url",
   "type": "string"
  },
  {
   "name": "sentiment2",
   "type": "string"
  },
  {
   "name": "retweet_count",
   "type": "string"
  },
  {
   "name": "user_mentions_name",
   "type": "string"
  },
  {
   "name": "geo",
   "type": "string"
  },
  {
   "name": "urls",
   "type": "string"
  },
  {
   "name": "countryCode",
   "type": "string"
  },
  {
   "name": "user_url",
   "type": "string"
  },
  {
   "name": "place",
   "type": "string",
   "doc": "Type inferred from '\"\"'"
  },
  {
   "name": "timestamp",
   "type": "string",
   "doc": "Type inferred from '\"1516754942404\"'"
  },
  {
   "name": "geolatitude",
   "type": "string",
   "doc": "Type inferred from '\"39.76\"'"
  },
  {
   "name": "coordinates",
   "type": "string",
   "doc": "Type inferred from '\"\"'"
  },
  {
   "name": "handle",
   "type": "string",
   "doc": "Type inferred from '\"LeeWeiden\"'"
  },
  {
   "name": "profile_image_url",
   "type": "string",
   "doc": "Type inferred from '\"http://pbs.twimg.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg\"'"
  },
  {
   "name": "time_zone",
   "type": "string",
   "doc": "Type inferred from '\"Eastern Time (US & Canada)\"'"
  },
  {
   "name": "ext_media",
   "type": "string",
   "doc": "Type inferred from '\"[]\"'"
  },
  {
   "name": "statuses_count",
   "type": "string",
   "doc": "Type inferred from '\"186127\"'"
  },
  {
   "name": "followers_count",
   "type": "string",
   "doc": "Type inferred from '\"1461\"'"
  },
  {
   "name": "location",
   "type": "string",
   "doc": "Type inferred from '\"United States\"'"
  },
  {
   "name": "time",
   "type": "string",
   "doc": "Type inferred from '\"Wed Jan 24 00:49:02 +0000 2018\"'"
  },
  {
   "name": "user_mentions",
   "type": "string",
   "doc": "Type inferred from '\"[]\"'"
  },
  {
   "name": "user_description",
   "type": "string",
   "doc": "Type inferred from '\"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience.\"'"
  }
 ]
}

https://github.com/chrismattmann/lucene-geo-gazetteer

GEOS (16-bit operating system) Apache NiFi

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • The Quest for REST
  • A Brief Overview of the Spring Cloud Framework
  • How To Check Docker Images for Vulnerabilities
  • Fraud Detection With Apache Kafka, KSQL, and Apache Flink

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: