Integrating lucene-geo-gazetteer for Geo Enrichment With Apache NiFi
In this interesting post, a data scientist explores Enriching Twitter data with geocoding using an open source tool made by Apache.
Join the DZone community and get the full member experience.
Join For Freelucene-geo-gazetteer is a very cool Apache Tika, Apache Lucene, and Apache OpenNLP tool that builds a fast index of geo data built from a large list of all countries' data. It then provides a REST API that we can easily integrate into a flow.
So I have connected this to a NiFi flow for enhancing and enriching Twitter data with Geodata.
Example NiFi Flow To Convert Twitter Locations Into Geo Information
Downloading the Countries' Data and Building the Geo Indexes
Calling the Local Geo Server
Example JSON Data Returned
Let's pull out the fields we want after the split
Let's build a new JSON file of just the fields we like including the new geo ones.
Example JSON Processed
{
"msg":"RT @pauljauregui: Cybersecurity Startups Struggle - https://t.co/wADHLyUEEB #CyberSecurity #AI #IoT #IIoT #IndustrialIoT #DataSecurity #Sec…",
"unixtime":"1516754942404",
"friends_count":"4293",
"sentiment":"NEGATIVE",
"geolongitude":"-98.5",
"hashtags":"[\"CyberSecurity\",\"AI\",\"IoT\",\"IIoT\",\"IndustrialIoT\",\"DataSecurity\"]",
"listed_count":"520",
"tweet_id":"955965632402485248",
"user_name":"Lee Weiden",
"favourites_count":"12454",
"source":"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
"vadersentiment":"Compound -0.3182 Negative 0.161 Neutral 0.839 Positive 0.0 \n",
"placename":"United States",
"media_url":"[]",
"sentiment2":"Negative\n",
"retweet_count":"0",
"user_mentions_name":"[]",
"geo":"",
"urls":"[]",
"countryCode":"US",
"user_url":"",
"place":"",
"timestamp":"1516754942404",
"geolatitude":"39.76",
"coordinates":"",
"handle":"LeeWeiden",
"profile_image_url":"http://pbs.twimg.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg",
"time_zone":"Eastern Time (US & Canada)",
"ext_media":"[]",
"statuses_count":"186127",
"followers_count":"1461",
"location":"United States",
"time":"Wed Jan 24 00:49:02 +0000 2018",
"user_mentions":"[]",
"user_description":"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience."
}
Test The API
http://localhost:8765/api/search?s=Hightstown&s=New+Jersey
Build the Index From All Countries Dataset
./src/main/bin/lucene-geo-gazetteer -i geoIndex -b allCountries.txt
Run the REST Server
./src/main/bin/lucene-geo-gazetteer -server
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardContext setPath
WARNING: A context path must either be an empty string or start with a '/' and do not end with a '/'. The path [/] does not meet these criteria and has been changed to []
Starting Embedded Tomcat on port : 8765
Mar 20, 2018 12:33:35 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-nio-8765"]
Mar 20, 2018 12:33:35 PM org.apache.tomcat.util.net.NioSelectorPool getSharedSelector
INFO: Using a shared selector for servlet write/read
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Tomcat
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: Apache Tomcat/8.0.28
Mar 20, 2018 12:33:35 PM org.apache.cxf.transport.servlet.CXFNonSpringServlet loadBusNoConfig
INFO: Load the bus without application context
Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext prepareRefresh
INFO: Refreshing org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9: display name [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]; startup date [Tue Mar 20 12:33:36 EDT 2018]; root of context hierarchy
Mar 20, 2018 12:33:36 PM org.apache.cxf.bus.spring.BusApplicationContext getConfigResources
INFO: No cxf.xml configuration file detected, relying on defaults.
Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext obtainFreshBeanFactory
INFO: Bean factory for application context [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]: org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b
Mar 20, 2018 12:33:36 PM org.springframework.beans.factory.support.DefaultListableBeanFactory preInstantiateSingletons
INFO: Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b: defining beans [cxf,org.apache.cxf.bus.spring.BusApplicationListener,org.apache.cxf.bus.spring.BusWiringBeanFactoryPostProcessor,org.apache.cxf.bus.spring.Jsr250BeanPostProcessor,org.apache.cxf.bus.spring.BusExtensionPostProcessor,org.apache.cxf.resource.ResourceManager,org.apache.cxf.configuration.Configurer,org.apache.cxf.binding.BindingFactoryManager,org.apache.cxf.transport.DestinationFactoryManager,org.apache.cxf.transport.ConduitInitiatorManager,org.apache.cxf.wsdl.WSDLManager,org.apache.cxf.phase.PhaseManager,org.apache.cxf.workqueue.WorkQueueManager,org.apache.cxf.buslifecycle.BusLifeCycleManager,org.apache.cxf.endpoint.ServerRegistry,org.apache.cxf.endpoint.ServerLifeCycleManager,org.apache.cxf.endpoint.ClientLifeCycleManager,org.apache.cxf.transports.http.QueryHandlerRegistry,org.apache.cxf.endpoint.EndpointResolverRegistry,org.apache.cxf.headers.HeaderManager,org.apache.cxf.catalog.OASISCatalogManager,org.apache.cxf.endpoint.ServiceContractResolverRegistry,org.apache.cxf.jaxrs.JAXRSBindingFactory]; root of factory hierarchy
Mar 20, 2018 12:33:36 PM org.apache.cxf.transport.servlet.AbstractCXFServlet replaceDestinationFactory
INFO: Replaced the http destination factory with servlet transport factory
Mar 20, 2018 12:33:36 PM edu.usc.ir.geo.gazetteer.api.SearchResource <init>
INFO: Initialising searcher from index /Volumes/seagate/opensourcecomputervision/lucene-geo-gazetteer/src/main/bin/../../../geoIndex
Example Call
http://localhost:8765/api/search?s=Hightstown&s=New+Jersey
Example Results
{"Hightstown":[{"name":"Hightstown","countryCode":"US","admin1Code":"NJ","admin2Code":"021","latitude":40.26955,"longitude":-74.52321}],"New Jersey":[{"name":"New Jersey","countryCode":"US","admin1Code":"NJ","admin2Code":"","latitude":40.16706,"longitude":-74.49987}]}
Example Schema
{
"type": "record",
"name": "twitter",
"fields": [
{
"name": "msg",
"type": "string"
},
{
"name": "unixtime",
"type": "string"
},
{
"name": "friends_count",
"type": "string"
},
{
"name": "sentiment",
"type": "string"
},
{
"name": "geolongitude",
"type": "string"
},
{
"name": "hashtags",
"type": "string"
},
{
"name": "listed_count",
"type": "string"
},
{
"name": "tweet_id",
"type": "string"
},
{
"name": "user_name",
"type": "string"
},
{
"name": "favourites_count",
"type": "string"
},
{
"name": "source",
"type": "string"
},
{
"name": "vadersentiment",
"type": "string"
},
{
"name": "placename",
"type": "string"
},
{
"name": "media_url",
"type": "string"
},
{
"name": "sentiment2",
"type": "string"
},
{
"name": "retweet_count",
"type": "string"
},
{
"name": "user_mentions_name",
"type": "string"
},
{
"name": "geo",
"type": "string"
},
{
"name": "urls",
"type": "string"
},
{
"name": "countryCode",
"type": "string"
},
{
"name": "user_url",
"type": "string"
},
{
"name": "place",
"type": "string",
"doc": "Type inferred from '\"\"'"
},
{
"name": "timestamp",
"type": "string",
"doc": "Type inferred from '\"1516754942404\"'"
},
{
"name": "geolatitude",
"type": "string",
"doc": "Type inferred from '\"39.76\"'"
},
{
"name": "coordinates",
"type": "string",
"doc": "Type inferred from '\"\"'"
},
{
"name": "handle",
"type": "string",
"doc": "Type inferred from '\"LeeWeiden\"'"
},
{
"name": "profile_image_url",
"type": "string",
"doc": "Type inferred from '\"http://pbs.twimg.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg\"'"
},
{
"name": "time_zone",
"type": "string",
"doc": "Type inferred from '\"Eastern Time (US & Canada)\"'"
},
{
"name": "ext_media",
"type": "string",
"doc": "Type inferred from '\"[]\"'"
},
{
"name": "statuses_count",
"type": "string",
"doc": "Type inferred from '\"186127\"'"
},
{
"name": "followers_count",
"type": "string",
"doc": "Type inferred from '\"1461\"'"
},
{
"name": "location",
"type": "string",
"doc": "Type inferred from '\"United States\"'"
},
{
"name": "time",
"type": "string",
"doc": "Type inferred from '\"Wed Jan 24 00:49:02 +0000 2018\"'"
},
{
"name": "user_mentions",
"type": "string",
"doc": "Type inferred from '\"[]\"'"
},
{
"name": "user_description",
"type": "string",
"doc": "Type inferred from '\"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience.\"'"
}
]
}
Opinions expressed by DZone contributors are their own.
Comments