Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Yelp: Reverse-Geocoding Businesses to Extract Detailed Location Information

DZone's Guide to

Yelp: Reverse-Geocoding Businesses to Extract Detailed Location Information

In this code-heavy tutorial, see how to use the Yelp Open Dataset to extract more detailed location information for businesses.

· Database Zone ·
Free Resource

MariaDB TX, proven in production and driven by the community, is a complete database solution for any and every enterprise — a modern database for modern applications.

I've been playing around with the Yelp Open Dataset and wanted to extract more detailed location information for each business.

This is an example of the JSON representation of one business:

$ cat dataset/business.json | head -n1 | jq
{
  "business_id": "FYWN1wneV18bWNgQjJ2GNg",
  "name": "Dental by Design",
  "neighborhood": "",
  "address": "4855 E Warner Rd, Ste B9",
  "city": "Ahwatukee",
  "state": "AZ",
  "postal_code": "85044",
  "latitude": 33.3306902,
  "longitude": -111.9785992,
  "stars": 4,
  "review_count": 22,
  "is_open": 1,
  "attributes": {
    "AcceptsInsurance": true,
    "ByAppointmentOnly": true,
    "BusinessAcceptsCreditCards": true
  },
  "categories": [
    "Dentists",
    "General Dentistry",
    "Health & Medical",
    "Oral Surgeons",
    "Cosmetic Dentists",
    "Orthodontists"
  ],
  "hours": {
    "Friday": "7:30-17:00",
    "Tuesday": "7:30-17:00",
    "Thursday": "7:30-17:00",
    "Wednesday": "7:30-17:00",
    "Monday": "7:30-17:00"
  }
}

The businesses reside in different countries so I wanted to extract the area/county/state and the country for each of them. I found the reverse-geocoder library which is perfect for this problem.

You give the library a lat/long or list of lat/longs and it returns you back a list containing the nearest lat/long to your points along with the name of the place, Admin regions, and country code. It's way quicker to pass in a list of lat/longs than to call the function individually for each lat/long so we'll do that.

We can write the following code to extract location information for a list of lat/longs:

import reverse_geocoder as rg

lat_longs = {
    "FYWN1wneV18bWNgQjJ2GNg": (33.3306902, -111.9785992),
    "He-G7vWjzVUysIKrfNbPUQ": (40.2916853, -80.1048999),
    "KQPW8lFf1y5BT2MxiSZ3QA": (33.5249025, -112.1153098)
}

business_ids = list(lat_longs.keys())
locations = rg.search(list(lat_longs.values()))

for business_id, location in zip(business_ids, locations):
    print(business_id, lat_longs[business_id], location)

This is the output we get from running the script:

$ python blog.py 
Loading formatted geocoded file...
FYWN1wneV18bWNgQjJ2GNg (33.3306902, -111.9785992) OrderedDict([('lat', '33.37088'), ('lon', '-111.96292'), ('name', 'Guadalupe'), ('admin1', 'Arizona'), ('admin2', 'Maricopa County'), ('cc', 'US')])
He-G7vWjzVUysIKrfNbPUQ (40.2916853, -80.1048999) OrderedDict([('lat', '40.2909'), ('lon', '-80.10811'), ('name', 'Thompsonville'), ('admin1', 'Pennsylvania'), ('admin2', 'Washington County'), ('cc', 'US')])
KQPW8lFf1y5BT2MxiSZ3QA (33.5249025, -112.1153098) OrderedDict([('lat', '33.53865'), ('lon', '-112.18599'), ('name', 'Glendale'), ('admin1', 'Arizona'), ('admin2', 'Maricopa County'), ('cc', 'US')])

It seems to work fairly well! Now, we just need to tweak our script to read the values from the Yelp JSON file and generate a new JSON file containing the locations:

import json

import reverse_geocoder as rg

lat_longs = {}

with open("dataset/business.json") as business_json:
    for line in business_json.readlines():
        item = json.loads(line)
        if item["latitude"] and item["longitude"]:
            lat_longs[item["business_id"]] = {
                "lat_long": (item["latitude"], item["longitude"]),
                "city": item["city"]
            }

result = {}

business_ids = list(lat_longs.keys())
locations = rg.search([value["lat_long"] for value in lat_longs.values()])

for business_id, location in zip(business_ids, locations):
    result[business_id] = {
        "country": location["cc"],
        "name": location["name"],
        "admin1": location["admin1"],
        "admin2": location["admin2"],
        "city": lat_longs[business_id]["city"]
    }

with open("dataset/businessLocations.json", "w") as business_locations_json:
    json.dump(result, business_locations_json, indent=4, sort_keys=True)

And that's it!

MariaDB AX is an open source database for modern analytics: distributed, columnar and easy to use.

Topics:
database ,tutorial ,geocoding ,data analytics ,json

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}