Designing a Spacial Index
Note: This tutorial is by Nick Dimiduk and Amandeep Khurana, authors of HBase in Action
Geographic Information Systems (GIS) is an interesting area of exploration because it poses two significant challenges: latency at scale and modeling spatial locality. Spatial locality in GIS data is tricky. This article, based on chapter 8 from HBase in Action, explains an algorithm called the geohash, which is a solution to this problem.
Let's say you're visiting NYC and need an Internet connection. Where's the nearest Wi-Fi hotspot?How might an HBase application help you answer that question? What kind of design decisions go into solving that problem, and how can it be solved in a scalable way?
You know you want fast access to a relevant subset of the data. To achieve that, let's start with two simple and related goals:
- You want to store location points on disk such that points close to each other in space are close to each other on disk.
- You want to retrieve as few points as possible when responding to a query.
Achieving these goals will allow you to build a highly responsive online application against a spatial dataset. The main tool HBase gives you for tackling both of these goals is the rowkey. We have a hunch it won't meet the first design goal. It doesn't hurt to try, especially if you can learn something along the way. It's easy to implement, so let's evaluate the basic compound rowkey before trying anything more elaborate.
Let's start by looking at the data. New York City has an open-data program and publishes many datasets. One of those datasets is a listing of all the city's Wi-Fi hotspots. We don't expect you to be familiar with GIS or GIS data, so we've preprocessed it a bit. Here's a sample of that data:
X Y ID NAME 1 -73.96974759 40.75890919 441 Fedex Kinko's 2 -73.96993203 40.75815170 442 Fedex Kinko's 3 -73.96873588 40.76107453 463 Smilers 707 4 -73.96880474 40.76048717 472 Juan Valdez NYC 5 -73.96974993 40.76170883 219 Startegy Atrium and Cafe 6 -73.96978387 40.75850573 388 Barnes & Noble 7 -73.96746533 40.76089302 525 McDonalds 8 -73.96910155 40.75873061 564 Public Telephone 9 -73.97000655 40.76098703 593 Starbucks
You've processed the data into a tab-separated text file. The first line contains the column names. The columns X and Y are the longitude and latitude values, respectively. Each record has an ID, a NAME, and a number of other columns.
A great thing about GIS data is that it lends itself nicely to pictures! Unlike other kinds of data, no aggregations are required to build a meaningful visualization—just throw points on a map and see what you have. This sample data looks like figure 1.
Figure 1 Find the Wi-Fi. Geographic data wants to be seen, so draw it on a map. Here's a sampling of the full dataset—a handful of places to find a Wi-Fi connection in Midtown Manhattan.
Based on the goals outlined earlier, you now have a pretty good spot-check for your schema designs. Remember goal #1. Point 388 is close to point 441 on the map, so those records should be close to each other in the database. As goal #2 states, if you want to retrieve those two points, you shouldn't have to retrieve point 219 as well.
Now you know the goals and you know the data. It's time to take a stab at the schema. Design of the rowkey is the single most important thing you can do in your HBase schema, so let's start there.
Starting with a compound rowkey
Concatenating X- and Y-axis values as the rowkey won't cut it for an efficient schema. We cited the D.C. versus BogotÃ¡ example as proof. Let's see why. Sort the sample records first by longitude, then latitude, and connect the dots. Figure 2 does just that. When you store data this way, scanning returns results ordered from 1 to 9. Notice in particular the distance between steps 6 and 7, and steps 8 and 9. This sorting results in lots of hopping between the northern and southern clusters because of sorting first on longitude, then latitude.
Figure 2 A naive approach to spatial schema design: concatenated axes values. This schema fails the first objective of mapping spatial locality to record locality.
This schema design does okay with goal #1, but that's likely because the data sample is small. Goal #2 is also poorly represented. Every jump from the northern cluster to the southern cluster represents retrieval of data you don't need. That's precisely the problem you see in this schema design. Points close to each other aren't necessarily close together as records in HBase. When you translate this to a rowkey scan, you have to retrieve every possible point along the latitude for the desired longitude range. Of course, you could work around this flaw in the design. Perhaps you could build a latitude filter, implemented as a RegexStringComparator attached to a RowFilter. At least that way you could keep all that extra data from hitting the client. That's not ideal, though. A filter is reading records out of the store in order to execute the filter logic. It would be better to never touch that data, if you can avoid it.
This schema design, placing one dimension ahead of the other in the rowkey, also implies an ordered relationship between dimensions that doesn't exist. You can do better. To do so, you need to learn about a trick the GIS community devised for solving these kinds of problems: the geohash.
Introducing the geohash
As the previous example shows, longitude and latitude are equally important in defining the location of a point. In order to use them as the basis for the spatial index, you need an algorithm to combine them. Such an algorithm will create a value based on the two dimensions. That way, two values produced by the algorithm are related to each other in a way that considers both dimensions equally. The geohash does exactly this.
A geohash is a function that turns some number of values into a single value. For it to work, each of those values must be from a dimension with a fixed domain. In this case, you want to turn both longitude and latitude into a single value. The longitude dimension is bounded by the range [-180.0, 180.0], and the latitude dimension is bounded by the range [-90.0, 90.0]. There are a number of ways to reduce multiple dimensions to a single one, but we're using the geohash here because its output preserves spatial locality.
A geohash isn't a flawless encoding of the input data. For you audiophiles, it's a bit like an MP3 of your source recording. The input data is mostly there, but only mostly.
Like an MP3, you must specify a precision when calculating a geohash. You'll use 12 geohash characters for the precision because that's the highest precision you can fit in an 8-byte Long and still represent a meaningful character string. By truncating characters from the end of the hash, you get a less precise geohash and a correspondingly less precise selection of the map. Where full precision represents a point, partial precision gives you an area on the map, effectively a bounding box around an area in space. Figure 3 illustrates the decreasing precision of a truncated geohash.
Figure 3 Truncating a geohash. By dropping characters from the end of a geohash, you drop precision from the space that hash represents. A single character goes a long way.
For a given geohash prefix, all points within that space match the common prefix. If you can fit your query inside a geohash prefix's bounding box, all matching points will share a common prefix. That means you can use HBase's prefix scan on the rowkeys to get back a set of points that are all relevant to the query. That accomplishes goal #1. But as figure 3 illustrates, if you have to choose an overly generous precision, you'll end up with much more data than you need. That violates goal #2. You need to work around these edge cases, but we'll cover that a little later. For now, let's look at some real points.
Consider these three locations: LaGuardia Airport (40.77° N, 73.87° W), JFK International Airport (40.64° N, 73.78° W), and Central Park (40.78° N, 73.97° W). Their coordinates geohash to the values dr5rzjcw2nze , dr5x1n711mhd , and dr5ruzb8wnfr, respectively. You can look at those points on the map in figure 4 and see that Central Park is closer to LaGuardia than JFK. In absolute terms, Central Park to LaGuardia is about 5 miles, whereas Central Park to JFK is about 14 miles.
Figure 4 Relative distances. When viewed on a map, it's easy to see that the distance between Central Park and JFK is much farther than the distance between Central Park and LaGuardia. This is precisely the relationship you want to reproduce with your hashing algorithm.
Because they're closer to each other spatially, you expect Central Park and LaGuardia to share more common prefix characters than Central Park and JFK. Sure enough:
$ sort <(echo "dr5rzjcw2nze"; echo "dr5x1n711mhd"; echo "dr5ruzb8wnfr") dr5ruzb8wnfrdr5rzjcw2nzedr5x1n711mhd
Now that you understand how a geohash can work for you, we'll show you how to calculate one. Don't worry; you won't be hashing all these points by hand. With HBase, it's useful to understand how it works in order to use it effectively. Likewise with the geohash, understanding how it's constructed will help you understand its edge cases.
Understand the geohash
The geohashes you've seen are all represented as character strings in the Base32 encoding alphabet. In reality, the geohash is a sequence of bits representing an increasingly precise subdivision of longitude and latitude.
For instance, 40.78° N is a latitude. It falls in the upper half of the range [-90.0, 90.0], so its first geohash bit is 1. Its second bit is 0 because 40.78 is in the lower half of the range [0.0, 90.0]. The third range is [0.0, 45.0], and 40.78 falls in the upper half, so the third bit is 1.
The contribution provided by each dimension is calculated by halving the value range and determining which half the point resides in. If the point is greater than or equal to the midpoint, it's a 1-bit. Otherwise, it's a 0-bit. This process is repeated, again cutting the range in half and selecting a 1 or 0 based on where the target point lies. This binary partitioning is performed on both the longitude and latitude values. Rather than using the bit sequence from each dimension independently, the encoding weaves the bits together to create the hash. The spatial partitioning is why geohashes have the spatial locality property. The weaving of bits from each dimension is what allows for the prefix-match precision trickery.
Now that you understand how each component is encoded, let's calculate a full value. This process of bisecting the range and selecting a bit is repeated until the desired precision is achieved. A bit sequence is calculated for both longitude and latitude, and the bits are interwoven, longitude first, then latitude, out to the target precision.
Figure 5 illustrates the process. Once the bit sequence is calculated, it's encoded to produce the final hash value.
Figure 5 Constructing a geohash. The first 3 bits from longitude and latitude are calculated and woven to produce a geohash of 6-bit precision. The example data we discussed previously executed this algorithm out to 7 Base32 characters, or 35-bit precision.
Now that you understand why the geohash is useful to you and how it works, let's plug it in for your rowkey.
Using the geohash as a spatially aware rowkey
The geohash makes a great choice for the rowkey because it's inexpensive to calculate and the prefix gets you a long way toward finding nearest neighbors. Let's apply it to the sample data, sort by geohash, and see how you're doing on prefixes. We've calculated the geohash for each point using a library and added it to the data. All of the data in the sample is relatively close together, so you expect a good deal of prefix overlap across the points:
GEOHASH X Y ID NAME 1 dr5rugb9rwjj -73.96993203 40.75815170 442 Fedex Kinko's 2 dr5rugbge05m -73.96978387 40.75850573 388 Barnes & Noble 3 dr5rugbvggqe -73.96974759 40.75890919 441 Fedex Kinko's 4 dr5rugckg406 -73.96910155 40.75873061 564 Public Telephone 5 dr5ruu1x1ct8 -73.96880474 40.76048717 472 Juan Valdez NYC 6 dr5ruu29vytq -73.97000655 40.76098703 593 Starbucks 7 dr5ruu2y5vkb -73.96974993 40.76170883 219 Startegy Atrium and Cafe 8 dr5ruu3d7x0b -73.96873588 40.76107453 463 Smilers 707 9 dr5ruu693jhm -73.96746533 40.76089302 525 McDonalds
Sure enough, you get five characters of common prefix. That's not bad at all! This means you're a long way toward the distance query and goal #1 with a simple range scan. For context, figure 6 puts this data on a map.
Figure 6 Seeing prefix matches in action. If the target search is in this area, a simple rowkey scan will get the data you need. Not only that, but the order of results makes a lot more sense than the order in figure 2.
This is much better than the compound rowkey approach, but it's by no means perfect. All these points are close together, within a couple blocks of each other. Why are you only matching on 5 of 12 characters? We would hope data this spatially close would be stored much closer together. Thinking back to figure 3, the difference in size of the spatial area covered by a prefix scan of five versus six versus seven characters is significant—far more than a couple of blocks. You'd make strides toward goal #2 if you could make two scans of prefix six rather than a single scan of prefix five. Or better still, how about five or six scans of prefix seven? Let's look at figure 7, this time with more perspective.
The geohash boxes for both the six-character and seven-character geohash prefixes are overlaid.
Figure 7 Prefix matches with geohash overlay. Lots of additional, unnecessary area is introduced into the query result by using the 6-character prefix. An ideal implementation would use only 7-character prefixes to minimize the amount of extra data transmitted over the wire.
Compared to the target query area, the six-character prefix match areas are huge. Worse still, the query spans two of those larger prefixes. Seen in this context, those five characters of common prefix include far more data than you need. Relying on prefix match results in scanning a huge amount of extra area. Of course, there's a trade-off. If your data isn't dense at this precision level, executing fewer, longer scans isn't such a big deal. The scans don't return too much superfluous data, and you can minimize the remote procedure call (RPC) overhead. If your data is dense, running more, shorter scans will reduce the number of excess points transported over the wire. Plus, if there's one thing that computers are getting good at these days, it's parallelism. Execute each of those shorter scans on its own CPU core, and the query is still as fast as the slowest scan.
Let's scroll the map over to a different part of Manhattan, not far from the space we've explored thus far. Look at figure 8. Notice that the geohash of the center box has six characters (dr5ruz) of prefix in common with the boxes to its east, southeast, and south. But there are only five characters (dr5ru) in common with the west and southwest boxes. If five characters of common prefix is bad, then the prefix match with the entire northern row is abysmal, with only two characters (dr) in common!
Figure 8 Visualizing the geohash edge case. The encoding isn't perfect; this is one such case. Imagine a nearest-neighbor search falling on the point under the arrow in this illustration. It's possible you'll find a neighbor in a tile with only two characters of common prefix.
This doesn't happen every time, but it does happen with a surprisingly high frequency. As a counterexample, all eight neighbors of the southeast box (dr5ruz9) share a common six-character prefix.
The geohash is effective, but you can't use a simple naive prefix match either. Based on these figures, it looks like the optimal approach for the data is to scan the center tile and its eight neighbors. This approach will guarantee correct results while minimizing the amount of unnecessary network IO. Luckily, calculating those neighbors is a simple bit-manipulation away. Explaining the details of that manipulation is beyond the scope of our interest, so we'll trust the geohash library to provide that feature.
Not all linearization techniques are created equal
The geohash is approximating the data space. That is, it's a function that computes a value on a single output dimension based on input from multiple dimensions. In this case, the dimensionality of the input is only 2, but you can imagine how this could work for more. This is a form of linearization, and it's not the only one. Other techniques such as the Z-order curve and the Hilbert curve are also common. These are both classes of space-filling curves: curves defined by a single, uninterrupted line that touches all partitions of a space. None of these techniques can perfectly model a two-dimensional plane on a one-dimensional line and maintain the relative characteristics of objects in those spaces. We choose the geohash because, for our purposes, its error cases are less bad than the others.
This article was as much about GIS as about HBase. Remember, HBase is just a tool. To use it effectively, you need to know both the tool and the domain in which you want to apply it. The geohash trick proves that point. A little domain knowledge can go a long way.
3. Base32 is an encoding used to represent a binary value as a sequence of ASCII characters. Note that although geohash uses an alphabet of characters similar to that of Base32, the geohash spec doesn't follow the Base32 RFC. Learn more about Base32 at http://en.wikipedia.org/wiki/Base32. ↩
4. When including a direction, degrees of latitude are measured from 0.0 to 90.0 with the northern hemisphere corresponding to positive values and the southern hemisphere to negative values on the absolute latitude range. Likewise, degrees of longitude are measured from 0.0 to 180.0 with the eastern hemisphere indicating positive values and western hemisphere indicating negative values. ↩
6. The Z-order curve is extremely similar to the geohash, involving the interleaving of bits. Read more at http: //en.wikipedia.org/wiki/Z-order_curve. ↩
8. Read more about space-filling curves at http://en.wikipedia.org/wiki/Space-filling_curves. ↩