Handling Big Data with HBase Part 5: Data Modeling (or, Life Without SQL)
This is the fifth of a series of blogs introducing Apache HBase. In the fourth part, we saw the basics of using the Java API to interact with HBase to create tables, retrieve data by row key, and do table scans. This part will discuss how to design schemas in HBase. HBase has nothing similar to a rich query capability like SQL from relational databases. Instead, it forgoes this capability and others like relationships, joins, etc. to instead focus on providing scalability with good performance and fault-tolerance. So when working with HBase you need to design the row keys and table structure in terms of rows and column families to match the data access patterns of your application. This is completely opposite what you do with relational databases where you start out with a normalized database schema, separate tables, and then you use SQL to perform joins to combine data in the ways you need. With HBase you design your tables specific to how they will be accessed by applications, so you need to think much more up-front about how data is accessed. You are much closer to the bare metal with HBase than with relational databases which abstract implementation details and storage mechanisms. However, for applications needing to store massive amounts of data and have inherent scalability, performance characteristics and tolerance to server failures, the potential benefits can far outweigh the costs. In the last part on the Java API, I mentioned that when scanning data in HBase, the row key is critical since it is the primary means to restrict the rows scanned; there is nothing like a rich query like SQL as in relational databases. Typically you create a scan using start and stop row keys and optionally add filters to further restrict the rows and columns data returned. In order to have some flexibility when scanning, the row key should be designed to contain the information you need to find specific subsets of data. In the blog and people examples we've seen so far, the row keys were designed to allow scanning via the most common data access patterns. For the blogs, the row keys were simply the posting date. This would permit scans in ascending order of blog entries, which is probably not the most common way to view blogs; you'd rather see the most recent blogs first. So a better row key design would be to use a reverse order timestamp, which you can get using the formula (Long.MAX_VALUE - timestamp), so scans return the most recent blog posts first. This makes it easy to scan specific time ranges, for example to show all blogs in the past week or month, which is a typical way to navigate blog entries in web applications. For the people table examples, we used a composite row key composed of last name, first name, middle initial, and a (unique) person identifier to distinguish people with the exact same name, separated by dashes. For example, Brian M. Smith with identifier 12345 would have row key smith-brian-m-12345. Scans for the people table can then be composed using start and end rows designed to retrieve people with specific last names, last names starting with specific letter combinations, or people with the same last name and first name initial. For example, if you wanted to find people whose first name begins with B and last name is Smith you could use the start row key smith-b and stop row key smith-c (the start row key is inclusive while the stop row key is exclusive, so the stop key smith-c ensures all Smiths with first name starting with the letter "B" are included). You can see that HBase supports the notion of partial keys, meaning you do not need to know the exact key, to provide more flexibility creating appropriate scans. You can combine partial key scans with filters to retrieve only the specific data needed, thus optimizing data retrieval for the data access patterns specific to your application. So far the examples have involved only single tables containing one type of information and no related information. HBase does not have foreign key relationships like in relational databases, but because it supports rows having up to millions of columns, one way to design tables in HBase is to encapsulate related information in the same row - a "wide" table design. It is called a "wide" design since you are storing all information related to a row together in as many columns as there are data items. In our blog example, you might want to store comments for each blog. The "wide" way to design this would be to include a column family named comments and then add columns to the comment family where the qualifiers are the comment timestamp; the comment columns would look like comments:20130704142510 and comments:20130707163045. Even better, when HBase retrieves columns it returns them in sorted order, just like row keys. So in order to display a blog entry and its comments, you can retrieve all the data from one row by asking for the content, info, and comments column families. You could also add a filter to retrieve only a specific number of comments, adding pagination to them. The people table column families could also be redesigned to store contact information such as separate addresses, phone numbers, and email addresses in column families allowing all of a person's information to be stored in one row. This kind of design can work well if the number of columns is relatively modest, as blog comments and a person's contact information would be. If instead you are modeling something like an email inbox, financial transactions, or massive amounts of automatically collected sensor data, you might choose instead to spread a user's emails, transactions, or sensor readings across multiple rows (a "tall" design) and design the row keys to allow efficient scanning and pagination. For an inbox the row key might look like - which would permit easily scanning and paginating a user's inbox, while for financial transactions the row key might be -. This kind of design can be called "tall" since you are spreading information about the same thing (e.g. readings from the same sensor, transactions in an account) across multiple rows, and is something to consider if there will be an ever-expanding amount of information, as would be the case in a scenario involving data collection from a huge network of sensors. Designing row keys and table structures in HBase is a key part of working with HBase, and will continue to be given the fundamental architecture of HBase. There are other things you can do to add alternative schemes for data access within HBase. For example, you could implement full-text searching via Apache Lucene either within rows or external to HBase (search Google for HBASE-3529). You can also create (and maintain) secondary indexes to permit alternate row key schemes for tables; for example in our people table the composite row key consists of the name and a unique identifier. But if we desire to access people by their birth date, telephone area code, email address, or any other number of ways, we could add secondary indexes to enable that form of interaction. Note, however, that adding secondary indexes is not something to be taken lightly; every time you write to the "main" table (e.g. people) you will need to also update all the secondary indexes! (Yes, this is something that relational databases do very well, but remember that HBase is designed to accomodate a lot more data than traditional RDBMSs were.) Conclusion to Part 5 In this part of the series, we got an introduction to schema design in HBase (without relations or SQL). Even though HBase is missing some of the features found in traditional RDBMS systems such as foreign keys and referential integrity, multi-row transactions, multiple indexes, and son on, many applications that need inherent HBase benefits like scaling can benefit from using HBase. As with anything complex, there are tradeoffs to be made. In the case of HBase, you are giving up some richness in schema design and query flexibility, but you gain the ability to scale to massive amounts of data by (more or less) simply adding additional servers to your cluster. In the next and last part of this series, we'll wrap up and mention a few (of the many) things we didn't cover in these introductory blogs. References HBase web site, http://hbase.apache.org/ HBase wiki, http://wiki.apache.org/hadoop/Hbase HBase Reference Guide http://hbase.apache.org/book/book.html HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide Google Bigtable Paper, http://labs.google.com/papers/bigtable.html Hadoop web site, http://hadoop.apache.org/ Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk Sample code, https://github.com/sleberknight/basic-hbase-examples
Updated October 11, 2022
·
19,345 Views
·
3 Likes
Comments
Mar 28, 2019 · Jesus J. Puente
Gotcha, thanks for the update. Only reason I posted is that in recent years I have come across Java developers who literally believe that you must use Spring when using Java. Not. Even. Kidding. And while maybe that's great for Spring, I've seen simple projects where Spring was used to wire together 5 or 6 small classes, and nothing else. The junior developer in one situation gave me a blank stare when I said you don't need Spring just to instantiate a few classes and to just manually construct the objects and pass in dependencies via constructors.
Mar 28, 2019 · Jesus J. Puente
Zuul isn't part of Spring Cloud, it is a standalone Netflix project that Spring Cloud wraps/integrates/etc. and itself has nothing to do with Spring. On GitHub at https://github.com/Netflix/zuul
Oct 12, 2018 · Harsh Binani
I think this comment is for a different article, not for the Levenshtein Algorithm??? There's nothing in here about microservices...
Jun 06, 2018 · Mike Gates
Wow I never realized the list returned from Arrays.asList() was not immutable...now that I went back and read the JavaDoc it does actually say it returns a "fixed-size list backed by the specified array" and that "changes to the returned list 'write-through' to the array". Learn something every day...thanks for the article!
Dec 16, 2017 · Eric Gregory
I'd suggest taking a look at Apache Curator, which has many "recipes" built in including Leader Election and a Leader Latch. See here: https://curator.apache.org/curator-recipes/index.html
Jan 24, 2017 · Sumith Puri
Also according to a thread in the wink user mailing list from December 2016, it looks like they might simply retire the project ( see http://mail-archives.apache.org/mod_mbox/wink-user/201612.mbox/browser ). So between that and the fact that it supports the now very old JAX-RS 1.1, I'd suggest avoiding this and use Jersey or something else.
Jan 24, 2017 · Sumith Puri
From what I can tell Apacke Wink hasn't had any releases since 2013 and it only implements JAX-RS 1.1, and the official website's last update was in September 2013.
In contrast JAX-RS 2.0 came out around May 2013 and, for example, the current Jersey version 2.25.1 (as of 2017-01-23) supports JAX-RS 2.0. So I'm not sure why anyone would choose to use this now in 2017.
Dec 09, 2016 · Per-Åke Minborg
It is your opinion that it is not very nice, not fact. I don't agree. Plus, I did split the lines but the comment formatter apparently didn't preserve the line breaks. Anyway, for creating small maps, especially in unit tests, it works out pretty nicely and I don't think I've ever heard anyone say it was difficult to read or modify later. It would be better if Java adopted a literal map intitalization syntax like many other modern (or even non-modern) languages do. e.g. in Ruby:
options = { font_size: 10, font_family: "Arial" }
Dec 08, 2016 · Per-Åke Minborg
Pretty cool. For years I've always just created utiliies with a method having the signature:
That way I can get whatever key and value type I want, and while not as nice, you can still do something like:
Not awesome, but not too horrible either. And I add variants like newLinkedHashMap, newConcurrentMap, newTreeMap, and so on.
Maybe Java will someday get literal map/list syntax..
Oct 03, 2016 · Thomas Krieger
I do not see an "anotify" method in the Condition interface. Did you maybe mean the signal or signalAll methods? Or am I just missing something, b/c I'm pretty sure there's no anotify method in Condition...
Sep 19, 2016 · Dave Fecak
Yup that would be another way to do it, and as mentioned in the above comments would be more efficient for large maps. In fact you could make it a method reference and just do:
deleteKeys.forEach(dataCache::remove);
Sep 19, 2016 · Dave Fecak
Thanks Stefan and Ivan for the good insights, and you are definitely right that you cannot blindly refactor to streams. Ivan is also correct that the second example which only traverses the keys to be deleted is not equivalent from a performance perspective especially in very large maps...thanks for catching that!
For some context, the cache generally contains on the order of a hundred items, and generally more like several dozen, so the actual data set in this case is very small. And, the set of keys to delete is generally just a few, and generally a lot less than the number of items in the cache.
So in this specific case, refactoring to streams should not cause any real, noticeable degradation in performance, and I am pretty sure the new removeIf code is at least as performant as the original code I found (since both the original code and removeIf use an Iterator over the entire map and call remove() on each matching item). But if this were a map containing a million entries then this might not be such a good thing to do...maybe I need to go explore that further and write a follow-up.
Aug 29, 2016 · Shamik Mitra
Lombok can generate builders for you, which makes it trivial to add builders to complex objects. (Of course, some people don't like Lombok, but that's a different conversation.) See https://projectlombok.org/features/Builder.html
Sep 17, 2009 · Patrick Hunlock
Good point about nulls being a royal PITA in general. In the specific case I had, the data was actually not used like that, i.e. it was not searched by date. I changed the actual class names from my real project but they were similar in concept, and the "Events" containing startYear and endYear were really just for reference purposes and were basically metadata for the parent "Timeline" domain object, sort of like a resume where you use dates to delineate previous jobs but you probably don't go around searching resumes by years (or at least I don't).
So to answer your question, since we don't need to search by those dates, using nulls wasn't an issue in that regard. And thus we don't end up having to use nvl or anything else throughout the app b/c this is localized to one domain object. The only logic is that when reports are generated the users wanted the year ranges to look like "2001" when there's no end date and "2001-02" when there is an end date. The display logic was actually perhaps a bit easier with nulls in this case since we don't need to check if the endYear is the same as startYear.
- ScottSep 17, 2009 · Nate Bergman
Good point about nulls being a royal PITA in general. In the specific case I had, the data was actually not used like that, i.e. it was not searched by date. I changed the actual class names from my real project but they were similar in concept, and the "Events" containing startYear and endYear were really just for reference purposes and were basically metadata for the parent "Timeline" domain object, sort of like a resume where you use dates to delineate previous jobs but you probably don't go around searching resumes by years (or at least I don't).
So to answer your question, since we don't need to search by those dates, using nulls wasn't an issue in that regard. And thus we don't end up having to use nvl or anything else throughout the app b/c this is localized to one domain object. The only logic is that when reports are generated the users wanted the year ranges to look like "2001" when there's no end date and "2001-02" when there is an end date. The display logic was actually perhaps a bit easier with nulls in this case since we don't need to check if the endYear is the same as startYear.
- ScottSep 17, 2009 · Nate Bergman
Good point about nulls being a royal PITA in general. In the specific case I had, the data was actually not used like that, i.e. it was not searched by date. I changed the actual class names from my real project but they were similar in concept, and the "Events" containing startYear and endYear were really just for reference purposes and were basically metadata for the parent "Timeline" domain object, sort of like a resume where you use dates to delineate previous jobs but you probably don't go around searching resumes by years (or at least I don't).
So to answer your question, since we don't need to search by those dates, using nulls wasn't an issue in that regard. And thus we don't end up having to use nvl or anything else throughout the app b/c this is localized to one domain object. The only logic is that when reports are generated the users wanted the year ranges to look like "2001" when there's no end date and "2001-02" when there is an end date. The display logic was actually perhaps a bit easier with nulls in this case since we don't need to check if the endYear is the same as startYear.
- ScottSep 17, 2009 · Nate Bergman
Good point about nulls being a royal PITA in general. In the specific case I had, the data was actually not used like that, i.e. it was not searched by date. I changed the actual class names from my real project but they were similar in concept, and the "Events" containing startYear and endYear were really just for reference purposes and were basically metadata for the parent "Timeline" domain object, sort of like a resume where you use dates to delineate previous jobs but you probably don't go around searching resumes by years (or at least I don't).
So to answer your question, since we don't need to search by those dates, using nulls wasn't an issue in that regard. And thus we don't end up having to use nvl or anything else throughout the app b/c this is localized to one domain object. The only logic is that when reports are generated the users wanted the year ranges to look like "2001" when there's no end date and "2001-02" when there is an end date. The display logic was actually perhaps a bit easier with nulls in this case since we don't need to check if the endYear is the same as startYear.
- ScottSep 17, 2009 · Patrick Hunlock
Good point about nulls being a royal PITA in general. In the specific case I had, the data was actually not used like that, i.e. it was not searched by date. I changed the actual class names from my real project but they were similar in concept, and the "Events" containing startYear and endYear were really just for reference purposes and were basically metadata for the parent "Timeline" domain object, sort of like a resume where you use dates to delineate previous jobs but you probably don't go around searching resumes by years (or at least I don't).
So to answer your question, since we don't need to search by those dates, using nulls wasn't an issue in that regard. And thus we don't end up having to use nvl or anything else throughout the app b/c this is localized to one domain object. The only logic is that when reports are generated the users wanted the year ranges to look like "2001" when there's no end date and "2001-02" when there is an end date. The display logic was actually perhaps a bit easier with nulls in this case since we don't need to check if the endYear is the same as startYear.
- ScottSep 17, 2009 · Patrick Hunlock
Good point about nulls being a royal PITA in general. In the specific case I had, the data was actually not used like that, i.e. it was not searched by date. I changed the actual class names from my real project but they were similar in concept, and the "Events" containing startYear and endYear were really just for reference purposes and were basically metadata for the parent "Timeline" domain object, sort of like a resume where you use dates to delineate previous jobs but you probably don't go around searching resumes by years (or at least I don't).
So to answer your question, since we don't need to search by those dates, using nulls wasn't an issue in that regard. And thus we don't end up having to use nvl or anything else throughout the app b/c this is localized to one domain object. The only logic is that when reports are generated the users wanted the year ranges to look like "2001" when there's no end date and "2001-02" when there is an end date. The display logic was actually perhaps a bit easier with nulls in this case since we don't need to check if the endYear is the same as startYear.
- Scott