Apache HBase in System Design
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
In this article, we will learn when Apache HBase might be a good choice of database for your software. Since HBase is a NoSQL database, we begin by exploring the design choice between choosing SQL or NoSQL databases for your application. After this, we explore the specific features of HBase not found in all NoSQL databases, and close with a summary of when and when not to use HBase in your software.
Table of contents
SQL and NoSQL
One can store certain kinds of data by organising it neatly in groups of tables, where each table contains a bunch of rows and columns. People in 1970s decided to make it even more structured, forcing values in the same column to be a single type and size. Often, tables will share one or more columns, and that's how two tables are related together. Without these relationships, we wouldn't be able to form connections between different pieces of information in different tables, therefore there is a lot of emphasis on the relationships in this form of data storage, aptly called relational database management system. Some people simply call it SQL, which is actually the language used to manipulate the relational database.
Here is an example of how tables are related in this model. Let's say you have a table called 'wishlist'.
Index | Thing |
---|---|
1 | iPhone X |
2 | Red dress |
3 | Movie ticket |
Thing | Price |
---|---|
iPhone X | $1000 |
Laptop | $1500 |
Food | $15 |
Red dress | $500 |
Vacation | $2000 |
Movie ticket | $7 |
Now say you have to find the price of the third thing in your wishlist. First you will look up the row where index is 3 in wishlist table, after which you will read the value inside 'things' column. The thing column also exists in the prices table, so you will again look up the thing (note how you are changing tables) in the prices table and get the price. When you changed tables during this process, you are traversing a relationship between the two tables. Another way to find the price of the third thing is to join the two tables together, and get the price for your intended index. Relational databases optimise for that too. SQL shines when you can organise your data neatly in tables like above, and when you want to do lots of joins and other generic operations across tables.
However, it doesn't scale as well as the rival NoSQL databases. These databases don't store data in as structured fashion and by sacrificing some features of the relational model, they achieve better flexibility and can be used to store greater amounts of data across multiple machines, making them a better candidate for BigData. A NoSQL database called Redis stores key-value pairs, but the values can be anything. These non-relational databases can split their data across a lot of computers based on the key. This distributes loads across multiple machines, and you can add more machines later to get even better performance.
You can also have multiple copies of data on different machines incase something goes wrong and one of the copies gets damaged. Such scalability makes non-relational (also called NoSQL databases) ideal for situations where you are storing data in the magnitude of hundreds of terabytes or petabytes, and want to split the load across multiple machines.
HBase
HBase is an open-source NoSQL database based upon Google's Bigtable. BigTable is different from other NoSQL databases as it stores structured data, rather than unstructured data. The data is organised in rows and columns. However, while RDBMS imposes strict type rules on a column, and there can only be one value in a single column, in BigTable (and HBase) you can store different types of values , multiple values or even leave a slot empty. It is therefore less strict than RDBMS about what goes inside a column. Columns are organised in groups called column families, which are supposed to have the same type for certain optimisations to work. For example, BigTable or HBase will decide to compress data from a certain column based on the type of the data. Data from the same column families is grouped together for ease of access.
Ever since it's advent in 2004, BigTable has been used in many of Google's apps including Google Earth, Finance, and Youtube. It's also used by google to store indexed webpages. Each row stores a different webpage indexed by URL. Columns store different aspects of the webpage, such as links, or images etc. BigTable also adds another z-index, which stores different versions in time for the same webpage. Accessing some webpage by URL in this web storage is very fast, after which you can choose to go to some specific snapshot in time of this webpage or look at certain aspects of the page (like all the <p> tags).
All of this is stored on multiple machines as BigTable writes to Google's distributed file system which takes care of the low-level storage. HBase too, functions on top of the distributed and open-source Hadoop file system (also by Apache). Writing to Hadoop file system allows HBase to act as a distributed database without concerning itself with the low-level issues of distributing data across machines, or duplicating data to make sure it stays intact even when one of the copies goes bad.
When Facebook opted for using HBase for their Messenger platform, they made the decision because of scalability reasons too.
HBase in Facebook
Facebook was handling 135 Billion messages a month when the article was written. It also suited Facebook's messenger data model quite well. For example, message searching could be implemented as a search on the userid (stored in row), word (column) and offset (z-axis). If Alice told you about her birthday using facebook messenger, you could do a search for "birthday" in her messages incase you would forget, which would be the same as searching the whole z-axis in row "alice" and column "birthday".
HBase in Adobe
When Adobe shifted to HBase, they needed a structured storage that had access times under 50ms and could grow to handle large amounts of data. They didn't want any downtime or data loss (Hadoop managed at least the data loss part), and could process the data using MapReduce.
So when should I use HBase ?
If you are scaling for storing huge amounts of data, or will have to do so in the near future, you might want to switch to a NoSQL database and HBase might be a potential candidate. You will want to know your data access patterns beforehand, however, because only then can you structure your database in a way that makes access fast.
If you find yourself creating similar tables as SQL in HBase, and doing a lot of joins, HBase might not be a good choice for you (unless you restructure your data for it). HBase is not optimised for cross-table operations like joins, unlike SQL and will be slower at it.
To be using HBase properly, you will have to predict how your database will be used, and restructure your database to a few large tables, optimised for the access patterns that you have predicted. Given that you are able to do this, you can scale to process petabytes of data, and increase your scale just by adding more machines.
HBase is a particularly good solution among the NoSQL databases, when you want to get the features of NoSQL but not sacrifice on the structuredness that relational DBMS provides you with its rows and columns. Organising data in columns allows you to associate related data together for optimised access and storage, employing tricks like compression where suitable.
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.