Hadoop and HBase in production

December 28th, 2009 | by bryan |

At the beginning of December, I ported several portions of the ReadPath infrastructure from MySQL to HBase with Hadoop. So far it’s been a complete success.

At first I had focused most of my attention on HBase, seeing it as a way to scale systems beyond a single MySQL instance without having to deal with the headaches of partitioning. However, once things went into production, I discovered  that being able to kick off Map/Reduce jobs over parts of the dataset is a huge advantage.

The first two areas to be ported were the content archive and the dictionary. ReadPath still keeps items that are less than 4 weeks old in a large MySQL database, although the plan is to eventually move everything over to HBase. Items that are older then 4 weeks are removed from the primary database and inserted into an 8 node HBase/Hadoop cluster. The content table works very well with HBase’s access patterns and can be fetched at speeds on par with MySQL.

The personalized content scoring features of ReadPath depend on having a good measurement of term frequencies. So to support this, there is a dictionary of all of the terms used in the content database along with their frequencies. The initial implementation of the dictionary wasn’t scaling properly so it was converted to a Map/Reduce job that stores data in HBase. The dictionary processing went from a system that was having trouble keeping up with the incoming stream of content ( ReadPath adds ~1,500 new items / minute) to one that could completely rebuild a dictionary from 250 Million content items in under 3 hours (this equates to ~1,400,000 items / minute).

One of the main items that was keeping me from pulling the trigger on porting to HBase was concerns about data loss. In my first day of playing with HBase, I had a bad server take out the .META. table and result in complete loss of HBase tables. I pulled that server and haven’t had any data loss since, but have also made good use of the HBase Exporter Map/Reduce job that will dump the contents of your tables to HDFS. This can then be easily restored if for some reason the HBase tables become corrupted. These backup and restore techniques are actually much easier than the standard systems used for MySQL at the scale that ReadPath had gotten to.

Next steps include porting the entire content system over to HBase and looking at using HBase for the link graph system that ReadPath needs to sort items. The link graph is a much more difficult system, the read/write pattern is completely random which blows away any caching. In preliminary tests, the system ends up being disk bound. Of course the current system has grown larger than a single MySQL instance can hold and is disk bound as well, so having 8+ disks is better than 1 disk.

  1. 3 Responses to “Hadoop and HBase in production”

  2. By stack on Jan 4, 2010 | Reply

    Nice post

  3. By Jean-Daniel Cryans on Jan 4, 2010 | Reply

    That’s a very nice write up Bryan, care to add ReadPath to HBase’s PoweredBy page? Here it is: http://wiki.apache.org/hadoop/Hbase/PoweredBy

    About that link graph, depending on the size of the data set and the available RAM, you could try to keep that whole table in memory (or at least the data you need the most often) by using IN-MEMORY column families.

  1. 1 Trackback(s)

  2. Jan 12, 2010: Hadoop and HBase in production › ec2base

Post a Comment