Posts

Showing posts from May, 2012

Optimizing HBase MapReduce scans (for Hive)

Image
By targeting data locality, full table scans of HBase using MapReduce across 373 million records are reduced from 19 minutes to 2.5 minutes.  We've been posting some blogs about HBase Performance which are all based on the PerformanceEvaluation tools supplied with HBase.  This has helped us understand many characteristics of our system, but in some ways has sidetracked our tuning - namely investigating channel bonding  to help increase inter machine bandwidth believing it was our primary limitation.  While that will help for many things (e.g. the copy between mappers and reducers), a key usage pattern involves full table scans of HBase (spawned by Hive ) and in a well setup environment network traffic should be minimal for this.  Here I describe how we approached this problem, and the results. The environment We run Ganglia for cluster monitoring (and ours is public ) and Puppet to provision machines.  As an aside, without these tools or an equivalent ...

Hive 0.9 with HBase 0.90

Hive 0.9.0 was  released  at the beginning of this month and it contains a lot of very nice improvements. Thanks to all involved! Unfortunately it drops compatibility with HBase 0.90.x due to two issues which introduced a dependency on HBase 0.92: https://issues.apache.org/jira/browse/HIVE-2748 https://issues.apache.org/jira/browse/HIVE-2764 Fortunately these were relatively easy to revert so that's what we did because we wanted to all the 0.9.0 goodness on our HBase 0.90.4 cluster (CDH3u3). I've forked Hive on Github and reverted the parts of those two issues ( HIVE-2748 , HIVE-2764 ) that were causing problems. For all those "stuck" with HBase 0.90 (e.g. CDH3 users) we've also deployed this custom Hive HBase Handler to our own Maven repository and will maintain that for the foreseeable future. You can just download the jar file and use it in your projects or use our Maven repository: gbif-thirdparty http://repository.gbif.org/content/repositories/thirdpart...