Optimizing HBase MapReduce scans (for Hive)
By targeting data locality, full table scans of HBase using MapReduce across 373 million records are reduced from 19 minutes to 2.5 minutes. We've been posting some blogs about HBase Performance which are all based on the PerformanceEvaluation tools supplied with HBase. This has helped us understand many characteristics of our system, but in some ways has sidetracked our tuning - namely investigating channel bonding to help increase inter machine bandwidth believing it was our primary limitation. While that will help for many things (e.g. the copy between mappers and reducers), a key usage pattern involves full table scans of HBase (spawned by Hive ) and in a well setup environment network traffic should be minimal for this. Here I describe how we approached this problem, and the results. The environment We run Ganglia for cluster monitoring (and ours is public ) and Puppet to provision machines. As an aside, without these tools or an equivalent ...