Posts

Showing posts from July, 2012

Getting started with DataCube on HBase

This tutorial blog provides a quick introduction to using DataCube , a Java based OLAP cube library with a pluggable storage engine open sourced by Urban Airship . In this tutorial, we make use of the inbuilt HBase storage engine. In a small database much of this would be trivial using aggregating functions (SUM(), COUNT() etc). As the volume grows, one often precalculates these metrics which brings it's own set of consistency challenges. As one outgrows a database, as GBIF are, we need to look for new mechanisms to manage these metrics. The features of DataCube that make this attractive to us are: A managable process to modify the cube structure A higher level API to develop against Ability to rebuild the cube with a single pass over the source data For this tutorial we will consider the source data as classical DarwinCore occurrence records, where each record represents the metadata associated with a species observation event, e.g.: ID, Kingdom, ScientificName, Country, IsoC...

Optimizing Writes in HBase

Image
I've written a few times about our work to improve the scanning performance of our cluster (parts  1 , 2 , and  3 ) since our highest priority for HBase is being able to serve requests for downloads of occurrence records (which require a full table scan). But now that the scanning is working nicely we need to start writing new records into our occurrence table as well as  cleaning raw data and  interpreting it into something more useful for the users of our data portal . That processing is built as Hive queries that read from and write back to the same HBase table. And while it was working fine on small test datasets, it all blew up once I moved the process to the full dataset. Here's what happened and how we fixed it. Note that we're using CDH3u3, with the addition of Hive 0.9.0, which we patched to support HBase 0.90.4. The problem Our processing is Hive queries which run as Hadoop MapReduce jobs. When the mappers were running they would eventually fail (repe...