Posts

Launch of the Canadensys explorer

Image
At Canadensys we already adopted and customized the IPT as our data repository . With the data of our network being served by the IPT, we have now built a tool to aggregate and explore these data. For an overview of how we built our network, see this presentation . The post below originally appeared on the Canadensys blog . We are very pleased to announce the beta version of the Canadensys explorer . The tool allows you to explore, filter, visualize and download all the specimen records published through the Canadensys network. The explorer currently aggregates nine published collections, comprising over half a million specimen records, with many more to come in the near future. All individual datasets are available on the Canadensys repository and via the Global Biodiversity Information Facility (GBIF) as well. The main functionalities of the explorer are listed below, but we encourage you to discover them for yourself instead . We hope it is intuitive. For the best user experienc...

Taxonomic Trees in PostgreSQL

Image
Taken aside pro parte synonyms taxonomic data follows a classic hierarchical tree structure. In relational databases such a tree is commonly represented by 3 models known as the adjacency list , the materialized path and the nested set model. There are many comparisons out there listing pros and cons, for example ON stackoverflow , the slides by Lorenzo Alberton or Bill Karwin or a postgres specific performance comparison between the adjacency model and a nested set. Checklist Bank At GBIF we use PostgreSQL to store taxonomic trees, which we refer to as checklists, in Checklist Bank . At the core there is a single table name_usage which contains records each representing a single taxon in the tree [note: in this post I am using the term taxon broadly covering both accepted taxa and synonyms]. It primarily uses the adjacency model with a single foreign key parent_fk which is null for the root elements of the tree. The simplified diagram of the main tables looks like ...

Faster HBase - hardware matters

Image
As I've written earlier , I've been spending some time evaluating the performance of HBase using PerformanceEvaluation. My earlier conclusions amounted to: bond your network ports and get more disks. So I'm happy to report that we got more disks, in the form of 6 new machines that together make up our new cluster: Master (c1n4): HDFS NameNode, Hadoop JobTracker, HBase Master, and Zookeeper Zookeeper (c1n1): Zookeeper for this cluster, master for our other cluster Slaves (c4n1..c4n6): HDFS DataNode, Hadoop TaskTracker, HBase RegionServer (6 GB heap) Hardware: c1n* : 1x   Intel Xeon X3363 @ 2.83GHz (quad), 8GB RAM, 2x500G SATA 5.4K c4n* : Dell R720XD, 2x Intel Xeon E5-2640 @ 2.50GHz (6-core), 64GB RAM, 12x1TB SATA 7.2K Obviously the new machines come with faster everything and lots more RAM, so first I bonded two ethernet ports and then ran the tests again to see how much we had improved: Figure 1: Scan performance of new cluster (2x 1gig ethernet) So, 1 million records/seco...

Optimizing HBase MapReduce scans (for Hive)

Image
By targeting data locality, full table scans of HBase using MapReduce across 373 million records are reduced from 19 minutes to 2.5 minutes.  We've been posting some blogs about HBase Performance which are all based on the PerformanceEvaluation tools supplied with HBase.  This has helped us understand many characteristics of our system, but in some ways has sidetracked our tuning - namely investigating channel bonding  to help increase inter machine bandwidth believing it was our primary limitation.  While that will help for many things (e.g. the copy between mappers and reducers), a key usage pattern involves full table scans of HBase (spawned by Hive ) and in a well setup environment network traffic should be minimal for this.  Here I describe how we approached this problem, and the results. The environment We run Ganglia for cluster monitoring (and ours is public ) and Puppet to provision machines.  As an aside, without these tools or an equivalent ...

Hive 0.9 with HBase 0.90

Hive 0.9.0 was  released  at the beginning of this month and it contains a lot of very nice improvements. Thanks to all involved! Unfortunately it drops compatibility with HBase 0.90.x due to two issues which introduced a dependency on HBase 0.92: https://issues.apache.org/jira/browse/HIVE-2748 https://issues.apache.org/jira/browse/HIVE-2764 Fortunately these were relatively easy to revert so that's what we did because we wanted to all the 0.9.0 goodness on our HBase 0.90.4 cluster (CDH3u3). I've forked Hive on Github and reverted the parts of those two issues ( HIVE-2748 , HIVE-2764 ) that were causing problems. For all those "stuck" with HBase 0.90 (e.g. CDH3 users) we've also deployed this custom Hive HBase Handler to our own Maven repository and will maintain that for the foreseeable future. You can just download the jar file and use it in your projects or use our Maven repository: gbif-thirdparty http://repository.gbif.org/content/repositories/thirdpart...

HBase Performance Evaluation continued - The Smoking Gun

Image
Update: See also part 1 and part 3 . In my  last post  I described my initial foray into testing our HBase cluster performance using the PerformanceEvaluation class.  I wasn't happy with our conclusions, which could largely be summed up as "we're not sure what's wrong, but it seems slow".  So in the grand tradition of people with itches that wouldn't go away, I kept scratching.  Everything that follows is based on testing with PerformanceEvaluation (the jar patched as in the  last post ) using a 300M row table built with PerformanceEvaluation sequentialWrite 300 , and tested with PerformanceEvaluation scan 300 .  I ran the scan test 3 times, so you should see 3 distinct bursts of activity in the charts.  And to recap our hardware setup - we have 3 regionservers and a separate master. The first unsettling  ganglia metric  that kept me digging was of ethernet bytes_in and bytes_out.  I'll recreate those here: Figure 1 - bytes_in (MB/...

Performance Evaluation of HBase

Image
Update: See also followup posts: part 2 and part 3 . In the last post Lars talked about  setting up Ganglia  for monitoring our Hadoop and HBase installations.  That was in preparation for giving HBase a solid testing run to assess its suitability for hosting our index of occurrence records.  One of the important features in our new Data Portal will be the "Download" function that lets people download occurrences matching some search criteria and currently that process is a very manual and labour intensive one, so automating it will be a big help to us.  Using HBase it would be implemented as a full table scan, and that's why I've spent some time testing our scan performance. Anyone who has been down this road will probably have encountered the myriad opinions on what will improve performance (some of them conflicting) along with the seemingly endless parameters that can be tuned in a given cluster.  The overall result of that kind of research is: "Yo...