Posts

Showing posts from June, 2012

Launch of the Canadensys explorer

Image
At Canadensys we already adopted and customized the IPT as our data repository . With the data of our network being served by the IPT, we have now built a tool to aggregate and explore these data. For an overview of how we built our network, see this presentation . The post below originally appeared on the Canadensys blog . We are very pleased to announce the beta version of the Canadensys explorer . The tool allows you to explore, filter, visualize and download all the specimen records published through the Canadensys network. The explorer currently aggregates nine published collections, comprising over half a million specimen records, with many more to come in the near future. All individual datasets are available on the Canadensys repository and via the Global Biodiversity Information Facility (GBIF) as well. The main functionalities of the explorer are listed below, but we encourage you to discover them for yourself instead . We hope it is intuitive. For the best user experienc...

Taxonomic Trees in PostgreSQL

Image
Taken aside pro parte synonyms taxonomic data follows a classic hierarchical tree structure. In relational databases such a tree is commonly represented by 3 models known as the adjacency list , the materialized path and the nested set model. There are many comparisons out there listing pros and cons, for example ON stackoverflow , the slides by Lorenzo Alberton or Bill Karwin or a postgres specific performance comparison between the adjacency model and a nested set. Checklist Bank At GBIF we use PostgreSQL to store taxonomic trees, which we refer to as checklists, in Checklist Bank . At the core there is a single table name_usage which contains records each representing a single taxon in the tree [note: in this post I am using the term taxon broadly covering both accepted taxa and synonyms]. It primarily uses the adjacency model with a single foreign key parent_fk which is null for the root elements of the tree. The simplified diagram of the main tables looks like ...

Faster HBase - hardware matters

Image
As I've written earlier , I've been spending some time evaluating the performance of HBase using PerformanceEvaluation. My earlier conclusions amounted to: bond your network ports and get more disks. So I'm happy to report that we got more disks, in the form of 6 new machines that together make up our new cluster: Master (c1n4): HDFS NameNode, Hadoop JobTracker, HBase Master, and Zookeeper Zookeeper (c1n1): Zookeeper for this cluster, master for our other cluster Slaves (c4n1..c4n6): HDFS DataNode, Hadoop TaskTracker, HBase RegionServer (6 GB heap) Hardware: c1n* : 1x   Intel Xeon X3363 @ 2.83GHz (quad), 8GB RAM, 2x500G SATA 5.4K c4n* : Dell R720XD, 2x Intel Xeon E5-2640 @ 2.50GHz (6-core), 64GB RAM, 12x1TB SATA 7.2K Obviously the new machines come with faster everything and lots more RAM, so first I bonded two ethernet ports and then ran the tests again to see how much we had improved: Figure 1: Scan performance of new cluster (2x 1gig ethernet) So, 1 million records/seco...