Posts

"I noticed that the GBIF data portal has fewer records than it used to – what happened?"

If you are a regular user of the GBIF data portal at http://data.gbif.org , or keep an eye on the numbers given at http://www.gbif.org , you may have noticed that the number of indexed records took a dip, from well over 389m records to a little more than 383m. Why would that be? The main reason for this is that software and processing upgrades have made it easier to spot duplicates and old, no longer published versions of records and datasets. Since the previous version of the data index, some major removal of such records has taken place:    -           Several publishers migrated their datasets from other publishing tools to the Integrated Publishing Toolkit (IPT) and Darwin Core Archive, and in the process identified and removed duplicate records in the published source data. As an additional effect, the use of Darwin Core Archives in publishing allows the indexing process to automatically remove records from the index that ar...

The GBIF Registry is now dataset-aware!

Image
This post continues the series of posts that highlight the latest updates on the GBIF Registry . To recap, in April 2011 Jose Cuadra wrote The evolution of the GBIF Registry , a post that provided a background to the GBIF Network, explained how Network entities are now stored in a database instead of UDDI system , and how it has a new  web application and API .   Then a month later, Jose wrote another post entitled 2011 GBIF Registry Refactoring  that was more technical in nature and detailed a new set of technologies chosen to improve the underlying codebase. Now even if you have been keeping an eye on the GBIF Registry , you probably missed the most important improvement that happened in September 2012: the Registry is now dataset-aware!  To be dataset-aware, means that the Registry is now aware of all the datasets that exist behind DiGIR  and BioCASE endpoints. Just in case the reader isn't aware, DiGIR and BioCASE are wrapper tools us...

IPT v2.0.4 released

Image
Today the GBIF Secretariat has announced the release of version 2.0.4 of the Integrated Publishing Toolkit (IPT). For those who can't wait to get their hands on the release, it's available for download on the project website  here . Collaboration on this version was more global than ever before, with volunteers in Latin America, Asia, and Europe contributing translations, and volunteers in Canada and the United States contributing some patches.  Add to that all the issue activity, things have been busy. In total 108 issues were addressed in this version; 38 Defects, 35 Enhancements, 7 Other, 5 Patches, 18 Won't fix, 4 Duplicates, and 1 that was considered as Invalid. These are detailed in the  issue tracking system . So what exactly has changed and why? Here's a quick rundown. One thing that kept coming up again and again in version 2.0.3, was that users were unwittingly installing the IPT in test mode, thinking that they were running in production. After regist...

Getting started with DataCube on HBase

This tutorial blog provides a quick introduction to using DataCube , a Java based OLAP cube library with a pluggable storage engine open sourced by Urban Airship . In this tutorial, we make use of the inbuilt HBase storage engine. In a small database much of this would be trivial using aggregating functions (SUM(), COUNT() etc). As the volume grows, one often precalculates these metrics which brings it's own set of consistency challenges. As one outgrows a database, as GBIF are, we need to look for new mechanisms to manage these metrics. The features of DataCube that make this attractive to us are: A managable process to modify the cube structure A higher level API to develop against Ability to rebuild the cube with a single pass over the source data For this tutorial we will consider the source data as classical DarwinCore occurrence records, where each record represents the metadata associated with a species observation event, e.g.: ID, Kingdom, ScientificName, Country, IsoC...

Optimizing Writes in HBase

Image
I've written a few times about our work to improve the scanning performance of our cluster (parts  1 , 2 , and  3 ) since our highest priority for HBase is being able to serve requests for downloads of occurrence records (which require a full table scan). But now that the scanning is working nicely we need to start writing new records into our occurrence table as well as  cleaning raw data and  interpreting it into something more useful for the users of our data portal . That processing is built as Hive queries that read from and write back to the same HBase table. And while it was working fine on small test datasets, it all blew up once I moved the process to the full dataset. Here's what happened and how we fixed it. Note that we're using CDH3u3, with the addition of Hive 0.9.0, which we patched to support HBase 0.90.4. The problem Our processing is Hive queries which run as Hadoop MapReduce jobs. When the mappers were running they would eventually fail (repe...

Launch of the Canadensys explorer

Image
At Canadensys we already adopted and customized the IPT as our data repository . With the data of our network being served by the IPT, we have now built a tool to aggregate and explore these data. For an overview of how we built our network, see this presentation . The post below originally appeared on the Canadensys blog . We are very pleased to announce the beta version of the Canadensys explorer . The tool allows you to explore, filter, visualize and download all the specimen records published through the Canadensys network. The explorer currently aggregates nine published collections, comprising over half a million specimen records, with many more to come in the near future. All individual datasets are available on the Canadensys repository and via the Global Biodiversity Information Facility (GBIF) as well. The main functionalities of the explorer are listed below, but we encourage you to discover them for yourself instead . We hope it is intuitive. For the best user experienc...

Taxonomic Trees in PostgreSQL

Image
Taken aside pro parte synonyms taxonomic data follows a classic hierarchical tree structure. In relational databases such a tree is commonly represented by 3 models known as the adjacency list , the materialized path and the nested set model. There are many comparisons out there listing pros and cons, for example ON stackoverflow , the slides by Lorenzo Alberton or Bill Karwin or a postgres specific performance comparison between the adjacency model and a nested set. Checklist Bank At GBIF we use PostgreSQL to store taxonomic trees, which we refer to as checklists, in Checklist Bank . At the core there is a single table name_usage which contains records each representing a single taxon in the tree [note: in this post I am using the term taxon broadly covering both accepted taxa and synonyms]. It primarily uses the adjacency model with a single foreign key parent_fk which is null for the root elements of the tree. The simplified diagram of the main tables looks like ...