Posts

Upgrading our cluster from CDH4 to CDH5

A little over a year ago we wrote about  upgrading from CDH3 to CDH4  and now the time had come to upgrade from CDH4 to CDH5 . The short version: upgrading the cluster itself was easy, but getting our applications to work with the new classpaths, especially MapReduce v2 (YARN), was painful. The Cluster Our cluster has grown since the last upgrade (now 12 slaves and 3 masters), and we no longer had the luxury of splitting the machines to build a new cluster from scratch. So this was an in-place upgrade, using CDH Manager. Upgrade CDH Manager The first step was upgrading to CDH Manager 5.2 (from our existing 4.8). The Cloudera documentation  is excellent so I don't need to repeat it here. What we did find was that the management service now requests significantly more RAM for it's monitoring services (minimum "happy" config of 14GB), to the point where our existing masters were overwhelmed. As a stop gap we've added a 4th old machine to the "masters" group...

Multimedia in GBIF

Image
We are happy to announce another long awaited improvement to the GBIF portal. Our portal test environment now shows multimedia items and their metadata associated with occurrences. As of today we have nearly 700 thousand occurrences with multimedia indexed. Based on the Dublin Core type vocabulary we distinguish between images, videos and sound files. As has been requested by many people the media type is available as a new filter in the occurrence search and subsequently in downloads. For example you can now easily find all audio recordings of birds . UAM:Mamm:11470 - Eumetopias jubatus - skull If you follow to the  details page  of any of those records you can see that sound files show up as simple links to the media file. We do the same for video files and currently do not have plans to embed any media player in our portal. This is different from images which are shown in a dedicated gallery you might have encountered for species pages before already. On the left you can s...

IPT v2.1 – Promoting the use of stable occurrenceIDs

Image
GBIF is pleased to announce the release of the IPT 2.1 with the following key changes: Stricter controls for the Darwin Core occurrenceID to improve stability of record level identifiers network wide Ability to support Microsoft Excel spreadsheets natively Japanese translation thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan With this update, GBIF continues to refine and improve the IPT based on feedback from users, while carrying out a number of deliverables that are part of the GBIF Work Programme for 2014-16 . The most significant new feature that has been added in this release is the ability to validate that each record within an Occurrence dataset has a unique identifier. If any missing or duplicate identifiers are found, publishing fails, and the problem records are logged in the publication report. This new feature will support data publishers who use the Darwin Core term occurrenceID to uniquely identify their occurrence records. The chang...

Lots of columns with Hive and HBase

We're in the process of rolling out a long awaited feature here at GBIF, namely the indexing of more fields from Darwin Core . Until the launch of our now HBase-backed occurrence store (in the fall of 2013) we couldn't index more than about 30 or so terms from Darwin Core because we were limited by our MySQL schema. Now that we have HBase we can add as many columns as we like! Or so we thought. Our occurrence download service gets a lot of use and naturally we want downloaders to have access to all of the newly indexed fields. The way our downloads work is as an Oozie workflow that executes a Hive query of an HDFS table (more details in this Cloudera blog ). We use an HDFS table to significantly speed up the scan speed of the query - using an HBase backed Hive table takes something like 4-5x as long. But to generated that HDFS table we need to start from a Hive table that _is_ backed by HBase. Here's an example of how to write a Hive table definition for an HBase-backed tab...

The new (real-time) GBIF Registry has gone live

For the last 4 years, GBIF has operated the GBRDS registry with its own web application on  http://gbrds.gbif.org .  Previously, when a dataset got registered in the GBRDS registry (for example using an IPT ) it wasn't immediately visible in the portal for several weeks until after rollover took place.  In October, GBIF launched its new portal on  www.gbif.org .  During the launch we indicated that the real-time data management would be starting up in November.  We are excited to inform you that today we made the first step towards making this a reality, by enabling the live operation of the new GBIF registry.     What does this mean for you? any dataset registered through GBIF (using an IPT , web services, or manually by liaison with the Secretariat) will be visible in the portal immediately because the portal and new registry are fully integrated   the GBRDS web application ( http://gbrds.gbif.org ) is no longer visible ,  si...

GBIF Backbone in GitHub

Image
For a long time I wanted to experiment with using GitHub as a tool to browse and manage the GBIF backbone taxonomy . Encouraged by similar sentiments from  Rod Page , it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the  great GitHub Treeslider to browse the taxonomy, so why not give it a try? A GitHub filesystem taxonomy I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files: README.md, a simple markdown file that gets rendered by github and shows the basic attributes of a taxon data.json,  a complete json representation of the taxon as it is exposed via the new GBIF species API The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species Amanita arctica  is represented as: This is just a first experimental s...

Validating scientific names with the forthcoming GBIF Portal web service API

Image
This guest post was written by Gaurav Vaidya, Victoria Tersigni and Robert Guralnick, and is being cross-posted to the VertNet Blog. David Bloom and John Wieczorek read through drafts of this post and improved it tremendously. A whale named Physeter macrocephalus Physeter catodon Physeter macrocephalus (photograph by Gabriel Barathieu, reused under CC-BY-SA from the Wikimedia Commons ) Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists' understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it's Porphyrio martini cus , not Porphyrio martini ca ). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as the dispute over w...