Posts

IPT v2.1 – Promoting the use of stable occurrenceIDs

Image
GBIF is pleased to announce the release of the IPT 2.1 with the following key changes: Stricter controls for the Darwin Core occurrenceID to improve stability of record level identifiers network wide Ability to support Microsoft Excel spreadsheets natively Japanese translation thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan With this update, GBIF continues to refine and improve the IPT based on feedback from users, while carrying out a number of deliverables that are part of the GBIF Work Programme for 2014-16 . The most significant new feature that has been added in this release is the ability to validate that each record within an Occurrence dataset has a unique identifier. If any missing or duplicate identifiers are found, publishing fails, and the problem records are logged in the publication report. This new feature will support data publishers who use the Darwin Core term occurrenceID to uniquely identify their occurrence records. The chang...

Lots of columns with Hive and HBase

We're in the process of rolling out a long awaited feature here at GBIF, namely the indexing of more fields from Darwin Core . Until the launch of our now HBase-backed occurrence store (in the fall of 2013) we couldn't index more than about 30 or so terms from Darwin Core because we were limited by our MySQL schema. Now that we have HBase we can add as many columns as we like! Or so we thought. Our occurrence download service gets a lot of use and naturally we want downloaders to have access to all of the newly indexed fields. The way our downloads work is as an Oozie workflow that executes a Hive query of an HDFS table (more details in this Cloudera blog ). We use an HDFS table to significantly speed up the scan speed of the query - using an HBase backed Hive table takes something like 4-5x as long. But to generated that HDFS table we need to start from a Hive table that _is_ backed by HBase. Here's an example of how to write a Hive table definition for an HBase-backed tab...

The new (real-time) GBIF Registry has gone live

For the last 4 years, GBIF has operated the GBRDS registry with its own web application on  http://gbrds.gbif.org .  Previously, when a dataset got registered in the GBRDS registry (for example using an IPT ) it wasn't immediately visible in the portal for several weeks until after rollover took place.  In October, GBIF launched its new portal on  www.gbif.org .  During the launch we indicated that the real-time data management would be starting up in November.  We are excited to inform you that today we made the first step towards making this a reality, by enabling the live operation of the new GBIF registry.     What does this mean for you? any dataset registered through GBIF (using an IPT , web services, or manually by liaison with the Secretariat) will be visible in the portal immediately because the portal and new registry are fully integrated   the GBRDS web application ( http://gbrds.gbif.org ) is no longer visible ,  si...

GBIF Backbone in GitHub

Image
For a long time I wanted to experiment with using GitHub as a tool to browse and manage the GBIF backbone taxonomy . Encouraged by similar sentiments from  Rod Page , it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the  great GitHub Treeslider to browse the taxonomy, so why not give it a try? A GitHub filesystem taxonomy I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files: README.md, a simple markdown file that gets rendered by github and shows the basic attributes of a taxon data.json,  a complete json representation of the taxon as it is exposed via the new GBIF species API The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species Amanita arctica  is represented as: This is just a first experimental s...

Validating scientific names with the forthcoming GBIF Portal web service API

Image
This guest post was written by Gaurav Vaidya, Victoria Tersigni and Robert Guralnick, and is being cross-posted to the VertNet Blog. David Bloom and John Wieczorek read through drafts of this post and improved it tremendously. A whale named Physeter macrocephalus Physeter catodon Physeter macrocephalus (photograph by Gabriel Barathieu, reused under CC-BY-SA from the Wikimedia Commons ) Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists' understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it's Porphyrio martini cus , not Porphyrio martini ca ). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as the dispute over w...

IPT v2.0.5 Released - A melhor versão até o momento!

Image
The GBIF Secretariat is proud to release version 2.0.5 of the Integrated Publishing Toolkit (IPT), available for download on the project website  here . As with every release, it's your chance to take advantage of the most requested feature enhancements and bug fixes. The most notable feature enhancements include: A resource can now be configured to publish automatically on an interval  (See " Automated Publishing " section in User Manual) The interface has been translated into Portuguese,  making the IPT available in five languages: French, Spanish, Traditional Chinese, Portuguese and of course English. An IPT can be configured to back up each DwC-Archive version published (See " Archival Mode " in User Manual) Each resource version now has a resolvable URL (See " Versioned Page " section in User Manual) Filterable, pageable, and sortable resource overview table in v2.0.5 The order of columns in published DwC-Archives is always the same between versio...

Migrating our hadoop cluster from CDH3 to CDH4

We've written a number of times on the initial setup , eventual upgrade and continued tuning of our hadoop cluster. Our latest project has been upgrading from CDH3u3 to CDH4.2.1 . Upgrades are almost always disruptive, but we decided it was worth the hassle for a number of reasons: general performance improvements in the entire Hadoop/HBase stack continued support from the community/user list (a non-trivial concern - anybody asking questions on the user groups and mailing list about problems with older clusters are invariably asked to update before people are interested in tackling the problem) multi-threaded compactions (the need for which we concluded in this post ) table-based region balancing (rather than just cluster-wide) We had been managing our cluster primarily using Puppet, with all the knowledge of how the bits worked together firmly within our dev team. In an effort to make everyone's lives easier, reduce our  bus factor , and get the server management back into t...