Posts

Probably Turboveg's best-kept secret

Image
Turboveg is one of the most widely used software programs used to manage vegetation data. Probably its best-kept secret is that it can export vegetation data in Darwin Core Archive (DwC-A) format, which is a standard format that enables its quick and easy integration with other resources on GBIF.org . Turboveg v2 converts vegetation data into species occurrence data packaged as a DwC-A. Now thanks to an 8 month long collaboration between GBIF and Stephan Hennekens (Turboveg's developer), v3 will convert vegetation data into sampling event data packaged as a DwC-A - a much more faithful and useful representation of the data. Turboveg Screenshot of Turboveg v3 prototype Turboveg is an easy to install and easy to use Windows program for storing, managing, visualizing and exporting vegetation data (relevés). A relevé is a list of the plants in a delimited plot of vegetation, with information on species cover and on substrate and other abiotic features in order to make as complete as...

Updating the GBIF Backbone

Image
The taxonomy employed by GBIF for organising all occurrences into a consistent view has remained unchanged since 2013. We have been working on a replacement for some time and are pleased to introduce a preview in this post. The work is rather complex and tries to establish an automated process to build a new backbone which we aim to run on a regular, probably quarterly basis. We would like to release the new taxonomy rather soon and improve the backbone iteratively. Large regressions should be avoided initially, but it is quite hard to evaluate all the changes between 2 large taxonomies with 4 - 5 million names each. We are therefore seeking feedback and help to discover oddities of the new backbone. Relevance & Challenges Every occurrence record in GBIF is matched to a taxon in the backbone. Because occurrence records in GBIF cover the whole tree of life and names may come from all possible, often outdated, taxonomies, it is important to...

Reprojecting coordinates according to their geodetic datum

For a long time Darwin Core has a term to declare the exact geodetic datum used for the given coordinate. Quite a few data publishers in GBIF have used dwc:geodeticDatum for some time to publish the datum of their location coordinates. Until now GBIF has treated all coordinates as if they were in WGS84 , the widespread global standard datum used by the Global Positioning System (GPS). Accordingly locations given in a different datum, for example NAD27 or AGD66, were displaced on GBIF maps a little. This so called “datum shift” is not dramatic, but can be up to a few hundred metres depending on the location and datum. The Univeristy of Colorado has a nice visualization of the impact . At GBIF we interpret the geodeticDatum and reproject all coordinates as good as we can into the single datum WGS84. In order to do this there are two main steps that need to be done: parse and interpret the given verbatim geodetic datum and then do the actual transformation based on the known g...

Simplified Downloads

Image
Since its re-launch in 2013  gbif.org  has supported the downloading of occurrence data using an arbitrary query with the download file provided as a  Darwin Core Archive file whose internal content is described here . This format contains comprehensive and self-explanatory information, which makes it suitable to be referenced in external resources. However, in cases where people only need the occurrence data in its simplest form the DwC-A format presents an additional complexity that can make it hard to use the data. Because of that we now support a new download format: a zip file that only contains a single file with the most common fields/terms used, where each column is separated by the TAB character. This makes things much easier when it comes to importing the data into tools such as Microsoft Excel, geographic information systems and relational databases. The current download functionality was extended to allow the selection of the desired format: From this p...

Don't fill your HDFS disks (upgrading to CDH 5.4.2)

Just a short post on the dangers of filling your HDFS disks. It's a warning you'll hear at conferences and in best practices blog posts like this one, but usually with only a vague consequence of "bad things will happen". We upgraded from CDH 5.2.0 to CDH 5.4.2 this past weekend and learned the hard way: bad things will happen. The Machine Configuration The upgrade went fine in our dev cluster (which has almost no data in HDFS) so we weren't expecting problems in production. Our production cluster is of course slightly different than our (much smaller) dev cluster. In production we have 3 masters, where one holds the NameNode and another holds the SecondaryNameNode (we're not yet using a High Availability setup, but it's in the plan). We have 12 DataNodes where each one has 13 disks dedicated to HDFS storage: 12 are 1TB and one is 512GB. They are formatted with 0% reserved blocks for root. The machines are evenly split into two racks. Pre Upgrade Status We...

Improving the GBIF Backbone matching

In GBIF occurrence records are matched to a taxon in a backbone taxonomy  using the  species match API . This is important to reduce spelling variations and create consistent metrics and searches according to a single classification and synonymy. Over the past years we have been alerted to various bad matches . Most of the reported issues refer to a false fuzzy match for a name missing in our backbone. In order to improve the taxonomic classification of occurrence records, we are undertaking 2 activities.  The first is to improve the algorithms we use to fuzzily match names, and the second will be to improve the algorithms used to assembled the backbone taxonomy itself.  Here I explain some of the work currently underway to tackle the former, which is visible on the test environment. 1.Name parsing of undetermined species In occurrences we see many names with a partly undetermined name such as Lucanus spec. Erroneously these rank markers have been treated as real s...

IPT v2.2 – Making data citable through DataCite

Image
GBIF is pleased to release  IPT 2.2 , now capable of automatically connecting with either  DataCite  or  EZID to assign DOIs to datasets. This new feature makes biodiversity data easier to access on the Web and facilitates tracking its re-use. DataCite integration explained DataCite specialises in assigning DOIs to datasets. It was established in 2009 with three fundamental goals (1) :                    Establish easier access to research data on the Internet Increase acceptance of research data as citable contributions to the scholarly record Support research data archiving to permit results to be verified and re-purposed for future study EZID is hosted by the California Digital Library  (a founding member of DataCite) and adds services on top of the DataCite DOI infrastructure such as their own easy-to-use programming interface . To integrate with DataCite and further these three goals for biodiversity ...