Posts

Simplified Downloads

Image
Since its re-launch in 2013  gbif.org  has supported the downloading of occurrence data using an arbitrary query with the download file provided as a  Darwin Core Archive file whose internal content is described here . This format contains comprehensive and self-explanatory information, which makes it suitable to be referenced in external resources. However, in cases where people only need the occurrence data in its simplest form the DwC-A format presents an additional complexity that can make it hard to use the data. Because of that we now support a new download format: a zip file that only contains a single file with the most common fields/terms used, where each column is separated by the TAB character. This makes things much easier when it comes to importing the data into tools such as Microsoft Excel, geographic information systems and relational databases. The current download functionality was extended to allow the selection of the desired format: From this p...

Don't fill your HDFS disks (upgrading to CDH 5.4.2)

Just a short post on the dangers of filling your HDFS disks. It's a warning you'll hear at conferences and in best practices blog posts like this one, but usually with only a vague consequence of "bad things will happen". We upgraded from CDH 5.2.0 to CDH 5.4.2 this past weekend and learned the hard way: bad things will happen. The Machine Configuration The upgrade went fine in our dev cluster (which has almost no data in HDFS) so we weren't expecting problems in production. Our production cluster is of course slightly different than our (much smaller) dev cluster. In production we have 3 masters, where one holds the NameNode and another holds the SecondaryNameNode (we're not yet using a High Availability setup, but it's in the plan). We have 12 DataNodes where each one has 13 disks dedicated to HDFS storage: 12 are 1TB and one is 512GB. They are formatted with 0% reserved blocks for root. The machines are evenly split into two racks. Pre Upgrade Status We...

Improving the GBIF Backbone matching

In GBIF occurrence records are matched to a taxon in a backbone taxonomy  using the  species match API . This is important to reduce spelling variations and create consistent metrics and searches according to a single classification and synonymy. Over the past years we have been alerted to various bad matches . Most of the reported issues refer to a false fuzzy match for a name missing in our backbone. In order to improve the taxonomic classification of occurrence records, we are undertaking 2 activities.  The first is to improve the algorithms we use to fuzzily match names, and the second will be to improve the algorithms used to assembled the backbone taxonomy itself.  Here I explain some of the work currently underway to tackle the former, which is visible on the test environment. 1.Name parsing of undetermined species In occurrences we see many names with a partly undetermined name such as Lucanus spec. Erroneously these rank markers have been treated as real s...

IPT v2.2 – Making data citable through DataCite

Image
GBIF is pleased to release  IPT 2.2 , now capable of automatically connecting with either  DataCite  or  EZID to assign DOIs to datasets. This new feature makes biodiversity data easier to access on the Web and facilitates tracking its re-use. DataCite integration explained DataCite specialises in assigning DOIs to datasets. It was established in 2009 with three fundamental goals (1) :                    Establish easier access to research data on the Internet Increase acceptance of research data as citable contributions to the scholarly record Support research data archiving to permit results to be verified and re-purposed for future study EZID is hosted by the California Digital Library  (a founding member of DataCite) and adds services on top of the DataCite DOI infrastructure such as their own easy-to-use programming interface . To integrate with DataCite and further these three goals for biodiversity ...

Upgrading our cluster from CDH4 to CDH5

A little over a year ago we wrote about  upgrading from CDH3 to CDH4  and now the time had come to upgrade from CDH4 to CDH5 . The short version: upgrading the cluster itself was easy, but getting our applications to work with the new classpaths, especially MapReduce v2 (YARN), was painful. The Cluster Our cluster has grown since the last upgrade (now 12 slaves and 3 masters), and we no longer had the luxury of splitting the machines to build a new cluster from scratch. So this was an in-place upgrade, using CDH Manager. Upgrade CDH Manager The first step was upgrading to CDH Manager 5.2 (from our existing 4.8). The Cloudera documentation  is excellent so I don't need to repeat it here. What we did find was that the management service now requests significantly more RAM for it's monitoring services (minimum "happy" config of 14GB), to the point where our existing masters were overwhelmed. As a stop gap we've added a 4th old machine to the "masters" group...

Multimedia in GBIF

Image
We are happy to announce another long awaited improvement to the GBIF portal. Our portal test environment now shows multimedia items and their metadata associated with occurrences. As of today we have nearly 700 thousand occurrences with multimedia indexed. Based on the Dublin Core type vocabulary we distinguish between images, videos and sound files. As has been requested by many people the media type is available as a new filter in the occurrence search and subsequently in downloads. For example you can now easily find all audio recordings of birds . UAM:Mamm:11470 - Eumetopias jubatus - skull If you follow to the  details page  of any of those records you can see that sound files show up as simple links to the media file. We do the same for video files and currently do not have plans to embed any media player in our portal. This is different from images which are shown in a dedicated gallery you might have encountered for species pages before already. On the left you can s...

IPT v2.1 – Promoting the use of stable occurrenceIDs

Image
GBIF is pleased to announce the release of the IPT 2.1 with the following key changes: Stricter controls for the Darwin Core occurrenceID to improve stability of record level identifiers network wide Ability to support Microsoft Excel spreadsheets natively Japanese translation thanks to Dr. Yukiko Yamazaki from the National Institute of Genetics (NIG) in Japan With this update, GBIF continues to refine and improve the IPT based on feedback from users, while carrying out a number of deliverables that are part of the GBIF Work Programme for 2014-16 . The most significant new feature that has been added in this release is the ability to validate that each record within an Occurrence dataset has a unique identifier. If any missing or duplicate identifiers are found, publishing fails, and the problem records are logged in the publication report. This new feature will support data publishers who use the Darwin Core term occurrenceID to uniquely identify their occurrence records. The chang...