faulker045

Posts

Lots of columns with Hive and HBase

March 04, 2014

We're in the process of rolling out a long awaited feature here at GBIF, namely the indexing of more fields from Darwin Core . Until the launch of our now HBase-backed occurrence store (in the fall of 2013) we couldn't index more than about 30 or so terms from Darwin Core because we were limited by our MySQL schema. Now that we have HBase we can add as many columns as we like! Or so we thought. Our occurrence download service gets a lot of use and naturally we want downloaders to have access to all of the newly indexed fields. The way our downloads work is as an Oozie workflow that executes a Hive query of an HDFS table (more details in this Cloudera blog ). We use an HDFS table to significantly speed up the scan speed of the query - using an HBase backed Hive table takes something like 4-5x as long. But to generated that HDFS table we need to start from a Hive table that _is_ backed by HBase. Here's an example of how to write a Hive table definition for an HBase-backed tab...

The new (real-time) GBIF Registry has gone live

October 28, 2013

For the last 4 years, GBIF has operated the GBRDS registry with its own web application on http://gbrds.gbif.org . Previously, when a dataset got registered in the GBRDS registry (for example using an IPT ) it wasn't immediately visible in the portal for several weeks until after rollover took place. In October, GBIF launched its new portal on www.gbif.org . During the launch we indicated that the real-time data management would be starting up in November. We are excited to inform you that today we made the first step towards making this a reality, by enabling the live operation of the new GBIF registry. What does this mean for you? any dataset registered through GBIF (using an IPT , web services, or manually by liaison with the Secretariat) will be visible in the portal immediately because the portal and new registry are fully integrated the GBRDS web application ( http://gbrds.gbif.org ) is no longer visible , si...

GBIF Backbone in GitHub

October 24, 2013

For a long time I wanted to experiment with using GitHub as a tool to browse and manage the GBIF backbone taxonomy . Encouraged by similar sentiments from Rod Page , it would be nice to use git to keep track of versions and allow external parties to fork parts of the taxonomic tree and push back changes if desired. To top it off there is the great GitHub Treeslider to browse the taxonomy, so why not give it a try? A GitHub filesystem taxonomy I decided to export each taxon in the backbone as a folder that is named according to the canonical name, containing 2 files: README.md, a simple markdown file that gets rendered by github and shows the basic attributes of a taxon data.json, a complete json representation of the taxon as it is exposed via the new GBIF species API The filesystem represents the taxonomic classification and taxon folders are nested accordingly, for example the species Amanita arctica is represented as: This is just a first experimental s...

Validating scientific names with the forthcoming GBIF Portal web service API

July 22, 2013

This guest post was written by Gaurav Vaidya, Victoria Tersigni and Robert Guralnick, and is being cross-posted to the VertNet Blog. David Bloom and John Wieczorek read through drafts of this post and improved it tremendously. A whale named Physeter macrocephalus Physeter catodon Physeter macrocephalus (photograph by Gabriel Barathieu, reused under CC-BY-SA from the Wikimedia Commons ) Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists' understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it's Porphyrio martini cus , not Porphyrio martini ca ). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as the dispute over w...

IPT v2.0.5 Released - A melhor versão até o momento!

May 22, 2013

The GBIF Secretariat is proud to release version 2.0.5 of the Integrated Publishing Toolkit (IPT), available for download on the project website here . As with every release, it's your chance to take advantage of the most requested feature enhancements and bug fixes. The most notable feature enhancements include: A resource can now be configured to publish automatically on an interval (See " Automated Publishing " section in User Manual) The interface has been translated into Portuguese, making the IPT available in five languages: French, Spanish, Traditional Chinese, Portuguese and of course English. An IPT can be configured to back up each DwC-Archive version published (See " Archival Mode " in User Manual) Each resource version now has a resolvable URL (See " Versioned Page " section in User Manual) Filterable, pageable, and sortable resource overview table in v2.0.5 The order of columns in published DwC-Archives is always the same between versio...

Migrating our hadoop cluster from CDH3 to CDH4

May 14, 2013

We've written a number of times on the initial setup , eventual upgrade and continued tuning of our hadoop cluster. Our latest project has been upgrading from CDH3u3 to CDH4.2.1 . Upgrades are almost always disruptive, but we decided it was worth the hassle for a number of reasons: general performance improvements in the entire Hadoop/HBase stack continued support from the community/user list (a non-trivial concern - anybody asking questions on the user groups and mailing list about problems with older clusters are invariably asked to update before people are interested in tackling the problem) multi-threaded compactions (the need for which we concluded in this post ) table-based region balancing (rather than just cluster-wide) We had been managing our cluster primarily using Puppet, with all the knowledge of how the bits worked together firmly within our dev team. In an effort to make everyone's lives easier, reduce our bus factor , and get the server management back into t...

Data cleaning: Using MySQL to identify XML breaking characters

February 08, 2013

Sometimes publishers have problems with data resources that contain control characters that will break the xml response if they are included. Identifying these characters and removing them can be a daunting task, especially if the dataset contains thousands of records. Publishers that share datasets through the DiGIR and TAPIR protocols are especially vulnerable to text fields that contain polluted data. Information about locality (http://rs.tdwg.org/dwc/terms/index.htm#locality) is often quite rich and can be copied from diverse sources, thereby entering the database table possibly without having been through a verification or a cleaning process. The locality string can be copy/pasted from a file into the locality column, or the data itself can be mass loaded infile, or it can be bulk inserted – each of these methods contains a risk that unintended characters enter the table. Even if you have time and are meticulous, you could miss certain control characters because they are invisible...