Posts

Validating scientific names with the forthcoming GBIF Portal web service API

Image
This guest post was written by Gaurav Vaidya, Victoria Tersigni and Robert Guralnick, and is being cross-posted to the VertNet Blog. David Bloom and John Wieczorek read through drafts of this post and improved it tremendously. A whale named Physeter macrocephalus Physeter catodon Physeter macrocephalus (photograph by Gabriel Barathieu, reused under CC-BY-SA from the Wikimedia Commons ) Validating scientific names is one of the hardest parts of cleaning up a biodiversity dataset: as taxonomists' understanding of species boundaries change, the names attached to them can be synonymized, moved between genera or even have their Latin grammar corrected (it's Porphyrio martini cus , not Porphyrio martini ca ). Different taxonomists may disagree on what to call a species, whether a particular set of populations make up a species, subspecies or species complex, or even which of several published names correspond to our modern understanding of that species, such as the dispute over w...

IPT v2.0.5 Released - A melhor versão até o momento!

Image
The GBIF Secretariat is proud to release version 2.0.5 of the Integrated Publishing Toolkit (IPT), available for download on the project website  here . As with every release, it's your chance to take advantage of the most requested feature enhancements and bug fixes. The most notable feature enhancements include: A resource can now be configured to publish automatically on an interval  (See " Automated Publishing " section in User Manual) The interface has been translated into Portuguese,  making the IPT available in five languages: French, Spanish, Traditional Chinese, Portuguese and of course English. An IPT can be configured to back up each DwC-Archive version published (See " Archival Mode " in User Manual) Each resource version now has a resolvable URL (See " Versioned Page " section in User Manual) Filterable, pageable, and sortable resource overview table in v2.0.5 The order of columns in published DwC-Archives is always the same between versio...

Migrating our hadoop cluster from CDH3 to CDH4

We've written a number of times on the initial setup , eventual upgrade and continued tuning of our hadoop cluster. Our latest project has been upgrading from CDH3u3 to CDH4.2.1 . Upgrades are almost always disruptive, but we decided it was worth the hassle for a number of reasons: general performance improvements in the entire Hadoop/HBase stack continued support from the community/user list (a non-trivial concern - anybody asking questions on the user groups and mailing list about problems with older clusters are invariably asked to update before people are interested in tackling the problem) multi-threaded compactions (the need for which we concluded in this post ) table-based region balancing (rather than just cluster-wide) We had been managing our cluster primarily using Puppet, with all the knowledge of how the bits worked together firmly within our dev team. In an effort to make everyone's lives easier, reduce our  bus factor , and get the server management back into t...

Data cleaning: Using MySQL to identify XML breaking characters

Image
Sometimes publishers have problems with data resources that contain control characters that will break the xml response if they are included. Identifying these characters and removing them can be a daunting task, especially if the dataset contains thousands of records. Publishers that share datasets through the DiGIR and TAPIR protocols are especially vulnerable to text fields that contain polluted data. Information about locality (http://rs.tdwg.org/dwc/terms/index.htm#locality) is often quite rich and can be copied from diverse sources, thereby entering the database table possibly without having been through a verification or a cleaning process. The locality string can be copy/pasted from a file into the locality column, or the data itself can be mass loaded infile, or it can be bulk inserted – each of these methods contains a risk that unintended characters enter the table. Even if you have time and are meticulous, you could miss certain control characters because they are invisible...

"I noticed that the GBIF data portal has fewer records than it used to – what happened?"

If you are a regular user of the GBIF data portal at http://data.gbif.org , or keep an eye on the numbers given at http://www.gbif.org , you may have noticed that the number of indexed records took a dip, from well over 389m records to a little more than 383m. Why would that be? The main reason for this is that software and processing upgrades have made it easier to spot duplicates and old, no longer published versions of records and datasets. Since the previous version of the data index, some major removal of such records has taken place:    -           Several publishers migrated their datasets from other publishing tools to the Integrated Publishing Toolkit (IPT) and Darwin Core Archive, and in the process identified and removed duplicate records in the published source data. As an additional effect, the use of Darwin Core Archives in publishing allows the indexing process to automatically remove records from the index that ar...

The GBIF Registry is now dataset-aware!

Image
This post continues the series of posts that highlight the latest updates on the GBIF Registry . To recap, in April 2011 Jose Cuadra wrote The evolution of the GBIF Registry , a post that provided a background to the GBIF Network, explained how Network entities are now stored in a database instead of UDDI system , and how it has a new  web application and API .   Then a month later, Jose wrote another post entitled 2011 GBIF Registry Refactoring  that was more technical in nature and detailed a new set of technologies chosen to improve the underlying codebase. Now even if you have been keeping an eye on the GBIF Registry , you probably missed the most important improvement that happened in September 2012: the Registry is now dataset-aware!  To be dataset-aware, means that the Registry is now aware of all the datasets that exist behind DiGIR  and BioCASE endpoints. Just in case the reader isn't aware, DiGIR and BioCASE are wrapper tools us...