Posts

IPT v2.0.5 Released - A melhor versão até o momento!

Image
The GBIF Secretariat is proud to release version 2.0.5 of the Integrated Publishing Toolkit (IPT), available for download on the project website  here . As with every release, it's your chance to take advantage of the most requested feature enhancements and bug fixes. The most notable feature enhancements include: A resource can now be configured to publish automatically on an interval  (See " Automated Publishing " section in User Manual) The interface has been translated into Portuguese,  making the IPT available in five languages: French, Spanish, Traditional Chinese, Portuguese and of course English. An IPT can be configured to back up each DwC-Archive version published (See " Archival Mode " in User Manual) Each resource version now has a resolvable URL (See " Versioned Page " section in User Manual) Filterable, pageable, and sortable resource overview table in v2.0.5 The order of columns in published DwC-Archives is always the same between versio...

Migrating our hadoop cluster from CDH3 to CDH4

We've written a number of times on the initial setup , eventual upgrade and continued tuning of our hadoop cluster. Our latest project has been upgrading from CDH3u3 to CDH4.2.1 . Upgrades are almost always disruptive, but we decided it was worth the hassle for a number of reasons: general performance improvements in the entire Hadoop/HBase stack continued support from the community/user list (a non-trivial concern - anybody asking questions on the user groups and mailing list about problems with older clusters are invariably asked to update before people are interested in tackling the problem) multi-threaded compactions (the need for which we concluded in this post ) table-based region balancing (rather than just cluster-wide) We had been managing our cluster primarily using Puppet, with all the knowledge of how the bits worked together firmly within our dev team. In an effort to make everyone's lives easier, reduce our  bus factor , and get the server management back into t...

Data cleaning: Using MySQL to identify XML breaking characters

Image
Sometimes publishers have problems with data resources that contain control characters that will break the xml response if they are included. Identifying these characters and removing them can be a daunting task, especially if the dataset contains thousands of records. Publishers that share datasets through the DiGIR and TAPIR protocols are especially vulnerable to text fields that contain polluted data. Information about locality (http://rs.tdwg.org/dwc/terms/index.htm#locality) is often quite rich and can be copied from diverse sources, thereby entering the database table possibly without having been through a verification or a cleaning process. The locality string can be copy/pasted from a file into the locality column, or the data itself can be mass loaded infile, or it can be bulk inserted – each of these methods contains a risk that unintended characters enter the table. Even if you have time and are meticulous, you could miss certain control characters because they are invisible...

"I noticed that the GBIF data portal has fewer records than it used to – what happened?"

If you are a regular user of the GBIF data portal at http://data.gbif.org , or keep an eye on the numbers given at http://www.gbif.org , you may have noticed that the number of indexed records took a dip, from well over 389m records to a little more than 383m. Why would that be? The main reason for this is that software and processing upgrades have made it easier to spot duplicates and old, no longer published versions of records and datasets. Since the previous version of the data index, some major removal of such records has taken place:    -           Several publishers migrated their datasets from other publishing tools to the Integrated Publishing Toolkit (IPT) and Darwin Core Archive, and in the process identified and removed duplicate records in the published source data. As an additional effect, the use of Darwin Core Archives in publishing allows the indexing process to automatically remove records from the index that ar...

The GBIF Registry is now dataset-aware!

Image
This post continues the series of posts that highlight the latest updates on the GBIF Registry . To recap, in April 2011 Jose Cuadra wrote The evolution of the GBIF Registry , a post that provided a background to the GBIF Network, explained how Network entities are now stored in a database instead of UDDI system , and how it has a new  web application and API .   Then a month later, Jose wrote another post entitled 2011 GBIF Registry Refactoring  that was more technical in nature and detailed a new set of technologies chosen to improve the underlying codebase. Now even if you have been keeping an eye on the GBIF Registry , you probably missed the most important improvement that happened in September 2012: the Registry is now dataset-aware!  To be dataset-aware, means that the Registry is now aware of all the datasets that exist behind DiGIR  and BioCASE endpoints. Just in case the reader isn't aware, DiGIR and BioCASE are wrapper tools us...

IPT v2.0.4 released

Image
Today the GBIF Secretariat has announced the release of version 2.0.4 of the Integrated Publishing Toolkit (IPT). For those who can't wait to get their hands on the release, it's available for download on the project website  here . Collaboration on this version was more global than ever before, with volunteers in Latin America, Asia, and Europe contributing translations, and volunteers in Canada and the United States contributing some patches.  Add to that all the issue activity, things have been busy. In total 108 issues were addressed in this version; 38 Defects, 35 Enhancements, 7 Other, 5 Patches, 18 Won't fix, 4 Duplicates, and 1 that was considered as Invalid. These are detailed in the  issue tracking system . So what exactly has changed and why? Here's a quick rundown. One thing that kept coming up again and again in version 2.0.3, was that users were unwittingly installing the IPT in test mode, thinking that they were running in production. After regist...

Getting started with DataCube on HBase

This tutorial blog provides a quick introduction to using DataCube , a Java based OLAP cube library with a pluggable storage engine open sourced by Urban Airship . In this tutorial, we make use of the inbuilt HBase storage engine. In a small database much of this would be trivial using aggregating functions (SUM(), COUNT() etc). As the volume grows, one often precalculates these metrics which brings it's own set of consistency challenges. As one outgrows a database, as GBIF are, we need to look for new mechanisms to manage these metrics. The features of DataCube that make this attractive to us are: A managable process to modify the cube structure A higher level API to develop against Ability to rebuild the cube with a single pass over the source data For this tutorial we will consider the source data as classical DarwinCore occurrence records, where each record represents the metadata associated with a species observation event, e.g.: ID, Kingdom, ScientificName, Country, IsoC...