Posts

Decoupling components

Image
Recent blog posts have introduced some of the registry  and  portal processing  work under development at GBIF.  Here I'd like to introduce some of the research underway to  improve the overall processing workflows by identifying well defined components and decoupling unnecessary dependencies.  The target being to improve the robustness, reliability and throughput of the data indexing performed for the portal. Key to the GBIF portal is the crawling, processing and indexing of the content shared through the GBIF network, which is currently performed by the Harvesting and Indexing Toolkit (HIT) .  Today the HIT operates largely as follows: Synchronise with the registry to discover the technical endpoints Allow the administrator to schedule the harvest and process of an endpoint, as follows: Initiate a metadata request to discover the datasets at the endpoint For each resource initiate a request for the inventory of distinct scientific names Process ...

The Phantom Records Menace

For a data administrator, going to the web test interface of data publisher can be incredibly useful if one needs to compare the data that was collected using the Harvesting and Indexing Toolkit: HIT and what is available from the publisher. In a perfect world transfer of records would happen without a glitch but when we eventually get less (or more!) than we asked for the search/test interfaces can be a real help (for instance the PyWrapper quering utilities) Sometimes GBIF will index a resource that for no apparent reason turns in fewer records than what is expected from the line count that the HIT performs automatically. In this particular case there appears to be several identical records on top of that – which we are made aware of by the HIT that warns us that there are multiple records with the same “holy triplet”: Institution code, collection code and catalogue number. Now what happens when a request goes out for this name range: Abies alba Mill. - Achillea millefolium L. foll...

2011 GBIF Registry Refactoring

Image
For the past couple of months, I have been working closely with another GBIF developer (and also fellow blog writer) Federico Mendez, on development tasks on the GBIF's Registry application. This post provides an overview of the work being done on this matter. First, I will like to explain the nuts and bolts of the current Registry application (the one online), and then the additions/modifications it has "suffered" during 2011 (modifications have not been deployed). As stated on The evolution of the GBIF Registry blog post, in 2010 the Registry entered a new stage on its development by moving to a single DB,  enhanced web service API , and a web user interface . On top of this, an admin-only web interface was created so that we could do internal curation of the data inside the Secretariat. Hibernate's framework was chosen as the preferred persistence framework and the Data-Access-Object (DAO)  classes were coded with the  HQL necessary to provide an interface to th...

Indexing bio-diversity metadata using Solr: schema, import-handlers, custom-transformers

This post is the second part of OAI-PMH Harvesting at GBIF . In that blog was explained how different OAI-PMH services are harvested. The subject of this post is introduce the overall architecture of the index created using the information gathered from those services. Let's start by justifying why we needed a metadata index at GBIF, one of the main requirements we had was allow "search datasets by a end-users". To enable this, the system provides two main search functionalities: Full Text Search and Advanced Search. For both functionalities the system will display a list of data sets containing the following information: title, provider, description (abstract) and hyperlink to view the full metadata document in the original format (DIF, EML, etc.) provided by the source; all that information was collected by the harvester. The results of any search had to be displayed with two, amog others, specific features: highlight the text that matched the searh criteria, and group/...

Software quality control at GBIF

We've not only set up Hadoop  here at GBIF but also introduced a few other new things. With the growing software development team we've felt the need to put some control measures in place to guarantee the quality of our software and to make the development process more transparent both for us at GBIF and hopefully for other interested parties as well. GBIF projects have always been open source and hosted at their Google Code sites (e.g. GBIF Occurrencestore  or the IPT ). So in theory it was always possible for everyone to check every commit and review it. We've set up a Jenkins server however that does continuous integration for us which means that every time a change is made to one of our projects it is checked out and a full build is being run including all tests, code quality measurements (I'm going to get back to those later), web site creation (e.g. Javadocs) and publishing of the results to our Maven repository. This is the first step in our new process. Every ...

Here be dragons - mapping occurrence data

Image
One of the most compelling ways of viewing GBIF data is on a map.  While name lists and detailed text are useful if you know what you're looking for, a map can give you the overview you need to start honing your search.  I've always liked playing with maps in web applications and recently I had the chance to add the functionality to our new Hadoop/Hive processing that answers the question "what species occurrence records exist in country x?". Approximately 82% of the GBIF occurrence records have latitude and longitude recorded, but these often contain errors - typically typos, and often one or both of lat and long reversed.  Map 1, below, plots all of the verbatim (i.e. completely unprocessed) records that have a latitude and longitude and claim to be in the USA.  Note the common mistakes, which result in glaring errors: reversed longitude produces the near-perfect mirror over China; reversed latitude produces a faint image over the Pacific off the coast of Chile; re...

The GBIF Spreadsheet Processor - an easy option to publish data

Image
Most of data publishers in the GBIF Network use software wrappers to make data available on the web. To set up those tools, usually an institution or an individual needs to have certain degrees of technical capacity, and this more or less raises the threshold for publishing biodiversity data. Imaging an entomologist who deals with collections and monographs everyday, the only thing s/he does on a PC is Word or Excel. S/he's got no student to help with, but keen to share the data before s/he retires. What is s/he going to do? One of our tools is built to support this kind of scenario - the GBIF Darwin Core Archive Spreadsheet Processor , usually we just call it "the Spreadsheet Processor." The Spreadsheet Processor is a web application that one can: Use templates provided on the web site; Fill and upload(or email) the xls file; Get a Darwin Core Archive file as the result. This is a pretty straight-forward approach to prepare data for publishing, because the learning curve...