Posts

BioCASe now producing DarwinCore Archives

Image
Guest post from Jörg Holetschek, Botanic Garden and Botanical Museum Berlin-Dahlem. The traditional way of sharing occurrence data with GBIF has been web-service-based for years. Data publishers have used one of the existing provider software packages ( DiGIR , BioCASe or TAPIR Link ) to expose their data as a DiGIR-, BioCASe- or TAPIR-compliant web service. Biodiversity networks such as GBIF used harvesters to crawl and index the records published by these services, an approach that works fine for small and medium-sized datasets, but runs into difficulties when record numbers hit the millions: Harvesting can take days and puts a heavy load on both the publisher and the crawler. To overcome this, GBIF recently introduced DarwinCore Archives for storing all information of a dataset to be published in a single file. GBIF directly ingesting this file eliminates the time-consuming back-and-forth communication between data provider and harvester, speeding up the process and reducing load f...

Updating a customized IPT

This post originally appeared on the Canadensys blog and is a follow-up of the post Customizing the IPT . As mentioned at the very end of my post about customizing the IPT , I face a problem when I want to install a new version of the GBIF Integrated Publishing Toolkit : installing it will overwrite all my customized files! Luckily Tim Robertson gave me a hint on how to solve this: a shell script to reapply my customization. Here's how it works (for Mac and Linux systems only): Comparing the customized files with the default files First of all, I need to compare my customized files with the files from the new IPT. They might have changed to include new functionalities or fix bugs. So, I installed the newest version of IPT on my localhost , opened the default files and compared them with my files. Although there are tools to compare files, I mostly did this manually. The biggest change in version 2.0.3 was the addition of localization, for which I'm using a different UI, so I h...

Bug fixing in the GBIF Data Portal

Despite our current efforts to develop a new Portal v2 , our current data portal at data.gbif.org  has not been left unattended. Bug fixes are being done periodically from feedback sent to us from our user community. In order to keep our community informed, this post will summarize the most important fixes and enhancements done in the past months: The data portal's main page now shows the total number of occurrence records with coordinates, along with the total count of records (non-georeferenced and georeferenced). Decimal coordinate searches were not working properly. When a user wanted to refine their coordinate searches to use decimals, the data portal was returning an erroneous count of occurrence records. Issue was fixed. Details here . Any feedback e-mail message sent from an occurrence or a taxon page now includes the original sender's email address in the CC field. Previously the sender's email address was not included in the feedback email, which represen...

Important Quality Boost for GBIF Data Portal

Improvements speed processing, “clean” name and location data, enable checklist publishing. [This is a reposting from the GBIF news site ] A major upgrade to enhance the quality and usability of data accessible through the GBIF Data Portal has gone live. The enhancements are the result of a year’s work by developers at the Copenhagen-based GBIF Secretariat, in collaboration with colleagues throughout the worldwide network. They respond to a range of issues including the need for quicker ‘turnaround’ time between entering new data and their appearance on the portal; filtering out inaccurate or incorrect locations and names for species occurrences; and enabling species checklists to be indexed as datasets accessible through the portal. After a testing period, the changes now apply to the more than 312 million biodiversity data records currently indexed from some 8,500 datasets and 340 publishers worldwide. Key improvements include: •    processing time for data has fallen f...

Integration tests with DBUnit

Database driven JUnit tests As part of our migration to a solid, general testing framework we are now using DbUnit for database integration tests of our database service layer with JUnit (on top of liquibase for the DDL). Creating a DbUnit test file As it can be painful to maintain a relational test dataset with many tables, I've decided to dump a small, existing Postgres database into the DbUnit XML structure, namely FlatXML . It turned out to be less simple as I had hoped for. First I've created a simple exporter script in Java that dumps the entire DB into XML. Simple. The first problem I've stumbled across was a column named "order" which caused a SQL exception. It turns out DbUnit needs to be configured for specific databases, so I've ended up using three configurations to both dump and read the files. Use Postgres specific types Double quote column and table names Enable case sensitive table & column names (now that we use quoted names, Postgres be...

GBIF Portal: Geographic interpretations

The new portal processing is about to go into production, and during testing I was drawing some metrics on the revised geographic interpretation.  It is a simple issue, but many records have coordinates that contradict the country that the record claims to be in.  Some illustrations of this were previously shared by Oliver . The challenge of this is two fold.  Firstly we see many variations in the country name  which needs to be interpreted.  Some examples for Argentina are given (there are 100s of variations per country): Argent. Argentina Argentiana N Argentina N. Argentina ARGENTINA ARGENTINIA ARGENTINNIA "ARGENTINIA" ""ARGENTINIA"" etc etc We have abstracted the parsing code into a separate Java library which makes use of basic algorithms and dictionary files to help interpret the results.  This library might be useful for other tools requiring similar interpretation, or data cleaning efforts, and will be maintained over time as it will be in use in ...

Group synergy

Image
During the last few weeks we have been intensively designing and implementing what would come to be the new data portal. Oliver described nicely the new stage our team has entered in his last blog post Portal v2 - There will be cake . As my personal opinion, I think this has been truly a group experience as we have decided to change our paradigm of working. Normally we would have worked on different components each one of us and later try to integrate everything, but now we took the approach of just focusing on one subcomponent, all of us, and driving our efforts into it. From my point of view, the main advantage of this is that we avoid the  Bus Factor  element, that we as a small group of developers, are quite exposed to. Communication has increased among our team as we are all on the same page now. As a general overview, the portal v2 will consist of different subcomponents (or sub-projects) that would need to interact between them to come up with a consolidated "view" for...