Posts

Important Quality Boost for GBIF Data Portal

Improvements speed processing, “clean” name and location data, enable checklist publishing. [This is a reposting from the GBIF news site ] A major upgrade to enhance the quality and usability of data accessible through the GBIF Data Portal has gone live. The enhancements are the result of a year’s work by developers at the Copenhagen-based GBIF Secretariat, in collaboration with colleagues throughout the worldwide network. They respond to a range of issues including the need for quicker ‘turnaround’ time between entering new data and their appearance on the portal; filtering out inaccurate or incorrect locations and names for species occurrences; and enabling species checklists to be indexed as datasets accessible through the portal. After a testing period, the changes now apply to the more than 312 million biodiversity data records currently indexed from some 8,500 datasets and 340 publishers worldwide. Key improvements include: •    processing time for data has fallen f...

Integration tests with DBUnit

Database driven JUnit tests As part of our migration to a solid, general testing framework we are now using DbUnit for database integration tests of our database service layer with JUnit (on top of liquibase for the DDL). Creating a DbUnit test file As it can be painful to maintain a relational test dataset with many tables, I've decided to dump a small, existing Postgres database into the DbUnit XML structure, namely FlatXML . It turned out to be less simple as I had hoped for. First I've created a simple exporter script in Java that dumps the entire DB into XML. Simple. The first problem I've stumbled across was a column named "order" which caused a SQL exception. It turns out DbUnit needs to be configured for specific databases, so I've ended up using three configurations to both dump and read the files. Use Postgres specific types Double quote column and table names Enable case sensitive table & column names (now that we use quoted names, Postgres be...

GBIF Portal: Geographic interpretations

The new portal processing is about to go into production, and during testing I was drawing some metrics on the revised geographic interpretation.  It is a simple issue, but many records have coordinates that contradict the country that the record claims to be in.  Some illustrations of this were previously shared by Oliver . The challenge of this is two fold.  Firstly we see many variations in the country name  which needs to be interpreted.  Some examples for Argentina are given (there are 100s of variations per country): Argent. Argentina Argentiana N Argentina N. Argentina ARGENTINA ARGENTINIA ARGENTINNIA "ARGENTINIA" ""ARGENTINIA"" etc etc We have abstracted the parsing code into a separate Java library which makes use of basic algorithms and dictionary files to help interpret the results.  This library might be useful for other tools requiring similar interpretation, or data cleaning efforts, and will be maintained over time as it will be in use in ...

Group synergy

Image
During the last few weeks we have been intensively designing and implementing what would come to be the new data portal. Oliver described nicely the new stage our team has entered in his last blog post Portal v2 - There will be cake . As my personal opinion, I think this has been truly a group experience as we have decided to change our paradigm of working. Normally we would have worked on different components each one of us and later try to integrate everything, but now we took the approach of just focusing on one subcomponent, all of us, and driving our efforts into it. From my point of view, the main advantage of this is that we avoid the  Bus Factor  element, that we as a small group of developers, are quite exposed to. Communication has increased among our team as we are all on the same page now. As a general overview, the portal v2 will consist of different subcomponents (or sub-projects) that would need to interact between them to come up with a consolidated "view" for...

Portal v2 - There will be cake

The current GBIF data portal was started in 2007 to provide access to the network's biodiversity data - at the time that meant a federated search across 220 providers and 76 million occurrence records. While that approach has served us well over the years, there are many features that have been requested for the portal that weren't addressable in the current architecture. Combined with the fact that we're now well over 300 million occurrence records, with millions of new taxonomic records to boot, it becomes clear that a new portal is needed. After a long consultation process with the wider community the initial requirements of a new portal have been determined, and I'm pleased to report that work has officially started on its design and development. For the last 6 months or so the development team has been working on improving our rollover process, registry improvements, IPT development, and disparate other tasks. The new portal marks an important milestone in ou...

VertNet and the GBIF Integrated Publishing Toolkit

Image
(A guest post from our friends at VertNet, cross-posted from the VertNet blog ) This week we’d like to discuss the current and future roles of the GBIF Integrated Publishing Toolkit (IPT) in VertNet. IPT is a Java-based web application that allows a user to publish and share biodiversity data sets from a server. Here are some of the things IPT can do: Create Darwin Core Archives. In our post about data publishing last week, we wrote about Darwin Core being the “language of choice” for VertNet. IPT allows publishers to create Darwin Core data records from either files or databases and to export them in zipped archive files that contain exactly what is needed by VertNet for uploading. Make data available for efficient indexing by GBIF. VertNet has an agreement with its data publishers that, by participating, they will also publish data through GBIF. GBIF keeps our registry of data providers and uses this registry to find and update data periodically from the original so...

Darwin Core Archives for Species Checklists

Image
GBIF has long had an ambition for supporting the sharing of annotated species checklists through the network. Realising this ambition has been frustrated by the lack of a data exchange standard of sufficient scope and simplicity as to promote publication of this type of resource. In 2009, the Darwin Core standard data set was formerly ratified by the TDWG, Biodiversity Information Standards. The addition of new terms, and a means of expressing these terms in a simplified and extensible text-based format, paved the way for the development of a data exchange profile for exchanging species checklists known as the Global Names Architecture (GNA) Profile. Species checklists, published in this format, can be zipped into single, portable, 'archive' files. Here I introduce two example archives that illustrate the flexible scope of the format. The first represents a very simple species checklist while the second is a more richly documented taxonomic catalogue. The contents ...