Posts

Darwin Core Archives for Species Checklists

Image
GBIF has long had an ambition for supporting the sharing of annotated species checklists through the network. Realising this ambition has been frustrated by the lack of a data exchange standard of sufficient scope and simplicity as to promote publication of this type of resource. In 2009, the Darwin Core standard data set was formerly ratified by the TDWG, Biodiversity Information Standards. The addition of new terms, and a means of expressing these terms in a simplified and extensible text-based format, paved the way for the development of a data exchange profile for exchanging species checklists known as the Global Names Architecture (GNA) Profile. Species checklists, published in this format, can be zipped into single, portable, 'archive' files. Here I introduce two example archives that illustrate the flexible scope of the format. The first represents a very simple species checklist while the second is a more richly documented taxonomic catalogue. The contents ...

Configuring Drupal and some modules for ticketing emails

Image
We at the Secretariat receive enquiries via helpdesk[at]gbif[dot]org, portal[at]gbif[dot]org and info[at]gbif[dot]org, everyday, or I would say, almost every hour. Some of them are provider-specific questions that need special attention from staff, while some others are FAQs. We have been thinking about better managing questions/issues, so by adding a little bit structure in the collaborative workflow, we can: 1. Make sure questions are answered with satisfaction; 2. Estimate how much man hours have been spent, or evaluate performance; 3. Improve efficiency on helpdesk activities. To achieve these, we need softwares that meet these requirements: 1. Case management for incoming emails; 2. A Q&A cycle should be completed by solely using email. Web forms are good but not necessary in the beginning; 3. Easy configured knowledge base essays; 4. Graphical reports shows the helpdesk performance; 5. Automatic escalation of case status. We looked for options from Open Source Help Desk List....

Using C3P0 with MyBatis

The problem In our rollover  process, which turns our raw harvested data into the interpreted occurrences you can see on our portal , we now have a step that calls a Web Service to turn geographical coordinates into country names. We use this to enrich and validate the incoming data. This step in our process usually took about three to four hours but last week it stopped working all together without any changes to the Web Service or the input data. We've spent a lot of time trying to find the problem and while we still can't say for sure what the exact problem is or was we've found a fix that works for us which also allows us to make some assumptions about the cause of the failure. It is a project called  geocode-ws  and it is a very simple project that uses MyBatis to call a PostgreSQL  (8.4.2) &  PostGIS  (1.4.0) database which does the GISy work of finding matches. Our process started out fine. The first few million calls to the Web Service wer...

Indexing occurrences data - using Lucene technology

The GBIF Occurrence Index collects, stores and parses data gathered from different sources to provide a fast and accurate access to biodiversity occurrence data. The purpose of having a GBIF Index is optimize speed, relevance and performance of search functionalities that will be implemented by the new GBIF portal architecture. Currently, GBIF has been providing search functionalities in its Data Portal supported in a semi-denormalized index relational database design, which allows find occurrence information by specifying filters to refine the expected results. That design was envisioned to support use cases of the actual GBIF Data Portal (a Web application); for the next generation of the GBIF platform, a new set of requirements must be achieved and is possible that the current index will not be able to support them, the most relevant of those requirements are: scheduling of batch exports, full text search, realtime faceted search and probably new schemas of data sharing with other ...

Customizing the IPT

Image
One of my responsibilities as the Biodiversity Informatics Manager for Canadensys is to develop a data portal giving access to all the biodiversity information published by the participants of our network. A huge portion of this task can now be done with the GBIF Integrated Publishing Toolkit version 2 or IPT. The IPT allows to host biodiversity resources, manage their data and metadata, and register them with GBIF so they can appear on the GBIF data portal , which are all targets we want to achieve. Best of all, most management can be done by the collection managers themselves. I have tested the IPT thoroughly and I am convinced the GBIF development team has done an excellent job creating a stable tool I can trust. This post explains how I have customized our IPT installation to integrate it with our other Canadensys websites. Background Our Canadensys community portal is powered by WordPress (MySQL, PHP), while our data portal - which before the IPT installation only consisted of...

Working with Scientific Names

Dealing with scientific names is an important regular part of our work at GBIF. Scientific names are highly structured strings with a syntax governed by a nomenclatural code. Unfortunately there are different ones for botany , zoology , bacteria , virus and even cultivar names. When dealing with names we often do not know to which code or classification it belongs to, so we need to have a code agnostic representation as much as possible. GBIF came up with a structured representation which is a compromise focusing on the most common names, primarily the botanical and zoological names which are quite similar in its basic form. The ParsedName class Our ParsedName class provides us with the following core properties: genusOrAbove infraGeneric specificEpithet rankMarker infraSpecificEpithet authorship year bracketAuthorship bracketYear These allow us to represent regular names properly. For example Agalinis purpurea var. borealis (Berg.) Peterson 1987 is represented as genusOrAbove=Aga...

Are you the keymaster?

As I mentioned previously I'm starting work on evaluating HBase for our occurrence record needs.  In the last little while that has meant coming up with a key structure and/or schema that optimizes reads for one major use case of the GBIF data portal - a user request to download an entire record set, including raw records as well as interpreted.  The most common form of this request looks like "Give me all records for ", eg "Give me all records for Family Felidae". So far I'm concentrating more on the lookup and retrieval rather than writing or data storage optimization, so the schema I'm using is two column families, one for verbatim columns, one for interpreted (for a total of about 70 columns).  The question of which key to use for HTable's single indexed column is what we need to figure out.  For all these examples we assume we know the backbone taxonomy id of the taxon concept in question (ie Family Felidae is id 123456). Option 1 Key: nativ...