Posts

Customizing the IPT

Image
One of my responsibilities as the Biodiversity Informatics Manager for Canadensys is to develop a data portal giving access to all the biodiversity information published by the participants of our network. A huge portion of this task can now be done with the GBIF Integrated Publishing Toolkit version 2 or IPT. The IPT allows to host biodiversity resources, manage their data and metadata, and register them with GBIF so they can appear on the GBIF data portal , which are all targets we want to achieve. Best of all, most management can be done by the collection managers themselves. I have tested the IPT thoroughly and I am convinced the GBIF development team has done an excellent job creating a stable tool I can trust. This post explains how I have customized our IPT installation to integrate it with our other Canadensys websites. Background Our Canadensys community portal is powered by WordPress (MySQL, PHP), while our data portal - which before the IPT installation only consisted of...

Working with Scientific Names

Dealing with scientific names is an important regular part of our work at GBIF. Scientific names are highly structured strings with a syntax governed by a nomenclatural code. Unfortunately there are different ones for botany , zoology , bacteria , virus and even cultivar names. When dealing with names we often do not know to which code or classification it belongs to, so we need to have a code agnostic representation as much as possible. GBIF came up with a structured representation which is a compromise focusing on the most common names, primarily the botanical and zoological names which are quite similar in its basic form. The ParsedName class Our ParsedName class provides us with the following core properties: genusOrAbove infraGeneric specificEpithet rankMarker infraSpecificEpithet authorship year bracketAuthorship bracketYear These allow us to represent regular names properly. For example Agalinis purpurea var. borealis (Berg.) Peterson 1987 is represented as genusOrAbove=Aga...

Are you the keymaster?

As I mentioned previously I'm starting work on evaluating HBase for our occurrence record needs.  In the last little while that has meant coming up with a key structure and/or schema that optimizes reads for one major use case of the GBIF data portal - a user request to download an entire record set, including raw records as well as interpreted.  The most common form of this request looks like "Give me all records for ", eg "Give me all records for Family Felidae". So far I'm concentrating more on the lookup and retrieval rather than writing or data storage optimization, so the schema I'm using is two column families, one for verbatim columns, one for interpreted (for a total of about 70 columns).  The question of which key to use for HTable's single indexed column is what we need to figure out.  For all these examples we assume we know the backbone taxonomy id of the taxon concept in question (ie Family Felidae is id 123456). Option 1 Key: nativ...

The organisational structure and the endorsement process - if you're an IPT administrator

Image
During the Expert Workshop last week in Copenhagen, we had a session talking about configuring IPT to reflect different organisational structures. I think it's worth to explain about that part as a blog post here, since some of our readers would like to help deploy IPT in the GBIF Network. It's usually started by questions like this: Why am I asked for a password of the organisation that I choose to register IPT? Why am I asked again when I want to add an additional organisation? The short answer is, by having the password of the organisation, that means you have got the permission from that organisation and the organisation is aware of the fact that you're registering an IPT against it. So, why is this the way of registering an organisation? The organisational structure Remember the GBIF Network is not only a common pool of sharing biodiversity data, to form such a pool, it's also the social network in which biodiversity data publishers interact. IPT, serves as the te...

Synchronizing occurrence records

This post should be read in the line of Tim’s post about Decoupling Components , as it takes for granted some information written there. During the last week, I’ve been learning /working with some technologies that are related to the decoupling of components we want to accomplish.  Specifically, I’ve been working with the Synchronizer  component of the event driven architecture Tim described. Right now, the synchronizer takes the responses from the resources and gets those responses into the occurrence store (MySQL as of today, but not final). But it has more to it: The responses from the resources come typically from DiGIR, TAPIR and BioCASe providers which render their responses into XML format. So how does all this data ends up in the occurrence store ? Well, fortunately my colleague Oliver Meyn wrote a very useful library to unmarshall all these XML chunks into  nice and simple objects, so on my side I just have to worry about calling all those getter methods. A...

Querying Solr using a pure AJAX application

Image
This is the third (and final) post related to the GBIF Metacatalogue Project . The first 2 were dedicated to explain how the data is harvested and how that information is stored in Apache Solr. Those post can be consulted in: OAI-PMH Harvesting at GBIF Indexing bio-diversity metadata using Solr: schema, import-handlers, custom-transformers One the nicest features of Solr is that most of its functionalities are exposed via Rest API. This API can be used for different operations like: delete documents, post new documents and more important to query the index. In cases when the index is self-contained (i.e: doesn't depend of external services or storages to return valuable information) a very thin application client without any mediator is viable option. In general terms, "mediator" is a layer that handles the communication between the user interface and Solr, in some cases (possibly) that layer manipulates the information before send it to user interface. Metadata Web appl...

Simple wallboard display with Scala and Lift at GBIF

This week we hit 300 million indexed occurrence records. As you can see in the picture  we have got a monitor set up that shows us our current record count. It started as an idea a few weeks ago but while at the Berlin Buzzwords conference (we were at about 298 million then) I decided it was time to do something about it. I've been playing around with Scala  a bit in the last few months so this was a good opportunity to try Lift , a web framework written in Scala. In the end it turns out that very little code was needed to create an auto-updating counter. There are three components: We've got a DBUpdater object that uses Lift's Schedule  (used to be called ActorPing which caused some confusion for me) to update its internal count of raw occurrence records every ten seconds. The beauty is that there is just one instance of this no matter how many clients are looking at the webpage. The second part is a class that acts as a Comet adaptor called RawOccurrenceRecordCo...