Posts

The organisational structure and the endorsement process - if you're an IPT administrator

Image
During the Expert Workshop last week in Copenhagen, we had a session talking about configuring IPT to reflect different organisational structures. I think it's worth to explain about that part as a blog post here, since some of our readers would like to help deploy IPT in the GBIF Network. It's usually started by questions like this: Why am I asked for a password of the organisation that I choose to register IPT? Why am I asked again when I want to add an additional organisation? The short answer is, by having the password of the organisation, that means you have got the permission from that organisation and the organisation is aware of the fact that you're registering an IPT against it. So, why is this the way of registering an organisation? The organisational structure Remember the GBIF Network is not only a common pool of sharing biodiversity data, to form such a pool, it's also the social network in which biodiversity data publishers interact. IPT, serves as the te...

Synchronizing occurrence records

This post should be read in the line of Tim’s post about Decoupling Components , as it takes for granted some information written there. During the last week, I’ve been learning /working with some technologies that are related to the decoupling of components we want to accomplish.  Specifically, I’ve been working with the Synchronizer  component of the event driven architecture Tim described. Right now, the synchronizer takes the responses from the resources and gets those responses into the occurrence store (MySQL as of today, but not final). But it has more to it: The responses from the resources come typically from DiGIR, TAPIR and BioCASe providers which render their responses into XML format. So how does all this data ends up in the occurrence store ? Well, fortunately my colleague Oliver Meyn wrote a very useful library to unmarshall all these XML chunks into  nice and simple objects, so on my side I just have to worry about calling all those getter methods. A...

Querying Solr using a pure AJAX application

Image
This is the third (and final) post related to the GBIF Metacatalogue Project . The first 2 were dedicated to explain how the data is harvested and how that information is stored in Apache Solr. Those post can be consulted in: OAI-PMH Harvesting at GBIF Indexing bio-diversity metadata using Solr: schema, import-handlers, custom-transformers One the nicest features of Solr is that most of its functionalities are exposed via Rest API. This API can be used for different operations like: delete documents, post new documents and more important to query the index. In cases when the index is self-contained (i.e: doesn't depend of external services or storages to return valuable information) a very thin application client without any mediator is viable option. In general terms, "mediator" is a layer that handles the communication between the user interface and Solr, in some cases (possibly) that layer manipulates the information before send it to user interface. Metadata Web appl...

Simple wallboard display with Scala and Lift at GBIF

This week we hit 300 million indexed occurrence records. As you can see in the picture  we have got a monitor set up that shows us our current record count. It started as an idea a few weeks ago but while at the Berlin Buzzwords conference (we were at about 298 million then) I decided it was time to do something about it. I've been playing around with Scala  a bit in the last few months so this was a good opportunity to try Lift , a web framework written in Scala. In the end it turns out that very little code was needed to create an auto-updating counter. There are three components: We've got a DBUpdater object that uses Lift's Schedule  (used to be called ActorPing which caused some confusion for me) to update its internal count of raw occurrence records every ten seconds. The beauty is that there is just one instance of this no matter how many clients are looking at the webpage. The second part is a class that acts as a Comet adaptor called RawOccurrenceRecordCo...

Buzzword compliance

Over the last few years a number of new technologies have emerged (inspired largely by Google) to help wrangle Big Data.  Things like Hadoop, HBase, Hive, Lucene, Solr and a host of others are becoming the "buzzwords" for handling the type of data that we at the secretariat are working with. As a number of our previous posts here have shown, the GBIF dev team is wholeheartedly embracing these new technologies, and we recently went to the Berlin Buzzwords conference (as a group) to get a sense of how the broader community is using these tools. My particular interest is in HBase , which is a style of database that can handle "millions of columns and billions of rows".  Since we're optimistic about the continued growth of the number of occurrence records indexed by GBIF, it's not unreasonable to think about 1 billion (10^9) indexed records within the medium-term, and while our current MySQL solution has held up reasonably well so far (now closing in on 300 mil...

Getting started with Avro RPC

Apache Avro is a data exchange format started by Doug Cutting of Lucene and Hadoop fame. A good introduction to Avro is on the cloudera blog so an introduction is not the intention of this post. Avro is surprisingly difficult to get into, as it is lacking the most basic "getting started" documentation for a new-comer to the project. This post serves as a reminder to myself of what I did, and hopefully to help others get the hello world working quickly. If people find it useful, let's fill it out and submit it to the Avro wiki! Prerequisites: knowledge of Apache Maven Start by adding the Avro maven plugin to the pom. This is needed to compile the Avro schema definitions into the Java classes. <plugin> <groupId>org.apache.avro</groupId> <artifactId>avro-maven-plugin</artifactId> <version>1.5.1</version> <executions> <execution> <id>schemas</id> <phase>generate-sources...

MySQL: A speed-up of over 9000 times using partitioning

I wanted to write about a MySQL performance optimization using partitioning as I recently applied it to the Harvesting and Indexing Toolkit’s (HIT) log table. The log table was already using a composite index (indexes on multiple columns), but as this table grew bigger and bigger (>50 million records) queries were being answered at a turtle’s pace. To set things up, imagine that in the HIT application there is a log page that allows the user to tail the latest log messages in almost real time. Behind the scenes, the application is querying the log table every few seconds for the most recent logs, and the effect is a running view of the logs. The tail query used looks like this: mysql> select * from log_event where id >= ‘latest id’ and datasource_id = ‘datasource_id’ and level >= ‘log level’ order by id desc; In effect this query asks: “give me the latest logs for datasource with id X having having at least a certain log level”. Partitioning basically divides a table into d...