Posts

Showing posts from April, 2011

The evolution of the GBIF Registry

Image
The GBIF Registry has evolved through time to become an important tool in GBIF's day to day work. But before going into this post, a basic understanding of the GBIF Network model should be provided. GBIF is a decentralised network that has several network entities that are related in some way between each other. At the top level, there are GBIF Participant Nodes , which typically are countries or thematic networks that coordinate their domain. These Nodes endorse one or more Organisations or Institutions inside their domain , and each Organisation possesses one or more Resources exposed through the GBIF Network. Also, each Resource typically comes associated to a Technical Access Point which is the url to access its data. There are also other entities such as IPT Installations which are deployed inside specific organisations, but are not resources by themselves. They publish Resources that might be owned by other organisations. A quick view on the GBIF's network model can b...

OAI-PMH Harvesting at GBIF

Image
GBIF has been my first experience in the bio-informatics world; my first assignment was developing an OAI-PMH harvester. This post will introduce OAI-PMH protocol and how we are gathering XML documents from different sources, in a next post I'll give a introduction to the Index that we have built using those documents.

 The main goal for this project was develop the infrastructure needed across the GBIF network to support the management and delivery of metadata that will enable potential end users to discover which datasets are available, and, to evaluate the appropriateness of such datasets for particular purposes. In the GBIF context, resources are datasets, loosely defined as collections of related data, the granularity of which is determined by the data custodian/provider.

 OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) is a platform independent framework for metadata publishers and metadata consumers as well. The most important concepts of this protocol...

Cleanup of occurrence records

Lars here, like Oliver I've started here in October 2010 and have no biology background either so my first step here at GBIF was to set up the infrastructure  Tim  was mentioning before, but I've written about that already (at length). To continue the series of blog posts that was started by Oliver , and in no particular order, I'll talk about what we are doing to process the incoming data - which is the task I was given after the Hadoop setup was done. During our rollover we're processing Occurrence records. Millions of them, about 270 millions at the moment and we expect this to grow significantly over the next few months and years. It is only natural that there is bound to be bad data in there for various reasons. These might be everything from simple typos to misconfigured publishing tools and transfer errors. The more we know about the domain and the data the more we are obviously able to fix. Any input is appreciated on how we could do better on this part of our ...

Reworking the Portal processing

Image
The  GBIF Data Portal  has provided a gateway to discover and access the content shared through the GBIF network for some years, without major change.  As the amount of data has grown, GBIF have scaled vertically (e.g. scaling up)  to maintain performance levels; this is becoming unmanageable with the current processing routines due to the amount of SQL statements issued against the database.  As GBIF content grows, the indexing infrastructure must change to scale out  accordingly. I have been monitoring and evaluating alternative technologies for some time  and a few months ago GBIF initiated the redevelopment of the processing routines.  This current area of work does not increase functionality offered through the portal (that will be addressed following this infrastructural work) but rather aims to: Reduce the latency between a record changing on the publisher side, and being reflected in the index Reduce the amount of (wo)man-hours needed to c...

Lucene for searching names in our new common taxonomy

Oliver here - I'm one of the new developers at GBIF, having started in October, 2010. With no previous experience in biology or biological classification you can bet it's been a steep learning curve in my time here, but at the same time it's very nice to be learning about a domain that's real, valuable and permanent, rather than yet another fleeting e-commerce, money-trading or "social media" application! One of the features of GBIF's Data Portal is allowing searching of primary occurrence data via a backbone taxonomy. For example let's say you're interested in snow leopards and would like to plot all current and historical occurrences of this elusive cat on a world map. Let's further say that Richard Attenborough suggested to you that the snow leopard's scientific name is "Panthera uncia". You would ask the data portal for all records about Panthera uncia and expect to see all occurrences of snow leopards. Unfortunately biol...

The first drafts of the Data Publishing Manuals are available for feedbacks

Image
Since Darwin Core had been officially ratified by Biodiversity Information Standards (TDWG) in November 2009, a few tools were developed by GBIFS to leverage the standard data format, a.k.a the Darwin Core Archive, to facilitate data mobilisation. These tools include Darwin Core Archive Assistant , GBIF Spreadsheet Processor and some validators that users can use to produce standard-compliant files for data exchanging or publishing purposes. Also, IPT has upgraded to version 2 recently to fully support data publishing in metadata, occurrence data and taxonomic data using Darwin Core Archive. Accompanying these development efforts, a suite of document are also prepared to instruct users on, not only the usage of individual software tool, but how to make data available within the GBIF Network. For those tool options we have in the biodiversity information world, we organised these materials according to which kind of content that users want to publish, and present a document map for ...

Can IPT2 handle big datasets now?

Image
One of IPT1's most serious problems was its inability to handle large datasets. For example, a dataset with only half a million records (relatively small compared to some of the biggest in the GBIF network) caused the application to slow down to such a degree that even the most patient users were throwing their hands up in dismay. Anyways, I wanted to see for myself whether the IPT’s problems with large datasets have been overcome or not in the newest version: IPT2. Here’s what I did to run the test: First, I connected to a MySQL database and used a “select * from … limit …” query to define my source data totalling 24 million records (the same number of records as a large dataset coming from Sweden). Next, I mapped 17 columns to Darwin Core occurrence terms and once this was done I was able to start the publication of a Darwin Core Archive (DwC-A). The publication took just under 50 minutes to finish, processing approximately 500,000 records per minute. Take a look at the screensho...

The GBIF Development Team

Recently the GBIF development group have been asked to communicate more on the work being carried out in the secretariat.  To quote one message: " IMHO, simply making all these discussions public via a basic mailing list could help people like me ... have a better awareness of what's going on... We could add our comments / identify possible drawbacks / make some "scalability tests"... In fact I'm really eager to participate to this process " (developer in Belgium) To kick things off, we plan to make better use of this blog and have set a target of posting 2-3 times a week .  This is a technical blog, so the anticipated audience include developers, database administrators and those interested in following details of the GBIF software development.  We have always welcomed external contributers to this blog and invite any developers working on publishing content through the GBIF network, or developing tools that make use of content discoverable and accessible...