Posts

HBase Performance Evaluation continued - The Smoking Gun

Image
Update: See also part 1 and part 3 . In my  last post  I described my initial foray into testing our HBase cluster performance using the PerformanceEvaluation class.  I wasn't happy with our conclusions, which could largely be summed up as "we're not sure what's wrong, but it seems slow".  So in the grand tradition of people with itches that wouldn't go away, I kept scratching.  Everything that follows is based on testing with PerformanceEvaluation (the jar patched as in the  last post ) using a 300M row table built with PerformanceEvaluation sequentialWrite 300 , and tested with PerformanceEvaluation scan 300 .  I ran the scan test 3 times, so you should see 3 distinct bursts of activity in the charts.  And to recap our hardware setup - we have 3 regionservers and a separate master. The first unsettling  ganglia metric  that kept me digging was of ethernet bytes_in and bytes_out.  I'll recreate those here: Figure 1 - bytes_in (MB/...

Performance Evaluation of HBase

Image
Update: See also followup posts: part 2 and part 3 . In the last post Lars talked about  setting up Ganglia  for monitoring our Hadoop and HBase installations.  That was in preparation for giving HBase a solid testing run to assess its suitability for hosting our index of occurrence records.  One of the important features in our new Data Portal will be the "Download" function that lets people download occurrences matching some search criteria and currently that process is a very manual and labour intensive one, so automating it will be a big help to us.  Using HBase it would be implemented as a full table scan, and that's why I've spent some time testing our scan performance. Anyone who has been down this road will probably have encountered the myriad opinions on what will improve performance (some of them conflicting) along with the seemingly endless parameters that can be tuned in a given cluster.  The overall result of that kind of research is: "Yo...

Monitoring Hadoop and HBase

Image
We're getting serious in our Hadoop adoption. The first process (our so called "rollover") is now in production and it uses Hadoop, Hive, Oozie and various other parts of the Hadoop ecosystem. Our next step is evaluating HBase and its performance on our (small and aging) cluster. To do that properly and to fix a rather embarrassing situation we first had to get proper monitoring up and running for our cluster. So far we've only had Cacti stats for OS level things (CPU, I/O, etc.) but we were missing actual Hadoop statistics. So we've now set up Ganglia at GBIF and the best news is it's public  and using the very latest Ganglia 3.3 which was released only a few days ago in February 2012. The setup was relatively painless. Ganglia was just nice to work with. To get monitoring of HBase working we had to apply HBASE-4854  because it's not included in our Hadoop distribution (CDH3u2). Thanks to Lars George for the hint. So we can happily report that Ganglia 3...

BioCASe now producing DarwinCore Archives

Image
Guest post from Jörg Holetschek, Botanic Garden and Botanical Museum Berlin-Dahlem. The traditional way of sharing occurrence data with GBIF has been web-service-based for years. Data publishers have used one of the existing provider software packages ( DiGIR , BioCASe or TAPIR Link ) to expose their data as a DiGIR-, BioCASe- or TAPIR-compliant web service. Biodiversity networks such as GBIF used harvesters to crawl and index the records published by these services, an approach that works fine for small and medium-sized datasets, but runs into difficulties when record numbers hit the millions: Harvesting can take days and puts a heavy load on both the publisher and the crawler. To overcome this, GBIF recently introduced DarwinCore Archives for storing all information of a dataset to be published in a single file. GBIF directly ingesting this file eliminates the time-consuming back-and-forth communication between data provider and harvester, speeding up the process and reducing load f...

Updating a customized IPT

This post originally appeared on the Canadensys blog and is a follow-up of the post Customizing the IPT . As mentioned at the very end of my post about customizing the IPT , I face a problem when I want to install a new version of the GBIF Integrated Publishing Toolkit : installing it will overwrite all my customized files! Luckily Tim Robertson gave me a hint on how to solve this: a shell script to reapply my customization. Here's how it works (for Mac and Linux systems only): Comparing the customized files with the default files First of all, I need to compare my customized files with the files from the new IPT. They might have changed to include new functionalities or fix bugs. So, I installed the newest version of IPT on my localhost , opened the default files and compared them with my files. Although there are tools to compare files, I mostly did this manually. The biggest change in version 2.0.3 was the addition of localization, for which I'm using a different UI, so I h...

Bug fixing in the GBIF Data Portal

Despite our current efforts to develop a new Portal v2 , our current data portal at data.gbif.org  has not been left unattended. Bug fixes are being done periodically from feedback sent to us from our user community. In order to keep our community informed, this post will summarize the most important fixes and enhancements done in the past months: The data portal's main page now shows the total number of occurrence records with coordinates, along with the total count of records (non-georeferenced and georeferenced). Decimal coordinate searches were not working properly. When a user wanted to refine their coordinate searches to use decimals, the data portal was returning an erroneous count of occurrence records. Issue was fixed. Details here . Any feedback e-mail message sent from an occurrence or a taxon page now includes the original sender's email address in the CC field. Previously the sender's email address was not included in the feedback email, which represen...

Important Quality Boost for GBIF Data Portal

Improvements speed processing, “clean” name and location data, enable checklist publishing. [This is a reposting from the GBIF news site ] A major upgrade to enhance the quality and usability of data accessible through the GBIF Data Portal has gone live. The enhancements are the result of a year’s work by developers at the Copenhagen-based GBIF Secretariat, in collaboration with colleagues throughout the worldwide network. They respond to a range of issues including the need for quicker ‘turnaround’ time between entering new data and their appearance on the portal; filtering out inaccurate or incorrect locations and names for species occurrences; and enabling species checklists to be indexed as datasets accessible through the portal. After a testing period, the changes now apply to the more than 312 million biodiversity data records currently indexed from some 8,500 datasets and 340 publishers worldwide. Key improvements include: •    processing time for data has fallen f...