Open Source Data Warehousing?
The economic downturn is causing an increasing interest in using open source (OS) solutions for BI. One of my previous blogposts already raised the issue of the missing pieces in open source analytical databases, but nevertheless more and more companies are using OS databases for data warehouse purposes. In fact, recent Gartner research indicated that almost 18% of the surveyed organizations are using MySQL as a data warehouse database. This is a bit strange, because when we look at all the options available in commercial general purpose databases to support data warehousing, almost none of these are part of MySQL. I´m talking about partitioning, bitmapped indexes, materialized views, parallel loading and query processing, support for SQL Windowing functions, etc. Even the OS databases that have been designed for BI purposes lack most of these options. The diagram below shows what I mean. It lists the DWH/performance options for the most used OS databases, including the perhaps less well known column oriented LucidDB and MonetDB.
So a lot of sad faces, especially when you look at materialized views and bitmap indexes. Note that compression and MPP support is not yet included here, and the question is whether this would make a big difference. None of these products have compression or MPP capabalities; only EnterpriseDB´s GridSQL (based on PostgreSQL) is an OS MPP alternative. The biggest question however is: does it really matter? Can these products deliver a decent performance for BI like applications regardless of all the missing functionality? The answer might surprise you, and though I only ran part of the TPC-H benchmark (which measures query performance in typical BI scenarios), the results speak for themselves:
Please note that this was only a small dataset (2GB) and all installations were 'plain vanilla' i.e. no tuning, tweaking etc. The machine was a Windows 2003 server running on a single Athlon 64 3500+ processor with 2GB ram and a single 7200 rpm harddrive. The results are however in sync with other TPC-H tests run by MonetDB and LucidDB, also on bigger data sets. What might be surprising is the big difference between MySQL 5.0 and 5.1 which can be easily explained: 5.0 used the InnoDB storage engine, 5.1 used MyIsam which is better suited (out of the box that is!) for typical DWH workloads. And yes, MySQL, LucidDB and MonetDB scored (a lot) better than product 'X' of which I cannot reveal the name due to legal constraints.
The conclusion here is obvious: do not rule out OS DB´s for data warehousing just because their functionality doesn´t look very good or incomplete on paper. First run your own benchmarks and draw your own conclusions. You may find that OS DB's prove to be an adequate solution in your situation which can potentially save a lot of money.
ps. I got Infobright up and running now, so expect some initial benchmark results here shortly. I´m also interested in running a benchmark with GridSQL on EC2 but I need a little help with that. If you´re interested, just send an email to Jos at tholis.com
InfoBright update 2009-02-09
I loaded the 1 GB TPC-H dataset and started with query 9 which was by far the slowest in my previous benchmark. Back then it took MonetDB 7.4 seconds, LucidDB 73 secs and MySQL 5.1 2447 secs on a 2GB dataset on my old computer. Well, it's taking InfoBright 128 seconds on a 1GB dataset on an 8 GB Quadcore machine. So roughly double the hardware speed, half the data volume and it's still blown out of the water by both LucidDB and MonetDB. Indeed a lot faster than standard MySQL with MyIsam but pretty dissapointing when compared to the other OS column stores. Other dissappointing fact: a lot of the TPC-H queries use sum(expression) statements, not just simple sum(column) selects. Guess what? IB doesn't support sum(expression) queries yet. They're working on that feature right now, so until then we need to rewrite the statements using subqueries (doing the calcs in the subquery, then sum them in the main query). Don't know whether TPC rules allow this, but at least I got Q1 running and returning within 1.19 secs. And yes, that is fast. In fact, it's exactly what MonetDB reported in their sf 1 results. Next steps: load the SF10 dataset, rewrite the queries and compare them to the posted MonetDB results. Is that comparable then? Yep. MonetDB used a Q6600 with 8GB ram and 2*500 GB disks running Fedora. I'm using a Q6600 with 8GB ram and 4*750 disks (raid5) running Ubuntu. So that's close enough. I'll post the results in my next blog.
Topic: Open Source Data Warehousing?
Virtuoso Open Source in comparison?
As far as I know, VOS features:
Analytical fct: no
Materialized views: yes
Bitmap index: yes
Full-text index/search: yes
Clustering: no (only in the licensed edition)
Replication: no (only in the licensed edition)
Re: Virtuoso Open Source in comparison?
Good writeup - update?
That said, would it be possible to update the table, as it is now 10+ months old?
LucidDB has a Pentaho bulk load utility now too:
It's currently a plugin, but it's being integrated into PDI as a standard step for an upcoming release.
It looks like LucidDB has improved quite a bit from when I tried it out.
I also wanted to point out that things will change a lot once the data no longer fits into memory. In tests I ran with a tuned PostgreSQL database where the data was a lot more than memory, PostgreSQL outperformed MonetDB, which ran for hours.
It should do even better today with the 8.4 development HEAD of PostgreSQL, which significantly improves TPC-H performance for a couple of the really slow ones (I think one was NOT IN).