Open Source Data Warehousing?

08/02/2009 12:37

The economic downturn is causing an increasing interest in using open source (OS) solutions for BI. One of my previous blogposts already raised the issue of the missing pieces in open source analytical databases, but nevertheless more and more companies are using OS databases for data warehouse purposes. In fact, recent Gartner research indicated that almost 18% of the surveyed organizations are using MySQL as a data warehouse database. This is a bit strange, because when we look at all the options available in commercial general purpose databases to support data warehousing, almost none of these are part of MySQL. I´m talking about partitioning, bitmapped indexes, materialized views, parallel loading and query processing, support for SQL Windowing functions, etc. Even the OS databases that have been designed for BI purposes lack most of these options. The diagram below shows what I mean. It lists the DWH/performance options for the most used OS databases, including the perhaps less well known column oriented LucidDB and MonetDB.

So a lot of sad faces, especially when you look at materialized views and bitmap indexes. Note that compression and MPP support is not yet included here, and the question is whether this would make a big difference. None of these products have compression or MPP capabalities; only EnterpriseDB´s GridSQL (based on PostgreSQL) is an OS MPP alternative. The biggest question however is: does it really matter? Can these products deliver a decent performance for BI like applications regardless of all the missing functionality? The answer might surprise you, and though I only ran part of the TPC-H benchmark (which measures query performance in typical BI scenarios), the results speak for themselves:

Please note that this was only a small dataset (2GB) and all installations were 'plain vanilla' i.e. no tuning, tweaking etc. The machine was a Windows 2003 server running on a single Athlon 64 3500+ processor with 2GB ram and a single 7200 rpm harddrive. The results are however in sync with other TPC-H tests run by MonetDB and LucidDB, also on bigger data sets. What might be surprising is the big difference between MySQL 5.0 and 5.1 which can be easily explained: 5.0 used the InnoDB storage engine, 5.1 used MyIsam which is better suited (out of the box that is!) for typical DWH workloads. And yes, MySQL, LucidDB and MonetDB scored (a lot) better than product 'X' of which I cannot reveal the name due to legal constraints.

The conclusion here is obvious: do not rule out OS DB´s for data warehousing just because their functionality doesn´t look very good or incomplete on paper. First run your own benchmarks and draw your own conclusions. You may find that OS DB's prove to be an adequate solution in your situation which can potentially save a lot of money.

ps. I got Infobright up and running now, so expect some initial benchmark results here shortly. I´m also interested in running a benchmark with GridSQL on EC2 but I need a little help with that. If you´re interested, just send an email to Jos at tholis.com

InfoBright update 2009-02-09

I loaded the 1 GB TPC-H dataset and started with query 9 which was by far the slowest in my previous benchmark. Back then it took MonetDB 7.4 seconds, LucidDB 73 secs and MySQL 5.1 2447 secs on a 2GB dataset on my old computer. Well, it's taking InfoBright 128 seconds on a 1GB dataset on an 8 GB Quadcore machine. So roughly double the hardware speed, half the data volume and it's still blown out of the water by both LucidDB and MonetDB. Indeed a lot faster than standard MySQL with MyIsam but pretty dissapointing when compared to the other OS column stores. Other dissappointing fact: a lot of the TPC-H queries use sum(expression) statements, not just simple sum(column) selects. Guess what? IB doesn't support sum(expression) queries yet. They're working on that feature right now, so until then we need to rewrite the statements using subqueries (doing the calcs in the subquery, then sum them in the main query). Don't know whether TPC rules allow this, but at least I got Q1 running and returning within 1.19 secs. And yes, that is fast. In fact, it's exactly what MonetDB reported in their sf 1 results. Next steps: load the SF10 dataset, rewrite the queries and compare them to the posted MonetDB results. Is that comparable then? Yep. MonetDB used a Q6600 with 8GB ram and 2*500 GB disks running Fedora. I'm using a Q6600 with 8GB ram and 4*750 disks (raid5) running Ubuntu. So that's close enough. I'll post the results in my next blog.

—————

Back

Topic: Open Source Data Warehousing?

Date: 23/02/2024

By: Nerryshouh

Subject: Hello, Happykiddi

Hello from Happykiddi.

—————

Date: 25/09/2022

By: Id1woh5njt

Subject: Выплaтa полученa

На ваш счет *5985 поступил перевод со счета *8922

Подробнее: AAAtholis.webnode.pageBBB

—————

Date: 04/07/2022

By: Lewiserymn

Subject: selfless

There's an exception to every rule Link to proverb.
Don't shut the stable door after the horse has bolted Link to proverb.
Fight the good fight Link to proverb.
As thick as thieves Link to proverb.
From the sublime to the ridiculous is only one step.
A rising tide lifts all boats.
The whole is greater than the sum of the parts.

—————

Date: 23/03/2022

By: Frenchmxjf

Subject: Wisdоm is a mixturе of еxреrienсe, соuragе and intelligеncе

Wе tеnd tо think of grеаt thinkers аnd innovаtors as sоloists, but thе truth is that thе greаtеst innоvаtive thinking dоеsn't occur in a vаcuum. Innovation rеsults from cоllаborаtion.

—————

Date: 26/08/2021

By: Oduvan4ikCink

Subject: Oduvan4ikCink Not

Error 523 [url=https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#523error]origin is unreachable[/url]

—————

Date: 21/08/2021

By: DavidTot

Subject: накрутка лайков в инстаграме

Вы попали в самую точку. Мысль хорошая, поддерживаю.

-----
<a href="https://xn----htbdjbj4afmz1a5i.xn--p1ai/orkhidei/tsimbidium.html">цимбидиумов в москве</a>

Вы попали в самую точку. Мысль хорошая, согласен с Вами.

-----
<a href="https://xn--80aep0abj8c.xn--p1ai/shary-na-14-fevralya/">воздушные шары на 14 февраля купить недорого</a>

Вы попали в самую точку. Мне нравится эта мысль, я полностью с Вами согласен.

-----
<a href="https://xn----7sbbrabtbpbedpapeajh1e2a8a0b6c1dg.xn--p1ai/">телекоммуникационный шкаф 19 6u</a>

Вы попали в самую точку. Это отличная мысль. Я Вас поддерживаю.

-----
<a href="https://seofish.ru/">анализ стоимости сайта</a>

Вы попали в самую точку. Это отличная мысль. Готов Вас поддержать.

-----
<a href="https://venro.ru/">накрутка лайков в инстаграме недорого без регистрации</a>

Вы попали в самую точку. Я думаю, что это отличная мысль.

-----
<a href="https://top-it.by/zamena-klaviatury-noutbuka/">замена клавиатуры на ноутбуке</a>

Вы попали в самую точку. Мне кажется это отличная мысль. Я согласен с Вами.

-----
<a href="https://lolz.guru/market/supercell/">купить аккаунт с рикошетом в brawl stars</a>

Вы попали в самую точку. Мне кажется это очень отличная мысль. Полностью с Вами соглашусь.

-----
<a href="https://vashholodilnik.ru/holodilniki/maunfeld_mff182nfw">https://vashholodilnik.ru/holodilniki/maunfeld_mff182nfw</a>

Вы попали в самую точку. Мысль отличная, поддерживаю.

-----
<a href="https://thujas.ru/">туЯ смарагд</a>

Вы попали в самую точку. Мысль отличная, согласен с Вами.

-----
<a href="https://apparatov.net/all/5-rezident.html">игровые играть резидент</a>

—————

Date: 21/08/2021

By: DavidTot

Subject: накрутка лайков в инстаграме

—————

Date: 30/07/2021

By: BanerMoskva

Subject: баннер на заказ в москве

Стартовый касательно совершенно распространенных видов рекламы касательно деятельности принимается навесная рекламное объявление, что стоит разместить у всяком месте плюс привлекать интересующихся клиентов. Печатный дискаунтер LowCostPrint уже на протяжении многих лет занимается широкоформатной штамп большой замысловатости и умеет осуществить печать в постижимым расценкам по Московской области. По сайте <a href="https://баннер-москва.рф/pechat-na-bannernoj-setke">печать на баннерной сетке</a> клиент можете определить интересующий вид баннера: большие объявления, интерьер печать, ненастоящие фасады, объявление, обрабатывающая процесс, самоклеящиеся пленки, печать по ткани также различные виды форматного оттиска.
Решить с правильным типом клиентам посодействуют действующие грамотные сотрудники, какие у телефонном графике дадут полноценный ответ на любых интересующие вопросы, предложат цены плюс окажут содействие сделать печать прямо на веб-сайт. Трудясь по своем спецоборудовании также совместно с https://баннер-москва.рф/ основными заготовителями, мы дает гарантию всем покупателям большое качество, быстрая исполение, формирование всевозможных объема и правильную тон!

—————

Date: 30/07/2021

By: BanerMoskva

Subject: баннер на заказ в москве

—————

Date: 29/07/2021

By: Oduvan4ikCink

Subject: Oduvan4ikCink Wow!

Error 523 [url=https://support.cloudflare.com/hc/en-us/articles/115003011431-Troubleshooting-Cloudflare-5XX-errors#523error]origin is unreachable[/url]

—————

1 | 2 | 3 | 4 | 5 > >>

The next wave in BI

Open Source Data Warehousing?

Topic: Open Source Data Warehousing?

Subject: Hello, Happykiddi

Subject: Выплaтa полученa

Subject: selfless

Subject: Wisdоm is a mixturе of еxреrienсe, соuragе and intelligеncе

Subject: Oduvan4ikCink Not

Subject: накрутка лайков в инстаграме

Subject: накрутка лайков в инстаграме

Subject: баннер на заказ в москве

Subject: баннер на заказ в москве

Subject: Oduvan4ikCink Wow!