Sizing up MySQL

Having gotten my hands on a copy of the AOL Search data, I decided to try my hand at a bit of data mining. Of course, the first thing that jumped to my mind was to import the data into a RDBMS of some sort. I have immediate access to two -- MySQL and Microsoft SQL Server. Both are installed on similarly-specd machines.

It never surprises me how willing people are to tell you their (obviously uninformed) opinions about Microsoft Products. Certain people will completely disregard a Microsoft product, and select a far inferior product simply because it is "open source". Take Microsoft's SQL Server vs. MySQL for example.

In many situations, MySQL is a great database server. If you're dealing with very small datasets (i.e. less than 100MB), in situations where there are very few modifications to the datasets (e.g., serving dynamic web pages, with no updates to the database), MySQL is fine. Anything more, and it will start to show its weaknesses.

For my example, I'm using one tenth (212MB) of the actual AOL dataset (about 2GB). Even in this range, MySQL shows that it's clearly not suitable for dealing with more than 100MB of data. This dataset has 3,558,411 rows, in 5 columns (AnonID, Query, QueryTime, ItemRank, ClickUrl). The table schema I'm using has an index on AnonId, Query, QueryTime and ClickUrl.

Importing data, with indexes
MySQL: 36 minutes ("LOAD DATA INFILE")
MS SQL: 3 minutes (Data Transformation Services)

MySQL: 13 mins 34 secs
Shell: 1 min 24 secs
MS SQL: 32 seconds

For comparison, I tried obtaining the same results using the shell commands cut, sort and uniq (marked above as 'Shell'). Even for this fairly basic query, on a comparitively small dataset, MySQL is performing over 25 times slower than SQL Server, and over 9 times slower than using basic shell tools. I daren't think how it copes with the full 2GB set.

The conclusion? If you need a RDBMS for any real work, look past MySQL. You'll be regretting it for years if you're stuck with it.