A tutorial with kdb+: the hidden secret for big data at speed
This will be a quick lesson on downloading and running a few queries in kdb+, a lightweight (seriously the binary download is 300kb) database that is used in a variety of applications where performance for vast amounts of data is critical. From quant firms trading financial securities to F1 teams needing streaming telemetry data for their race cars, kdb+ is used in a variety of highly demanding real world applications.
We will be going through the quick process of setting up kdb+, seeding a sample database with 10 million rows, and querying our dummy data. We will be using a quick built in function to time our queries, which should become evident just how snappy kdb can be.
Kdb+ runs on a proprietary query language called q, and the syntax should be fairly familiar to anybody who has ever worked with a SQL before.
Let's get it installed
If you head on over to https://kx.com/download/, you will be prompted to download the 32 bit version of kdb+, which at the time of this blog post, is currently free. Fill out a few details about yourself and pick out your os.
Keep in mind since we are using the free 32 bit version, we will be limited to 4 gb of memory that the database could potentially utilize. Once you have downloaded the binary, follow the code samples below to open an interactive console session with q and create an in-memory database
cd ~/Downloads cp -r q ~/. q/m32/q
After copying your install to your home directory, you may want to update your bash profile so you can call q more easily. The above command should have opened kdb+ as follows:
KDB+ 3.6 2018.05.17 Copyright (C) 1993-2018 Kx Systems m32/ 8()core 8192MB mdzingle michaels-macbook-pro-2.local 192.168.9.103 NONEXPIRE Welcome to kdb+ 32bit edition For support please see http://groups.google.com/d/forum/personal-kdbplus Tutorials can be found at http://code.kx.com/wiki/Tutorials To exit, type \\ To remove this startup msg, edit q.q q)
Getting familiar with the syntax
The q language, which is built on the programming language K, is an array programming langauge. Let's dive into some common operations!
q)"Hello world!" "Hello world!"
q) 5*5 25
Divisionis a little different, as it uses the modulo character that is typically used for finding remainders in other programming languages. Values are returned as floats, which is explicitly distinguished.
q) 6%2 3f
Order of operations
q)til 25 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Creating a database in memory
Below we will create a database for a fictitious outdoor advertising company in the United States. Their database will be information about their various advertising options in multiple cities, totaling ten million rows (much more and we start to run into memory limits of the 32 bit architecture. Kdb+ on the other hand, wouldn't even break a sweat with a dataset this small).
q)n:10000000 q)item:`billboard`bench`busstop`taxi; q)city:`newyork`losangeles`chicago`miami`nashville; q)table:(time:asc n?0D0;n?item;amount:n?100;n?city);
Querying your database
q)select from table where item = `billboard time item amount city ------------------------------------------------ 0D00:00:00.015509873 billboard 25 nashville 0D00:00:00.029128789 billboard 31 losangeles 0D00:00:00.046589970 billboard 97 chicago 0D00:00:00.146770477 billboard 37 nashville 0D00:00:00.178091973 billboard 51 newyork 0D00:00:00.206637382 billboard 80 newyork 0D00:00:00.304061919 billboard 31 nashville 0D00:00:00.334337353 billboard 33 nashville 0D00:00:00.336992740 billboard 24 nashville 0D00:00:00.364009290 billboard 51 miami 0D00:00:00.404664874 billboard 71 newyork 0D00:00:00.454855710 billboard 9 newyork 0D00:00:00.502994656 billboard 6 nashville 0D00:00:00.505851209 billboard 20 losangeles 0D00:00:00.518806278 billboard 15 miami 0D00:00:00.573040544 billboard 14 nashville 0D00:00:00.694786012 billboard 10 losangeles 0D00:00:00.714580714 billboard 87 chicago 0D00:00:00.729406625 billboard 96 nashville 0D00:00:00.772013515 billboard 11 newyork .. q)select sum amount by city from table city | amount ----------| -------- chicago | 99067942 losangeles| 98960981 miami | 99083812 nashville | 99067080 newyork | 98951392
How fast are we talking?
If you have been following along, you may have noticed the console is extremely responsive for how large our randomly sampled database is. Just how fast is it though? Kdb+ has a simple method that will tells you in milliseconds how long the last query took. We will run it on the last query for summing up for each city:
q)\t select sum amount by city from table 133
Only 133 milliseconds to sum up our results for 10 million rows. Fairly impressive numbers.
Where to go next
This was just a tutorial to dip your toes in the subject and acquaint yourself with some of the impressive benchmarks that can be achieved with this software. To take these ideas further, you may be interested in working with kdb+ from disk. Kx, the company that owns kdb+, has some decent articles detailing more of the capabilities of this system, and I advise anybody interested in low latency and sequence based data to check it out.