Contents

A tutorial with kdb+: the hidden secret for big data at speed

August 01th, 2018 kdb-database

This will be a quick lesson on downloading and running a few queries in kdb+, a lightweight (seriously the binary download is 300kb) database that is used in a variety of applications where performance for vast amounts of data is critical. From quant firms trading financial securities to F1 teams needing streaming telemetry data for their race cars, kdb+ is used in a variety of highly demanding real world applications.

We will be going through the quick process of setting up kdb+, seeding a sample database with 10 million rows, and querying our dummy data. We will be using a quick built in function to time our queries, which should become evident just how snappy kdb can be.

Kdb+ runs on a proprietary query language called q, and the syntax should be fairly familiar to anybody who has ever worked with a SQL before.

Let's get it installed

If you head on over to https://kx.com/download/, you will be prompted to download the 32 bit version of kdb+, which at the time of this blog post, is currently free. Fill out a few details about yourself and pick out your os.

Keep in mind since we are using the free 32 bit version, we will be limited to 4 gb of memory that the database could potentially utilize. Once you have downloaded the binary, follow the code samples below to open an interactive console session with q and create an in-memory database


cd ~/Downloads
cp -r q ~/.
q/m32/q

After copying your install to your home directory, you may want to update your bash profile so you can call q more easily. The above command should have opened kdb+ as follows:


KDB+ 3.6 2018.05.17 Copyright (C) 1993-2018 Kx Systems
m32/ 8()core 8192MB mdzingle michaels-macbook-pro-2.local 192.168.9.103 NONEXPIRE  

Welcome to kdb+ 32bit edition
For support please see http://groups.google.com/d/forum/personal-kdbplus
Tutorials can be found at http://code.kx.com/wiki/Tutorials
To exit, type \\
To remove this startup msg, edit q.q
q)

Getting familiar with the syntax

The q language, which is built on the programming language K, is an array programming langauge. Let's dive into some common operations!


q)"Hello world!"
"Hello world!"

Addition

q)5+5
10

Multiplication

q) 5*5
25

Division
is a little different, as it uses the modulo character that is typically used for finding remainders in other programming languages. Values are returned as floats, which is explicitly distinguished.

q) 6%2
3f

Order of operations

q)(6+2)%3
2.666667

Looping:

q)til 25
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Creating a database in memory

Below we will create a database for a fictitious outdoor advertising company in the United States. Their database will be information about their various advertising options in multiple cities, totaling ten million rows (much more and we start to run into memory limits of the 32 bit architecture. Kdb+ on the other hand, wouldn't even break a sweat with a dataset this small).


q)n:10000000
q)item:`billboard`bench`busstop`taxi;
q)city:`newyork`losangeles`chicago`miami`nashville;
q)table:([]time:asc n?0D0;n?item;amount:n?100;n?city);

Querying your database


q)select from table where item = `billboard
time                 item      amount city      
------------------------------------------------
0D00:00:00.015509873 billboard 25     nashville 
0D00:00:00.029128789 billboard 31     losangeles
0D00:00:00.046589970 billboard 97     chicago   
0D00:00:00.146770477 billboard 37     nashville 
0D00:00:00.178091973 billboard 51     newyork   
0D00:00:00.206637382 billboard 80     newyork   
0D00:00:00.304061919 billboard 31     nashville 
0D00:00:00.334337353 billboard 33     nashville 
0D00:00:00.336992740 billboard 24     nashville 
0D00:00:00.364009290 billboard 51     miami     
0D00:00:00.404664874 billboard 71     newyork   
0D00:00:00.454855710 billboard 9      newyork   
0D00:00:00.502994656 billboard 6      nashville 
0D00:00:00.505851209 billboard 20     losangeles
0D00:00:00.518806278 billboard 15     miami     
0D00:00:00.573040544 billboard 14     nashville 
0D00:00:00.694786012 billboard 10     losangeles
0D00:00:00.714580714 billboard 87     chicago   
0D00:00:00.729406625 billboard 96     nashville 
0D00:00:00.772013515 billboard 11     newyork   
..

q)select sum amount by city from table
city      | amount  
----------| --------
chicago   | 99067942
losangeles| 98960981
miami     | 99083812
nashville | 99067080
newyork   | 98951392

How fast are we talking?

If you have been following along, you may have noticed the console is extremely responsive for how large our randomly sampled database is. Just how fast is it though? Kdb+ has a simple method that will tells you in milliseconds how long the last query took. We will run it on the last query for summing up for each city:


q)\t select sum amount by city from table
133

Only 133 milliseconds to sum up our results for 10 million rows. Fairly impressive numbers.

Where to go next

This was just a tutorial to dip your toes in the subject and acquaint yourself with some of the impressive benchmarks that can be achieved with this software. To take these ideas further, you may be interested in working with kdb+ from disk. Kx, the company that owns kdb+, has some decent articles detailing more of the capabilities of this system, and I advise anybody interested in low latency and sequence based data to check it out.

Back