This is being discussed over at Hacker News, check it out.
In developing software for large data sets (billions of records, terabytes in size) the way you store your data in memory is critical – and you want your data in memory if you want to be able to analyse it quickly (e.g. minutes not days).
Any data structure that relies on pointers for each data element quickly becomes unworkable due to the overhead of pointers. On a 64 bit system, with one pointer for each data element across a billion records you have just blown near 8GB of memory just in pointers.
Thus there is a need for compact data structures that still have fast access characteristics.
The challenge is to come up with the fastest data structure that meets the following requirements:
Use less memory than an array in all circumstances
Fast Seek is more important than Fast Access
Seek and Access must be better than O(N).
Where Seek and Access are defined as:
Access (int index): Return me the value at the specified index ( like array[idx] ).
Seek (int value): Return me all the Indexes that match value.
(The actual return type of Seek is a little different, but logically the same. What we need to return is a bitmap where a bit set to 1 at position X means that value was found at index X. This allows us to combine results using logical ANDs rather than intersections as detailed here)
The first solution which we compare all others too is the humble integer array.
Memory: 4 bytes for each integer, making space complexity O(N).
Access: Nice and fast, O(1).
Seek: Awful! We need to check every item in the list and is O(N). Not good enough!
This is a type of “succinct data structure”, which is pretty incredible. It actually blew my mind when I first read about it here.
The structure computes a list of all the available keys (distinct values, ordered of length K), then stores a bit array that is N bits long (the number of values). It then needs Log_{2}(K) such bit arrays, which I will refer to as rows. Essentially the bitmap is a representation of a binary search tree.
The data structure takes a bit of work to get your head around, so I will explain it with an example. Say we have the data:
3 2 4 6 2 6 (Keys: 2, 3, 4, 6)
If we store the keys, we can re-write our data as:
1 0 2 3 0 3
Simply by storing the indexes to our keys array (our first element in our data is 3, which can be found at keys[1]). Easy yeah?
Now we build out first bitmap row. Looking at our new data set, our minimum is 0, our maximum is 3, so if we do (max – min) / 2 we get 1, which we will use as our middle value. So to write out first row, we will write a 0 if the value in our index is less than or equal to our middle value (in this case 1) and 1 if it’s greater.
This means our first row is:
0 0 1 1 0 1
For our next row we group all our ones and zeroes together, such that we have a) and b), where a) is made up of all the zeroes and b) is made up of all the ones.
Group a) is now made up of the following values 1 0 1 which we use to repeat the same process as the first row. We now use the “middle” value from the last row as our “maximum”, and keep the same minimum. We then calculate our new “middle” (max – min) / 2 = (0 – 1) / 2 = 0. This means for Group a) we need to write a zero for each 0 and a one for each 1 (that’s worked out nicely hasn’t it!). Therefore the bitmap for Group a) is:
1 0 1
Group b) is made up of the following values: 2 2 3, so we can repeat the same process as above. Here our max is 3, min is 2 and our middle value is 2. This means our final bitmap for Group b) is:
0 0 1
Meaning our representation of the 6 values is now:
0 0 1 1 0 1 (first row)
1 0 1 0 0 1 (Group a + Group b)
Awesome, our 6 values are now represented by 4 integers and 12 bits. You can see how for each additional value we add (provided we don’t add more keys) only adds a pair of bits to our storage size, making this structure very efficient.
Access(index)
Accessing these structures is a little more complex than accessing an array. Essentially we walk down the “rows” checking whether the bit at the corresponding index is set. The key here is that at each level our index needs to be modified to compensate for the fact we have “grouped” values at each level.
So if we are looking for the value at index 3 (which we know to be 2), we first inspect the bit at index 3 of row 0.
0 0 1 1 0 1
Since this is a 1, our index needs to change to be the first index of group B. This is done by counting the number of Zeroes in the first complete row, which gives us the point where our group starts. By then also tracking how many “1” bits are set before my index, I can adjust the index to check for the next row.
The bad news is that this means counting a lot of bits. The good news is than Nehalem processors have an SSE4.2 instruction that enables us to count 0’s and 1’s VERY quickly (popcnt). The bad news is that the indexes are not necessarily aligned to words, so additional work is required in aligning bitmaps to word boundaries (which I have not done).
Seek(value)
Seek is much more complex, but works by traversing our keys list as if it was a binary search tree. Here we first take the middle value of our keys, and compare it to our value. If our value is larger than the middle value, our bitmask tells us which values are also larger. For example you can think of the first row as telling us which numbers are bigger than 1 (where 1 is set), and which are smaller (where 0 is set).
The goal of the seek to return a bitmap the same length as our rows, with a 1 set wherever the value matches the corresponding index.
So if the value we are searching for is larger than the middle value, our temporary mask becomes the first row. If it was smaller, then we flip the bits on the first row.
We are then directed to either group a) or group b) depending on the result of the first comparison (group a if its smaller, or group b if its larger. At this point we find out what the “middle” value is for this group and compare it to the value we are seeking. If our seek value is larger than this value, we can just use bits as is. If its smaller or equal then we need to invert it. We then use this as a “group mask” which we use to refine our temporary mask.
The refine operation simply looks at all the 1 bits in our temporary mask, and works out what number bit it is in the mask if there were no 0’s present. (thus in “0 0 1 0 1 0 ”, the index of the first 1 would become 0, and second 1 would be 1). We then use the “group mask”, which tells us which of those set bits need to be set to one. Take the following example:
0 0 1 1 0 1 (temp mask)
0 1 0 (group mask – note it is 3 bits long the same as the number of 1’s in temp mask)
0 0 0 1 0 0 (result, the first and third ones are set to 0).
I have implemented the code to do this, but it is far from efficient (if you like making C code fast, this may be a good project for you!). In my implementation I also store the “rank indexes” (the start of each group) so as to prevent the entire bitmap from needing to be scanned all the time, which makes it faster but use more memory.
Hybrid (tmap)
Given the complexity of Wavelet tree’s and its disappointing performance, I developed a hybrid, which stores values in the same way as a wavelet tree, but does not group values. This makes accesses much faster as the index doesn’t change. It also makes seeks much faster than an array, as we just do a logical AND down the rows. Since our bitmaps are implemented in 64bit integers, and we have only Log_{2}(K) rows, we can test (64 / (Log2(K) -1)) values per operation.
So how do they perform?
Size
Values | array | wtree | tmap |
1,000 | 4,016 | 9,016 | 3,956 |
10,000 | 40,016 | 23,180 | 16,680 |
100,000 | 400,016 | 135,560 | 129,160 |
1,000,000 | 4,000,016 | 1,260,700 | 1,254,200 |
10,000,000 | 40,000,016 | 12,510,700 | 12,504,200 |
Access Performance
Values | array | wtree | tmap |
1,000 | 1 | 3,956 | 2 |
10,000 | 1 | 16,680 | 2 |
100,000 | 1 | 129,160 | 10 |
1,000,000 | 1 | 1,254,200 | 100 |
10,000,000 | 30 | 12,504,200 | 1,160 |
Seek Performance
Values | array | wtree | tmap |
1,000 | 2 | 20 | 1 |
10,000 | 2 | 260 | 1 |
100,000 | 100 | 2,630 | 40 |
1,000,000 | 940 | 26,430 | 460 |
10,000,000 | 10,520 | 278,500 | 8,420 |
There is no doubt that on paper, the Wavelet tree structure should perform the best out of all the structures, but as we can see here, its performance is AWFUL. I have no doubt that there is plenty of room for optimization (particularly in the “refine” and “access” methods, where utilizing the popcnt SSE4.2 instruction would no doubt have a massive impact on performance.
So the challenge remains open. Can you make a faster, smaller array than those described here?
You can view the source code for this project at https://github.com/siganakis/