This post is a second part of Ultimate Sets and Maps for Java. In the first part, we discussed memory-efficient implementations of sets and maps. These data structures efficiently support *contains(key)** *operation. In this part of the article, we discuss more advanced querying:

- How to efficiently test that a collection of items meets a filtering criteria
*contains(key1) AND contains(key2) AND …. AND contains (keyN)* - How to efficiently test that a collection of items meets a filtering criteria (
*contains(key11) OR contains(key12) OR … ) AND (**contains(key21) OR contains(key22) OR … ) AND …* - How to efficiently eliminate duplicates from a collection of items?

# Querying Sets

In this section, we discuss two structures: Bloom filter and Reflector. These structures are useful when one has a number of collections (e.g. documents) and need to select all collections that meet a filtering criteria which is a boolean expression over *contains(key)* predicates.

The Bloom filter can be used for filters like *contains(key1) AND contains(key2) AND …. AND contains (keyN)*, i.e. for containsAll() operation.

Reflector is a structure that designed for filters like ( *contains(key11) OR contains(key12) OR … ) AND ( contains(key21) OR contains(key22) OR … ) AND …*

## Bloom Filter

The Bloom filter is a probabilistic data structure that is used in some caching systems and databases. We briefly describe the main properties of the Bloom filter, more information can be found on Wikipedia and in Mitzenmacher’s Probability and Computing .

- The nature of the Bloom filter is close to the hash table. For any element it generates a hash – a
*signature*– which is simply an array of bits. Signature of a set of elements is a bitwise OR of signatures of elements. - Let us have a set
`A`of elements and filter`F`which is also a set of one or more elements. We want to check if`A.containsAll(F)`. In order to do this, we generate signatures`s`_{A}and`s`_{F}, for`A`and`F`, respectively. If`s`_{A}contains 1’s at all positions that correspond to signatures of elements from`F,`i.e.`s`_{A}`&&``s`_{F}`==``s`_{F}, then this testifies that`A`contains`F`. - Testing by means of signatures
*does not*guarantee a correct answer, because we operate with hashes which are inherently eligible for collisions. If`A`doesn’t contains all elements from`F`then the Bloom filter can testify that`s`_{A}passed filter`s`_{F}, i.e.*false positives*are possible. Nevertheless,*true negatives*are not possible. If we need a correct answer then we need to re-check result provided by the Bloom filter. In other words, the Bloom filter partially sorts out sets that do not meet the filter`F`and, consequently, decreases the number of direct checks.

Efficiency of the Bloom filter (probability of false positives) depends on lengths of signature and a number of signed elements. We provide a simple implementation of the Bloom filter with 64-bit signatures that can be used for efficient signing of about 10 elements:

public class BloomFilter64 { private int capacity; private int k; private int maxLoops; public BloomFilter64(int capacity) { this.capacity = capacity; k = (int)(64.0 / capacity * Math.log(2)); // theoretically optimal formula for weight of signature maxLoops = k * 10; } public long signature(int[] elements) { long filter = 0; for(Integer e : elements) { filter |= signature(e); } return filter; } private long signature(int e) { long fingerprint = 0; int i; for(i = 0; i < maxLoops && Long.bitCount(fingerprint) != k; i++) { fingerprint |= (1L) << (hash(e+i) % 64); } if(i == maxLoops) { throw new IllegalStateException("Failed to generate signature"); } return fingerprint; } private int hash(int e) { // Robert Jenkins' 32 bit integer hash function e = (e+0x7ed55d16) + (e<<12); e = (e^0xc761c23c) ^ (e>>19); e = (e+0x165667b1) + (e<<5); e = (e+0xd3a2646c) ^ (e<<9); e = (e+0xfd7046c5) + (e<<3); e = (e^0xb55a4f09) ^ (e>>16); return e; } public boolean test(long filter, long signature) { return (filter & signature) == signature; } }

Usage of `BloomFilter64` is demonstrated in the following snippet:

BloomFilter64 bloom = new BloomFilter64(10); // it is expected that filter will sign about 10 elements long signature = bloom.signature(new int[] {1, 56, 87, 2345, 92}); // sign 5 elements long filterA = bloom.signature(new int[] {56, 87}); long filterB = bloom.signature(new int[] {87, 2345, 777}); bloom.test(signature, filterA); // return true - signature contains 56 and 87 bloom.test(signature, filterB); // return false - signature doesn't contain 777

Let us evaluate performance of the Bloom filter. We generate a collection of sets, each one contains `setSize` elements that are randomly selected from the set of random integers of size `elementPoolSize`. Then we create a filter randomly selecting one of the sets and truncating it to `filterSize` elements. Obviously, the longer the filter the higher its selectivity. We measure how much time does it take to iterate over all sets and test each one against the filter using the following two approaches:

- Both sets and filter are stored as
`HashSet<Integer>`and testing is performed via`HashSet#containsAll()`operation. - Each set and filter is associated with a signature generated via
`BloomFilter#signature`. We first test set’s signature against filter’s signature using`BloomFilter#test()`operation and, if passed, we test the set against the filter using`HashSet#containsAll()`to filter out potential false positives.

We repeat the experiment multiple times and average measurements over different filters and sets.

In this test schema, we can vary filter selectivity via `elementPoolSize` and `filterSize` parameters for fixed `setSize`. For example, if `setSize=10`, `filterSize=1`, and `elementPoolSize=20` then selectivity will be about 50% because filter is selected from 20 possible values and each 10-element set contains one half of possible values. Let us choose relatively small `elementPoolSize=20` and `setSize=10` to achieve a comprehensive range of selectivities and measure throughput (filtrations/sec) of both techniques for different `filterSize`:

## Filter size (elements) |
## True filter selectivity (%) |
## Bloom filter selectivity (%) |
## HashSet throughput (filtrations/sec) |
## Bloom Set throughput (filtrations/sec) |
## BloomSet/HashSet throughput ratio |
---|---|---|---|---|---|

1 | 50.01528 | 50.63274 | 3417634 | 5217845 | 1.53 x |

2 | 23.69485 | 28.23952 | 2857142 | 6402048 | 2.24 x |

3 | 10.53322 | 15.63630 | 2871088 | 12113870 | 4.22 x |

4 | 4.33940 | 7.75789 | 2714809 | 19704433 | 7.26 x |

5 | 1.62736 | 3.94698 | 2675585 | 32626427 | 12.19 x |

10 | 0.00057 | 0.02942 | 2688172 | 173913043 | 64.70 x |

One can see that Bloom filter is a very effective technique comparing to the standard hash table. Nevertheless, it’s efficiency depends on many factors (variation of set sizes, cost of false positives etc) and one should carefully adjust parameters to achieve good results.

## Reflector

Reflector is a simple structure for testing sets with filters like ( *contains(key11) OR contains(key12) OR … ) AND ( contains(key21) OR contains(key22) OR … ) AND …*. The basic idea is that we have a bit flag for each OR’ed group and map each attribute to the corresponding flag. So, if the filter passes then all flags are set because each group in the filter toggles a dedicated flag. We introduce the following constraints to keep the implementation simple:

- Maximum number of OR’ed groups is 63. This allows us to use single
`long`variable for flags storage. - Elements of set and filter (attributes) are positive integers in range
`(1, N]`and`N`is so small that it’s possible to allocate array of`N`integers. - Groups cannot have common elements, i.e. a particular attribute appears in the filter only once.

These constraints can be easily removed if necessary: long variable can be replaced by `BitSet` and hash table can be used if `N` is large. Let us take a look at the implementation:

public class Reflector { private long[] index; private int groupNumber; public Reflector(int maxElement, int[][] filter) { index = new long[maxElement]; for(int g = 0; g < filter.length; g++) { for(int e = 0; e < filter[g].length; e++) { index[filter[g][e]] = 1 << (g + 1); } } groupNumber = filter.length; } public boolean test(int[] set) { long footprint = 0; for(int i = 0; i < set.length; i++) { footprint |= index[set[i]]; } return BitUtils.testLSB(footprint, groupNumber); } } class BitUtils { private static long[] lsbMask; static { lsbMask = new long[64]; for(int i = 1; i < 64; i++) { lsbMask[i] = lsbMask[i-1] | (1 << i); } } public static boolean testLSB(long value, int bits) { return (value & lsbMask[bits]) == lsbMask[bits]; } }

- Refelector instance is created for each filter and can be used to test (filter) multiple sets.
- Filter is passed to the constructor as an array of arrays, each inner array represents an OR’ed group. For example, filter
`(1 OR 3) AND (6 OR 7 OR 8)`can be created as**new int[][] {new int[]{1,3}, new int[]{6,7,8}}** `BitUtils.testLSB(v, k)`method returns true if`k`less significant bits are all ones in`v`and false otherwise.

It’s clear that Reflector is very efficient in case of small sets and complex filters because its performance does not depend on size of the filter (number of OR’ed groups and size of each group).

# Duplicates Removal

In this section we discuss how to efficiently process a bunch of collections removing duplicates from each of them. There are different approaches to this problem:

- Pass each collection through hash table.
- If the total number of possible elements is predefined then it is possible to allocate a bit set where each position represents one element and traverse collection setting corresponding bit to 1 for each element.
- Sort each collection and traverse sorted list excluding duplicates.

`StackSet` is a data structure that combines the first two approaches eliminating overheads on allocation and cleanup of temporary memory. StackSet can be described as follows:

`StackSet`is the open addressing hash table where all`elements`are linked in order of their insertion by means of`pointers`.`add()`method inserts an element to the table and target a corresponding`pointer`to the previous element.`lastWritePosition`points to the last inserted element.`poll()`method returns an element which is pointed by`lastWritePosition`, cleans its position, and switches`lastWritePosition`to the previous element.- if one inserts several elements and polls
`StackSet`till the`EMPTY`state then`StackSet`instance is empty and ready to be reused.

public class StackSet { private int elements[]; private int pointers[]; private int lastWritePosition = -1; private int addedElements = 0; private int capacity; public static final int EMPTY = -1; private static final int FREE = -1; public StackSet(int capacity, double loadFactor) { this(MathUtils.nextPrime((int)(capacity / loadFactor))); this.capacity = capacity; } private StackSet(int size) { elements = new int[size]; Arrays.fill(elements, FREE); pointers = new int[size]; this.capacity = size; } public void add(int key) { int position = indexOfKey(key); if(position >= 0) { if(addedElements > capacity) { throw new IllegalStateException("Set is full"); } elements[position] = key; pointers[position] = lastWritePosition; lastWritePosition = position; addedElements++; } } public int poll() { if(lastWritePosition < 0) { addedElements = 0; return EMPTY; } int result = elements[lastWritePosition]; elements[lastWritePosition] = FREE; lastWritePosition = pointers[lastWritePosition]; return result; } private int indexOfKey(int e) { final int length = elements.length; int startPosition = e % length; // the first hash function int decrement = 1 + (e % (length-2)); // the second hash function while (elements[startPosition] != e && elements[startPosition] != FREE) { startPosition -= decrement; if (startPosition < 0) { startPosition += length; } } if(elements[startPosition] == FREE) { return startPosition; } else { return -1; } } }

Usage of `StackSet` is illustrated below:

StackSet stackSet = new StackSet(capacity, loadFactor); for(int[] array : arrays) { for(int i = 0; i < array.length; i++) { stackSet.add(array[i]); } int numberOfDistinctElements = 0; while(stackSet.poll() != StackSet.EMPTY) { numberOfDistinctElements++; } // numberOfDistinctElements counted the number of distinct elements in array }

Note that `stackSet` is created only once and this single instance is used for processing of `arrays` collection. In this simplified example we need to know the `capacity` of StackSet (maximum number of distinct elements is `array` or at least maximum `array` length) before calculations. In practice StackSet can grow dynamically by means of rehasing as any hash table.

We compare performance of `StackSet` with `HashSet` and duplicates removal via sorting. The results for different array sizes and duplicates amount are shown below:

## Duplicates (%) |
## Array Size (elements) |
## HashSet throughput (arrays/sec) |
## StackSet throughput (arrays/sec) |
## Sorting throughput (arrays/sec) |
---|---|---|---|---|

90 | 10 | 2,085,070 | 7,194,244 | 6,361,323 |

90 | 100 | 236,854 | 798,403 | 323,572 |

90 | 1000 | 21,121 | 83,368 | 22,441 |

50 | 10 | 1,653,986 | 4,573,519 | 3,881,987 |

50 | 100 | 191,846 | 539,956 | 174,779 |

50 | 1000 | 16,728 | 59,031 | 14,507 |

10 | 10 | 1,419,144 | 4,149,377 | 3,522,987 |

10 | 100 | 170,619 | 502,891 | 173,726 |

10 | 1000 | 15,656 | 55,509 | 14,584 |

We see that `StackSet` is effective enough, although performance of sorting is very close for small arrays. It is worth noting that `HashSet` generates a lot of garbage during its work:

whereas `StackSet` does not allocate dynamic memory at all: