This post is a second part of the Ultimate Sets and Maps for Java article. See the first part.
Querying of Sets
In this section we discuss two structures: Bloom filter and Reflector. These structures are useful for testing of multiple sets against the same filter. The Bloom filter is convenient for filters in form A and B and ..., i.e. for containsAll() operation. Reflector can be used for filters in form (A or B or ..) and (C or D or ..) and ...
Bloom Filter
The Bloom filter is a probabilistic data structure that is used in some caching systems and databases. We briefly describe the main properties of the Bloom filter, more information can be found on Wikipedia and in Mitzenmacher’s Probability and Computing .
- The nature of the Bloom filter is close to hash table. For any element it generates a special hash – signature – which is simply an array of bits. Signature of the set of elements is a bitwise OR of separate signatures of such elements.
- Let us have a set A of elements and filter F which is also a set with one or more elements. We want to check is A.containsAll(F). In order to do this, we generate signaturessA and sF, for A and F, respectively. If sA contains 1′s at all positions that correspond to signatures of elements from F, i.e. sA && sF == sF, then this testifies that A contains F.
- Testing via signatures doesn’t not guarantee the correct answer, because we operate with hashes which are inherently eligible for collisions. If A doesn’t contains all elements from F then the Bloom filter can testify that sA passed filter sF, i.e. false positives are possible. Nevertheless, true negatives are not possible. If we need the correct answer then we need to re-check result provided by the Bloom filter. In other words, the Bloom filter simply sorts out some part of sets which don’t pass the filter F and decreases the number of direct checks.
Efficiency of the Bloom filter (probability of false positives) depends on the length of signature and number of signed elements. We provide a simple implementation of the Bloom filter with 64-bit signatures which can be used for efficient signing about 10 elements:
public class BloomFilter64 {
private int capacity;
private int k;
private int maxLoops;
public BloomFilter64(int capacity) {
this.capacity = capacity;
k = (int)(64.0 / capacity * Math.log(2)); // theoretically optimal formula for weight of signature
maxLoops = k * 10;
}
public long signature(int[] elements) {
long filter = 0;
for(Integer e : elements) {
filter |= signature(e);
}
return filter;
}
private long signature(int e) {
long fingerprint = 0;
int i;
for(i = 0; i < maxLoops && Long.bitCount(fingerprint) != k; i++) {
fingerprint |= (1L) << (hash(e+i) % 64);
}
if(i == maxLoops) {
throw new IllegalStateException("Failed to generate signature");
}
return fingerprint;
}
private int hash(int e) { // Robert Jenkins' 32 bit integer hash function
e = (e+0x7ed55d16) + (e<<12);
e = (e^0xc761c23c) ^ (e>>19);
e = (e+0x165667b1) + (e<<5);
e = (e+0xd3a2646c) ^ (e<<9);
e = (e+0xfd7046c5) + (e<<3);
e = (e^0xb55a4f09) ^ (e>>16);
return e;
}
public boolean test(long filter, long signature) {
return (filter & signature) == signature;
}
}
Usage of BloomFilter64 is demonstrated in the following snippet:
BloomFilter64 bloom = new BloomFilter64(10); // it is expected that filter will sign about 10 elements
long signature = bloom.signature(new int[] {1, 56, 87, 2345, 92}); // sign 5 elements
long filterA = bloom.signature(new int[] {56, 87});
long filterB = bloom.signature(new int[] {87, 2345, 777});
bloom.test(signature, filterA); // return true - signature contains 56 and 87
bloom.test(signature, filterB); // return false - signature doesn't contain 777
Let us evaluate performance of the Bloom filter. We generate a collection of sets, each one contains setSize elements which are randomly selected from the set of random integers of size elementPoolSize. After this we randomly select one set and truncate it to filterSize elements. The truncated set form a filter. Obviously, the longer the filter the higher its selectivity. We measure how much time it takes to iterate over all sets and test each one against the filter using the following two approaches:
- Both sets and filter are stored as HashSet<Integer> and testing is performed via HashSet#containsAll() operation.
- Each set and filter is associated with a signature generated via BloomFilter#signature. We first test set’s signature against filter’s signature using BloomFilter#test()operation and, if test passed, we test set against filter using HashSet#containsAll() to filter out potential false positives.
Of course, we repeat this experiment multiple times and average measurements over different filters and sets.
In this test schema, we can vary filter selectivity via elementPoolSize and filterSize parameters for fixed setSize. For example, if setSize=10, filterSize=1, andelementPoolSize=20 then selectivity will be about 50% because filter is selected from 20 possible values and each 10-element set contains one half of possible values. Let us choose relatively small elementPoolSize=20 and setSize=10 to achieve a comprehensive range of selectivities and measure throughput (filtrations/sec) of both techniques for differentfilterSize:
Filter size (elements) |
True filter selectivity (%) |
Bloom filter selectivity (%) |
HashSet throughput (filtrations/sec) |
Bloom Set throughput (filtrations/sec) |
BloomSet/HashSet throughput ratio |
|---|---|---|---|---|---|
| 1 | 50.01528 | 50.63274 | 3417634 | 5217845 | 1.53 |
| 2 | 23.69485 | 28.23952 | 2857142 | 6402048 | 2.24 |
| 3 | 10.53322 | 15.63630 | 2871088 | 12113870 | 4.22 |
| 4 | 4.33940 | 7.75789 | 2714809 | 19704433 | 7.26 |
| 5 | 1.62736 | 3.94698 | 2675585 | 32626427 | 12.19 |
| 10 | 0.00057 | 0.02942 | 2688172 | 173913043 | 64.70 |
One can see that Bloom filter is a very effective technique for the discussed problem in comparison with the standard hash table. Nevertheless, it’s efficiency depends on many factors (variation of set sizes, cost of false positives etc) and one should carefully adjust parameters to achieve good results.
Reflector
Reflector is a quite obvious structure for testing sets against filters in form (A or B or ..) and (C or D or ..) and ... The basic idea is that we have a bit flag for each OR’ed group and map each attribute to the corresponding flag. So, if filter passed then all flags are set because each group in the filter triggers a dedicated flag. We introduce the following constraints to keep our implementation simple:
- Maximum number of OR’ed groups is 63. This allows us to use single long variable for flags storage.
- Elements of set and filter (attributes) are positive integers in range (1, N] and N is so small that it’s possible to allocate array of N integers.
- Groups can’t have common elements, i.e. particular attribute can appear in filter only once.
These constraints can be easily overcome: long variable can be replaced by BitSet and hash table can be used if N is large. Let us take a look at the implementation preceded by several comments:
- Refelector instance is created for each filter and can be used to test (filter) multiple sets.
- Filter is passed to the constructor as an array of arrays, each inner array represents an OR’ed group. For example, filter (1 OR 3) AND (6 OR 7 OR 8) can be created asnew int[][] {new int[]{1,3}, new int[]{6,7,8}
- BitUtils.testLSB(v, k) method returns true if k less significant bits are all ones in v and false otherwise.
public class Reflector {
private long[] index;
private int groupNumber;
public Reflector(int maxElement, int[][] filter) {
index = new long[maxElement];
for(int g = 0; g < filter.length; g++) {
for(int e = 0; e < filter[g].length; e++) {
index[filter[g][e]] = 1 << (g + 1);
}
}
groupNumber = filter.length;
}
public boolean test(int[] set) {
long footprint = 0;
for(int i = 0; i < set.length; i++) {
footprint |= index[set[i]];
}
return BitUtils.testLSB(footprint, groupNumber);
}
}
class BitUtils {
private static long[] lsbMask;
static {
lsbMask = new long[64];
for(int i = 1; i < 64; i++) {
lsbMask[i] = lsbMask[i-1] | (1 << i);
}
}
public static boolean testLSB(long value, int bits) {
return (value & lsbMask[bits]) == lsbMask[bits];
}
}
It’s clear that Reflector is very efficient in case of small sets and complex filters because its performance doesn’t depend on size of filter (number of OR’ed groups and size of each group).
Duplicates Removal and Stack Set
In this section we discuss the following problem. There are many collections that contain duplicates and it’s necessary to remove duplicates from each collection. Typically, collections are formed in runtime, e.g. by merging of other collections. There are different approaches for this problem:
- Pass each collection through hash table, i.e. use HashSet.
- If the total number of possible elements is predefined then it’s possible to allocate a bit set where each position represents one element and traverse collection setting corresponding bit to 1 for each element.
- Sort each collection and traverse sorted list excluding duplicates.
The code below presents a StackSet which combines first two approaches eliminating overheads on allocation and clean up of temporary structures. It’s logic may be described as follows:
- StackSet is an open addressing hash table where all elements are linked in order of their insertion by means of pointers.
- add() method inserts an element to the table and target a corresponding pointer to the previous element. lastWritePosition points to the last inserted element.
- poll() method returns an element which is pointed by lastWritePosition, cleans it’s position, and switches lastWritePosition to the previous element.
- if one inserts several elements and polls StackSet till the EMPTY state then StackSet instance is empty and ready to be reused.
public class StackSet {
private int elements[];
private int pointers[];
private int lastWritePosition = -1;
private int addedElements = 0;
private int capacity;
public static final int EMPTY = -1;
private static final int FREE = -1;
public StackSet(int capacity, double loadFactor) {
this(MathUtils.nextPrime((int)(capacity / loadFactor)));
this.capacity = capacity;
}
private StackSet(int size) {
elements = new int[size];
Arrays.fill(elements, FREE);
pointers = new int[size];
this.capacity = size;
}
public void add(int key) {
int position = indexOfKey(key);
if(position >= 0) {
if(addedElements > capacity) {
throw new IllegalStateException("Set is full");
}
elements[position] = key;
pointers[position] = lastWritePosition;
lastWritePosition = position;
addedElements++;
}
}
public int poll() {
if(lastWritePosition < 0) {
addedElements = 0;
return EMPTY;
}
int result = elements[lastWritePosition];
elements[lastWritePosition] = FREE;
lastWritePosition = pointers[lastWritePosition];
return result;
}
private int indexOfKey(int e) {
final int length = elements.length;
int startPosition = e % length; // the first hash function
int decrement = 1 + (e % (length-2)); // the second hash function
while (elements[startPosition] != e && elements[startPosition] != FREE) {
startPosition -= decrement;
if (startPosition < 0) {
startPosition += length;
}
}
if(elements[startPosition] == FREE) {
return startPosition;
} else {
return -1;
}
}
}
Possible usage of StackSet is demonstrated below:
StackSet stackSet = new StackSet(capacity, loadFactor);
for(int[] array : arrays) {
for(int i = 0; i < array.length; i++) {
stackSet.add(array[i]);
}
int numberOfDistinctElements = 0;
while(stackSet.poll() != StackSet.EMPTY) {
numberOfDistinctElements++;
}
// numberOfDistinctElements counted the number of distinct elements in array
}
Note that stackSet is created only once and this single instance is used for processing of arrays collection. In this simplified example we need to know the capacity of StackSet (maximum number of distinct elements is array or at least maximum array length) before calculations. In practice StackSet can grow dynamically by means of rehasing as any hash table.
We compare performance of StackSet with HashSet and duplicates removal via sorting. The results for different array sizes and duplicates amount are shown below:
Duplicates in original arrays (%) |
Array Size (elements) |
HashSet throughput (arrays/sec) |
StackSet throughput (arrays/sec) |
Sorting throughput (arrays/sec) |
|---|---|---|---|---|
| 90 | 10 | 2,085,070 | 7,194,244 | 6,361,323 |
| 90 | 100 | 236,854 | 798,403 | 323,572 |
| 90 | 1000 | 21,121 | 83,368 | 22,441 |
| 50 | 10 | 1,653,986 | 4,573,519 | 3,881,987 |
| 50 | 100 | 191,846 | 539,956 | 174,779 |
| 50 | 1000 | 16,728 | 59,031 | 14,507 |
| 10 | 10 | 1,419,144 | 4,149,377 | 3,522,987 |
| 10 | 100 | 170,619 | 502,891 | 173,726 |
| 10 | 1000 | 15,656 | 55,509 | 14,584 |
We see that StackSet is effective enough, although performance of sorting is very close for short arrays. Obvious but important point here is that HashSet generates a lot of garbage during its work:
whereas StackSet doesn’t allocate memory at all:



Posted on January 1, 2012
0