Some time ago our team had been requested to develop several Java components for structured information retrieval. After the initial research, we concluded that standard approaches like inverted indexes are not well applicable to our problem because of specific business requirements. As a result we faced a necessity to design our own custom indexes and index processors with the following properties:

- Memory footprint of indexes should be as low as possible
- Index processing should be efficient in CPU utilization and memory allocation (produce a small amount of garbage objects)
- Indexes are mainly immutable i.e. performance of index building and modification is not significant
- Indexes should be partitionable i.e. it should be possible to distribute an index across the cluster

The details of our business requirements are irrelevant, as well as a particular structure of the designed indexes, but it’s important that we ended up with the following problem statement:

*There is a set of documents and each document is a collection of attributes or a group of such collections. It’s necessary to store and process all documents one by one (e.g. there exist admixtures of business logic that do not allow to leverage inverted indexes). The following questions are in focus:*

**How to store sets?**It’s necessary to store sets of items in immutable collections without duplicates (`Set`semantics). Small memory footprint is a must, efficient iteration over collection and fast`contains()`operation are also required. This problem is very close to building compact and immutable`Map`with fast`get()`operation (just need to associate each key with a value).**How to query sets?**This generic problem can be described as follows: There is a collection of documents and each document`D`is a set of attributes {`A,B,..,X,Y`}. Select all the documents that meet an attribute filter. The filter is an boolean expression that permits (or not) a set of attributes.**How to build sets?**This question is about duplicates removal. Another problem statement is merging of several potentially intersecting sets into the single collection without duplicates.

*A few more assumptions:*

*Collections are designed to store IDs (document IDs, attribute IDs etc), so we focus our attention on primitive types assuming that keys are positive integers.**Most collections are immutable or have immutable capacity, high performance of*`add()`operation is not a goal. The`remove()`operation is an extremely unusual case, we will simply omit this aspect for the sake of brevity.

From an implementation perspective, these tasks assumes intensive usage of `Maps` and `Sets`, but standard Java collections like `HashSet` are not always the best choice because of the large memory footprint (objects instead of primitive types) and intensive memory consumption in runtime (creation of new Map.Entry objects etc). The alternative frameworks like Colt look promising, but also don’t cover all our needs.

Eventually we developed our own collections. In this article we briefly discuss simplified versions of our structures (there is no need in the detailed analysis because all the algorithms are well known) and present results of performance testing. The main goal is to share the sources of the basic implementations that may be useful for developers who work with custom data structures.

# Compact Sets

In this section we discuss several data structures that can be used for compact set storage. Most of them are based on hashing, although this is not the only choice.

## Open Addressing Hash Set

In this section we present a typical implementation of open addressing hash table with double hashing. This is a standard algorithm (for example, see Cormen’s Introduction in Algorithms) that is widely used in many libraries like fastutil, Trove, or Colt. We provide several comments on implementation without detailed description of the algorithm:

- For simplicity, the capacity is fixed. This doesn’t affect the design deeply because capacity can be extended by means of rehashing.
- Addresses are generated using the double hashing technique.
- We use a table
`keys`that has a size always equal to a prime number. This guarantee that hash functions in`indexOfKey()`iterate all the table searching`FREE`position for a new element. `MathUtils.nextPrime(m)`returns a prime number that is greater or equal than`m`. Implementation is shown in the Appendix A.- Typical load factor is
`0.5`. This means that we need to allocate array of`2n`integers to store`n`elements. This is pretty compact. Moreover, the structure typically performs well for the load factors of about`0.7-0.9`, although the greater load factor, the worse the performance.

public class OpenAddressingSet { protected int keys[]; protected int capacity; protected int size; protected static final int FREE = -1; public OpenAddressingSet(int capacity, double loadFactor) { this.capacity = capacity; int tableSize = MathUtils.nextPrime( ((int)(capacity / loadFactor)) ); keys = new int[tableSize]; Arrays.fill(keys, FREE); } public boolean contains(int key) { return indexOfKey(key) >= 0; } public boolean add(int key) { if(size >= capacity) { throw new IllegalArgumentException("Set is full"); } boolean result; result = addInternal(key); if(result) { size++; } return result; } protected boolean addInternal(int key) { int position = indexOfKey(key); boolean added = false; if(position < 0) { position = -position-1; added = true; } keys[position] = key; return added; } private int indexOfKey(int key) { final int length = keys.length; int startPosition = key % length; // the first hash function int step = 1 + (key % (length-2)); // the second hash function while (keys[startPosition] != key && keys[startPosition] != FREE) { startPosition -= step; if (startPosition < 0) { startPosition += length; } } if(keys[startPosition] == FREE) { return -startPosition-1; } return startPosition; } }

Performance of `contains()` operation may be increased by additional reordering of elements during the insertion. One possible approach is so-called Brent algorithm. It is not widely used, but Knuth provides a sufficient description in The Art of Computer Programming Vol 3. The algorithm is quite complex, so we simply provide the implementation (variables are named in accordance with Knuth’s notation):

public class BrentOpenAddressingSet extends OpenAddressingSet { public BrentOpenAddressingSet(int capacity, double loadFactor) { super(capacity, loadFactor); } protected boolean addInternal(int key) { final int length = keys.length; int h1 = key % length; // the first hash function int h2 = 1 + (key % (length-2)); // the second hash function int scanStep = 0; int scanCursor = h1; boolean isInserted = false; boolean isPositionFree; while (!isInserted) { isPositionFree = keys[scanCursor] == FREE; if(isPositionFree || keys[scanCursor] == key) { keys[scanCursor] = key; isInserted = true; } else { scanCursor -= h2; if (scanCursor < 0) { scanCursor += length; } boolean isOptimized = false; isPositionFree = keys[scanCursor] == FREE; if((!isPositionFree && keys[scanCursor] != key) && scanStep >= 2) { int rank = scanStep - 1; for(int j = 0; j < rank && !isOptimized; j++) { int p_j = (h1 - j * h2) % length; if(p_j < 0) { p_j += length; } int c_j = 1 + (keys[p_j] % (length-2)); int p_kj = (p_j - (rank - j)*c_j) % length; if(p_kj < 0) { p_kj += length; } isPositionFree = keys[p_kj] == FREE; if(isPositionFree || keys[p_kj] == key) { keys[p_kj] = keys[p_j]; keys[p_j] = key; isOptimized = isInserted = true; } } } isPositionFree = keys[scanCursor] == FREE; if(!isOptimized && (isPositionFree || keys[scanCursor] == key)) { keys[scanCursor] = key; isInserted = true; } } scanStep++; } return isInserted; } }

## Cuckoo Hash Set

Open addressing hash tables have very small memory footprint, but performance of the lookup has certain pitfalls. The algorithms described in the previous section jump from one position in the table to another until the requested key is found or a free position is found (i.e. there is no such element in the table – unsuccessful lookup). Under some circumstances, the series of jumps can become too long, especially for unsuccessful lookups. The Cuckoo Hashing offers a trade off between the memory consumption and performance/stability of the lookups. It typically uses two tables instead of one, but both successful and unsuccessful lookups are performed without iterative jumps (compare `contains` methods in `OpenAddressingSet` and `CuckooSet`). We provide a simple implementation of the Cuckoo Set below. Let us briefly comment the several aspects:

- The proposed implementation uses multiplicative hash functions with “magic” multipliers
`H1, H2`that are discussed, for example, in The Art of Computer Programming Vol 2 . - In the open addressing hash tables we don’t need to resize the table, given that the capacity is predefined and fixed. In
`CuckooSet`we can’t guarantee that insertion will find a place for new element, even after multiple permutations of the existing elements (maximum number of permutations is regulated by`insertRounds`). So, increase of the table size and rehashing is an integral part of the implementation.

public class CuckooSet { private int[] keysT1; private int[] keysT2; private int tableLength; private int hashShift; private int rehashCounter = 0; private double loadFactor; private int capacity; private int insertRounds; private static final long H1 = 2654435761L; private static final long H2 = 0x6b43a9b5L; private static final long INT_MASK = 0x07FFFFFFF; private static final int FREE = -1; public CuckooSet(int capacity, double loadFactor, int insertRounds) { this.loadFactor = loadFactor; this.insertRounds = insertRounds; initializeCapacity(capacity, loadFactor); } public boolean add(int key) { if(contains(key)) { return false; } for(int i = 0; i < insertRounds; i++) { int h1 = h1(key); int register = keysT1[h1]; keysT1[h1] = key; if(register == FREE) { return true; } key = register; int h2 = h2(key); register = keysT2[h2]; keysT2[h2] = key; if(register == FREE) { return true; } key = register; } rehash(); return add(key); } public boolean contains(int key) { return keysT1[h1(key)] == key || keysT2[h2(key)] == key; } public int getRehashCounter() { return rehashCounter; } private void rehash() { int[] oldT1 = keysT1; int[] oldT2 = keysT2; int oldTableLength = tableLength; initializeCapacity(capacity * 2, loadFactor); for(int i = 0; i < oldTableLength; i++) { if(oldT1[i] != FREE) { add(oldT1[i]); } if(oldT2[i] != FREE) { add(oldT2[i]); } } rehashCounter++; } private void initializeCapacity(int capacity, double loadFactor) { this.capacity = capacity; tableLength = (int)(capacity / loadFactor); hashShift = 32 - (int)Math.ceil(Math.log(tableLength) / Math.log(2)); keysT1 = new int[tableLength]; keysT2 = new int[tableLength]; Arrays.fill(keysT1, FREE); Arrays.fill(keysT2, FREE); } private int h1(int key) { return ((int)( ((int)(key * H1) >> hashShift) & INT_MASK) ) % tableLength; } private int h2(int key) { return ((int)( ((int)(key * H2) >> hashShift) & INT_MASK) ) % tableLength; } }

## Array Set

We have already discussed several hash-based approaches for sets representation. To have a complete picture, we should mention another one very simple technique. The `ArraySet` is a straightforward implementation of set – it stores keys in the sorted array, `contains()` is implemented as a binary search. The shortcomings are obvious: inefficient `add()` operation and relatively low performance of binary search for large arrays. However, we can expect good performance and excellent memory footprint for small sets.

public class ArraySet { private int[] array; private int size = 0; public ArraySet(int capacity) { array = new int[capacity]; Arrays.fill(array, -1); } public boolean add(int key) { int index = Arrays.binarySearch(array, 0, size, key); if (index < 0) { int insertIndex = -index-1; if(size < array.length - 1) { if(insertIndex < size) { System.arraycopy(array, insertIndex, array, insertIndex + 1, size - insertIndex); } array[insertIndex] = key; } else { int[] newArray = new int[array.length + 1]; System.arraycopy(array, 0, newArray, 0, insertIndex); System.arraycopy(array, insertIndex, newArray, insertIndex + 1, array.length - insertIndex); newArray[insertIndex] = key; array = newArray; } size++; return true; } return false; } public int get(int position) { return array[position]; } public int size() { return size; } public boolean contains(int key) { return Arrays.binarySearch(array, key) >= 0; } }

## Performance Evaluation

We start with a couple of words about testing environment. We use 64-bit JDK 1.6 u20 under Ubuntu Linux 9.10, 2.93GHz Intel Core Quad processor. All performance tests were executed in a single-thread mode. The absolute values of throughput may be used to understand the order of magnitudes, but it’s more correct to consider these values as abstract points that can be compared one with other.

Our primary goal was to design sets with the small memory footprint, so we compare sizes of all the structures. In the HotSpot JVM size of `OpenAddressingSet` or`BrentOpenAddressingSet` is about `elements * 4 / load factor` bytes, size of `CuckooSet` is about `2 * elements * 4 / load factor` bytes, size of `ArraySet` is about `elements * 4`. This means that for 10000 elements `ArraySet` uses `40kB`, `OpenAddressingSet` and `CuckooSet` use `80kB` and `160kB`, respectively, for load factor of `0.5`. Size of `HashSet` is about `610k`for 10000 integer keys.

Let us evaluate performance of the described implementations on random keys. First, we measure performance of the set population (`add` operation) for different set sizes. In this test, memory for all structures is preallocated and the load factor for `HashSet` is `0.75` (default). For other implementations load factors is `0.5`. It’s clear that `OpenAddressingSet` and `CuckooSet`significantly outperform `HashSet` under these circumstances:

Performance of the lookup operation is depicted below. We test successful (requested key is in the set, marked with `(S)`), and unsuccessful (key is not in the set, marked with `(U)`) lookups separately because most algorithms are sensitive to this.

One can see that `OpenAddressingSet` and `CuckooSet` work well, but `BrentOpenAddressingSet` provides almost no gain. It’s also clear that `ArraySet` is very good solution for small sets.

Finally, we compare performance of `OpenAddressingSet` and `CuckooSet` for load factors `0.5` and `0.9`. Here we see that `BrentOpenAddressingSet` outperforms `OpenAddressingSet` for high load factors:

## Appendix A

public class MathUtils { public static int nextPrime(int n) { int candidate = n; while( !isPrime(candidate) ) { candidate++; } return candidate; } public static boolean isPrime(int n) { if((n % 2) == 0 || n % 3 == 0) { return n == 2 || n == 3; } int upperSearchLimit = (int)Math.ceil(Math.sqrt(n)); for(int i = 6; i-1 <= upperSearchLimit; i += 6) { if(n % (i-1) == 0 || n % (i+1) == 0) { return false; } } return true; } }

very informative. thanks

Hello, because of a personal project I have to implement my own set, and this articles really points me the right direction on how it should be made. But I’m left with this one question when you mention “MathUtils.nextPrime(m) returns a prime number which is greater or equal than m. Implementation is shown in Appendix A.” Where can i find this Appendix A? I would like to know how the nextPrime function was implemented.

Alberto,

Thanks for point this out – I’ve added Appendix A to the end of the article. Nevertheless, this is a very basic and potentially slow implementation. A precalculated table of exponentially growing prime numbers is typically used in practice.

Another interesting prime-selection implementation uses a precomputed table optimized for doubling hash tables/sets. It is constructed such that for any size, a prime can be found within 11% of the desired capacity. You can find it in GNU Trove’s PrimeFinder (http://grepcode.com/file/repo1.maven.org/maven2/net.sf.trove4j/trove4j/3.0.2/gnu/trove/impl/PrimeFinder.java#PrimeFinder)

Sorry, I’m a little late on this. In my experience, people use a prime-sized hash table for one reason only: their hash functions are terrible.

If you do take a typical prime-sized hash table of integers and do instruction-level profiling of access times, you’ll often find that there is invariably one instruction which dominates execution time: the modulo-hash-table-size instruction. Since the table size isn’t predictable statically, you need to do a machine division here. I’ve had programs where 20% of the run time was spent on this one instruction, even taking I/O into account!

Alwaysuse a good hash function. Then you can make your hash table a power of two in size, and implement the modulo operation with a bit mask.good article.. thanks for sharing.. 🙂