<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Highly Scalable Blog</title>
	<atom:link href="http://highlyscalable.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://highlyscalable.wordpress.com</link>
	<description>Articles on Big Data, HPC, and Highly Scalable Software Engineering</description>
	<lastBuildDate>Wed, 05 Jun 2013 12:22:34 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='highlyscalable.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Highly Scalable Blog</title>
		<link>http://highlyscalable.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://highlyscalable.wordpress.com/osd.xml" title="Highly Scalable Blog" />
	<atom:link rel='hub' href='http://highlyscalable.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Distributed Algorithms in NoSQL Databases</title>
		<link>http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/</link>
		<comments>http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/#comments</comments>
		<pubDate>Tue, 18 Sep 2012 16:45:37 +0000</pubDate>
		<dc:creator>Ilya Katsov</dc:creator>
				<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[cassandra]]></category>
		<category><![CDATA[consistency]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[nosql]]></category>
		<category><![CDATA[protocol]]></category>
		<category><![CDATA[sharding]]></category>

		<guid isPermaLink="false">http://highlyscalable.wordpress.com/?p=675</guid>
		<description><![CDATA[Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big umbrella, and it is. Although it can hardly be said that NoSQL movement brought fundamentally new techniques into distributed data processing, it triggered an avalanche [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=675&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big umbrella, and it is. Although it can hardly be said that NoSQL movement brought fundamentally new techniques into distributed data processing, it triggered an avalanche of practical studies and real-life trials of different combinations of protocols and algorithms. These developments gradually highlight a system of relevant database building blocks with proven practical efficiency. In this article I’m trying to provide more or less systematic description of techniques related to distributed operations in NoSQL databases.</p>
<p>In the rest of this article we study a number of distributed activities like replication of failure detection that could happen in a database. These activities, highlighted in bold below, are grouped into three major sections:</p>
<ul>
<li>Data Consistency. Historically, NoSQL paid a lot of attention to tradeoffs between consistency, fault-tolerance and performance to serve geographically distributed systems, low-latency or highly available applications. Fundamentally, these tradeoffs spin around data consistency, so this section is devoted <strong>data replication </strong>and <strong>data repair</strong>.</li>
<li>Data Placement. A database should accommodate itself to different data distributions, cluster topologies and hardware configurations. In this section we discuss how to <strong>distribute or rebalance data</strong> in such a way that failures are handled rapidly, persistence guarantees are maintained, queries are efficient, and system resource like RAM or disk space are used evenly throughout the cluster.</li>
<li>System Coordination. Coordination techniques like <strong>leader election</strong> are used in many databases to implements fault-tolerance and strong data consistency. However, even decentralized databases typically track their global state, <strong>detect failures and topology changes</strong>. This section describes several important techniques that are used to keep the system in a coherent state.</li>
</ul>
<h2>Data Consistency</h2>
<p>It is well known and fairly obvious that in geographically distributed systems or other environments with probable network partitions or delays it is not generally possible to maintain high availability without sacrificing consistency because isolated parts of the database have to operate independently in case of network partition. This fact is often referred to as the CAP theorem. However, consistency is a very expensive thing in distributed systems, so it can be traded not only to availability. It is often involved into multiple tradeoffs. To study these tradeoffs, we first note that consistency issues in distributed systems are induced by the replication and the spatial separation of coupled data, so we have to start with goals and desired properties of the replication:</p>
<ul>
<li>Availability. Isolated parts of the database can serve read/write requests in case of network partition.</li>
<li>Read/Write latency. Read/Write requests are processes with a minimal latency.</li>
<li>Read/Write scalability. Read/Write load can be balanced across multiple nodes.</li>
<li>Fault-tolerance. Ability to serve read/write requests does not depend on availability of any particular node.</li>
<li>Data persistence. Node failures within certain limits do not cause data loss.</li>
<li>Consistency. Consistency is a much more complicated property than the previous ones, so we have to discuss different options in detail. It beyond this article to go deeply into theoretical consistency and concurrency models, so we use a very lean framework of simple properties.
<ul>
<li>Read-Write consistency. From the read-write perspective, the basic goal of a database is to minimize a replica convergence time (how long does it take to propagate an update to all replicas) and guarantee eventual consistency. Besides these weak guarantees, one can be interested in stronger consistency properties:
<ul>
<li>Read-after-write consistency. The effect of a write operation on data item X, will always be seen by a successive read operation on X.</li>
<li>Read-after-read consistency. If some client reads the value of a data item X, any successive read operation on X will always return that same or a more recent value.</li>
</ul>
</li>
</ul>
<ul>
<li>Write-Write consistency. Write-write conflicts appear in case of database partition, so a database should either handle these conflicts somehow or guarantee that concurrent writes will not be processed by different partitions. From this perspective, a database can offer different consistency models:
<ul>
<li>Atomic Writes. If a database provides an API where a write request can only be an independent atomic assignment of a value, one possible way to avoid write-write conflicts is to pick the “most recent” version of each entity. This guarantees that all nodes will end up with the same version of data irrespectively to the order of updates which can be affected by network failures and delays. Data version can be specified by a timestamps or application-specific metric. This approach is used for example in Cassandra.</li>
<li>Atomic Read-modify-write. Applications often do a read-modify-write sequence instead of independent atomic writes. If two clients read the same version of data, modify it and write back concurrently, the latest update will silently override the first one in the atomic writes model. This behavior can be semantically inappropriate (for example, if both clients add a value to a list). A database can offer at least two solutions:
<ul>
<li>Conflict prevention. Read-modify-write can be thought as a particular case of transaction, so distributed locking or consensus protocols like PAXOS [20, 21] are both a solution.  This is a generic technique that can support both atomic read-modify-write semantics and arbitrary isolated transactions. An alternative approach is to prevent distributed concurrent writes entirely and route all writes of a particular data item to a single node (global master or shard master).  To prevent conflicts, a database must sacrifice availability in case of network partitioning and stop all but one partition. This approach is used in many systems with strong consistency guarantees (e.g. most RDBMSs, HBase, MongoDB).</li>
<li>Conflict detection. A database track concurrent conflicting updates and either rollback one of the conflicting updates or preserve both versions for resolving on the client side. Concurrent updates are typically tracked by using vector clocks [19] (which can be though as a generalization of the optimistic locking) or by preserving an entire version history. This approach is used in systems like Riak, Voldemort, CouchDB.</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>Now let’s take a closer look at commonly used replication techniques and classify them in accordance with the described properties. The first figure below depicts logical relationships between different techniques and their coordinates in the system of the consistency-scalability-availability-latency tradeoffs. The second figure illustrates each technique in detail.</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/09/consistency-plot-3.png"><img class="aligncenter size-full wp-image-798" title="consistency-plot-3" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/consistency-plot-3.png?w=594"   /></a></p>
<div id="attachment_768" class="wp-caption aligncenter" style="width: 496px"><a href="http://highlyscalable.files.wordpress.com/2012/09/consistency-catalog.png"><img class="size-full wp-image-768" title="consistency-catalog" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/consistency-catalog.png?w=594"   /></a><p class="wp-caption-text">Replication factor 4. It is assumed that read/write coordinator can be either an external client or a proxy node within a database.</p></div>
<p>Let’s go through all these techniques moving from weak to strong consistency guarantees:</p>
<ul>
<li>(A, Anti-Entropy) Weakest consistency guarantees are provided by the following strategy. Writer updates any arbitrary selected replica. Reader reads any replica and sees the old data until a new version is not propagated via background anti-entropy protocol (more on anti-entropy protocols in the next section). The main properties of this approach are:
<ul>
<li>High propagation latency makes it quite impractical for data synchronization, so it is typically used only as an auxiliary background process that detects and repairs unplanned inconsistencies. However, databases like Cassandra use anti-entropy as a primary way to propagate information about database topology and other metadata.</li>
<li>Consistency guarantees are poor: write-write conflicts and read-write discrepancies are very probable even in absence of failures.</li>
<li>Superior availability and robustness against network partitions. This schema provides good performance because individual updates are replaced by asynchronous batch processing.</li>
<li>Persistence guarantees are weak because new data are initially stored on a single replica.</li>
</ul>
</li>
<li>(B) An obvious improvement of the previous schema is to send an update to all (available) replicas asynchronously as soon as the update request hits any replica. It can be considered as a kind of targeted anti-entropy.
<ul>
<li>In comparison with pure anti-entropy, this greatly improves consistency with a relatively small performance penalty. However, formal consistency and persistence guarantees remain the same.</li>
<li>If some replica is temporary unavailable due to network failures or node failure/replacement, updates should be eventually delivered to it by the anti-entropy process.</li>
</ul>
</li>
<li>(C) In the previous schema, failures can be handled better using the hinted handoff technique [8]. Updates that are intended for unavailable nodes are recorded on the coordinator or any other node with a hint that they should be delivered to a certain node as soon as it will become available. This improves persistence guarantees and replica convergence time.</li>
<li>(D, Read One Write One) Since the carrier of hinted handoffs can fail before deferred updates were propagated, it makes sense to enforce consistency by so-called read repairs. Each read (or randomly selected reads) triggers an asynchronous process that requests a digest (a kind of signature/hash) of the requested data from all replicas and reconciles inconsistencies if detected. We use term ReadOne-WriteOne for combination of techniques A, B, C and D – they all do not provide strict consistency guarantees, but are efficient enough to be used in practice as an self-contained approach.</li>
<li>(E, Read Quorum Write Quorum) The strategies above are heuristic enhancements that decrease replicas convergence time. To provide guarantees beyond eventual consistency, one has to sacrifice availability and guarantee an overlap between read and write sets. A common generalization is to write synchronously W replicas instead of one and touch R replicas during reading.
<ul>
<li>First, this allows one to manage persistence guarantees setting W&gt;1.</li>
<li>Second, this improves consistency for R+W&gt;N because synchronously written set will overlap with the set that is contacted during reading (in the figure above W=2, R=3, N=4), so reader will touch at least one fresh replica and select it as a result. This guarantees consistency if read and write requests are issued sequentially (e.g. by one client, read-your-writes consistency), but do not guarantee global read-after-read consistency. Consider an example in the figure below to see why reads can be inconsistent. In this example R=2, W=2, N=3. However, writing of two replicas is not transactional, so clients can fetch both old and new values while writing is not completed:</li>
</ul>
</li>
</ul>
<p><a href="http://highlyscalable.files.wordpress.com/2012/09/consistency-concurrent-quorum.png"><img class="aligncenter size-full wp-image-769" title="consistency-concurrent-quorum" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/consistency-concurrent-quorum.png?w=594"   /></a></p>
<ul>
<ul>
<li>Different values of R and W allows to trade write latency and persistence to read latency and vice versa.</li>
<li>Concurrent writers can write to disjoint quorums if W&lt;=N/2. Setting W&gt;N/2 guarantees immediate conflict detection in Atomic Read-modify-write with rollbacks model.</li>
<li>Strictly speaking, this schema is not tolerant to network partitions, although it tolerates failures of separate nodes. In practice, heuristics like sloppy quorum [8] can be used to sacrifice consistency provided by a standard quorum schema in favor of availability in certain scenarios.</li>
</ul>
<li>(F, Read All Write Quorum) The problem with read-after-read consistency can be alleviated by contacting all replicas during reading (reader can fetch data or check digests). This ensures that a new version of data becomes visible to the readers as soon as it appears on at least one node. Network partitions of course can lead to violation of this guarantee.</li>
<li>(G, Master-Slave) The techniques above are often used to provide either Atomic Writes or Read-modify-write with Conflict Detection consistency levels. To achieve a Conflict Prevention level, one has to use a kind of centralization or locking. A simplest strategy is to use master-slave asynchronous replication. All writes for a particular data item are routed to a central node that executes write operations sequentially. This makes master a bottleneck, so it becomes crucial to partition data into independent shards to be scalable.</li>
<li>(H, Transactional Read Quorum Write Quorum and Read One Write All) Quorum approach can also be reinforced by transactional techniques to prevent write-write conflicts. A well-known approach is to use two-phase commit protocol. However, two-phase commit is not perfectly reliable because coordinator failures can cause resource blocking. PAXOS commit protocol [20, 21] is a more reliable alterative, but with a price or performance penalty. A small step forward and we end up with the Read One Write All approach where writes update all replicas in a transactional fashion. This approach provides strong fault-tolerant consistency but with a price of performance and availability.</li>
</ul>
<p>It is worth noting that the analysis above highlights a number of tradeoffs:</p>
<ul>
<li><strong>Consistency-availability tradeoff</strong>. This strict tradeoff is formalized by the CAP theorem. In case of network partition, a database should either stop all partitions except one or accept the possibility of data conflicts.</li>
<li><strong>Consistency-scalability tradeoff</strong>. One can see that even read-write consistency guarantees impose serious limitations on a replica set scalability, and write-write conflicts can be handled in a relatively scalable fashion only in the Atomic Writes model. The Atomic Read-modify-write model introduces short casual dependencies between data and this immediately requires global locking to prevent conflicts. This shows that <em>even a slight spatial or casual dependency between data entries or operations could kill scalability</em>, so separation of data into independent shards and <a title="NoSQL Data Modeling Techniques" href="http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/">careful data modeling</a> is extremely important for scalability.</li>
<li><strong>Consistency-latency tradeoff</strong>. As it was shown above, there exists a tendency to Read-All and Write-All techniques when strong consistency or persistence guarantees are provides by a database. These guarantees are clearly in inverse proportion to requests latency. Quorum techniques are a middle ground.</li>
<li><strong>Failover-consistency/scalability/latency tradeoff</strong>. It is interesting that contention between failover and consistency/scalability/latency is not really severe. Failures of up to N/2 nodes can often be tolerated with reasonable performance/consistency penalty. However, this tradeoff is visible, for example, in the difference between 2-phase commit and PAXOS protocols. Another example of this tradeoff is ability to lift certain consistency guarantees like read-your-writes using sticky sessions which complicate failover [22].</li>
</ul>
<h3>Anti-Entropy Protocols, Gossips</h3>
<p>Let us start our study with the following problem statement:</p>
<p><em>There is a set of nodes and each data item is replicated to a subset of nodes. Each node serves update requests even if there is no network connection to other nodes. Each node periodically synchronizes its state with other nodes is such a way that if no updates take place for a long time, all replicas will gradually become consistent. How this synchronization should be organized – when synchronization is triggered, how a peer to synchronize with is chosen, what is the data exchange protocol? Let us assume that two nodes can always merge their versions of data selecting a newest version or preserving both versions for further application-side resolution.</em></p>
<p>This problem appears both in data consistency maintenance and in synchronization of a cluster state (propagation of the cluster membership information and so on). Although the problem above can be solved by means of a global coordinator that monitors a database and builds a global synchronization plan or schedule, decentralized databases take advantage of more fault-tolerant approach. The main idea is to use well-studied epidemic protocols [7] that are relatively simple, provide a pretty good convergence time, and can tolerate almost any failures or network partitions. Although there are different classes of epidemic algorithms, we focus on anti-entropy protocols because of their intensive usage in NoSQL databases.</p>
<p>Anti-entropy protocols assume that synchronization is performed by a fixed schedule &#8211; every node regularly chooses another node at random or by some rule and exchanges database contents, resolving differences. There are three flavors of anti-entropy protocols: push, pull, and push-pull. The idea of the push protocol is to simply select a random peer and push a current state of data to it. In practice, it is quite silly to push the entire database, so nodes typically work in accordance with the protocol which is depicted in the figure below.</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/09/gossips.png"><img class="aligncenter size-full wp-image-770" title="gossips" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/gossips.png?w=594"   /></a></p>
<p>Node A which is initiator of synchronization prepares a digest (a set of checksums) which is a fingerprint of its data. Node B receives this digest, determines the difference between the digest and its local data and sends a digest of the difference back to A. Finally, A sends an update to B and B updates itself. Pull and push-pull protocols work similarly, as it shown in the figure above.</p>
<p>Anti-entropy protocols provide reasonable good convergence time and scalability. The following figure shows simulation results for propagation of an update in the cluster of 100 nodes. On each iteration, each node contacts one randomly selected peer.</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/09/epidemic-dynamics.png"><img class="aligncenter size-full wp-image-771" title="epidemic-dynamics" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/epidemic-dynamics.png?w=594"   /></a></p>
<p>One can see that the pull style provides better convergence than the push, and this can be proven theoretically [7]. Also, push has a problem with a “convergence tail” when a small percent of nodes remains unaffected during many iterations, although almost all nodes are already touched. The Push-Pull approach greatly improves efficiency in comparison with the original push or pulls techniques, so it is typically used in practice. Anti-entropy is scalable because the average conversion time grows as a logarithmic function of the cluster size.</p>
<p>Although these techniques look pretty simple, there are many studies [5] regarding performance of anti-entropy protocols under different constraints. One can leverage knowledge of the network topology to replace a random peer selection by a more efficient schema [10]; adjust transmit rates or use advanced rules to select data to be synchronized if the network bandwidth is limited [9]. Computation of digest can also be challenging, so a database can maintain a journal of the recent updates to facilitate digests computing.</p>
<h3>Eventually Consistent Data Types</h3>
<p>In the previous section we assumed that <em>two nodes can always merge their versions of data</em>. However, reconciliation of conflicting updates is not a trivial task and it is surprisingly difficult to make all replicas to converge to a semantically correct value. A well-known example is that deleted items can resurface in the Amazon Dynamo database [8].</p>
<p>Let us consider a simple example that illustrates the problem: a database maintains a logically global counter and each database node can serve increment/decrement operations. Although each node can maintain its own local counter as a single scalar value, but these local counters cannot be merged by simple addition/subtraction. Consider an example: there are 3 nodes A, B, and C and increment operation was applied 3 times, once per node. If A pulls value from B and adds it to the local copy, C pulls from B, C pulls from A, then C ends up with value 4 which is incorrect. One possible way to overcome these issues is to use a data structure similar to vector clock [19] and maintain a pair of counters for each node [1]:</p>
<pre class="brush: java; title: ; notranslate">
class Counter {
   int[] plus
   int[] minus
   int NODE_ID

   increment() {
      plus[NODE_ID]++
   }

   decrement() {
      minus[NODE_ID]++
   }

   get() {
      return sum(plus) – sum(minus)
   }

   merge(Counter other) {
      for i in 1..MAX_ID {
         plus[i] = max(plus[i], other.plus[i])
         minus[i] = max(minus[i], other.minus[i])
      }
   }
}
</pre>
<p>Cassandra uses a very similar approach to provide counters as a part of its functionality [11]. It is possible to design more complex eventually consistent data structures that can leverage either state-based or operation-based replication principles. For example, [1] contains a catalog of such structures that includes:</p>
<ul>
<li>Counters (increment and decrement operations)</li>
<li>Sets (add and remove operations)</li>
<li>Graphs (addEdge/addVertex, removeEdge/removeVertex operations)</li>
<li>Lists (insertAt(position) and removeAt(position) operations)</li>
</ul>
<p>However, eventually consistent data types are often limited in functionality and impose performance overheads.</p>
<h2>Data Placement</h2>
<p>This section is dedicated to algorithms that control data placement inside a distributed database. These algorithms are responsible for mapping between data items and physical nodes, migration of data from one node to another and global allocation of resources like RAM throughout the database.</p>
<h3>Rebalancing</h3>
<p>Let us start with a simple protocol that is aimed to provide outage-free data migration between cluster nodes. This task arises in situations like cluster expansion (new nodes are added), failover (some node goes done), or rebalancing (data became unevenly distributed across the nodes). Consider a situation that is depicted in the section (A) of the figure below – there are three nodes and each node contains a portion of data (we assume a key-value data model without loss of generality) that is distributed across the nodes according to an arbitrary data placement policy:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/09/rebalancing.png"><img class="aligncenter size-full wp-image-772" title="rebalancing" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/rebalancing.png?w=594"   /></a></p>
<p>If one does not have a database that supports data rebalancing internally, he probably will deploy several instances of the database to each node as it is shown in the section (B) of the figure above. This allows one to perform a manual cluster expansion by turning a separate instance off, copying it to a new node, and turning it on, as it is shown in the section (C). Although an automatic database is able to track each record separately, many systems including MongoDB, Oracle Coherence, and upcoming Redis Cluster use the described technique internally, i.e. group records into shards which are minimal units of migration for sake of efficiency. It is quite obvious that a number of shards should be quite large in comparison with the number of nodes to provide the even load distribution. An outage-free shard migration can be done according to the simple protocol that redirects client from the exporting to the importing node during a migration of the shard. The following figure depicts a state machine for get(key) logic as it going to  be implemented in Redis Cluster:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/09/redis-rebalancing-protocol.png"><img class="aligncenter size-full wp-image-773" title="redis-rebalancing-protocol" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/redis-rebalancing-protocol.png?w=594"   /></a></p>
<p>It is assumed that each node knows a topology of the cluster and is able to map any key to a shard and a shard to a cluster node. If the node determines that the requested key belongs to a local shard, then it looks it up locally (the upper square in the picture above). If the node determines that the requested key belongs to another node X, than it sends a permanent redirection command to the client (the lower square in the figure above). Permanent redirection means that the client is able to cache the mapping between the shard and the node.  If the shard migration is in progress, the exporting and the importing nodes mark this shard accordingly and start to move its records locking each record separately. The exporting node first looks up the key locally and, if not found, redirects the client to the importing node assuming that key is already migrated. This redirect is a one-time and should not be cached. The importing node processes redirects locally, but regular queries are permanently redirected until migration is not completed.</p>
<h3>Sharding and Replication in Dynamic Environments</h3>
<p>The next question we have to address is how to map records to physical nodes. A straightforward approach is to have a table of key ranges where each range is assigned to a node or to use procedures like <em>NodeID = hash(key) % TotalNodes</em>. However, modulus-based hashing does not explicitly address cluster reconfiguration because addition or removal of nodes causes complete data reshuffling throughout the cluster. As a result, it is difficult to handle replication and failover.</p>
<p>There are different ways to enhance the basic approach from the replication and failover perspectives. The most famous technique is a consistent hashing. There are many descriptions of the consistent hashing technique in the web, so I provide a basic description just for sake of completeness. The following figure depicts the basic ideas of consistent hashing:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/09/consistent-hashing.png"><img class="aligncenter size-full wp-image-774" title="consistent-hashing" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/consistent-hashing.png?w=594"   /></a></p>
<p>Consistent hashing is basically a mapping schema for key-value store – it maps keys (hashed keys are typically used) to physical nodes. A space of hashed keys is an ordered space of binary strings of a fixed length, so it is quite obvious that each range of keys is assigned to some node as it depicted in the figure (A) for 3 nodes, namely, A, B, and C. To cope with replication, it is convenient to close a key space into a ring and traverse it clockwise until all replicas are mapped, as it shown in the figure (B). In other words, item Y should be placed on node B because its key corresponds to B’s range, first replica should be placed on C, second replica on A and so on.</p>
<p>The benefit of this schema is efficient addition and removal of a node because it causes data rebalancing only in neighbor sectors. As it shown in the figures (C), addition of the node D affects only item X but not Y. Similarly, removal (or failure) of the node B affects Y and the replica of X, but not X itself. However, as it was pointed in [8], the dark side of this benefit is vulnerability to overloads – all the burden of rebalancing is handled by neighbors only and makes them to replicate high volumes of data. This problem can be alleviated by mapping each node not to a one range, but to a set of ranges, as it shown in the figure (D). This is a tradeoff – it avoids skew in loads during rebalancing, but keeps the total rebalancing effort reasonably low in comparison with module-based mapping.</p>
<p>Maintenance of a complete and coherent vision of a hashing ring may be problematic in very large deployments. Although it is not a typical problem for databases because of relatively small clusters, it is interesting to study how data placement was combined with the network routing in peer-to-peer networks. A good example is the Chord algorithm [2] that trades completeness of the ring vision by a single node to efficiency of the query routing. The Chord algorithm is similar to consistent hashing in the sense that it uses a concept of a ring to map keys to nodes. However, a particular node maintains only a short list of peers with exponentially growing offset on the logical ring (see the picture below). This allows one to locate a key in several network hops using a kind of binary search:</p>
<p style="text-align:center;"><a href="http://highlyscalable.files.wordpress.com/2012/09/chord.png"><img class="aligncenter  wp-image-775" title="chord" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/chord.png?w=416&#038;h=300" height="300" width="416" /></a></p>
<p>This figure depicts a cluster of 16 nodes and illustrates how node A looks up a key that is physically located on node D. Part (A) depicts the route and part (B) depicts partial visions of the ring for nodes A, B, and C. More information about data replication in decentralized systems can be found in [15].</p>
<h3>Multi-Attribute Sharding</h3>
<p>Although consistent hashing offers an efficient data placement strategy when data items are accessed by a primary key, things become much more complex when querying by multiple attributes is required. A straightforward approach (that is used, for example, in MongoDB) is to distribute data by a primary key regardless to other attributes. As a result, queries that restrict the primary key can be routed to a limited number of nodes, but other queries have to be processed by all nodes in the cluster. This skew in query efficiency leads us to the following problem statement:</p>
<p><em>There is a set of data items and each item has a set of attributes along with their values. Is there a data placement strategy that limits a number of nodes that should be contacted to process a query that restricts an arbitrary subset of the attributes?</em></p>
<p>One possible solution was implemented in the HyperDex database. The basic idea is to treat each attribute as an axis in a multidimensional space and map blocks in the space to physical nodes. A query corresponds to a hyperplane that intersects a subset of blocks in the space, so only this subset of blocks should be touched during the query processing. Consider the following example from [6]:</p>
<p style="text-align:center;"><a href="http://highlyscalable.files.wordpress.com/2012/09/hyperspace-sharding.png"><img class="aligncenter  wp-image-776" title="hyperspace-sharding" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/hyperspace-sharding.png?w=366&#038;h=296" height="296" width="366" /></a></p>
<p>Each data item is a user account that is attributed by First Name, Last Name, and Phone Number. These attributes are treated as a three-dimensional space and one possible data placement strategy is to map each octant to a dedicated physical node. Queries like “First Name = John” correspond to a plane that intersects 4 octants, hence only 4 nodes should be involved into processing. Queries that restrict two attributes correspond to a line that intersects two octants as it shown in the figure above, hence only 2 nodes should be involved into processing.</p>
<p>The problem with this approach is that dimensionality of the space grows as an exponential function of the attributes count. As a result, queries that restrict only a few attributes tend to involve many blocks and, consequently, involve many servers. One can alleviate this by splitting one data item with multiple attributes into multiple sub-items and mapping them to the several independent subspaces instead of one large hyperspace:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/09/hyperspace-sharding-2.png"><img class="aligncenter size-full wp-image-777" title="hyperspace-sharding-2" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/hyperspace-sharding-2.png?w=594"   /></a></p>
<p>This provides more narrowed query-to-nodes mapping, but complicates coordination because one data item becomes scattered across several independent subspaces with their own physical locations and transactional updates become required. More information about this technique and implementation details can be found in [6].</p>
<h3>Passivated Replicas</h3>
<p>Some applications with heavy random reads can require all data to fit RAM. In these cases, sharding with independent master-slave replication of each replica (like in MongoDB) typically requires at least double amount of RAM because each chunk of data is stored both on a master and on a slave. A slave should have the same amount of RAM as a master in order to replace the master in case of failure. However, shards can be placed in such a way that amount of required RAM can be reduced, assuming that the system tolerates short-time outages or performance degradation in case of failures.</p>
<p>The following figure depicts 4 nodes that host 16 shards, primary copies are stored in RAM and replicas are stored on disk:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/09/replica-passivation.png"><img class="aligncenter size-full wp-image-778" title="replica-passivation" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/replica-passivation.png?w=594&#038;h=429" height="429" width="594" /></a></p>
<p>The gray arrows highlight replication of shards from node #2. Shards from the other nodes are replicated symmetrically. The red arrows depict how the passivated replicas will be loaded into RAM in case of failure of node #2. Even distribution of replicas throughout the cluster allows one to have only a small memory reserve that will be used to activate replicas in case of failure. In the figure above, the cluster is able to survive a single node failure having only 1/3 of RAM in reserve. It is worth noting that replica activation (loading from disk to RAM) takes some time and cause temporally performance degradation or outage of the corresponding data during failure recovery.</p>
<h2>System Coordination</h2>
<p>In this section we discuss a couple of techniques that relates to system coordination. Distributed coordination is an extremely large area that was a subject of intensive study during several decades. In this article, we, of course, consider only a couple of applied techniques. A comprehensive description of distributed locking, consensus protocols and other fundamental primitives can be found in numerous books or web resources [17, 18, 21].</p>
<h3>Failure Detection</h3>
<p>Failure detection is a fundamental component of any fault tolerant distributed system. Practically all failure detection protocols are based on a heartbeat messages which are a pretty simple concept – monitored components periodically send a heartbeat message to the monitoring process (or the monitoring process polls monitored components) and absence of heartbeat messages for a long time is interpreted as a failure. However, real distributed systems impose a number of additional requirements that should be addressed:</p>
<ul>
<li>Automatic adaptation. Failure detection should be robust to the temporary network failures and delays, dynamic changes in the cluster topology, workload or bandwidth. This is a fundamentally difficult problem because there is no way to distinguish crashed process from a slow one [13]. As a result, failure detection is always a tradeoff between a failure detection time (how long does it take to detect a real failure) and the false-alarm probability. Parameters of this tradeoff should be adjusted dynamically and automatically.</li>
<li>Flexibility. At first glance, failure detector should produce a boolean output, a monitored process considered to be either live or dead. Nevertheless, it can be argued that boolean output is insufficient in practice. Let us consider an example from [12] that resembles Hadoop MapReduce. There is a distributed application that consists of a master and several workers. The master has a list of jobs and submits them to the workers. The master can distinguish different “degrees of failure”. If the master starts to suspect that some worker went down, it stops to submit new jobs to this worker. Next, as time goes by and there are no heartbeat messages, the master resubmits jobs that were running on this worker to the other workers. Finally, the master becomes completely confident that the worker is down and releases all corresponding resources.</li>
<li>Scalability and robustness. Failure detection as a system process should scale up as well as the system does. It also should be robust and consistent, i.e. all nodes in the system should have a consistent view of running and failed processes even in case of communication problems.</li>
</ul>
<p>A possible way to address the first two requirements is so-called Phi Accrual Failure Detector [12] that is used with some modifications in Cassandra [16]. The basic workflow is as follows (see the figure below):</p>
<ul>
<li>For each monitored resource, Detector collects arrival times Ti of heartbeat messages.</li>
<li>Mean and variance are constantly computed for the recent arrival times (on a sliding window of size W) in the Statistics Estimation block.</li>
<li>Assuming that distribution of arrival times is known (the figure below contains a formula for normal distribution), one can compute the probability of the current heartbeat delay (difference between the current time t_now and the last arrival time Tc). This probability is a measure of confidence in a failure. As suggested in [12], this value can be rescaled using the logarithmic function for sake of usability. In this case output 1 means that the likeness of the mistake is about 10%, output 2 means 1% and so on.</li>
</ul>
<p><a href="http://highlyscalable.files.wordpress.com/2012/09/phi-accrual-failure-detector.png"><img class="aligncenter size-full wp-image-779" title="phi-accrual-failure-detector" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/phi-accrual-failure-detector.png?w=594"   /></a></p>
<p>The scalability requirement can be addressed in significant degree by hierarchically organized monitoring zones that prevent flooding of the network with heartbeat messages [14] and synchronization of different zones via gossip protocol or central fault-tolerant repository. This approach is illustrated below (there are two zones and all six failure detectors talk to each other via gossip protocol or robust repository like ZooKeeper):</p>
<p style="text-align:center;"><a href="http://highlyscalable.files.wordpress.com/2012/09/monitoring-zones.png"><img class="aligncenter  wp-image-780" title="monitoring-zones" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/monitoring-zones.png?w=336&#038;h=293" height="293" width="336" /></a></p>
<h3>Coordinator Election</h3>
<p>Coordinator election is an important technique for databases with strict consistency guarantees. First, it allows one to organize failover of a master node in master-slave systems. Second, it allows one to prevent write-write conflicts in case of network partition by terminating partitions that do not include a majority of nodes.</p>
<p>Bully algorithm is a relatively simple approach to coordinator election. MongoDB uses a version of this algorithm to elect leaders in replica sets. The main idea of the bully algorithm is that each member of the cluster can declare itself as a coordinator and announce this claim to other nodes. Other nodes can either accept this claim or reject it by entering the competition for being a coordinator. Node that does not face any further contention becomes a coordinator. Nodes use some attribute to decide who wins and who loses. This attribute can be a static ID or some recency metric like the last transaction ID (the most up-to-date node wins).</p>
<p>An example of the bully algorithm execution is shown in the figure below. Static ID is used as a comparison metric, a node with a greater ID wins.</p>
<ol>
<li>Initially five nodes are in the cluster and node 5 is a globally accepted coordinator.</li>
<li>Let us assume that node 5 goes down and nodes 3 and 2 detect this simultaneously. Both nodes start election procedure and send election messages to the nodes with greater IDs.</li>
<li>Node 4 kicks out nodes 2 and 3 from the competition by sending OK. Node 3 kicks out node 2.</li>
<li>Imagine that node 1 detects failure of 5 now and an election message to the all nodes with greater IDs.</li>
<li>Nodes 2, 3, and 4 kick out node 1.</li>
<li>Node 4 sends an election message to node 5.</li>
<li>Node 5 does not respond, so node 4 declares itself as a coordinator and announce this fact to all other peers.</li>
</ol>
<p><a href="http://highlyscalable.files.wordpress.com/2012/09/bully-algorithm.png"><img class="aligncenter size-full wp-image-781" title="bully-algorithm" alt="" src="http://highlyscalable.files.wordpress.com/2012/09/bully-algorithm.png?w=594&#038;h=506" height="506" width="594" /></a></p>
<p>Coordinator election process can count a number of nodes that participate in it and check that at least a half of cluster nodes are attend. This guarantees that only one partition can elect a coordinator in case of network partition.</p>
<h2>References</h2>
<ol>
<li><a href="http://hal.inria.fr/docs/00/55/55/88/PDF/techreport.pdf">M. Shapiro et al. A Comprehensive Study of Convergent and Commutative Replicated Data Types</a></li>
<li><a href="http://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf">I. Stoica et al. Chord: A Scalable Peer-to-peer  Lookup Service  for Internet Applications</a></li>
<li><a href="http://www.ssrc.ucsc.edu/Papers/honicky-ipdps04.pdf">R. J. Honicky, E.L.Miller. Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution</a></li>
<li><a href="http://cs-www.cs.yale.edu/homes/shah/pubs/thesis.pdf">G. Shah. Distributed Data Structures for Peer-to-Peer Systems</a></li>
<li><a href="http://sbrc2010.inf.ufrgs.br/resources/presentations/tutorial/tutorial-montresor.pdf">A. Montresor, Gossip Protocols for Large-Scale Distributed Systems</a></li>
<li><a href="http://hyperdex.org/papers/hyperdex.pdf">R. Escriva, B. Wong, E.G. Sirer. HyperDex: A Distributed, Searchable Key-Value Store</a></li>
<li><a href="http://net.pku.edu.cn/~course/cs501/2009/reading/1987-SPDC-Epidemic%20algorithms%20for%20replicated%20database%20maintenance.pdf">A. Demers et al. Epidemic Algorithms for Replicated Database Maintenance</a></li>
<li><a href="http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf">G. DeCandia, et al. Dynamo: Amazon’s Highly Available Key-value Store</a></li>
<li><a href="http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf">R. van Resesse et al. Efficient Reconciliation and Flow Control for Anti-Entropy Protocols</a></li>
<li><a href="http://www.hcs.ufl.edu/pubs/CC2000.pdf">S. Ranganathan et al. Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters</a></li>
<li><a href="http://www.slideshare.net/kakugawa/distributed-counters-in-cassandra-cassandra-summit-2010">http://www.slideshare.net/kakugawa/distributed-counters-in-cassandra-cassandra-summit-2010</a></li>
<li><a href="http://cassandra-shawn.googlecode.com/files/The%20Phi%20Accrual%20Failure%20Detector.pdf">N. Hayashibara, X. Defago, R. Yared, T. Katayama.  The Phi Accrual Failure Detector</a></li>
<li><a href="http://www.cs.mcgill.ca/~carl/impossible.pdf">M.J. Fischer, N.A. Lynch, and M.S. Paterson. Impossibility of Distributed Consensus with One Faulty Process</a></li>
<li><a href="http://ddg.jaist.ac.jp/pub/HCK02.pdf">N. Hayashibara, A. Cherif, T. Katayama. Failure Detectors for Large-Scale Distributed Systems</a></li>
<li>M. Leslie, J. Davies, and T. Huffman. A Comparison Of Replication Strategies for Reliable Decentralised Storage</li>
<li><a href="http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf">A. Lakshman, P.Malik. Cassandra &#8211; A Decentralized Structured Storage System</a></li>
<li>N. A. Lynch.  Distributed Algorithms</li>
<li>G. Tel. Introduction to Distributed Algorithms</li>
<li><a href="http://basho.com/blog/technical/2010/04/05/why-vector-clocks-are-hard/">http://basho.com/blog/technical/2010/04/05/why-vector-clocks-are-hard/</a></li>
<li><a href="http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf">L. Lamport. Paxos Made Simple</a></li>
<li><a href="http://www.cs.duke.edu/courses/fall07/cps212/consensus.pdf">J. Chase. Distributed Systems, Failures, and Consensus </a></li>
<li><a href="http://www.allthingsdistributed.com/2008/12/eventually_consistent.html">W. Vogels. Eventualy Consistent – Revisited</a></li>
<li><a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012.pdf">J. C. Corbett et al. Spanner: Google’s Globally-Distributed Database</a></li>
</ol>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/highlyscalable.wordpress.com/675/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/highlyscalable.wordpress.com/675/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=675&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
	
		<media:thumbnail url="http://highlyscalable.files.wordpress.com/2012/09/featured.png?w=150" />
		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/featured.png?w=150" medium="image">
			<media:title type="html">featured</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c11f79021b0f6248403dbf5e4b9d529b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">highlyscalable</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/consistency-plot-3.png" medium="image">
			<media:title type="html">consistency-plot-3</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/consistency-catalog.png" medium="image">
			<media:title type="html">consistency-catalog</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/consistency-concurrent-quorum.png" medium="image">
			<media:title type="html">consistency-concurrent-quorum</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/gossips.png" medium="image">
			<media:title type="html">gossips</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/epidemic-dynamics.png" medium="image">
			<media:title type="html">epidemic-dynamics</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/rebalancing.png" medium="image">
			<media:title type="html">rebalancing</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/redis-rebalancing-protocol.png" medium="image">
			<media:title type="html">redis-rebalancing-protocol</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/consistent-hashing.png" medium="image">
			<media:title type="html">consistent-hashing</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/chord.png" medium="image">
			<media:title type="html">chord</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/hyperspace-sharding.png" medium="image">
			<media:title type="html">hyperspace-sharding</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/hyperspace-sharding-2.png" medium="image">
			<media:title type="html">hyperspace-sharding-2</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/replica-passivation.png" medium="image">
			<media:title type="html">replica-passivation</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/phi-accrual-failure-detector.png" medium="image">
			<media:title type="html">phi-accrual-failure-detector</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/monitoring-zones.png" medium="image">
			<media:title type="html">monitoring-zones</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/09/bully-algorithm.png" medium="image">
			<media:title type="html">bully-algorithm</media:title>
		</media:content>
	</item>
		<item>
		<title>Speeding Up Hadoop Builds Using Distributed Unit Tests</title>
		<link>http://highlyscalable.wordpress.com/2012/08/14/speeding-up-hadoop-builds-distributed-parallel-unit-tests-on-jenkins/</link>
		<comments>http://highlyscalable.wordpress.com/2012/08/14/speeding-up-hadoop-builds-distributed-parallel-unit-tests-on-jenkins/#comments</comments>
		<pubDate>Tue, 14 Aug 2012 15:46:11 +0000</pubDate>
		<dc:creator>Ilya Katsov</dc:creator>
				<category><![CDATA[Case Studies]]></category>
		<category><![CDATA[Jenkins]]></category>
		<category><![CDATA[ci]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[jenkins]]></category>
		<category><![CDATA[junit]]></category>
		<category><![CDATA[maven]]></category>
		<category><![CDATA[pig]]></category>
		<category><![CDATA[unit test]]></category>

		<guid isPermaLink="false">http://highlyscalable.wordpress.com/?p=678</guid>
		<description><![CDATA[We recently worked with one of the Hadoop vendors on the continuous integration system for Hadoop core and other Hadoop-related projects like Pig, Hive, HBase. One of the challenges we faced was very slow automatic tests &#8212; full unit/integration test suite takes more than 2 hours for Hadoop core and more than 9 hours for [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=678&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>We recently worked with one of the Hadoop vendors on the continuous integration system for Hadoop core and other Hadoop-related projects like Pig, Hive, HBase. One of the challenges we faced was very slow automatic tests &#8212; full unit/integration test suite takes more than 2 hours for Hadoop core and more than 9 hours for Apache Pig. Although there are different ways to alleviate this problem (divide tests into suites, optimize tests by tweaking timeouts and sleeps, etc.), we decided to start with a quick solution that immediately and drastically improves CI efficiency &#8212; distributed parallel test execution. In this article I describe a technique we used to speed up a Pig build from 9 hours to 1 hour 30 minutes using 6 Jenkins nodes. This technique is generic and can be considered as a general way to speed up maven or ant builds on Jenkins CI server or other CI systems.</p>
<h2>Solution Overview</h2>
<p>Basically, the problem boils down to the following. There is a number of Jenkins slave nodes, and we have to split all JUnit tests into batches, run all batches in parallel using available slaves, and aggregate test results into a single report. The last two tasks (parallel execution and aggregation) can be solved using built-in Jenkins functionality, namely, multi-configuration jobs also known as matrix builds. Multi-configuration job allows one to configure a standard Jenkins job and specify a set of slave servers this job to be executed on. Jenkins is capable of running an instance of the job on all specified slaves in parallel, passing slave ID as a build parameter, and aggregating JUnit test results into a single report. On our build server, configuration matrix for a job is as simple as this:<br />
<a href="http://highlyscalable.files.wordpress.com/2012/08/configuration-matrix.png"><img src="http://highlyscalable.files.wordpress.com/2012/08/configuration-matrix.png?w=594" alt="" title="configuration-matrix"   class="aligncenter size-full wp-image-708" /></a></p>
<p>Test splitting is a more tricky task. A straightforward approach is to obtain a list of test cases and cut it into equal pieces. This is definitely better than nothing, but execution time can vary significantly from batch to batch especially in presence of long-running tests. Our preliminary experiments showed that parallelization of Pig builds in such a way is not very efficient &#8212; some batches can run two or more times slower than other. To cope with this issue we decided to collect statistics about tests duration and assemble batches such that the difference between expected execution times is minimal and, consequently, the total build time is minimal. The next section is devoted to the implementation details of this approach.</p>
<h2>Build Steps on Jenkins</h2>
<p>One of our goals was to keep an implementation as simple as possible, so we came up with the design where each node executes a number of steps sequentially (as a solid script) and independently from the other nodes. The only information this script receives from Jenkins server is a node ID. Each instance of the multi-configuration job on each node includes the following steps:</p>
<ol>
<li>A list of available JUnit tests is obtained.</li>
<li>Statistics about previous test runs is loaded from the central store.</li>
<li>Available tests are divided into batches according to the statistics.</li>
<li>A batch is selected according to the node ID and submitted to ant/maven as a build parameter.</li>
<li>JUnit reports are parsed, test statistics is extracted and saved to the central shared store.</li>
</ol>
<p>In this section a Python implementation of each step is shown in a simplified form, details like error handling and logging are omitted for sake of readability. </p>
<p>First, we prepare an initial list of tests by scanning sources in the workspace:</p>
<pre class="brush: python; title: ; notranslate">
#[ COLLECT A TEST POOL
test_pool = set([])
for root, dirnames, filenames in os.walk(&quot;./test&quot;):
   for filename in fnmatch.filter(filenames, 'Test*.java'):
      test_name = re.search(r&quot;.*(Test.*)\.java&quot;, os.path.join(root, filename))
      test_pool.add(test_name.group(1))   
#]
</pre>
<p>Second, we load test statistics from the shared store. We use MySQL as a database, but one can use version control system to store statistics along with the sources. This statistics is initially empty.</p>
<pre class="brush: python; title: ; notranslate">
#[ LOAD TEST STATISTICS
job_name = &quot;Pig_gd-branch-0.9&quot;
db = MySQLdb.connect(...)
cursor = db.cursor()
cursor.execute(&quot; SELECT test_name, duration FROM test_stat WHERE job_name=%s &quot;, job_name)
test_statistics_data = cursor.fetchall()
test_statistics = dict(test_statistics_data)
db.close()
#]
</pre>
<p>The third step is a scheduling step that selects tests that have to be executed on the current node. We have to split the test pool into a fixed number of disjoint batches such that the difference of their execution times is minimal. We don&#8217;t need an optimal solution, a simple <a href="http://en.wikipedia.org/wiki/Partition_problem#The_greedy_algorithm" title="greedy algorithm">greedy algorithm</a> is practically enough. This step produces a set of files with the test names:</p>
<pre class="brush: python; title: ; notranslate">
random.seed(1234) # fix seed to produce identical results on all nodes

#[ PREPARE SPLITS, GREEDY ALGORITHM
test_splits = [ [] for i in range(SPLIT_FACTOR) ]
test_times = [0] * SPLIT_FACTOR
for test in sorted(test_pool, key=lambda test : -test_statistics.get(test, 0)):
    # select a split with minimal expected execution time
    split_index = test_times.index(min(test_times))
    test_duration = test_statistics.get(test, 0)
    if not test_duration: # if statistics is unavailable, select a random split
        split_index = random.randint(0, SPLIT_FACTOR - 1)        
    test_splits[split_index].append(test)
    test_times[split_index] += test_duration

for split, id in zip(test_splits, range(SPLIT_FACTOR)):
    f = open(base_dir + 'upar-split.%d' % id, 'w')
        for test in split: # write ant's include mask to a file
            f.write(&quot;**/%s.java\n&quot; % test)  
    f.close()
#]
</pre>
<p>As soon as splits are ready, the slave name is mapped to the batch ID and the build is executed for this batch (fortunately, Pig&#8217;s build system allows to submit a file with test filters as a build parameter). The similar thing can done for maven builds. The following piece of bash code do this part of the work:</p>
<pre class="brush: bash; title: ; notranslate">
case $SLAVEID in
Slave-Alpha)   JOBID=0;; # Slave-Alpha is a Jenkins node ID
Slave-Beta)    JOBID=1;;
Slave-Gamma)   JOBID=2;;
Slave-Delta)   JOBID=3;;
Slave-Epsilon) JOBID=4;;
Slave-Zeta)    JOBID=5;;
esac
ant -Dtest.junit.output.format=xml clean test -Dtest.all.file=upar-split.${JOBID}
</pre>
<p>The final step is to parse test results and update test statistics in the DB. This is also quite trivial:</p>
<pre class="brush: python; title: ; notranslate">
#[ UPDATE TEST STATISTICS
db = MySQLdb.connect(...)
cursor = db.cursor()
path = &quot;./build/test/logs/&quot;
for infile in glob.glob( os.path.join(path, 'TEST-*.xml') ):
   f = open(infile)
   text = f.read()
   f.close()
   time = re.search(
        r&quot;&lt;testsuite[^&gt;]*time=\&quot;([0-9\.]+)\&quot;&quot;, 
        text, flags=re.DOTALL)
   test_name = re.search(r&quot;.*TEST-.*(Test\w*).xml&quot;, infile).group(1)       
   cursor.execute(
        &quot;REPLACE INTO test_stat(job_name,test_name,duration) VALUES(%s,%s,%s)&quot;, 
        (job_name, test_name, float(time.group(1))) )
db.close()
#]
</pre>
<h2>Results</h2>
<p>According to our experiments, the described technique allows one to achieve a very even load distribution among the nodes and, consequently, minimize the total build time. An example of the build duration distribution for Pig build is shown in the screenshot below (monolithic build takes more than 9 hours):<br />
<a href="http://highlyscalable.files.wordpress.com/2012/08/test-duration.png"><img src="http://highlyscalable.files.wordpress.com/2012/08/test-duration.png?w=594" alt="" title="test-duration"   class="aligncenter size-full wp-image-717" /></a><br />
It should be noted that the real production implementation takes care about a few more issues:</p>
<ul>
<li><em>Split stability.</em> Jenkins nodes can differ in performance and vast changes in test-to-node mapping can lead to the unpredictable result. By this reason it&#8217;s preferable to have relatively stable mapping procedure, i.e. changes in execution time for a few tests should not lead to a completely new batches. This can be achieved by using thresholds and deliberate coarsening of the statistics that are used in computations.</li>
<li><em>Cohesion of artifacts.</em> All instances of the multi-configuration job are executed in parallel and work independently. It is theoretically possible that two nodes can checkout different revisions of artifacts or sources and, consequently, start with different test pools. This can be alleviated in a multiple ways including distribution of the test pool via the central store.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/highlyscalable.wordpress.com/678/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/highlyscalable.wordpress.com/678/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=678&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://highlyscalable.wordpress.com/2012/08/14/speeding-up-hadoop-builds-distributed-parallel-unit-tests-on-jenkins/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:thumbnail url="http://highlyscalable.files.wordpress.com/2012/08/featured.png?w=150" />
		<media:content url="http://highlyscalable.files.wordpress.com/2012/08/featured.png?w=150" medium="image">
			<media:title type="html">featured</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c11f79021b0f6248403dbf5e4b9d529b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">highlyscalable</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/08/configuration-matrix.png" medium="image">
			<media:title type="html">configuration-matrix</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/08/test-duration.png" medium="image">
			<media:title type="html">test-duration</media:title>
		</media:content>
	</item>
		<item>
		<title>Fast Intersection of Sorted Lists Using SSE Instructions</title>
		<link>http://highlyscalable.wordpress.com/2012/06/05/fast-intersection-sorted-lists-sse/</link>
		<comments>http://highlyscalable.wordpress.com/2012/06/05/fast-intersection-sorted-lists-sse/#comments</comments>
		<pubDate>Tue, 05 Jun 2012 12:58:19 +0000</pubDate>
		<dc:creator>Ilya Katsov</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Fundamentals]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[information retrieval]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[simd]]></category>
		<category><![CDATA[sse]]></category>

		<guid isPermaLink="false">http://highlyscalable.wordpress.com/?p=516</guid>
		<description><![CDATA[Intersection of sorted lists is a cornerstone operation in many applications including search engines and databases because indexes are often implemented using different types of sorted structures. At GridDynamics, we recently worked on a custom database for realtime web analytics where fast intersection of very large lists of IDs was a must for good performance. From a functional [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=516&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Intersection of sorted lists is a cornerstone operation in many applications including search engines and databases because indexes are often implemented using different types of sorted structures. At GridDynamics, we recently worked on a custom database for realtime web analytics where fast intersection of very large lists of IDs was a must for good performance. From a functional point of view, we needed mainly a standard boolean query processing, so it was possible to use Solr/Lucene as a platform. However, it was interesting to evaluate performance of alternative approaches. In this article I describe several useful techniques that are based on SSE instructions and provide results of performance testing for Lucene, Java, and C implementations. I&#8217;d like to mention that in this study we were focused on a general case when selectivity of the intersection is low or unknown and optimization techniques like <a href="http://nlp.stanford.edu/IR-book/html/htmledition/faster-postings-list-intersection-via-skip-pointers-1.html">skip list</a> are not necessarily beneficial.</p>
<h2>Scalar Intersection</h2>
<p>Our starting point is a simple element-by-element intersection algorithm (also known as Zipper). Its implementation in C is shown below and do not require lengthy explanations:</p>
<pre class="brush: cpp; title: ; notranslate">
#define int32 unsigned int

// A, B - operands, sorted arrays
// s_a, s_b - sizes of A and B
// C - result buffer
// return size of the result C
size_t intersect_scalar(int32 *A, int32 *B, size_t s_a, size_t s_b, int32 *C) {
	size_t i_a = 0, i_b = 0;
	size_t counter = 0;

	while(i_a &lt; s_a &amp;&amp; i_b &lt; s_b) {
		if(A[i_a] &lt; B[i_b]) {
			i_a++;
		} else if(B[i_b] &lt; A[i_a]) {
			i_b++;
		} else {
			C[counter++] = A[i_a];
			i_a++; i_b++;
		}
	}
	return counter;
}
</pre>
<p>Performance of this procedure both in C and Java will be evaluated in the last section. I believe that it is possible to improve this approach using a branchless implementation, but I had no chance to try it out.</p>
<h2>Vectorized Intersection</h2>
<p>It is intuitively clear that performance of intersection may be improved by processing of multiple elements at once using SIMD instructions. Let us start with the following question: how to find and extract common elements in two short sorted arrays (let&#8217;s call them segments). SSE instruction set allow one to do a pairwise comparison of two segments of four 32-bit integers each using one instruction (<a href="http://msdn.microsoft.com/en-us/library/tcww73d7.aspx">_mm_cmpeq</a> intrinsic) that produces a bit mask that highlights positions of equal elements. If one has two 4-element registers, A and B, it is possible to obtain a mask of common elements comparing A with different cyclic shifts of B (the left part of the figure below) and OR-ing the masks produced by each comparison (the right part of the figure):</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/05/rolling-processing1.png"><img class="aligncenter size-full wp-image-621" title="rolling-processing" src="http://highlyscalable.files.wordpress.com/2012/05/rolling-processing1.png?w=594&#038;h=376" alt="" width="594" height="376" /></a></p>
<p>The resulting<em> comparison mask</em> highlights the required elements in the segment A. This 128-bit mask can be transformed to a 4-bit value (<em>shuffling mask index</em>) using <a href="http://msdn.microsoft.com/en-us/library/4490ys29.aspx">__mm_movemask</a> intrinsic.  When this short mask of common elements is obtained, we have to efficiently copy out common elements. This can be done by shuffling of the original elements according to the shuffling mask that can be looked up in the precomputed dictionary using the <em>shuffling mask index (</em>i.e. each of 16 possible 4-bit shuffling mask indexes<em> </em>is mapped to some permutation<em>)</em>. All common elements should be placed to the beginning of the register, in this case register can be copied in one shot to the output buffer C as it shown in the figure above.</p>
<p>The described technique gives us a building block that can be used for intersection of long sorted lists. This process is somehow similar to the scalar intersection:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/05/main.png"><img class="aligncenter size-full wp-image-581" title="main" src="http://highlyscalable.files.wordpress.com/2012/05/main.png?w=594" alt=""   /></a></p>
<p>In this example, during the first cycle we compare first 4-element segments (highlighted in red) and copy out common elements (2 and 11). Similarly to the scalar intersection algorithm, we can move forward the pointer for the list B because the tail element of the compared segments is smaller in B (11 vs 14). At the second cycle (in green) we compare the first segment of A with the second segment of B, intersection is empty, and we move pointer for A. And so on. In this example, we need 5 comparisons to process two lists of 12 elements each.</p>
<p>The complete implementation of the described techniques is shown below:</p>
<pre class="brush: cpp; title: ; notranslate">
static __m128i shuffle_mask[16]; // precomputed dictionary

size_t intersect_vector(int32 *A, int32 *B, size_t s_a, size_t s_b, int32 *C) {
	size_t count = 0;
	size_t i_a = 0, i_b = 0;

	// trim lengths to be a multiple of 4
	size_t st_a = (s_a / 4) * 4;
	size_t st_b = (s_b / 4) * 4;

	while(i_a &lt; s_a &amp;&amp; i_b &lt; s_b) {
		//[ load segments of four 32-bit elements
		__m128i v_a = _mm_load_si128((__m128i*)&amp;A[i_a]);
		__m128i v_b = _mm_load_si128((__m128i*)&amp;B[i_b]);
		//]

		//[ move pointers
		int32 a_max = _mm_extract_epi32(v_a, 3);
		int32 b_max = _mm_extract_epi32(v_b, 3);
		i_a += (a_max &lt;= b_max) * 4;
		i_b += (a_max &gt;= b_max) * 4;
		//]

		//[ compute mask of common elements
		int32 cyclic_shift = _MM_SHUFFLE(0,3,2,1);
		__m128i cmp_mask1 = _mm_cmpeq_epi32(v_a, v_b);    // pairwise comparison
		v_b = _mm_shuffle_epi32(v_b, cyclic_shift);       // shuffling
		__m128i cmp_mask2 = _mm_cmpeq_epi32(v_a, v_b);    // again...
		v_b = _mm_shuffle_epi32(v_b, cyclic_shift);
		__m128i cmp_mask3 = _mm_cmpeq_epi32(v_a, v_b);    // and again...
		v_b = _mm_shuffle_epi32(v_b, cyclic_shift);
		__m128i cmp_mask4 = _mm_cmpeq_epi32(v_a, v_b);    // and again.
		__m128i cmp_mask = _mm_or_si128(
				_mm_or_si128(cmp_mask1, cmp_mask2),
				_mm_or_si128(cmp_mask3, cmp_mask4)
		); // OR-ing of comparison masks
		// convert the 128-bit mask to the 4-bit mask
		int32 mask = _mm_movemask_ps((__m128)cmp_mask);
		//]

		//[ copy out common elements
		__m128i p = _mm_shuffle_epi8(v_a, shuffle_mask[mask]);
		_mm_storeu_si128((__m128i*)&amp;C[count], p);
		count += _mm_popcnt_u32(mask); // a number of elements is a weight of the mask
		//]
	}

	// intersect the tail using scalar intersection
	...

	return count;
}
</pre>
<p>The described implementation uses the <em>shuffle_mask</em> dictionary to map the mask of common elements to the shuffling parameter. Building of this dictionary is straightforward (each bit in the mask corresponds to 4 bytes in the register):</p>
<pre class="brush: cpp; title: ; notranslate">
// a simple implementation, we don't care about performance here
void prepare_shuffling_dictionary() {
	for(int i = 0; i &lt; 16; i++) {
		int counter = 0;
		char permutation[16];
		memset(permutation, 0xFF, sizeof(permutation));
		for(char b = 0; b &lt; 4; b++) {
			if(getBit(i, b)) {
				permutation[counter++] = 4*b;
				permutation[counter++] = 4*b + 1;
				permutation[counter++] = 4*b + 2;
				permutation[counter++] = 4*b + 3;
			}
		}
		__m128i mask = _mm_loadu_si128((const __m128i*)permutation);
		shuffle_mask[i] = mask;
	}
}

int getBit(int value, int position) {
    return ( ( value &amp; (1 &lt;&lt; position) ) &gt;&gt; position);
}
</pre>
<h2>Partitioned Vectorized Intersection</h2>
<p>SSE 4.2 instruction set offers <a href="http://msdn.microsoft.com/en-us/library/bb514080.aspx">PCMPESTRM</a> instruction that allows one to compare two segments of eight 16-bit values each and obtain a bit mask that highlights common elements. This sounds like an extremely efficient approach for intersection of sorted lists, but in its basic form this approach is limited by 16-bit values in the lists. This is not the case for many applications, so a workaround was recently suggested by Benjamin Schedel et al. in <a href="http://www.adms-conf.org/p1-SCHLEGEL.pdf">this article</a>. The main idea is to store indexes in the partitioned format, where elements with the same most significant bits are grouped together. This approach also has limited applicability because each partition should contain a sufficient number of elements, i.e. it works well in case or very large lists or favorable distribution of the values.</p>
<p>Each partition has a header that includes a prefix which represents most significant bits that are common for all elements in the partition and the number of elements in the partition. The following figure illustrates the partitioning process:<br />
<a href="http://highlyscalable.files.wordpress.com/2012/05/partitioning.png"><img class="aligncenter size-full wp-image-664" title="partitioning" src="http://highlyscalable.files.wordpress.com/2012/05/partitioning.png?w=594" alt=""   /></a></p>
<p>The partitioning procedure that coverts 32-bit values into 16-bit values is shown in the code snippet below:</p>
<pre class="brush: cpp; title: ; notranslate">
// A - sorted array
// s_a - size of A
// R - partitioned sorted array
size_t partition(int32 *A, size_t s_a, int16 *R) {
	int16 high = 0;
	size_t partition_length = 0;
	size_t partition_size_position = 1;
	size_t counter = 0;
	for(size_t p = 0; p &lt; s_a; p++) {
		int16 chigh = _high16(A[p]); // upper dword
		int16 clow = _low16(A[p]);   // lower dword
		if(chigh == high &amp;&amp; p != 0) { // add element to the current partition
			R[counter++] = clow;
			partition_length++;
		} else { // start new partition
			R[counter++] = chigh; // partition prefix
			R[counter++] = 0;     // reserve place for partition size
			R[counter++] = clow;  // write the first element
			R[partition_size_position] = partition_length;
			partition_length = 1; // reset counters
			partition_size_position = counter - 2;
			high = chigh;
		}
	}
	R[partition_size_position] = partition_length;

	return counter;
}
</pre>
<p>A pair of partitions can be intersected using the following procedure that computes a mask of common elements using <a href="http://msdn.microsoft.com/en-us/library/bb514080.aspx">_mm_cmpestrm</a> intrinsic and then shuffles these elements similarly to the vectorized intersection procedure what was described in the previous section.</p>
<pre class="brush: cpp; title: ; notranslate">
size_t intersect_vector16(int16 *A, int16 *B, size_t s_a, size_t s_b, int16 *C) {
	size_t count = 0;
	size_t i_a = 0, i_b = 0;

	size_t st_a = (s_a / 8) * 8;
	size_t st_b = (s_b / 8) * 8;

	while(i_a &lt; st_a &amp;&amp; i_b &lt; st_b) {
		__m128i v_a = _mm_loadu_si128((__m128i*)&amp;A[i_a]);
		__m128i v_b = _mm_loadu_si128((__m128i*)&amp;B[i_b]);

		__m128i res_v = _mm_cmpestrm(v_b, 8, v_a, 8,
				_SIDD_UWORD_OPS|_SIDD_CMP_EQUAL_ANY|_SIDD_BIT_MASK);
		int r = _mm_extract_epi32(res_v, 0);
		__m128i p = _mm_shuffle_epi8(v_a, shuffle_mask16[r]);
		_mm_storeu_si128((__m128i*)&amp;C[count], p);
		count += _mm_popcnt_u32(r);

		int16 a_max = _mm_extract_epi16(v_a, 7);
		int16 b_max = _mm_extract_epi16(v_b, 7);
		i_a += (a_max &lt;= b_max) * 4;
		i_b += (a_max &gt;= b_max) * 4;
	}

	// intersect the tail using scalar intersection
	...

	return count;
}
</pre>
<p>The whole intersection algorithm looks similarly to the scalar intersection. It receives two partitioned operands, iterates over headers of partitions and calls intersection of particular partitions if their prefixes match:</p>
<pre class="brush: cpp; title: ; notranslate">
// A, B - partitioned operands
size_t intersect_partitioned(int16 *A, int16 *B, size_t s_a, size_t s_b, int16 *C) {
	size_t i_a = 0, i_b = 0;
	size_t counter = 0;

	while(i_a &lt; s_a &amp;&amp; i_b &lt; s_b) {
		if(A[i_a] &lt; B[i_b]) {
			i_a += A[i_a + 1] + 2;
		} else if(B[i_b] &lt; A[i_a]) {
			i_b += B[i_b + 1] + 2;
		} else {
			C[counter++] = A[i_a]; // write partition prefix
			int16 partition_size = intersect_vector16(&amp;A[i_a + 2], &amp;B[i_b + 2],
						A[i_a + 1], B[i_b + 1], &amp;C[counter + 1]);
			C[counter++] = partition_size; // write partition size
			counter += partition_size;
			i_a += A[i_a + 1] + 2;
			i_b += B[i_b + 1] + 2;
		}
	}
	return counter;
}
</pre>
<p>The output of this procedure is also a partitioned vector that can be used in further operations.</p>
<h2>Performance Evaluation</h2>
<p>Performance of the described techniques was evaluated for intersection of sorted lists of size 1 million elements, with average intersection selectivity about 30%. All evaluated methods excepts partitioned vectorized intersection do not require specific properties of the values in the lists. For partitioned vectorized intersection values were selected from range [0, 3M] to provide relatively large partitions.</p>
<p>In case of Lucene, a corpus of documents with two fields was generated to provide the mentioned index sizes and selectivity; RAMDirectory was used.  Intersection was done using standard Boolean query with top hits limited by 1 to prevent generation of large result set. Of course, this not a fair comparison because Lucene is much more than a list intersector, but it is still interesting to try it out.</p>
<p>Performance testing was done on the ordinary Linux desktop with 2.8GHz cores. JDK 1.6 and gcc 4.5.2 (with -O3 option) were used.</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/05/performance1.png"><img class="aligncenter size-full wp-image-625" title="performance" src="http://highlyscalable.files.wordpress.com/2012/05/performance1.png?w=594" alt=""   /></a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/highlyscalable.wordpress.com/516/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/highlyscalable.wordpress.com/516/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=516&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://highlyscalable.wordpress.com/2012/06/05/fast-intersection-sorted-lists-sse/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
	
		<media:thumbnail url="http://highlyscalable.files.wordpress.com/2012/05/featured2.png?w=150" />
		<media:content url="http://highlyscalable.files.wordpress.com/2012/05/featured2.png?w=150" medium="image">
			<media:title type="html">featured</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c11f79021b0f6248403dbf5e4b9d529b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">highlyscalable</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/05/rolling-processing1.png" medium="image">
			<media:title type="html">rolling-processing</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/05/main.png" medium="image">
			<media:title type="html">main</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/05/partitioning.png" medium="image">
			<media:title type="html">partitioning</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/05/performance1.png" medium="image">
			<media:title type="html">performance</media:title>
		</media:content>
	</item>
		<item>
		<title>Probabilistic Data Structures for Web Analytics and Data Mining</title>
		<link>http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/</link>
		<comments>http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/#comments</comments>
		<pubDate>Tue, 01 May 2012 14:11:46 +0000</pubDate>
		<dc:creator>Ilya Katsov</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Fundamentals]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[cardinality]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[web analitics]]></category>

		<guid isPermaLink="false">http://highlyscalable.wordpress.com/?p=468</guid>
		<description><![CDATA[Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=468&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters.  Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.</p>
<p>I would like to thank Mikhail Khludnev and Kirill Uvaev, who reviewed this article and provided valuable suggestions.</p>
<p>Let us start with a simple example that illustrates capabilities of probabilistic data structures:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/04/probabilistic-sizes.png"><img class="aligncenter size-full wp-image-484" title="probabilistic-sizes" src="http://highlyscalable.files.wordpress.com/2012/04/probabilistic-sizes.png?w=594" alt=""   /></a></p>
<p>Let us have a data set that is simply a heap of ten million random integer values and we know that it contains not more than one million distinct values (there are many duplicates). The picture above depicts the fact that this data set basically occupies 40MB of memory (10 million of 4-byte elements). It is a ridiculous size for Big Data applications, but this a reasonable choice to show all structures in scale. Our goal is to convert this data set to compact structures that allow one to process the following queries:</p>
<ul>
<li>How many distinct elements are in the data set (i.e. what is the cardinality of the data set)?</li>
<li>What are the most frequent elements (the terms “heavy hitters” and “top-k elements” are also used)?</li>
<li>What are the frequencies of the most frequent elements?</li>
<li>How many elements belong to the specified range (range query, in SQL it looks like  SELECT count(v) WHERE v &gt;= c1 AND v &lt; c2)?</li>
<li>Does the data set contain a particular element (membership query)?</li>
</ul>
<p>The picture above shows (in scale) how much memory different representations of the data set will consume and which queries they support:</p>
<ul>
<li>A straightforward approach for cardinality computation and membership query processing is to maintain a sorted list of IDs or a hash table. This approach requires at least 4MB because we expect up to 10^6 values, the actual size of the hash table will be even larger.</li>
<li>A straightforward approach for frequency counting and range query processing is to store a map like (value -&gt; counter) for each element. It requires a table of 7MB that stores values and counters (24-bit counters are sufficient because we have not more than 10^7 occurrences of each element).</li>
<li>With probabilistic data structures, a membership query can be processed with 4% error rate (false positive answers) using only 0.6MB of memory if data is stored in the Bloom filter.</li>
<li>Frequencies of 100 most frequent elements can be estimated with 4% precision using Count-Min Sketch structure that uses about 48KB (12k integer counters, based on the experimental result), assuming that data is skewed in accordance with Zipfian distribution that models well natural texts, many types of web events and network traffic. A group of several such sketches can be used to process range query.</li>
<li>100 most frequent items can be detected with 4% error (96 of 100 are determined correctly, based on the experimental results) using Stream-Summary structure, also assuming Zipfian distribution of probabilities of the items.</li>
<li>Cardinality of this data set can be estimated with 4% precision using either Linear Counter or Loglog Counter. The former one uses about 125KB of memory and its size is linear function of the cardinality, the later one requires only 2KB and its size is almost constant for any input. It is possible to combine several linear counters to estimate cardinality of the corresponding union of sets.</li>
</ul>
<p>A number of probabilistic data structures is described in detail in the following sections, although without excessive theoretical explanations – detailed mathematical analysis of these structures can be found in the original articles.  The preliminary remarks are:</p>
<ul>
<li>For some structures like Loglog Counter or Bloom filter, there exist simple and practical formulas that allow one to determine parameters of the structure on the basis of expected data volume and required error probability. Other structures like Count-Min Sketch or Stream-Summary have complex dependency on statistical properties of data and experiments are the only reasonable way to understand their applicability to real use cases.</li>
<li>It is important to keep in mind that applicability of the probabilistic data structures is not strictly limited by the queries listed above or by a single data set. On the contrary, structures populated by different data sets can often be combined to process complex queries and other types of queries can be supported by using customized versions of the described algorithms.</li>
</ul>
<h2>Cardinality Estimation: Linear Counting</h2>
<p>Let us start with a very simple technique that is called Linear Counting. Basically, a liner counter is just a bit set and each element in the data set is mapped to a bit. This process is illustrated in the following code snippet:</p>
<pre class="brush: java; title: ; notranslate">
class LinearCounter {
	BitSet mask = new BitSet(m) // m is a design parameter

	void add(value) {
		int position = hash(value) // map the value to the range 0..m
		mask.set(position) // sets a bit in the mask to 1
	}
}
</pre>
<p>Let’s say that the ratio of a number of distinct items in the data set to m is a <em>load factor</em>. It is intuitively clear that:</p>
<ul>
<li>If the load factor is much less than 1, a number of collisions in the mask will be low and weight of the mask (a number of 1’s) will be a good estimation of the cardinality.</li>
<li>If the load factor is higher than 1, but not very high, many different values will be mapped to the same bits. Hence the weight of the mask is not a good estimation of the cardinality. Nevertheless, it is possible that there exist a function that allows one to estimate the cardinality on the basis of weight (real cardinality will always be greater than weight).</li>
<li>If the load factor is very high (for example, 100), it is very probable that all bits will be set to 1 and it will be impossible to obtain a reasonable estimation of the cardinality on the basis of the mask.</li>
</ul>
<p>If so, we have to pose the following two questions:</p>
<ul>
<li>Is there a function that maps the weight of the mask to the estimation of the cardinality and how does this function look like?</li>
<li>How to choose m on the basis of the expected number of the unique items (or upper bound) and the required estimation error?</li>
</ul>
<p>Both questions were addressed in [1]. The following table contains key formulas that allow one to estimate cardinality as a function of the mask weight and choose parameter m by required bias or standard error of the estimation:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/04/linear-counting-formulas.png"><img class="aligncenter size-full wp-image-490" title="linear-counting-formulas" src="http://highlyscalable.files.wordpress.com/2012/04/linear-counting-formulas.png?w=594" alt=""   /></a></p>
<p>The last two equations cannot be solved analytically to express m or load factor as a function of bias or standard error, but it is possible to tabulate numerical solutions. The following plots can be used to determine the load factor (and, consequently, m) for different capacities.</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/04/linear-counting-guidelines.png"><img class="aligncenter size-full wp-image-491" title="linear-counting-guidelines" src="http://highlyscalable.files.wordpress.com/2012/04/linear-counting-guidelines.png?w=594&#038;h=780" alt="" width="594" height="780" /></a></p>
<p>The rule of thumb is that load factor of 10 can be used for large data sets even if very precise estimation is required, i.e. memory consumption is about 0.1 bits per unique value. This is more than two orders of magnitude more efficient than the explicit indexing of 32- or 64-bit identifiers, but memory consumption grows linearly as a function of the expected cardinality (n), i.e. capacity of counter.</p>
<p>It is important to note that several independently computed masks for several data sets can be merged as a bitwise OR to estimate the cardinality of the union of the data sets. This opportunity is leveraged in the following case study.</p>
<h4><strong>Case Study</strong></h4>
<p><strong></strong>There is a system that receives events on user visits from different internet sites. This system enables analysis to query a number of unique visitors for the specified date range and site. Linear Counters can be used to aggregate information about registered visitor IDs for each day and site, masks for each day are saved, and a query can be processed using bitwise OR-ing of the daily masks.</p>
<h2>Cardinality Estimation: Loglog Counting</h2>
<p>Loglog algorithm [2] is a much more powerful and much more complex technique than the Linear Counting algorithm. Although some aspects of the Loglog algorithm are pretty complex, the basic idea is simple and ingenious.</p>
<p>In order to understand principles of the Loglog algorithm we should start one general observation. Let us imagine that we hashed each element in the data set and these hashed values are presented as binary strings. We can expect that about one half of strings will start with 1, one quarter will start with 01, and so on. Let’s denote the number of the leading zeros as a rank. Finally, one or a few values will have some maximum rank r, as it shown in the figure below.</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/05/log-log-counter.png"><img class="aligncenter size-full wp-image-599" title="log-log-counter" src="http://highlyscalable.files.wordpress.com/2012/05/log-log-counter.png?w=594" alt=""   /></a></p>
<p>From this consideration it follows that 2^r can be treated as some kind of the cardinality estimation, but a very unstable estimation – r is determined by one or few items and variance is very high. However, it is possible to overcome this issue by using multiple independent observations and averaging them. This technique is shown in the code snippet below. Incoming values are routed to a number of buckets by using their first bits as a bucket address. Each bucket maintains a maximum rank of the received values:</p>
<pre class="brush: java; title: ; notranslate">
class LogLogCounter {
	int H  			// H is a design parameter
	int m = 2^k  		// k is a design parameter
	etype[] estimators = new etype[m] // etype is a design parameter

	void add(value) {
		hashedValue = hash(value)
		bucket = getBits(hashedValue, 0, k)
		estimators[bucket] = max(
 			estimators[bucket],
 			rank( getBits(hashedValue, k, H) )
 		)
 	}

 	getBits(value, int start, int end)
 	rank(value)
}
</pre>
<p>This implementation requires the following parameters to be determined:</p>
<ul>
<li>H – sufficient length of the hash function (in bits)</li>
<li>k – number of bits that determine a bucket, 2^k is a number of buckets</li>
<li>etype – type of the estimator (for example, byte), i.e. how many bits are required for each estimator</li>
</ul>
<p>The auxiliary functions are specified as follows:</p>
<ul>
<li>hash(value) – produces H-bit hash of the value</li>
<li>getBits(value, start, end) – crop bits between start and end positions from the value and return an integer number that is assembled from this bits</li>
<li>rank(value) – compute position of first 1-bit in the value, i.e. rank(1&#8230;b) is 1, rank (001&#8230;b) is 3, rank (00001&#8230;b) is 5 etc.</li>
</ul>
<p>The following table provides the estimation formula and equations that can be used to determine numerical parameters of the Loglog Counter:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/04/log-log-formulas.png"><img class="aligncenter size-full wp-image-494" title="log-log-formulas" src="http://highlyscalable.files.wordpress.com/2012/04/log-log-formulas.png?w=594" alt=""   /></a></p>
<p>These formulas are very impressive. One can see that a number of buckets is relatively small for most of the practically interesting values of the standard error of the estimation. For example, 1024 estimators provide a standard error of 4%. At the same time, the length of the estimator is a very slow growing function of the capacity, 5-bit buckets are enough for cardinalities up to 10^11, 8-bit buckets (etype is byte) can support practically unlimited cardinalities. This means that less than 1KB of auxiliary memory may be enough to process gigabytes of data in the real life applications! This is a fundamental phenomenon that was revealed and theoretically analyzed in [7]: <em>It is possible to recover an approximate value of cardinality, using only a (small and) constant memory.</em></p>
<p>Loglog counter is essentially a record about a single (rarest) element in the dataset.</p>
<p>More recent developments on cardinality estimation are described in [9] and [10]. This <a href="http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html">article</a> also provides a good overview of the cardinality estimation techniques.</p>
<h4><strong>Case Study</strong></h4>
<p><strong></strong>There is a system that monitors traffic and counts unique visitors for different criteria (visited site, geography, etc.) The straightforward approaches for implementation of this system are:</p>
<ul>
<li>Log all events in a large storage like Hadoop and compute unique visitor periodically using heavy MapReduce jobs or whatever.</li>
<li>Maintain some kind of inverted indexes like (site -&gt; {visitor IDs}) where visitor IDs are stored as a hash table or sorted array. The number of unique users is a length of the corresponding index.</li>
</ul>
<p>If number of users and criteria is high, both solutions assume very high amount of data to be stored, maintained, or processed. As an alternative, a LoglogCounter structure can be maintained for each criterion. In this case, thousands of criteria and hundreds of millions of visitors can be tracked using a very modest amount of memory.</p>
<h4><strong>Case Study</strong></h4>
<p>There is a system that monitors traffic and counts unique visitors for different criteria (visited site, geography, etc.). It is required to compute 100 most popular sites using a number of unique visitors as a metric of popularity. Popularity should be computed every day on the basis of data for last month, i.e. every day one-day partition added, another one is removed from the scope. Similarly to the previous case study, straightforward solutions for this problem require a lot of resources if data volume is high. On the other hand, one can create a fresh set of per-site Loglog counters every day and maintain this set during 30 days, i.e. 30 sets of counters are active at any moment of time. This approach can be very efficient because of the tiny memory footprint of the Loglog counter, even for millions of unique visitors.</p>
<h2>Frequency Estimation: Count-Min Sketch</h2>
<p>Count-Min Sketches is a family of memory efficient data structures that allow one to estimate frequency-related properties of the data set, e.g. estimate frequencies of particular elements, find top-K frequent elements, perform range queries (where the goal is to find the sum of frequencies of elements within a range), estimate percentiles.</p>
<p>Let’s focus on the following problem statement: there is a set of values with duplicates, it is required to estimate frequency (a number of duplicates) for each value. Estimations for relatively rare values can be imprecise, but frequent values and their absolute frequencies should be determined accurately.</p>
<p>The basic idea of Count-Min Sketch [3] is quite simple and somehow similar to Linear Counting. Count-Min sketch is simply a two-dimensional array (d x w) of integer counters. When a value arrives, it is mapped to one position at each of d rows using d different and preferably independent hash functions. Counters on each position are incremented. This process is shown in the figure below:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/04/count-min-sketch.png"><img class="aligncenter size-full wp-image-495" title="count-min-sketch" src="http://highlyscalable.files.wordpress.com/2012/04/count-min-sketch.png?w=594&#038;h=250" alt="" width="594" height="250" /></a></p>
<p>It is clear that if sketch is large in comparison with the cardinality of the data set, almost each value will get an independent counter and estimation will precise. Nevertheless, this case is absolutely impractical – it is much better to simply maintain a dedicated counter for each value by using plain array or hash table. To cope with this issue, Count-Min algorithm estimates frequency of the given value as a minimum of the corresponding counters in each row because the estimation error is always positive (each occurrence of a value always increases its counters, but collisions can cause additional increments). A practical implementation of Count-Min sketch is provided in the following code snippet. It uses simple hash functions as it was suggested in [4]:</p>
<pre class="brush: java; title: ; notranslate">
class CountMinSketch {
	long estimators[][] = new long[d][w]	// d and w are design parameters
	long a[] = new long[d]
	long b[] = new long[d]
	long p 		// hashing parameter, a prime number. For example 2^31-1

	void initializeHashes() {
		for(i = 0; i &lt; d; i++) {
			a[i] = random(p)	// random in range 1..p
			b[i] = random(p)
 		}
	}

	void add(value) {
		for(i = 0; i &lt; d; i++)
			estimators[i][ hash(value, i) ]++
	}

	long estimateFrequency(value) {
		long minimum = MAX_VALUE
		for(i = 0; i &lt; d; i++)
			minimum = min(
 				minimum,
 				estimators[i][ hash(value, i) ]
 			)
		return minimum
	}

	hash(value, i) {
		return ((a[i] * value + b[i]) mod p) mod w
	}
}
</pre>
<p>Dependency between the sketch size and accuracy is shown in the table below. It is worth noting that width of the sketch limits the magnitude of the error and height (also called depth) controls the probability that estimation breaks through this limit:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/05/count-min-formulas1.png"><img class="aligncenter size-full wp-image-646" title="count-min-formulas" src="http://highlyscalable.files.wordpress.com/2012/05/count-min-formulas1.png?w=594" alt=""   /></a><br />
Accuracy of the Count-Min sketch depends on the ratio between the sketch size and the total number of registered events. This means that Count-Min technique provides significant memory gains only for skewed data, i.e. data where items have very different probabilities. This property is illustrated in the figures below.</p>
<p>Two experiments were done with the Count-Min sketch of size 3&#215;64, i.e. 192 counters total. In the first case the sketch was populated with moderately skewed data set of 10k elements, about 8500 distinct values (element frequencies follow Zipfian distribution which models, for example, distribution of words in natural texts). The real histogram (for most frequent elements, it has a long flat tail in the right that was truncated in this figure) and the histogram recovered from the sketch are shown in the figure below:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/04/count-min-hist01.png"><img class="aligncenter size-full wp-image-499" title="count-min-hist01" src="http://highlyscalable.files.wordpress.com/2012/04/count-min-hist01.png?w=594" alt=""   /></a></p>
<p>It is clear that Count-Min sketch cannot track frequencies of 8500 elements using only 192 counters in the case of low skew of the frequencies, so the estimated histogram is very inaccurate.</p>
<p>In the second case the sketch was populated with a relatively highly skewed data set of 80k elements, also about 8500 distinct values. The real and estimated histograms are presented in the figure below:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/04/count-min-hist02.png"><img class="aligncenter size-full wp-image-500" title="count-min-hist02" src="http://highlyscalable.files.wordpress.com/2012/04/count-min-hist02.png?w=594" alt=""   /></a></p>
<p>One can see that result is more accurate, at least for the most frequent items. In general, applicability of Count-Min sketches is not a straightforward question and the best thing that can be recommended is experimental evaluation of each particular case. Theoretical bounds of Count-Min sketch accuracy on skewed data and measurements on real data sets are provided in [6].</p>
<h2>Frequency Estimation: Count-Mean-Min Sketch</h2>
<p>The original Count-Min sketch performs well on highly skewed data, but on low or moderately skewed data it is not so efficient because of poor protection from the high number of hash collisions – Count-Min sketch simply selects minimal (less distorted) estimator. As an alternative, more careful correction can be done to compensate the noise caused by collisions. One possible correction algorithm was suggested in [5]. It estimates noise for each hash function as the average value of all counters in the row that correspond to this function (except counter that corresponds to the query itself), deduces it from the estimation for this hash function, and, finally, computes the median of the estimations for all hash functions. Having that the sum of all counters in the sketch row equals to the total number of the added elements, we obtain the following implementation:</p>
<pre class="brush: java; title: ; notranslate">
class CountMeanMinSketch {
	// initialization and addition procedures as in CountMinSketch
	// n is total number of added elements

	long estimateFrequency(value) {
		long e[] = new long[d]
		for(i = 0; i &lt; d; i++) {
			sketchCounter = estimators[i][ hash(value, i) ]
			noiseEstimation = (n - sketchCounter) / (w - 1)
			e[i] = sketchCounter – noiseEstimator
 		}
		return median(e)
	}
}
</pre>
<p>This enhancement can significantly improve accuracy of the Count-Min structure. For example, compare the histograms below with the first histograms for Count-Min sketch (both techniques used a sketch of size 3&#215;64 and 8500 elements were added to it):</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/04/count-mean-min-hist01.png"><img class="aligncenter size-full wp-image-501" title="count-mean-min-hist01" src="http://highlyscalable.files.wordpress.com/2012/04/count-mean-min-hist01.png?w=594" alt=""   /></a></p>
<h2>Heavy Hitters: Count-Min Sketch</h2>
<p>Count-Min sketches are applicable to the following problem: Find all elements in the data set with the frequencies greater than k percent of the total number of elements in the data set. The algorithm is straightforward:</p>
<ul>
<li>Maintain a standard Count-Min sketch during the scan of the data set and put all elements into it.</li>
<li>Maintain a heap of top elements, initially empty, and a counter N of the total number of already process elements.</li>
<li>For each element in the data set:
<ul>
<li>Put the element to the sketch</li>
<li>Estimate the frequency of the element using the sketch. If frequency is greater than a threshold (k*N), then put the element to the heap. Heap should be periodically or continuously cleaned up to remove elements that do not meet the threshold anymore.</li>
</ul>
</li>
</ul>
<p>In general, the top-k problem makes sense only for skewed data, so usage of Count-Min sketches is reasonable in this context.</p>
<h4><strong>Case Study</strong></h4>
<p>There is a system that tracks traffic by IP address and it is required to detect most traffic-intensive addresses. This problem can be solved using the algorithm described above, but the problem is not trivial because we need to track the total traffic for each address, not a frequency of items. Nevertheless, there is a simple solution – counters in the CountMinSketch implementation can be incremented not by 1, but by absolute amount of traffic for each observation (for example, size of IP packet if sketch is updated for each packet). In this case, sketch will track amounts of traffic for each address and a heap with the most traffic-intensive addresses can be maintained as described above.</p>
<h2>Heavy Hitters: Stream-Summary</h2>
<p>Count-Min Sketch and other similar techniques is not the only family of structures that allow one to estimate frequency-related metrics. Another large family of algorithms and structures that deal with frequency estimation is counter-based techniques. Stream-Summary algorithm [8] belongs to this family. Stream-Summary allows one to detect most frequent items in the dataset and estimate their frequencies with explicitly tracked estimation error.</p>
<p>Basically, Stream-Summary traces a fixed number (a number of slots) of elements that presumably are most frequent ones. If one of these elements occurs in the stream, the corresponding counter is increased. If a new, non-traced element appears, it replaces the least frequent traced element and this kicked out element become non-traced.</p>
<p>The figure below illustrates how Stream-Summary with 3 slots works for the input stream {1,2,2,2,3,1,1,4}. Stream-Summary groups all traced elements into buckets where each bucket corresponds to the particular frequency, i.e. to the number of occurrences. Additionally, each traced element has the “err” field that stores maximum potential error of the estimation.</p>
<ol>
<li>Initially there is only 0-bucket and there is no elements attached to it.</li>
<li>Input : <strong>1</strong>. A new bucket for frequency 1 is created and the element is attached to it. Potential error is 0.</li>
<li>Input : <strong>2</strong>. The element is also attached to the bucket 1.</li>
<li>Input : <strong>2</strong>. The corresponding slot is detached from bucket 1 and attached to the newly created bucket 2 (element 2 occurred twice).</li>
<li>Input : <strong>2</strong>. The corresponding slot is detached from bucket 2 and attached to the newly created bucket 3. Bucket 2 is deleted because it is empty.</li>
<li>Input : <strong>3</strong>. The element is attached to the bucket 1 because it is the first occurrence of 3.</li>
<li>Input : <strong>1</strong>. The corresponding slot is moved to bucket 2 because this is the second occurrence of the element 1.</li>
<li>Input : <strong>1</strong>. The corresponding slot is moved to bucket 3 because this is the third occurrence of the element 1.</li>
<li>Input : <strong>4</strong>. The element 4 is not traced, so it kicks out element 3 and replaces it in the corresponding slot. The number of occurrences of the element 3 (which is 1) becomes a potential estimation error for the element 4. After this the corresponding slot is moved to the bucket 2, just like it was the second occurrence of the element 4.</li>
</ol>
<p><a href="http://highlyscalable.files.wordpress.com/2012/04/stream-summary.png"><img class="aligncenter size-full wp-image-502" title="stream-summary" src="http://highlyscalable.files.wordpress.com/2012/04/stream-summary.png?w=594" alt=""   /></a></p>
<p>The estimation procedure for most frequent elements and corresponding frequencies is quite obvious because of simple internal design of the Stream-Summary structure. Indeed, one just need to scan elements in the buckets that correspond to the highest frequencies. Nevertheless, Stream-Summary is able not only to provide estimates, but to answer are these estimates exact (guaranteed) or not. Computation of these guarantees is not trivial, corresponding algorithms are described in [8].</p>
<h2>Range Query: Array of Count-Min Sketches</h2>
<p>In theory, one can process a range query (something like SELECT count(v) WHERE v &gt;= c1 AND v &lt; c2) using a Count-Min sketch  enumerating all points within a range and summing estimates for corresponding frequencies. However, this approach is impractical because the number of points within a range can be very high and accuracy also tends to be inacceptable because of cumulative error of the sum. Nevertheless, it is possible to overcome these problems using multiple Count-Min sketches. The basic idea is to maintain a number of sketches with the different “resolution”, i.e. one sketch that counts frequencies for each value separately, one sketch that counts frequencies for pairs of values (to do this one can simply truncate a one bit of a value on the sketch’s input), one sketch with 4-items buckets and so on. The number of levels equals to logarithm of the maximum possible value. This schema is shown in the right part of the following picture:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/04/array-count-min-sketch.png"><img class="aligncenter size-full wp-image-503" title="array-count-min-sketch" src="http://highlyscalable.files.wordpress.com/2012/04/array-count-min-sketch.png?w=594&#038;h=265" alt="" width="594" height="265" /></a></p>
<p>Any range query can be reduced to a number of queries to the sketches of different level, as it shown in right part of the picture above. This approach (called dyadic ranges) allows one to reduce the number of computations and increase accuracy. An obvious optimization of this schema is to replace sketches by exact counters at the lowest levels where a number of buckets is small.</p>
<p><a href="http://madlib.net/">MADlib</a> (a data mining library for PostgreSQL and Greenplum) implements this algorithm to process range queries and calculate percentiles on large data sets.</p>
<h2>Membership Query: Bloom Filter</h2>
<p>Bloom Filter is probably the most famous and widely used probabilistic data structure. There are multiple descriptions of the Bloom filter in the web, I provide a short overview here just for sake of completeness. Bloom filter is similar to Linear Counting, but it is designed to maintain an identity of each item rather than statistics. Similarly to Linear Counter, the Bloom filter maintains a bitset, but each value is mapped not to one, but to some fixed number of bits by using several independent hash functions. If the filter has a relatively large size in comparison with the number of distinct elements, each element has a relatively unique signature and it is possible to check a particular value &#8211; is it already registered in the bit set or not. If all the bits of the corresponding signature are ones then the answer is yes (with a certain probability, of course).</p>
<p>The following table contains formulas that allow one to calculate parameters of the Bloom filter as functions of error probability and capacity:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/04/bloom-filter-formulas.png"><img class="aligncenter size-full wp-image-504" title="bloom-filter-formulas" src="http://highlyscalable.files.wordpress.com/2012/04/bloom-filter-formulas.png?w=594" alt=""   /></a></p>
<p>Bloom filter is widely used as a preliminary probabilistic test that allows one to reduce a number of exact checks. The following case study shows how the Bloom filter can be applied to the cardinality estimation.</p>
<h4><strong>Case Study</strong></h4>
<p>There is a system that tracks a huge number of web events and each event is marked by a number of tags including a user ID this event corresponds to. It is required to report a number of unique users that meet the specified combination of tags (like users from the city C that visited site A or site B).</p>
<p>A possible solution is to maintain a Bloom filter that tracks user IDs for each tag value and a Bloom filter that contains user IDs that correspond to the final result. A user ID from each incoming event is tested against the per-tag filters – does it satisfy the required combination of tags or not. If the user ID passes this test, it is additionally tested against the additional Bloom filter that corresponds to the report itself and, if passed, the final report counter is increased.</p>
<h2>References</h2>
<ol>
<li><a href="http://dblab.kaist.ac.kr/Publication/pdf/ACM90_TODS_v15n2.pdf">K. Whang, B. T. Vander-Zaden, H.M. Taylor. A Liner-Time Probabilistic Counting Algorithm for Database Applications</a></li>
<li><a href="http://algo.inria.fr/flajolet/Publications/DuFl03.pdf">M. Durand and P. Flajolet. Loglog Counting of Large Cardinalities</a></li>
<li><a href="http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf">G. Cormode, S. Muthukrishnan. An Improved Data Stream Summary: The Count-Min Sketch and its Applications</a></li>
<li><a href="http://www.research.att.com/people/Cormode_Graham/library/publications/CormodeMuthukrishnan12.pdf">G. Cormode, S. Muthukrishnan. Approximating Data with the Count-Min Data Structure</a></li>
<li><a href="http://webdocs.cs.ualberta.ca/~fandeng/paper/cmm.pdf">F. Deng, D. Rafiei. New Estimation Algorithms for Streaming Data: Count-min Can Do More</a></li>
<li><a href="http://www.cs.rutgers.edu/~muthu/cmz-sdm.pdf">G. Cormode, S. Muthukrishnan. Summarizing and Mining Skewed Data Streams</a></li>
<li><a href="http://www.mathcs.emory.edu/~cheung/papers/StreamDB/Probab/1985-Flajolet-Probabilistic-counting.pdf">P. Flayjolet and N. Martin. Probabilistic counting algorithm for data base applications</a></li>
<li><a href="http://www.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf">A. Metwally, D. Agrawal, A.E. Abbadi. Efficient Computation of Frequent and Top-K Elements in Data Streams</a></li>
<li><a href="http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf">P. Flayjolet, E.Fusy, O. Gandouet, F. Meunier. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm</a></li>
<li><a href="http://arxiv.org/abs/0801.3552">P. Clifford, I. Cosma. A Statistical Analysis of Probabilistic Counting Algorithms</a></li>
</ol>
<p>It is worth mentioning that simple Java implementations of several structures can be found in <a href="https://github.com/clearspring/stream-lib">stream-lib</a> library.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/highlyscalable.wordpress.com/468/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/highlyscalable.wordpress.com/468/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=468&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/feed/</wfw:commentRss>
		<slash:comments>25</slash:comments>
	
		<media:thumbnail url="http://highlyscalable.files.wordpress.com/2012/05/featured.png?w=150" />
		<media:content url="http://highlyscalable.files.wordpress.com/2012/05/featured.png?w=150" medium="image">
			<media:title type="html">featured</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c11f79021b0f6248403dbf5e4b9d529b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">highlyscalable</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/probabilistic-sizes.png" medium="image">
			<media:title type="html">probabilistic-sizes</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/linear-counting-formulas.png" medium="image">
			<media:title type="html">linear-counting-formulas</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/linear-counting-guidelines.png" medium="image">
			<media:title type="html">linear-counting-guidelines</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/05/log-log-counter.png" medium="image">
			<media:title type="html">log-log-counter</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/log-log-formulas.png" medium="image">
			<media:title type="html">log-log-formulas</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/count-min-sketch.png" medium="image">
			<media:title type="html">count-min-sketch</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/05/count-min-formulas1.png" medium="image">
			<media:title type="html">count-min-formulas</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/count-min-hist01.png" medium="image">
			<media:title type="html">count-min-hist01</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/count-min-hist02.png" medium="image">
			<media:title type="html">count-min-hist02</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/count-mean-min-hist01.png" medium="image">
			<media:title type="html">count-mean-min-hist01</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/stream-summary.png" medium="image">
			<media:title type="html">stream-summary</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/array-count-min-sketch.png" medium="image">
			<media:title type="html">array-count-min-sketch</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/bloom-filter-formulas.png" medium="image">
			<media:title type="html">bloom-filter-formulas</media:title>
		</media:content>
	</item>
		<item>
		<title>Hierarchical Navigation and Faceted Search on Top of Oracle Coherence</title>
		<link>http://highlyscalable.wordpress.com/2012/04/02/architecture-of-high-performance-ecommerce-backend/</link>
		<comments>http://highlyscalable.wordpress.com/2012/04/02/architecture-of-high-performance-ecommerce-backend/#comments</comments>
		<pubDate>Mon, 02 Apr 2012 14:06:03 +0000</pubDate>
		<dc:creator>Ilya Katsov</dc:creator>
				<category><![CDATA[Case Studies]]></category>
		<category><![CDATA[Coherence]]></category>
		<category><![CDATA[architecture]]></category>
		<category><![CDATA[backend]]></category>
		<category><![CDATA[coherence]]></category>
		<category><![CDATA[data grid]]></category>
		<category><![CDATA[ecommerce]]></category>
		<category><![CDATA[facet]]></category>
		<category><![CDATA[imdg]]></category>
		<category><![CDATA[navigation]]></category>
		<category><![CDATA[pattern]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://highlyscalable.wordpress.com/?p=230</guid>
		<description><![CDATA[Some time ago I participated in design of a backend for one large online retailer company. From the business logic point of view, this was a pretty typical eCommerce service for hierarchical and faceted navigation, although not without peculiarities, but high performance requirements led us to the quite advanced architecture and technical design. In particular, we [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=230&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Some time ago I participated in design of a backend for one large online retailer company. From the business logic point of view, this was a pretty typical eCommerce service for hierarchical and faceted navigation, although not without peculiarities, but high performance requirements led us to the quite advanced architecture and technical design. In particular, we built this system on top of Oracle Coherence and designed our own data structures and indexes.</p>
<p>In this article, I describe major architectural decisions we made and techniques we used. This description should not be considered as a solid blueprint, but rather a collection of the relatively independent ideas, patterns, and notes that can be used in different combinations and in different applications, not only in eCommerce systems.</p>
<h2>Business Logic: Hierarchical and Faceted Navigation</h2>
<p>I cannot disclose customer&#8217;s name, so I will explain business logic using <em>amazon.com</em> as an example, fortunately the basic functionality is very similar. The first piece of functionality is structural or hierarchical navigation through categories and products, which are the main business entities of the system. Categories are organized in a tree-like structure and the user is provided with several controls that enable him to navigate through this tree starting from the highest categories (like departments on <em>amazon.com</em>) and going to the lowest ones:</p>
<div id="attachment_244" class="wp-caption aligncenter" style="width: 604px"><a href="http://highlyscalable.files.wordpress.com/2012/02/navigation.png"><img class="size-full wp-image-244" title="navigation" src="http://highlyscalable.files.wordpress.com/2012/02/navigation.png?w=594&#038;h=233" alt="" width="594" height="233" /></a><p class="wp-caption-text">Hierarchical Navigation on Amazon.Com</p></div>
<p>Each product can be explicitly associated with one or more categories of any level and category contains a product if this product is explicitly associated with it or associated with any of its subcategories. These structural dependencies between categories and products are relatively static (the system refreshes this information daily), but operations team can change separate relations in runtime to fix incorrect data or to inject other urgent changes. Besides this, each product has some transient information like in-stock availability that is a subject of frequent updates (every 5 minutes or so).</p>
<p>The second important piece of functionality is a faceted navigation. Categories can contain thousands of products and user cannot efficiently search though this array without powerful tools. The most popular way to do this is a faceted navigation that can be thought as a generation of dynamic categories based on product attributes. For example, if the user opens a category that contains clothes, products will be characterized by properties like size, brand, color and so on. Available values of these properties (called facets) can be extracted from the product set and shown on the UI to enable the user to apply <em>user-select filters,</em> which are particular AND-ed or OR-ed combinations of the facet values:</p>
<div id="attachment_245" class="wp-caption aligncenter" style="width: 419px"><a href="http://highlyscalable.files.wordpress.com/2012/02/facets.png"><img class="size-full wp-image-245" title="facets" src="http://highlyscalable.files.wordpress.com/2012/02/facets.png?w=594" alt=""   /></a><p class="wp-caption-text">Faceted Navigation on Amazon.Com</p></div>
<p>Each facet value is often accompanied with <em>cardinality</em>, i.e. number of products that will be in the results set if this filter is applied. When user clicks on a facet, the system automatically applies the selected filters and narrows the product set according to the user interests. It is important that this style of navigation assumes high interactivity &#8211; each selection leads to recomputing of all available facets, their cardinalities, and products in a result set.</p>
<p>There is a lot of information about faceted search on the web. I can recommend <a href="http://www.alistapart.com/articles/design-patterns-faceted-navigation/">this article</a> by Peter Morville and Jeffrey Callender for further reading. We will also return back to some details of business logic in the section devoted to implementation of the faceted navigation.</p>
<p>From the backend perspective, hierarchical and faced navigation requires the following operations to be implemented:</p>
<ul>
<li><strong>getProductsAndFacets(CategoryID, UserSelectedFilters) &#8211; </strong>return all products within the category filtered in accordance with the user-selected filters, compute available facet values and corresponding cardinalities for the filtered product set.</li>
<li><strong>traverseCategoryHierarchy(CategoryID) &#8211; </strong>return ancestors and descendants of the given category in the tree of categories. Depth of traversal is specified by the frontend.</li>
<li><strong>getProducts(ProductID[]) &#8211; </strong>return a product domain entity that contains product attributes, prices, images etc. This information is used to populate a page with product and display product details.</li>
<li><strong>getCategories(CategoryID[]) &#8211; </strong>return a category domain entity that contains category attributes and properties.</li>
<li><strong>getProductsTransientAttributes(ProductID[]), <strong>getCategoryTransientAtributes(CategoryID[])</strong> - </strong>return a short list of attributes that are the subject of frequent changes (the in-stock availability etc.) The rationale behind these methods is that frontend should be able to fetch transient information very efficiently and separately from fetching of heavy-weight domain entities because this information cannot be cached.</li>
</ul>
<h2>System Properties and Major Technical Requirements</h2>
<p>From the technical perspective, the following properties should be highlighted:</p>
<ul>
<li>All data is initially stored in the relational database, but this database is heavily loaded because it is a master record for many applications. So, the only way was to <strong>cache all necessary data</strong> to minimize interaction with RDBMS.</li>
<li>The content that is delivered to users (categories and products) is pretty much static. In such cases, content delivery network (CDN) is typically used to cache majority of the content and shield the system from high workload. Nevertheless, there were two obstacles that decrease efficiency of CDN in this project:
<ul>
<li>Faceted navigation leads to a high amount of different views because users are able to select arbitrary combinations of facets, and, consequently, many unique requests should be served.</li>
<li>Product in-stock availability is transient, especially for the certain periods of eCommerce system life cycle (sales and so on). This means that content &#8211; products and facets &#8211; is sporadically updated every few minutes.</li>
</ul>
</li>
<li>Taking into account the previous considerations, performance requirements were set as <strong>1000 faceted navigation requests/second</strong> per typical hardware blade.</li>
<li>Data capacity of the system is not less than <strong>1 million products</strong>.</li>
<li>Structural data are completely reloaded from the RDBMS every night. Transient information updates and requests for minor changes of structural information can arrive every few minutes.</li>
<li>The system is implemented in Java.</li>
</ul>
<h2>Deployment Schema and High-Level Architecture</h2>
<p>The major architectural decision was to use in-memory data grid (IMDG) to shield the master RDBMS from workload during request processing. Oracle Coherence was chosen as an implementation. Coherence is used as a platform that provides distributed cache capabilities and can serve as a messaging bus for coordination of all application-level modules on all nodes in the cluster.</p>
<p>The deployment schema includes three types of nodes &#8211; processing nodes, storage nodes, and maintenance nodes. Processing nodes are responsible for requests serving and act as Coherence clients. Storage nodes are basically Coherence storage nodes. Maintenance nodes are responsible for data indexing and processing of transient information updates. Both Storage and Maintenance nodes do not serve client requests. This deployment schema is shown in the figure below:</p>
<div id="attachment_258" class="wp-caption aligncenter" style="width: 604px"><a href="http://highlyscalable.files.wordpress.com/2012/02/deployment-schema1.png"><img class="size-full wp-image-258" title="deployment-schema" src="http://highlyscalable.files.wordpress.com/2012/02/deployment-schema1.png?w=594&#038;h=579" alt="" width="594" height="579" /></a><p class="wp-caption-text">Deployment Schema</p></div>
<p>Nodes can be dynamically added or removed from the cluster. All nodes (processing, storage, maintenance) host the same application that contains all modules for request processing, maintenance operations, and Coherence instance. Basically, deployments on all nodes are identical and can serve both client requests and maintenance operations, although each type of nodes has its own configuration parameters. The rationale behind this architecture can be recognized as a pattern:</p>
<table style="background-color:#eeff99;border-style:dashed;border-width:1px;">
<tbody>
<tr>
<td>
<h4><strong>Pattern: Homogeneous Cluster Nodes</strong></h4>
</td>
</tr>
<tr>
<td><strong>Problem</strong><br />
There is a clustered system that consist of multiple business services and auxiliary modules (data loaders, administration controls, etc). The deployment process is going to be complex if each module is deployed as a separate artifact with its own deployment schema and configuration.</td>
</tr>
<tr>
<td><strong>Solution</strong><br />
Different groups of nodes in the cluster can have different roles and serve different needs, but it may be a good idea to create one application and one artifact that will be deployed throughout the cluster. Different modules of this application are activated on different nodes depends on explicitly specified configuration (say, property files) or just because of usage pattern (say, certain requests are routed only to particular nodes).</td>
</tr>
<tr>
<td><strong>Results</strong><br />
This approach simplifies deployment and release processes, mitigates risk of incorrect deployment or misconfiguration. Development and QA processes are simplified because one can use either singe node or multiple nodes to run fully functional environment.</td>
</tr>
</tbody>
</table>
<p>Turning to the internals of the application itself, we can see that it includes the following components (these components are depicted in the figure below):</p>
<ul>
<li>Data Loader. The first role of this component is to fetch data from the master DB, assemble domain entities, and push these entities to Coherence. The second role is to build navigation indexes (these indexes will be described in the further sections), split them into chunks, and flush to Coherence. The rationale behind splitting into chunks is that indexes can be quite large (hundreds of megabytes), and Coherence is not intended for storing of such large entities, transmission of these entities can block Coherence network IO and crash the cluster. The third role of the Loader module is to receive intraday updates and apply patches to the indexes and domain entities.</li>
<li>Entity Gateway. The role of this module is to return information about particular entities, products and categories. Basically, this module is just a facade for Coherence. It takes domain entities from Coherence, compute fields that depend on transient information using navigation index, and return data to the client.</li>
<li>Hierarchical Navigation Engine. This engine is responsible for hierarchical navigation and works as a primary navigation service for external clients. Besides this, the navigation index is a master record for transient attributes, so other modules like Entity Gateway request these attributes from the Navigation Engine. Implementation of the engine will be described in the next section.</li>
<li>Facet Engine. This engine is responsible for computation of facets and for filtering according to user-selected filters. Implementation of this module will be discussed later.</li>
</ul>
<div></div>
<div>
<div id="attachment_433" class="wp-caption aligncenter" style="width: 604px"><a href="http://highlyscalable.files.wordpress.com/2012/03/application-components.png"><img class="size-full wp-image-433" title="application-components" src="http://highlyscalable.files.wordpress.com/2012/03/application-components.png?w=594&#038;h=551" alt="" width="594" height="551" /></a><p class="wp-caption-text">Component Diagram and Data Flows</p></div>
<p>Data Loader is active only on the Maintenance nodes where it has a plenty of resources for temporary buffers, index compilation tasks and so on. All updates and indexing requests are routed only to the Maintenance nodes, not to Processing/Storage nodes. Such separation of data loader and other maintenance units can be recognized as a common pattern:</p>
</div>
<table style="background-color:#eeff99;border-style:dashed;border-width:1px;">
<tbody>
<tr>
<td>
<h4><strong>Pattern: Maintenance Node</strong></h4>
</td>
</tr>
<tr>
<td><strong>Problem</strong><br />
There is a cluster of nodes where each node is able to serve both business and maintenance requests. Maintenance operations can consume a lot resources and impact performance of business requests.</td>
</tr>
<tr>
<td><strong>Solution</strong><br />
Maintenance operations like data indexing can be handled by any cluster node when a distributed platform like IMDG is used. Nevertheless, it is often a good idea to use a dedicated node for this purpose. This node can be identical to other nodes from the deployment point of view (the same application as on the other nodes), but user requests are not routed to it and more powerful hardware can be used in some cases.</td>
</tr>
<tr>
<td><strong>Results</strong><br />
On the one hand, maintenance node provides potentially resource-consuming indexing processes with dedicated hardware capacities. On the other hand, maintenance processes do not interfere with user requests.</td>
</tr>
</tbody>
</table>
<p>Data Loader loads <em>all active </em>data to Coherence during each daily update, but there is &#8220;<em>dark matter</em>&#8221; that is not loaded into Coherence but occasionally requested by some clients. For instance, this matter is obsolete products and categories that are not visible on the site and not available for purchase. Coherence Read-Through feature is used to cope with these entities &#8211; it is acceptable to load them from the RDBMS on demand because the number of such requests is very low.</p>
<h2>Implementation of Data Loader</h2>
<p>Design of Data Loader is influenced by two major factors:</p>
<ul>
<li>Loader should efficiently fetch and process large data set in a relatively short time.</li>
<li>There are multiple consumers like index builders or entity saves that should process the same data.</li>
</ul>
<p>As a result, Data Loader is organized as an asynchronous pipeline (<a href="http://eaipatterns.com/PipesAndFilters.html">Pipes and Filters</a> design pattern) where batches of entities are loaded from RDBMS by a set of units that work in parallel threads. Loaded entities are submitted to a queue, and each consumer works in its own thread taking batches and processing them independently from the other participants. This schema is shown in the figure below:</p>
<div id="attachment_317" class="wp-caption aligncenter" style="width: 504px"><a href="http://highlyscalable.files.wordpress.com/2012/02/data-loading-pipeline.png"><img class="size-full wp-image-317" title="data-loading-pipeline" src="http://highlyscalable.files.wordpress.com/2012/02/data-loading-pipeline.png?w=594" alt=""   /></a><p class="wp-caption-text">Data Loading Pipeline</p></div>
<p>This schema is relatively simple because there is only one data source and structure of entities is not too complicated. Nevertheless, this pipeline can become more complex if there are multiple data sources and one business entity is assembled using several sources. In this case, a batch of entities can be initially loaded from a single source and then passed to another loader that enriches entities by additional attributes and so on.</p>
<table style="background-color:#eeff99;border-style:dashed;border-width:1px;">
<tbody>
<tr>
<td>
<h4><strong>Pattern: Data Loading Pipeline</strong></h4>
</td>
</tr>
<tr>
<td><strong>Problem</strong><br />
A system should be populated with a large data set that come from single or multiple sources. One business entity can depend on multiple sources. There are many consumers of the loaded business entities that index, persist, or process entities.</td>
</tr>
<tr>
<td><strong>Solution</strong><br />
Adopt the <a href="http://eaipatterns.com/PipesAndFilters.html">Pipes and Filters</a> pattern. Implement each operation (loading or indexing) as an isolated unit that produces or consumes entities. Data producers or loaders should be driven by incoming requests that specify data to be loaded. Connect all units via asynchronous data channels and run multiple instances of each unit as an independent process.</td>
</tr>
<tr>
<td><strong>Results</strong><br />
Data Loading Pipeline allows one to organize efficient data loading in a multithreaded environment. All units can work in a batch mode, and more parallel instances can be easily added. A special attention should be paid to the memory consumption because queues with entities can consume a lot of memory if a system is not balanced or misconfigured.</td>
</tr>
</tbody>
</table>
<p>Data inconsistency during saving of new data to Coherence is practically avoided using techniques that were described in <a title="Implementation of MVCC Transactions for Key-Value Stores" href="http://highlyscalable.wordpress.com/2012/01/07/mvcc-transactions-key-value/">one of my previous articles</a>.</p>
<h2>Implementation of Hierarchical Navigation</h2>
<p>When we first started to work on the navigation procedures, we first tried to do it using standard Coherence capabilities, i.e. filters and entry processors. This attempt was not very successful from the performance point of view due to high memory consumption and relatively low performance in general. The next step was to design a compact data structure that supports very fast category tree traversal and extraction of products by Category ID. The structure we created is based on the <a href="http://en.wikipedia.org/wiki/Nested_set_model">nested set model</a>, it is shown in the figure below:</p>
<div id="attachment_260" class="wp-caption aligncenter" style="width: 604px"><a href="http://highlyscalable.files.wordpress.com/2012/02/hierarchical-navigation.png"><img class="size-full wp-image-260" title="hierarchical-navigation" src="http://highlyscalable.files.wordpress.com/2012/02/hierarchical-navigation.png?w=594&#038;h=575" alt="" width="594" height="575" /></a><p class="wp-caption-text">Hierarchical Navigation Index Structure</p></div>
<p>A navigation index represents a huge array of product IDs and their basic attributes that are frequently used in computation and filtering, for example, in-stock availability. In our domain model these attributes are binary, hence we efficiently packed them into integer numbers where each bit is reserved for a particular attribute. Each element of this array corresponds to the product-to-category relation and one product ID can occur in this array multiple times if product is associated with multiple categories. Hierarchy itself is stored as an indexed tree of category IDs and each node contains two indexes in product-to-category array. This indexes point to start and end positions of relations that belong to the particular category.</p>
<p>The second notable feature of this navigation solution is that each Processing Node fetches index from Coherence and entirely caches it in local memory. This allows one to perform navigational operations without touching heavy-weight domain objects. If data volume becomes high, it is possible to partition index into several shards and perform distributed processing, although it was not a case in our application (index with millions of products can be easily handled by one JVM). This technique can be considered as a common pattern (or anti-pattern, it depends on scalability requirements):</p>
<table style="background-color:#eeff99;border-style:dashed;border-width:1px;">
<tbody>
<tr>
<td>
<h4><strong>Pattern: Replicated Custom Index</strong></h4>
</td>
</tr>
<tr>
<td><strong>Problem</strong><br />
There is an application with a distributed data storage. It is necessary to perform a special type of query that involves limited amount of attributes for each entity, but complex business logic or high performance requirements make standard distributed scans inefficient.</td>
</tr>
<tr>
<td><strong>Solution<br />
</strong> When a non-standard traversal or querying is required and amount of involved data is limited, each node in the cluster can cache domain-specific index and use it to perform the operation.</td>
</tr>
<tr>
<td><strong>Results</strong><br />
This approach can be very efficient when standard indexes do not work well, but it can turn into scalability bottleneck if implemented incorrectly. If there are reasons to assume that index will become too large to be cached on one node, this is a serious argument against this approach.</td>
</tr>
</tbody>
</table>
<p>Index propagation throughout the cluster is shown in the figure below. Maintenance Node loads data from the Master DB, builds index, saves it in a serialized partitioned form to Coherence, and then Processing Nodes fetch it and cache locally:</p>
<div id="attachment_327" class="wp-caption aligncenter" style="width: 604px"><a href="http://highlyscalable.files.wordpress.com/2012/02/index-propagation.png"><img class="size-full wp-image-327" title="index-propagation" src="http://highlyscalable.files.wordpress.com/2012/02/index-propagation.png?w=594&#038;h=437" alt="" width="594" height="437" /></a><p class="wp-caption-text">Index Building and Propagation</p></div>
<h2>Implementation of Faceted Navigation</h2>
<p>Faceted Navigation was described in the first section of this article, but it should be mentioned that logic of computation is not always straightforward, but often affected by business rules and peculiarities of a business model. As an interesting example, we can consider the following use case. Imagine that, according to the business model, product is not a final item of purchase, but a group of such items. For instance, when user looks into the Jeans category, he or she can see Levi&#8217;s Jeans 501 as a product, but the actual item to be purchased is a particular instance of Levi&#8217;s Jeans 501, say Levi&#8217;s Jeans 501 of size 34&#215;30, white color. Considered as a product domain entity, Levi&#8217;s Jeans 501 will contain many particular items of a different color and size. From the faceted navigation perspective, this leads to the interesting issue. At the first glance, it is fine to attribute each product with all sizes or colors that can be found among all its instances and build facets based on this information. Now imagine that there are two instances of Levi&#8217;s Jeans 501 &#8211; one is of size 34&#215;30 and in white color, another one is one is of size 30&#215;30 and in white color. If the user looks for black jeans of size 34&#215;30, this product will match the filter if it is simply attributed by a plain list of instance-level attributes. Nevertheless, there are no black jeans of size 34&#215;30 in the store. This situation is illustrated in the figure below:</p>
<div id="attachment_356" class="wp-caption aligncenter" style="width: 545px"><a href="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-facets.png"><img class="size-full wp-image-356" title="nested-documents-facets" src="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-facets.png?w=594" alt=""   /></a><p class="wp-caption-text">Incorrect Modeling of Products and Instances</p></div>
<p>This is just a one example of non-trivial issues with facetization logic. Many more issues and merchandiser-driven tweaks can appear in a real system. The conclusion is that faceted navigation can be pretty sophisticated and certain implementation flexibility is required.</p>
<p>To cope with such issues, it was decided to keep the design of a facet index very straightforward and do not use data layouts like <a href="http://en.wikipedia.org/wiki/Inverted_index">inverted indexes</a>. Basically, all products, their instances and higher level groups of items are stored just like nested arrays and maps of objects:</p>
<div id="attachment_440" class="wp-caption aligncenter" style="width: 604px"><a href="http://highlyscalable.files.wordpress.com/2012/03/facet-index1.png"><img class="size-full wp-image-440" title="facet-index" src="http://highlyscalable.files.wordpress.com/2012/03/facet-index1.png?w=594&#038;h=396" alt="" width="594" height="396" /></a><p class="wp-caption-text">Facet Index</p></div>
<p>All attributes are mapped to the integer values and these values are compactly stored in open addressing hash sets inside each instance or product. This allows one to iterate over all items within a category, efficiently applying user selected filter to each item, and increment facet counters for all attributes that are inside accepted items. I provided a detailed description of data structures and algorithms that allow one to do this in <a title="Ultimate Sets and Maps for Java, Part II" href="http://highlyscalable.wordpress.com/2012/01/01/ultimate-sets-and-maps-for-java-p2/">my previous post</a>.</p>
<p>If the user selected filter includes many attributes it may be inefficient to check all these attributes one by one for each item. Performance of filtering can be improved using <a title="Ultimate Sets and Maps for Java, Part II" href="http://highlyscalable.wordpress.com/2012/01/01/ultimate-sets-and-maps-for-java-p2/">Bloom filter</a> that allows one to apply a filter of several terms to a set of attributes using a couple of processor instructions. Bloom filter is liable to false positives, so it can not completely replace traditional checks using hash sets with attributes, but it can be used as a preliminary test to decrease a number of relatively expensive exact checks. This technique is used in a number of well-known systems, Google Big Table and Apache HBase are among them.</p>
<table style="background-color:#eeff99;border-style:dashed;border-width:1px;">
<tbody>
<tr>
<td>
<h4><strong>Pattern: Probabilistic Test</strong></h4>
</td>
</tr>
<tr>
<td><strong>Problem</strong><br />
There is a large collection of items (domain entities, files, records etc). It is necessary to provide the ability to select items that meet a certain criteria &#8211; simple yes/no predicate or complex filter.</td>
</tr>
<tr>
<td><strong>Solution<br />
</strong> Items can be grouped into buckets. Each bucket contains one or more items and has a compact <em>signature</em> that allows one to answer the question &#8220;<em>is there at least one item inside the bucket that meets the criteria</em>&#8220;. This signature is typically a kind of hash that has much smaller memory footprint than the original collection and liable to false positives. Query processor tests bucket&#8217;s signature and, if results shows that bucket potentially can contain the requested items, it goes into the bucket and checks all items independently.</td>
</tr>
<tr>
<td><strong>Results</strong><br />
Probabilistic testing is good to trade time to memory or IO to memory. It increases memory consumption because of signatures, but allows one to significantly decrease volume of processed data for selective queries.</td>
</tr>
</tbody>
</table>
<p>Replicated Custom Index pattern is used to distribute Facet Index throughout the cluster, just like Navigation Index.</p>
<h2>Conclusions</h2>
<p>The described design showed the following properties after being in production for a long time:</p>
<ul>
<li>(+) Computational performance is superior in comparison with the general-purpose databases and third-party products.</li>
<li>(+) The deployment schema is very efficient at all stages of development, functional testing, performance testing, and production maintenance because of its simplicity and flexibility.</li>
<li>(+) Cost of ownership and development is pretty low in comparison with third-party products usage due to high flexibility and relative simplicity of the used data structures.</li>
<li>(-) Scalability by data is not a built-in feature of the described design because of non-sharded replicated indexes. Nevertheless, actual capacity is relatively high for eCommerce domain and sharding capabilities can be added.</li>
<li>(-) In the long term perspective there is a negative tendency to over-complicated extensions around the core structures that are caused by complication of the business logic.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/highlyscalable.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/highlyscalable.wordpress.com/230/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=230&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://highlyscalable.wordpress.com/2012/04/02/architecture-of-high-performance-ecommerce-backend/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
	
		<media:thumbnail url="http://highlyscalable.files.wordpress.com/2012/04/featured1.png?w=150" />
		<media:content url="http://highlyscalable.files.wordpress.com/2012/04/featured1.png?w=150" medium="image">
			<media:title type="html">featured</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c11f79021b0f6248403dbf5e4b9d529b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">highlyscalable</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/navigation.png" medium="image">
			<media:title type="html">navigation</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/facets.png" medium="image">
			<media:title type="html">facets</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/deployment-schema1.png" medium="image">
			<media:title type="html">deployment-schema</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/03/application-components.png" medium="image">
			<media:title type="html">application-components</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/data-loading-pipeline.png" medium="image">
			<media:title type="html">data-loading-pipeline</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/hierarchical-navigation.png" medium="image">
			<media:title type="html">hierarchical-navigation</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/index-propagation.png" medium="image">
			<media:title type="html">index-propagation</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-facets.png" medium="image">
			<media:title type="html">nested-documents-facets</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/03/facet-index1.png" medium="image">
			<media:title type="html">facet-index</media:title>
		</media:content>
	</item>
		<item>
		<title>NoSQL Data Modeling Techniques</title>
		<link>http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/</link>
		<comments>http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/#comments</comments>
		<pubDate>Thu, 01 Mar 2012 12:54:40 +0000</pubDate>
		<dc:creator>Ilya Katsov</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Fundamentals]]></category>
		<category><![CDATA[big table]]></category>
		<category><![CDATA[data modeling]]></category>
		<category><![CDATA[document]]></category>
		<category><![CDATA[geohash]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[index]]></category>
		<category><![CDATA[key value]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[nosql]]></category>

		<guid isPermaLink="false">http://highlyscalable.wordpress.com/?p=284</guid>
		<description><![CDATA[NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like the CAP theorem apply well to NoSQL systems.  At the same time, NoSQL [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=284&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like the <a href="http://en.wikipedia.org/wiki/CAP_theorem">CAP theorem</a> apply well to NoSQL systems.  At the same time, NoSQL data modeling is not so well studied and lacks the systematic theory found in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques.</p>
<p>I would like to thank <a href="http://www.kirkdorffer.com/">Daniel Kirkdorffer</a> who reviewed the article and cleaned up the grammar.</p>
<p>To  explore data modeling techniques, we have to start with a more or less systematic view of NoSQL data models that preferably reveals trends and interconnections. The following figure depicts imaginary &#8220;evolution&#8221; of the major NoSQL system families, namely, Key-Value stores, BigTable-style databases, Document databases, Full Text Search Engines, and Graph databases:</p>
<div id="attachment_310" class="wp-caption aligncenter" style="width: 604px"><a href="http://highlyscalable.files.wordpress.com/2012/02/overview2.png"><img class="size-full wp-image-310" title="overview" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/overview2.png?w=594&#038;h=699" height="699" width="594" /></a><p class="wp-caption-text">NoSQL Data Models</p></div>
<p>First, we should note that SQL and relational model in general were designed long time ago to interact with the end user. This user-oriented nature had vast implications:</p>
<ul>
<li>The end user is often interested in aggregated reporting information, not in separate data items, and SQL pays a lot of attention to this aspect.</li>
<li>No one can expect human users to explicitly control concurrency, integrity, consistency, or data type validity. That&#8217;s why SQL pays a lot of attention to transactional guaranties, schemas, and referential integrity.</li>
</ul>
<p>On the other hand, it turned out that software applications are not so often interested in in-database aggregation and able to control, at least in many cases, integrity and validity themselves. Besides this, elimination of these features had an extremely important influence on the performance and scalability of the stores. And this was where a new evolution of data models began:</p>
<ul>
<li>Key-Value storage is a very simplistic, but very powerful model. Many techniques that are described below are perfectly applicable to this model.</li>
<li>One of the most significant shortcomings of the Key-Value model is a poor applicability to cases that require processing of key ranges. Ordered Key-Value model overcomes this limitation and significantly improves aggregation capabilities.</li>
<li>Ordered Key-Value model is very powerful, but it does not provide any framework for value modeling. In general, value modeling can be done by an application, but BigTable-style databases go further and model values as a map-of-maps-of-maps, namely, column families, columns, and timestamped versions.</li>
<li>Document databases advance the BigTable model offering two significant improvements. The first one is values with schemes of arbitrary complexity, not just a map-of-maps. The second one is database-managed indexes, at least in some implementations. Full Text Search Engines can be considered a related species in the sense that they also offer flexible schema and automatic indexes. The main difference is that Document database group indexes by field names, as opposed to Search Engines that group indexes by field values. It is also worth noting that some Key-Value stores like Oracle Coherence gradually move towards Document databases via addition of indexes and in-database entry processors.</li>
<li>Finally, Graph data models can be considered as a side branch of evolution that origins from the Ordered Key-Value models. Graph databases allow one model business entities very transparently (<em>this depends on that</em>), but hierarchical modeling techniques make other data models very competitive in this area too. Graph databases are related to Document databases because many implementations allow one model a value as a map or document.</li>
</ul>
<h2>General Notes on NoSQL Data Modeling</h2>
<p>The rest of this article describes concrete data modeling techniques and patterns. As a preface, I would like to provide a few general notes on NoSQL data modeling:</p>
<ul>
<li>NoSQL data modeling often starts from the application-specific queries as opposed to relational modeling:
<ul>
<li>Relational modeling is typically driven by the structure of available data. The main design theme is  &#8221;<strong>What answers do I have?&#8221;</strong><em> </em></li>
<li>NoSQL data modeling is typically driven by application-specific access patterns, i.e. the types of queries to be supported. The main design theme is<strong> &#8221;What questions do I have?&#8221; </strong><em> </em></li>
</ul>
</li>
<li>NoSQL data modeling often requires a deeper understanding of data structures and algorithms than relational database modeling does. In this article I describe several well-known data structures that are not specific for NoSQL, but are very useful in practical NoSQL modeling.</li>
<li>Data duplication and denormalization are first-class citizens.</li>
<li>Relational databases are not very convenient for hierarchical or graph-like data modeling and processing. Graph databases are obviously a perfect solution for this area, but actually most of NoSQL solutions are surprisingly strong for such problems. That is why the current article devotes a separate section to hierarchical data modeling.</li>
</ul>
<div>Although data modeling techniques are basically implementation agnostic, this is a list of the particular systems that I had in mind while working on this article:</div>
<div>
<ul>
<li>Key-Value Stores: Oracle Coherence, Redis, Kyoto Cabinet</li>
<li>BigTable-style Databases: Apache HBase, Apache Cassandra</li>
<li>Document Databases: MongoDB, CouchDB</li>
<li>Full Text Search Engines: Apache Lucene, Apache Solr</li>
<li>Graph Databases: neo4j, FlockDB</li>
</ul>
</div>
<h2>Conceptual Techniques</h2>
<p>This section is devoted to the basic principles of NoSQL data modeling.</p>
<h3>(1) Denormalization</h3>
<p>Denormalization can be defined as the copying of the same data into multiple documents or tables in order to simplify/optimize query processing or to fit the user&#8217;s data into a particular data model. Most techniques described in this article leverage denormalization in one or another form.</p>
<p>In general, denormalization is helpful for the following trade-offs:</p>
<ul>
<li><em>Query data volume</em> or <em>IO per query</em> VS <em>total data volume</em>. Using denormalization one can group all data that is needed to process a query in one place. This often means that for different query flows the same data will be accessed in different combinations. Hence we need to duplicate data, which increases total data volume.</li>
<li><em>Processing complexity</em> VS <em>total data volume</em>. Modeling-time normalization and consequent query-time joins obviously increase complexity of the query processor, especially in distributed systems. Denormalization allow one to store data in a query-friendly structure to simplify query processing.</li>
</ul>
<p><strong>Applicability</strong>: Key-Value Stores, Document Databases, BigTable-style Databases</p>
<h3>(2) Aggregates</h3>
<p>All major genres of NoSQL provide soft schema capabilities in one way or another:</p>
<ul>
<li>Key-Value Stores and Graph Databases typically do not place constraints on values, so values can be comprised of arbitrary format. It is also possible to vary a number of records for one business entity by using composite keys. For example, a user account can be modeled as a set of entries with composite keys like <em>UserID_name, UserID_email, UserID_messages</em> and so on. If a user has no email or messages then a corresponding entry is not recorded.</li>
<li>BigTable models support soft schema via a variable set of columns within a <em>column family</em> and a variable number of <em>versions </em>for one <em>cell</em>.</li>
<li>Document databases are inherently schema-less, although some of them allow one to validate incoming data using a user-defined schema.</li>
</ul>
<p>Soft schema allows one to form classes of entities with complex internal structures (nested entities) and to vary the structure of particular entities.This feature provides two major facilities:</p>
<ul>
<li>Minimization of one-to-many relationships by means of nested entities and, consequently, reduction of joins.</li>
<li>Masking of &#8220;technical&#8221; differences between business entities and modeling of heterogeneous business entities using one collection of documents or one table.</li>
</ul>
<div>These facilities are illustrated in the figure below. This figure depicts modeling of a product entity for an eCommerce business domain. Initially, we can say that all products have an ID, Price, and Description. Next, we discover that different types of products have different attributes like Author for Book or Length for Jeans. Some of these attributes have a one-to-many or many-to-many nature like Tracks in Music Albums. Next, it is possible that some entities can not be modeled using fixed types at all. For example, Jeans attributes are not consistent across brands and specific for each manufacturer. It is possible to overcome all these issues in a relational normalized data model, but solutions are far from elegant. Soft schema allows one to use a single Aggregate (product) that can model all types of products and their attributes:</div>
<div>
<div id="attachment_404" class="wp-caption aligncenter" style="width: 604px"><a href="http://highlyscalable.files.wordpress.com/2012/02/soft-schema2.png"><img class="size-full wp-image-404" title="soft-schema" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/soft-schema2.png?w=594&#038;h=439" height="439" width="594" /></a><p class="wp-caption-text">Entity Aggregation</p></div>
</div>
<div>
<p>Embedding with denormalization can greatly impact updates both in performance and consistency, so special attention should be paid to update flows.</p>
</div>
<p><strong>Applicability</strong>: Key-Value Stores, Document Databases, BigTable-style Databases</p>
<h3>(3) Application Side Joins</h3>
<p>Joins are rarely supported in NoSQL solutions. As a consequence of the &#8220;question-oriented&#8221; NoSQL nature, joins are often handled at design time as opposed to relational models where joins are handled at query execution time. Query time joins almost always mean a performance penalty, but in many cases one can avoid joins using Denormalization and Aggregates, i.e. embedding nested entities. Of course, in many cases joins are inevitable and should be handled by an application. The major use cases are:</p>
<ul>
<li>Many to many relationships are often modeled by links and require joins.</li>
<li>Aggregates are often inapplicable when entity internals are the subject of frequent modifications. It is usually better to keep a record that something happened and join the records at query time as opposed to changing a value . For example, a messaging system can be modeled as a User entity that contains nested Message entities. But if messages are often appended, it may be better to extract Messages as independent entities and join them to the User at query time: <a href="http://highlyscalable.files.wordpress.com/2012/03/aggregates-joins.png"><img class="aligncenter size-full wp-image-422" title="aggregates-joins" alt="" src="http://highlyscalable.files.wordpress.com/2012/03/aggregates-joins.png?w=594"   /></a></li>
</ul>
<p><strong>Applicability</strong>: Key-Value Stores, Document Databases, BigTable-style Databases, Graph Databases</p>
<h2>General Modeling Techniques</h2>
<p>In this section we discuss general modeling techniques that applicable to a variety of NoSQL implementations.</p>
<h3>(4) Atomic Aggregates</h3>
<p>Many, although not all, NoSQL solutions have limited transaction support. In some cases one can achieve transactional behavior using distributed locks or <a title="Implementation of MVCC Transactions for Key-Value Stores" href="http://highlyscalable.wordpress.com/2012/01/07/mvcc-transactions-key-value/">application-managed MVCC</a>, but it is common to model data using an Aggregates technique to guarantee some of the ACID properties.</p>
<p>One of the reasons why powerful transactional machinery is an inevitable part of the relational databases is that normalized data typically require multi-place updates. On the other hand, Aggregates allow one to store a single business entity as one document, row or key-value pair and update it atomically:</p>
<div id="attachment_409" class="wp-caption aligncenter" style="width: 543px"><a href="http://highlyscalable.files.wordpress.com/2012/02/atomic-aggregate1.png"><img class="size-full wp-image-409" title="atomic-aggregate" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/atomic-aggregate1.png?w=594"   /></a><p class="wp-caption-text">Atomic Aggregates</p></div>
<p>Of course, Atomic Aggregates as a data modeling technique is not a complete transactional solution, but if the store provides certain guaranties of atomicity, locks, or test-and-set instructions then Atomic Aggregates can be applicable.</p>
<p><strong>Applicability</strong>: Key-Value Stores, Document Databases, BigTable-style Databases</p>
<h3>(5) Enumerable Keys</h3>
<p>Perhaps the greatest benefit of an unordered Key-Value data model is that entries can be partitioned across multiple servers by just hashing the key. Sorting makes things more complex, but sometimes an application is able to take some advantages of ordered keys even if storage doesn&#8217;t offer such a feature. Let&#8217;s consider the modeling of email messages as an example:</p>
<ol>
<li>Some NoSQL stores provide atomic counters that allow one to generate sequential IDs. In this case one can store messages using <em>userID_messageID</em> as a composite key. If the latest message ID is known, it is possible to traverse previous messages. It is also possible to traverse preceding and succeeding messages for any given message ID.</li>
<li>Messages can be grouped into buckets, for example, daily buckets. This allows one to traverse a mail box backward or forward starting from any specified date or the current date.</li>
</ol>
<p><strong>Applicability</strong>: Key-Value Stores</p>
<h3>(6) Dimensionality Reduction</h3>
<p>Dimensionality Reduction is a technique that allows one to map multidimensional data to a Key-Value model or to other non-multidimensional models.</p>
<p>Traditional geographic information systems use some variation of a Quadtree or R-Tree for indexes. These structures need to be updated in-place and are expensive to manipulate when data volumes are large. An alternative approach is to traverse the 2D structure and flatten it into a plain list of entries. One well known example of this technique is a Geohash. A Geohash uses a Z-like scan to fill 2D space and each move is encoded as 0 or 1 depending on direction. Bits for longitude and latitude moves are interleaved as well as moves. The encoding process is illustrated in the figure below, where black and red bits stand for longitude and latitude, respectively:</p>
<div id="attachment_398" class="wp-caption aligncenter" style="width: 394px"><a href="http://highlyscalable.files.wordpress.com/2012/02/geohash-traversal1.png"><img class="size-full wp-image-398" title="geohash-traversal" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/geohash-traversal1.png?w=594"   /></a><p class="wp-caption-text">Geohash Index</p></div>
<p>An important feature of a Geohash is its ability to estimate distance between regions using bit-wise code proximity, as is shown in the figure. Geohash encoding allows one to store geographical information using plain data models, like sorted key values preserving spatial relationships. The Dimensionality Reduction technique for BigTable was described in [6.1]. More information about Geohashes and other related techniques can be found in [6.2] and [6.3].</p>
<p><strong>Applicability</strong>: Key-Value Stores, Document Databases, BigTable-style Databases</p>
<h3>(7) Index Table</h3>
<p>Index Table is a very straightforward technique that allows one to take advantage of indexes in stores that do not support indexes internally. The most important class of such stores is the BigTable-style database. The idea is to create and maintain a special table with keys that follow the access pattern. For example, there is a master table that stores user accounts that can be accessed by user ID. A query that retrieves all users by a specified city can be supported by means of an additional table where city is a key:</p>
<div id="attachment_399" class="wp-caption aligncenter" style="width: 448px"><a href="http://highlyscalable.files.wordpress.com/2012/02/index-table.png"><img class="size-full wp-image-399" title="index-table" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/index-table.png?w=594"   /></a><p class="wp-caption-text">Index Table Example</p></div>
<p>An Index table can be updated for each update of the master table or in batch mode. Either way, it results in an additional performance penalty and become a consistency issue.</p>
<p>Index Table can be considered as an analog of materialized views in relational databases.</p>
<p><strong>Applicability</strong>: BigTable-style Databases</p>
<h3>(8) Composite Key Index</h3>
<p>Composite key is a very generic technique, but it is extremely beneficial when a store with ordered keys is used. Composite keys in conjunction with secondary sorting allows one to build a kind of multidimensional index which is fundamentally similar to the previously described Dimensionality Reduction technique. For example, let&#8217;s take a set of records where each record is a user statistic. If we are going to aggregate these statistics by a region the user came from, we can use keys in a format <em>(State:City:UserID)</em> that allow us to iterate over records for a particular state or city if that store supports the selection of key ranges by a partial key match (as BigTable-style systems do):</p>
<pre class="brush: sql; title: ; notranslate">
SELECT Values WHERE state=&quot;CA:*&quot;
SELECT Values WHERE city=&quot;CA:San Francisco*&quot;
</pre>
<div id="attachment_477" class="wp-caption aligncenter" style="width: 407px"><a href="http://highlyscalable.files.wordpress.com/2012/03/composite-key-index.png"><img class="size-full wp-image-477" title="composite-key-index" alt="" src="http://highlyscalable.files.wordpress.com/2012/03/composite-key-index.png?w=594"   /></a><p class="wp-caption-text">Composite Key Index</p></div>
<p><strong>Applicability</strong>: BigTable-style Databases</p>
<h3>(9) Aggregation with Composite Keys</h3>
<p>Composite keys may be used not only for indexing, but for different types of grouping. Let&#8217;s consider an example. There is a huge array of log records with information about internet users and their visits from different sites (<em>click stream</em>). The goal is to count the number of unique users for each site. This is similar to the following SQL query:</p>
<pre class="brush: sql; title: ; notranslate">
SELECT count(distinct(user_id)) FROM clicks GROUP BY site
</pre>
<p>We can model this situation using composite keys with a UserID prefix:</p>
<div id="attachment_383" class="wp-caption aligncenter" style="width: 585px"><a href="http://highlyscalable.files.wordpress.com/2012/02/composite-key-collating1.png"><img class="size-full wp-image-383" title="composite-key-collating" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/composite-key-collating1.png?w=594"   /></a><p class="wp-caption-text">Counting Unique Users using Composite Keys</p></div>
<p>The idea is to keep all records for one user collocated, so it is possible to fetch such a frame into memory (one user can not produce too many events) and to eliminate site duplicates using hash table or whatever. An alternative technique is to have one entry for one user and append sites to this entry as events arrive. Nevertheless, entry modification is generally less efficient than entry insertion in the majority of implementations.</p>
<p><strong>Applicability</strong>: Ordered Key-Value Stores, BigTable-style Databases</p>
<h3>(10) Inverted Search &#8211; Direct Aggregation</h3>
<p>This technique is more a data processing pattern, rather than data modeling. Nevertheless, data models are also impacted by usage of this pattern. The main idea of this technique is to use an index to find data that meets a criteria, but aggregate data using original representation or full scans. Let&#8217;s consider an example. There are a number of log records with information about internet users and their visits from different sites (<em>click stream</em>). Let assume that each record contains user ID, categories this user belongs to (Men, Women, Bloggers, etc), city this user came from, and visited site. The goal is to describe the audience that meet some criteria (site, city, etc) in terms of unique users for each category that occurs in this audience (i.e. in the set of users that meet the criteria).</p>
<p>It is quite clear that a search of users that meet the criteria can be efficiently done using inverted indexes like <em>{Category -&gt; [user IDs]}</em> or <em>{Site -&gt; [user IDs]}</em>. Using such indexes, one can intersect or unify corresponding user IDs (this can be done very efficiently if user IDs are stored as sorted lists or bit sets) and obtain an audience. But describing an audience which is similar to an aggregation query like</p>
<pre class="brush: sql; title: ; notranslate">
SELECT count(distinct(user_id)) ... GROUP BY category
</pre>
<p>cannot be handled efficiently using an inverted index if the number of categories is big. To cope with this, one can build a direct index of the form <em>{UserID -&gt; [Categories]}</em> and iterate over it in order to build a final report. This schema is depicted below:</p>
<div id="attachment_388" class="wp-caption aligncenter" style="width: 604px"><a href="http://highlyscalable.files.wordpress.com/2012/02/invert-direct1.png"><img class="size-full wp-image-388" title="invert-direct" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/invert-direct1.png?w=594&#038;h=438" height="438" width="594" /></a><p class="wp-caption-text">Counting Unique Users using Inverse and Direct Indexes</p></div>
<p>And as a final note, we should take into account that random retrieval of records for each user ID in the audience can be inefficient. One can grapple with this problem by leveraging batch query processing. This means that some number of user sets can be precomputed (for different criteria) and then all reports for this batch of audiences can be computed in one full scan of direct or inverse index.</p>
<p><strong>Applicability</strong>: Key-Value Stores, BigTable-style Databases, Document Databases</p>
<h2>Hierarchy Modeling Techniques</h2>
<h3>(11) Tree Aggregation</h3>
<p>Trees or even arbitrary graphs (with the aid of denormalization) can be modeled as a single record or document.</p>
<ul>
<li>This techniques is efficient when the tree is accessed at once (for example, an entire tree of blog comments is fetched to show a page with a post).</li>
<li>Search and arbitrary access to the entries may be problematic.</li>
<li>Updates are inefficient in most NoSQL implementations (as compared to independent nodes).</li>
</ul>
<div id="attachment_381" class="wp-caption aligncenter" style="width: 452px"><a href="http://highlyscalable.files.wordpress.com/2012/02/tree-aggregation.png"><img class="size-full wp-image-381" title="tree-aggregation" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/tree-aggregation.png?w=594"   /></a><p class="wp-caption-text">Tree Aggregation</p></div>
<p><strong>Applicability</strong>: Key-Value Stores, Document Databases</p>
<h3> (12) Adjacency Lists</h3>
<p>Adjacency Lists are a straightforward way of graph modeling &#8211; each node is modeled as an independent record that contains arrays of direct ancestors or descendants. It allows one to search for nodes by identifiers of their parents or children and, of course, to traverse a graph by doing one hop per query. This approach is usually inefficient for getting an entire subtree for a given node, for deep or wide traversals.</p>
<p><strong>Applicability</strong>: Key-Value Stores, Document Databases</p>
<h3>(13) Materialized Paths</h3>
<p>Materialized Paths is a technique that helps to avoid recursive traversals of tree-like structures. This technique can be considered as a kind of denormalization. The idea is to attribute each node by identifiers of all its parents or children, so that it is possible to determine all descendants or predecessors of the node without traversal:</p>
<div id="attachment_372" class="wp-caption aligncenter" style="width: 598px"><a href="http://highlyscalable.files.wordpress.com/2012/02/materialized-paths2.png"><img class="size-full wp-image-372" title="materialized-paths" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/materialized-paths2.png?w=594"   /></a><p class="wp-caption-text">Materialized Paths for eShop Category Hierarchy</p></div>
<p>This technique is especially helpful for Full Text Search Engines because it allows one to convert hierarchical structures into flat documents. One can see in the figure above that all products or subcategories within the <em>Men&#8217;s Shoes </em>category can be retrieved using a short query which is simply a category name.</p>
<p>Materialized Paths can be stored as a set of IDs or as a single string of concatenated IDs. The latter option allows one to search for nodes that meet a certain partial path criteria using regular expressions. This option is illustrated in the figure below (path includes node itself):</p>
<div id="attachment_377" class="wp-caption aligncenter" style="width: 425px"><a href="http://highlyscalable.files.wordpress.com/2012/02/materialized-paths-2.png"><img class="size-full wp-image-377" title="materialized-paths-2" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/materialized-paths-2.png?w=594"   /></a><p class="wp-caption-text">Query Materialized Paths using RegExp</p></div>
<p><strong>Applicability</strong>: Key-Value Stores, Document Databases, Search Engines</p>
<h3>(14) Nested Sets</h3>
<p><a href="http://en.wikipedia.org/wiki/Nested_set_model">Nested sets</a> is a standard technique for modeling tree-like structures. It is widely used in relational databases, but it is perfectly applicable to Key-Value Stores and Document Databases. The idea is to store the leafs of the tree in an array and to map each non-leaf node to a range of leafs using start and end indexes, as is shown in the figure below:</p>
<div id="attachment_360" class="wp-caption aligncenter" style="width: 532px"><a href="http://highlyscalable.files.wordpress.com/2012/02/nested-sets.png"><img class="size-full wp-image-360" title="nested-sets" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/nested-sets.png?w=594"   /></a><p class="wp-caption-text">Modeling of eCommerce Catalog using Nested Sets</p></div>
<p>This structure is pretty efficient for immutable data because it has a small memory footprint and allows one to fetch all leafs for a given node without traversals. Nevertheless, inserts and updates are quite costly because the addition of one leaf causes an extensive update of indexes.</p>
<p><strong>Applicability</strong>: Key-Value Stores, Document Databases</p>
<h3>(15) Nested Documents Flattening: Numbered Field Names</h3>
<p>Search Engines typically work with flat documents, i.e. each document is a flat list of fields and values. The goal of data modeling is to map business entities to plain documents and this can be challenging if the entities have a complex internal structure. One typical challenge mapping documents with a hierarchical structure, i.e. documents with nested documents inside. Let&#8217;s consider the following example:</p>
<div id="attachment_363" class="wp-caption aligncenter" style="width: 499px"><a href="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-1.png"><img class="size-full wp-image-363" title="nested-documents-1" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-1.png?w=594"   /></a><p class="wp-caption-text">Nested Documents Problem</p></div>
<p>Each business entity is some kind of resume. It contains a person&#8217;s name and a list of his or her skills with a skill level. An obvious way to model such an entity is to create a plain document with <em>Skill</em> and <em>Level</em> fields. This model allows one to search for a person by skill or by level, but queries that combine both fields are liable to result in false matches, as depicted in the figure above.</p>
<p>One way to overcome this issue was suggested in [4.6]. The main idea of this technique is to index each skill and corresponding level as a dedicated pair of fields <em>Skill_i</em> and <em>Level_i, </em>and to search for all these pairs simultaneously (where the number of OR-ed terms in a query is as high as the maximum number of skills for one person):</p>
<div id="attachment_365" class="wp-caption aligncenter" style="width: 550px"><a href="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-3.png"><img class="size-full wp-image-365" title="nested-documents-3" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-3.png?w=594"   /></a><p class="wp-caption-text">Nested Document Modeling using Numbered Field Names</p></div>
<p>This approach is not really scalable because query complexity grows rapidly as a function of the number of nested structures.</p>
<p><strong>Applicability</strong>: Search Engines</p>
<h3>(16) Nested Documents Flattening: Proximity Queries</h3>
<p>The problem with nested documents can be solved using another technique that were also described in [4.6]. The idea is to use proximity queries that limit the acceptable distance between words in the document. In the figure below, all skills and levels are indexed in one field, namely, SkillAndLevel, and the query indicates that the words &#8220;Excellent&#8221; and &#8220;Poetry&#8221; should follow one another:</p>
<div id="attachment_364" class="wp-caption aligncenter" style="width: 594px"><a href="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-2.png"><img class="size-full wp-image-364" title="nested-documents-2" alt="" src="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-2.png?w=594"   /></a><p class="wp-caption-text">Nested Document Modeling using Proximity Queries</p></div>
<p>[4.3] describes a success story for this technique used on top of Solr.</p>
<p><strong>Applicability</strong>: Search Engines</p>
<h3>(17) Batch Graph Processing</h3>
<p>Graph databases like neo4j are exceptionally good for exploring the neighborhood of a given node or exploring relationships between two or a few nodes. Nevertheless, global processing of large graphs is not very efficient because general purpose graph databases do not scale well. Distributed graph processing can be done using MapReduce and the Message Passing pattern that was described, for example, in <a title="MapReduce Patterns, Algorithms, and Use Cases" href="http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/">one of my previous articles</a>. This approach makes Key-Value stores, Document databases, and BigTable-style databases suitable for processing large graphs.</p>
<p><strong>Applicability</strong>: Key-Value Stores, Document Databases, BigTable-style Databases</p>
<h2>References</h2>
<p>Finally, I provide a list of useful links related to NoSQL data modeling:</p>
<ol>
<li>Key-Value Stores:
<ol>
<li><a href="http://www.devshed.com/c/a/MySQL/Database-Design-Using-KeyValue-Tables/">http://www.devshed.com/c/a/MySQL/Database-Design-Using-KeyValue-Tables/</a></li>
<li><a href="http://antirez.com/post/Sorting-in-key-value-data-model.html">http://antirez.com/post/Sorting-in-key-value-data-model.htm</a>l</li>
<li><a href="http://stackoverflow.com/questions/3554169/difference-between-document-based-and-key-value-based-databases">http://stackoverflow.com/questions/3554169/difference-between-document-based-and-key-value-based-databases</a></li>
<li><a href="http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html">http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html</a></li>
</ol>
</li>
<li>BigTable-style Databases:
<ol>
<li><a href="http://www.slideshare.net/ebenhewitt/cassandra-datamodel-4985524">http://www.slideshare.net/ebenhewitt/cassandra-datamodel-4985524</a></li>
<li><a href="http://www.slideshare.net/mattdennis/cassandra-data-modeling">http://www.slideshare.net/mattdennis/cassandra-data-modeling</a></li>
<li><a href="http://nosql.mypopescu.com/post/17419074362/cassandra-data-modeling-examples-with-matthew-f-dennis">http://nosql.mypopescu.com/post/17419074362/cassandra-data-modeling-examples-with-matthew-f-dennis</a></li>
<li><a href="http://s-expressions.com/2009/03/08/hbase-on-designing-schemas-for-column-oriented-data-stores/">http://s-expressions.com/2009/03/08/hbase-on-designing-schemas-for-column-oriented-data-stores/</a></li>
<li><a href="http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable">http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable</a></li>
</ol>
</li>
<li>Document Databases:
<ol>
<li><a href="http://www.slideshare.net/mongodb/mongodb-schema-design-richard-kreuters-mongo-berlin-preso">http://www.slideshare.net/mongodb/mongodb-schema-design-richard-kreuters-mongo-berlin-preso</a></li>
<li><a href="http://www.michaelhamrah.com/blog/2011/08/data-modeling-at-scale-mongodb-mongoid-callbacks-and-denormalizing-data-for-efficiency/">http://www.michaelhamrah.com/blog/2011/08/data-modeling-at-scale-mongodb-mongoid-callbacks-and-denormalizing-data-for-efficiency/</a></li>
<li><a href="http://seancribbs.com/tech/2009/09/28/modeling-a-tree-in-a-document-database/">http://seancribbs.com/tech/2009/09/28/modeling-a-tree-in-a-document-database/</a></li>
<li><a href="http://www.mongodb.org/display/DOCS/Schema+Design">http://www.mongodb.org/display/DOCS/Schema+Design</a></li>
<li><a href="http://www.mongodb.org/display/DOCS/Trees+in+MongoDB">http://www.mongodb.org/display/DOCS/Trees+in+MongoDB</a></li>
<li><a href="http://blog.fiesta.cc/post/11319522700/walkthrough-mongodb-data-modeling">http://blog.fiesta.cc/post/11319522700/walkthrough-mongodb-data-modeling</a></li>
</ol>
</li>
<li>Full Text Search Engines:
<ol>
<li><a href="http://www.searchworkings.org/blog/-/blogs/query-time-joining-in-lucene">http://www.searchworkings.org/blog/-/blogs/query-time-joining-in-lucene</a></li>
<li><a href="http://www.lucidimagination.com/devzone/technical-articles/solr-and-rdbms-basics-designing-your-application-best-both">http://www.lucidimagination.com/devzone/technical-articles/solr-and-rdbms-basics-designing-your-application-best-both</a></li>
<li><a href="http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html">http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html</a></li>
<li><a href="http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/">http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/</a></li>
<li><a href="http://blog.mgm-tp.com/2011/03/non-standard-ways-of-using-lucene/">http://blog.mgm-tp.com/2011/03/non-standard-ways-of-using-lucene/</a></li>
<li><a href="http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene">http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene</a></li>
<li><a href="http://mysolr.com/tips/denormalized-data-structure/">http://mysolr.com/tips/denormalized-data-structure/</a></li>
<li><a href="http://sujitpal.blogspot.com/2010/10/denormalizing-maps-with-lucene-payloads.html">http://sujitpal.blogspot.com/2010/10/denormalizing-maps-with-lucene-payloads.html</a></li>
<li><a href="http://java.dzone.com/articles/hibernate-search-mapping-entit">http://java.dzone.com/articles/hibernate-search-mapping-entit</a></li>
</ol>
</li>
<li>Graph Databases:
<ol>
<li><a href="http://docs.neo4j.org/chunked/stable/tutorial-comparing-models.html">http://docs.neo4j.org/chunked/stable/tutorial-comparing-models.html</a></li>
<li><a href="http://blog.neo4j.org/2010/03/modeling-categories-in-graph-database.html">http://blog.neo4j.org/2010/03/modeling-categories-in-graph-database.html</a></li>
<li><a href="http://skillsmatter.com/podcast/nosql/graph-modelling">http://skillsmatter.com/podcast/nosql/graph-modelling</a></li>
<li><a href="http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Schatz_MLG2010.pdf">http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Schatz_MLG2010.pdf</a></li>
</ol>
</li>
<li>Demensionality Reduction:
<ol>
<li><a href="http://www.slideshare.net/mmalone/scaling-gis-data-in-nonrelational-data-stores">http://www.slideshare.net/mmalone/scaling-gis-data-in-nonrelational-data-stores</a></li>
<li><a href="http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves">http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves</a></li>
<li><a href="http://www.trisis.co.uk/blog/?p=1287">http://www.trisis.co.uk/blog/?p=1287</a></li>
</ol>
</li>
</ol>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/highlyscalable.wordpress.com/284/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/highlyscalable.wordpress.com/284/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=284&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/feed/</wfw:commentRss>
		<slash:comments>62</slash:comments>
	
		<media:thumbnail url="http://highlyscalable.files.wordpress.com/2012/03/featured.png?w=142" />
		<media:content url="http://highlyscalable.files.wordpress.com/2012/03/featured.png?w=142" medium="image">
			<media:title type="html">featured</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c11f79021b0f6248403dbf5e4b9d529b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">highlyscalable</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/overview2.png" medium="image">
			<media:title type="html">overview</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/soft-schema2.png" medium="image">
			<media:title type="html">soft-schema</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/03/aggregates-joins.png" medium="image">
			<media:title type="html">aggregates-joins</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/atomic-aggregate1.png" medium="image">
			<media:title type="html">atomic-aggregate</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/geohash-traversal1.png" medium="image">
			<media:title type="html">geohash-traversal</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/index-table.png" medium="image">
			<media:title type="html">index-table</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/03/composite-key-index.png" medium="image">
			<media:title type="html">composite-key-index</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/composite-key-collating1.png" medium="image">
			<media:title type="html">composite-key-collating</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/invert-direct1.png" medium="image">
			<media:title type="html">invert-direct</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/tree-aggregation.png" medium="image">
			<media:title type="html">tree-aggregation</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/materialized-paths2.png" medium="image">
			<media:title type="html">materialized-paths</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/materialized-paths-2.png" medium="image">
			<media:title type="html">materialized-paths-2</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/nested-sets.png" medium="image">
			<media:title type="html">nested-sets</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-1.png" medium="image">
			<media:title type="html">nested-documents-1</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-3.png" medium="image">
			<media:title type="html">nested-documents-3</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/nested-documents-2.png" medium="image">
			<media:title type="html">nested-documents-2</media:title>
		</media:content>
	</item>
		<item>
		<title>Tricks with Direct Memory Access in Java</title>
		<link>http://highlyscalable.wordpress.com/2012/02/02/direct-memory-access-in-java/</link>
		<comments>http://highlyscalable.wordpress.com/2012/02/02/direct-memory-access-in-java/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 12:09:36 +0000</pubDate>
		<dc:creator>Ilya Katsov</dc:creator>
				<category><![CDATA[HotSpot JVM]]></category>
		<category><![CDATA[direct memory management]]></category>
		<category><![CDATA[heap]]></category>
		<category><![CDATA[hot spot]]></category>
		<category><![CDATA[hotspot]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[jdk7]]></category>
		<category><![CDATA[jvm]]></category>

		<guid isPermaLink="false">http://highlyscalable.wordpress.com/?p=212</guid>
		<description><![CDATA[Java was initially designed as a safe, managed environment. Nevertheless, Java HotSpot VM contains a &#8220;backdoor&#8221; that provides a number of low-level operations to manipulate memory and threads directly. This backdoor – sun.misc.Unsafe – is widely used by JDK itself in the packages like java.nio or java.util.concurrent. It is hard to imagine a Java developer who uses this backdoor in [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=212&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Java was initially designed as a safe, managed environment. Nevertheless, Java HotSpot VM contains a &#8220;backdoor&#8221; that provides a number of low-level operations to manipulate memory and threads directly. This backdoor – <tt>sun.misc.Unsafe</tt> – is widely used by JDK itself in the packages like <tt>java.nio</tt> or <tt>java.util.concurrent</tt>. It is hard to imagine a Java developer who uses this backdoor in any regular development because this API is extremely dangerous, non portable, and volatile. Nevertheless, <tt>Unsafe</tt> provides an easy way to look into HotSpot JVM internals and do some tricks. Sometimes it is simply funny, sometimes it can be used to study VM internals without C++ code debugging, sometimes it can be leveraged for profiling and development tools.</p>
<h2>Obtaining Unsafe</h2>
<p>The <tt>sun.misc.Unsafe</tt> class is so unsafe that JDK developers added special checks to restrict access to it. Its constructor is private and caller of the factory method <tt>getUnsafe()</tt> should be loaded by Bootloader (i.e. caller should also be a part of JDK):</p>
<pre class="brush: java; title: ; notranslate">
public final class Unsafe {
    ...
    private Unsafe() {}
    private static final Unsafe theUnsafe = new Unsafe();
    ...
    public static Unsafe getUnsafe() {
       Class cc = sun.reflect.Reflection.getCallerClass(2);
       if (cc.getClassLoader() != null)
           throw new SecurityException(&quot;Unsafe&quot;);
       return theUnsafe;
    }
    ...
}
</pre>
<p>Fortunately there is <tt>theUnsafe</tt> field that can be used to retrieve <tt>Unsafe</tt> instance. We can easily write a helper method to do this via reflection:</p>
<pre class="brush: java; title: ; notranslate">
public static Unsafe getUnsafe() {
    try {
            Field f = Unsafe.class.getDeclaredField(&quot;theUnsafe&quot;);
            f.setAccessible(true);
            return (Unsafe)f.get(null);
    } catch (Exception e) { /* ... */ }
}
</pre>
<p>In the next sections we will study several tricks that become possible due to the following methods of <tt>Unsafe</tt>:</p>
<ul>
<li><strong>long getAddress(long address)</strong> and <strong>void putAddress(long address, long x)</strong> that allows to read and write dwords directly from memory.</li>
<li><strong>int getInt(Object o, long offset)</strong> , <strong>void putInt(Object o, long offset, int x)</strong>, and other similar methods that allows to read and write data directly from C structure that represents Java object.</li>
<li><strong>long allocateMemory(long bytes)</strong> which can be considered as a wrapper for C&#8217;s malloc().</li>
</ul>
<div>
<h2>sizeof() Function</h2>
<p>The first trick we will do is C-like sizeof() function, i.e. function that returns shallow object size in bytes. Inspecting JVM sources of JDK6 and JDK7, in particular <a href="http://hg.openjdk.java.net/jdk7/hotspot/hotspot/file/9b0ca45cd756/src/share/vm/oops/oop.hpp" rel="nofollow">src/share/vm/oops/oop.hpp</a> and <a href="http://hg.openjdk.java.net/jdk7/hotspot/hotspot/file/9b0ca45cd756/src/share/vm/oops/klass.hpp" rel="nofollow">src/share/vm/oops/klass.hpp</a>, and reading comments in the code, we can notice that size of class instance is stored in <tt>_layout_helper</tt> which is the fourth field in C structure that represents Java class. Similarly, <a href="http://hg.openjdk.java.net/jdk7/hotspot/hotspot/file/9b0ca45cd756/src/share/vm/oops/oop.hpp" rel="nofollow">/src/share/vm/oops/oop.hpp</a> shows that each instance (i.e. object) stores pointer to a class structure in its second field. For 32-bit JVM this means that we can first take class structure address as 4-8 bytes in the object structure and next shift by 3&#215;4=12 bytes inside class structure to capture<tt>_layout_helper</tt> field which is instance size in bytes. These structures are shown in the picture below:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/02/oops-jdk7.png"><img class="aligncenter size-full wp-image-213" title="oops-jdk7" src="http://highlyscalable.files.wordpress.com/2012/02/oops-jdk7.png?w=594" alt=""   /></a></p>
<p>As so, we can implement sizeof() as follows:</p>
</div>
<pre class="brush: java; title: ; notranslate">
public static long sizeOf(Object object) {
   Unsafe unsafe = getUnsafe();
   return unsafe.getAddress( normalize( unsafe.getInt(object, 4L) ) + 12L );
}

public static long normalize(int value) {
   if(value &gt;= 0) return value;
   return (~0L &gt;&gt;&gt; 32) &amp; value;
}
</pre>
<p>We need to use normalize() function because addresses between 2^31 and 2^32 will be automatically converted to negative integers, i.e. stored in complement form. Let&#8217;s test it on 32-bit JVM (JDK 6 or 7):</p>
<pre class="brush: java; title: ; notranslate">
// sizeOf(new MyStructure()) gives the following results:

class MyStructure { } // 8: 4 (start marker) + 4 (pointer to class)
class MyStructure { int x; } // 16: 4 (start marker) + 4 (pointer to class) + 4 (int) + 4 stuff bytes to align structure to 64-bit blocks
class MyStructure { int x; int y; } // 16: 4 (start marker) + 4 (pointer to class) + 2*4
</pre>
<p>This function will not work for array objects, because <tt>_layout_helper</tt> field has another meaning in that case. Although it is still possible to generalize sizeOf() to support arrays.</p>
<h2>Direct Memory Management</h2>
<p><tt>Unsafe</tt> allows to allocate and deallocate memory explicitly via <tt>allocateMemory</tt> and <tt>freeMemory</tt> methods. Allocated memory is not under GC control and not limited by maximum JVM heap size. In general, such functionality is safely available via NIO&#8217;s off-heap bufferes. But the interesting thing is that it is possible to map standard Java reference to off-heap memory:</p>
<pre class="brush: java; title: ; notranslate">
MyStructure structure = new MyStructure(); // create a test object
structure.x = 777;

long size = sizeOf(structure);
long offheapPointer = getUnsafe().allocateMemory(size);
getUnsafe().copyMemory(
                structure,      // source object
                0,              // source offset is zero - copy an entire object
                null,           // destination is specified by absolute address, so destination object is null
                offheapPointer, // destination address
                size
); // test object was copied to off-heap

Pointer p = new Pointer(); // Pointer is just a handler that stores address of some object
long pointerOffset = getUnsafe().objectFieldOffset(Pointer.class.getDeclaredField(&quot;pointer&quot;));
getUnsafe().putLong(p, pointerOffset, offheapPointer); // set pointer to off-heap copy of the test object

structure.x = 222; // rewrite x value in the original object
System.out.println(  ((MyStructure)p.pointer).x  ); // prints 777

....

class Pointer {
    Object pointer;
}
</pre>
<p>So, it is virtually possible to manually allocate and deallocate real objects, not only byte buffers. Of course, it&#8217;s a big question what may happen with GC after such cheats.</p>
<h2>Inheritance from Final Class and void*</h2>
<p>Imagine the situation when one has a method that takes a string as an argument, but it is necessary to pass some extra payload. There are at least two standard ways to do it in Java: put payload to thread local or use static field. With <tt>Unsafe</tt> another two possibilities appears: pass payload address as a string and inherit payload class from String class. The first approach is pretty close to what we see in the previous section – one just need obtain payload address using Pointer and create a new Pointer to payload inside the called method. In other words, any argument that can carrier an address can be used as analog of void* in C. In order to explore the second approach we start with the following code which is compilable, but obviously produces ClassCastException in run time:</p>
<pre class="brush: java; title: ; notranslate">
Carrier carrier = new Carrier();
carrier.secret = 777;

String message = (String)(Object)carrier; // ClassCastException
handler( message );

...

void handler(String message) {
   System.out.println( ((Carrier)(Object)message).secret );
}

...

class Carrier {
   int secret;
}
</pre>
<p>To make it work, one need to modify Carrier class to simulate inheritance from String. A list of superclasses is stored in Carrier class structure starting from position 28, as it shown in the figure. Pointer to object goes first and pointer to Carrier itself goes after it (at position 32) since Carrier is inherited from Object directly. In principle, it is enough to add the following code before the line that casts Carrier to String:</p>
<pre class="brush: java; title: ; notranslate">
long carrierClassAddress = normalize( unsafe.getInt(carrier, 4L) );
long stringClassAddress = normalize( unsafe.getInt(&quot;&quot;, 4L) );
unsafe.putAddress(carrierClassAddress + 32, stringClassAddress); // insert pointer to String class to the list of Carrier's superclasses
</pre>
<p>Now cast works fine. Nevertheless, this transformation is not correct and violates VM contracts. More careful approach should include more steps:</p>
<ol>
<li>Position 32 in Carrier class actually contains a pointer to Carrier class itself, so this pointer should be shifted to position 36, not simply overwritten by the pointer to the String class.</li>
<li>Since Carrier is now inherited from String, final markers in String class should be removed.</li>
</ol>
<div>
<h2>Conclusion</h2>
<p><tt>sun.misc.Unsafe</tt> provides almost unlimited capabilities for exploring and modification of VM&#8217;s runtime data structures. Despite the fact that these capabilities are almost inapplicable in Java development itself, Unsafe is a great tool for anyone who want to study HotSpot VM without C++ code debugging or need to create ad hoc profiling instruments.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/highlyscalable.wordpress.com/212/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/highlyscalable.wordpress.com/212/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=212&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://highlyscalable.wordpress.com/2012/02/02/direct-memory-access-in-java/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
	
		<media:thumbnail url="http://highlyscalable.files.wordpress.com/2012/02/featured1.png?w=150" />
		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/featured1.png?w=150" medium="image">
			<media:title type="html">featured</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c11f79021b0f6248403dbf5e4b9d529b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">highlyscalable</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/oops-jdk7.png" medium="image">
			<media:title type="html">oops-jdk7</media:title>
		</media:content>
	</item>
		<item>
		<title>MapReduce Patterns, Algorithms, and Use Cases</title>
		<link>http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/</link>
		<comments>http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/#comments</comments>
		<pubDate>Wed, 01 Feb 2012 06:52:05 +0000</pubDate>
		<dc:creator>Ilya Katsov</dc:creator>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Fundamentals]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[optimization]]></category>
		<category><![CDATA[pattern]]></category>
		<category><![CDATA[use case]]></category>

		<guid isPermaLink="false">http://highlyscalable.wordpress.com/?p=120</guid>
		<description><![CDATA[In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop&#8217;s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. This [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=120&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop&#8217;s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. This framework is depicted in the figure below.</p>
<div id="attachment_287" class="wp-caption aligncenter" style="width: 604px"><a href="http://highlyscalable.files.wordpress.com/2012/02/map-reduce.png"><img class="size-full wp-image-287" title="map-reduce" src="http://highlyscalable.files.wordpress.com/2012/02/map-reduce.png?w=594&#038;h=609" alt="" width="594" height="609" /></a><p class="wp-caption-text">MapReduce Framework</p></div>
<h1>Basic MapReduce Patterns</h1>
<h2>Counting and Summing</h2>
<p><strong>Problem Statement:</strong> There is a number of documents where each document is a set of terms. It is required to calculate a total number of occurrences of each term in all documents. Alternatively, it can be an arbitrary function of the terms. For instance, there is a log file where each record contains a response time and it is required to calculate an average response time.</p>
<p><strong>Solution:</strong></p>
<p>Let start with something really simple. The code snippet below shows Mapper that simply emit &#8220;1&#8243; for each term it processes and Reducer that goes through the lists of ones and sum them up:</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(docid id, doc d)
      for all term t in doc d do
         Emit(term t, count 1)

class Reducer
   method Reduce(term t, counts [c1, c2,...])
      sum = 0
      for all count c in [c1, c2,...] do
          sum = sum + c
      Emit(term t, count sum)
</pre>
<p>The obvious disadvantage of this approach is a high amount of dummy counters emitted by the Mapper. The Mapper can decrease a number of counters via summing counters for each document:</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(docid id, doc d)
      H = new AssociativeArray
      for all term t in doc d do
          H{t} = H{t} + 1
      for all term t in H do
         Emit(term t, count H{t})
</pre>
<p>In order to accumulate counters not only for one document, but for all documents processed by one Mapper node, it is possible to leverage Combiners:</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(docid id, doc d)
      for all term t in doc d do
         Emit(term t, count 1)

class Combiner
   method Combine(term t, [c1, c2,...])
      sum = 0
      for all count c in [c1, c2,...] do
          sum = sum + c
      Emit(term t, count sum)

class Reducer
   method Reduce(term t, counts [c1, c2,...])
      sum = 0
      for all count c in [c1, c2,...] do
          sum = sum + c
      Emit(term t, count sum)
</pre>
<h3>Applications:</h3>
<p>Log Analysis, Data Querying</p>
<h2></h2>
<h2>Collating</h2>
<p><strong>Problem Statement:</strong> There is a set of items and some function of one item. It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example is building of inverted indexes.</p>
<p><strong>Solution:</strong></p>
<p>The solution is straightforward. Mapper computes a given function for each item and emits value of the function as a key and item itself as a value. Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words) and function is a document ID where the term was found.</p>
<h3>Applications:</h3>
<p>Inverted Indexes, ETL</p>
<h2></h2>
<h2>Filtering (&#8220;Grepping&#8221;), Parsing, and Validation</h2>
<p><strong>Problem Statement:</strong> There is a set of records and it is required to collect all records that meet some condition or transform each record (independently from other records) into another representation. The later case includes such tasks as text parsing and value extraction, conversion from one format to another.</p>
<p><strong>Solution:</strong>  Solution is absolutely straightforward &#8211; Mapper takes records one by one and emits accepted items or their transformed versions.</p>
<h3>Applications:</h3>
<p>Log Analysis, Data Querying, ETL, Data Validation</p>
<h2></h2>
<h2>Distributed Task Execution</h2>
<p><strong>Problem Statement:</strong> There is a large computational problem that can be divided into multiple parts and results from all parts can be combined together to obtain a final result.</p>
<p><strong>Solution:</strong>  Problem description is split in a set of specifications and specifications are stored as input data for Mappers. Each Mapper takes a specification, performs corresponding computations and emits results. Reducer combines all emitted parts into the final result.</p>
<h3>Case Study: Simulation of a Digital Communication System</h3>
<p>There is a software simulator of a digital communication system like WiMAX that passes some volume of random data through the system model and computes error probability of throughput. Each Mapper runs simulation for specified amount of data which is 1/Nth of the required sampling and emit error rate. Reducer computes average error rate.</p>
<h3>Applications:</h3>
<p>Physical and Engineering Simulations, Numerical Analysis, Performance Testing</p>
<h2></h2>
<h2>Sorting</h2>
<p><strong>Problem Statement:</strong> There is a set of records and it is required to sort these records by some rule or process these records in a certain order.</p>
<p><strong>Solution:</strong> Simple sorting is absolutely straightforward &#8211; Mappers just emit all items as values associated with the sorting keys that are assembled as function of items. Nevertheless, in practice sorting is often used in a quite tricky way, that&#8217;s why it is said to be a heart of MapReduce (and Hadoop). In particular, it is very common to use composite keys to achieve secondary sorting and grouping.</p>
<p>Sorting in MapReduce is originally intended for sorting of the emitted key-value pairs by key, but there exist techniques that leverage Hadoop implementation specifics to achieve sorting by values. See this <a href="http://www.riccomini.name/Topics/DistributedComputing/Hadoop/SortByValue/">blog</a> for more details.</p>
<p>It is worth noting that if MapReduce is used for sorting of the original (not intermediate) data, it is often a good idea to continuously maintain data in sorted state using BigTable concepts. In other words, it can be more efficient to sort data once during insertion than sort them for each MapReduce query.</p>
<h3>Applications:</h3>
<p>ETL, Data Analysis</p>
<h2></h2>
<h1>Not-So-Basic MapReduce Patterns</h1>
<h2>Iterative Message Passing (Graph Processing)</h2>
<p><strong>Problem Statement:</strong> There is a network of entities and relationships between them. It is required to calculate a state of each entity on the basis of properties of the other entities in its neighborhood. This state can represent a distance to other nodes,  indication that there is a neighbor with the certain properties, characteristic of neighborhood density and so on.</p>
<p><strong>Solution:</strong> A network is stored as a set of nodes and each node contains a list of adjacent node IDs. Conceptually, MapReduce jobs are performed in iterative way and at each iteration each node sends messages to its neighbors. Each neighbor updates its state on the basis of the received messages. Iterations are terminated by some condition like fixed maximal number of iterations (say, network diameter) or negligible changes in states between two consecutive iterations. From the technical point of view, Mapper emits messages for each node using ID of the adjacent node as a key. As result, all messages are grouped by the incoming node and reducer is able to recompute state and rewrite node with the new state. This algorithm is shown in the figure below:</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(id n, object N)
      Emit(id n, object N)
      for all id m in N.OutgoingRelations do
         Emit(id m, message getMessage(N))

class Reducer
   method Reduce(id m, [s1, s2,...])
      M = null
      messages = []
      for all s in [s1, s2,...] do
          if IsObject(s) then
             M = s
          else               // s is a message
             messages.add(s)
      M.State = calculateState(messages)
      Emit(id m, item M)
</pre>
<p>It should be emphasized that state of one node rapidly propagates across all the network of network is not too sparse because all nodes that were &#8220;infected&#8221; by this state start to &#8220;infect&#8221; all their neighbors. This process is illustrated in the figure below:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/01/graph-propagation-3.png"><img class="aligncenter size-full wp-image-161" title="Iterative Message Passing" src="http://highlyscalable.files.wordpress.com/2012/01/graph-propagation-3.png?w=594" alt="Iterative Message Passing"   /></a></p>
<h3>Case Study: Availability Propagation Through The Tree of Categories</h3>
<p><strong>Problem Statement:</strong> This problem is inspired by real life eCommerce task. There is a tree of categories that branches out from large categories (like Men, Women, Kids) to smaller ones (like Men Jeans or Women Dresses), and eventually to small end-of-line categories (like Men Blue Jeans). End-of-line category is either available (contains products) or not. Some high level category is available if there is at least one available end-of-line category in its subtree. The goal is to calculate availabilities for all categories if availabilities of end-of-line categories are know.</p>
<p><strong>Solution:</strong> This problem can be solved using the framework that was described in the previous section. We define getMessage and calculateState methods as follows:</p>
<pre class="brush: cpp; title: ; notranslate">
class N
   State in {True = 2, False = 1, null = 0}, initialized 1 or 2 for end-of-line categories, 0 otherwise

method getMessage(object N)
   return N.State

method calculateState(state s, data [d1, d2,...])
   return max( [d1, d2,...] )
</pre>
<h3>Case Study: Breadth-First Search</h3>
<p><strong>Problem Statement:</strong> There is a graph and it is required to calculate distance (a number of hops) from one source node to all other nodes in the graph.</p>
<p><strong>Solution:</strong> Source node emits 0 to all its neighbors and these neighbors propagate this counter incrementing it by 1 during each hope:</p>
<pre class="brush: cpp; title: ; notranslate">
class N
   State is distance, initialized 0 for source node, INFINITY for all other nodes

method getMessage(N)
   return N.State + 1

method calculateState(state s, data [d1, d2,...])
   min( [d1, d2,...] )
</pre>
<h3>Case Study: PageRank and Mapper-Side Data Aggregation</h3>
<p>This algorithm was suggested by Google to calculate relevance of a web page as a function of authoritativeness (PageRank) of pages that have links to this page. The real algorithm is quite complex, but in its core it is just a propagation of weights between nodes where each node calculates its weight as a mean of the incoming weights:</p>
<pre class="brush: cpp; title: ; notranslate">
class N
    State is PageRank

method getMessage(object N)
    return N.State / N.OutgoingRelations.size()

method calculateState(state s, data [d1, d2,...])
    return ( sum([d1, d2,...]) )
</pre>
<p>It is worth mentioning that the schema we use is too generic and doesn&#8217;t take advantage of the fact that state is a numerical value. In most of practical cases, we can perform aggregation of values on the Mapper side due to virtue of this fact. This optimization  is illustrated in the code snippet below (for the PageRank algorithm):</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Initialize
      H = new AssociativeArray
   method Map(id n, object N)
      p = N.PageRank  / N.OutgoingRelations.size()
      Emit(id n, object N)
      for all id m in N.OutgoingRelations do
         H{m} = H{m} + p
   method Close
      for all id n in H do
         Emit(id n, value H{n})

class Reducer
   method Reduce(id m, [s1, s2,...])
      M = null
      p = 0
      for all s in [s1, s2,...] do
          if IsObject(s) then
             M = s
          else
             p = p + s
      M.PageRank = p
      Emit(id m, item M)
</pre>
<h3>Applications:</h3>
<p>Graph Analysis, Web Indexing</p>
<h2></h2>
<h2>Distinct Values (Unique Items Counting)</h2>
<p><strong>Problem Statement:</strong> There is a set of records that contain fields F and G. Count the total number of unique values of filed F for each subset of records that have the same G (grouped by G).</p>
<p>The problem can be a little bit generalized and formulated in terms of faceted search:</p>
<p><strong>Problem Statement:</strong> There is a set of records. Each record has field F and arbitrary number of category labels G = {G1, G2, &#8230;} . Count the total number of unique values of filed F for each subset of records for each value of any label. Example:</p>
<pre class="brush: cpp; title: ; notranslate">
Record 1: F=1, G={a, b}
Record 2: F=2, G={a, d, e}
Record 3: F=1, G={b}
Record 4: F=3, G={a, b}

Result:
a -&gt; 3   // F=1, F=2, F=3
b -&gt; 2   // F=1, F=3
d -&gt; 1   // F=2
e -&gt; 1   // F=2
</pre>
<p><strong>Solution I:</strong></p>
<p>The first approach is to solve the problem in two stages. At the first stage Mapper emits dummy counters for each pair of F and G; Reducer calculates a total number of occurrences for each such pair. The main goal of this phase is to guarantee uniqueness of F values. At the second phase pairs are grouped by G and the total number of items in each group is calculated.</p>
<p>Phase I:</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(null, record [value f, categories [g1, g2,...]])
      for all category g in [g1, g2,...]
         Emit(record [g, f], count 1)

class Reducer
   method Reduce(record [g, f], counts [n1, n2, ...])
      Emit(record [g, f], null )
</pre>
<p>Phase II:</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(record [f, g], null)
      Emit(value g, count 1)

class Reducer
   method Reduce(value g, counts [n1, n2,...])
      Emit(value g, sum( [n1, n2,...] ) )
</pre>
<p><strong>Solution II:</strong></p>
<p>The second solution requires only one MapReduce job, but it is not really scalable and its applicability is limited. The algorithm is simple &#8211; Mapper emits values and categories, Reducer excludes duplicates from the list of categories for each value and increment counters for each category. The final step is to sum all counter emitted by Reducer. This approach is applicable if th number of record with the same f value is not very high and total number of categories is also limited. For instance, this approach is applicable for processing of web logs and classification of users &#8211; total number of users is high, but number of events for one user is limited, as well as a number of categories to classify by. It worth noting that Combiners can be used in this schema to exclude duplicates from category lists before data will be transmitted to Reducer.</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(null, record [value f, categories [g1, g2,...] )
      for all category g in [g1, g2,...]
          Emit(value f, category g)

class Reducer
   method Initialize
      H = new AssociativeArray : category -&gt; count
   method Reduce(value f, categories [g1, g2,...])
      [g1', g2',..] = ExcludeDuplicates( [g1, g2,..] )
      for all category g in [g1', g2',...]
         H{g} = H{g} + 1
   method Close
      for all category g in H do
         Emit(category g, count H{g})
</pre>
<h3>Applications:</h3>
<p>Log Analysis, Unique Users Counting</p>
<h2></h2>
<h2>Cross-Correlation</h2>
<p><strong>Problem Statement: </strong>There is a set of tuples of items. For each possible pair of items calculate a number of tuples where these items co-occur. If the total number of items is N then N*N values should be reported.</p>
<p>This problem appears in text analysis (say, items are words and tuples are sentences), market analysis (customers who buy <em>this</em> tend to also buy <em>that</em>). If N*N is quite small and such a matrix can fit in the memory of a single machine, then implementation is straightforward.</p>
<p><strong>Pairs Approach</strong></p>
<p>The first approach is to emit all pairs and dummy counters from Mappers and sum these counters on Reducer. The shortcomings are:</p>
<ul>
<li>The benefit from combiners is limited, as it is likely that all pair are distinct</li>
<li>There is no in-memory accumulations</li>
</ul>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(null, items [i1, i2,...] )
      for all item i in [i1, i2,...]
         for all item j in [i1, i2,...]
            Emit(pair [i j], count 1)

class Reducer
   method Reduce(pair [i j], counts [c1, c2,...])
      s = sum([c1, c2,...])
      Emit(pair[i j], count s)
</pre>
<p><strong>Stripes Approach</strong></p>
<p>The second approach is to group data by the first item in pair and maintain an associative array (&#8220;stripe&#8221;) where counters for all adjacent items are accumulated. Reducer receives all stripes for leading item i, merges them, and emits the same result as in the Pairs approach.</p>
<ul>
<li>Generates fewer intermediate keys. Hence the framework has less sorting to do.</li>
<li>Greately benefits from combiners.</li>
<li>Performs in-memory accumulation. This can lead to problems, if not properly implemented.</li>
<li>More complex implementation.</li>
<li>In general, &#8220;stripes&#8221; is faster than &#8220;pairs&#8221;</li>
</ul>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(null, items [i1, i2,...] )
      for all item i in [i1, i2,...]
         H = new AssociativeArray : item -&gt; counter
         for all item j in [i1, i2,...]
            H{j} = H{j} + 1
         Emit(item i, stripe H)

class Reducer
   method Reduce(item i, stripes [H1, H2,...])
      H = new AssociativeArray : item -&gt; counter
      H = merge-sum( [H1, H2,...] )
      for all item j in H.keys()
         Emit(pair [i j], H{j})
</pre>
<h3>Applications:</h3>
<p>Text Analysis, Market Analysis</p>
<h3>References:</h3>
<ol>
<li>Lin J. Dyer C. Hirst G. <a href="http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421/">Data Intensive Processing MapReduce</a></li>
</ol>
<h1></h1>
<h1>Relational MapReduce Patterns</h1>
<p>In this section we go though the main relational operators and discuss how these operators can implemented in MapReduce terms.</p>
<h2>Selection</h2>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(rowkey key, tuple t)
      if t satisfies the predicate
         Emit(tuple t, null)
</pre>
<h2>Projection</h2>
<p>Projection is just a little bit more complex than selection, but we should use a Reducer in this case to eliminate possible duplicates.</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(rowkey key, tuple t)
      tuple g = project(t)  // extract required fields to tuple g
      Emit(tuple g, null)

class Reducer
   method Reduce(tuple t, array n)   // n is an array of nulls
      Emit(tuple t, null)
</pre>
<h2>Union</h2>
<p>Mappers are fed by all records of two sets to be united. Reducer is used to eliminate duplicates.</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(rowkey key, tuple t)
      Emit(tuple t, null)

class Reducer
   method Reduce(tuple t, array n)   // n is an array of one or two nulls
      Emit(tuple t, null)
</pre>
<h2>Intersection</h2>
<p>Mappers are fed by all records of two sets to be intersected. Reducer emits only records that occurred twice. It is possible only if both sets contain this record because record includes primary key and can occur in one set only once.</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(rowkey key, tuple t)
      Emit(tuple t, null)

class Reducer
   method Reduce(tuple t, array n)   // n is an array of one or two nulls
      if n.size() = 2
          Emit(tuple t, null)
</pre>
<h2>Difference</h2>
<p>Let&#8217;s we have two sets of records &#8211; R and S. We want to compute difference R &#8211; S. Mapper emits all tuples and tag which is a name of the set this record came from. Reducer emits only records that came from R but not from S.</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(rowkey key, tuple t)
      Emit(tuple t, string t.SetName)    // t.SetName is either 'R' or 'S'

class Reducer
   method Reduce(tuple t, array n) // array n can be ['R'], ['S'], ['R' 'S'], or ['S', 'R']
      if n.size() = 1 and n[1] = 'R'
          Emit(tuple t, null)
</pre>
<h2>GroupBy and Aggregation</h2>
<p>Grouping and aggregation can be performed in one MapReduce job as follows. Mapper extract from each tuple values to group by and aggregate and emits them. Reducer receives values to be aggregated already grouped and calculates an aggregation function. Typical aggregation functions like sum or max can be calculated in a streaming fashion, hence don&#8217;t require to handle all values simultaneously. Nevertheless, in some cases two phase MapReduce job may be required &#8211; see pattern <strong>Distinct Values</strong> as an example.</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(null, tuple [value GroupBy, value AggregateBy, value ...])
      Emit(value GroupBy, value AggregateBy)
class Reducer
   method Reduce(value GroupBy, [v1, v2,...])
      Emit(value GroupBy, aggregate( [v1, v2,...] ) )  // aggregate() : sum(), max(),...
</pre>
<h2></h2>
<h2>Joining</h2>
<p>Joins are perfectly possible in MapReduce framework, but there exist a number of techniques that differ in efficiency and data volumes they are oriented for. In this section we study some basic approaches. The references section contains links to detailed studies of join techniques.</p>
<h3>Repartition Join (Reduce Join, Sort-Merge Join)</h3>
<p>This algorithm joins of two sets R and L on some key k. Mapper goes through all tuples from R and L, extracts key k from the tuples, marks tuple with a tag that indicates a set this tuple came from (&#8216;R&#8217; or &#8216;L&#8217;), and emits tagged tuple using k as a key. Reducer receives all tuples for a particular key k and put them into two buckets &#8211; for R and for L. When two buckets are filled, Reducer runs nested loop over them and emits a cross join of the buckets. Each emitted tuple is a concatenation R-tuple, L-tuple, and key k. This approach has the following disadvantages:</p>
<ul>
<li>Mapper emits absolutely all data, even for keys that occur only in one set and have no pair in the other.</li>
<li>Reducer should hold all data for one key in the memory. If data doesn&#8217;t fit the memory, its Reducer&#8217;s responsibility to handle this by some kind of swap.</li>
</ul>
<div>Nevertheless, Repartition Join is a most generic technique that can be successfully used when other optimized techniques are not applicable.</div>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Map(null, tuple [join_key k, value v1, value v2,...])
      Emit(join_key k, tagged_tuple [set_name tag, values [v1, v2, ...] ] )

class Reducer
   method Reduce(join_key k, tagged_tuples [t1, t2,...])
      H = new AssociativeArray : set_name -&gt; values
      for all tagged_tuple t in [t1, t2,...]     // separate values into 2 arrays
         H{t.tag}.add(t.values)
      for all values r in H{'R'}                 // produce a cross-join of the two arrays
         for all values l in H{'L'}
            Emit(null, [k r l] )
</pre>
<h3>Replicated Join (Map Join, Hash Join)</h3>
<p>In practice, it is typical to join a small set with a large one (say, a list of users with a list of log records). Let&#8217;s assume that we join two sets &#8211; R and L, R is relative small. If so, R can be distributed to all Mappers and each Mapper can load it and index by the join key. The most common and efficient indexing technique here is a hash table. After this, Mapper goes through tuples of the set L and joins them with the corresponding tuples from R that are stored in the hash table. This approach is very effective because there is no need in sorting or transmission of the set L over the network, but set R should be quite small to be distributed to the all Mappers.</p>
<pre class="brush: cpp; title: ; notranslate">
class Mapper
   method Initialize
      H = new AssociativeArray : join_key -&gt; tuple from R
      R = loadR()
      for all [ join_key k, tuple [r1, r2,...] ] in R
         H{k} = H{k}.append( [r1, r2,...] )

   method Map(join_key k, tuple l)
      for all tuple r in H{k}
         Emit(null, tuple [k r l] )
</pre>
<h3>References:</h3>
<ol>
<li><a href="http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf">Join Algorithms using Map/Reduce</a></li>
<li><a href="http://infolab.stanford.edu/~ullman/pub/join-mr.pdf">Optimizing Joins in a MapReduce Environment</a></li>
</ol>
<h1>Machine Learning and Math MapReduce Algorithms</h1>
<ul>
<li>C. T. Chu <em>et al</em> provides an excellent description of  machine learning algorithms for MapReduce in the article <a title="Map-Reduce for Machine Learning on Multicore" href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf" target="_blank">Map-Reduce for Machine Learning on Multicore</a>.</li>
<li>FFT using MapReduce: <a href="http://www.slideshare.net/hortonworks/large-scale-math-with-hadoop-mapreduce">http://www.slideshare.net/hortonworks/large-scale-math-with-hadoop-mapreduce</a></li>
<li>MapReduce for integer factorization: <a href="http://www.javiertordable.com/files/MapreduceForIntegerFactorization.pdf">http://www.javiertordable.com/files/MapreduceForIntegerFactorization.pdf</a></li>
<li>Matrix multiplication with MapReduce: <a href="http://csl.skku.edu/papers/CS-TR-2010-330.pdf">http://csl.skku.edu/papers/CS-TR-2010-330.pdf</a> and <a href="http://www.norstad.org/matrix-multiply/index.html">http://www.norstad.org/matrix-multiply/index.html</a></li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/highlyscalable.wordpress.com/120/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/highlyscalable.wordpress.com/120/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=120&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/feed/</wfw:commentRss>
		<slash:comments>36</slash:comments>
	
		<media:thumbnail url="http://highlyscalable.files.wordpress.com/2012/02/featured.png?w=133" />
		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/featured.png?w=133" medium="image">
			<media:title type="html">featured</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c11f79021b0f6248403dbf5e4b9d529b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">highlyscalable</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/02/map-reduce.png" medium="image">
			<media:title type="html">map-reduce</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/01/graph-propagation-3.png" medium="image">
			<media:title type="html">Iterative Message Passing</media:title>
		</media:content>
	</item>
		<item>
		<title>Implementation of MVCC Transactions for Key-Value Stores</title>
		<link>http://highlyscalable.wordpress.com/2012/01/07/mvcc-transactions-key-value/</link>
		<comments>http://highlyscalable.wordpress.com/2012/01/07/mvcc-transactions-key-value/#comments</comments>
		<pubDate>Sat, 07 Jan 2012 15:47:05 +0000</pubDate>
		<dc:creator>Ilya Katsov</dc:creator>
				<category><![CDATA[Fundamentals]]></category>
		<category><![CDATA[coherence]]></category>
		<category><![CDATA[concurrency control]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[lock-free]]></category>
		<category><![CDATA[multiversion]]></category>
		<category><![CDATA[mvcc]]></category>
		<category><![CDATA[nosql]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[transaction]]></category>
		<category><![CDATA[transaction implementation]]></category>

		<guid isPermaLink="false">http://highlyscalable.wordpress.com/?p=77</guid>
		<description><![CDATA[ACID transactions are one of the most widely used software engineering techniques, a cornerstone of  the relational databases, and an integral part of the enterprise middleware where transactions are often offered as the black-box primitives. Notwithstanding all these and many other cases, the old-fashion approach to transactions cannot be maintained in a variety of modern large [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=77&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>ACID transactions are one of the most widely used software engineering techniques, a cornerstone of  the relational databases, and an integral part of the enterprise middleware where transactions are often offered as the black-box primitives. Notwithstanding all these and many other cases, the old-fashion approach to transactions cannot be maintained in a variety of modern large scale systems and NoSQL storages because of high requirements on performance, data volumes, and availability. In such cases, traditional transactions are not rarely replaced by a customized model that assumes implementation of transactional or semi-transactional operations on the top of units that are not transactional by themselves.</p>
<p>In this post we consider implementation of lock-free transactional operations on the top of  Key-Value storages, although these techniques are generic and can be used in any database-like system. In GridDynamics, we recently used some of these techniques to implement a lightweight nonstandard transactions on the top of Oracle Coherence. In the first section we take a look at two simple approaches that suitable for some important use cases, in the second section we study more generic approach that resembles PostgreSQL&#8217;s MVCC implementation.</p>
<h2>Atomic Cache Switching, Read Committed Isolation</h2>
<p>Let&#8217;s start with simple and easy-to-implement techniques that are intended for relatively infrequent updates in read-mostly systems, for instance, daily data reload in eCommerce systems, administrative operations like repair of invalid items, or cache refreshes.</p>
<p>The most trivial case is reload of all data in the cache (or key space). We wrap cache interface by a proxy that intercepts all cache operation like <em>get()</em> or <em>put()</em>. This proxy is backed by two caches, namely, A and B, and works in accordance with the following simple logic (Fig.1):</p>
<ul>
<li>At any moment of time, only one cache is active and proxy routes all user request to it (Fig.1.1)</li>
<li>Refresh process load new data into inactive cache (Fig.1.2)</li>
<li>Refresh process switches a global flag that shared by all proxies that participate in refresh and this flag defines which cache is active (Fig.1.3). Proxy starts to dispatch all new read transactions to the new active cache.</li>
<li>Transactions that are in progress at the moment of cache switching can be handled differently depending on required level of consistency and isolation. If non-repeatable reads are acceptable (some transaction can read data partially from the old state and partially from the new one) then switch is straightforward and old data can be cleaned up immediately. Otherwise, the proxy should maintain a list of active transactions and route each one to the cache it was initially assigned. Old data can be purged only when all attached transaction were committed or aborted.</li>
</ul>
<div id="attachment_94" class="wp-caption aligncenter" style="width: 492px"><a href="http://highlyscalable.files.wordpress.com/2012/01/tx-001.png"><img class="size-full wp-image-94 " title="Cache Switch" src="http://highlyscalable.files.wordpress.com/2012/01/tx-001.png?w=594" alt=""   /></a><p class="wp-caption-text">Fig.1 Cache Switch</p></div>
<p>The similar technique can be used for partial updates. It can be implemented differently depends on the underlying storage, we consider one simple strategy with three caches. The framework is similar to the previous one, but proxy acts in the following way (Fig.2):</p>
<ul>
<li>User requests are routed to the PRIMARY cache (Fig.2.1)</li>
<li>New and updated items are loaded into the NEW cache, keys of deleted items are stored to DELETED cache  (Fig.2.2)</li>
<li>Commit process begins with switching of the global flag. This flag instructs the proxies to look up requested keys in NEW and DELETED caches first and, if not found, look up the same key in the PRIMARY cache (Fig.2.3). In other words, all user request are switched to the new data at this step.</li>
<li>Commit process starts to propagate changes from NEW and DELETED caches to the PRIMARY cache, i.e. replace/add/remove items in the PRIMARY cache one by one in non-atomic way (Fig.2.4).</li>
<li>Finally, the commit process switches the global flag back and requests are routed to the PRIMARY cache (Fig.2.5).</li>
<li>Old data can be copied to another cache during step 4 in order to provide rollback possibility. In-progress transactions can be handled as for full refreshes.</li>
</ul>
<div id="attachment_95" class="wp-caption aligncenter" style="width: 548px"><a href="http://highlyscalable.files.wordpress.com/2012/01/tx-002.png"><img class="size-full wp-image-95 " title="Partial Cache Switch" src="http://highlyscalable.files.wordpress.com/2012/01/tx-002.png?w=594" alt=""   /></a><p class="wp-caption-text">Fig.2 Partial Cache Switch</p></div>
<p>Thus, from the examples above, we can conclude that attachment of read transactions to the snapshot of data and avoiding of interference from the commitment of the update transactions is one of the main sources of complexity. This is obviously a case for write-intensive environments. In the next section we consider very powerful technique that helps to solve gracefully this problem.</p>
<h2>MVCC Transactions, Repeatable Reads Isolation</h2>
<p>Isolation between transactions can be achieved using versioning of separate items in the Key-Value space. There are different ways to implement this technique, here we discuss an approach that is very similar to how PostgreSQL handles transactions.</p>
<p>As it was said in the previous section, each transaction should be attached to a particular data snapshot which is a set of items in the cache. At the same time, each item has its own life span &#8211; from the moment it was added to the cache till the moment it was removed or updated, i.e. replace by a new version. So, isolation can be achieved via marking each item two time stamps, each transaction by its start time, and checking that transaction sees only items that were alive at the moment the transaction began. In practice of course global monotonically increasing counters are usually used instead of time stamps. More formally:</p>
<ul>
<li>When a new transaction is started, it is associated with:
<ul>
<li>Its Transaction ID or XID which is unique for each transaction and grows monotonically.</li>
<li>A list of XIDs of all transactions that are currently in-progress.</li>
</ul>
</li>
<li>Each item in the cache is marked with two values, <em>xmin</em> and <em>xmax.</em> Values are assigned as follows:
<ul>
<li>When item is created by some transaction, <em>xmin</em> is set to XID of this transaction, <em> xmax</em> is empty.</li>
<li>When item is removed by some transaction, <em>xmin</em> is not changed, <em>xmax</em> is set to XID. The item is not actually removed from the cache, it is merely marked as deleted.</li>
<li>When item is updated by some transaction, old version is preserved in the cache, its <em>xmax</em> is set to XID; new version is inserted with <em>xmin</em>=XID and empty <em>xmax</em>. In other words this is equivalent to remove + insert.</li>
</ul>
</li>
<li>Item is visible for transaction with XID = <em>txid </em>if the following two statements are true:
<ul>
<li><em>xmin</em> is a XID of the committed transaction and <em>xmin</em> is less or equal than<em> txid</em></li>
<li><em>xmax</em> is blank, or XID of the non-committed (aborted or in-progress) transaction, or greater than <em>txid</em></li>
</ul>
</li>
<li>Each <em>xmin</em> and <em>xmax </em>can store two bit flags that indicate wherever transaction aborted or committed in order to perform checks described in the previous point.</li>
</ul>
<p>This logic is illustrated in the following graphic:</p>
<div id="attachment_96" class="wp-caption aligncenter" style="width: 604px"><a href="http://highlyscalable.files.wordpress.com/2012/01/tx-003.png"><img class="size-full wp-image-96" title="PostgeSQL-like MVCC " src="http://highlyscalable.files.wordpress.com/2012/01/tx-003.png?w=594&#038;h=588" alt="" width="594" height="588" /></a><p class="wp-caption-text">Fig.3 PostgeSQL-like MVCC</p></div>
<p>The disadvantage of this approach is a quite complex procedure of the obsolete versions removal. Because different transactions will have visibility to a different set items and versions, it is not straightforward to determine a moment when particular version becomes invisible and may be eliminated. There at least two different techniques to do this, the first one is used in PostgreSQL, the second one in the Oracle Database:</p>
<ul>
<li> All versions are stored in the same key-value space and there is no fixed limit on how many versions may be maintained. Old versions are collected by a background process that is executed continuously, by schedule, or triggered by reads or writes.</li>
<li>Primary key-value space stores only the last versions, the previous versions are stored in another fixed size storage. The last versions have references to the previous versions and particular version can be traced back by transactions that require them. Because size of the storage is limited, oldest versions may be eliminated to free space for the &#8220;new old&#8221; items. If some transaction is not able to find a required version it fails.</li>
</ul>
<h2></h2>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/highlyscalable.wordpress.com/77/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/highlyscalable.wordpress.com/77/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=77&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://highlyscalable.wordpress.com/2012/01/07/mvcc-transactions-key-value/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
	
		<media:thumbnail url="http://highlyscalable.files.wordpress.com/2012/01/featured3.png?w=150" />
		<media:content url="http://highlyscalable.files.wordpress.com/2012/01/featured3.png?w=150" medium="image">
			<media:title type="html">featured</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c11f79021b0f6248403dbf5e4b9d529b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">highlyscalable</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/01/tx-001.png" medium="image">
			<media:title type="html">Cache Switch</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/01/tx-002.png" medium="image">
			<media:title type="html">Partial Cache Switch</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/01/tx-003.png" medium="image">
			<media:title type="html">PostgeSQL-like MVCC </media:title>
		</media:content>
	</item>
		<item>
		<title>Performance of Priority Queue Sorting with Pagination</title>
		<link>http://highlyscalable.wordpress.com/2012/01/02/sorting-with-pagination/</link>
		<comments>http://highlyscalable.wordpress.com/2012/01/02/sorting-with-pagination/#comments</comments>
		<pubDate>Mon, 02 Jan 2012 09:43:28 +0000</pubDate>
		<dc:creator>Ilya Katsov</dc:creator>
				<category><![CDATA[Fundamentals]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[information retrieval]]></category>
		<category><![CDATA[pagination]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[queue]]></category>
		<category><![CDATA[sorting]]></category>

		<guid isPermaLink="false">http://highlyscalable.wordpress.com/?p=59</guid>
		<description><![CDATA[In web applications, it is a very common task  to sort some set of items according to the user-selected criteria and  return only the first or N-th page of the sorted result. The page size can be much less than the total number of items, hence it is typically not reasonable to sort the entire set and [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=59&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>In web applications, it is a very common task  to sort some set of items according to the user-selected criteria and  return only the first or N-th page of the sorted result. The page size can be much less than the total number of items, hence it is typically not reasonable to sort the entire set and crop a one page; it&#8217;s much more efficient to extract this page on the fly, running through the initial unsorted set. Sorting with priority queue is well know solution for this problem. In this post I present analysis of priority queue sorting for page-oriented use cases.</p>
<p>Let us assume that sorting unit iterates over the unsorted list of items and maintains a sorted page (a queue) of the selected maximums. If the unit meets an item which is greater than the minimal element in the queue then it removes the minimal element from the page and inserts the current item into the queue. This logic is illustrated below:</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/01/page-sorting.png"><img class="aligncenter size-full wp-image-60" title="page-sorting" src="http://highlyscalable.files.wordpress.com/2012/01/page-sorting.png?w=594" alt=""   /></a></p>
<p>Let we have <span style="color:#ff6600;"><tt>N</tt></span> randomly permuted items and we need to extract <span style="color:#ff6600;"><tt>p</tt></span> maximums. If we have set of <span style="color:#ff6600;"><tt>k</tt></span> items <span style="color:#ff6600;"><tt>x</tt><sub>1</sub><tt>, ...,x</tt><sub>k</sub></span> and <span style="color:#ff6600;"><tt>x</tt><sub>k</sub></span> is p-th largest element in this list then we say that <span style="color:#ff6600;"><tt>rank(x</tt><sub>k</sub><tt>)=p</tt></span>. In other words <span style="color:#ff6600;"><tt>rank(x</tt><sub>k</sub><tt>)</tt></span> is a position of <span style="color:#ff6600;"><tt>x</tt><sub>k </sub></span>in a list after sorting. Let us consider the following case: we scanned <span style="color:#ff6600;"><tt>k-1 &gt; p</tt></span> items and the page contains<span style="color:#ff6600;"> <tt>p </tt></span>largest items among <span style="color:#ff6600;"><tt>x</tt><sub>1</sub><tt>, ..., x</tt><sub>k-1</sub></span>.  In this case, next item <span style="color:#ff6600;"><tt>x</tt><sub>k</sub></span> will be inserted in the page with the following probability (assuming flat distribution of all possible permutations):</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/01/rank_prob.png"><img class="aligncenter size-full wp-image-67" title="rank_prob" src="http://highlyscalable.files.wordpress.com/2012/01/rank_prob.png?w=594" alt=""   /></a>To see this note that we have <span style="color:#ff6600;"><tt>k!</tt></span> possible permutations of k items and k-th item will be inserted if it is a largest element (<span style="color:#ff6600;"><tt>(k-1)!</tt> </span>permutations) or second largest element (<span style="color:#ff6600;"><tt>(k-1)!</tt></span> permutations), &#8230; or p-th largest element (<span style="color:#ff6600;"><tt>(k-1)!</tt></span> permutations) . Now we can estimate an average number of item comparisons for page extraction. This number consist of two components. The first one is an initial effort of scanning of first <span style="color:#ff6600;">p</span> of <span style="color:#ff6600;">N</span> elements &#8211; the list that represents the extracted page is initially empty, so we simply add first <span style="color:#ff6600;">p</span> items into it and sort them &#8211; this requires about <span style="color:#ff6600;">p*log p</span> comparisons. The second component is an effort of scanning of the remaining items. It requires <span style="color:#ff6600;">N-p-1 <span style="color:#000000;">comparisons with the page minimal element (&#8220;<span style="color:#ff6600;">1 +</span> <span style="color:#808080;">Pr{..</span>&#8221; term in the formula below) and insertion into the page with complexity <span style="color:#ff6600;">log p<span style="color:#000000;">, but only in the cases when item has a higher rank than a minimal item in the currently extracted page. Combining all this into one formula and reducing it we obtain that total complexity is about <span style="color:#ff6600;">p log p ln N + N</span>:</span></span></span></span></p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/01/final_complexity.png"><img class="aligncenter size-full wp-image-68" title="final_complexity" src="http://highlyscalable.files.wordpress.com/2012/01/final_complexity.png?w=594&#038;h=98" alt="" width="594" height="98" /></a></p>
<p>At the same time direct sorting of all items is estimated in <span style="color:#ff6600;"><tt>N*logN</tt></span> comparisons. For example, if we select first 48 items (second page of size 24) from 5000 items then simple sorting will take <tt>~61500</tt> comparisons and page-oriented sorting will take <tt>~6470</tt> comparisons.</p>
<p>The plots below depict complexity (number of comparisons) of page-aware sorting and sorting of entire item set for different values of <span style="color:#ff6600;">p</span> and <span style="color:#ff6600;">N</span>.</p>
<p><a href="http://highlyscalable.files.wordpress.com/2012/01/plot.png"><img class="aligncenter size-full wp-image-83" title="plot" src="http://highlyscalable.files.wordpress.com/2012/01/plot.png?w=594&#038;h=298" alt="" width="594" height="298" /></a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/highlyscalable.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/highlyscalable.wordpress.com/59/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=highlyscalable.wordpress.com&#038;blog=30930683&#038;post=59&#038;subd=highlyscalable&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://highlyscalable.wordpress.com/2012/01/02/sorting-with-pagination/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:thumbnail url="http://highlyscalable.files.wordpress.com/2012/01/featured4.png?w=121" />
		<media:content url="http://highlyscalable.files.wordpress.com/2012/01/featured4.png?w=121" medium="image">
			<media:title type="html">featured</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/c11f79021b0f6248403dbf5e4b9d529b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">highlyscalable</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/01/page-sorting.png" medium="image">
			<media:title type="html">page-sorting</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/01/rank_prob.png" medium="image">
			<media:title type="html">rank_prob</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/01/final_complexity.png" medium="image">
			<media:title type="html">final_complexity</media:title>
		</media:content>

		<media:content url="http://highlyscalable.files.wordpress.com/2012/01/plot.png" medium="image">
			<media:title type="html">plot</media:title>
		</media:content>
	</item>
	</channel>
</rss>
