Although there are many books on data mining in general and its applications to marketing and customer relationship management in particular [BE11, AS14, PR13 etc.], most of them are structured as data scientist manuals focusing on algorithms and methodologies and assume that human decisions play a central role in transforming analytical findings into business actions. In this article we are trying to take a more rigorous approach and provide a systematic view of econometric models and objective functions that can leverage data analysis to make more automated decisions. With this paper, we want to describe a hypothetical revenue management platform that consumes a retailer’s data and controls different aspects of the retailer’s strategy such as pricing, marketing, and inventory:
There are two major reasons why this study focuses on a combination of economic frameworks and data mining methods:
These two considerations suggest a high potential for automated decision making and dynamic optimization in retail, so we were keen to study this subject. Most of this article represents an overview of the results published by retailers and researchers who built practical decision making and optimization systems combining abstract economic models with data mining methods. More specifically, the article was inspired by three major case studies from Albert Heijn [KOK07], the largest supermarket chain in the Netherlands, Zara [CA12], an international apparel retailer, and RueLaLa [JH14], an innovative online fashion retailer. We also incorporate results from Amazon, Netflix, LinkedIn and many independent researchers and commercial projects. At the same time, we avoid academic results with little or no empirical support.
The study focuses mainly on optimization problems related to revenue management discipline which includes marketing and pricing questions. More specialized data mining applications like supply chain optimization and fraud detection are out of scope, as well as the implementation details of the data mining process (such as evaluation of model quality).
The rest of the article is organized as follows:
[This article is also available in PDF – see the Whitepapers page.]
This article describes six major optimization problems related to marketing and pricing that can be solved leveraging data mining techniques. Although these problems are very different, we are trying to establish a common framework that helps to design optimization and data mining tasks required for solutions.
The basic idea of the framework is to use an economic metric such as gross margin as the optimization objective and consider it a function of possible retailer’s actions such as marketing campaigns or assortment adjustments. The econometric objective is also a function of data in the sense that econometric models should be parameterized by properties of a particular retailer to produce a numerical value, such as gross margin, at its output. For instance, consider a retailer planning a marketing mailing campaign. The space of possible actions can be defined as a set of send/no-send decisions with regard to individual customers and the gross margin of the campaign depends both on actions (who will receive the incentive and who will not) and data such as expected revenue from a given customer and mailing costs. This approach can be expressed in more formal way by the following equation:
where is the data available for analysis, is the space of a retailer’s actions and decisions, is an econometric model defined as a function of actions and data, and is the optimal strategy. This framework resembles the approach suggested in [JK98].
The design of the model heavily depends on the problem. In most cases it is reasonable to model and optimize gross margin, but, as we will discuss in the next section dedicated to response modeling, other objectives are also possible. It is also important to keep in mind that the optimization problem (1) as a whole is somewhat dependent on time because of environmental changes (new products appear on the market, competitors make their moves etc.) and retailer’s own actions. The most typical approach for handling this dependency is to use stateless treating it as a mathematical function, but allow for historical data in the arguments to account for memory effects.
The role of data mining in the optimization problem (1) is crucial because econometric models are typically complex and have to be learned from data by means of regression and other data mining techniques. In some cases the model cannot be completely specified either because of high complexity (e.g. user behavior cannot be precisely predicted) or because it’s impossible to extrapolate the existing data to the case of interest (e.g. the action is to introduce a completely new service). A/B testing and panel surveys are used in such cases to get additional data points that improve the precision of the model.
Some resource such as an advertisement or a special offer will be distributed to a group of customers. Each unit of the resource is associated with a monetary cost such as the mailing cost of a printed catalog, or some negative effect (such as causing a customer to unsubscribe from irrelevant email notifications). At the same time, the resource can influence customers’ decisions urging them to make more purchases, buy promoted products etc. The goal is to find a set of the most promising candidates who should receive the resource in order to maximize the overall performance of the targeted group of customers.
The resource can be homogenous (i.e. all participating customers will get the same incentive) or personalized. In the latter case, a retailer has a set of different incentives such as discount coupons on different products and the goal is to offer a unique subset of incentives or no incentives to each customer to maximize the overall performance.
Response modeling is widely applicable in marketing and customer relationship management:
From the points we’ve talked about above, we now realize that the problem of resource distribution is an optimization problem that should be driven by an objective function. One of the most basic approaches is to model the overall profit of the campaign in terms of probability of response and the expected net value for a customer. Let us denote the entire population of customers as and the subset of customers reached in the scope of the campaign as . The expected gross profit of the campaign can then be modeled as follows:
where is the probability of the response on the incentive from the customer , is a response net value for the customer , and is a cost of the incentive resource. The first term corresponds to the expected gain from a responding customer and the second term corresponds to the expected loss of sending an incentive to which there’s no response. The objective is to maximize by finding a subset of customers that are likely to respond in the most profitable way. Since the equation (1.1) can be reduced as follows
where denotes the mathematical expectation of the gross margin for a given user assuming that the user will receive the incentive, the customer selection criteria boils down to the following condition
and the optimal subset of customers can be determined as a subset that maximizes the gross margin:
This approach can also be considered the maximization of targeted net value compared to random resource distribution. To see this, let us compare these two options assuming a fixed number of customers participating in a campaign. First, let us extend the equation (1.2) to explicitly include the expected gross margin from a campaign that distributes incentives among customers selected at random:
where is the average net value per customer over the population. This average net value is constant, hence it can be omitted assuming the fixed cardinality . The equation (1.2) can be also reduced in the case of fixed yielding the same result as (1.3):
However, it can be argued [VL02] that this model is imperfect because it favors customers who are likely to respond to an incentive, but does not take into account customers who are likely to respond anyway generating the same profit even without incentives. To address this shortcoming, let us separately calculate the gross margin for the set of customers in the following four cases:
The equation (1.2) maximizes the difference i.e. the lift of targeting compared to the random distribution. The alternative approach is to maximize which measures not only the lift compared to the random distribution but also the lift compared to the no-action baseline on the same set of customers. In that case, the equation (1.2) transforms into the following:
where the last term corresponds to the expected net value for customers who were not provided with the incentive. This approach is known as differential response analysis or uplift modeling [BE09].
It is worth noting that the expressions (1.2) and (1.4) are not necessarily optimized by maximizing marketing budgets. Consider the situation when the response net profit is $100 per customer and the incentive cost is $1. If a group of 1 million customers contains 0.5% potential responders, the most expensive marketing campaign that reaches each customer will effect a loss of $500K (the total response net value of $500K minus the campaign cost of $1M). At the same time, a data model that identifies ten thousand of the most likely customers with a response probability of 5% (10x lift) will produce a profit of $40,000 (a total response value of $50,000 minus the campaign cost of $10,000).
The equation (1.4) is especially important for different types of price discounts (coupons, temporary price discounts, and special offers). Consider the following question: “Should a retailer offer a discount coupon on apples to a person who buys apples every day?” This question would most likely will be answered in the affirmative according to the equation (1.2) because the person is likely to redeem a coupon. However, it is more probable that the customer would just buy the same amount of apples for a lower price, basically decreasing retailer’s profit. The equation (1.4) alleviates this problem by incorporating default customer behavior. We continue to discuss price discrimination in the next sections because it is a complex topic that goes far beyond the equation (1.4).
The mathematical expectations of the net revenue in the equations (1.2) and (1.4) can be estimated by means of classification and regression models trained on historical data for customers who have received incentives in the past and those who did not. This problem can be very challenging, especially when the incentive under evaluation is somewhat dissimilar to everything used in the past; in this case, the incentives may require testing on a customer panel before running a full-scale campaign. Moreover, gross margin is not the only performance metric that is important for retailers. The gross margin metric, in the sense it is used in the equations (1.2) and (1.4), is concerned with the immediate return from the first purchase which is a very simplistic view of customer relationship management. A retailer might be concerned with a variety of other metrics and this variety is so huge that there is a separate econometric discipline – propensity modeling [SG09, LE13] – that develops different models that predict customers’ future behaviors. The most important propensity models include:
The models above can be embedded into a framework similar to the equation (1.4) as the alternatives to the gross margin objective. We will take a closer look at propensity modeling in a later section dedicated to price discrimination where we will model propensity to response on a discount. More details on propensity modeling can be found in dedicated studies and books like [FX06] and [SG09].
The framework can also be extended to select an optimal incentive among multiple alternatives. For instance, a retailer can estimate the expected performance of two incentives A and B (e.g. chocolate ice-cream versus vanilla ice-cream) and then select the optimal option for a given user according to the following criteria [WE07]:
Finally, it is worth noting that the problem of response modeling is tightly coupled with customer segmentation:
There is a set of incentives where each incentive corresponds to a product or some other catalog item. Incentives are not associated with direct monetary costs, but only a limited number of incentives can be shown to a customer. In that respect, each incentive has a cost in terms of screen space or a customer’s attention span, so the negative effect of providing a customer with a given incentive can be measured as lost opportunity costs. The goal is to suggest a personalized subset of incentives (e.g. recommendations on a website) to each customer to maximize the purchasing performance of the population.
The most typical use cases for this problem are recommender systems, personalized search results ranking, and targeted ads. However, there are a number of other important applications:
It is critical to note that although recommendations are typically considered as a relatively specialized online service, the principles and techniques developed for recommender systems are fundamentally important for many aspects of retail because they aim to reveal the hidden mapping between the customers and products these customers are interested in. It is a principal task for any retailer.
Recommender systems have been a subject of extremely intensive research during the last two decades, and books [JZ10, RR10] were written to provide a systematic view on dozens of recommendation algorithms and techniques suggested in numerous articles, presentations and whitepapers. To a certain extent, such a high diversity of recommendation techniques is attributed to several implementation challenges like a sparsity of customer ratings, computational scalability, and lack of information on new items and customers. Clearly, we cannot overview even a fraction of these methods and algorithms in this section and it does not make a lot of sense to try to do so because many overviews of all kinds are widely available. Instead we will focus on the objectives and utility functions that drive the design of the recommender systems and mainly bypass the algorithmic and technical sides of the problem.
From an econometric perspective, the problem of recommendations is closely related to the rapid expansion of assortment in many retail sectors expedited by ecommerce and omni-channel commerce. Large assortment increases the number of slow moving items each of which has a small sales volume and contributes little to revenues, but the overall contribution of this “long tail” is significant:
Traditional recommendation techniques like advertising best sellers become insufficient to leverage the potential of slow moving items and more sophisticated recommendation methods are needed to guide the customer through the millions of items that he or she will never explore completely without suggestions.
Since we are mainly concerned with the models that describe customer preferences in regard to products, let us walk through the most widely used recommendation techniques ranging them by complexity of utility functions, starting from relatively simple and moving towards more advanced. We will use the hierarchy of recommendation techniques depicted in the figure below. This hierarchy resembles the commonly used classification of the recommender systems, although not exactly the same:
Multiple-objective solutions are especially interesting in the context of this study because they can incorporate economic goals alongside simple relevancy.
Single objective. Let us start with a basic definition of the single-objective recommendation task that is widely used in the literature on the recommender systems. A retailer sells line items to the population of users . The rating function expresses the opinion of a user about an item and ranges from negative (“don’t like”) to positive (“like”) within a certain numerical scale. The rating score for some pairs of users and items can be estimated by the explicitly set user ratings or by analyzing purchase histories, access logs on a website and so on. The recommendation task then can be defined as prediction of a rating score for a given user-item pair .
There are two natural ways to approach the problem of rating prediction: to estimate a rating score for each user independently by looking for items that are similar to what this particular user liked in the past; and to estimate the rating scores by averaging the ratings from users similar to a given one. These two approaches are known as content filtering and collaborative filtering, respectively.
Content filtering. The main idea of content filtering is to predict ratings by comparing past user preferences, behavior and purchases with the product items. Although different interpretations of content filtering are possible, we choose to treat it as a classification problem [PZ07] to highlight usage of data mining methods:
Although the process above seems to be relatively straightforward, it is very challenging because users and items are fundamentally different entities and it is difficult to find a representation for catalog items that can be directly transformed into such a subtle thing as user preferences by means of a regression model.
The main problem is that standard inventory attributes such as a brand, product name, or price are typically insufficient to measure the utility of an item to a user. Although some customers can be satisfactorily characterized by loyalty to a certain brand or price category, more subtle and informal dimensions like lifestyle or temperament are required to describe observed patterns and commonalities. These implicit dimensions are crucial for products like movies, books, music, and even for more tangible goods like apparel. A retailer can leverage standard classification techniques to label items with implicit dimensions as follows [GH02]:
Content filtering in general and item modeling in particular is the information retrieval task, so many text mining and search techniques (for example, see [MA08] for an overview) are typically leveraged to build a recommender. We omit these details here because they are not so important from the econometric perspective.
Collaborative filtering. The problem of implicit dimensions noted in the previous sections has crucial implications that lead us to the second family of recommendation techniques. This problem naturally arises in connection with the fundamental inability to rigorously model human tastes and judgments. Collaborative filtering represents a natural and probably the only possible solution that does not require manual training of the system – the need of a “human factor” in recommendation decisions is satisfied by using the feedback from other users.
The basic collaborative filtering model [RE94, BR98] is straightforwardly defined based on the similarity metric between users:
where is a known rating for item j set by user u, is the set of all users or a heuristically selected neighborhood around a given user, is a normalization coefficient, is a measure of similarity between two users, and is the average rating for a given user
assuming that is a set of items rated by the user. The equation (2.1) uses the concept of average user ratings to model the fact that some users have tendencies to give higher or lower ratings than others because they are more or less demanding. Although not absolutely required, this correction is very important in practice and widely used from the very first implementations of collaborative filtering.
The similarity function is typically calculated as a cosine distance or Pearson correlation coefficient between the rating vectors for and . In addition, this basic similarity measure can be adjusted in multiple ways [BR98, SU09] to improve its performance in practice.
The model (2.1) has significant drawbacks because of high computational complexity that grows proportionally to both the number of items and the number of users, and a sparsity of user ratings. A sparsity of ratings means that each user rates only a small fraction of available products, so the vectors and used to compute the similarity metric often have no elements in common which reduces the quality of recommendations. For instance, it is known that Amazon [SA01] and Netflix [YK08] are missing 99% of the possible ratings in . To overcome this limitation, the user-based model (2.1) is often replaced by the conceptually similar item-based model [SA01, YK08]:
where the similarity metric between the items is based on the ratings from all users they have in common and the baseline rating incorporates both user bias (user’s average rating compared to the overall average) and item bias (item’s average rating compared to the overall average). It is worth noting that a simplistic way to implement (2.2) is to find frequent itemsets (items that are frequently bough together) and compute the similarity based on co-occurrence in the same itemsets. This approach is also known as the “poor man’s recommendation engine” due to its simplicity [RE03].
The models (2.1) and (2.2) belong to the family of so-called neighborhood models that estimate rating by analyzing the neighborhood of most similar users or items. This family also includes a huge variety of techniques [SU09] that replace computationally expensive inspection of the neighborhood by more compact probabilistic models or other approximations.
Although neighborhood models are a proven recommendation technique known to be used by leading retailers like Amazon, they still suffer from the fundamental problem of the implicit dimensions we discussed in the context of content filtering. The user-to-user and item-to-item similarity metrics considered above are limited in their ability to reveal a complicated relationship between users and items. A similar problem appears in information retrieval when text documents are looked up by a search query – synonymy (multiple ways to express the same idea or concept using difference words and linguistic structures) and polysemy (multiple meanings of the same word or expression) phenomena make it very challenging to reveal the actual intent of a person who composed the query and to adequately translate this intent into a similarity measure between the document and query. To cope with this problem, a technique called Latent Semantic Analysis had been proposed in [DR90]. Ten years after, this method has been utilized in the design of recommender systems [SA00] creating a new family of Latent Factor models.
The main idea of latent factor models can be described as follows: the rating function R can be represented as matrix ( is the number of users and is the number of products) whose elements are the rating values. This can be considered a linear space of dimensions. The recommendation task then can be restated as the computing of a user rating vector as a linear combination of other rating vectors. Indeed, the equation (2.1) is naturally a linear combination of ratings with weights defined by the similarity function. However, the problem is that the rating matrix is sparse due to missing ratings, generally noisy because of biases and randomness, and based on item-wise dimensions which limit its ability to reveal user tastes that are generally related to item groups instead of individual items. In other words, the signal is scattered across this huge low-density matrix and mixed with noise to such an extent that it can be revealed only by studying hidden patterns. The idea of the latent factor model is to approximate this large linear space using a basis of a smaller dimensionality. It helps to achieve the following goals:
The goals above can be achieved by means of dimensionality reduction because of correlations in the original data representation R. As an illustration, consider the following example of two-dimensional data:
Each point in the set has substantially large coordinates both along and dimensions suggesting complex and irregular structure of the data. However, another coordinate system reveals that the data can be efficiently described by a coordinate along the dimension and dimension does not matter much, suggesting a one-dimensional latent factor model.
To a certain extent, the latent factor model can be compared to how discreet cosine transform (DCT) is used in image compression algorithms like JPEG to approximate the image by a few harmonics.
This chain of thinking leads us to the following formal model of the latent factors. One first selects the number of dimensions and models each user and each item as a vector in the space of this dimensionality. Let us denote the vector for user as and the vector for item as . The vectors are computed from the rating matrix in such a way that each of vector components corresponds to a latent dimension as described above. So, both users and items are now encoded in the same terms and the rating can be calculated as a product of these two vectors i.e. pairwise correlation between the corresponding dimensions:
There are different ways to compute the latent factor vectors and for users and items. The most straightforward way is factorization of the rating matrix using singular value decomposition (SVD), however iterative gradient descend optimization methods [YK09] are typically used in this practice because of computational stability and complexity concerns.
The sketch below illustrates the difference between the convolutions (2.1) and (2.3). On the left side, the sparse rating vector for a given item is convolved with the sparse similarity vector for a given user producing the estimate. On the right side, the rating is estimated by convolving two vectors of reduced dimensionality and higher energy density.
Multiple objectives. All recommendation methods discussed above are driven essentially by a single objective – to provide the best semantic match or predicted preference score. However, recommendation accuracy might not be the only concern of the recommender system design – a retailer might be interested to incorporate multiple competing objectives into the recommendations offered to the customers. For instance, grocers might be interested to boost perishables with a shorter shelf life, fashion stores might want to promote sponsored brands or seasonal collections, and a wide range of retailers can benefit from recommending products with a higher margin or from taking into account product stock levels to avoid stockouts.
A recommender system with multiple objectives was suggested in [JW10] and then developed and tested in practice at a large scale by LinkedIn [RP12]. In the case of LinkedIn, the primary objective was to recommend candidates who semantically match a job description and also, as a secondary objective, display a job-seeking behavior. The method described in [RP12] defines the recommendation task as the following optimization problem:
where
The main idea of the optimization problem above is to increase the utility of the hybrid recommendations that mix relevance scores with the secondary objective, but penalize the difference between the original relevancy-based recommendations and hybrid recommendations to make sure that relevance will not be completely sacrificed in pursuit of a secondary objective. The design of the function should include tunable parameters that control the trade-off between two objectives and will be the subject of optimization. This approach can be straightforwardly extended to incorporate more than two objectives.
We can illustrate how the optimization model above can be adapted to the practical problems using a couple of examples. First, consider the case of a retailer who wants to incorporate the revenue objective into the recommendation scores. The overall utility function can be defined as the expected gross margin, assuming that is a normalized gross margin of item p and the probability of purchase is modeled as a reciprocal to the ranking position (i.e. the lower the item in the list of recommendations, the lower the probability of conversion):
where is the probability normalization constant. The composite ranking function can be defined as
where is a parameter that controls the trade-off between the relevance and pitching of high-margin products. This parameter will be the subject of optimization in the problem (2.4).
The second example of re-ranking according to the secondary objective is a boosting of featured items such as on-sale products or perishables. The utility function can be specified as the average number of featured products in the short list of K recommendations:
where is a feature label that equals 1 if the item is featured and 0 otherwise. The composite ranking function combines the relevance score and feature labels with a trade-off parameter which is the subject of optimization:
The ranking function above can be straightforwardly extended to incorporate multiple separate features each of which contributes to the final ranking score according to its own trade-off parameter (all parameters will be optimized jointly):
More details on numerical optimization algorithms for the problem (2.4) can be found in [RP12].
A retailer offers a category of products to its customers. The demand on a given product depends on many factors including a product’s own properties such as price or brand, prices of competing products in the category, sales events, and even the weather. The goal is to build a demand model that incorporates these factors and allows one to perform what-if analysis to forecast response on price changes, assortment extensions and reductions, compute optimal stock levels, and allocate shelf-space units.
In this section we discuss the core problem of demand prediction. It can be considered a building block that is required to model actions that affect the demand or constrained by stock levels:
We will use the demand prediction model in later sections dedicated to price optimization and assortment planning.
Demand prediction can be considered a relatively straightforward data mining problem that boils down to building a regression model and evaluating it over historical data. However, the design of the regression model is not so straightforward because the demand is influenced by many factors with complex dependencies. In this section we study a regression model suggested and evaluated in [KOK07] for Albert Heijn, a supermarket chain in the Netherlands. This model is based on earlier marketing studies such as [BG92], and fashion retailers like RueLaLa [JH14] and Zara [CA12] who also reported usage of similar models in practice. However, it is important to understand that different optimization problems require different demand prediction models and it is hardly possible to build a universal demand model that incorporates a wide variety of factors that influence demand.
We start with the following model of the demand for a given product j:
where
All factors in the equation (3.1) can be estimated from the transactional data from the stores. The demand generally depends on date (day of the week, holidays etc.) and store (size, neighborhood demographics etc.), so we introduce subscripts and to denote date and store, respectively, and estimate demand as a function of these parameters. Alternatively, store properties such as size, location and average consumer’s income can be incorporated into the model as regressors. According to [KOK07], the number of store visitors can be modeled as follows:
where is the weather temperature, is the weather comfort index (humidity, cloudiness etc.), and are 0/1 dummy variables for a day of the week and public holidays, respectively, is the number of public holidays, and alphas are regression coefficients.
The purchase incidence is a binary variable (purchase/no purchase), so we can use a standard modeling approach – express the purchase probability as a sigmoid function and estimate its exponential parameter from the data:
The regression model for x will be
where is a dummy variable equal to 1 if product j is on promotion and 0 otherwise and is the total number of products in the category i.e. the regressor corresponds to the percentage of promoted products in the category.
Estimation of is a little bit more tricky. Consumer choice modeling is a fundamental econometric problem studied by a special economic discipline called choice modeling theory. The choice modeling theory justifies the notion that the multinomial logit model (MNL) is an efficient way to model the probability of choice among alternatives [FAD74, CN10]:
where i iterates over all products in a category and is a parameter variable. Similarly to the probability of the purchase incidence, we build a regression model for a parameter variable :
where the coefficients and are shared for all products, and are the product price and the average price in the category, respectively, and and are promotion dummy variables and the average promotion rate, as described above for the purchase incidence regression model.
Finally, the average number of units sold can be modeled as follows:
By substituting the models above to the root expression (3.1) one obtains a fully specified demand prediction model. This model can be adjusted to a retailer’s business use cases by adding more explanatory variables such as marketing events.
A retailer offers a category of products to customers. The goal is to assign an individual price for each customer in order to maximize the overall revenue. Alternatively, the problem can be restated as targeting discounts that change prices compared to the common baseline.
Price discrimination is widely used in retail and there are many explicit and implicit forms of it:
Although we have stated the problem in a way that suggests fine-grained individual prices, it is an extreme case and the more typical approach is to set prices for larger customer segments.
Price discrimination is one of the most fundamental problems in economic and marketing [SM11], so it makes sense to start with some background on classic economics. Broadly speaking, a retailer, as well as any other commercial enterprise, can be modeled using the following basic equation:
where is profit, is quantity sold, is unit price, is variable cost per unit which roughly corresponds to wholesale price in the case of a retailer, and is fixed costs like general management. Price and quantity in the right part of the equation (4.1) are interdependent because the demand typically decreases as the price increases and vice versa. The relationship between the price and quantity is often approximated by a linear function with a coefficient that is widely known as elasticity of demand:
In other words, the elasticity of demand is a ratio between the percentage change in quantity demanded and the percentage change in price. The equations (4.1) and (4.2) can be visualized as follows:
The demand curve is a line with a slope defined by the elasticity of demand and a retailer’s profit is the difference between the revenue and variable costs, numerically equal to where is the unit price set by a retailer. On the one hand, the profit tends to zero when the price approaches variable costs, high sales volumes notwithstanding. On the other hand, too high a price drives the sales volume down and, consequently, profit to its minimum. It basically means that the price is a subject of numerical optimization and a retailer can use statistical techniques to estimate the elasticity of demand and find the optimal price that maximizes the equation (4.1). This approach, known as economic price optimization, has limited practical applicability [SM11] because the model expressed by the equation (4.1) oversimplifies market behavior and discards important factors that impact price-quantity relationship in a competitive market. For instance, significant price drops are likely to trigger a symmetric response from competitors decreasing prices industry-wide, so all market players eventually find themselves in a situation with status quo sale volumes and shares but less profit.
Despite the limitations of economic price optimization, equations (4.1-4.2) shed some light on the nature of price discrimination. Any single price , does not matter how optimal it is, represents a tradeoff because some customers do not buy a product considering it too expensive, although they would be willing to buy it at a lower price in between and V positively contributing to the profit. Moreover, some customers will tolerate prices higher than although the sales volume they will generate is relatively small. In both cases, a retailer fails to capture additional profits that lie in the triangle in between the demand curve and variable costs line. Price discrimination is a natural way to overcome the limitations of a single regular price by segmenting customers according to their willingness to pay and offering different prices to different segments. Consider a particular case of this strategy where the regular price from the previous chart had been complemented by a higher premium price (note how the profit area increases comparing to the single price strategy):
This consideration leads to the challenging question of how a retailer can sell the same product to different customers at different prices. Broadly speaking, it requires setting fences between customers with different willingness to pay in such a way that customers with higher willingness will not be able to a pay a lower price intended for the lower segments. Retailers have a number of fencing mechanisms at their disposal including the following:
Although all these techniques have long been in use, the problem of building an integrated discount optimization model is very challenging and, to the best of our knowledge, all existing models are limited in one way or another. In the rest of this section we consider two price discrimination models that were designed and evaluated using data from US supermarkets, in particular Safeway’s subsidiary in Chicago.
Discrimination by quantity and location. The model developed in [KJ05] aims to jointly optimize quantity discounts based on package sizes and store-level price zones. This model is quite similar to the model we studied in the section dedicated to demand prediction, however it elaborates more on package size and discount parameters.
Let us consider the case of a retailer that operates multiple stores and sells a few substitutable brands of a product that comes in several sizes e.g. 2, 4 and 6 packs of Coca-Cola and Pepsi. The goal is to optimize prices for each size assuming that price per unit can vary depending on the package size and that price settings can vary across the stores as well. We start with the standard multinomial logit (MNL) model for demand prediction that we have discussed in a previous section:
which denotes the probability of purchase of product variant by customer at time in store and is the number of product variants (the total number of sizes across all brands plus no-purchase option). Time is measured in relatively large time intervals such as weeks. The parameter variable can be estimated using the following regression model:
where is package size, is price, is discount depth in dollars, incorporates competition effects like distance to the nearest competitive store, and incorporates environmental shifters like weather. Consequently, the regression parameters basically correspond to a customer’s bias to brand, preference for size, price sensitivity, responsiveness to discounts, impact of discounts on price sensitivity, bias to competitors, and sensitivity to shifting effects, respectively. It is argued in [KJ05] that the regression model for price sensitivity in the case of promotions should be more complex than just one regressor to capture the fact that promotions in the past can increase the current price sensitivity because the customer can stockpile the product. This aspect is modeled by decomposing the corresponding regressor into two subcomponents as follows:
where is the mean sensitivity and the second term represents the memory effect. The history depth W denotes the number of weeks in the past, denotes the regular price, and is the actual discounted price. In a similar way, it is argued that the sensitivity to a promotion depends on recent promotions:
where is the number of weeks elapsed since the previous promotion. This dependency on basically models the assumption that the longer the period between promotions, the higher the response to them.
The equation (4.3) allows one to predict sales volumes, so the price optimization problem then can be stated based on the equation (4.1), independently for each time period :
where denotes wholesale price and denotes predicted sales volume. The role of the optimization constraint here is to avoid sharp changes and skews in price that can trigger major changes in market competition or customer behavior. This particular constraint proposed in [KJ05] requires the share-weighted average price ( denotes the market share of product ) not to exceed the share-weighted average price before the optimization ( denotes original prices). The optimization problem (4.4) can be solved at a store level which will imply discrimination both by quantity and location or, alternatively, at a store chain level that will imply quantity discounts only.
Personalized discounts and coupons. Although the previous model allows for store-level prices which implies customer-level price discrimination, it is not designed for optimized discounts for individual customers. The second model we consider [JT13] was designed specifically to optimize personalized discounts or coupons. The main advantage of this model is that it optimizes not only the depth of the discount, but also tries to find the optimal time to offer a discount to a given user and its optimal duration. The idea of temporal properties optimization comes from the assumption that a customer’s probability to purchase is not uniform and varies over time, so there is an optimal discount time window for each user. The major shortcoming of this instead of a category, hence it can be used to optimize performance of a particular brand e.g. manufacturer promotions, but cannot be used for category management.
In order to model the temporal properties of a discount, we will decompose the probability of the purchase of product j by customer u at time t assuming discount depth d into a product of the product purchase probability and probability to make a purchase at time t:
The probability density function of the purchase of a given product can be estimated using MNL model we used in the equation (4.3). The probability density function of a purchase at time t is modeled in [JT13] in the form of an Erlang distribution:
where the parameter variable can be estimated by means of a regression model that, similarly to the model for the parameter variable x in the equation (4.3), includes the discount depth as a regressor, so it can later be a subject of optimization.
The probability of a purchase defined above enables us to model the sales volume for a given customer as a function of the discount depth in dollars d, discount start time t, and discount duration T:
This leads us to the following optimization problem for gross margin:
where m is the margin at the regular price. The first term in the equation above corresponds to the revenue that in turn consists of three components – revenue received before the promotion, during the promotion, and after the promotion – and the second term corresponds to promotional costs. The following chart shows this optimization problem:
The first plot on the top shows the probability density of purchase by customer u where the expected sales volume for a given product at the regular price corresponds to the area. A flat permanent discount will lift this volume by adding area , so the total revenue and promotional costs (shown in the middle plot) will both be proportional to . A time-optimized promotion will make the revenue proportional to and its costs will be proportional to (the plot in the bottom). This difference between the flat promotion and optimized promotion shows the potential to take advantage of temporal optimization in the case of certain quantitative properties of the probability density functions.
A retailer prepares a sales event – a limited-time discount on a particular product or group of products. The event planning requires the estimation the following dependent values:
We will consider a case where the stock level is predefined and a retailer is trying to calculate optimal prices. This problem statement is typical for fashion retailers who deal with seasonal clearance sales and collection renewals [JH14, CA12]. The problem can be stated in many different ways both to study demand forecasting and price optimization as separate problems and to optimize stock levels and prices simultaneously in order to achieve maximal revenue.
Sales event planning has several applications in retail:
Dynamic demand prediction and price optimization are fundamental problems studied in the economic discipline called revenue management. The theory of revenue management is well-developed and systematically described in books like [TA05]. The most advanced and efficient examples of revenue management automation are found in the service industries that deal with reservations – flight tickets, stadium seats, hotel rooms, rental cars etc. To understand how such techniques can be leveraged in the retail space, we will consider a methodology recently developed by RueLaLa, a fashion retailer [JH14].
Let us assume that a retailer plans to provide a discount on products or product groups where all products within a group have the same price (e.g. yogurts with different flavors or t-shirts of different colors). Let be the set of prices that includes all prices that can be assigned to each of the products. In practice, is often a relatively small set composed according to business rules. For instance, the lower bound on the price can be defined by a retailer’s profitability level as $29.90 per item, the maximum price can be determined from analysis of competitive offering as $49.90, and intermediate price values can have an increment of $5.00 from the psychological pricing considerations, giving P = {$29.90, $34.90, $39.90, $44.90, $49.90}.
It is assumed that all products or product groups in the sales event have something in common, e.g. belong to the same category like “Women’s Shoes” or “Christmas Eve Foods”, hence demand on one product is potentially dependent on the price of other products that can be used as substitutes. This assumption can be incorporated into the optimization model by introducing the variable equal to the sum of the prices of all competing products (product groups) that participate in the event and estimating expected demand on a given product as a mathematical expectation where is a random variable that represent the demand quantity, is the product index, and is the price of an individual product or product group. Since depends both on the product price and S, it implicitly incorporates the ratio between the price of the product and the average price of possible substitutions that influences the demand and its elasticity. Now we are all set to define the optimization problem under the assumption that S is fixed and then solve it for all possible S [JH14]:
The binary variable equals to 1 if product i has price and equals to 0 otherwise. The objective function in the optimization problem above naturally represents the revenue of the sales event. The first constraint ensures that any product is sold only once, and the second constraint ensures that all prices sum up to S. Additional constraints on stock levels can be included in the optimization problem as well.
The optimization problem above requires the estimation of the demand that can be done using techniques considered in the previous sections dedicated to demand prediction and price segmentation. However, one should pay close attention to the fact that stockouts are typical (and almost desirable) in the case of a sales event, so historical data for demand prediction modeling is chopped for many products. As suggested in [JH14], one can work around this issue by building multiple profiles for different categories of products using the data for items that had not run out of stock during the previous events and use these profiles to adjust the demand curves in the corresponding categories.
A retailer sells products by categories. A category represents a relatively cohesive set of products that have a lot in common (examples of categories are “desserts”, “women’s jeans” etc.), so it is generally possible that customers might be willing to substitute one product with another if the product of their choice is not available for some reason. The main reasons for product unavailability are permanent assortment reductions (e.g. because of limited shelf space) and temporary stockouts. The goal is to calculate a subset of products that meets physical constraints such as available shelf space and maximizes the gross margin by taking advantage of the substitution effect in the optimal way.
Category management is a relatively specialized task, but it deals with substitution effects that come up in other problems like promotion optimization when the goal is to optimize the overall performance of a product category as opposed to performance of a single product. Retailers are typically much more interested in the overall category performance rather than optimization of individual products, so the methods described in this section can be useful in many other applications to achieve truly optimal solutions. The model studied in this section can be directly applied to the following flavors of category management:
From an econometric perspective, the problem of category management arises from the law of diminishing returns or, more specifically, the fact that revenues and costs depend on the category size in different ways. The general tendency is that consumer buying capacity comes to saturation at some point meanwhile costs continue to grow because of increasing selling area and other operational costs:
This tendency leads to the category optimization problem. It is a very challenging problem because it requires the modeling of an entire category accounting for interdependencies between the products in it. However, despite these challenges, a practically feasible assortment optimization model has been developed in [KOK07] and applied at Albert Heijn, a supermarket chain in the Netherlands. To study their approach, let us first introduce the following notation:
Using the above notation, the assortment optimization problem can be specified as follows:
where is a function that describes the gross margin for a given product and corresponding observed demand. This function heavily depends on a retailer’s business model, so we can outline a few generic templates that can be customized for practical usage:
The equation (6.2) represents the simplest way to model gross profit by multiplying the observed demand by margin m. It implicitly assumes perfect replenishment and the absence of stockouts. This might be the case for fast moving consumer goods like groceries, but other retail domains such as apparel probably should take stockouts into account using equations like (6.3). Retailers of perishable goods should also take into account the losses due to disposed inventory that are modeled in the equation (6.4) by introducing a per-unit disposal loss . For the sake of brevity, we hereafter assume that all products are perfectly replenished, so stockouts are not possible or are negligible. It allows us to treat as a binary variable that indicates the presence of a product in the assortment. The more complex model with stockouts can be found in [KOK07].
To solve the optimization problem (6.1), one needs to define the observed demand function. In the case of the stockouts-free assumption we made above, the demand function can be modeled as follows:
where is the probability of substitution of product k by product j. The formula above is relatively straightforward: the first term is the original demand and the second term corresponds to the cumulative substitution effect from all products that are evicted from the assortment set.
The equation (6.5) requires the estimation of the substitution probabilities and original demand rates . In order to do this estimation, let us assume that the following variables are known (we already discussed demand prediction in one of the previous sections of this article):
The substitution rates are challenging to estimate because up to different rates can exist for the assortment of products. However, [KOK07] found that the following simplistic model of customer behavior is sufficiently accurate in practice and requires the estimation of just one variable instead of : if product k is not available, the customer either selects his or her second-choice product j as a substitution with the probability which is the same for all products in a category or no purchase takes place with the probability . This model leads to the following simple equation for the substitution rate:
In order to estimate , let us define the total demand at a given store as a sum of values that can be estimated from the data:
On the other hand, the same value can be estimated according to the expression (6.5) as follows:
Now can be estimated by solving the following optimization problem that minimizes the discrepancy between the observed and predicted values of the total demand:
The next step in solving the optimization problem (6.1) is to compute the original demand rates that are used in the equation (6.5). We first note that the total demand for all products in N at store h can be computed as follows:
where is the total number of customers visiting store h per day. In the equation (6.10), the sum of all multiplied by represents the total demand given a full assortment. However, the values are estimated for stores with a full assortment, so specifics of the given store h (e.g. location, store size in square feet, etc.) are not modeled. This is compensated for by scaling by the ratio of estimated category demand from equation (6.7) to the predicted demand from equation (6.8).
In a store with a restricted assortment, the total demand is the sum of two components: the demand that comes from the products included in the assortment of a given store and the demand for other products in N. The ratio between these two components can be expressed via as follows:
Consequently, represents the fraction of the demand attributed to the products in the assortment, and represents the remaining fraction attributed to the products that are not in the assortment. Finally, we compute the demand for a single product as a fraction of the total demand proportional to the estimated per-product demand:
All coefficients in equations (6.12) and (6.9) can be estimated from the data, so we can roll up all formulas to the original optimization problem (6.1) that can be solved using numerical methods proposed in [KOK07].
Equation (6.1) will produce a set of presumably optimal stock levels for all products. These levels can be used to adjust inventory and optimize shelf layout. It is important to note that the model enables a retailer to perform what-if analysis to evaluate how changes in assortment and stock levels might impact the gross margin. In particular, a retailer can plot curves that show expected gross margin as a function of stock levels for a given product or a group of products. Such curves are especially descriptive for perishable products because gross margin is a convex function that is zero when the stock level is zero and also zero when the stock level is too high causing losses from expired products, with a maximum in between these two extremes.
Our overview of optimization methods and the corresponding data problems would be incomplete without data about the financial performance of the discussed methods. Although this data is available, it should be considered with caution because of the specific dependency on a retailer’s business model, and the fact that we cannot isolate the impact of the optimization from other environmental factors such as market growth or competitors’ moves. Besides that, the numbers can vary greatly depending on many factors, so our goal here is just to provide a few benchmarks that give some sense of the magnitude of potential improvements. The following list gathers several facts that give some idea about financial impact of the described methods:
Finally, it is worth noting that most of the described optimization methods do not significantly impact retailer’s costs, so the increase in revenues is likely to contribute directly to net profits.
In the previous sections we overviewed a number of econometric problems relevant to retail, described their applications and use cases, and outlined data analysis tasks and optimization models that can be used. In this final section we will connect the dots between the discussed models and provide general conclusions to capture the whole picture.
The major goal of this article is to sketch a decision automation framework that completely relies on data mining and numerical optimization under the hood. Hence, it is reasonable to visualize this framework as a pipeline that consumes data and produces executable actions and decisions. Reviewing the solutions studied in the previous sections, we can also conclude that this pipeline has several internal stages or tiers.
First, we can put data exploration and knowledge discovery processes into a separate tier that uses mainly unsupervised learning algorithms and significantly relies on a human factor to evaluate data mining results such as customer clusters or frequently purchased item sets. Although these processes are highly important in practice, their ability to integrate with automated optimization is limited because discovered patterns typically require manual post processing and are more useful in strategic decisions rather incremental optimizations. Outputs of this tier can be used to configure downstream processes. For instance, a newly discovered customer cluster can be used to define a new propensity model or introduce and optimize a special discount.
The next two tiers relate to modeling and optimization, respectively. Broadly speaking, the fundamental goal of the modeling tier is to provide a comprehensive model of a consumer that quantitatively describes his or her price sensitivity, propensity to respond to offers and discounts, willingness to substitute one product with another, and to accept a recommendation, etc. It is extremely hard to build such a model in practice, so multiple specialized models for different applications are used instead. However, it is critical to note that this imaginary consumer model underlays all types of optimization, hence acquiring comprehensive data about all aspects of customer behavior is crucial. The main challenge of the optimization tier is a joint optimization of multiple objectives. Joint optimization represents a serious computational challenge and, most importantly, is constrained by the capabilities of the underlying predictive model, so almost all optimization techniques deal with not more than one or two objectives.
We put together these tiers in the figure below. There are many possible dependencies and interactions between the components, so we show just one sample flow related to response modeling to prevent cluttering the diagram.
Among this diversity of problems and goals, we should emphasize the importance of pricing decisions and all the optimizations directly or indirectly related to pricing. Let us consider a classic example that illustrates the importance of pricing decisions. Recall the basic equation for enterprise profit:
where Q is quantity sold, P is price, V denotes variable costs, and C denotes fixed costs. Consider an imaginary apparel retailer that sells 100,000 garments monthly at $40 per item, assuming a wholesale price of $25 per item and fixed costs of $500,000 per month. Let us calculate how a one percent change in sales volume, price, variable and fixed costs will impact profit:
In this example, one can see that pricing impacts the profits much more seriously than any other variable. Although it is an oversimplified and arbitrary example, this pattern prevails in a huge variety of enterprises across many industries. This leads us to the conclusion that retailers should pay special attention to the optimization methods related to pricing (discounts, personalized prices, dynamic pricing etc.) and the supporting data mining processes.
We also note that omni-channel retail can deliver new opportunities for automated price optimization. Since price discrimination is one of the most powerful pricing techniques, the ideal environment for price optimization is an environment where each customer is provided with a personalized price, explicitly or implicitly by means of discounts, and all these prices can be adjusted dynamically. Digital channels provide exactly these conditions where each customer has his or her own isolated and dynamic view of a retailer.
As we already mentioned, most optimization problems in retail are internally dependent on customer behavior models. The ability to build such models at the level of individual customers is one of the most important benefits of data mining and a key enabler of one-to-one marketing. The most sophisticated examples of customer modeling can be found in recommender systems that often use the concept of implicit dimensions to capture psychographic features of customers and products. This concept is so fundamental that it probably goes far beyond recommender systems, although, to the best of our knowledge, it is not so widely used in other applications as one might expect. It leads us to the conclusion that integrated optimization systems can benefit by adopting state-of-the-art techniques from the well-developed recommendations domain in less common applications.
The problem of completely automated decision making in the retail environment is extremely ambitious. It can even be argued that it is almost impossible to measure the performance of optimization methods in practice because the observed improvements can coincide with market trends, competitors’ actions, changes in customer tastes and myriad other factors. This problem, referenced as the endogeneity problem in economic texts, represents a huge challenge for developers and adopters of data-driven optimization techniques and compromises even seemingly successful attempts. However, during the last decade major retailers have been looking for integrated solutions that combine data mining with numerical optimization. Such advanced systems are naturally the next step in the evolution of enterprise data management that follows such a wide appreciation of data warehousing and adoption of data science.
At Grid Dynamics, we recently faced a necessity to build an in-stream data processing system that aimed to crunch about 8 billion events daily providing fault-tolerance and strict transactioanlity i.e. none of these events can be lost or duplicated. This system has been designed to supplement and succeed the existing Hadoop-based system that had too high latency of data processing and too high maintenance costs. The requirements and the system itself were so generic and typical that we describe it below as a canonical model, just like an abstract problem statement.
A high-level overview of the environment we worked with is shown in the figure below:
One can see that this environment is a typical Big Data installation: there is a set of applications that produce the raw data in multiple datacenters, the data is shipped by means of Data Collection subsystem to HDFS located in the central facility, then the raw data is aggregated and analyzed using the standard Hadoop stack (MapReduce, Pig, Hive) and the aggregated results are stored in HDFS and NoSQL, imported to the OLAP database and accessed by custom user applications. Our goal was to equip all facilities with a new in-stream engine (shown in the bottom of the figure) that processes most intensive data flows and ships the pre-aggregated data to the central facility, thus decreasing the amount of raw data and heavy batch jobs in Hadoop. The design of the in-stream processing engine itself was driven by the following requirements:
To find out how such a system can be implemented, we discuss the following topics in the rest of the article:
The article is based on a research project developed at Grid Dynamics Labs. Much of the credit goes to Alexey Kharlamov and Rafael Bagmanov who led the project and other contributors: Dmitry Suslov, Konstantine Golikov, Evelina Stepanova, Anatoly Vinogradov, Roman Belous, and Varvara Strizhkova.
It is clear that distributed in-stream data processing has something to do with query processing in distributed relational databases. Many standard query processing techniques can be employed by in-stream processing engine, so it is extremely useful to understand classical algorithms of distributed query processing and see how it all relates to in-stream processing and other popular paradigms like MapReduce.
Distributed query processing is a very large area of knowledge that was under development for decades, so we start with a brief overview of the main techniques just to provide a context for further discussion.
Distributed and parallel query processing heavily relies on data partitioning to break down a large data set into multiple pieces that can be processed by independent processors. Query processing could consist of multiple steps and each step could require its own partitioning strategy, so data shuffling is an operation frequently performed by distributed databases.
Although optimal partitioning for selection and projection operations can be tricky (e.g. for range queries), we can assume that for in-stream data filtering it is practically enough to distribute data among the processors using a hash-based partitioning.
Processing of distributed joins is not so easy and requires a more thorough examination. In distributed environments, parallelism of join processing is achieved through data partitioning, i.e. the data is distributed among processors and each processor employs a serial join algorithm (e.g. nested-loop join or sort-merge join or hash-based join) to process its part of the data. The final results are consolidated from the results obtained from different processors.
There are two main data partitioning techniques that can be employed by distributed join processing:
Disjoint data partitioning technique shuffles the data into several partitions in such a way that join keys in different partitions do not overlap. Each processor performs the join operation on each of these partitions and the final result is obtained as a simple concatenation of the results obtained from different processors. Consider an example where relation R is joined with relation S on a numerical key k and a simple modulo-based hash function is used to produce the partitions (it is assumes that the data initially distributed among the processors based on some other policy):
The divide and broadcast join algorithm is illustrated in the figure below. This method divides the first data set into multiple disjoint partitions (R1, R2, and R3 in the figure) and replicates the second data set to all processors. In a distributed database, division typically is not a part of the query processing itself because data sets are initially distributed among multiple nodes.
This strategy is applicable for joining of a large relation with a small relation or two small relations. In-stream data processing systems can employ this technique for stream enrichment i.e. joining a static data (admixture) to a data stream.
Processing of GroupBy queries also relies on shuffling and fundamentally similar to the MapReduce paradigm in its pure form. Consider an example where the data is grouped by a string key and sum of the numerical values is computed in each group:
In this example, computation consists of two steps: local aggregation and global aggregation. These steps basically correspond to Map and Reduce operations. Local aggregation is optional and raw records can be emitted, shuffled, and aggregated on a global aggregation phase.
The whole point of this section is that all the algorithms above can be naturally implemented using a message passing architectural style i.e. the query execution engine can be considered as a distributed network of nodes connected by the messaging queues. It is conceptually similar to the in-stream processing pipelines.
In the previous section, we noted that many distributed query processing algorithms resemble message passing networks. However, it is not enough to organize efficient in-stream processing: all operators in a query should be chained in such a way that the data flows smoothly through the entire pipeline i.e. neither operation should block processing by waiting for a large piece of input data without producing any output or by writing intermediate results on disk. Some operations like sorting are inherently incompatible with this concept (obviously, a sorting block cannot produce any output until the entire input is ingested), but in many cases pipelining algorithms are applicable. A typical example of pipelining is shown below:
In this example, the hash join algorithm is employed to join four relations: R1, S1, S2, and S3 using 3 processors. The idea is to build hash tables for S1, S2 and S3 in parallel and then stream R1 tuples one by one though the pipeline that joins them with S1, S2 and S3 by looking up matches in the hash tables. In-stream processing naturally employs this technique to join a data stream with the static data (admixtures).
In relational databases, join operation can take advantage of pipelining by using the symmetric hash join algorithm or some of its advanced variants [1,2]. Symmetric hash join is a generalization of hash join. Whereas a normal hash join requires at least one of its inputs to be completely available to produce first results (the input is needed to build a hash table), symmetric hash join is able to produce first results immediately. In contrast to the normal hash join, it maintains hash tables for both inputs and populates these tables as tuples arrive:
As a tuple comes in, the joiner first looks it up in the hash table of the other stream. If match is found, an output tuple is produced. Then the tuple is inserted in its own hash table.
However, it does not make a lot of sense to perform a complete join of infinite streams. In many cases join is performed on a finite time window or other type of buffer e.g. LFU cache that contains most frequent tuples in the stream. Symmetric hash join can be employed if the buffer is large comparing to the stream rate or buffer is flushed frequently according to some application logic or buffer eviction strategy is not predictable. In other cases, simple hash join is often sufficient since the buffer is constantly full and does not block the processing:
It is worth noting that in-stream processing often deals with sophisticated stream correlation algorithms where records are matched based on scoring metrics, not on field equality condition. A more complex system of buffers can be required for both streams in such cases.
In the previous section, we discussed a number of standard query processing techniques that can be used in massively parallel stream processing. Thus, on a conceptual level, an efficient query engine in a distributed database can act as a stream processing system and vice versa, a stream processing system can act as a distributed database query engine. Shuffling and pipelining are the key techniques of distributed query processing and message passing networks can naturally implement them. However, things are not so simple. In a contrast to database query engines where reliability is not critical because a read-only query can always be restarted, streaming systems should pay a lot of attention to reliable events processing. In this section, we discuss a number of techniques that are used by streaming systems to provide message delivery guarantees and some other patterns that are not typical for standard query processing.
Ability to rewind data stream back in time and replay the data is very important for in-stream processing systems because of the following reasons:
As a result, the input data typically goes from the data source to the in-stream pipeline via a persistent buffer that allows clients to move their reading pointers back and forth.
Kafka messaging queue is well known implementation of such a buffer that also supports scalable distributed deployments, fault-tolerance, and provides high performance.
As a bottom line, Stream Replay technique imposes the following requirements of the system design:
In a streaming system, events flow through a chain of processors until the result reaches the final destination (like an external database). Each input event produces a directed graph of descendant events (lineage) that ends by the final results. To guarantee reliable data processing, it is necessary to ensure that the entire graph was processed successfully and to restart processing in case of failures.
Efficient lineage tracking is not a trivial problem. Let us first consider how Twitter’s Storm tracks the messages to guarantee at-least-once delivery semantics (see the diagram below):
The described approach is elegant due to its decentrilized nature: nodes act independently sending acknowledgement messages, there is no cental entity that tracks all lineages explicitly. However, it could be difficult to manage transactional processing in this way for flows that maintain sliding windows or other buffers. For example, processing on a sliding window can involve hundreds of thousands events at each moment of time, so it becomes difficult to manage acknowledgements because many events stay uncommitted or computational state should be persisted frequently.
An alternative approach is used in Apache Spark [3]. The idea is to consider the final result as a function of the incoming data. To simplify lineage tracking, the framework processes events in batches, so the result is a sequence of batches where each batch is a function of the input batches. Resulting batches can be computed in parallel and if some computation fails, the framework simply reruns it. Consider an example:
In this example, the framework joins two streams on a sliding window and then the result passes through one more processing stage. The framework considers the incoming streams not as streams, but as set of batches. Each batch has an ID and the framework can fetch it by the ID at any moment of time. So, stream processing can be represented as a bunch of transactions where each transaction takes a group of input batches, transforms them using a processing function, and persists a result. In the figure above, one of such transactions is highlighted in red. If the transaction fails, the framework simply reruns it. It is important that transactions can be executed in parallel.
This simple but powerful paradigm enables centralized transaction management and inherently provides exactly-once message processing semantics. It is worth noting that this technique can be used both for batch processing and for stream processing because it treats the input data as a set of batches regardless to their streaming of static nature.
In the previous section we have considered the lineage tracking algorithm that uses signatures (checksums) to provide at-least-one message delivery semantics. This technique improves reliability of the system, but it leaves at least two major open questions:
Twitter’s Storm addresses these issues by using the following protocol:
This technique allows one to achieve exactly-once processing semantics assuming that data sources are fault-tolerant and can be replayed. However, persistent state updates can cause serious performance degradation even if large batches are used. By this reason, the intermediate computational state should be minimized or avoided whenever possible.
As a footnote, it is worth mentioning that state writing can be implemented in different ways. The most straightforward approach is to dump in-memory state to the persistent store as part of the transaction commit process. This does not work well for large states (sliding windows an so on). An alternative is to write a kind of transaction log i.e. a sequence of operations that transform the old state into the new one (for a sliding window it can be a set of added and evicted events). This approach complicates crash recovery because the state has to be reconstructed from the log, but can provide performance benefits in a variety of cases.
Additivity of intermediate and final computational results is an important property that drastically simplifies design, implementation, maintenance, and recovery of in-stream data processing systems. Additivity means that the computational result for a larger time range or a larger data partition can be calculated as a combination of results for smaller time ranges or smaller partitions. For example, a daily number of page views can be calculated as a sum of hourly numbers of page views. Additive state allows one to split processing of a stream into processing of batches that can be computed and re-computed independently and, as we discussed in the previous sections, this helps to simplify lineage tracking and reduce complexity of state maintenance.
It is not always trivial to achieve additivity:
Sketches is a very efficient way to transform non-additive values into additive. In the previous example, lists of ID can be replaced by compact additive statistical counters. These counters provide approximations instead of precise result, but it is acceptable for many practical applications. Sketches are very popular in certain areas like internet advertising and can be considered as an independent pattern of in-stream processing. A thorough overview of the sketching techniques can be found in [5].
It is very common for in-stream computations to depend on time: aggregations and joins are often performed on sliding time windows; processing logic often depends on a time interval between events and so on. Obviously, the in-stream processing system should have a notion of application’s view of time, instead of CPU wall-clock. However, proper time tracking is not trivial because data streams and particular events can be replayed in case of failures. It is often a good idea to have a notion of global logical time that can be implemented as follows:
We already have discussed that persistent store can be used for state checkpointing. However, it not the only way to employ an external store for in-stream processing. Let us consider an example that employs Cassandra to join multiple data streams over a time window. Instead of maintaining in-memory event buffers, one can simply save all incoming events from all data streams to Casandra using a join key as row key, as it shown in the figure below:
On the other side, the second process traverses the records periodically, assembles and emits joined events, and evicts the events that fell out of the time window. Cassandra even can facilitate this activity by sorting events according to their timestamps.
It is important to understand that such techniques can defeat the whole purpose of in-stream data processing if implemented incorrectly – writing individual events to the data store can introduce a serious performance bottleneck even for fast stores like Cassandra or Redis. On the other hand, this approach provides perfect persistence of the computational state and different performance optimizations – say, batch writes – can help to achieve acceptable performance in many use cases.
In-stream data processing frequently deals with queries like “What is the sum of the values in the stream over last 10 minutes?” i.e. with continuous queries on a sliding time window. A straightforward approach to processing of such queries is to compute the aggregation function like sum for each instance of the time window independently. It is clear that this approach is not optimal because of the high similarity between two sequential instances of the time window. If the window at the time T contains samples {s(0), s(1), s(2), …, s(T-1), s(T)}, then the window at the time T+1 contains samples {s(1), s(2), s(3), …, s(T), s(T+1)}. This observation suggests that incremental processing might be used.
Incremental computations over sliding windows is a group of techniques that are widely used in digital signal processing, in both software and hardware. A typical example is a computation of the sum function. If the sum over the current time window is known, then the sum over the next time window can be computed by adding a new sample and subtracting the eldest sample in the window:
Similar techniques exist not only for simple aggregations like sums or products, but also for more complex transformations. For example, the SDFT (Sliding Discreet Fourier Transform) algorithm [4] is a computationally efficient alternative to per-window calculation of the FFT (Fast Fourier Transform) algorithm.
Now let us return to the practical problem that was stated in the beginning of this article. We have designed and implemented our in-stream data processing system on top of Storm, Kafka, and Cassandra adopting the techniques described earlier in this article. Here we provide just a very brief overview of the solution – a detailed description of all implementation pitfalls and tricks is too large and probably requires a separate article.
The system naturally uses Kafka 0.8 as a partitioned fault-tolerant event buffer to enable stream replay and improve system extensibility by easy addition of new event producers and consumers. Kafka’s ability to rewind read pointers also enables random access to the incoming batches and, consequently, Spark-style lineage tracking. It is also possible to point the system input to HDFS to process the historical data.
Cassandra is employed for state checkpointing and in-store aggregation, as described earlier. In many use cases, it also stores the final results.
Twitter’s Storm is a backbone of the system. All active query processing is performed in Storm’s topologies that interact with Kafka and Cassandra. Some data flows are simple and straightforward: the data arrives to Kafka; Storm reads and processes it and persist the results to Cassandra or other destination. Other flows are more sophisticated: one Storm topology can pass the data to another topology via Kafka or Cassandra. Two examples of such flows are shown in the figure above (red and blue curved arrows).
It is great that the existing technologies like Hive, Storm, and Impala enable us to crunch Big Data using both batch processing for complex analytics and machine learning, and real-time query processing for online analytics, and in-stream processing for continuous querying. Moreover, techniques like Lambda Architecture [6, 7] were developed and adopted to combine these solutions efficiently. This brings us to the question of how all these technologies and approaches could converge to a solid solution in the future. In this section, we discuss the striking similarity between distributed relational query processing, batch processing, and in-stream query processing to figure out the technologies that could cover all these use cases and, consequently, have the highest potential in this area.
The key observation is that relational query processing, MapReduce, and in-stream processing could be implemented using exactly the same concepts and techniques like shuffling and pipelining. At the same time:
The two statement above imply that tunable persistence (in-memory message passing versus on-disk materialization) and reliability are the distinctive features of the imaginary query engine that provides a set of processing primitives and interfaces to the high-level frameworks:
Among the emerging technologies, the following two are especially notable in the context of this discussion:
In the rest of this article we study a number of distributed activities like replication of failure detection that could happen in a database. These activities, highlighted in bold below, are grouped into three major sections:
It is well known and fairly obvious that in geographically distributed systems or other environments with probable network partitions or delays it is not generally possible to maintain high availability without sacrificing consistency because isolated parts of the database have to operate independently in case of network partition. This fact is often referred to as the CAP theorem. However, consistency is a very expensive thing in distributed systems, so it can be traded not only to availability. It is often involved into multiple tradeoffs. To study these tradeoffs, we first note that consistency issues in distributed systems are induced by the replication and the spatial separation of coupled data, so we have to start with goals and desired properties of the replication:
Now let’s take a closer look at commonly used replication techniques and classify them in accordance with the described properties. The first figure below depicts logical relationships between different techniques and their coordinates in the system of the consistency-scalability-availability-latency tradeoffs. The second figure illustrates each technique in detail.
Let’s go through all these techniques moving from weak to strong consistency guarantees:
It is worth noting that the analysis above highlights a number of tradeoffs:
Let us start our study with the following problem statement:
There is a set of nodes and each data item is replicated to a subset of nodes. Each node serves update requests even if there is no network connection to other nodes. Each node periodically synchronizes its state with other nodes is such a way that if no updates take place for a long time, all replicas will gradually become consistent. How this synchronization should be organized – when synchronization is triggered, how a peer to synchronize with is chosen, what is the data exchange protocol? Let us assume that two nodes can always merge their versions of data selecting a newest version or preserving both versions for further application-side resolution.
This problem appears both in data consistency maintenance and in synchronization of a cluster state (propagation of the cluster membership information and so on). Although the problem above can be solved by means of a global coordinator that monitors a database and builds a global synchronization plan or schedule, decentralized databases take advantage of more fault-tolerant approach. The main idea is to use well-studied epidemic protocols [7] that are relatively simple, provide a pretty good convergence time, and can tolerate almost any failures or network partitions. Although there are different classes of epidemic algorithms, we focus on anti-entropy protocols because of their intensive usage in NoSQL databases.
Anti-entropy protocols assume that synchronization is performed by a fixed schedule – every node regularly chooses another node at random or by some rule and exchanges database contents, resolving differences. There are three flavors of anti-entropy protocols: push, pull, and push-pull. The idea of the push protocol is to simply select a random peer and push a current state of data to it. In practice, it is quite silly to push the entire database, so nodes typically work in accordance with the protocol which is depicted in the figure below.
Node A which is initiator of synchronization prepares a digest (a set of checksums) which is a fingerprint of its data. Node B receives this digest, determines the difference between the digest and its local data and sends a digest of the difference back to A. Finally, A sends an update to B and B updates itself. Pull and push-pull protocols work similarly, as it shown in the figure above.
Anti-entropy protocols provide reasonable good convergence time and scalability. The following figure shows simulation results for propagation of an update in the cluster of 100 nodes. On each iteration, each node contacts one randomly selected peer.
One can see that the pull style provides better convergence than the push, and this can be proven theoretically [7]. Also, push has a problem with a “convergence tail” when a small percent of nodes remains unaffected during many iterations, although almost all nodes are already touched. The Push-Pull approach greatly improves efficiency in comparison with the original push or pulls techniques, so it is typically used in practice. Anti-entropy is scalable because the average conversion time grows as a logarithmic function of the cluster size.
Although these techniques look pretty simple, there are many studies [5] regarding performance of anti-entropy protocols under different constraints. One can leverage knowledge of the network topology to replace a random peer selection by a more efficient schema [10]; adjust transmit rates or use advanced rules to select data to be synchronized if the network bandwidth is limited [9]. Computation of digest can also be challenging, so a database can maintain a journal of the recent updates to facilitate digests computing.
In the previous section we assumed that two nodes can always merge their versions of data. However, reconciliation of conflicting updates is not a trivial task and it is surprisingly difficult to make all replicas to converge to a semantically correct value. A well-known example is that deleted items can resurface in the Amazon Dynamo database [8].
Let us consider a simple example that illustrates the problem: a database maintains a logically global counter and each database node can serve increment/decrement operations. Although each node can maintain its own local counter as a single scalar value, but these local counters cannot be merged by simple addition/subtraction. Consider an example: there are 3 nodes A, B, and C and increment operation was applied 3 times, once per node. If A pulls value from B and adds it to the local copy, C pulls from B, C pulls from A, then C ends up with value 4 which is incorrect. One possible way to overcome these issues is to use a data structure similar to vector clock [19] and maintain a pair of counters for each node [1]:
class Counter { int[] plus int[] minus int NODE_ID increment() { plus[NODE_ID]++ } decrement() { minus[NODE_ID]++ } get() { return sum(plus) – sum(minus) } merge(Counter other) { for i in 1..MAX_ID { plus[i] = max(plus[i], other.plus[i]) minus[i] = max(minus[i], other.minus[i]) } } }
Cassandra uses a very similar approach to provide counters as a part of its functionality [11]. It is possible to design more complex eventually consistent data structures that can leverage either state-based or operation-based replication principles. For example, [1] contains a catalog of such structures that includes:
However, eventually consistent data types are often limited in functionality and impose performance overheads.
This section is dedicated to algorithms that control data placement inside a distributed database. These algorithms are responsible for mapping between data items and physical nodes, migration of data from one node to another and global allocation of resources like RAM throughout the database.
Let us start with a simple protocol that is aimed to provide outage-free data migration between cluster nodes. This task arises in situations like cluster expansion (new nodes are added), failover (some node goes done), or rebalancing (data became unevenly distributed across the nodes). Consider a situation that is depicted in the section (A) of the figure below – there are three nodes and each node contains a portion of data (we assume a key-value data model without loss of generality) that is distributed across the nodes according to an arbitrary data placement policy:
If one does not have a database that supports data rebalancing internally, he probably will deploy several instances of the database to each node as it is shown in the section (B) of the figure above. This allows one to perform a manual cluster expansion by turning a separate instance off, copying it to a new node, and turning it on, as it is shown in the section (C). Although an automatic database is able to track each record separately, many systems including MongoDB, Oracle Coherence, and upcoming Redis Cluster use the described technique internally, i.e. group records into shards which are minimal units of migration for sake of efficiency. It is quite obvious that a number of shards should be quite large in comparison with the number of nodes to provide the even load distribution. An outage-free shard migration can be done according to the simple protocol that redirects client from the exporting to the importing node during a migration of the shard. The following figure depicts a state machine for get(key) logic as it going to be implemented in Redis Cluster:
It is assumed that each node knows a topology of the cluster and is able to map any key to a shard and a shard to a cluster node. If the node determines that the requested key belongs to a local shard, then it looks it up locally (the upper square in the picture above). If the node determines that the requested key belongs to another node X, than it sends a permanent redirection command to the client (the lower square in the figure above). Permanent redirection means that the client is able to cache the mapping between the shard and the node. If the shard migration is in progress, the exporting and the importing nodes mark this shard accordingly and start to move its records locking each record separately. The exporting node first looks up the key locally and, if not found, redirects the client to the importing node assuming that key is already migrated. This redirect is a one-time and should not be cached. The importing node processes redirects locally, but regular queries are permanently redirected until migration is not completed.
The next question we have to address is how to map records to physical nodes. A straightforward approach is to have a table of key ranges where each range is assigned to a node or to use procedures like NodeID = hash(key) % TotalNodes. However, modulus-based hashing does not explicitly address cluster reconfiguration because addition or removal of nodes causes complete data reshuffling throughout the cluster. As a result, it is difficult to handle replication and failover.
There are different ways to enhance the basic approach from the replication and failover perspectives. The most famous technique is a consistent hashing. There are many descriptions of the consistent hashing technique in the web, so I provide a basic description just for sake of completeness. The following figure depicts the basic ideas of consistent hashing:
Consistent hashing is basically a mapping schema for key-value store – it maps keys (hashed keys are typically used) to physical nodes. A space of hashed keys is an ordered space of binary strings of a fixed length, so it is quite obvious that each range of keys is assigned to some node as it depicted in the figure (A) for 3 nodes, namely, A, B, and C. To cope with replication, it is convenient to close a key space into a ring and traverse it clockwise until all replicas are mapped, as it shown in the figure (B). In other words, item Y should be placed on node B because its key corresponds to B’s range, first replica should be placed on C, second replica on A and so on.
The benefit of this schema is efficient addition and removal of a node because it causes data rebalancing only in neighbor sectors. As it shown in the figures (C), addition of the node D affects only item X but not Y. Similarly, removal (or failure) of the node B affects Y and the replica of X, but not X itself. However, as it was pointed in [8], the dark side of this benefit is vulnerability to overloads – all the burden of rebalancing is handled by neighbors only and makes them to replicate high volumes of data. This problem can be alleviated by mapping each node not to a one range, but to a set of ranges, as it shown in the figure (D). This is a tradeoff – it avoids skew in loads during rebalancing, but keeps the total rebalancing effort reasonably low in comparison with module-based mapping.
Maintenance of a complete and coherent vision of a hashing ring may be problematic in very large deployments. Although it is not a typical problem for databases because of relatively small clusters, it is interesting to study how data placement was combined with the network routing in peer-to-peer networks. A good example is the Chord algorithm [2] that trades completeness of the ring vision by a single node to efficiency of the query routing. The Chord algorithm is similar to consistent hashing in the sense that it uses a concept of a ring to map keys to nodes. However, a particular node maintains only a short list of peers with exponentially growing offset on the logical ring (see the picture below). This allows one to locate a key in several network hops using a kind of binary search:
This figure depicts a cluster of 16 nodes and illustrates how node A looks up a key that is physically located on node D. Part (A) depicts the route and part (B) depicts partial visions of the ring for nodes A, B, and C. More information about data replication in decentralized systems can be found in [15].
Although consistent hashing offers an efficient data placement strategy when data items are accessed by a primary key, things become much more complex when querying by multiple attributes is required. A straightforward approach (that is used, for example, in MongoDB) is to distribute data by a primary key regardless to other attributes. As a result, queries that restrict the primary key can be routed to a limited number of nodes, but other queries have to be processed by all nodes in the cluster. This skew in query efficiency leads us to the following problem statement:
There is a set of data items and each item has a set of attributes along with their values. Is there a data placement strategy that limits a number of nodes that should be contacted to process a query that restricts an arbitrary subset of the attributes?
One possible solution was implemented in the HyperDex database. The basic idea is to treat each attribute as an axis in a multidimensional space and map blocks in the space to physical nodes. A query corresponds to a hyperplane that intersects a subset of blocks in the space, so only this subset of blocks should be touched during the query processing. Consider the following example from [6]:
Each data item is a user account that is attributed by First Name, Last Name, and Phone Number. These attributes are treated as a three-dimensional space and one possible data placement strategy is to map each octant to a dedicated physical node. Queries like “First Name = John” correspond to a plane that intersects 4 octants, hence only 4 nodes should be involved into processing. Queries that restrict two attributes correspond to a line that intersects two octants as it shown in the figure above, hence only 2 nodes should be involved into processing.
The problem with this approach is that dimensionality of the space grows as an exponential function of the attributes count. As a result, queries that restrict only a few attributes tend to involve many blocks and, consequently, involve many servers. One can alleviate this by splitting one data item with multiple attributes into multiple sub-items and mapping them to the several independent subspaces instead of one large hyperspace:
This provides more narrowed query-to-nodes mapping, but complicates coordination because one data item becomes scattered across several independent subspaces with their own physical locations and transactional updates become required. More information about this technique and implementation details can be found in [6].
Some applications with heavy random reads can require all data to fit RAM. In these cases, sharding with independent master-slave replication of each replica (like in MongoDB) typically requires at least double amount of RAM because each chunk of data is stored both on a master and on a slave. A slave should have the same amount of RAM as a master in order to replace the master in case of failure. However, shards can be placed in such a way that amount of required RAM can be reduced, assuming that the system tolerates short-time outages or performance degradation in case of failures.
The following figure depicts 4 nodes that host 16 shards, primary copies are stored in RAM and replicas are stored on disk:
The gray arrows highlight replication of shards from node #2. Shards from the other nodes are replicated symmetrically. The red arrows depict how the passivated replicas will be loaded into RAM in case of failure of node #2. Even distribution of replicas throughout the cluster allows one to have only a small memory reserve that will be used to activate replicas in case of failure. In the figure above, the cluster is able to survive a single node failure having only 1/3 of RAM in reserve. It is worth noting that replica activation (loading from disk to RAM) takes some time and cause temporally performance degradation or outage of the corresponding data during failure recovery.
In this section we discuss a couple of techniques that relates to system coordination. Distributed coordination is an extremely large area that was a subject of intensive study during several decades. In this article, we, of course, consider only a couple of applied techniques. A comprehensive description of distributed locking, consensus protocols and other fundamental primitives can be found in numerous books or web resources [17, 18, 21].
Failure detection is a fundamental component of any fault tolerant distributed system. Practically all failure detection protocols are based on a heartbeat messages which are a pretty simple concept – monitored components periodically send a heartbeat message to the monitoring process (or the monitoring process polls monitored components) and absence of heartbeat messages for a long time is interpreted as a failure. However, real distributed systems impose a number of additional requirements that should be addressed:
A possible way to address the first two requirements is so-called Phi Accrual Failure Detector [12] that is used with some modifications in Cassandra [16]. The basic workflow is as follows (see the figure below):
The scalability requirement can be addressed in significant degree by hierarchically organized monitoring zones that prevent flooding of the network with heartbeat messages [14] and synchronization of different zones via gossip protocol or central fault-tolerant repository. This approach is illustrated below (there are two zones and all six failure detectors talk to each other via gossip protocol or robust repository like ZooKeeper):
Coordinator election is an important technique for databases with strict consistency guarantees. First, it allows one to organize failover of a master node in master-slave systems. Second, it allows one to prevent write-write conflicts in case of network partition by terminating partitions that do not include a majority of nodes.
Bully algorithm is a relatively simple approach to coordinator election. MongoDB uses a version of this algorithm to elect leaders in replica sets. The main idea of the bully algorithm is that each member of the cluster can declare itself as a coordinator and announce this claim to other nodes. Other nodes can either accept this claim or reject it by entering the competition for being a coordinator. Node that does not face any further contention becomes a coordinator. Nodes use some attribute to decide who wins and who loses. This attribute can be a static ID or some recency metric like the last transaction ID (the most up-to-date node wins).
An example of the bully algorithm execution is shown in the figure below. Static ID is used as a comparison metric, a node with a greater ID wins.
Coordinator election process can count a number of nodes that participate in it and check that at least a half of cluster nodes are attend. This guarantees that only one partition can elect a coordinator in case of network partition.
Basically, the problem boils down to the following. There is a number of Jenkins slave nodes, and we have to split all JUnit tests into batches, run all batches in parallel using available slaves, and aggregate test results into a single report. The last two tasks (parallel execution and aggregation) can be solved using built-in Jenkins functionality, namely, multi-configuration jobs also known as matrix builds. Multi-configuration job allows one to configure a standard Jenkins job and specify a set of slave servers this job to be executed on. Jenkins is capable of running an instance of the job on all specified slaves in parallel, passing slave ID as a build parameter, and aggregating JUnit test results into a single report. On our build server, configuration matrix for a job is as simple as this:
Test splitting is a more tricky task. A straightforward approach is to obtain a list of test cases and cut it into equal pieces. This is definitely better than nothing, but execution time can vary significantly from batch to batch especially in presence of long-running tests. Our preliminary experiments showed that parallelization of Pig builds in such a way is not very efficient — some batches can run two or more times slower than other. To cope with this issue we decided to collect statistics about tests duration and assemble batches such that the difference between expected execution times is minimal and, consequently, the total build time is minimal. The next section is devoted to the implementation details of this approach.
One of our goals was to keep an implementation as simple as possible, so we came up with the design where each node executes a number of steps sequentially (as a solid script) and independently from the other nodes. The only information this script receives from Jenkins server is a node ID. Each instance of the multi-configuration job on each node includes the following steps:
In this section a Python implementation of each step is shown in a simplified form, details like error handling and logging are omitted for sake of readability.
First, we prepare an initial list of tests by scanning sources in the workspace:
#[ COLLECT A TEST POOL test_pool = set([]) for root, dirnames, filenames in os.walk("./test"): for filename in fnmatch.filter(filenames, 'Test*.java'): test_name = re.search(r".*(Test.*)\.java", os.path.join(root, filename)) test_pool.add(test_name.group(1)) #]
Second, we load test statistics from the shared store. We use MySQL as a database, but one can use version control system to store statistics along with the sources. This statistics is initially empty.
#[ LOAD TEST STATISTICS job_name = "Pig_gd-branch-0.9" db = MySQLdb.connect(...) cursor = db.cursor() cursor.execute(" SELECT test_name, duration FROM test_stat WHERE job_name=%s ", job_name) test_statistics_data = cursor.fetchall() test_statistics = dict(test_statistics_data) db.close() #]
The third step is a scheduling step that selects tests that have to be executed on the current node. We have to split the test pool into a fixed number of disjoint batches such that the difference of their execution times is minimal. We don’t need an optimal solution, a simple greedy algorithm is practically enough. This step produces a set of files with the test names:
random.seed(1234) # fix seed to produce identical results on all nodes #[ PREPARE SPLITS, GREEDY ALGORITHM test_splits = [ [] for i in range(SPLIT_FACTOR) ] test_times = [0] * SPLIT_FACTOR for test in sorted(test_pool, key=lambda test : -test_statistics.get(test, 0)): # select a split with minimal expected execution time split_index = test_times.index(min(test_times)) test_duration = test_statistics.get(test, 0) if not test_duration: # if statistics is unavailable, select a random split split_index = random.randint(0, SPLIT_FACTOR - 1) test_splits[split_index].append(test) test_times[split_index] += test_duration for split, id in zip(test_splits, range(SPLIT_FACTOR)): f = open(base_dir + 'upar-split.%d' % id, 'w') for test in split: # write ant's include mask to a file f.write("**/%s.java\n" % test) f.close() #]
As soon as splits are ready, the slave name is mapped to the batch ID and the build is executed for this batch (fortunately, Pig’s build system allows to submit a file with test filters as a build parameter). The similar thing can done for maven builds. The following piece of bash code do this part of the work:
case $SLAVEID in Slave-Alpha) JOBID=0;; # Slave-Alpha is a Jenkins node ID Slave-Beta) JOBID=1;; Slave-Gamma) JOBID=2;; Slave-Delta) JOBID=3;; Slave-Epsilon) JOBID=4;; Slave-Zeta) JOBID=5;; esac ant -Dtest.junit.output.format=xml clean test -Dtest.all.file=upar-split.${JOBID}
The final step is to parse test results and update test statistics in the DB. This is also quite trivial:
#[ UPDATE TEST STATISTICS db = MySQLdb.connect(...) cursor = db.cursor() path = "./build/test/logs/" for infile in glob.glob( os.path.join(path, 'TEST-*.xml') ): f = open(infile) text = f.read() f.close() time = re.search( r"<testsuite[^>]*time=\"([0-9\.]+)\"", text, flags=re.DOTALL) test_name = re.search(r".*TEST-.*(Test\w*).xml", infile).group(1) cursor.execute( "REPLACE INTO test_stat(job_name,test_name,duration) VALUES(%s,%s,%s)", (job_name, test_name, float(time.group(1))) ) db.close() #]
According to our experiments, the described technique allows one to achieve a very even load distribution among the nodes and, consequently, minimize the total build time. An example of the build duration distribution for Pig build is shown in the screenshot below (monolithic build takes more than 9 hours):
It should be noted that the real production implementation takes care about a few more issues:
Our starting point is a simple element-by-element intersection algorithm (also known as Zipper). Its implementation in C is shown below and do not require lengthy explanations:
#define int32 unsigned int // A, B - operands, sorted arrays // s_a, s_b - sizes of A and B // C - result buffer // return size of the result C size_t intersect_scalar(int32 *A, int32 *B, size_t s_a, size_t s_b, int32 *C) { size_t i_a = 0, i_b = 0; size_t counter = 0; while(i_a < s_a && i_b < s_b) { if(A[i_a] < B[i_b]) { i_a++; } else if(B[i_b] < A[i_a]) { i_b++; } else { C[counter++] = A[i_a]; i_a++; i_b++; } } return counter; }
Performance of this procedure both in C and Java will be evaluated in the last section. I believe that it is possible to improve this approach using a branchless implementation, but I had no chance to try it out.
It is intuitively clear that performance of intersection may be improved by processing of multiple elements at once using SIMD instructions. Let us start with the following question: how to find and extract common elements in two short sorted arrays (let’s call them segments). SSE instruction set allow one to do a pairwise comparison of two segments of four 32-bit integers each using one instruction (_mm_cmpeq intrinsic) that produces a bit mask that highlights positions of equal elements. If one has two 4-element registers, A and B, it is possible to obtain a mask of common elements comparing A with different cyclic shifts of B (the left part of the figure below) and OR-ing the masks produced by each comparison (the right part of the figure):
The resulting comparison mask highlights the required elements in the segment A. This 128-bit mask can be transformed to a 4-bit value (shuffling mask index) using __mm_movemask intrinsic. When this short mask of common elements is obtained, we have to efficiently copy out common elements. This can be done by shuffling of the original elements according to the shuffling mask that can be looked up in the precomputed dictionary using the shuffling mask index (i.e. each of 16 possible 4-bit shuffling mask indexes is mapped to some permutation). All common elements should be placed to the beginning of the register, in this case register can be copied in one shot to the output buffer C as it shown in the figure above.
The described technique gives us a building block that can be used for intersection of long sorted lists. This process is somehow similar to the scalar intersection:
In this example, during the first cycle we compare first 4-element segments (highlighted in red) and copy out common elements (2 and 11). Similarly to the scalar intersection algorithm, we can move forward the pointer for the list B because the tail element of the compared segments is smaller in B (11 vs 14). At the second cycle (in green) we compare the first segment of A with the second segment of B, intersection is empty, and we move pointer for A. And so on. In this example, we need 5 comparisons to process two lists of 12 elements each.
The complete implementation of the described techniques is shown below:
static __m128i shuffle_mask[16]; // precomputed dictionary size_t intersect_vector(int32 *A, int32 *B, size_t s_a, size_t s_b, int32 *C) { size_t count = 0; size_t i_a = 0, i_b = 0; // trim lengths to be a multiple of 4 size_t st_a = (s_a / 4) * 4; size_t st_b = (s_b / 4) * 4; while(i_a < s_a && i_b < s_b) { //[ load segments of four 32-bit elements __m128i v_a = _mm_load_si128((__m128i*)&A[i_a]); __m128i v_b = _mm_load_si128((__m128i*)&B[i_b]); //] //[ move pointers int32 a_max = _mm_extract_epi32(v_a, 3); int32 b_max = _mm_extract_epi32(v_b, 3); i_a += (a_max <= b_max) * 4; i_b += (a_max >= b_max) * 4; //] //[ compute mask of common elements int32 cyclic_shift = _MM_SHUFFLE(0,3,2,1); __m128i cmp_mask1 = _mm_cmpeq_epi32(v_a, v_b); // pairwise comparison v_b = _mm_shuffle_epi32(v_b, cyclic_shift); // shuffling __m128i cmp_mask2 = _mm_cmpeq_epi32(v_a, v_b); // again... v_b = _mm_shuffle_epi32(v_b, cyclic_shift); __m128i cmp_mask3 = _mm_cmpeq_epi32(v_a, v_b); // and again... v_b = _mm_shuffle_epi32(v_b, cyclic_shift); __m128i cmp_mask4 = _mm_cmpeq_epi32(v_a, v_b); // and again. __m128i cmp_mask = _mm_or_si128( _mm_or_si128(cmp_mask1, cmp_mask2), _mm_or_si128(cmp_mask3, cmp_mask4) ); // OR-ing of comparison masks // convert the 128-bit mask to the 4-bit mask int32 mask = _mm_movemask_ps((__m128)cmp_mask); //] //[ copy out common elements __m128i p = _mm_shuffle_epi8(v_a, shuffle_mask[mask]); _mm_storeu_si128((__m128i*)&C[count], p); count += _mm_popcnt_u32(mask); // a number of elements is a weight of the mask //] } // intersect the tail using scalar intersection ... return count; }
The described implementation uses the shuffle_mask dictionary to map the mask of common elements to the shuffling parameter. Building of this dictionary is straightforward (each bit in the mask corresponds to 4 bytes in the register):
// a simple implementation, we don't care about performance here void prepare_shuffling_dictionary() { for(int i = 0; i < 16; i++) { int counter = 0; char permutation[16]; memset(permutation, 0xFF, sizeof(permutation)); for(char b = 0; b < 4; b++) { if(getBit(i, b)) { permutation[counter++] = 4*b; permutation[counter++] = 4*b + 1; permutation[counter++] = 4*b + 2; permutation[counter++] = 4*b + 3; } } __m128i mask = _mm_loadu_si128((const __m128i*)permutation); shuffle_mask[i] = mask; } } int getBit(int value, int position) { return ( ( value & (1 << position) ) >> position); }
SSE 4.2 instruction set offers PCMPESTRM instruction that allows one to compare two segments of eight 16-bit values each and obtain a bit mask that highlights common elements. This sounds like an extremely efficient approach for intersection of sorted lists, but in its basic form this approach is limited by 16-bit values in the lists. This is not the case for many applications, so a workaround was recently suggested by Benjamin Schedel et al. in this article. The main idea is to store indexes in the partitioned format, where elements with the same most significant bits are grouped together. This approach also has limited applicability because each partition should contain a sufficient number of elements, i.e. it works well in case or very large lists or favorable distribution of the values.
Each partition has a header that includes a prefix which represents most significant bits that are common for all elements in the partition and the number of elements in the partition. The following figure illustrates the partitioning process:
The partitioning procedure that coverts 32-bit values into 16-bit values is shown in the code snippet below:
// A - sorted array // s_a - size of A // R - partitioned sorted array size_t partition(int32 *A, size_t s_a, int16 *R) { int16 high = 0; size_t partition_length = 0; size_t partition_size_position = 1; size_t counter = 0; for(size_t p = 0; p < s_a; p++) { int16 chigh = _high16(A[p]); // upper dword int16 clow = _low16(A[p]); // lower dword if(chigh == high && p != 0) { // add element to the current partition R[counter++] = clow; partition_length++; } else { // start new partition R[counter++] = chigh; // partition prefix R[counter++] = 0; // reserve place for partition size R[counter++] = clow; // write the first element R[partition_size_position] = partition_length; partition_length = 1; // reset counters partition_size_position = counter - 2; high = chigh; } } R[partition_size_position] = partition_length; return counter; }
A pair of partitions can be intersected using the following procedure that computes a mask of common elements using _mm_cmpestrm intrinsic and then shuffles these elements similarly to the vectorized intersection procedure what was described in the previous section.
size_t intersect_vector16(int16 *A, int16 *B, size_t s_a, size_t s_b, int16 *C) { size_t count = 0; size_t i_a = 0, i_b = 0; size_t st_a = (s_a / 8) * 8; size_t st_b = (s_b / 8) * 8; while(i_a < st_a && i_b < st_b) { __m128i v_a = _mm_loadu_si128((__m128i*)&A[i_a]); __m128i v_b = _mm_loadu_si128((__m128i*)&B[i_b]); __m128i res_v = _mm_cmpestrm(v_b, 8, v_a, 8, _SIDD_UWORD_OPS|_SIDD_CMP_EQUAL_ANY|_SIDD_BIT_MASK); int r = _mm_extract_epi32(res_v, 0); __m128i p = _mm_shuffle_epi8(v_a, shuffle_mask16[r]); _mm_storeu_si128((__m128i*)&C[count], p); count += _mm_popcnt_u32(r); int16 a_max = _mm_extract_epi16(v_a, 7); int16 b_max = _mm_extract_epi16(v_b, 7); i_a += (a_max <= b_max) * 4; i_b += (a_max >= b_max) * 4; } // intersect the tail using scalar intersection ... return count; }
The whole intersection algorithm looks similarly to the scalar intersection. It receives two partitioned operands, iterates over headers of partitions and calls intersection of particular partitions if their prefixes match:
// A, B - partitioned operands size_t intersect_partitioned(int16 *A, int16 *B, size_t s_a, size_t s_b, int16 *C) { size_t i_a = 0, i_b = 0; size_t counter = 0; while(i_a < s_a && i_b < s_b) { if(A[i_a] < B[i_b]) { i_a += A[i_a + 1] + 2; } else if(B[i_b] < A[i_a]) { i_b += B[i_b + 1] + 2; } else { C[counter++] = A[i_a]; // write partition prefix int16 partition_size = intersect_vector16(&A[i_a + 2], &B[i_b + 2], A[i_a + 1], B[i_b + 1], &C[counter + 1]); C[counter++] = partition_size; // write partition size counter += partition_size; i_a += A[i_a + 1] + 2; i_b += B[i_b + 1] + 2; } } return counter; }
The output of this procedure is also a partitioned vector that can be used in further operations.
Performance of the described techniques was evaluated for intersection of sorted lists of size 1 million elements, with average intersection selectivity about 30%. All evaluated methods excepts partitioned vectorized intersection do not require specific properties of the values in the lists. For partitioned vectorized intersection values were selected from range [0, 3M] to provide relatively large partitions.
In case of Lucene, a corpus of documents with two fields was generated to provide the mentioned index sizes and selectivity; RAMDirectory was used. Intersection was done using standard Boolean query with top hits limited by 1 to prevent generation of large result set. Of course, this not a fair comparison because Lucene is much more than a list intersector, but it is still interesting to try it out.
Performance testing was done on the ordinary Linux desktop with 2.8GHz cores. JDK 1.6 and gcc 4.5.2 (with -O3 option) were used.
I would like to thank Mikhail Khludnev and Kirill Uvaev, who reviewed this article and provided valuable suggestions.
Let us start with a simple example that illustrates capabilities of probabilistic data structures:
Let us have a data set that is simply a heap of ten million random integer values and we know that it contains not more than one million distinct values (there are many duplicates). The picture above depicts the fact that this data set basically occupies 40MB of memory (10 million of 4-byte elements). It is a ridiculous size for Big Data applications, but this a reasonable choice to show all structures in scale. Our goal is to convert this data set to compact structures that allow one to process the following queries:
The picture above shows (in scale) how much memory different representations of the data set will consume and which queries they support:
A number of probabilistic data structures is described in detail in the following sections, although without excessive theoretical explanations – detailed mathematical analysis of these structures can be found in the original articles. The preliminary remarks are:
Let us start with a very simple technique that is called Linear Counting. Basically, a liner counter is just a bit set and each element in the data set is mapped to a bit. This process is illustrated in the following code snippet:
class LinearCounter { BitSet mask = new BitSet(m) // m is a design parameter void add(value) { int position = hash(value) // map the value to the range 0..m mask.set(position) // sets a bit in the mask to 1 } }
Let’s say that the ratio of a number of distinct items in the data set to m is a load factor. It is intuitively clear that:
If so, we have to pose the following two questions:
Both questions were addressed in [1]. The following table contains key formulas that allow one to estimate cardinality as a function of the mask weight and choose parameter m by required bias or standard error of the estimation:
The last two equations cannot be solved analytically to express m or load factor as a function of bias or standard error, but it is possible to tabulate numerical solutions. The following plots can be used to determine the load factor (and, consequently, m) for different capacities.
The rule of thumb is that load factor of 10 can be used for large data sets even if very precise estimation is required, i.e. memory consumption is about 0.1 bits per unique value. This is more than two orders of magnitude more efficient than the explicit indexing of 32- or 64-bit identifiers, but memory consumption grows linearly as a function of the expected cardinality (n), i.e. capacity of counter.
It is important to note that several independently computed masks for several data sets can be merged as a bitwise OR to estimate the cardinality of the union of the data sets. This opportunity is leveraged in the following case study.
There is a system that receives events on user visits from different internet sites. This system enables analysis to query a number of unique visitors for the specified date range and site. Linear Counters can be used to aggregate information about registered visitor IDs for each day and site, masks for each day are saved, and a query can be processed using bitwise OR-ing of the daily masks.
Loglog algorithm [2] is a much more powerful and much more complex technique than the Linear Counting algorithm. Although some aspects of the Loglog algorithm are pretty complex, the basic idea is simple and ingenious.
In order to understand principles of the Loglog algorithm we should start one general observation. Let us imagine that we hashed each element in the data set and these hashed values are presented as binary strings. We can expect that about one half of strings will start with 1, one quarter will start with 01, and so on. Let’s denote the number of the leading zeros as a rank. Finally, one or a few values will have some maximum rank r, as it shown in the figure below.
From this consideration it follows that 2^r can be treated as some kind of the cardinality estimation, but a very unstable estimation – r is determined by one or few items and variance is very high. However, it is possible to overcome this issue by using multiple independent observations and averaging them. This technique is shown in the code snippet below. Incoming values are routed to a number of buckets by using their first bits as a bucket address. Each bucket maintains a maximum rank of the received values:
class LogLogCounter { int H // H is a design parameter int m = 2^k // k is a design parameter etype[] estimators = new etype[m] // etype is a design parameter void add(value) { hashedValue = hash(value) bucket = getBits(hashedValue, 0, k) estimators[bucket] = max( estimators[bucket], rank( getBits(hashedValue, k, H) ) ) } getBits(value, int start, int end) rank(value) }
This implementation requires the following parameters to be determined:
The auxiliary functions are specified as follows:
The following table provides the estimation formula and equations that can be used to determine numerical parameters of the Loglog Counter:
These formulas are very impressive. One can see that a number of buckets is relatively small for most of the practically interesting values of the standard error of the estimation. For example, 1024 estimators provide a standard error of 4%. At the same time, the length of the estimator is a very slow growing function of the capacity, 5-bit buckets are enough for cardinalities up to 10^11, 8-bit buckets (etype is byte) can support practically unlimited cardinalities. This means that less than 1KB of auxiliary memory may be enough to process gigabytes of data in the real life applications! This is a fundamental phenomenon that was revealed and theoretically analyzed in [7]: It is possible to recover an approximate value of cardinality, using only a (small and) constant memory.
Loglog counter is essentially a record about a single (rarest) element in the dataset.
More recent developments on cardinality estimation are described in [9] and [10]. This article also provides a good overview of the cardinality estimation techniques.
There is a system that monitors traffic and counts unique visitors for different criteria (visited site, geography, etc.) The straightforward approaches for implementation of this system are:
If number of users and criteria is high, both solutions assume very high amount of data to be stored, maintained, or processed. As an alternative, a LoglogCounter structure can be maintained for each criterion. In this case, thousands of criteria and hundreds of millions of visitors can be tracked using a very modest amount of memory.
There is a system that monitors traffic and counts unique visitors for different criteria (visited site, geography, etc.). It is required to compute 100 most popular sites using a number of unique visitors as a metric of popularity. Popularity should be computed every day on the basis of data for last month, i.e. every day one-day partition added, another one is removed from the scope. Similarly to the previous case study, straightforward solutions for this problem require a lot of resources if data volume is high. On the other hand, one can create a fresh set of per-site Loglog counters every day and maintain this set during 30 days, i.e. 30 sets of counters are active at any moment of time. This approach can be very efficient because of the tiny memory footprint of the Loglog counter, even for millions of unique visitors.
Count-Min Sketches is a family of memory efficient data structures that allow one to estimate frequency-related properties of the data set, e.g. estimate frequencies of particular elements, find top-K frequent elements, perform range queries (where the goal is to find the sum of frequencies of elements within a range), estimate percentiles.
Let’s focus on the following problem statement: there is a set of values with duplicates, it is required to estimate frequency (a number of duplicates) for each value. Estimations for relatively rare values can be imprecise, but frequent values and their absolute frequencies should be determined accurately.
The basic idea of Count-Min Sketch [3] is quite simple and somehow similar to Linear Counting. Count-Min sketch is simply a two-dimensional array (d x w) of integer counters. When a value arrives, it is mapped to one position at each of d rows using d different and preferably independent hash functions. Counters on each position are incremented. This process is shown in the figure below:
It is clear that if sketch is large in comparison with the cardinality of the data set, almost each value will get an independent counter and estimation will precise. Nevertheless, this case is absolutely impractical – it is much better to simply maintain a dedicated counter for each value by using plain array or hash table. To cope with this issue, Count-Min algorithm estimates frequency of the given value as a minimum of the corresponding counters in each row because the estimation error is always positive (each occurrence of a value always increases its counters, but collisions can cause additional increments). A practical implementation of Count-Min sketch is provided in the following code snippet. It uses simple hash functions as it was suggested in [4]:
class CountMinSketch { long estimators[][] = new long[d][w] // d and w are design parameters long a[] = new long[d] long b[] = new long[d] long p // hashing parameter, a prime number. For example 2^31-1 void initializeHashes() { for(i = 0; i < d; i++) { a[i] = random(p) // random in range 1..p b[i] = random(p) } } void add(value) { for(i = 0; i < d; i++) estimators[i][ hash(value, i) ]++ } long estimateFrequency(value) { long minimum = MAX_VALUE for(i = 0; i < d; i++) minimum = min( minimum, estimators[i][ hash(value, i) ] ) return minimum } hash(value, i) { return ((a[i] * value + b[i]) mod p) mod w } }
Dependency between the sketch size and accuracy is shown in the table below. It is worth noting that width of the sketch limits the magnitude of the error and height (also called depth) controls the probability that estimation breaks through this limit:
Accuracy of the Count-Min sketch depends on the ratio between the sketch size and the total number of registered events. This means that Count-Min technique provides significant memory gains only for skewed data, i.e. data where items have very different probabilities. This property is illustrated in the figures below.
Two experiments were done with the Count-Min sketch of size 3×64, i.e. 192 counters total. In the first case the sketch was populated with moderately skewed data set of 10k elements, about 8500 distinct values (element frequencies follow Zipfian distribution which models, for example, distribution of words in natural texts). The real histogram (for most frequent elements, it has a long flat tail in the right that was truncated in this figure) and the histogram recovered from the sketch are shown in the figure below:
It is clear that Count-Min sketch cannot track frequencies of 8500 elements using only 192 counters in the case of low skew of the frequencies, so the estimated histogram is very inaccurate.
In the second case the sketch was populated with a relatively highly skewed data set of 80k elements, also about 8500 distinct values. The real and estimated histograms are presented in the figure below:
One can see that result is more accurate, at least for the most frequent items. In general, applicability of Count-Min sketches is not a straightforward question and the best thing that can be recommended is experimental evaluation of each particular case. Theoretical bounds of Count-Min sketch accuracy on skewed data and measurements on real data sets are provided in [6].
The original Count-Min sketch performs well on highly skewed data, but on low or moderately skewed data it is not so efficient because of poor protection from the high number of hash collisions – Count-Min sketch simply selects minimal (less distorted) estimator. As an alternative, more careful correction can be done to compensate the noise caused by collisions. One possible correction algorithm was suggested in [5]. It estimates noise for each hash function as the average value of all counters in the row that correspond to this function (except counter that corresponds to the query itself), deduces it from the estimation for this hash function, and, finally, computes the median of the estimations for all hash functions. Having that the sum of all counters in the sketch row equals to the total number of the added elements, we obtain the following implementation:
class CountMeanMinSketch { // initialization and addition procedures as in CountMinSketch // n is total number of added elements long estimateFrequency(value) { long e[] = new long[d] for(i = 0; i < d; i++) { sketchCounter = estimators[i][ hash(value, i) ] noiseEstimation = (n - sketchCounter) / (w - 1) e[i] = sketchCounter – noiseEstimator } return median(e) } }
This enhancement can significantly improve accuracy of the Count-Min structure. For example, compare the histograms below with the first histograms for Count-Min sketch (both techniques used a sketch of size 3×64 and 8500 elements were added to it):
Count-Min sketches are applicable to the following problem: Find all elements in the data set with the frequencies greater than k percent of the total number of elements in the data set. The algorithm is straightforward:
In general, the top-k problem makes sense only for skewed data, so usage of Count-Min sketches is reasonable in this context.
There is a system that tracks traffic by IP address and it is required to detect most traffic-intensive addresses. This problem can be solved using the algorithm described above, but the problem is not trivial because we need to track the total traffic for each address, not a frequency of items. Nevertheless, there is a simple solution – counters in the CountMinSketch implementation can be incremented not by 1, but by absolute amount of traffic for each observation (for example, size of IP packet if sketch is updated for each packet). In this case, sketch will track amounts of traffic for each address and a heap with the most traffic-intensive addresses can be maintained as described above.
Count-Min Sketch and other similar techniques is not the only family of structures that allow one to estimate frequency-related metrics. Another large family of algorithms and structures that deal with frequency estimation is counter-based techniques. Stream-Summary algorithm [8] belongs to this family. Stream-Summary allows one to detect most frequent items in the dataset and estimate their frequencies with explicitly tracked estimation error.
Basically, Stream-Summary traces a fixed number (a number of slots) of elements that presumably are most frequent ones. If one of these elements occurs in the stream, the corresponding counter is increased. If a new, non-traced element appears, it replaces the least frequent traced element and this kicked out element become non-traced.
The figure below illustrates how Stream-Summary with 3 slots works for the input stream {1,2,2,2,3,1,1,4}. Stream-Summary groups all traced elements into buckets where each bucket corresponds to the particular frequency, i.e. to the number of occurrences. Additionally, each traced element has the “err” field that stores maximum potential error of the estimation.
The estimation procedure for most frequent elements and corresponding frequencies is quite obvious because of simple internal design of the Stream-Summary structure. Indeed, one just need to scan elements in the buckets that correspond to the highest frequencies. Nevertheless, Stream-Summary is able not only to provide estimates, but to answer are these estimates exact (guaranteed) or not. Computation of these guarantees is not trivial, corresponding algorithms are described in [8].
In theory, one can process a range query (something like SELECT count(v) WHERE v >= c1 AND v < c2) using a Count-Min sketch enumerating all points within a range and summing estimates for corresponding frequencies. However, this approach is impractical because the number of points within a range can be very high and accuracy also tends to be inacceptable because of cumulative error of the sum. Nevertheless, it is possible to overcome these problems using multiple Count-Min sketches. The basic idea is to maintain a number of sketches with the different “resolution”, i.e. one sketch that counts frequencies for each value separately, one sketch that counts frequencies for pairs of values (to do this one can simply truncate a one bit of a value on the sketch’s input), one sketch with 4-items buckets and so on. The number of levels equals to logarithm of the maximum possible value. This schema is shown in the right part of the following picture:
Any range query can be reduced to a number of queries to the sketches of different level, as it shown in right part of the picture above. This approach (called dyadic ranges) allows one to reduce the number of computations and increase accuracy. An obvious optimization of this schema is to replace sketches by exact counters at the lowest levels where a number of buckets is small.
MADlib (a data mining library for PostgreSQL and Greenplum) implements this algorithm to process range queries and calculate percentiles on large data sets.
Bloom Filter is probably the most famous and widely used probabilistic data structure. There are multiple descriptions of the Bloom filter in the web, I provide a short overview here just for sake of completeness. Bloom filter is similar to Linear Counting, but it is designed to maintain an identity of each item rather than statistics. Similarly to Linear Counter, the Bloom filter maintains a bitset, but each value is mapped not to one, but to some fixed number of bits by using several independent hash functions. If the filter has a relatively large size in comparison with the number of distinct elements, each element has a relatively unique signature and it is possible to check a particular value – is it already registered in the bit set or not. If all the bits of the corresponding signature are ones then the answer is yes (with a certain probability, of course).
The following table contains formulas that allow one to calculate parameters of the Bloom filter as functions of error probability and capacity:
Bloom filter is widely used as a preliminary probabilistic test that allows one to reduce a number of exact checks. The following case study shows how the Bloom filter can be applied to the cardinality estimation.
There is a system that tracks a huge number of web events and each event is marked by a number of tags including a user ID this event corresponds to. It is required to report a number of unique users that meet the specified combination of tags (like users from the city C that visited site A or site B).
A possible solution is to maintain a Bloom filter that tracks user IDs for each tag value and a Bloom filter that contains user IDs that correspond to the final result. A user ID from each incoming event is tested against the per-tag filters – does it satisfy the required combination of tags or not. If the user ID passes this test, it is additionally tested against the additional Bloom filter that corresponds to the report itself and, if passed, the final report counter is increased.
It is worth mentioning that simple Java implementations of several structures can be found in stream-lib library.
In this article, I describe major architectural decisions we made and techniques we used. This description should not be considered as a solid blueprint, but rather a collection of the relatively independent ideas, patterns, and notes that can be used in different combinations and in different applications, not only in eCommerce systems.
I cannot disclose customer’s name, so I will explain business logic using amazon.com as an example, fortunately the basic functionality is very similar. The first piece of functionality is structural or hierarchical navigation through categories and products, which are the main business entities of the system. Categories are organized in a tree-like structure and the user is provided with several controls that enable him to navigate through this tree starting from the highest categories (like departments on amazon.com) and going to the lowest ones:
Each product can be explicitly associated with one or more categories of any level and category contains a product if this product is explicitly associated with it or associated with any of its subcategories. These structural dependencies between categories and products are relatively static (the system refreshes this information daily), but operations team can change separate relations in runtime to fix incorrect data or to inject other urgent changes. Besides this, each product has some transient information like in-stock availability that is a subject of frequent updates (every 5 minutes or so).
The second important piece of functionality is a faceted navigation. Categories can contain thousands of products and user cannot efficiently search though this array without powerful tools. The most popular way to do this is a faceted navigation that can be thought as a generation of dynamic categories based on product attributes. For example, if the user opens a category that contains clothes, products will be characterized by properties like size, brand, color and so on. Available values of these properties (called facets) can be extracted from the product set and shown on the UI to enable the user to apply user-select filters, which are particular AND-ed or OR-ed combinations of the facet values:
Each facet value is often accompanied with cardinality, i.e. number of products that will be in the results set if this filter is applied. When user clicks on a facet, the system automatically applies the selected filters and narrows the product set according to the user interests. It is important that this style of navigation assumes high interactivity – each selection leads to recomputing of all available facets, their cardinalities, and products in a result set.
There is a lot of information about faceted search on the web. I can recommend this article by Peter Morville and Jeffrey Callender for further reading. We will also return back to some details of business logic in the section devoted to implementation of the faceted navigation.
From the backend perspective, hierarchical and faced navigation requires the following operations to be implemented:
From the technical perspective, the following properties should be highlighted:
The major architectural decision was to use in-memory data grid (IMDG) to shield the master RDBMS from workload during request processing. Oracle Coherence was chosen as an implementation. Coherence is used as a platform that provides distributed cache capabilities and can serve as a messaging bus for coordination of all application-level modules on all nodes in the cluster.
The deployment schema includes three types of nodes – processing nodes, storage nodes, and maintenance nodes. Processing nodes are responsible for requests serving and act as Coherence clients. Storage nodes are basically Coherence storage nodes. Maintenance nodes are responsible for data indexing and processing of transient information updates. Both Storage and Maintenance nodes do not serve client requests. This deployment schema is shown in the figure below:
Nodes can be dynamically added or removed from the cluster. All nodes (processing, storage, maintenance) host the same application that contains all modules for request processing, maintenance operations, and Coherence instance. Basically, deployments on all nodes are identical and can serve both client requests and maintenance operations, although each type of nodes has its own configuration parameters. The rationale behind this architecture can be recognized as a pattern:
Pattern: Homogeneous Cluster Nodes |
Problem There is a clustered system that consist of multiple business services and auxiliary modules (data loaders, administration controls, etc). The deployment process is going to be complex if each module is deployed as a separate artifact with its own deployment schema and configuration. |
Solution Different groups of nodes in the cluster can have different roles and serve different needs, but it may be a good idea to create one application and one artifact that will be deployed throughout the cluster. Different modules of this application are activated on different nodes depends on explicitly specified configuration (say, property files) or just because of usage pattern (say, certain requests are routed only to particular nodes). |
Results This approach simplifies deployment and release processes, mitigates risk of incorrect deployment or misconfiguration. Development and QA processes are simplified because one can use either singe node or multiple nodes to run fully functional environment. |
Turning to the internals of the application itself, we can see that it includes the following components (these components are depicted in the figure below):
Data Loader is active only on the Maintenance nodes where it has a plenty of resources for temporary buffers, index compilation tasks and so on. All updates and indexing requests are routed only to the Maintenance nodes, not to Processing/Storage nodes. Such separation of data loader and other maintenance units can be recognized as a common pattern:
Pattern: Maintenance Node |
Problem There is a cluster of nodes where each node is able to serve both business and maintenance requests. Maintenance operations can consume a lot resources and impact performance of business requests. |
Solution Maintenance operations like data indexing can be handled by any cluster node when a distributed platform like IMDG is used. Nevertheless, it is often a good idea to use a dedicated node for this purpose. This node can be identical to other nodes from the deployment point of view (the same application as on the other nodes), but user requests are not routed to it and more powerful hardware can be used in some cases. |
Results On the one hand, maintenance node provides potentially resource-consuming indexing processes with dedicated hardware capacities. On the other hand, maintenance processes do not interfere with user requests. |
Data Loader loads all active data to Coherence during each daily update, but there is “dark matter” that is not loaded into Coherence but occasionally requested by some clients. For instance, this matter is obsolete products and categories that are not visible on the site and not available for purchase. Coherence Read-Through feature is used to cope with these entities – it is acceptable to load them from the RDBMS on demand because the number of such requests is very low.
Design of Data Loader is influenced by two major factors:
As a result, Data Loader is organized as an asynchronous pipeline (Pipes and Filters design pattern) where batches of entities are loaded from RDBMS by a set of units that work in parallel threads. Loaded entities are submitted to a queue, and each consumer works in its own thread taking batches and processing them independently from the other participants. This schema is shown in the figure below:
This schema is relatively simple because there is only one data source and structure of entities is not too complicated. Nevertheless, this pipeline can become more complex if there are multiple data sources and one business entity is assembled using several sources. In this case, a batch of entities can be initially loaded from a single source and then passed to another loader that enriches entities by additional attributes and so on.
Pattern: Data Loading Pipeline |
Problem A system should be populated with a large data set that come from single or multiple sources. One business entity can depend on multiple sources. There are many consumers of the loaded business entities that index, persist, or process entities. |
Solution Adopt the Pipes and Filters pattern. Implement each operation (loading or indexing) as an isolated unit that produces or consumes entities. Data producers or loaders should be driven by incoming requests that specify data to be loaded. Connect all units via asynchronous data channels and run multiple instances of each unit as an independent process. |
Results Data Loading Pipeline allows one to organize efficient data loading in a multithreaded environment. All units can work in a batch mode, and more parallel instances can be easily added. A special attention should be paid to the memory consumption because queues with entities can consume a lot of memory if a system is not balanced or misconfigured. |
Data inconsistency during saving of new data to Coherence is practically avoided using techniques that were described in one of my previous articles.
When we first started to work on the navigation procedures, we first tried to do it using standard Coherence capabilities, i.e. filters and entry processors. This attempt was not very successful from the performance point of view due to high memory consumption and relatively low performance in general. The next step was to design a compact data structure that supports very fast category tree traversal and extraction of products by Category ID. The structure we created is based on the nested set model, it is shown in the figure below:
A navigation index represents a huge array of product IDs and their basic attributes that are frequently used in computation and filtering, for example, in-stock availability. In our domain model these attributes are binary, hence we efficiently packed them into integer numbers where each bit is reserved for a particular attribute. Each element of this array corresponds to the product-to-category relation and one product ID can occur in this array multiple times if product is associated with multiple categories. Hierarchy itself is stored as an indexed tree of category IDs and each node contains two indexes in product-to-category array. This indexes point to start and end positions of relations that belong to the particular category.
The second notable feature of this navigation solution is that each Processing Node fetches index from Coherence and entirely caches it in local memory. This allows one to perform navigational operations without touching heavy-weight domain objects. If data volume becomes high, it is possible to partition index into several shards and perform distributed processing, although it was not a case in our application (index with millions of products can be easily handled by one JVM). This technique can be considered as a common pattern (or anti-pattern, it depends on scalability requirements):
Pattern: Replicated Custom Index |
Problem There is an application with a distributed data storage. It is necessary to perform a special type of query that involves limited amount of attributes for each entity, but complex business logic or high performance requirements make standard distributed scans inefficient. |
Solution When a non-standard traversal or querying is required and amount of involved data is limited, each node in the cluster can cache domain-specific index and use it to perform the operation. |
Results This approach can be very efficient when standard indexes do not work well, but it can turn into scalability bottleneck if implemented incorrectly. If there are reasons to assume that index will become too large to be cached on one node, this is a serious argument against this approach. |
Index propagation throughout the cluster is shown in the figure below. Maintenance Node loads data from the Master DB, builds index, saves it in a serialized partitioned form to Coherence, and then Processing Nodes fetch it and cache locally:
Faceted Navigation was described in the first section of this article, but it should be mentioned that logic of computation is not always straightforward, but often affected by business rules and peculiarities of a business model. As an interesting example, we can consider the following use case. Imagine that, according to the business model, product is not a final item of purchase, but a group of such items. For instance, when user looks into the Jeans category, he or she can see Levi’s Jeans 501 as a product, but the actual item to be purchased is a particular instance of Levi’s Jeans 501, say Levi’s Jeans 501 of size 34×30, white color. Considered as a product domain entity, Levi’s Jeans 501 will contain many particular items of a different color and size. From the faceted navigation perspective, this leads to the interesting issue. At the first glance, it is fine to attribute each product with all sizes or colors that can be found among all its instances and build facets based on this information. Now imagine that there are two instances of Levi’s Jeans 501 – one is of size 34×30 and in white color, another one is one is of size 30×30 and in white color. If the user looks for black jeans of size 34×30, this product will match the filter if it is simply attributed by a plain list of instance-level attributes. Nevertheless, there are no black jeans of size 34×30 in the store. This situation is illustrated in the figure below:
This is just a one example of non-trivial issues with facetization logic. Many more issues and merchandiser-driven tweaks can appear in a real system. The conclusion is that faceted navigation can be pretty sophisticated and certain implementation flexibility is required.
To cope with such issues, it was decided to keep the design of a facet index very straightforward and do not use data layouts like inverted indexes. Basically, all products, their instances and higher level groups of items are stored just like nested arrays and maps of objects:
All attributes are mapped to the integer values and these values are compactly stored in open addressing hash sets inside each instance or product. This allows one to iterate over all items within a category, efficiently applying user selected filter to each item, and increment facet counters for all attributes that are inside accepted items. I provided a detailed description of data structures and algorithms that allow one to do this in my previous post.
If the user selected filter includes many attributes it may be inefficient to check all these attributes one by one for each item. Performance of filtering can be improved using Bloom filter that allows one to apply a filter of several terms to a set of attributes using a couple of processor instructions. Bloom filter is liable to false positives, so it can not completely replace traditional checks using hash sets with attributes, but it can be used as a preliminary test to decrease a number of relatively expensive exact checks. This technique is used in a number of well-known systems, Google Big Table and Apache HBase are among them.
Pattern: Probabilistic Test |
Problem There is a large collection of items (domain entities, files, records etc). It is necessary to provide the ability to select items that meet a certain criteria – simple yes/no predicate or complex filter. |
Solution Items can be grouped into buckets. Each bucket contains one or more items and has a compact signature that allows one to answer the question “is there at least one item inside the bucket that meets the criteria“. This signature is typically a kind of hash that has much smaller memory footprint than the original collection and liable to false positives. Query processor tests bucket’s signature and, if results shows that bucket potentially can contain the requested items, it goes into the bucket and checks all items independently. |
Results Probabilistic testing is good to trade time to memory or IO to memory. It increases memory consumption because of signatures, but allows one to significantly decrease volume of processed data for selective queries. |
Replicated Custom Index pattern is used to distribute Facet Index throughout the cluster, just like Navigation Index.
The described design showed the following properties after being in production for a long time:
I would like to thank Daniel Kirkdorffer who reviewed the article and cleaned up the grammar.
To explore data modeling techniques, we have to start with a more or less systematic view of NoSQL data models that preferably reveals trends and interconnections. The following figure depicts imaginary “evolution” of the major NoSQL system families, namely, Key-Value stores, BigTable-style databases, Document databases, Full Text Search Engines, and Graph databases:
First, we should note that SQL and relational model in general were designed long time ago to interact with the end user. This user-oriented nature had vast implications:
On the other hand, it turned out that software applications are not so often interested in in-database aggregation and able to control, at least in many cases, integrity and validity themselves. Besides this, elimination of these features had an extremely important influence on the performance and scalability of the stores. And this was where a new evolution of data models began:
The rest of this article describes concrete data modeling techniques and patterns. As a preface, I would like to provide a few general notes on NoSQL data modeling:
This section is devoted to the basic principles of NoSQL data modeling.
Denormalization can be defined as the copying of the same data into multiple documents or tables in order to simplify/optimize query processing or to fit the user’s data into a particular data model. Most techniques described in this article leverage denormalization in one or another form.
In general, denormalization is helpful for the following trade-offs:
Applicability: Key-Value Stores, Document Databases, BigTable-style Databases
All major genres of NoSQL provide soft schema capabilities in one way or another:
Soft schema allows one to form classes of entities with complex internal structures (nested entities) and to vary the structure of particular entities.This feature provides two major facilities:
Embedding with denormalization can greatly impact updates both in performance and consistency, so special attention should be paid to update flows.
Applicability: Key-Value Stores, Document Databases, BigTable-style Databases
Joins are rarely supported in NoSQL solutions. As a consequence of the “question-oriented” NoSQL nature, joins are often handled at design time as opposed to relational models where joins are handled at query execution time. Query time joins almost always mean a performance penalty, but in many cases one can avoid joins using Denormalization and Aggregates, i.e. embedding nested entities. Of course, in many cases joins are inevitable and should be handled by an application. The major use cases are:
Applicability: Key-Value Stores, Document Databases, BigTable-style Databases, Graph Databases
In this section we discuss general modeling techniques that applicable to a variety of NoSQL implementations.
Many, although not all, NoSQL solutions have limited transaction support. In some cases one can achieve transactional behavior using distributed locks or application-managed MVCC, but it is common to model data using an Aggregates technique to guarantee some of the ACID properties.
One of the reasons why powerful transactional machinery is an inevitable part of the relational databases is that normalized data typically require multi-place updates. On the other hand, Aggregates allow one to store a single business entity as one document, row or key-value pair and update it atomically:
Of course, Atomic Aggregates as a data modeling technique is not a complete transactional solution, but if the store provides certain guaranties of atomicity, locks, or test-and-set instructions then Atomic Aggregates can be applicable.
Applicability: Key-Value Stores, Document Databases, BigTable-style Databases
Perhaps the greatest benefit of an unordered Key-Value data model is that entries can be partitioned across multiple servers by just hashing the key. Sorting makes things more complex, but sometimes an application is able to take some advantages of ordered keys even if storage doesn’t offer such a feature. Let’s consider the modeling of email messages as an example:
Applicability: Key-Value Stores
Dimensionality Reduction is a technique that allows one to map multidimensional data to a Key-Value model or to other non-multidimensional models.
Traditional geographic information systems use some variation of a Quadtree or R-Tree for indexes. These structures need to be updated in-place and are expensive to manipulate when data volumes are large. An alternative approach is to traverse the 2D structure and flatten it into a plain list of entries. One well known example of this technique is a Geohash. A Geohash uses a Z-like scan to fill 2D space and each move is encoded as 0 or 1 depending on direction. Bits for longitude and latitude moves are interleaved as well as moves. The encoding process is illustrated in the figure below, where black and red bits stand for longitude and latitude, respectively:
An important feature of a Geohash is its ability to estimate distance between regions using bit-wise code proximity, as is shown in the figure. Geohash encoding allows one to store geographical information using plain data models, like sorted key values preserving spatial relationships. The Dimensionality Reduction technique for BigTable was described in [6.1]. More information about Geohashes and other related techniques can be found in [6.2] and [6.3].
Applicability: Key-Value Stores, Document Databases, BigTable-style Databases
Index Table is a very straightforward technique that allows one to take advantage of indexes in stores that do not support indexes internally. The most important class of such stores is the BigTable-style database. The idea is to create and maintain a special table with keys that follow the access pattern. For example, there is a master table that stores user accounts that can be accessed by user ID. A query that retrieves all users by a specified city can be supported by means of an additional table where city is a key:
An Index table can be updated for each update of the master table or in batch mode. Either way, it results in an additional performance penalty and become a consistency issue.
Index Table can be considered as an analog of materialized views in relational databases.
Applicability: BigTable-style Databases
Composite key is a very generic technique, but it is extremely beneficial when a store with ordered keys is used. Composite keys in conjunction with secondary sorting allows one to build a kind of multidimensional index which is fundamentally similar to the previously described Dimensionality Reduction technique. For example, let’s take a set of records where each record is a user statistic. If we are going to aggregate these statistics by a region the user came from, we can use keys in a format (State:City:UserID) that allow us to iterate over records for a particular state or city if that store supports the selection of key ranges by a partial key match (as BigTable-style systems do):
SELECT Values WHERE state="CA:*" SELECT Values WHERE city="CA:San Francisco*"
Applicability: BigTable-style Databases
Composite keys may be used not only for indexing, but for different types of grouping. Let’s consider an example. There is a huge array of log records with information about internet users and their visits from different sites (click stream). The goal is to count the number of unique users for each site. This is similar to the following SQL query:
SELECT count(distinct(user_id)) FROM clicks GROUP BY site
We can model this situation using composite keys with a UserID prefix:
The idea is to keep all records for one user collocated, so it is possible to fetch such a frame into memory (one user can not produce too many events) and to eliminate site duplicates using hash table or whatever. An alternative technique is to have one entry for one user and append sites to this entry as events arrive. Nevertheless, entry modification is generally less efficient than entry insertion in the majority of implementations.
Applicability: Ordered Key-Value Stores, BigTable-style Databases
This technique is more a data processing pattern, rather than data modeling. Nevertheless, data models are also impacted by usage of this pattern. The main idea of this technique is to use an index to find data that meets a criteria, but aggregate data using original representation or full scans. Let’s consider an example. There are a number of log records with information about internet users and their visits from different sites (click stream). Let assume that each record contains user ID, categories this user belongs to (Men, Women, Bloggers, etc), city this user came from, and visited site. The goal is to describe the audience that meet some criteria (site, city, etc) in terms of unique users for each category that occurs in this audience (i.e. in the set of users that meet the criteria).
It is quite clear that a search of users that meet the criteria can be efficiently done using inverted indexes like {Category -> [user IDs]} or {Site -> [user IDs]}. Using such indexes, one can intersect or unify corresponding user IDs (this can be done very efficiently if user IDs are stored as sorted lists or bit sets) and obtain an audience. But describing an audience which is similar to an aggregation query like
SELECT count(distinct(user_id)) ... GROUP BY category
cannot be handled efficiently using an inverted index if the number of categories is big. To cope with this, one can build a direct index of the form {UserID -> [Categories]} and iterate over it in order to build a final report. This schema is depicted below:
And as a final note, we should take into account that random retrieval of records for each user ID in the audience can be inefficient. One can grapple with this problem by leveraging batch query processing. This means that some number of user sets can be precomputed (for different criteria) and then all reports for this batch of audiences can be computed in one full scan of direct or inverse index.
Applicability: Key-Value Stores, BigTable-style Databases, Document Databases
Trees or even arbitrary graphs (with the aid of denormalization) can be modeled as a single record or document.
Applicability: Key-Value Stores, Document Databases
Adjacency Lists are a straightforward way of graph modeling – each node is modeled as an independent record that contains arrays of direct ancestors or descendants. It allows one to search for nodes by identifiers of their parents or children and, of course, to traverse a graph by doing one hop per query. This approach is usually inefficient for getting an entire subtree for a given node, for deep or wide traversals.
Applicability: Key-Value Stores, Document Databases
Materialized Paths is a technique that helps to avoid recursive traversals of tree-like structures. This technique can be considered as a kind of denormalization. The idea is to attribute each node by identifiers of all its parents or children, so that it is possible to determine all descendants or predecessors of the node without traversal:
This technique is especially helpful for Full Text Search Engines because it allows one to convert hierarchical structures into flat documents. One can see in the figure above that all products or subcategories within the Men’s Shoes category can be retrieved using a short query which is simply a category name.
Materialized Paths can be stored as a set of IDs or as a single string of concatenated IDs. The latter option allows one to search for nodes that meet a certain partial path criteria using regular expressions. This option is illustrated in the figure below (path includes node itself):
Applicability: Key-Value Stores, Document Databases, Search Engines
Nested sets is a standard technique for modeling tree-like structures. It is widely used in relational databases, but it is perfectly applicable to Key-Value Stores and Document Databases. The idea is to store the leafs of the tree in an array and to map each non-leaf node to a range of leafs using start and end indexes, as is shown in the figure below:
This structure is pretty efficient for immutable data because it has a small memory footprint and allows one to fetch all leafs for a given node without traversals. Nevertheless, inserts and updates are quite costly because the addition of one leaf causes an extensive update of indexes.
Applicability: Key-Value Stores, Document Databases
Search Engines typically work with flat documents, i.e. each document is a flat list of fields and values. The goal of data modeling is to map business entities to plain documents and this can be challenging if the entities have a complex internal structure. One typical challenge mapping documents with a hierarchical structure, i.e. documents with nested documents inside. Let’s consider the following example:
Each business entity is some kind of resume. It contains a person’s name and a list of his or her skills with a skill level. An obvious way to model such an entity is to create a plain document with Skill and Level fields. This model allows one to search for a person by skill or by level, but queries that combine both fields are liable to result in false matches, as depicted in the figure above.
One way to overcome this issue was suggested in [4.6]. The main idea of this technique is to index each skill and corresponding level as a dedicated pair of fields Skill_i and Level_i, and to search for all these pairs simultaneously (where the number of OR-ed terms in a query is as high as the maximum number of skills for one person):
This approach is not really scalable because query complexity grows rapidly as a function of the number of nested structures.
Applicability: Search Engines
The problem with nested documents can be solved using another technique that were also described in [4.6]. The idea is to use proximity queries that limit the acceptable distance between words in the document. In the figure below, all skills and levels are indexed in one field, namely, SkillAndLevel, and the query indicates that the words “Excellent” and “Poetry” should follow one another:
[4.3] describes a success story for this technique used on top of Solr.
Applicability: Search Engines
Graph databases like neo4j are exceptionally good for exploring the neighborhood of a given node or exploring relationships between two or a few nodes. Nevertheless, global processing of large graphs is not very efficient because general purpose graph databases do not scale well. Distributed graph processing can be done using MapReduce and the Message Passing pattern that was described, for example, in one of my previous articles. This approach makes Key-Value stores, Document databases, and BigTable-style databases suitable for processing large graphs.
Applicability: Key-Value Stores, Document Databases, BigTable-style Databases
Finally, I provide a list of useful links related to NoSQL data modeling:
The sun.misc.Unsafe class is so unsafe that JDK developers added special checks to restrict access to it. Its constructor is private and caller of the factory method getUnsafe() should be loaded by Bootloader (i.e. caller should also be a part of JDK):
public final class Unsafe { ... private Unsafe() {} private static final Unsafe theUnsafe = new Unsafe(); ... public static Unsafe getUnsafe() { Class cc = sun.reflect.Reflection.getCallerClass(2); if (cc.getClassLoader() != null) throw new SecurityException("Unsafe"); return theUnsafe; } ... }
Fortunately there is theUnsafe field that can be used to retrieve Unsafe instance. We can easily write a helper method to do this via reflection:
public static Unsafe getUnsafe() { try { Field f = Unsafe.class.getDeclaredField("theUnsafe"); f.setAccessible(true); return (Unsafe)f.get(null); } catch (Exception e) { /* ... */ } }
In the next sections we will study several tricks that become possible due to the following methods of Unsafe:
The first trick we will do is C-like sizeof() function, i.e. function that returns shallow object size in bytes. Inspecting JVM sources of JDK6 and JDK7, in particular src/share/vm/oops/oop.hpp and src/share/vm/oops/klass.hpp, and reading comments in the code, we can notice that size of class instance is stored in _layout_helper which is the fourth field in C structure that represents Java class. Similarly, /src/share/vm/oops/oop.hpp shows that each instance (i.e. object) stores pointer to a class structure in its second field. For 32-bit JVM this means that we can first take class structure address as 4-8 bytes in the object structure and next shift by 3×4=12 bytes inside class structure to capture_layout_helper field which is instance size in bytes. These structures are shown in the picture below:
As so, we can implement sizeof() as follows:
public static long sizeOf(Object object) { Unsafe unsafe = getUnsafe(); return unsafe.getAddress( normalize( unsafe.getInt(object, 4L) ) + 12L ); } public static long normalize(int value) { if(value >= 0) return value; return (~0L >>> 32) & value; }
We need to use normalize() function because addresses between 2^31 and 2^32 will be automatically converted to negative integers, i.e. stored in complement form. Let’s test it on 32-bit JVM (JDK 6 or 7):
// sizeOf(new MyStructure()) gives the following results: class MyStructure { } // 8: 4 (start marker) + 4 (pointer to class) class MyStructure { int x; } // 16: 4 (start marker) + 4 (pointer to class) + 4 (int) + 4 stuff bytes to align structure to 64-bit blocks class MyStructure { int x; int y; } // 16: 4 (start marker) + 4 (pointer to class) + 2*4
This function will not work for array objects, because _layout_helper field has another meaning in that case. Although it is still possible to generalize sizeOf() to support arrays.
Unsafe allows to allocate and deallocate memory explicitly via allocateMemory and freeMemory methods. Allocated memory is not under GC control and not limited by maximum JVM heap size. In general, such functionality is safely available via NIO’s off-heap bufferes. But the interesting thing is that it is possible to map standard Java reference to off-heap memory:
MyStructure structure = new MyStructure(); // create a test object structure.x = 777; long size = sizeOf(structure); long offheapPointer = getUnsafe().allocateMemory(size); getUnsafe().copyMemory( structure, // source object 0, // source offset is zero - copy an entire object null, // destination is specified by absolute address, so destination object is null offheapPointer, // destination address size ); // test object was copied to off-heap Pointer p = new Pointer(); // Pointer is just a handler that stores address of some object long pointerOffset = getUnsafe().objectFieldOffset(Pointer.class.getDeclaredField("pointer")); getUnsafe().putLong(p, pointerOffset, offheapPointer); // set pointer to off-heap copy of the test object structure.x = 222; // rewrite x value in the original object System.out.println( ((MyStructure)p.pointer).x ); // prints 777 .... class Pointer { Object pointer; }
So, it is virtually possible to manually allocate and deallocate real objects, not only byte buffers. Of course, it’s a big question what may happen with GC after such cheats.
Imagine the situation when one has a method that takes a string as an argument, but it is necessary to pass some extra payload. There are at least two standard ways to do it in Java: put payload to thread local or use static field. With Unsafe another two possibilities appears: pass payload address as a string and inherit payload class from String class. The first approach is pretty close to what we see in the previous section – one just need obtain payload address using Pointer and create a new Pointer to payload inside the called method. In other words, any argument that can carrier an address can be used as analog of void* in C. In order to explore the second approach we start with the following code which is compilable, but obviously produces ClassCastException in run time:
Carrier carrier = new Carrier(); carrier.secret = 777; String message = (String)(Object)carrier; // ClassCastException handler( message ); ... void handler(String message) { System.out.println( ((Carrier)(Object)message).secret ); } ... class Carrier { int secret; }
To make it work, one need to modify Carrier class to simulate inheritance from String. A list of superclasses is stored in Carrier class structure starting from position 28, as it shown in the figure. Pointer to object goes first and pointer to Carrier itself goes after it (at position 32) since Carrier is inherited from Object directly. In principle, it is enough to add the following code before the line that casts Carrier to String:
long carrierClassAddress = normalize( unsafe.getInt(carrier, 4L) ); long stringClassAddress = normalize( unsafe.getInt("", 4L) ); unsafe.putAddress(carrierClassAddress + 32, stringClassAddress); // insert pointer to String class to the list of Carrier's superclasses
Now cast works fine. Nevertheless, this transformation is not correct and violates VM contracts. More careful approach should include more steps:
sun.misc.Unsafe provides almost unlimited capabilities for exploring and modification of VM’s runtime data structures. Despite the fact that these capabilities are almost inapplicable in Java development itself, Unsafe is a great tool for anyone who want to study HotSpot VM without C++ code debugging or need to create ad hoc profiling instruments.
Problem Statement: There is a number of documents where each document is a set of terms. It is required to calculate a total number of occurrences of each term in all documents. Alternatively, it can be an arbitrary function of the terms. For instance, there is a log file where each record contains a response time and it is required to calculate an average response time.
Solution:
Let start with something really simple. The code snippet below shows Mapper that simply emit “1” for each term it processes and Reducer that goes through the lists of ones and sum them up:
class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1) class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum)
The obvious disadvantage of this approach is a high amount of dummy counters emitted by the Mapper. The Mapper can decrease a number of counters via summing counters for each document:
class Mapper method Map(docid id, doc d) H = new AssociativeArray for all term t in doc d do H{t} = H{t} + 1 for all term t in H do Emit(term t, count H{t})
In order to accumulate counters not only for one document, but for all documents processed by one Mapper node, it is possible to leverage Combiners:
class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1) class Combiner method Combine(term t, [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum) class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum)
Log Analysis, Data Querying
Problem Statement: There is a set of items and some function of one item. It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example is building of inverted indexes.
Solution:
The solution is straightforward. Mapper computes a given function for each item and emits value of the function as a key and item itself as a value. Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words) and function is a document ID where the term was found.
Inverted Indexes, ETL
Problem Statement: There is a set of records and it is required to collect all records that meet some condition or transform each record (independently from other records) into another representation. The later case includes such tasks as text parsing and value extraction, conversion from one format to another.
Solution: Solution is absolutely straightforward – Mapper takes records one by one and emits accepted items or their transformed versions.
Log Analysis, Data Querying, ETL, Data Validation
Problem Statement: There is a large computational problem that can be divided into multiple parts and results from all parts can be combined together to obtain a final result.
Solution: Problem description is split in a set of specifications and specifications are stored as input data for Mappers. Each Mapper takes a specification, performs corresponding computations and emits results. Reducer combines all emitted parts into the final result.
There is a software simulator of a digital communication system like WiMAX that passes some volume of random data through the system model and computes error probability of throughput. Each Mapper runs simulation for specified amount of data which is 1/Nth of the required sampling and emit error rate. Reducer computes average error rate.
Physical and Engineering Simulations, Numerical Analysis, Performance Testing
Problem Statement: There is a set of records and it is required to sort these records by some rule or process these records in a certain order.
Solution: Simple sorting is absolutely straightforward – Mappers just emit all items as values associated with the sorting keys that are assembled as function of items. Nevertheless, in practice sorting is often used in a quite tricky way, that’s why it is said to be a heart of MapReduce (and Hadoop). In particular, it is very common to use composite keys to achieve secondary sorting and grouping.
Sorting in MapReduce is originally intended for sorting of the emitted key-value pairs by key, but there exist techniques that leverage Hadoop implementation specifics to achieve sorting by values. See this blog for more details.
It is worth noting that if MapReduce is used for sorting of the original (not intermediate) data, it is often a good idea to continuously maintain data in sorted state using BigTable concepts. In other words, it can be more efficient to sort data once during insertion than sort them for each MapReduce query.
ETL, Data Analysis
Problem Statement: There is a network of entities and relationships between them. It is required to calculate a state of each entity on the basis of properties of the other entities in its neighborhood. This state can represent a distance to other nodes, indication that there is a neighbor with the certain properties, characteristic of neighborhood density and so on.
Solution: A network is stored as a set of nodes and each node contains a list of adjacent node IDs. Conceptually, MapReduce jobs are performed in iterative way and at each iteration each node sends messages to its neighbors. Each neighbor updates its state on the basis of the received messages. Iterations are terminated by some condition like fixed maximal number of iterations (say, network diameter) or negligible changes in states between two consecutive iterations. From the technical point of view, Mapper emits messages for each node using ID of the adjacent node as a key. As result, all messages are grouped by the incoming node and reducer is able to recompute state and rewrite node with the new state. This algorithm is shown in the figure below:
class Mapper method Map(id n, object N) Emit(id n, object N) for all id m in N.OutgoingRelations do Emit(id m, message getMessage(N)) class Reducer method Reduce(id m, [s1, s2,...]) M = null messages = [] for all s in [s1, s2,...] do if IsObject(s) then M = s else // s is a message messages.add(s) M.State = calculateState(messages) Emit(id m, item M)
It should be emphasized that state of one node rapidly propagates across all the network of network is not too sparse because all nodes that were “infected” by this state start to “infect” all their neighbors. This process is illustrated in the figure below:
Problem Statement: This problem is inspired by real life eCommerce task. There is a tree of categories that branches out from large categories (like Men, Women, Kids) to smaller ones (like Men Jeans or Women Dresses), and eventually to small end-of-line categories (like Men Blue Jeans). End-of-line category is either available (contains products) or not. Some high level category is available if there is at least one available end-of-line category in its subtree. The goal is to calculate availabilities for all categories if availabilities of end-of-line categories are know.
Solution: This problem can be solved using the framework that was described in the previous section. We define getMessage and calculateState methods as follows:
class N State in {True = 2, False = 1, null = 0}, initialized 1 or 2 for end-of-line categories, 0 otherwise method getMessage(object N) return N.State method calculateState(state s, data [d1, d2,...]) return max( [d1, d2,...] )
Problem Statement: There is a graph and it is required to calculate distance (a number of hops) from one source node to all other nodes in the graph.
Solution: Source node emits 0 to all its neighbors and these neighbors propagate this counter incrementing it by 1 during each hope:
class N State is distance, initialized 0 for source node, INFINITY for all other nodes method getMessage(N) return N.State + 1 method calculateState(state s, data [d1, d2,...]) min( [d1, d2,...] )
This algorithm was suggested by Google to calculate relevance of a web page as a function of authoritativeness (PageRank) of pages that have links to this page. The real algorithm is quite complex, but in its core it is just a propagation of weights between nodes where each node calculates its weight as a mean of the incoming weights:
class N State is PageRank method getMessage(object N) return N.State / N.OutgoingRelations.size() method calculateState(state s, data [d1, d2,...]) return ( sum([d1, d2,...]) )
It is worth mentioning that the schema we use is too generic and doesn’t take advantage of the fact that state is a numerical value. In most of practical cases, we can perform aggregation of values on the Mapper side due to virtue of this fact. This optimization is illustrated in the code snippet below (for the PageRank algorithm):
class Mapper method Initialize H = new AssociativeArray method Map(id n, object N) p = N.PageRank / N.OutgoingRelations.size() Emit(id n, object N) for all id m in N.OutgoingRelations do H{m} = H{m} + p method Close for all id n in H do Emit(id n, value H{n}) class Reducer method Reduce(id m, [s1, s2,...]) M = null p = 0 for all s in [s1, s2,...] do if IsObject(s) then M = s else p = p + s M.PageRank = p Emit(id m, item M)
Graph Analysis, Web Indexing
Problem Statement: There is a set of records that contain fields F and G. Count the total number of unique values of filed F for each subset of records that have the same G (grouped by G).
The problem can be a little bit generalized and formulated in terms of faceted search:
Problem Statement: There is a set of records. Each record has field F and arbitrary number of category labels G = {G1, G2, …} . Count the total number of unique values of filed F for each subset of records for each value of any label. Example:
Record 1: F=1, G={a, b} Record 2: F=2, G={a, d, e} Record 3: F=1, G={b} Record 4: F=3, G={a, b} Result: a -> 3 // F=1, F=2, F=3 b -> 2 // F=1, F=3 d -> 1 // F=2 e -> 1 // F=2
Solution I:
The first approach is to solve the problem in two stages. At the first stage Mapper emits dummy counters for each pair of F and G; Reducer calculates a total number of occurrences for each such pair. The main goal of this phase is to guarantee uniqueness of F values. At the second phase pairs are grouped by G and the total number of items in each group is calculated.
Phase I:
class Mapper method Map(null, record [value f, categories [g1, g2,...]]) for all category g in [g1, g2,...] Emit(record [g, f], count 1) class Reducer method Reduce(record [g, f], counts [n1, n2, ...]) Emit(record [g, f], null )
Phase II:
class Mapper method Map(record [f, g], null) Emit(value g, count 1) class Reducer method Reduce(value g, counts [n1, n2,...]) Emit(value g, sum( [n1, n2,...] ) )
Solution II:
The second solution requires only one MapReduce job, but it is not really scalable and its applicability is limited. The algorithm is simple – Mapper emits values and categories, Reducer excludes duplicates from the list of categories for each value and increment counters for each category. The final step is to sum all counter emitted by Reducer. This approach is applicable if th number of record with the same f value is not very high and total number of categories is also limited. For instance, this approach is applicable for processing of web logs and classification of users – total number of users is high, but number of events for one user is limited, as well as a number of categories to classify by. It worth noting that Combiners can be used in this schema to exclude duplicates from category lists before data will be transmitted to Reducer.
class Mapper method Map(null, record [value f, categories [g1, g2,...] ) for all category g in [g1, g2,...] Emit(value f, category g) class Reducer method Initialize H = new AssociativeArray : category -> count method Reduce(value f, categories [g1, g2,...]) [g1', g2',..] = ExcludeDuplicates( [g1, g2,..] ) for all category g in [g1', g2',...] H{g} = H{g} + 1 method Close for all category g in H do Emit(category g, count H{g})
Log Analysis, Unique Users Counting
Problem Statement: There is a set of tuples of items. For each possible pair of items calculate a number of tuples where these items co-occur. If the total number of items is N then N*N values should be reported.
This problem appears in text analysis (say, items are words and tuples are sentences), market analysis (customers who buy this tend to also buy that). If N*N is quite small and such a matrix can fit in the memory of a single machine, then implementation is straightforward.
Pairs Approach
The first approach is to emit all pairs and dummy counters from Mappers and sum these counters on Reducer. The shortcomings are:
class Mapper method Map(null, items [i1, i2,...] ) for all item i in [i1, i2,...] for all item j in [i1, i2,...] Emit(pair [i j], count 1) class Reducer method Reduce(pair [i j], counts [c1, c2,...]) s = sum([c1, c2,...]) Emit(pair[i j], count s)
Stripes Approach
The second approach is to group data by the first item in pair and maintain an associative array (“stripe”) where counters for all adjacent items are accumulated. Reducer receives all stripes for leading item i, merges them, and emits the same result as in the Pairs approach.
class Mapper method Map(null, items [i1, i2,...] ) for all item i in [i1, i2,...] H = new AssociativeArray : item -> counter for all item j in [i1, i2,...] H{j} = H{j} + 1 Emit(item i, stripe H) class Reducer method Reduce(item i, stripes [H1, H2,...]) H = new AssociativeArray : item -> counter H = merge-sum( [H1, H2,...] ) for all item j in H.keys() Emit(pair [i j], H{j})
Text Analysis, Market Analysis
In this section we go though the main relational operators and discuss how these operators can implemented in MapReduce terms.
class Mapper method Map(rowkey key, tuple t) if t satisfies the predicate Emit(tuple t, null)
Projection is just a little bit more complex than selection, but we should use a Reducer in this case to eliminate possible duplicates.
class Mapper method Map(rowkey key, tuple t) tuple g = project(t) // extract required fields to tuple g Emit(tuple g, null) class Reducer method Reduce(tuple t, array n) // n is an array of nulls Emit(tuple t, null)
Mappers are fed by all records of two sets to be united. Reducer is used to eliminate duplicates.
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, null) class Reducer method Reduce(tuple t, array n) // n is an array of one or two nulls Emit(tuple t, null)
Mappers are fed by all records of two sets to be intersected. Reducer emits only records that occurred twice. It is possible only if both sets contain this record because record includes primary key and can occur in one set only once.
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, null) class Reducer method Reduce(tuple t, array n) // n is an array of one or two nulls if n.size() = 2 Emit(tuple t, null)
Let’s we have two sets of records – R and S. We want to compute difference R – S. Mapper emits all tuples and tag which is a name of the set this record came from. Reducer emits only records that came from R but not from S.
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, string t.SetName) // t.SetName is either 'R' or 'S' class Reducer method Reduce(tuple t, array n) // array n can be ['R'], ['S'], ['R' 'S'], or ['S', 'R'] if n.size() = 1 and n[1] = 'R' Emit(tuple t, null)
Grouping and aggregation can be performed in one MapReduce job as follows. Mapper extract from each tuple values to group by and aggregate and emits them. Reducer receives values to be aggregated already grouped and calculates an aggregation function. Typical aggregation functions like sum or max can be calculated in a streaming fashion, hence don’t require to handle all values simultaneously. Nevertheless, in some cases two phase MapReduce job may be required – see pattern Distinct Values as an example.
class Mapper method Map(null, tuple [value GroupBy, value AggregateBy, value ...]) Emit(value GroupBy, value AggregateBy) class Reducer method Reduce(value GroupBy, [v1, v2,...]) Emit(value GroupBy, aggregate( [v1, v2,...] ) ) // aggregate() : sum(), max(),...
Joins are perfectly possible in MapReduce framework, but there exist a number of techniques that differ in efficiency and data volumes they are oriented for. In this section we study some basic approaches. The references section contains links to detailed studies of join techniques.
This algorithm joins of two sets R and L on some key k. Mapper goes through all tuples from R and L, extracts key k from the tuples, marks tuple with a tag that indicates a set this tuple came from (‘R’ or ‘L’), and emits tagged tuple using k as a key. Reducer receives all tuples for a particular key k and put them into two buckets – for R and for L. When two buckets are filled, Reducer runs nested loop over them and emits a cross join of the buckets. Each emitted tuple is a concatenation R-tuple, L-tuple, and key k. This approach has the following disadvantages:
class Mapper method Map(null, tuple [join_key k, value v1, value v2,...]) Emit(join_key k, tagged_tuple [set_name tag, values [v1, v2, ...] ] ) class Reducer method Reduce(join_key k, tagged_tuples [t1, t2,...]) H = new AssociativeArray : set_name -> values for all tagged_tuple t in [t1, t2,...] // separate values into 2 arrays H{t.tag}.add(t.values) for all values r in H{'R'} // produce a cross-join of the two arrays for all values l in H{'L'} Emit(null, [k r l] )
In practice, it is typical to join a small set with a large one (say, a list of users with a list of log records). Let’s assume that we join two sets – R and L, R is relative small. If so, R can be distributed to all Mappers and each Mapper can load it and index by the join key. The most common and efficient indexing technique here is a hash table. After this, Mapper goes through tuples of the set L and joins them with the corresponding tuples from R that are stored in the hash table. This approach is very effective because there is no need in sorting or transmission of the set L over the network, but set R should be quite small to be distributed to the all Mappers.
class Mapper method Initialize H = new AssociativeArray : join_key -> tuple from R R = loadR() for all [ join_key k, tuple [r1, r2,...] ] in R H{k} = H{k}.append( [r1, r2,...] ) method Map(join_key k, tuple l) for all tuple r in H{k} Emit(null, tuple [k r l] )