Latest Posts

on 10Mar2015January 10, 2018

Data Mining Problems in Retail

Retail is one of the most important business domains for data science and data mining applications because of its prolific data and numerous optimization problems such as optimal prices, discounts, recommendations, and stock levels that can be solved using data analysis methods. The rise of omni-channel retail that integrates marketing, customer relationship management, and inventory […]

on 20Aug2013July 7, 2017

In-Stream Big Data Processing

by Ilya Katsov

The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. It became clear that real-time query processing and in-stream processing is the immediate need in many practical applications. In recent years, this idea got a lot of traction and a whole bunch of solutions […]

on 18Sep2012October 11, 2012

Distributed Algorithms in NoSQL Databases

by Ilya Katsov

Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big umbrella, and it is. Although it can hardly be said that NoSQL movement brought fundamentally new techniques into distributed data processing, it triggered an avalanche […]

on 14Aug201214Aug2012

Speeding Up Hadoop Builds Using Distributed Unit Tests

by Ilya Katsov

We recently worked with one of the Hadoop vendors on the continuous integration system for Hadoop core and other Hadoop-related projects like Pig, Hive, HBase. One of the challenges we faced was very slow automatic tests — full unit/integration test suite takes more than 2 hours for Hadoop core and more than 9 hours for […]

on 05Jun201205Jun2012

Fast Intersection of Sorted Lists Using SSE Instructions

by Ilya Katsov

Intersection of sorted lists is a cornerstone operation in many applications including search engines and databases because indexes are often implemented using different types of sorted structures. At GridDynamics, we recently worked on a custom database for realtime web analytics where fast intersection of very large lists of IDs was a must for good performance. From a functional […]

on 01May2012August 19, 2012

Probabilistic Data Structures for Web Analytics and Data Mining

by Ilya Katsov

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and […]

on 02Apr2012April 6, 2012

Hierarchical Navigation and Faceted Search on Top of Oracle Coherence

by Ilya Katsov

Some time ago I participated in design of a backend for one large online retailer company. From the business logic point of view, this was a pretty typical eCommerce service for hierarchical and faceted navigation, although not without peculiarities, but high performance requirements led us to the quite advanced architecture and technical design. In particular, we […]

on 01Mar2012October 16, 2012

NoSQL Data Modeling Techniques

by Ilya Katsov

NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like the CAP theorem apply well to NoSQL systems. At the same time, NoSQL […]

on 02Feb2012May 2, 2012

Tricks with Direct Memory Access in Java

by Ilya Katsov

Java was initially designed as a safe, managed environment. Nevertheless, Java HotSpot VM contains a “backdoor” that provides a number of low-level operations to manipulate memory and threads directly. This backdoor – sun.misc.Unsafe – is widely used by JDK itself in the packages like java.nio or java.util.concurrent. It is hard to imagine a Java developer who uses this backdoor in […]

on 01Feb2012May 4, 2012

MapReduce Patterns, Algorithms, and Use Cases

by Ilya Katsov

In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. This […]