Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and […]
January 1, 2012
Greenplum Database is an interesting solution for data mining and data warehousing. In this post I focus on MapReduce capabilities of Greenplum 4.1 and try to figure out how efficient its implementation is. Simple MapReduce Job Let us consider a simplified version of one real life problem that is typically solved using MapReduce technique – analysis […]
May 1, 2012
25