Speeding Up Hadoop Builds Using Distributed Unit Tests

Posted on August 14, 2012

2


We recently worked with one of the Hadoop vendors on the continuous integration system for Hadoop core and other Hadoop-related projects like Pig, Hive, HBase. One of the challenges we faced was very slow automatic tests — full unit/integration test suite takes more than 2 hours for Hadoop core and more than 9 hours for Apache Pig. Although there are different ways to alleviate this problem (divide tests into suites, optimize tests by tweaking timeouts and sleeps, etc.), we decided to start with a quick solution that immediately and drastically improves CI efficiency — distributed parallel test execution. In this article I describe a technique we used to speed up a Pig build from 9 hours to 1 hour 30 minutes using 6 Jenkins nodes. This technique is generic and can be considered as a general way to speed up maven or ant builds on Jenkins CI server or other CI systems.

Solution Overview

Basically, the problem boils down to the following. There is a number of Jenkins slave nodes, and we have to split all JUnit tests into batches, run all batches in parallel using available slaves, and aggregate test results into a single report. The last two tasks (parallel execution and aggregation) can be solved using built-in Jenkins functionality, namely, multi-configuration jobs also known as matrix builds. Multi-configuration job allows one to configure a standard Jenkins job and specify a set of slave servers this job to be executed on. Jenkins is capable of running an instance of the job on all specified slaves in parallel, passing slave ID as a build parameter, and aggregating JUnit test results into a single report. On our build server, configuration matrix for a job is as simple as this:

Test splitting is a more tricky task. A straightforward approach is to obtain a list of test cases and cut it into equal pieces. This is definitely better than nothing, but execution time can vary significantly from batch to batch especially in presence of long-running tests. Our preliminary experiments showed that parallelization of Pig builds in such a way is not very efficient — some batches can run two or more times slower than other. To cope with this issue we decided to collect statistics about tests duration and assemble batches such that the difference between expected execution times is minimal and, consequently, the total build time is minimal. The next section is devoted to the implementation details of this approach.

Build Steps on Jenkins

One of our goals was to keep an implementation as simple as possible, so we came up with the design where each node executes a number of steps sequentially (as a solid script) and independently from the other nodes. The only information this script receives from Jenkins server is a node ID. Each instance of the multi-configuration job on each node includes the following steps:

  1. A list of available JUnit tests is obtained.
  2. Statistics about previous test runs is loaded from the central store.
  3. Available tests are divided into batches according to the statistics.
  4. A batch is selected according to the node ID and submitted to ant/maven as a build parameter.
  5. JUnit reports are parsed, test statistics is extracted and saved to the central shared store.

In this section a Python implementation of each step is shown in a simplified form, details like error handling and logging are omitted for sake of readability.

First, we prepare an initial list of tests by scanning sources in the workspace:

#[ COLLECT A TEST POOL
test_pool = set([])
for root, dirnames, filenames in os.walk("./test"):
   for filename in fnmatch.filter(filenames, 'Test*.java'):
      test_name = re.search(r".*(Test.*)\.java", os.path.join(root, filename))
      test_pool.add(test_name.group(1))   
#]

Second, we load test statistics from the shared store. We use MySQL as a database, but one can use version control system to store statistics along with the sources. This statistics is initially empty.

#[ LOAD TEST STATISTICS
job_name = "Pig_gd-branch-0.9"
db = MySQLdb.connect(...)
cursor = db.cursor()
cursor.execute(" SELECT test_name, duration FROM test_stat WHERE job_name=%s ", job_name)
test_statistics_data = cursor.fetchall()
test_statistics = dict(test_statistics_data)
db.close()
#]

The third step is a scheduling step that selects tests that have to be executed on the current node. We have to split the test pool into a fixed number of disjoint batches such that the difference of their execution times is minimal. We don’t need an optimal solution, a simple greedy algorithm is practically enough. This step produces a set of files with the test names:

random.seed(1234) # fix seed to produce identical results on all nodes

#[ PREPARE SPLITS, GREEDY ALGORITHM
test_splits = [ [] for i in range(SPLIT_FACTOR) ]
test_times = [0] * SPLIT_FACTOR
for test in sorted(test_pool, key=lambda test : -test_statistics.get(test, 0)):
    # select a split with minimal expected execution time
    split_index = test_times.index(min(test_times))
    test_duration = test_statistics.get(test, 0)
    if not test_duration: # if statistics is unavailable, select a random split
        split_index = random.randint(0, SPLIT_FACTOR - 1)        
    test_splits[split_index].append(test)
    test_times[split_index] += test_duration

for split, id in zip(test_splits, range(SPLIT_FACTOR)):
    f = open(base_dir + 'upar-split.%d' % id, 'w')
        for test in split: # write ant's include mask to a file
            f.write("**/%s.java\n" % test)  
    f.close()
#]

As soon as splits are ready, the slave name is mapped to the batch ID and the build is executed for this batch (fortunately, Pig’s build system allows to submit a file with test filters as a build parameter). The similar thing can done for maven builds. The following piece of bash code do this part of the work:

case $SLAVEID in
Slave-Alpha)   JOBID=0;; # Slave-Alpha is a Jenkins node ID
Slave-Beta)    JOBID=1;;
Slave-Gamma)   JOBID=2;;
Slave-Delta)   JOBID=3;;
Slave-Epsilon) JOBID=4;;
Slave-Zeta)    JOBID=5;;
esac
ant -Dtest.junit.output.format=xml clean test -Dtest.all.file=upar-split.${JOBID}

The final step is to parse test results and update test statistics in the DB. This is also quite trivial:

#[ UPDATE TEST STATISTICS
db = MySQLdb.connect(...)
cursor = db.cursor()
path = "./build/test/logs/"
for infile in glob.glob( os.path.join(path, 'TEST-*.xml') ):
   f = open(infile)
   text = f.read()
   f.close()
   time = re.search(
        r"<testsuite[^>]*time=\"([0-9\.]+)\"", 
        text, flags=re.DOTALL)
   test_name = re.search(r".*TEST-.*(Test\w*).xml", infile).group(1)       
   cursor.execute(
        "REPLACE INTO test_stat(job_name,test_name,duration) VALUES(%s,%s,%s)", 
        (job_name, test_name, float(time.group(1))) )
db.close()
#]

Results

According to our experiments, the described technique allows one to achieve a very even load distribution among the nodes and, consequently, minimize the total build time. An example of the build duration distribution for Pig build is shown in the screenshot below (monolithic build takes more than 9 hours):

It should be noted that the real production implementation takes care about a few more issues:

  • Split stability. Jenkins nodes can differ in performance and vast changes in test-to-node mapping can lead to the unpredictable result. By this reason it’s preferable to have relatively stable mapping procedure, i.e. changes in execution time for a few tests should not lead to a completely new batches. This can be achieved by using thresholds and deliberate coarsening of the statistics that are used in computations.
  • Cohesion of artifacts. All instances of the multi-configuration job are executed in parallel and work independently. It is theoretically possible that two nodes can checkout different revisions of artifacts or sources and, consequently, start with different test pools. This can be alleviated in a multiple ways including distribution of the test pool via the central store.
About these ads
Posted in: Case Studies, Jenkins