﻿ Benchmarking - Algorithms in a Nutshell: A Practical Guide (2016) ﻿

## Algorithms in a Nutshell: A Practical Guide (2016)

### Appendix A. Benchmarking

Each algorithm in this book is accompanied by data about its performance. Because it’s important to use the right benchmarks to get accurate performance, we present our infrastructure to evaluate algorithm performance in this appendix. This should also help to address any questions or doubts you might have concerning the validity of our approach. We try to explain the precise means by which empirical data is computed, in order to enable you both to verify that the results are accurate and to understand where the assumptions are appropriate given the context in which the algorithm is intended to be used.

There are numerous ways by which algorithms can be analyzed. Chapter 2 presented a theoretical, formal treatment, introducing the concepts of worst-case and average-case analysis. These theoretic results can be empirically evaluated in some cases, but not all. For example, consider evaluating the performance of an algorithm to sort 20 numbers. There are 2.43*1018 permutations of these 20 numbers, and we cannot simply exhaustively evaluate each of these permutations to compute the average case. Additionally, we cannot compute the average by measuring the time to sort all of these permutations. We must rely on statistical measures to assure ourselves we have properly computed the expected performance time of the algorithm.

Statistical Foundation

This chapter focuses on essential points for evaluating the performance of algorithms. Interested readers should consult any of the large number of available textbooks on statistics for more information on the relevant statistical information used to produce the empirical measurements in this book.

To compute the performance of an algorithm, we construct a suite of T independent trials for which the algorithm is executed. Each trial is intended to execute an algorithm on an input problem of size n. Some effort is made to ensure these trials are all reasonably equivalent for the algorithm. When the trials are actually identical, the intent of the trial is to quantify the variance of the underlying implementation of the algorithm. This may be suitable, for example, if it is too costly to compute a large number of independent equivalent trials.

The suite is executed and millisecond-level timings are taken before and after the observable behavior. When the code is written in Java, the system garbage collector is invoked immediately prior to launching the trial; although this effort can’t guarantee that the garbage collector does not execute during the trial, it may reduce this risk of spending extra time unrelated to the algorithm. From the full set of T recorded times, the best and worst performing times are discarded as being “outliers.” The remaining T − 2 time records are averaged, and a standard deviation is computed using the following formula:

where xi is the time for an individual trial and x is the average of the T − 2 trials. Note here that n is equal to T − 2, so the denominator within the square root is T − 3. Calculating averages and standard deviations will help predict future performance, based on Table A-1, which shows the probability (between 0 and 1) that the actual value will be within the range [x – k*σx + k*σ], where σ represents the standard deviation computed in the equation just shown. The probability values become confidence intervals that declare the confidence we have in a prediction.

 k Probability 1 0.6827 2 0.9545 3 0.9973 4 0.9999 5 1 Table A-1.Standard deviation table

For example, in a randomized trial, it is expected that 68.27% of the time the result will fall within the range [x–σx+σ].

When reporting results, we never present numbers with greater than four decimal digits of accuracy, so we don’t give the mistaken impression that we believe the accuracy of our numbers extends any farther. This process will convert a computation such as 16.897986 into the reported number 16.8980.

Example

Assume we wanted to benchmark the addition of the numbers from 1 to n. An experiment is designed to measure the times for n = 8,000,000 to n = 16,000,000 in increments of two million. Because the problem is identical for n and doesn’t vary, we execute for 30 trials to eliminate as much variability as possible.

The hypothesis is that the time to complete the sum will vary directly in relation to n. We show three programs that solve this problem—in Java, C, and Python—and present the benchmark infrastructure by showing how it is used.

Java Benchmarking Solutions

On Java test cases, the current system time (in milliseconds) is determined immediately prior to, and after, the execution of interest. The code in Example A-1 measures the time it takes to complete the task. In a perfect computer, the 30 trials should all require exactly the same amount of time. Of course, this is unlikely to happen, because modern operating systems have numerous background processing tasks that share the same CPU on which the performance code executes.

Example A-1. Java example to time execution of task

public class Main {

public static void main (String[] args) {

TrialSuite ts = new TrialSuite();

for (long len = 8000000; len <= 16000000; len += 2000000) {

for (int i = 0; i < 30; i++) {

System.gc();

long now = System.currentTimeMillis();

/** Task to be timed. */

long sum = 0;

for (int x = 1; x <= len; x++) {

sum += x;

}

long end = System.currentTimeMillis();

ts.addTrial(len, now, end);

}

}

System.out.println (ts.computeTable());

}

}

The TrialSuite class stores trials by their size. After all the trials have been added to the suite, the resulting table is computed. To do this, the running times are added together to find the total sum, minimum value, and maximum value. As described earlier, the minimum and maximum values are removed from the set when computing the average and standard deviation.

Linux Benchmarking Solutions

For C test cases, we developed a benchmarking library to be linked with the code to test. In this section, we briefly describe the essential aspects of the timing code and refer the interested reader to the code repository for the full source.

Primarily created for testing sort routines, the C-based infrastructure can be linked to existing source code. The timing API takes over responsibility for parsing the command-line arguments:

usage: timing [-n NumElements] [-s seed] [-v] [OriginalArguments]

-n declares the problem size         [default: 100,000]

-v verbose output                    [default: false]

-s # set the seed for random values  [default: no seed]

-h print usage information

The timing library assumes a problem will be attempted whose input size is defined by the [-n] flag. To produce repeatable trials, the random seed can be set with [-s seed]. To link with the timing library, a test case provides the following functions:

void problemUsage()

Report to the console the set of [OriginalArguments] supported by the specific code. Note that the timing library parses the declared timing parameters, and remaining arguments are passed along to the prepareInput function.

void prepareInput (int size, int argc, char **argv)

For some problems, this function is responsible for building the input set to be processed within the execute method. Note that this information is not passed directly to execute via a formal argument, but instead is stored as a static variable within the test case.

void postInputProcessing()

If any validation is needed after the input problem is solved, that code can execute here.

void execute()

This method contains the body of code to be timed. Because the code is run once as part of the evaluation time, it can have a small impact on the reported time. When the execute method is empty, the overhead is considered to have no impact on the overall reporting.

The test case in Example A-2 shows the code task for the addition example.

Example A-2. Task describing addition of n numbers

extern int numElements;     /* size of n */

void problemUsage() { /* none */ }

void prepareInput() { /* none */ }

void postInputProcessing() { /* None */ }

void execute() {

int x;

long sum = 0;

for (x = 1; x <= numElements; x++) { sum += x; }

}

Each execution of the C function corresponds to a single trial, so we have a set of shell scripts to repeatedly execute the code being tested to generate statistics. For each suite, a configuration file name config.rc is created to represent the trial suite run. Example A-3 shows the file for the value-based sorting used in Chapter 4.

Example A-3. Sample configuration file to compare sort executions

# configure to use these BINS

BINS=./Insertion ./Qsort_2_6_11 ./Qsort_2_6_6 ./Qsort_straight

# configure suite

TRIALS=10

LOW=1

HIGH=16384

INCREMENT=*2

This specification file declares that the set of executables will be three variations of QuickSort with one Insertion Sort. The suite consists of problem sizes ranging from n = 1 to n = 16,384, where n doubles after each run. For each problem size, 10 trials are executed. The best and worst performers are discarded, and the resulting generated table will have the averages (and standard deviations) of the remaining eight trials.

Example A-4 contains the compare.sh script that generates an aggregate set of information for a particular problem size n.

Example A-4. compare.sh benchmarking script

#!/bin/bash

#

#  This script expects TWO arguments:

#     \$1  -- size of problem n

#     \$2  -- number of trials to execute

#  This script reads its parameters from the \$CONFIG configuration file

#    BINS    set of executables to execute

#    EXTRAS  extra command line arguments to use when executing them

#

#  CODE is set to directory where these scripts are to be found

CODE=`dirname \$0`

SIZE=20

NUM_TRIALS=10

if [ \$# -ge 1 ]

then

SIZE=\$1

NUM_TRIALS=\$2

fi

if [ "x\$CONFIG" = "x" ]

then

echo "No Configuration file (\\$CONFIG) defined"

exit 1

fi

if [ "x\$BINS" = "x" ]

then

if [ -f \$CONFIG ]

then

BINS=`grep "BINS=" \$CONFIG | cut -f2- -d'='`

EXTRAS=`grep "EXTRAS=" \$CONFIG | cut -f2- -d'='`

fi

if [ "x\$BINS" = "x" ]

then

echo "no \\$BINS variable and no \$CONFIG configuration "

echo "Set \\$BINS to a space-separated set of executables"

fi

fi

echo "Report: \$BINS on size \$SIZE"

echo "Date: `date`"

echo "Host: `hostname`"

RESULTS=/tmp/compare.\$\$

for b in \$BINS

do

TRIALS=\$NUM_TRIALS

# start with number of trials followed by totals (one per line)

echo \$NUM_TRIALS > \$RESULTS

while [ \$TRIALS -ge 1 ] do

\$b -n \$SIZE -s \$TRIALS \$EXTRAS | grep secs | sed 's/secs//' >> \$RESULTS

TRIALS=\$((TRIALS-1))

done

# compute average/stdev

RES=`cat \$RESULTS | \$CODE/eval`

echo "\$b \$RES"

rm -f \$RESULTS

done

compare.sh makes use of a small C program, eval, that computes the average and standard deviation using the method described at the start of this chapter. This compare.sh script is repeatedly executed by a manager script, suiteRun.sh, which iterates over the desired input problem sizes specified within the config.rc file, as shown in Example A-5.

Example A-5. suiteRun.sh benchmarking script

#!/bin/bash

CODE=`dirname \$0`

# if no args then use default config file, otherwise expect it

if [ \$# -eq 0 ]

then

CONFIG="config.rc"

else

CONFIG=\$1

echo "Using configuration file \$CONFIG..."

fi

# export so it will be picked up by compare.sh

export CONFIG

# pull out information

if [ -f \$CONFIG ]

then

BINS=`grep "BINS=" \$CONFIG | cut -f2- -d'='`

TRIALS=`grep "TRIALS=" \$CONFIG | cut -f2- -d'='`

LOW=`grep "LOW=" \$CONFIG | cut -f2- -d'='`

HIGH=`grep "HIGH=" \$CONFIG | cut -f2- -d'='`

INCREMENT=`grep "INCREMENT=" \$CONFIG | cut -f2- -d'='`

else

echo "Configuration file (\$CONFIG) unable to be found."

exit −1

fi

# headers

HB=`echo \$BINS | tr ' ' ','`

echo "n,\$HB"

# compare trials on sizes from LOW through HIGH

SIZE=\$LOW

REPORT=/tmp/Report.\$\$

while [ \$SIZE -le \$HIGH ]

do

# one per \$BINS entry

\$CODE/compare.sh \$SIZE \$TRIALS | awk 'BEGIN{p=0} \

{if(p) { print \$0; }} \

/Host:/{p=1}' | cut -d' ' -f2 > \$REPORT

# concatenate with, all entries ONLY the average. The stdev is

# going to be ignored

# ------------------------------

VALS=`awk 'BEGIN{s=""}\

{s = s "," \$0 }\

END{print s;}' \$REPORT`

rm -f \$REPORT

echo \$SIZE \$VALS

# \$INCREMENT can be "+ NUM" or "* NUM", it works in both cases.

SIZE=\$((\$SIZE\$INCREMENT))

done

Python Benchmarking Solutions

The Python code in Example A-6 measures the performance of computing the addition problem. It uses the timeit module, a standard for measuring the execution time of Python code fragments and entire programs.

Example A-6. Python example to time execution of task

import timeit

def performance():

"""Demonstrate execution performance."""

n = 8000000

numTrials = 10

print ("n", "Add time")

while n <= 16000000:

setup = 'total=0'

code  = 'for i in range(' + str(n) + '): total += i'

add_total = min(timeit.Timer(code, setup=setup).repeat(5,numTrials))

print ("%d %5.4f " % (n, add_total ))

n += 2000000

if __name__ == '__main__':

performance()

The timeit module returns a list of values reflecting the execution time in seconds of the code fragment. By applying min to this list, we extract the best performance of any of these trials. The timeit module documentation explains the benefits of using this approach to benchmark Python programs.

Reporting

It is instructive to review the actual results of the performance of three different implementations (in different programming languages) of the same program when computed on the same platform. We present three tables (Table A-2Table A-4, and Table A-5), one each for Java, C, and Python. In each table, we present the millisecond results and a brief histogram table for the Java results.

 n average min max stdev 8,000,000 7.0357 7 12 0.189 10,000,000 8.8571 8 42 0.5245 12,000,000 10.5357 10 11 0.5079 14,000,000 12.4643 12 14 0.6372 16,000,000 14.2857 13 17 0.5998 Table A-2. Timing results of computations in Java

The aggregate behavior of Table A-2 is shown in detail as a histogram in Table A-3. We omit from the table rows that have only zero values; all nonzero values are shaded in the table.

 time (ms) 8,000,000 10,000,000 12,000,000 14,000,000 16,000,000 7 28 0 0 0 0 8 1 7 0 0 0 9 0 20 0 0 0 10 0 2 14 0 0 11 0 0 16 0 0 12 1 0 0 18 0 13 0 0 0 9 1 14 0 0 0 3 22 15 0 0 0 0 4 16 0 0 0 0 2 17 0 0 0 0 1 42 0 1 0 0 0 Table A-3. Individual breakdowns of timing results

To interpret these results for Java, we turn to statistics, referring to the confidence intervals described earlier. We assume the timing of each trial is independent. If we are asked to predict the performance of a proposed run for n = 12,000,000, observe that its average performance, x, is 12.619 and the standard deviation, σ, is 0.282. Consider the range of values [x – 2*σx + 2*σ], which covers values that are plus or minus two standard deviations from the average. As you can see from Table A-1, the probability of being in this range of [9.5199, 11.5515] is 95.45%.

 n average min max stdev 8,000,000 8.376 7.932 8.697 .213 10,000,000 10.539 9.850 10.990 .202 12,000,000 12.619 11.732 13.305 .282 14,000,000 14.681 13.860 15.451 .381 16,000,000 16.746 15.746 17.560 .373 Table A-4. Timing results (in milliseconds) of computations in C

A few years ago, there would have been noticeable differences in the execution times of these three programs. Improvements in language implementations (especially just-in-time compilation) and computing hardware allows them to converge on pretty much the same performance for this specific computation. The histogram results are not as informative, because the timing results include fractional milliseconds, whereas the Java timing strategy reports only integer values. Comparing more realistic programs would show greater differences between the programming languages.

 n Execution time (ms) 8,000,000 7.9386 10,000,000 9.9619 12,000,000 12.0528 14,000,000 14.0182 16,000,000 15.8646 Table A-5. Timing results of computations in Python

Precision

Instead of using millisecond-level timers, we could use nanosecond timers. On the Java platform, the only change in the earlier timing code would be to invoke System.nanoTime() instead of accessing the milliseconds. To understand whether there is any correlation between the millisecond and nanosecond timers, we changed the code to that shown in Example A-7.

Example A-7. Using nanosecond timers in Java

TrialSuite tsM = new TrialSuite();

TrialSuite tsN = new TrialSuite();

for (long len = 8000000; len <= 16000000; len += 2000000) {

for (int i = 0; i < 30; i++) {

long nowM = System.currentTimeMillis();

long nowN = System.nanoTime();

long sum = 0;

for (int x = 1; x <= len; x++) { sum += x; }

long endM = System.currentTimeMillis();

long endN = System.nanoTime();

tsM.addTrial(len, nowM, endM);

tsN.addTrial(len, nowN, endN);

}

}

System.out.println (tsM.computeTable());

System.out.println (tsN.computeTable());

Table A-2, shown earlier, contains the millisecond results of the timings, whereas Table A-6 contains the results when using the nanosecond timer in C, and Table A-7 shows the Java performance. For these computations, the results are quite accurate. Because we don’t find that using nanosecond-level timers adds much extra precision or accuracy, we continue to use millisecond-level timing results within the benchmark results reported in the algorithm chapters. We also continue to use milliseconds to avoid giving the impression that our timers are more accurate than they really are. Finally, nanosecond timers on Unix systems are not yet standardized, and there are times when we wanted to compare execution times across platforms, which is another reason we chose to use millisecond-level timers throughout this book.

 n average min max stdev 8,000,000 6970676 6937103 14799912 20067.5194 10,000,000 8698703 8631108 8760575 22965.5895 12,000,000 10430000 10340060 10517088 33381.1922 14,000,000 12180000 12096029 12226502 27509.5704 16,000,000 13940000 13899521 14208708 27205.4481 Table A-6. Results using nanosecond timers in C
 n average min max stdev 8,000,000 6961055 6925193 14672632 15256.9936 10,000,000 8697874 8639608 8752672 26105.1020 12,000,000 10438429 10375079 10560557 31481.9204 14,000,000 12219324 12141195 12532792 91837.0132 16,000,000 13998684 13862725 14285963 124900.6866 Table A-7. Results using nanosecond timers in Java
﻿

﻿