/mobile Handheld Friendly website

24 Oct 2011, Monday, 11:34 pm GMT

Why measure programs written in different programming languages?
  1. To show working programs written in less familiar programming languages
  2. To show the least we should expect from performance comparisons
       (between different programming languages)
  3. To show how difficult it can be to make meaningful comparisons
       (between different programming languages)
 How to compare these measurements

There are 4 sets of up-to-date measurements. Measurements for the 4 different OS/machine combinations are shown on color-coded pages.

Compare the source code for 2 programs

Example, open a web browser window,  Ubuntu™ : Intel® Q6600® one core  select fannkuch-redux select C GNU GCC in the drop-down menus.

Example, open a second web browser window,  Ubuntu™ : Intel® Q6600® one core  select fannkuch-redux select C CINT in the drop-down menus.

Tile the web browser windows side-by-side.

Notice the slight changes made to the C program for CINT. Notice the huge difference in the Time-used measurements. Notice which GCC compiler options and CINT interpreter flags were used.

Compare performance for 3 or 4 or more programming languages

Example  Ubuntu™ : Intel® Q6600® quad-core  select all benchmarks select all languages in the drop-down menus.

Compare the performance of all the programs for one benchmark

Example  x64 Ubuntu™ : Intel® Q6600® quad-core  select spectral-norm select all languages in the drop-down menus.

Compare program speed and size for 2 language implementations

Example  x64 Ubuntu™ : Intel® Q6600® one core  select all benchmarks select Java 7 -server in the drop-down menus.

Example  Ubuntu™ : Intel® Q6600® one core  select all benchmarks select Java 7 -server select Python 3 in the drop-down menus.

Compare Memory-used for all the benchmarks

Don't confuse differences in default memory allocation with differences in Memory-used when the task requires programs to allocate more than the default memory.

Example select Which programming language is best? and set the Memory KB weight to 1 and the Time secs weight to 0.

Notice that some of the programs are written for multicore and allocate additional buffers to accumulate results from multiple processes.

Compare Code-used for all the benchmarks

Don't expect programming in the large to show as big a difference in Code-used measurements as these tiny tiny programming in the small tasks.

"This paper [pdf The Effect of Language Choice on Revision Control Systems] compares one scripting language, Python, with C in the domain of revision control systems, as large working implementations exist for both languages. It finds no clear evidence that scripting languages produce smaller systems…"

Example select Which programming language is best? and set the Code B weight to 1 and the Time secs weight to 0.

Notice that some of the programs are written for multicore and include code to distribute work across multiple threads (or processes).

Compare measurements of all the programs for one language

Example select Java 7 -server in the drop-down menu.

 How programs were measured
The Process
  1. Each program was run and measured at the smallest input value, program output redirected to a file and compared to expected output. As long as the output matched expected output, the program was then run and measured at the next larger input value until measurements had been made at every input value.
  2. If the program gave the expected output within an arbitrary cutoff time (now 120 seconds) the program was measured again (5 more times) with output redirected to /dev/null.
  3. If the program didn't give the expected output within an arbitrary timeout (usually one hour) the program was forced to quit. If measurements at a smaller input value had been successful within an arbitrary cutoff time (now 120 seconds), the program was measured again (5 more times) at that smaller input value, with output redirected to /dev/null.
  4. The measurements shown on the website are either
    • within the arbitrary cutoff - the lowest time and highest memory use from 6 measurements
    • outside the arbitrary cutoff - the sole time and memory use measurement
  5. For sure, programs taking 4 and 5 hours were only measured once!
How did you measure Time-used?

Each program was run as a child-process of a Python script using Popen.

  • CPU secs: The script child-process usr+sys rusage time was taken using os.wait3
  • Elapsed secs: The time was taken before forking the child-process and after the child-process exits, using time.time()

Time measurements include program startup time - see ↓ What about Java?

On win32 -

How did you measure Memory-used?

By sampling GTop proc_mem for the program and it's child processes every 0.2 seconds. Obviously those measurements are unlikely to be reliable for programs that run for less than 0.2 seconds.

On win32 - QueryInformationJobObject(hJob,JobObjectExtendedLimitInformation) PeakJobMemoryUsed

How did you measure Code-used?

We started with the source-code markup you can see, removed comments, removed duplicate whitespace characters, and then applied minimum GZip compression. The Code-used measurement is the size in bytes of that GZip compressed source-code file.

Thanks to Brian Hurt for the idea of using size of compressed source code instead of lines of code.

(Note: There is some evidence that complexity metrics don't provide any more information than SLoC or LoC.)

How did you measure ≈ CPU Load?

The GTop cpu idle and GTop cpu total were taken before forking the child-process and after the child-process exits, The percentages represent the proportion of cpu not-idle to cpu total for each core.

On win32 - GetSystemTimes UserTime and IdleTime were taken before forking the child-process and after the child-process exits. The percentage represents the proportion of TotalUserTime to UserTime+IdleTime (because that's like the percentage you'll see in Task Manager).

 How to contribute programs
How much effort should I put into getting the program correct?

Do design-iteration on your machine, or in a language newsgroup. Only contribute programs which give correct results on your machine - diff the program output with the provided output file before you contribute the program.

How should I implement programs?

Prefer plain vanilla programs - after all we're trying to compare language implementations not programmer effort and skill. We'd like your programs to be easily viewable - so please format your code to fit in less than 80 columns (we don't measure lines-of-code!).

How should I implement data-input?

Programs are measured across a range of input-values; programs are expected to either take a single command-line parameter or read text from stdin.

(Look at what the other programs do.)

How should I implement data-output?

Programs should write to stdout. Program output is redirected to a log-file and diff'd with the expected output.

(Look at what the other programs do.)

How should I identify my program?

Include a header comment in the program like this:

/* The Computer Language Benchmarks Game
   http://shootout.alioth.debian.org/

   contributed by …
   modified by …
*/
How should I implement loops?

Don't manually unroll loops!

Finally! Use the Tracker to contribute programs

Attach the full source-code file of a tested program. Please don't paste source-code into the description field. Please don't contribute patch-files.

Before contributing programs

  • debian issue their own security certificate - your web browser will complain.
  • read and accept the Revised BSD license - all contributed programs are published under this revised BSD license.
  • create an Alioth ID and login.

The Tracker

  • After login, go to the "Play the Benchmarks Game" Tracker
  • Find and click the "Play the Benchmarks Game: Submit New" link
  • Now start from the bottom of the form and work your way up

Start from the bottom

  1. Attach the program source-code file - do this first because it's easy to forget.
  2. Say in the Description how this program fixes an error or is faster or was missing or … Give us reasons to accept your program.
  3. Each Summary text must be unique! Follow this convention:
    language, benchmark, your-name, date, (version)
    Ruby nsieve Glenn Parker 2005-03-28
  4. Category: select the language implementation
  5. Group: select the benchmark
  6. click the Submit button
How can I track what happens to the program I contributed?

You created an ↓ Alioth ID with a valid email address so you'll receive email updates when your program is accepted and measured.

 What does … mean?
What does N mean?

N means the value passed to the program on the command-line (or the value used to create the data file passed to the program on stdin). Larger N causes the program to do more work - mostly measurements are shown for the largest N, the largest workload.

Read ↓ How programs were measured

What does '27% 34% 28% 67%' ≈ CPU Load mean?

When the program was being measured: the first core was not-idle about 27% of the time, the second core was not-idle about 34% of the time, the third core was not-idle about 28% of the time, the fourth core was not-idle about 67% of the time.

When all the programs show ≈ CPU Load like this '0% 0% 0% 100%' you are probably looking at measurements of programs forced to use just one core - the fourth core (rather than being allowed to use any or all of the CPU cores).

Read ↓ How did you measure ≈ CPU Load?

What does Interesting Alternative Program mean?

Interesting Alternative Program means that the program doesn't implement the benchmark according to the arbitrary and idiosyncratic rules of The Computer Language Benchmarks Game - but we felt like showing the program anyway.

What do #2 #3 mean?

Nothing - they are arbitrary suffixes that identify a specific program.

 FAQs
What about Java®?

In these (Intel® Q6600® quad-core) examples we measured elapsed time inside the Java programs.

In the first case (Cold), we simply started and measured the program 66 times; and then discarded the first measurement leaving 65 data points.

   public static void main(String[] args){
      for (int i=0; i<1; ++i){ 
         System.gc(); 
         long t1 = System.nanoTime();
         nbody.program_main(args);
         long t2 = System.nanoTime();
         System.err.println( String.format( "%.6f", (t2 - t1) * 1e-9 ) );         
      }
   }

In the second case (Warmed), we started the program once and repeated measurements again and again and again 66 times without restarting the JVM; and then discarded the first measurement leaving 65 data points.

   public static void main(String[] args){
      for (int i=0; i<66; ++i){ 
         System.gc(); 
         long t1 = System.nanoTime();
         nbody.program_main(args);
         long t2 = System.nanoTime();
         System.err.println( String.format( "%.6f", (t2 - t1) * 1e-9 ) );         
      }
   }

The usual measurements and the Java 7 "averaged" approximations are shown alongside for comparison.

"1.7.0" Java HotSpot(TM)
System.nanoTime()  1) Cold   2) Warmed   
  mean σ mean σ   usual  "averaged"
meteor contest   0.0107s 0.0011 0.0015s 0.0003 0.22s 0.12s
chameneos-redux   4.09s 0.28 4.00s 0.27 4.17s 4.09s
spectral norm   4.54s 0.13 4.39s 0.13 4.52s 4.39s
pidigits   5.35s 0.16 5.34s 0.16 5.37s 5.28s
mandelbrot   7.96s 0.23 7.98s 0.01 7.04s 8.02s
binary trees   10.82s 0.44 8.00s 0.29 9.59s 8.19s
fannkuch-redux   16.70s 1.50 17.28s 0.07 13.72s 17.35s
nbody   22.42s 0.01 22.41s 0.01 22.49s 22.38s

The largest and most obvious effects of bytecode loading and dynamic optimization can be seen with the meteor-contest program which only runs for a fraction of a second.

Why don't you accept every program that gives the correct result?

We are trying to show the performance of various programming language implementations - so we ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result.

We do show one contest where you can use different algorithms - meteor-contest.

What machine are you running the programs on?

We use a quad-core 2.4Ghz Intel® Q6600® machine with 4GB of RAM and 250GB SATA II disk drive.

The out-of-date measurements used a single-processor 2.2Ghz AMD™ Sempron™ machine with 512MB of RAM and 40GB IDE disk drive; and a single-processor 2Ghz Intel® Pentium® 4 machine with 512MB of RAM and 80GB IDE disk drive.

What OS are you using on the test machine?

We use Ubuntu™ 11.10 Linux Kernel 3.0.0-12-generic

The out-of-date measurements used Debian Linux 'unstable', Kernel 2.6.18-3-k7 and Gentoo Linux gentoo-sources-2.6.20-r6

Where can I see previous programs?

Periodically we go through and remove slower programs from the website (if there's a faster program for the same language implementation). We don't remove those programs from the "Play the Benchmarks Game" tracker.

You can see previous programs by browsing though the Play the Benchmarks Game tracker items and looking at the attached source code files. Log In with your Alioth Id, you will be able to create and save a query to search for particular tracker items.

Why do you only include language X one core measurements?

Probably because no one has contributed language X programs that use more than one core. Why don't you contribute language X programs that use more than one core?

Why don't you include language X?

Because I want to do fewer chores not more! Why don't you use our measurement scripts and publish measurements for language X?

For example

The Python script "bencher does repeated measurements of program cpu time, elapsed time, resident memory usage, cpu load while a program is running, and summarizes those measurements" - download bencher and unzip into your ~ directory, check the requirements and recommendations, and read the license before use.

As an alternative, you should take a look at these Python measurement scripts designed for statistically rigorous Java performance evaluation - JavaStats.

Why don't you include 3 or 4 implementations of the same language?

Because I want to do fewer chores not more! Why don't you use our measurement scripts and publish measurements for 3 or 4 implementations of the same language?

The Python script "bencher does repeated measurements of program cpu time, elapsed time, resident memory usage, cpu load while a program is running, and summarizes those measurements" - download bencher and unzip into your ~ directory, check the requirements and recommendations, and read the license before use.

Why don't you include Microsoft® Windows®?

Because I want to do fewer chores not more! Why don't you use our measurement scripts and publish measurements for Microsoft® Windows®?

The Python script "bencher does repeated measurements of program cpu time, elapsed time, resident memory usage, cpu load while a program is running, and summarizes those measurements" - download bencher and unzip into your c:\ directory, check the requirements and recommendations, and read the license before use.

(Here are some measurements made just as a demo of what you could do with bencher.py on Windows Vista®.)

Why don't you include LLVM?

Because I want to do fewer chores not more! Why don't you use our measurement scripts and publish measurements for LLVM?

The Python script "bencher does repeated measurements of program cpu time, elapsed time, resident memory usage, cpu load while a program is running, and summarizes those measurements" - download bencher and unzip into your ~ directory, check the requirements and recommendations, and read the license before use.

(Here are some measurements made just as a demo of what you could do building language implementations on the LLVM toolchain.)

What…? Where…? Why…?

Please create an Alioth ID, login and ask your questions in the discussion forum.

Note: Debian issue their own security certificate - your web browser will complain.

Revised BSD license

  Home   Conclusions   License   Help